SEC-cyBERT/docs/STATUS.md
2026-04-03 14:43:53 -04:00

222 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Project Status — 2026-04-02 (evening)
## What's Done
### Data Pipeline
- [x] 72,045 paragraphs extracted from ~9,000 10-K filings + 207 8-K filings
- [x] 14 filing generators identified, quality metrics per generator
- [x] 6 surgical patches applied (orphan words + heading stripping)
- [x] Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
- [x] Embedded bullet detection (2,163 paragraphs flagged degraded, 0.5x sample weight)
- [x] All data integrity rules formalized (frozen originals, UUID-linked patches)
### GenAI Labeling (Stage 1)
- [x] Prompt v2.5 locked after 12+ iterations
- [x] 3-model panel: gemini-flash-lite + mimo-v2-flash + grok-4.1-fast
- [x] 150,009 annotations completed ($115.88, 0 failures)
- [x] Orphan word re-annotation: 1,537 paragraphs re-run ($3.30), merged into `stage1.patched.jsonl`
- [x] Codebook v3.0 with 3 major rulings
### DAPT + TAPT Pre-Training
- [x] DAPT corpus: 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped)
- [x] DAPT training complete: eval loss 0.7250, perplexity 1.65. 1 epoch on 500M tokens, ~14.5h on RTX 3090.
- [x] DAPT checkpoint at `checkpoints/dapt/modernbert-large/final/`
- [x] TAPT training complete: eval loss 1.0754, perplexity 2.11. 5 epochs, whole-word masking, ~50 min on RTX 3090. Loss: 1.46 → 1.08.
- [x] TAPT checkpoint at `checkpoints/tapt/modernbert-large/final/`
- [x] Custom `WholeWordMaskCollator` (upstream `transformers` collator broken for BPE tokenizers)
- [x] Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility)
- [x] Procedure documented in `docs/DAPT-PROCEDURE.md`
### Human Labeling — Complete
- [x] All 6 annotators completed 600 paragraphs each (3,600 labels total, 1,200 paragraphs × 3)
- [x] BIBD assignment: each paragraph labeled by exactly 3 of 6 annotators
- [x] Full data export: raw labels, timing, quiz sessions, metrics → `data/gold/`
- [x] Comprehensive IRR analysis → `data/gold/charts/`
| Metric | Category | Specificity | Both |
|--------|----------|-------------|------|
| Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% |
| Krippendorff's α | 0.801 | 0.546 | — |
| Avg Cohen's κ | 0.612 | 0.440 | — |
### Prompt v3.0
- [x] Codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP
- [x] Prompt version bumped from v2.5 → v3.0
### GenAI Holdout Benchmark — Complete
- [x] 6 benchmark models + Opus 4.6 on the 1,200 holdout paragraphs
- [x] All 1,200 annotations per model (0 failures after minimax/kimi fence-stripping fix)
- [x] Total benchmark cost: $45.63
| Model | Supplier | Cost | Cat % vs Opus | Both % vs Opus |
|-------|----------|------|---------------|----------------|
| openai/gpt-5.4 | OpenAI | $6.79 | 88.2% | 79.8% |
| google/gemini-3.1-pro-preview | Google | $16.09 | 87.4% | 80.0% |
| moonshotai/kimi-k2.5 | Moonshot | $7.70 | 85.1% | 76.8% |
| z-ai/glm-5:exacto | Zhipu | $6.86 | 86.2% | 76.5% |
| xiaomi/mimo-v2-pro:exacto | Xiaomi | $6.59 | 85.7% | 76.3% |
| minimax/minimax-m2.7:exacto | MiniMax | $1.61 | 82.8% | 63.6% |
| anthropic/claude-opus-4.6 | Anthropic | $0 | — | — |
Plus Stage 1 panel already on file = **10 models, 8 suppliers**.
### 13-Signal Cross-Source Analysis — Complete
- [x] 30 diagnostic charts generated → `data/gold/charts/`
- [x] Leave-one-out analysis (no model privileged as reference)
- [x] Adjudication tier breakdown computed
**Adjudication tiers (13 signals per paragraph):**
| Tier | Count | % | Rule |
|------|-------|---|------|
| 1 | 756 | 63.0% | 10+/13 agree on both dimensions → auto gold |
| 2 | 216 | 18.0% | Human + GenAI majorities agree → cross-validated |
| 3 | 26 | 2.2% | Humans split, GenAI converges → expert review |
| 4 | 202 | 16.8% | Universal disagreement → expert review |
**Leave-one-out ranking (each source vs majority of other 12):**
| Rank | Source | Cat % | Spec % | Both % |
|------|--------|-------|--------|--------|
| 1 | Opus 4.6 | 92.6 | 90.8 | 84.0 |
| 2 | Kimi K2.5 | 91.6 | 91.1 | 83.3 |
| 3 | Gemini Pro | 91.1 | 90.1 | 82.3 |
| 4 | GPT-5.4 | 91.4 | 88.8 | 82.1 |
| 8 | H:Xander (best human) | 91.3 | 83.9 | 76.9 |
| 16 | H:Aaryan (outlier) | 59.1 | 24.7 | 15.8 |
**Key finding:** Opus earns the #1 spot through leave-one-out — it's not special because we designated it as gold; it genuinely disagrees with the crowd least (7.4% odd-one-out rate).
### Codebook v3.5 & Prompt Iteration — Complete
- [x] Cross-analysis: GenAI vs human systematic errors identified (SI↔N/O 23:0, MR↔RMP 38:13, BG↔MR 33:6)
- [x] v3.5 rulings: SI materiality assessment test, BG purpose test, MR↔RMP 3-step chain
- [x] v3.5 gold re-run: 7 models × 359 confusion-axis holdout paragraphs ($18)
- [x] 6 rounds prompt iteration on 26 regression paragraphs ($1.02): v3.0=18/26 → v3.5=22/26
- [x] SI rule tightened: "could have material adverse effect" = NOT SI (speculation, not assessment)
- [x] Cross-reference exception: materiality language in cross-refs = N/O
- [x] BG threshold: one-sentence committee mention doesn't flip to BG
- [x] Stage 1 corrections flagged: 308 paragraphs (180 materiality + 128 SPACs)
- [x] Prompt locked at v3.5, codebook updated, version history documented
- [x] SI↔N/O paradox investigated and resolved: models correct, humans systematically over-call SI on speculation
- [x] Codebook Case 9 contradiction with Rule 6 fixed ("could" example → N/O)
- [x] Gold adjudication strategy for SI↔N/O defined: trust model consensus, apply SI via regex for assessments
| Data asset | Location |
|-----------|----------|
| v3.5 bench annotations | `data/annotations/bench-holdout-v35/*.jsonl` (7 models × 359) |
| v3.5 Opus annotations | `data/annotations/golden-v35/opus.jsonl` (359) |
| Stage 1 correction flags | `data/annotations/stage1-corrections.jsonl` (308) |
| Holdout re-run IDs | `data/gold/holdout-rerun-v35.jsonl` (359) |
### Gold Set Adjudication v1 — Complete
- [x] Aaryan redo integrated: 50.3% of labels changed, α 0.801→0.825 (cat), 0.546→0.661 (spec)
- [x] Old Aaryan labels preserved in `data/gold/human-labels-aaryan-v1.jsonl`
- [x] Cross-axis systematic error analysis: models correct ~85% on MR↔RMP, MR↔BG, RMP↔BG, TP↔RMP, SI↔N/O
- [x] 5-tier adjudication: T1 super-consensus (911), T2 cross-validated (108), T3 rule-based (30), T4 model-unanimous (59), T5 plurality (92)
- [x] 30 rule-based overrides (27 SI↔N/O + 3 T5 codebook resolutions)
### Gold Set Adjudication v2 — Complete (T5 deep analysis)
- [x] Full model disagreement analysis: 6-model vote vectors on all 1,200 paragraphs
- [x] Gemini identified as systematic MR outlier (z≈+2.3, 302 MR vs ~192 avg, drives 45% MR↔RMP confusion)
- [x] Gemini exclusion experiment: NULL RESULT at T5 (human MR bias makes it redundant; tiering already neutralizes at T4)
- [x] v3.5 prompt impact: unanimity 25%→60%, but created new BG↔RMP hotspot (+171%)
- [x] **Text-based BG vote removal**: automated, verifiable — if "board" absent from text, BG model votes removed. 13 labels corrected, source accuracy UP for 10/12 sources
- [x] **10 new codebook tiebreaker overrides**: ID↔SI (negative assertions), SPAC rule, board-removal test, committee-level test
- [x] **Specificity hybrid**: human unanimous → human label, human split → model majority. 195 specificity labels updated
- [x] All changes validated experimentally (one variable at a time, acceptance criteria checked)
- [x] T5: 92 → 85, gold≠human: 151 → 144
| Source | Accuracy vs Gold (v1) | Accuracy vs Gold (v2) | Δ |
|--------|----------------------|----------------------|---|
| Xander | 91.0% | 91.5% | +0.5% |
| Opus | 88.6% | 89.1% | +0.5% |
| GPT-5.4 | 87.4% | 88.5% | +1.1% |
| GLM-5 | 86.0% | 86.5% | +0.5% |
| Elisabeth | 85.8% | 86.5% | +0.7% |
| MIMO | 85.8% | 86.2% | +0.5% |
| Meghan | 85.3% | 86.0% | +0.7% |
| Kimi | 84.5% | 84.9% | +0.4% |
| Gemini | 84.0% | 84.6% | +0.6% |
| Joey | 80.7% | 80.2% | -0.5% |
| Aaryan | 75.2% | 74.2% | -1.0% |
| Anuj | 69.3% | 69.7% | +0.3% |
| Data asset | Location |
|-----------|----------|
| Adjudicated gold labels | `data/gold/gold-adjudicated.jsonl` (1,200) |
| Old Aaryan labels | `data/gold/human-labels-aaryan-v1.jsonl` (600) |
| Adjudication charts | `data/gold/charts/gold-*.png` (4 charts) |
| Adjudication script | `scripts/adjudicate-gold.py` (v2) |
| Experiment harness | `scripts/adjudicate-gold-experiment.py` |
| T5 analysis docs | `docs/T5-ANALYSIS.md` |
## What's Next (in dependency order)
### 1. (Optional) Manual review of remaining 85 T5-plurality paragraphs
- 85 paragraphs resolved by signal plurality — lowest confidence tier
- 71% on the BG↔MR↔RMP triangle (irreducible ambiguity)
- 62 have weak plurality (4-5/9) — diminishing returns
- Could improve gold set by ~1-3% if reviewed, but diminishing returns
### 2. Stage 2 re-eval on training data
- Pilot gpt-5.4-mini vs gpt-5.4 on holdout validation sample
- Run on 308 flagged Stage 1 corrections (180 materiality + 128 SPACs)
- Also run standard Stage 2 judge on existing disagreements with v3.5 prompt
### 3. Training data assembly
- Unanimous Stage 1 labels (35,204 paragraphs) → full weight
- Calibrated majority labels (~9-12K) → full weight
- Judge high-confidence labels (~2-3K) → full weight
- Quality tier weights: clean/headed/minor=1.0, degraded=0.5
### 4. Fine-tuning + ablations
- 8+ experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} × {±class weighting}
- Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
- Focal loss / class-weighted CE for category imbalance
- Ordinal regression (CORAL) for specificity
### 5. Evaluation + paper
- Macro F1 + per-class F1 on holdout (must exceed 0.80 for category)
- Full GenAI benchmark table (10 models × 1,200 holdout)
- Cost/time/reproducibility comparison
- Error analysis on Tier 4 paragraphs (A-grade criterion)
- IGNITE slides (20 slides, 15s each)
## Parallel Tracks
```
Track A (GPU): DAPT ✓ → TAPT ✓ ─────────────────────────────→ Fine-tuning → Eval
Track B (API): Opus re-run ✓─┐ │
├→ v3.5 re-run ✓ → SI paradox ✓ ───┐ │
Track C (API): 6-model bench ✓┘ │ │
Gold adjud. ✓ ┤ │
Track E (API): v3.5 prompt ✓ → S1 flags ✓ → Stage 2 re-eval ───┘───┘
Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ → Aaryan redo ✓
```
## Key File Locations
| What | Where |
|------|-------|
| Patched paragraphs | `data/paragraphs/paragraphs-clean.patched.jsonl` (49,795) |
| Patched annotations | `data/annotations/stage1.patched.jsonl` (150,009) |
| Quality scores | `data/paragraphs/quality/quality-scores.jsonl` (72,045) |
| Human labels (raw) | `data/gold/human-labels-raw.jsonl` (3,600 labels) |
| Human label metrics | `data/gold/metrics.json` |
| Holdout paragraphs | `data/gold/paragraphs-holdout.jsonl` (1,200) |
| Diagnostic charts | `data/gold/charts/*.png` (30 charts) |
| Opus golden labels | `data/annotations/golden/opus.jsonl` (1,200) |
| Benchmark annotations | `data/annotations/bench-holdout/{model}.jsonl` (6 × 1,200) |
| Original sampled IDs | `labelapp/.sampled-ids.original.json` (1,200 holdout PIDs) |
| DAPT corpus | `data/dapt-corpus/shard-*.jsonl` (14,756 docs) |
| DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` |
| TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` |
| v3.5 bench annotations | `data/annotations/bench-holdout-v35/*.jsonl` (7 × 359) |
| v3.5 Opus golden | `data/annotations/golden-v35/opus.jsonl` (359) |
| Stage 1 correction flags | `data/annotations/stage1-corrections.jsonl` (1,014) |
| Holdout re-run IDs | `data/gold/holdout-rerun-v35.jsonl` (359) |
| Analysis script | `scripts/analyze-gold.py` (30-chart, 13-signal analysis) |
| Data dump script | `labelapp/scripts/dump-all.ts` |