12 KiB
12 KiB
Project Status — 2026-04-02 (evening)
What's Done
Data Pipeline
- 72,045 paragraphs extracted from ~9,000 10-K filings + 207 8-K filings
- 14 filing generators identified, quality metrics per generator
- 6 surgical patches applied (orphan words + heading stripping)
- Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
- Embedded bullet detection (2,163 paragraphs flagged degraded, 0.5x sample weight)
- All data integrity rules formalized (frozen originals, UUID-linked patches)
GenAI Labeling (Stage 1)
- Prompt v2.5 locked after 12+ iterations
- 3-model panel: gemini-flash-lite + mimo-v2-flash + grok-4.1-fast
- 150,009 annotations completed ($115.88, 0 failures)
- Orphan word re-annotation: 1,537 paragraphs re-run ($3.30), merged into
stage1.patched.jsonl - Codebook v3.0 with 3 major rulings
DAPT + TAPT Pre-Training
- DAPT corpus: 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped)
- DAPT training complete: eval loss 0.7250, perplexity 1.65. 1 epoch on 500M tokens, ~14.5h on RTX 3090.
- DAPT checkpoint at
checkpoints/dapt/modernbert-large/final/ - TAPT training complete: eval loss 1.0754, perplexity 2.11. 5 epochs, whole-word masking, ~50 min on RTX 3090. Loss: 1.46 → 1.08.
- TAPT checkpoint at
checkpoints/tapt/modernbert-large/final/ - Custom
WholeWordMaskCollator(upstreamtransformerscollator broken for BPE tokenizers) - Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility)
- Procedure documented in
docs/DAPT-PROCEDURE.md
Human Labeling — Complete
- All 6 annotators completed 600 paragraphs each (3,600 labels total, 1,200 paragraphs × 3)
- BIBD assignment: each paragraph labeled by exactly 3 of 6 annotators
- Full data export: raw labels, timing, quiz sessions, metrics →
data/gold/ - Comprehensive IRR analysis →
data/gold/charts/
| Metric | Category | Specificity | Both |
|---|---|---|---|
| Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% |
| Krippendorff's α | 0.801 | 0.546 | — |
| Avg Cohen's κ | 0.612 | 0.440 | — |
Prompt v3.0
- Codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP
- Prompt version bumped from v2.5 → v3.0
GenAI Holdout Benchmark — Complete
- 6 benchmark models + Opus 4.6 on the 1,200 holdout paragraphs
- All 1,200 annotations per model (0 failures after minimax/kimi fence-stripping fix)
- Total benchmark cost: $45.63
| Model | Supplier | Cost | Cat % vs Opus | Both % vs Opus |
|---|---|---|---|---|
| openai/gpt-5.4 | OpenAI | $6.79 | 88.2% | 79.8% |
| google/gemini-3.1-pro-preview | $16.09 | 87.4% | 80.0% | |
| moonshotai/kimi-k2.5 | Moonshot | $7.70 | 85.1% | 76.8% |
| z-ai/glm-5:exacto | Zhipu | $6.86 | 86.2% | 76.5% |
| xiaomi/mimo-v2-pro:exacto | Xiaomi | $6.59 | 85.7% | 76.3% |
| minimax/minimax-m2.7:exacto | MiniMax | $1.61 | 82.8% | 63.6% |
| anthropic/claude-opus-4.6 | Anthropic | $0 | — | — |
Plus Stage 1 panel already on file = 10 models, 8 suppliers.
13-Signal Cross-Source Analysis — Complete
- 30 diagnostic charts generated →
data/gold/charts/ - Leave-one-out analysis (no model privileged as reference)
- Adjudication tier breakdown computed
Adjudication tiers (13 signals per paragraph):
| Tier | Count | % | Rule |
|---|---|---|---|
| 1 | 756 | 63.0% | 10+/13 agree on both dimensions → auto gold |
| 2 | 216 | 18.0% | Human + GenAI majorities agree → cross-validated |
| 3 | 26 | 2.2% | Humans split, GenAI converges → expert review |
| 4 | 202 | 16.8% | Universal disagreement → expert review |
Leave-one-out ranking (each source vs majority of other 12):
| Rank | Source | Cat % | Spec % | Both % |
|---|---|---|---|---|
| 1 | Opus 4.6 | 92.6 | 90.8 | 84.0 |
| 2 | Kimi K2.5 | 91.6 | 91.1 | 83.3 |
| 3 | Gemini Pro | 91.1 | 90.1 | 82.3 |
| 4 | GPT-5.4 | 91.4 | 88.8 | 82.1 |
| 8 | H:Xander (best human) | 91.3 | 83.9 | 76.9 |
| 16 | H:Aaryan (outlier) | 59.1 | 24.7 | 15.8 |
Key finding: Opus earns the #1 spot through leave-one-out — it's not special because we designated it as gold; it genuinely disagrees with the crowd least (7.4% odd-one-out rate).
Codebook v3.5 & Prompt Iteration — Complete
- Cross-analysis: GenAI vs human systematic errors identified (SI↔N/O 23:0, MR↔RMP 38:13, BG↔MR 33:6)
- v3.5 rulings: SI materiality assessment test, BG purpose test, MR↔RMP 3-step chain
- v3.5 gold re-run: 7 models × 359 confusion-axis holdout paragraphs ($18)
- 6 rounds prompt iteration on 26 regression paragraphs ($1.02): v3.0=18/26 → v3.5=22/26
- SI rule tightened: "could have material adverse effect" = NOT SI (speculation, not assessment)
- Cross-reference exception: materiality language in cross-refs = N/O
- BG threshold: one-sentence committee mention doesn't flip to BG
- Stage 1 corrections flagged: 308 paragraphs (180 materiality + 128 SPACs)
- Prompt locked at v3.5, codebook updated, version history documented
- SI↔N/O paradox investigated and resolved: models correct, humans systematically over-call SI on speculation
- Codebook Case 9 contradiction with Rule 6 fixed ("could" example → N/O)
- Gold adjudication strategy for SI↔N/O defined: trust model consensus, apply SI via regex for assessments
| Data asset | Location |
|---|---|
| v3.5 bench annotations | data/annotations/bench-holdout-v35/*.jsonl (7 models × 359) |
| v3.5 Opus annotations | data/annotations/golden-v35/opus.jsonl (359) |
| Stage 1 correction flags | data/annotations/stage1-corrections.jsonl (308) |
| Holdout re-run IDs | data/gold/holdout-rerun-v35.jsonl (359) |
Gold Set Adjudication v1 — Complete
- Aaryan redo integrated: 50.3% of labels changed, α 0.801→0.825 (cat), 0.546→0.661 (spec)
- Old Aaryan labels preserved in
data/gold/human-labels-aaryan-v1.jsonl - Cross-axis systematic error analysis: models correct ~85% on MR↔RMP, MR↔BG, RMP↔BG, TP↔RMP, SI↔N/O
- 5-tier adjudication: T1 super-consensus (911), T2 cross-validated (108), T3 rule-based (30), T4 model-unanimous (59), T5 plurality (92)
- 30 rule-based overrides (27 SI↔N/O + 3 T5 codebook resolutions)
Gold Set Adjudication v2 — Complete (T5 deep analysis)
- Full model disagreement analysis: 6-model vote vectors on all 1,200 paragraphs
- Gemini identified as systematic MR outlier (z≈+2.3, 302 MR vs ~192 avg, drives 45% MR↔RMP confusion)
- Gemini exclusion experiment: NULL RESULT at T5 (human MR bias makes it redundant; tiering already neutralizes at T4)
- v3.5 prompt impact: unanimity 25%→60%, but created new BG↔RMP hotspot (+171%)
- Text-based BG vote removal: automated, verifiable — if "board" absent from text, BG model votes removed. 13 labels corrected, source accuracy UP for 10/12 sources
- 10 new codebook tiebreaker overrides: ID↔SI (negative assertions), SPAC rule, board-removal test, committee-level test
- Specificity hybrid: human unanimous → human label, human split → model majority. 195 specificity labels updated
- All changes validated experimentally (one variable at a time, acceptance criteria checked)
- T5: 92 → 85, gold≠human: 151 → 144
| Source | Accuracy vs Gold (v1) | Accuracy vs Gold (v2) | Δ |
|---|---|---|---|
| Xander | 91.0% | 91.5% | +0.5% |
| Opus | 88.6% | 89.1% | +0.5% |
| GPT-5.4 | 87.4% | 88.5% | +1.1% |
| GLM-5 | 86.0% | 86.5% | +0.5% |
| Elisabeth | 85.8% | 86.5% | +0.7% |
| MIMO | 85.8% | 86.2% | +0.5% |
| Meghan | 85.3% | 86.0% | +0.7% |
| Kimi | 84.5% | 84.9% | +0.4% |
| Gemini | 84.0% | 84.6% | +0.6% |
| Joey | 80.7% | 80.2% | -0.5% |
| Aaryan | 75.2% | 74.2% | -1.0% |
| Anuj | 69.3% | 69.7% | +0.3% |
| Data asset | Location |
|---|---|
| Adjudicated gold labels | data/gold/gold-adjudicated.jsonl (1,200) |
| Old Aaryan labels | data/gold/human-labels-aaryan-v1.jsonl (600) |
| Adjudication charts | data/gold/charts/gold-*.png (4 charts) |
| Adjudication script | scripts/adjudicate-gold.py (v2) |
| Experiment harness | scripts/adjudicate-gold-experiment.py |
| T5 analysis docs | docs/T5-ANALYSIS.md |
What's Next (in dependency order)
1. (Optional) Manual review of remaining 85 T5-plurality paragraphs
- 85 paragraphs resolved by signal plurality — lowest confidence tier
- 71% on the BG↔MR↔RMP triangle (irreducible ambiguity)
- 62 have weak plurality (4-5/9) — diminishing returns
- Could improve gold set by ~1-3% if reviewed, but diminishing returns
2. Stage 2 re-eval on training data
- Pilot gpt-5.4-mini vs gpt-5.4 on holdout validation sample
- Run on 308 flagged Stage 1 corrections (180 materiality + 128 SPACs)
- Also run standard Stage 2 judge on existing disagreements with v3.5 prompt
3. Training data assembly
- Unanimous Stage 1 labels (35,204 paragraphs) → full weight
- Calibrated majority labels (~9-12K) → full weight
- Judge high-confidence labels (~2-3K) → full weight
- Quality tier weights: clean/headed/minor=1.0, degraded=0.5
4. Fine-tuning + ablations
- 8+ experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} × {±class weighting}
- Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
- Focal loss / class-weighted CE for category imbalance
- Ordinal regression (CORAL) for specificity
5. Evaluation + paper
- Macro F1 + per-class F1 on holdout (must exceed 0.80 for category)
- Full GenAI benchmark table (10 models × 1,200 holdout)
- Cost/time/reproducibility comparison
- Error analysis on Tier 4 paragraphs (A-grade criterion)
- IGNITE slides (20 slides, 15s each)
Parallel Tracks
Track A (GPU): DAPT ✓ → TAPT ✓ ─────────────────────────────→ Fine-tuning → Eval
↑
Track B (API): Opus re-run ✓─┐ │
├→ v3.5 re-run ✓ → SI paradox ✓ ───┐ │
Track C (API): 6-model bench ✓┘ │ │
Gold adjud. ✓ ┤ │
Track E (API): v3.5 prompt ✓ → S1 flags ✓ → Stage 2 re-eval ───┘───┘
Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ → Aaryan redo ✓
Key File Locations
| What | Where |
|---|---|
| Patched paragraphs | data/paragraphs/paragraphs-clean.patched.jsonl (49,795) |
| Patched annotations | data/annotations/stage1.patched.jsonl (150,009) |
| Quality scores | data/paragraphs/quality/quality-scores.jsonl (72,045) |
| Human labels (raw) | data/gold/human-labels-raw.jsonl (3,600 labels) |
| Human label metrics | data/gold/metrics.json |
| Holdout paragraphs | data/gold/paragraphs-holdout.jsonl (1,200) |
| Diagnostic charts | data/gold/charts/*.png (30 charts) |
| Opus golden labels | data/annotations/golden/opus.jsonl (1,200) |
| Benchmark annotations | data/annotations/bench-holdout/{model}.jsonl (6 × 1,200) |
| Original sampled IDs | labelapp/.sampled-ids.original.json (1,200 holdout PIDs) |
| DAPT corpus | data/dapt-corpus/shard-*.jsonl (14,756 docs) |
| DAPT checkpoint | checkpoints/dapt/modernbert-large/final/ |
| TAPT checkpoint | checkpoints/tapt/modernbert-large/final/ |
| v3.5 bench annotations | data/annotations/bench-holdout-v35/*.jsonl (7 × 359) |
| v3.5 Opus golden | data/annotations/golden-v35/opus.jsonl (359) |
| Stage 1 correction flags | data/annotations/stage1-corrections.jsonl (1,014) |
| Holdout re-run IDs | data/gold/holdout-rerun-v35.jsonl (359) |
| Analysis script | scripts/analyze-gold.py (30-chart, 13-signal analysis) |
| Data dump script | labelapp/scripts/dump-all.ts |