# Project Status — 2026-04-03 (v2 Reboot) **Deadline:** 2026-04-24 (21 days) ## What's Done (Carried Forward from v1) ### Data Pipeline - [x] 72,045 paragraphs extracted from ~9,000 10-K + 207 8-K filings - [x] 14 filing generators identified, 6 surgical patches applied - [x] Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%) - [x] 72 truncated filings identified and excluded - [x] All data integrity rules formalized (frozen originals, UUID-linked patches) ### Pre-Training - [x] DAPT: 1 epoch on 500M tokens, eval loss 0.7250, ~14.5h on RTX 3090 - [x] TAPT: 5 epochs on 72K paragraphs, eval loss 1.0754, ~50 min on RTX 3090 - [x] Custom `WholeWordMaskCollator` (upstream broken for BPE) - [x] Checkpoints: `checkpoints/dapt/` and `checkpoints/tapt/` ### v1 Labeling (preserved, not used for v2 training) - [x] 150K Stage 1 annotations (v2.5 prompt, $115.88) - [x] 10-model benchmark (8 suppliers, $45.63) - [x] Human labeling: 6 annotators × 600 paragraphs, category α=0.801, specificity α=0.546 - [x] Gold adjudication: 13-signal cross-analysis, 5-tier adjudication - [x] Codebook v1.0→v3.5 iteration (12+ prompt versions, 6 v3.5 rounds) - [x] All v1 data preserved at original paths + `docs/NARRATIVE-v1.md` ### v2 Codebook (this session) - [x] LABELING-CODEBOOK.md v2: broadened Level 2, 1+ QV, "what question?" test - [x] CODEBOOK-ETHOS.md: full reasoning, worked edge cases - [x] NARRATIVE.md: data/pretraining carried forward, pivot divider, v2 section started - [x] STATUS.md: this document --- ## What's Next (v2 Pipeline) ### Step 1: Codebook Finalization ← CURRENT - [x] Draft v2 codebook with systemic changes - [x] Draft codebook ethos with full reasoning - [ ] Get group approval on v2 codebook (share both docs) - [ ] Incorporate any group feedback ### Step 2: Prompt Iteration (dev set) - [ ] Draw ~200 paragraph dev set from existing Stage 1 labels (stratified, separate from holdout) - [ ] Update Stage 1 prompt to match v2 codebook - [ ] Run 2-3 models on dev set, analyze results - [ ] Iterate prompt against judge panel until reasonable consensus - [ ] Update codebook with any rulings needed (should be minimal if rules are clean) - [ ] Re-approval if codebook changed materially - **Estimated cost:** ~$5-10 - **Estimated time:** 1-2 sessions ### Step 3: Stage 1 Re-Run - [ ] Lock v2 prompt - [ ] Re-run Stage 1 on full corpus (~50K paragraphs × 3 models) - [ ] Distribution check: verify Level 2 grew to ~20%, category distribution healthy - [ ] If distribution is off → iterate codebook/prompt before proceeding - **Estimated cost:** ~$120 - **Estimated time:** ~30 min execution ### Step 4: Holdout Selection - [ ] Draw stratified holdout from new Stage 1 labels - ~170 per category class × 7 ≈ 1,190 - Random within each stratum (NOT difficulty-weighted) - Secondary constraint: minimum ~100 per specificity level - Exclude dev set paragraphs - [ ] Draw separate AI-labeled extension set (up to 20K) if desired - **Depends on:** Step 3 complete + distribution check passed ### Step 5: Labelapp Update - [ ] Update quiz questions for v2 codebook (new Level 2 definition, 1+ QV, "what question?" test) - [ ] Update warmup paragraphs with v2 examples - [ ] Update codebook sidebar content - [ ] Load new holdout paragraphs into labelapp - [ ] Generate new BIBD assignments (3 of 6 annotators per paragraph) - [ ] Test the full flow (quiz → warmup → labeling) - **Depends on:** Step 4 complete ### Step 6: Parallel Labeling - [ ] **Humans:** Tell annotators to start labeling v2 holdout - [ ] **Models:** Run full benchmark panel on holdout (10+ models, 8+ suppliers) - Stage 1 panel (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast) - Benchmark panel (gpt-5.4, gemini-pro, kimi-k2.5, glm-5, mimo-v2-pro, minimax-m2.7) - Opus 4.6 via Anthropic SDK (new addition, treated as another benchmark model) - **Estimated model cost:** ~$45 - **Estimated human time:** 2-3 days (600 paragraphs per annotator) - **Depends on:** Step 5 complete ### Step 7: Gold Set Assembly - [ ] Compute human IRR (target: category α > 0.75, specificity α > 0.67) - [ ] Gold = majority vote (where all 3 disagree, model consensus tiebreaker) - [ ] Validate gold against model panel — check for systematic human errors (learned from v1 SI↔N/O) - **Depends on:** Step 6 complete (both humans and models) ### Step 8: Stage 2 (if needed) - [ ] Bench Stage 2 adjudication accuracy against gold - [ ] If Stage 2 adds value → iterate prompt, run on disputed Stage 1 paragraphs - [ ] If Stage 2 adds minimal value → document finding, skip production run - **Estimated cost:** ~$20-40 if run - **Depends on:** Step 7 complete ### Step 9: Training Data Assembly - [ ] Unanimous Stage 1 labels → full weight - [ ] Calibrated majority labels → full weight - [ ] Judge high-confidence (if Stage 2 run) → full weight - [ ] Quality tier weights: clean/headed/minor = 1.0, degraded = 0.5 - [ ] Nuke 72 truncated filings - **Depends on:** Step 8 complete ### Step 10: Fine-Tuning - [ ] Ablation matrix: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss} - [ ] Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal) - [ ] Ordinal regression (CORAL) for specificity - [ ] SCL for boundary separation (optional, if time permits) - **Estimated time:** 12-20h GPU - **Depends on:** Step 9 complete ### Step 11: Evaluation & Paper - [ ] Macro F1 on holdout (target: > 0.80 for both heads) - [ ] Per-class F1 breakdown - [ ] Full GenAI benchmark table (10+ models × holdout) - [ ] Cost/time/reproducibility comparison - [ ] Error analysis on hardest cases - [ ] IGNITE slides (20 slides, 15s each) - [ ] Python notebooks for replication (assignment requirement) - **Depends on:** Step 10 complete --- ## Timeline Estimate | Step | Days | Cumulative | |------|------|-----------| | 1. Codebook approval | 1 | 1 | | 2. Prompt iteration | 2 | 3 | | 3. Stage 1 re-run | 0.5 | 3.5 | | 4. Holdout selection | 0.5 | 4 | | 5. Labelapp update | 1 | 5 | | 6. Parallel labeling | 3 | 8 | | 7. Gold assembly | 1 | 9 | | 8. Stage 2 (if needed) | 1 | 10 | | 9. Training data assembly | 0.5 | 10.5 | | 10. Fine-tuning | 3-5 | 13.5-15.5 | | 11. Evaluation + paper | 3-5 | 16.5-20.5 | **Buffer:** 0.5-4.5 days. Tight but feasible if Steps 1-5 execute cleanly. --- ## Rubric Checklist (Assignment) ### C (f1 > .80): the goal - [ ] Fine-tuned model with F1 > .80 — category likely, specificity needs v2 broadening - [x] Performance comparison GenAI vs fine-tuned — 10 models benchmarked (will re-run on v2 holdout) - [x] Labeled datasets — 150K Stage 1 + 1,200 gold (v1; will re-do for v2) - [x] Documentation — extensive - [ ] Python notebooks for replication ### B (3+ of 4): already have all 4 - [x] Cost, time, reproducibility — dollar amounts for every API call - [x] 6+ models, 3+ suppliers — 10 models, 8 suppliers (+ Opus in v2) - [x] Contemporary self-collected data — 72K paragraphs from SEC EDGAR - [x] Compelling use case — SEC cyber disclosure quality assessment ### A (3+ of 4): have 3, working on 4th - [x] Error analysis — T5 deep-dive, confusion axis analysis, model reasoning examination - [x] Mitigation strategy — v1→v2 codebook evolution, experimental validation - [ ] Additional baselines — dictionary/keyword approach (specificity IS/NOT lists as baseline) - [x] Comparison to amateur labels — annotator before/after, human vs model agreement analysis --- ## Key File Locations | What | Where | |------|-------| | v2 codebook | `docs/LABELING-CODEBOOK.md` | | v2 codebook ethos | `docs/CODEBOOK-ETHOS.md` | | v2 narrative | `docs/NARRATIVE.md` | | v1 codebook (preserved) | `docs/LABELING-CODEBOOK-v1.md` | | v1 narrative (preserved) | `docs/NARRATIVE-v1.md` | | Strategy notes | `docs/STRATEGY-NOTES.md` | | Paragraphs | `data/paragraphs/paragraphs-clean.jsonl` (72,045) | | Patched paragraphs | `data/paragraphs/paragraphs-clean.patched.jsonl` (49,795) | | v1 Stage 1 annotations | `data/annotations/stage1.patched.jsonl` (150,009) | | v1 gold labels | `data/gold/gold-adjudicated.jsonl` (1,200) | | v1 human labels | `data/gold/human-labels-raw.jsonl` (3,600) | | v1 benchmark annotations | `data/annotations/bench-holdout/*.jsonl` | | DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` | | TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` | | DAPT corpus | `data/dapt-corpus/shard-*.jsonl` | | Stage 1 prompt | `ts/src/label/prompts.ts` | | Annotation runner | `ts/src/label/annotate.ts` | | Labelapp | `labelapp/` |