8.4 KiB
8.4 KiB
Project Status — 2026-04-03 (v2 Reboot)
Deadline: 2026-04-24 (21 days)
What's Done (Carried Forward from v1)
Data Pipeline
- 72,045 paragraphs extracted from ~9,000 10-K + 207 8-K filings
- 14 filing generators identified, 6 surgical patches applied
- Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
- 72 truncated filings identified and excluded
- All data integrity rules formalized (frozen originals, UUID-linked patches)
Pre-Training
- DAPT: 1 epoch on 500M tokens, eval loss 0.7250, ~14.5h on RTX 3090
- TAPT: 5 epochs on 72K paragraphs, eval loss 1.0754, ~50 min on RTX 3090
- Custom
WholeWordMaskCollator(upstream broken for BPE) - Checkpoints:
checkpoints/dapt/andcheckpoints/tapt/
v1 Labeling (preserved, not used for v2 training)
- 150K Stage 1 annotations (v2.5 prompt, $115.88)
- 10-model benchmark (8 suppliers, $45.63)
- Human labeling: 6 annotators × 600 paragraphs, category α=0.801, specificity α=0.546
- Gold adjudication: 13-signal cross-analysis, 5-tier adjudication
- Codebook v1.0→v3.5 iteration (12+ prompt versions, 6 v3.5 rounds)
- All v1 data preserved at original paths +
docs/NARRATIVE-v1.md
v2 Codebook (this session)
- LABELING-CODEBOOK.md v2: broadened Level 2, 1+ QV, "what question?" test
- CODEBOOK-ETHOS.md: full reasoning, worked edge cases
- NARRATIVE.md: data/pretraining carried forward, pivot divider, v2 section started
- STATUS.md: this document
What's Next (v2 Pipeline)
Step 1: Codebook Finalization ← CURRENT
- Draft v2 codebook with systemic changes
- Draft codebook ethos with full reasoning
- Get group approval on v2 codebook (share both docs)
- Incorporate any group feedback
Step 2: Prompt Iteration (dev set)
- Draw ~200 paragraph dev set from existing Stage 1 labels (stratified, separate from holdout)
- Update Stage 1 prompt to match v2 codebook
- Run 2-3 models on dev set, analyze results
- Iterate prompt against judge panel until reasonable consensus
- Update codebook with any rulings needed (should be minimal if rules are clean)
- Re-approval if codebook changed materially
- Estimated cost: ~$5-10
- Estimated time: 1-2 sessions
Step 3: Stage 1 Re-Run
- Lock v2 prompt
- Re-run Stage 1 on full corpus (~50K paragraphs × 3 models)
- Distribution check: verify Level 2 grew to ~20%, category distribution healthy
- If distribution is off → iterate codebook/prompt before proceeding
- Estimated cost: ~$120
- Estimated time: ~30 min execution
Step 4: Holdout Selection
- Draw stratified holdout from new Stage 1 labels
- ~170 per category class × 7 ≈ 1,190
- Random within each stratum (NOT difficulty-weighted)
- Secondary constraint: minimum ~100 per specificity level
- Exclude dev set paragraphs
- Draw separate AI-labeled extension set (up to 20K) if desired
- Depends on: Step 3 complete + distribution check passed
Step 5: Labelapp Update
- Update quiz questions for v2 codebook (new Level 2 definition, 1+ QV, "what question?" test)
- Update warmup paragraphs with v2 examples
- Update codebook sidebar content
- Load new holdout paragraphs into labelapp
- Generate new BIBD assignments (3 of 6 annotators per paragraph)
- Test the full flow (quiz → warmup → labeling)
- Depends on: Step 4 complete
Step 6: Parallel Labeling
- Humans: Tell annotators to start labeling v2 holdout
- Models: Run full benchmark panel on holdout (10+ models, 8+ suppliers)
- Stage 1 panel (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast)
- Benchmark panel (gpt-5.4, gemini-pro, kimi-k2.5, glm-5, mimo-v2-pro, minimax-m2.7)
- Opus 4.6 via Anthropic SDK (new addition, treated as another benchmark model)
- Estimated model cost: ~$45
- Estimated human time: 2-3 days (600 paragraphs per annotator)
- Depends on: Step 5 complete
Step 7: Gold Set Assembly
- Compute human IRR (target: category α > 0.75, specificity α > 0.67)
- Gold = majority vote (where all 3 disagree, model consensus tiebreaker)
- Validate gold against model panel — check for systematic human errors (learned from v1 SI↔N/O)
- Depends on: Step 6 complete (both humans and models)
Step 8: Stage 2 (if needed)
- Bench Stage 2 adjudication accuracy against gold
- If Stage 2 adds value → iterate prompt, run on disputed Stage 1 paragraphs
- If Stage 2 adds minimal value → document finding, skip production run
- Estimated cost: ~$20-40 if run
- Depends on: Step 7 complete
Step 9: Training Data Assembly
- Unanimous Stage 1 labels → full weight
- Calibrated majority labels → full weight
- Judge high-confidence (if Stage 2 run) → full weight
- Quality tier weights: clean/headed/minor = 1.0, degraded = 0.5
- Nuke 72 truncated filings
- Depends on: Step 8 complete
Step 10: Fine-Tuning
- Ablation matrix: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss}
- Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
- Ordinal regression (CORAL) for specificity
- SCL for boundary separation (optional, if time permits)
- Estimated time: 12-20h GPU
- Depends on: Step 9 complete
Step 11: Evaluation & Paper
- Macro F1 on holdout (target: > 0.80 for both heads)
- Per-class F1 breakdown
- Full GenAI benchmark table (10+ models × holdout)
- Cost/time/reproducibility comparison
- Error analysis on hardest cases
- IGNITE slides (20 slides, 15s each)
- Python notebooks for replication (assignment requirement)
- Depends on: Step 10 complete
Timeline Estimate
| Step | Days | Cumulative |
|---|---|---|
| 1. Codebook approval | 1 | 1 |
| 2. Prompt iteration | 2 | 3 |
| 3. Stage 1 re-run | 0.5 | 3.5 |
| 4. Holdout selection | 0.5 | 4 |
| 5. Labelapp update | 1 | 5 |
| 6. Parallel labeling | 3 | 8 |
| 7. Gold assembly | 1 | 9 |
| 8. Stage 2 (if needed) | 1 | 10 |
| 9. Training data assembly | 0.5 | 10.5 |
| 10. Fine-tuning | 3-5 | 13.5-15.5 |
| 11. Evaluation + paper | 3-5 | 16.5-20.5 |
Buffer: 0.5-4.5 days. Tight but feasible if Steps 1-5 execute cleanly.
Rubric Checklist (Assignment)
C (f1 > .80): the goal
- Fine-tuned model with F1 > .80 — category likely, specificity needs v2 broadening
- Performance comparison GenAI vs fine-tuned — 10 models benchmarked (will re-run on v2 holdout)
- Labeled datasets — 150K Stage 1 + 1,200 gold (v1; will re-do for v2)
- Documentation — extensive
- Python notebooks for replication
B (3+ of 4): already have all 4
- Cost, time, reproducibility — dollar amounts for every API call
- 6+ models, 3+ suppliers — 10 models, 8 suppliers (+ Opus in v2)
- Contemporary self-collected data — 72K paragraphs from SEC EDGAR
- Compelling use case — SEC cyber disclosure quality assessment
A (3+ of 4): have 3, working on 4th
- Error analysis — T5 deep-dive, confusion axis analysis, model reasoning examination
- Mitigation strategy — v1→v2 codebook evolution, experimental validation
- Additional baselines — dictionary/keyword approach (specificity IS/NOT lists as baseline)
- Comparison to amateur labels — annotator before/after, human vs model agreement analysis
Key File Locations
| What | Where |
|---|---|
| v2 codebook | docs/LABELING-CODEBOOK.md |
| v2 codebook ethos | docs/CODEBOOK-ETHOS.md |
| v2 narrative | docs/NARRATIVE.md |
| v1 codebook (preserved) | docs/LABELING-CODEBOOK-v1.md |
| v1 narrative (preserved) | docs/NARRATIVE-v1.md |
| Strategy notes | docs/STRATEGY-NOTES.md |
| Paragraphs | data/paragraphs/paragraphs-clean.jsonl (72,045) |
| Patched paragraphs | data/paragraphs/paragraphs-clean.patched.jsonl (49,795) |
| v1 Stage 1 annotations | data/annotations/stage1.patched.jsonl (150,009) |
| v1 gold labels | data/gold/gold-adjudicated.jsonl (1,200) |
| v1 human labels | data/gold/human-labels-raw.jsonl (3,600) |
| v1 benchmark annotations | data/annotations/bench-holdout/*.jsonl |
| DAPT checkpoint | checkpoints/dapt/modernbert-large/final/ |
| TAPT checkpoint | checkpoints/tapt/modernbert-large/final/ |
| DAPT corpus | data/dapt-corpus/shard-*.jsonl |
| Stage 1 prompt | ts/src/label/prompts.ts |
| Annotation runner | ts/src/label/annotate.ts |
| Labelapp | labelapp/ |