SEC-cyBERT/docs/STATUS.md
2026-04-04 15:01:20 -04:00

196 lines
8.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Project Status — 2026-04-03 (v2 Reboot)
**Deadline:** 2026-04-24 (21 days)
## What's Done (Carried Forward from v1)
### Data Pipeline
- [x] 72,045 paragraphs extracted from ~9,000 10-K + 207 8-K filings
- [x] 14 filing generators identified, 6 surgical patches applied
- [x] Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
- [x] 72 truncated filings identified and excluded
- [x] All data integrity rules formalized (frozen originals, UUID-linked patches)
### Pre-Training
- [x] DAPT: 1 epoch on 500M tokens, eval loss 0.7250, ~14.5h on RTX 3090
- [x] TAPT: 5 epochs on 72K paragraphs, eval loss 1.0754, ~50 min on RTX 3090
- [x] Custom `WholeWordMaskCollator` (upstream broken for BPE)
- [x] Checkpoints: `checkpoints/dapt/` and `checkpoints/tapt/`
### v1 Labeling (preserved, not used for v2 training)
- [x] 150K Stage 1 annotations (v2.5 prompt, $115.88)
- [x] 10-model benchmark (8 suppliers, $45.63)
- [x] Human labeling: 6 annotators × 600 paragraphs, category α=0.801, specificity α=0.546
- [x] Gold adjudication: 13-signal cross-analysis, 5-tier adjudication
- [x] Codebook v1.0→v3.5 iteration (12+ prompt versions, 6 v3.5 rounds)
- [x] All v1 data preserved at original paths + `docs/NARRATIVE-v1.md`
### v2 Codebook (this session)
- [x] LABELING-CODEBOOK.md v2: broadened Level 2, 1+ QV, "what question?" test
- [x] CODEBOOK-ETHOS.md: full reasoning, worked edge cases
- [x] NARRATIVE.md: data/pretraining carried forward, pivot divider, v2 section started
- [x] STATUS.md: this document
---
## What's Next (v2 Pipeline)
### Step 1: Codebook Finalization ← CURRENT
- [x] Draft v2 codebook with systemic changes
- [x] Draft codebook ethos with full reasoning
- [ ] Get group approval on v2 codebook (share both docs)
- [ ] Incorporate any group feedback
### Step 2: Prompt Iteration (dev set)
- [ ] Draw ~200 paragraph dev set from existing Stage 1 labels (stratified, separate from holdout)
- [ ] Update Stage 1 prompt to match v2 codebook
- [ ] Run 2-3 models on dev set, analyze results
- [ ] Iterate prompt against judge panel until reasonable consensus
- [ ] Update codebook with any rulings needed (should be minimal if rules are clean)
- [ ] Re-approval if codebook changed materially
- **Estimated cost:** ~$5-10
- **Estimated time:** 1-2 sessions
### Step 3: Stage 1 Re-Run
- [ ] Lock v2 prompt
- [ ] Re-run Stage 1 on full corpus (~50K paragraphs × 3 models)
- [ ] Distribution check: verify Level 2 grew to ~20%, category distribution healthy
- [ ] If distribution is off → iterate codebook/prompt before proceeding
- **Estimated cost:** ~$120
- **Estimated time:** ~30 min execution
### Step 4: Holdout Selection
- [ ] Draw stratified holdout from new Stage 1 labels
- ~170 per category class × 7 ≈ 1,190
- Random within each stratum (NOT difficulty-weighted)
- Secondary constraint: minimum ~100 per specificity level
- Exclude dev set paragraphs
- [ ] Draw separate AI-labeled extension set (up to 20K) if desired
- **Depends on:** Step 3 complete + distribution check passed
### Step 5: Labelapp Update
- [ ] Update quiz questions for v2 codebook (new Level 2 definition, 1+ QV, "what question?" test)
- [ ] Update warmup paragraphs with v2 examples
- [ ] Update codebook sidebar content
- [ ] Load new holdout paragraphs into labelapp
- [ ] Generate new BIBD assignments (3 of 6 annotators per paragraph)
- [ ] Test the full flow (quiz → warmup → labeling)
- **Depends on:** Step 4 complete
### Step 6: Parallel Labeling
- [ ] **Humans:** Tell annotators to start labeling v2 holdout
- [ ] **Models:** Run full benchmark panel on holdout (10+ models, 8+ suppliers)
- Stage 1 panel (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast)
- Benchmark panel (gpt-5.4, gemini-pro, kimi-k2.5, glm-5, mimo-v2-pro, minimax-m2.7)
- Opus 4.6 via Anthropic SDK (new addition, treated as another benchmark model)
- **Estimated model cost:** ~$45
- **Estimated human time:** 2-3 days (600 paragraphs per annotator)
- **Depends on:** Step 5 complete
### Step 7: Gold Set Assembly
- [ ] Compute human IRR (target: category α > 0.75, specificity α > 0.67)
- [ ] Gold = majority vote (where all 3 disagree, model consensus tiebreaker)
- [ ] Validate gold against model panel — check for systematic human errors (learned from v1 SI↔N/O)
- **Depends on:** Step 6 complete (both humans and models)
### Step 8: Stage 2 (if needed)
- [ ] Bench Stage 2 adjudication accuracy against gold
- [ ] If Stage 2 adds value → iterate prompt, run on disputed Stage 1 paragraphs
- [ ] If Stage 2 adds minimal value → document finding, skip production run
- **Estimated cost:** ~$20-40 if run
- **Depends on:** Step 7 complete
### Step 9: Training Data Assembly
- [ ] Unanimous Stage 1 labels → full weight
- [ ] Calibrated majority labels → full weight
- [ ] Judge high-confidence (if Stage 2 run) → full weight
- [ ] Quality tier weights: clean/headed/minor = 1.0, degraded = 0.5
- [ ] Nuke 72 truncated filings
- **Depends on:** Step 8 complete
### Step 10: Fine-Tuning
- [ ] Ablation matrix: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss}
- [ ] Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
- [ ] Ordinal regression (CORAL) for specificity
- [ ] SCL for boundary separation (optional, if time permits)
- **Estimated time:** 12-20h GPU
- **Depends on:** Step 9 complete
### Step 11: Evaluation & Paper
- [ ] Macro F1 on holdout (target: > 0.80 for both heads)
- [ ] Per-class F1 breakdown
- [ ] Full GenAI benchmark table (10+ models × holdout)
- [ ] Cost/time/reproducibility comparison
- [ ] Error analysis on hardest cases
- [ ] IGNITE slides (20 slides, 15s each)
- [ ] Python notebooks for replication (assignment requirement)
- **Depends on:** Step 10 complete
---
## Timeline Estimate
| Step | Days | Cumulative |
|------|------|-----------|
| 1. Codebook approval | 1 | 1 |
| 2. Prompt iteration | 2 | 3 |
| 3. Stage 1 re-run | 0.5 | 3.5 |
| 4. Holdout selection | 0.5 | 4 |
| 5. Labelapp update | 1 | 5 |
| 6. Parallel labeling | 3 | 8 |
| 7. Gold assembly | 1 | 9 |
| 8. Stage 2 (if needed) | 1 | 10 |
| 9. Training data assembly | 0.5 | 10.5 |
| 10. Fine-tuning | 3-5 | 13.5-15.5 |
| 11. Evaluation + paper | 3-5 | 16.5-20.5 |
**Buffer:** 0.5-4.5 days. Tight but feasible if Steps 1-5 execute cleanly.
---
## Rubric Checklist (Assignment)
### C (f1 > .80): the goal
- [ ] Fine-tuned model with F1 > .80 — category likely, specificity needs v2 broadening
- [x] Performance comparison GenAI vs fine-tuned — 10 models benchmarked (will re-run on v2 holdout)
- [x] Labeled datasets — 150K Stage 1 + 1,200 gold (v1; will re-do for v2)
- [x] Documentation — extensive
- [ ] Python notebooks for replication
### B (3+ of 4): already have all 4
- [x] Cost, time, reproducibility — dollar amounts for every API call
- [x] 6+ models, 3+ suppliers — 10 models, 8 suppliers (+ Opus in v2)
- [x] Contemporary self-collected data — 72K paragraphs from SEC EDGAR
- [x] Compelling use case — SEC cyber disclosure quality assessment
### A (3+ of 4): have 3, working on 4th
- [x] Error analysis — T5 deep-dive, confusion axis analysis, model reasoning examination
- [x] Mitigation strategy — v1→v2 codebook evolution, experimental validation
- [ ] Additional baselines — dictionary/keyword approach (specificity IS/NOT lists as baseline)
- [x] Comparison to amateur labels — annotator before/after, human vs model agreement analysis
---
## Key File Locations
| What | Where |
|------|-------|
| v2 codebook | `docs/LABELING-CODEBOOK.md` |
| v2 codebook ethos | `docs/CODEBOOK-ETHOS.md` |
| v2 narrative | `docs/NARRATIVE.md` |
| v1 codebook (preserved) | `docs/LABELING-CODEBOOK-v1.md` |
| v1 narrative (preserved) | `docs/NARRATIVE-v1.md` |
| Strategy notes | `docs/STRATEGY-NOTES.md` |
| Paragraphs | `data/paragraphs/paragraphs-clean.jsonl` (72,045) |
| Patched paragraphs | `data/paragraphs/paragraphs-clean.patched.jsonl` (49,795) |
| v1 Stage 1 annotations | `data/annotations/stage1.patched.jsonl` (150,009) |
| v1 gold labels | `data/gold/gold-adjudicated.jsonl` (1,200) |
| v1 human labels | `data/gold/human-labels-raw.jsonl` (3,600) |
| v1 benchmark annotations | `data/annotations/bench-holdout/*.jsonl` |
| DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` |
| TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` |
| DAPT corpus | `data/dapt-corpus/shard-*.jsonl` |
| Stage 1 prompt | `ts/src/label/prompts.ts` |
| Annotation runner | `ts/src/label/annotate.ts` |
| Labelapp | `labelapp/` |