SEC-cyBERT/docs/STATUS.md
2026-04-04 15:01:20 -04:00

8.4 KiB
Raw Blame History

Project Status — 2026-04-03 (v2 Reboot)

Deadline: 2026-04-24 (21 days)

What's Done (Carried Forward from v1)

Data Pipeline

  • 72,045 paragraphs extracted from ~9,000 10-K + 207 8-K filings
  • 14 filing generators identified, 6 surgical patches applied
  • Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
  • 72 truncated filings identified and excluded
  • All data integrity rules formalized (frozen originals, UUID-linked patches)

Pre-Training

  • DAPT: 1 epoch on 500M tokens, eval loss 0.7250, ~14.5h on RTX 3090
  • TAPT: 5 epochs on 72K paragraphs, eval loss 1.0754, ~50 min on RTX 3090
  • Custom WholeWordMaskCollator (upstream broken for BPE)
  • Checkpoints: checkpoints/dapt/ and checkpoints/tapt/

v1 Labeling (preserved, not used for v2 training)

  • 150K Stage 1 annotations (v2.5 prompt, $115.88)
  • 10-model benchmark (8 suppliers, $45.63)
  • Human labeling: 6 annotators × 600 paragraphs, category α=0.801, specificity α=0.546
  • Gold adjudication: 13-signal cross-analysis, 5-tier adjudication
  • Codebook v1.0→v3.5 iteration (12+ prompt versions, 6 v3.5 rounds)
  • All v1 data preserved at original paths + docs/NARRATIVE-v1.md

v2 Codebook (this session)

  • LABELING-CODEBOOK.md v2: broadened Level 2, 1+ QV, "what question?" test
  • CODEBOOK-ETHOS.md: full reasoning, worked edge cases
  • NARRATIVE.md: data/pretraining carried forward, pivot divider, v2 section started
  • STATUS.md: this document

What's Next (v2 Pipeline)

Step 1: Codebook Finalization ← CURRENT

  • Draft v2 codebook with systemic changes
  • Draft codebook ethos with full reasoning
  • Get group approval on v2 codebook (share both docs)
  • Incorporate any group feedback

Step 2: Prompt Iteration (dev set)

  • Draw ~200 paragraph dev set from existing Stage 1 labels (stratified, separate from holdout)
  • Update Stage 1 prompt to match v2 codebook
  • Run 2-3 models on dev set, analyze results
  • Iterate prompt against judge panel until reasonable consensus
  • Update codebook with any rulings needed (should be minimal if rules are clean)
  • Re-approval if codebook changed materially
  • Estimated cost: ~$5-10
  • Estimated time: 1-2 sessions

Step 3: Stage 1 Re-Run

  • Lock v2 prompt
  • Re-run Stage 1 on full corpus (~50K paragraphs × 3 models)
  • Distribution check: verify Level 2 grew to ~20%, category distribution healthy
  • If distribution is off → iterate codebook/prompt before proceeding
  • Estimated cost: ~$120
  • Estimated time: ~30 min execution

Step 4: Holdout Selection

  • Draw stratified holdout from new Stage 1 labels
    • ~170 per category class × 7 ≈ 1,190
    • Random within each stratum (NOT difficulty-weighted)
    • Secondary constraint: minimum ~100 per specificity level
    • Exclude dev set paragraphs
  • Draw separate AI-labeled extension set (up to 20K) if desired
  • Depends on: Step 3 complete + distribution check passed

Step 5: Labelapp Update

  • Update quiz questions for v2 codebook (new Level 2 definition, 1+ QV, "what question?" test)
  • Update warmup paragraphs with v2 examples
  • Update codebook sidebar content
  • Load new holdout paragraphs into labelapp
  • Generate new BIBD assignments (3 of 6 annotators per paragraph)
  • Test the full flow (quiz → warmup → labeling)
  • Depends on: Step 4 complete

Step 6: Parallel Labeling

  • Humans: Tell annotators to start labeling v2 holdout
  • Models: Run full benchmark panel on holdout (10+ models, 8+ suppliers)
    • Stage 1 panel (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast)
    • Benchmark panel (gpt-5.4, gemini-pro, kimi-k2.5, glm-5, mimo-v2-pro, minimax-m2.7)
    • Opus 4.6 via Anthropic SDK (new addition, treated as another benchmark model)
  • Estimated model cost: ~$45
  • Estimated human time: 2-3 days (600 paragraphs per annotator)
  • Depends on: Step 5 complete

Step 7: Gold Set Assembly

  • Compute human IRR (target: category α > 0.75, specificity α > 0.67)
  • Gold = majority vote (where all 3 disagree, model consensus tiebreaker)
  • Validate gold against model panel — check for systematic human errors (learned from v1 SI↔N/O)
  • Depends on: Step 6 complete (both humans and models)

Step 8: Stage 2 (if needed)

  • Bench Stage 2 adjudication accuracy against gold
  • If Stage 2 adds value → iterate prompt, run on disputed Stage 1 paragraphs
  • If Stage 2 adds minimal value → document finding, skip production run
  • Estimated cost: ~$20-40 if run
  • Depends on: Step 7 complete

Step 9: Training Data Assembly

  • Unanimous Stage 1 labels → full weight
  • Calibrated majority labels → full weight
  • Judge high-confidence (if Stage 2 run) → full weight
  • Quality tier weights: clean/headed/minor = 1.0, degraded = 0.5
  • Nuke 72 truncated filings
  • Depends on: Step 8 complete

Step 10: Fine-Tuning

  • Ablation matrix: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss}
  • Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
  • Ordinal regression (CORAL) for specificity
  • SCL for boundary separation (optional, if time permits)
  • Estimated time: 12-20h GPU
  • Depends on: Step 9 complete

Step 11: Evaluation & Paper

  • Macro F1 on holdout (target: > 0.80 for both heads)
  • Per-class F1 breakdown
  • Full GenAI benchmark table (10+ models × holdout)
  • Cost/time/reproducibility comparison
  • Error analysis on hardest cases
  • IGNITE slides (20 slides, 15s each)
  • Python notebooks for replication (assignment requirement)
  • Depends on: Step 10 complete

Timeline Estimate

Step Days Cumulative
1. Codebook approval 1 1
2. Prompt iteration 2 3
3. Stage 1 re-run 0.5 3.5
4. Holdout selection 0.5 4
5. Labelapp update 1 5
6. Parallel labeling 3 8
7. Gold assembly 1 9
8. Stage 2 (if needed) 1 10
9. Training data assembly 0.5 10.5
10. Fine-tuning 3-5 13.5-15.5
11. Evaluation + paper 3-5 16.5-20.5

Buffer: 0.5-4.5 days. Tight but feasible if Steps 1-5 execute cleanly.


Rubric Checklist (Assignment)

C (f1 > .80): the goal

  • Fine-tuned model with F1 > .80 — category likely, specificity needs v2 broadening
  • Performance comparison GenAI vs fine-tuned — 10 models benchmarked (will re-run on v2 holdout)
  • Labeled datasets — 150K Stage 1 + 1,200 gold (v1; will re-do for v2)
  • Documentation — extensive
  • Python notebooks for replication

B (3+ of 4): already have all 4

  • Cost, time, reproducibility — dollar amounts for every API call
  • 6+ models, 3+ suppliers — 10 models, 8 suppliers (+ Opus in v2)
  • Contemporary self-collected data — 72K paragraphs from SEC EDGAR
  • Compelling use case — SEC cyber disclosure quality assessment

A (3+ of 4): have 3, working on 4th

  • Error analysis — T5 deep-dive, confusion axis analysis, model reasoning examination
  • Mitigation strategy — v1→v2 codebook evolution, experimental validation
  • Additional baselines — dictionary/keyword approach (specificity IS/NOT lists as baseline)
  • Comparison to amateur labels — annotator before/after, human vs model agreement analysis

Key File Locations

What Where
v2 codebook docs/LABELING-CODEBOOK.md
v2 codebook ethos docs/CODEBOOK-ETHOS.md
v2 narrative docs/NARRATIVE.md
v1 codebook (preserved) docs/LABELING-CODEBOOK-v1.md
v1 narrative (preserved) docs/NARRATIVE-v1.md
Strategy notes docs/STRATEGY-NOTES.md
Paragraphs data/paragraphs/paragraphs-clean.jsonl (72,045)
Patched paragraphs data/paragraphs/paragraphs-clean.patched.jsonl (49,795)
v1 Stage 1 annotations data/annotations/stage1.patched.jsonl (150,009)
v1 gold labels data/gold/gold-adjudicated.jsonl (1,200)
v1 human labels data/gold/human-labels-raw.jsonl (3,600)
v1 benchmark annotations data/annotations/bench-holdout/*.jsonl
DAPT checkpoint checkpoints/dapt/modernbert-large/final/
TAPT checkpoint checkpoints/tapt/modernbert-large/final/
DAPT corpus data/dapt-corpus/shard-*.jsonl
Stage 1 prompt ts/src/label/prompts.ts
Annotation runner ts/src/label/annotate.ts
Labelapp labelapp/