SEC-cyBERT/docs/STATUS.md

16 KiB
Raw Blame History

Project Status — v2 Pipeline

Deadline: 2026-04-24 | Started: 2026-04-03 | Updated: 2026-04-05 (Holdout eval done: cat F1=0.934, spec F1=0.895 vs GPT-5.4 proxy gold)


Carried Forward (not re-done)

  • 72,045 paragraphs (all annotated in v2), quality tiers, 6 surgical patches
  • DAPT checkpoint (eval loss 0.7250, ~14.5h) + TAPT checkpoint (eval loss 1.0754, ~50min)
  • v1 data preserved: 150K Stage 1 annotations, 10-model benchmark, 6-annotator human labels, gold adjudication
  • v2 codebook approved (5/6 group approval 2026-04-04)

Pipeline Steps

1. Codebook Finalization — DONE

  • Draft v2 codebook (LABELING-CODEBOOK.md)
  • Draft codebook ethos (CODEBOOK-ETHOS.md)
  • Group approval (5/6, 2026-04-04)

2. Holdout Selection — DONE

  • Heuristic v2 specificity prediction (keyword scan of v1 L1 → predicted L2, v1 L3 → predicted L4)
  • Stratified holdout: 185 per non-ID category, 90 ID = 1,200 exact
  • Max 2 paragraphs per company per category stratum
  • Specificity floors met: L1=621, L2=119, L3=262, L4=198 (all ≥100)
  • 1,042 companies represented, max 3 from any one company
  • Output: data/gold/v2-holdout-ids.json, data/gold/v2-holdout-manifest.jsonl
  • Script: scripts/sample-v2-holdout.py
  • Dev set drawn from holdout (first 200 paragraphs used for prompt iteration)

3. Prompt Iteration — DONE

  • Full rewrite of SYSTEM_PROMPT for v2 codebook (v4.0 → v4.5, ~8 iterations)
  • Principle-first restructure: ERM test for L2, "unique to THIS company" for L3, external verifiability for L4
  • Lists compressed to boundary-case disambiguation only (not exhaustive checklists)
  • Category/specificity independence explicitly stated (presence check, not relevance judgment)
  • Hard vs soft number boundary clarified for QV; lower bounds ("more than 20 years") count as hard
  • VP/SVP title boundary: VP-or-above with IT/Security qualifier → L3; Director of IT without security qualifier → L1
  • Schema updated: "Sector-Adapted" → "Domain-Adapted", 2+ QV → 1+ QV
  • Piloted on 200 holdout paragraphs with GPT-5.4 across 5 iterations (~$6 total)
  • v4.5 iteration: mechanical bridge (specific_facts → specificity level), expertise-vs-topic L1/L2 clarification, SI negative-assertion L4 fix, fact storage in output
  • v4.4 results (200 paragraphs): L1=65, L2=41, L3=51, L4=43; category 95.5% agreement with v1
  • Cost per 200: ~$1.20 (GPT-5.4)
  • Prompt version: v4.5 (locked)

4. Full Holdout Validation — DONE

  • Run GPT-5.4 on all 1,200 holdout paragraphs with v4.4 prompt ($5.70)
  • Identified 34.5% medium-confidence specificity calls, concentrated at L1/L2 and L2/L3 boundaries
  • Identified SI materiality assertions being false-promoted to L4 (negative assertions not verifiable)
  • Identified specific_facts field not being stored to disk (toLabelOutput stripped it)
  • Iterated to v4.5: mechanical bridge, expertise-vs-topic, SI L4 fix, fact storage
  • Re-ran full 1,200 with v4.5 ($6.88)
  • Verified bridge consistency: L1=all empty, L2+=all populated (100%)
  • Verified SI L4 false positives eliminated (0 remaining)
  • Verified TP L2→L1 drops are correct (generic vendor language, not cybersecurity expertise)
  • v4.5 results (1,200 paragraphs): L1=618 (51.5%), L2=168 (14.0%), L3=207 (17.2%), L4=207 (17.2%)
  • Confidence: 989 high (82.4%), 211 medium (17.6%) — down from 414 medium in v4.4
  • Category stability: 96.8% agreement between v4.4 and v4.5
  • L2 at 14%: below 15% target on holdout, but holdout oversamples TP (14.4% vs 5% in corpus). On full corpus (46% RMP, 5% TP), L2 should be ~15-17% since RMP L2 held up.
  • Dev vs unseen stable: no prompt overfitting

5. Holdout Benchmark — DONE

  • Run 10 models from 8 providers on 1,200 holdout (GPT-5.4, Grok Fast, Gemini Lite, Gemini Pro, MIMO Flash, Kimi K2.5, GLM-5, MiniMax M2.7, Opus 4.6, + 3 pilots)
  • Opus prompt-only vs codebook A/B test (prompt-only wins: 85.2% vs 82.4% both-match)
  • MIMO Flash broken on specificity (91% L1 collapse, κw=0.662) — disqualified
  • Pilot 3 cheap candidates (Qwen3-235B, Seed 1.6 Flash, Qwen3.5 Flash) — all below Flash Lite quality
  • Grok self-consistency test: 8.5% specificity divergence on repeated runs at temp=0 (reasoning stochasticity)
  • Decision: Grok ×3 self-consistency panel (Wang et al. 2022)
  • Benchmark cost: $45.47
  • Top models: Grok Fast (86.1% both), Opus prompt-only (85.2%), Gemini Pro (84.2%)
  • Stage 1 panel: Grok 4.1 Fast ×3 ($96 estimated)

6. Stage 1 Re-Run — DONE

  • Lock v2 prompt (v4.5)
  • Model selection: Grok 4.1 Fast ×3 (self-consistency)
  • Re-run Stage 1 on full corpus (72,045 paragraphs × 3 runs, concurrency 200)
  • Cross-run agreement: category 94.9% unanimous, specificity 91.3% unanimous
  • Consensus: 62,510 unanimous (86.8%), 9,323 majority (12.9%), 212 judge tiebreaker (0.3%)
  • GPT-5.4 judge on 212 unresolved paragraphs — 100% agreed with a Grok label
  • Distribution check: L2=22.7% (above 15% target), categories healthy
  • Stage 1 cost: $129.75 (3 runs) + $5.76 (judge) = $135.51
  • Run time: ~33 min per run at concurrency 200

7. Labelapp Update ← CURRENT

  • Update quiz questions for v2 codebook (v2 specificity rules, fixed impossible qv-3, all 4 levels as options)
  • Update warmup paragraphs with v2 explanations
  • Update onboarding content for v2 (Domain-Adapted, 1+ QV, domain terminology lists)
  • Update codebook reference page for v2
  • DB migration to clear old 72k data (0002_v2-reset.sql)
  • Seed script updated for 1,200 holdout paragraphs only
  • Nuke admin account, joey is admin
  • Quiz is one-time (at onboarding), warmup resets each login session
  • Run migration + seed (la:db:migrate then la:seed)
  • Generate new BIBD assignments (3 of 5 annotators per paragraph)

8. Parallel Labeling

  • Humans: annotators label v2 holdout (~600 per annotator, 2-3 days)
  • Models: full benchmark panel on holdout (10 models, 8 providers + Opus via Agent SDK) — $45.47
  • Estimated cost: ~$0 remaining (models done)

9. Gold Set Assembly

  • Compute human IRR (category α > 0.75, specificity α > 0.67)
  • Gold = majority vote; all-disagree → model consensus tiebreaker
  • Cross-validate against model panel

10. Stage 2

  • GPT-5.4 judge resolved 212 tiebreaker paragraphs during Stage 1 consensus ($5.76)
  • Bench Stage 2 accuracy against gold (if needed for additional disputed paragraphs)
  • Cost so far: $5.76 | Remaining budget: ~$39

11. Training Data Assembly — DONE

  • Merge Stage 1 consensus with paragraph data (python/src/finetune/data.py)
  • Exclude 1,200 holdout paragraphs (reserved for eval)
  • Exclude 614 individually truncated paragraphs (not entire filings — more targeted than original plan)
  • Quality tier weights: clean/headed/minor 1.0, degraded 0.5
  • Stratified train/val split (90/10) from training set
  • Training set size: 70,231 paragraphs (72,045 1,200 holdout 614 truncated)
  • Train/val split: 63,214 / 7,024

12. Fine-Tuning — DONE

  • Ablation round 1: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss} = 12 configs × 1 epoch
  • Ablation round 1 winner: base_weighted_ce (CORAL head, [CLS] pooling)
  • CORAL limitation identified: shared weight vector can't capture 3 different transition signals (L1→L2: domain terms, L2→L3: firm facts, L3→L4: quantified claims)
  • Architecture iteration: replaced CORAL with independent threshold heads (3 separate MLP binary classifiers), attention pooling, specificity confidence filtering
  • Final model (iter1-independent, epoch 8): Cat F1=0.943, Spec F1=0.945, QWK=0.952, Combined=0.944
  • Architecture: ModernBERT-large → attention pooling → dropout →
    • Category: Linear(1024, 7) + weighted CE
    • Specificity: 3× IndependentThreshold(Linear(1024→256→1)) + cumulative BCE + ordinal consistency reg.
  • Key findings (ablation round 1):
    • DAPT/TAPT pre-training did not help — base ModernBERT-large outperformed both
    • Class weighting + CE is the best loss combination
    • Focal loss + class weighting = too much correction (always bottom tier)
    • TAPT consistently worst — likely overfitting on task paragraphs during MLM pre-training
  • Key findings (architecture iteration):
    • CORAL's shared weight vector was the primary bottleneck for specificity (0.517 → 0.940)
    • Independent threshold heads let each L1→L2, L2→L3, L3→L4 transition learn different features
    • Attention pooling captures distributed specificity signals (one "CISO" mention anywhere matters)
    • Confidence filtering removes ~8.7% noisy boundary labels from specificity training
  • Training speed: ~2.1 it/s, batch 32, seq 512, bf16, flash attention 2, torch.compile
  • Peak VRAM: ~18-20 GB / 24.6 GB (RTX 3090)
  • Improvement plan: docs/SPECIFICITY-IMPROVEMENT-PLAN.md

13. Evaluation & Paper ← CURRENT

  • Proxy eval: fine-tuned model on 1,200 holdout vs GPT-5.4 and Opus-4.6 proxy gold
  • Full metrics suite: macro/per-class F1, precision, recall, MCC, AUC, QWK, MAE, Krippendorff's α, ECE, confusion matrices
  • CORAL baseline comparison: same eval pipeline on CORAL epoch 5 checkpoint
  • Figures: confusion matrices, calibration diagrams, per-class F1 bars, CORAL vs Independent comparison, speed/cost table
  • Reference ceiling analysis: GPT-5.4 vs Opus-4.6 agreement = 0.885 macro spec F1 (our model exceeds this at 0.895)
  • L2 error analysis: model L2 F1 (0.798) within 0.007 of reference ceiling (0.805)
  • Sequence length analysis: only 139/72K paragraphs (0.19%) truncated at 512 tokens — negligible impact
  • Opus labels completed: 1,200/1,200 (filled 16 missing from initial run)
  • Macro F1 on holdout gold (target > 0.80 both heads) — blocked on human labels
  • Per-threshold sigmoid tuning against human gold (potential +0.01-0.02 on L2 F1)
  • Temperature scaling for improved calibration — T_cat=1.76, T_spec=2.46; ECE reduced 33%/40% (cat/spec); F1 unchanged
  • Ensemble of 3 seeds for confidence intervals — seeds 42/69/420, val std ±0.002 spec, holdout +0.017 L2 F1, +0.007 spec F1 vs single seed
  • Dictionary/keyword baseline (A-rubric "additional baselines") — Cat 0.55, Spec 0.66; gap to learned model documents value of context
  • Confidence-filter ablation — null result, filtering does not affect F1; architecture changes carry the spec F1 improvement
  • Error analysis against human gold, IGNITE slides
  • Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work
  • Note in paper: DAPT/TAPT did not improve fine-tuning — noteworthy null result
  • Note in paper: CORAL ordinal regression insufficient for multi-signal ordinal classification
  • Note in paper: model exceeds inter-reference agreement — approaches ceiling of construct reliability
  • Proxy gold results (vs GPT-5.4): Cat F1=0.934, Spec F1=0.895, MCC=0.923/0.866, AUC=0.992/0.982, QWK=0.932
  • Proxy gold results (vs Opus-4.6): Cat F1=0.923, Spec F1=0.883, QWK=0.923
  • Speed: 5.6ms/sample (178/sec) — 520× faster than GPT-5.4, 1,070× faster than Opus
  • Next: deploy labelapp for human annotation, then gold evaluation + threshold tuning

Rubric Checklist

C (F1 > .80): Fine-tuned model, GenAI comparison, labeled datasets, documentation, Python notebooks B (3+ of 4): [x] Cost/time/reproducibility, [x] 6+ models / 3+ suppliers, [x] Contemporary self-collected data, [x] Compelling use case A (3+ of 4): [x] Error analysis, [x] Mitigation strategy, [x] Additional baselines (keyword/dictionary — Cat 0.55 / Spec 0.66), [x] Comparison to amateur labels


Key Data

What Where
v2 codebook docs/LABELING-CODEBOOK.md
v2 ethos docs/CODEBOOK-ETHOS.md
Paragraphs (patched) data/paragraphs/paragraphs-clean.patched.jsonl (72,045)
v1 Stage 1 annotations data/annotations/stage1.patched.jsonl (150,009)
v2 holdout IDs data/gold/v2-holdout-ids.json (1,200)
v2 holdout manifest data/gold/v2-holdout-manifest.jsonl
v1 holdout IDs labelapp/.sampled-ids.original.json
v1 gold labels data/gold/gold-adjudicated.jsonl
v2 holdout benchmark data/annotations/v2-bench/ (10 models + 3 pilots, 1,200 paragraphs)
v2 holdout reference data/annotations/v2-bench/gpt-5.4.jsonl (v4.5, 1,200 paragraphs)
v2 iteration archive data/annotations/v2-bench/gpt-5.4.v4.{0,1,2,3,4}.jsonl
v4.5 boundary test data/annotations/v2-bench/v45-test/gpt-5.4.jsonl (50 paragraphs)
Opus prompt-only data/annotations/v2-bench/opus-4.6.jsonl (1,200 paragraphs)
Opus +codebook data/annotations/golden/opus.jsonl (includes v1 + v2 runs)
Grok self-consistency test data/annotations/v2-bench/grok-rerun/grok-4.1-fast.jsonl (47 paragraphs)
Benchmark analysis scripts/analyze-v2-bench.py
Stage 1 prompt ts/src/label/prompts.ts (v4.5)
Holdout sampling script scripts/sample-v2-holdout.py
v2 Stage 1 run 1 data/annotations/v2-stage1/grok-4.1-fast.run1.jsonl (72,045)
v2 Stage 1 run 2 data/annotations/v2-stage1/grok-4.1-fast.run2.jsonl (72,045)
v2 Stage 1 run 3 data/annotations/v2-stage1/grok-4.1-fast.run3.jsonl (72,045)
v2 Stage 1 consensus data/annotations/v2-stage1/consensus.jsonl (72,045)
v2 Stage 1 judge data/annotations/v2-stage1/judge.jsonl (212 tiebreakers)
Stage 1 distribution charts figures/stage1-*.png (7 charts)
Stage 1 chart script scripts/plot-stage1-distributions.py
Fine-tuning data loader python/src/finetune/data.py
Dual-head model python/src/finetune/model.py
Fine-tuning trainer python/src/finetune/train.py
Fine-tune config python/configs/finetune/modernbert.yaml
Ablation results checkpoints/finetune/ablation/ablation_results.json
Best model (final) checkpoints/finetune/iter1-independent/final/ (cat=0.943, spec=0.945)
CORAL baseline (ablation winner) checkpoints/finetune/best-base_weighted_ce-ep5/final/ (cat=0.932, spec=0.517)
Ablation results checkpoints/finetune/ablation/ablation_results.json
Spec improvement plan docs/SPECIFICITY-IMPROVEMENT-PLAN.md
Best model iter1 config python/configs/finetune/iter1-independent.yaml
Eval script python/src/finetune/eval.py
Eval results (best model) results/eval/iter1-independent/metrics.json
Eval results (CORAL) results/eval/coral-baseline/metrics.json
Comparison figures results/eval/comparison/ (5 charts)
Per-model eval figures results/eval/iter1-independent/figures/ + results/eval/coral-baseline/figures/
Comparison figure script python/scripts/generate-comparison-figures.py

v2 Stage 1 Distribution (72,045 paragraphs, v4.5 prompt, Grok ×3 consensus + GPT-5.4 judge)

Category Count %
RMP 31,201 43.3%
BG 13,876 19.3%
MR 10,591 14.7%
SI 7,470 10.4%
N/O 4,576 6.4%
TP 4,094 5.7%
ID 237 0.3%
Specificity Count %
L1 29,593 41.1%
L2 16,344 22.7%
L3 17,911 24.9%
L4 8,197 11.4%

v1 Stage 1 Distribution (50,003 paragraphs, v2.5 prompt, 3-model consensus)

Category Count %
RMP 22,898 45.8%
MR 8,782 17.6%
BG 8,024 16.0%
SI 5,014 10.0%
N/O 2,503 5.0%
TP 2,478 5.0%
ID 304 0.6%

GPT-5.4 Prompt Iteration (holdout)

Specificity v4.0 (list, 200) v4.4 (principle, 200) v4.4 (full, 1200) v4.5 (full, 1200)
L1 81 (40.5%) 65 (32.5%) 546 (45.5%) 618 (51.5%)
L2 32 (16.0%) 41 (20.5%) 229 (19.1%) 168 (14.0%)
L3 43 (21.5%) 51 (25.5%) 225 (18.8%) 207 (17.2%)
L4 44 (22.0%) 43 (21.5%) 200 (16.7%) 207 (17.2%)
Med conf 414 (34.5%) 211 (17.6%)

v4.4→v4.5 key changes: mechanical bridge (specific_facts drives specificity level, 100% consistent), expertise-vs-topic L1/L2 clarification (fixes TP false L2s), SI negative-assertion L4 fix, lower-bound numbers as hard QV, fact storage in output.