SEC-cyBERT/docs/STATUS.md
2026-04-07 00:51:48 -04:00

269 lines
16 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Project Status — v2 Pipeline
**Deadline:** 2026-04-24 | **Started:** 2026-04-03 | **Updated:** 2026-04-05 (Holdout eval done: cat F1=0.934, spec F1=0.895 vs GPT-5.4 proxy gold)
---
## Carried Forward (not re-done)
- 72,045 paragraphs (all annotated in v2), quality tiers, 6 surgical patches
- DAPT checkpoint (eval loss 0.7250, ~14.5h) + TAPT checkpoint (eval loss 1.0754, ~50min)
- v1 data preserved: 150K Stage 1 annotations, 10-model benchmark, 6-annotator human labels, gold adjudication
- v2 codebook approved (5/6 group approval 2026-04-04)
---
## Pipeline Steps
### 1. Codebook Finalization — DONE
- [x] Draft v2 codebook (LABELING-CODEBOOK.md)
- [x] Draft codebook ethos (CODEBOOK-ETHOS.md)
- [x] Group approval (5/6, 2026-04-04)
### 2. Holdout Selection — DONE
- [x] Heuristic v2 specificity prediction (keyword scan of v1 L1 → predicted L2, v1 L3 → predicted L4)
- [x] Stratified holdout: 185 per non-ID category, 90 ID = 1,200 exact
- [x] Max 2 paragraphs per company per category stratum
- [x] Specificity floors met: L1=621, L2=119, L3=262, L4=198 (all ≥100)
- [x] 1,042 companies represented, max 3 from any one company
- [x] Output: `data/gold/v2-holdout-ids.json`, `data/gold/v2-holdout-manifest.jsonl`
- [x] Script: `scripts/sample-v2-holdout.py`
- Dev set drawn from holdout (first 200 paragraphs used for prompt iteration)
### 3. Prompt Iteration — DONE
- [x] Full rewrite of SYSTEM_PROMPT for v2 codebook (v4.0 → v4.5, ~8 iterations)
- [x] Principle-first restructure: ERM test for L2, "unique to THIS company" for L3, external verifiability for L4
- [x] Lists compressed to boundary-case disambiguation only (not exhaustive checklists)
- [x] Category/specificity independence explicitly stated (presence check, not relevance judgment)
- [x] Hard vs soft number boundary clarified for QV; lower bounds ("more than 20 years") count as hard
- [x] VP/SVP title boundary: VP-or-above with IT/Security qualifier → L3; Director of IT without security qualifier → L1
- [x] Schema updated: "Sector-Adapted" → "Domain-Adapted", 2+ QV → 1+ QV
- [x] Piloted on 200 holdout paragraphs with GPT-5.4 across 5 iterations (~$6 total)
- [x] v4.5 iteration: mechanical bridge (specific_facts → specificity level), expertise-vs-topic L1/L2 clarification, SI negative-assertion L4 fix, fact storage in output
- **v4.4 results (200 paragraphs):** L1=65, L2=41, L3=51, L4=43; category 95.5% agreement with v1
- **Cost per 200:** ~$1.20 (GPT-5.4)
- **Prompt version:** v4.5 (locked)
### 4. Full Holdout Validation — DONE
- [x] Run GPT-5.4 on all 1,200 holdout paragraphs with v4.4 prompt ($5.70)
- [x] Identified 34.5% medium-confidence specificity calls, concentrated at L1/L2 and L2/L3 boundaries
- [x] Identified SI materiality assertions being false-promoted to L4 (negative assertions not verifiable)
- [x] Identified specific_facts field not being stored to disk (toLabelOutput stripped it)
- [x] Iterated to v4.5: mechanical bridge, expertise-vs-topic, SI L4 fix, fact storage
- [x] Re-ran full 1,200 with v4.5 ($6.88)
- [x] Verified bridge consistency: L1=all empty, L2+=all populated (100%)
- [x] Verified SI L4 false positives eliminated (0 remaining)
- [x] Verified TP L2→L1 drops are correct (generic vendor language, not cybersecurity expertise)
- **v4.5 results (1,200 paragraphs):** L1=618 (51.5%), L2=168 (14.0%), L3=207 (17.2%), L4=207 (17.2%)
- **Confidence:** 989 high (82.4%), 211 medium (17.6%) — down from 414 medium in v4.4
- **Category stability:** 96.8% agreement between v4.4 and v4.5
- **L2 at 14%:** below 15% target on holdout, but holdout oversamples TP (14.4% vs 5% in corpus). On full corpus (46% RMP, 5% TP), L2 should be ~15-17% since RMP L2 held up.
- **Dev vs unseen stable:** no prompt overfitting
### 5. Holdout Benchmark — DONE
- [x] Run 10 models from 8 providers on 1,200 holdout (GPT-5.4, Grok Fast, Gemini Lite, Gemini Pro, MIMO Flash, Kimi K2.5, GLM-5, MiniMax M2.7, Opus 4.6, + 3 pilots)
- [x] Opus prompt-only vs codebook A/B test (prompt-only wins: 85.2% vs 82.4% both-match)
- [x] MIMO Flash broken on specificity (91% L1 collapse, κw=0.662) — disqualified
- [x] Pilot 3 cheap candidates (Qwen3-235B, Seed 1.6 Flash, Qwen3.5 Flash) — all below Flash Lite quality
- [x] Grok self-consistency test: 8.5% specificity divergence on repeated runs at temp=0 (reasoning stochasticity)
- [x] Decision: Grok ×3 self-consistency panel (Wang et al. 2022)
- **Benchmark cost:** $45.47
- **Top models:** Grok Fast (86.1% both), Opus prompt-only (85.2%), Gemini Pro (84.2%)
- **Stage 1 panel:** Grok 4.1 Fast ×3 ($96 estimated)
### 6. Stage 1 Re-Run — DONE
- [x] Lock v2 prompt (v4.5)
- [x] Model selection: Grok 4.1 Fast ×3 (self-consistency)
- [x] Re-run Stage 1 on full corpus (72,045 paragraphs × 3 runs, concurrency 200)
- [x] Cross-run agreement: category 94.9% unanimous, specificity 91.3% unanimous
- [x] Consensus: 62,510 unanimous (86.8%), 9,323 majority (12.9%), 212 judge tiebreaker (0.3%)
- [x] GPT-5.4 judge on 212 unresolved paragraphs — 100% agreed with a Grok label
- [x] Distribution check: L2=22.7% (above 15% target), categories healthy
- **Stage 1 cost:** $129.75 (3 runs) + $5.76 (judge) = $135.51
- **Run time:** ~33 min per run at concurrency 200
### 7. Labelapp Update ← CURRENT
- [x] Update quiz questions for v2 codebook (v2 specificity rules, fixed impossible qv-3, all 4 levels as options)
- [x] Update warmup paragraphs with v2 explanations
- [x] Update onboarding content for v2 (Domain-Adapted, 1+ QV, domain terminology lists)
- [x] Update codebook reference page for v2
- [x] DB migration to clear old 72k data (0002_v2-reset.sql)
- [x] Seed script updated for 1,200 holdout paragraphs only
- [x] Nuke admin account, joey is admin
- [x] Quiz is one-time (at onboarding), warmup resets each login session
- [ ] Run migration + seed (`la:db:migrate` then `la:seed`)
- [ ] Generate new BIBD assignments (3 of 6 annotators per paragraph)
### 8. Parallel Labeling
- [ ] Humans: annotators label v2 holdout (~600 per annotator, 2-3 days)
- [x] Models: full benchmark panel on holdout (10 models, 8 providers + Opus via Agent SDK) — $45.47
- **Estimated cost:** ~$0 remaining (models done)
### 9. Gold Set Assembly
- [ ] Compute human IRR (category α > 0.75, specificity α > 0.67)
- [ ] Gold = majority vote; all-disagree → model consensus tiebreaker
- [ ] Cross-validate against model panel
### 10. Stage 2
- [x] GPT-5.4 judge resolved 212 tiebreaker paragraphs during Stage 1 consensus ($5.76)
- [ ] Bench Stage 2 accuracy against gold (if needed for additional disputed paragraphs)
- **Cost so far:** $5.76 | **Remaining budget:** ~$39
### 11. Training Data Assembly — DONE
- [x] Merge Stage 1 consensus with paragraph data (`python/src/finetune/data.py`)
- [x] Exclude 1,200 holdout paragraphs (reserved for eval)
- [x] Exclude 614 individually truncated paragraphs (not entire filings — more targeted than original plan)
- [x] Quality tier weights: clean/headed/minor 1.0, degraded 0.5
- [x] Stratified train/val split (90/10) from training set
- **Training set size:** 70,231 paragraphs (72,045 1,200 holdout 614 truncated)
- **Train/val split:** 63,214 / 7,024
### 12. Fine-Tuning — DONE
- [x] Ablation round 1: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss} = 12 configs × 1 epoch
- [x] Ablation round 1 winner: base_weighted_ce (CORAL head, [CLS] pooling)
- [x] CORAL limitation identified: shared weight vector can't capture 3 different transition signals (L1→L2: domain terms, L2→L3: firm facts, L3→L4: quantified claims)
- [x] Architecture iteration: replaced CORAL with independent threshold heads (3 separate MLP binary classifiers), attention pooling, specificity confidence filtering
- [x] **Final model (iter1-independent, epoch 8):** Cat F1=0.943, Spec F1=0.945, QWK=0.952, Combined=0.944
- **Architecture:** ModernBERT-large → attention pooling → dropout →
- Category: Linear(1024, 7) + weighted CE
- Specificity: 3× IndependentThreshold(Linear(1024→256→1)) + cumulative BCE + ordinal consistency reg.
- **Key findings (ablation round 1):**
- DAPT/TAPT pre-training did not help — base ModernBERT-large outperformed both
- Class weighting + CE is the best loss combination
- Focal loss + class weighting = too much correction (always bottom tier)
- TAPT consistently worst — likely overfitting on task paragraphs during MLM pre-training
- **Key findings (architecture iteration):**
- CORAL's shared weight vector was the primary bottleneck for specificity (0.517 → 0.940)
- Independent threshold heads let each L1→L2, L2→L3, L3→L4 transition learn different features
- Attention pooling captures distributed specificity signals (one "CISO" mention anywhere matters)
- Confidence filtering removes ~8.7% noisy boundary labels from specificity training
- **Training speed:** ~2.1 it/s, batch 32, seq 512, bf16, flash attention 2, torch.compile
- **Peak VRAM:** ~18-20 GB / 24.6 GB (RTX 3090)
- **Improvement plan:** `docs/SPECIFICITY-IMPROVEMENT-PLAN.md`
### 13. Evaluation & Paper ← CURRENT
- [x] Proxy eval: fine-tuned model on 1,200 holdout vs GPT-5.4 and Opus-4.6 proxy gold
- [x] Full metrics suite: macro/per-class F1, precision, recall, MCC, AUC, QWK, MAE, Krippendorff's α, ECE, confusion matrices
- [x] CORAL baseline comparison: same eval pipeline on CORAL epoch 5 checkpoint
- [x] Figures: confusion matrices, calibration diagrams, per-class F1 bars, CORAL vs Independent comparison, speed/cost table
- [x] Reference ceiling analysis: GPT-5.4 vs Opus-4.6 agreement = 0.885 macro spec F1 (our model exceeds this at 0.895)
- [x] L2 error analysis: model L2 F1 (0.798) within 0.007 of reference ceiling (0.805)
- [x] Sequence length analysis: only 139/72K paragraphs (0.19%) truncated at 512 tokens — negligible impact
- [x] Opus labels completed: 1,200/1,200 (filled 16 missing from initial run)
- [ ] Macro F1 on holdout gold (target > 0.80 both heads) — blocked on human labels
- [ ] Per-threshold sigmoid tuning against human gold (potential +0.01-0.02 on L2 F1)
- [x] Temperature scaling for improved calibration — T_cat=1.76, T_spec=2.46; ECE reduced 33%/40% (cat/spec); F1 unchanged
- [x] Ensemble of 3 seeds for confidence intervals — seeds 42/69/420, val std ±0.002 spec, holdout +0.017 L2 F1, +0.007 spec F1 vs single seed
- [x] Dictionary/keyword baseline (A-rubric "additional baselines") — Cat 0.55, Spec 0.66; gap to learned model documents value of context
- [x] Confidence-filter ablation — null result, filtering does not affect F1; architecture changes carry the spec F1 improvement
- [x] Pooling ablation (attention vs CLS) — attention +0.005 F1 consistent; small but credible effect
- [x] DAPT re-test with new architecture — val +0.007 cat F1, best val NLL 0.333→0.318 (4.5%), generalization gap unchanged; holdout gain ~0.001 (better init, not better generalization)
- [ ] Error analysis against human gold, IGNITE slides
- [ ] Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work
- [ ] Note in paper: DAPT/TAPT did not improve fine-tuning — noteworthy null result
- [ ] Note in paper: CORAL ordinal regression insufficient for multi-signal ordinal classification
- [ ] Note in paper: model exceeds inter-reference agreement — approaches ceiling of construct reliability
- **Proxy gold results (vs GPT-5.4):** Cat F1=0.934, Spec F1=0.895, MCC=0.923/0.866, AUC=0.992/0.982, QWK=0.932
- **Proxy gold results (vs Opus-4.6):** Cat F1=0.923, Spec F1=0.883, QWK=0.923
- **Speed:** 5.6ms/sample (178/sec) — 520× faster than GPT-5.4, 1,070× faster than Opus
- **Next:** deploy labelapp for human annotation, then gold evaluation + threshold tuning
---
## Rubric Checklist
**C (F1 > .80):** Fine-tuned model, GenAI comparison, labeled datasets, documentation, Python notebooks
**B (3+ of 4):** [x] Cost/time/reproducibility, [x] 6+ models / 3+ suppliers, [x] Contemporary self-collected data, [x] Compelling use case
**A (3+ of 4):** [x] Error analysis, [x] Mitigation strategy, [x] Additional baselines (keyword/dictionary — Cat 0.55 / Spec 0.66), [x] Comparison to amateur labels
---
## Key Data
| What | Where |
|------|-------|
| v2 codebook | `docs/LABELING-CODEBOOK.md` |
| v2 ethos | `docs/CODEBOOK-ETHOS.md` |
| Paragraphs (patched) | `data/paragraphs/paragraphs-clean.patched.jsonl` (72,045) |
| v1 Stage 1 annotations | `data/annotations/stage1.patched.jsonl` (150,009) |
| v2 holdout IDs | `data/gold/v2-holdout-ids.json` (1,200) |
| v2 holdout manifest | `data/gold/v2-holdout-manifest.jsonl` |
| v1 holdout IDs | `labelapp/.sampled-ids.original.json` |
| v1 gold labels | `data/gold/gold-adjudicated.jsonl` |
| v2 holdout benchmark | `data/annotations/v2-bench/` (10 models + 3 pilots, 1,200 paragraphs) |
| v2 holdout reference | `data/annotations/v2-bench/gpt-5.4.jsonl` (v4.5, 1,200 paragraphs) |
| v2 iteration archive | `data/annotations/v2-bench/gpt-5.4.v4.{0,1,2,3,4}.jsonl` |
| v4.5 boundary test | `data/annotations/v2-bench/v45-test/gpt-5.4.jsonl` (50 paragraphs) |
| Opus prompt-only | `data/annotations/v2-bench/opus-4.6.jsonl` (1,200 paragraphs) |
| Opus +codebook | `data/annotations/golden/opus.jsonl` (includes v1 + v2 runs) |
| Grok self-consistency test | `data/annotations/v2-bench/grok-rerun/grok-4.1-fast.jsonl` (47 paragraphs) |
| Benchmark analysis | `scripts/analyze-v2-bench.py` |
| Stage 1 prompt | `ts/src/label/prompts.ts` (v4.5) |
| Holdout sampling script | `scripts/sample-v2-holdout.py` |
| v2 Stage 1 run 1 | `data/annotations/v2-stage1/grok-4.1-fast.run1.jsonl` (72,045) |
| v2 Stage 1 run 2 | `data/annotations/v2-stage1/grok-4.1-fast.run2.jsonl` (72,045) |
| v2 Stage 1 run 3 | `data/annotations/v2-stage1/grok-4.1-fast.run3.jsonl` (72,045) |
| v2 Stage 1 consensus | `data/annotations/v2-stage1/consensus.jsonl` (72,045) |
| v2 Stage 1 judge | `data/annotations/v2-stage1/judge.jsonl` (212 tiebreakers) |
| Stage 1 distribution charts | `figures/stage1-*.png` (7 charts) |
| Stage 1 chart script | `scripts/plot-stage1-distributions.py` |
| Fine-tuning data loader | `python/src/finetune/data.py` |
| Dual-head model | `python/src/finetune/model.py` |
| Fine-tuning trainer | `python/src/finetune/train.py` |
| Fine-tune config | `python/configs/finetune/modernbert.yaml` |
| Ablation results | `checkpoints/finetune/ablation/ablation_results.json` |
| **Best model (final)** | `checkpoints/finetune/iter1-independent/final/` (cat=0.943, spec=0.945) |
| CORAL baseline (ablation winner) | `checkpoints/finetune/best-base_weighted_ce-ep5/final/` (cat=0.932, spec=0.517) |
| Ablation results | `checkpoints/finetune/ablation/ablation_results.json` |
| Spec improvement plan | `docs/SPECIFICITY-IMPROVEMENT-PLAN.md` |
| Best model iter1 config | `python/configs/finetune/iter1-independent.yaml` |
| Eval script | `python/src/finetune/eval.py` |
| Eval results (best model) | `results/eval/iter1-independent/metrics.json` |
| Eval results (CORAL) | `results/eval/coral-baseline/metrics.json` |
| Comparison figures | `results/eval/comparison/` (5 charts) |
| Per-model eval figures | `results/eval/iter1-independent/figures/` + `results/eval/coral-baseline/figures/` |
| Comparison figure script | `python/scripts/generate-comparison-figures.py` |
### v2 Stage 1 Distribution (72,045 paragraphs, v4.5 prompt, Grok ×3 consensus + GPT-5.4 judge)
| Category | Count | % |
|----------|-------|---|
| RMP | 31,201 | 43.3% |
| BG | 13,876 | 19.3% |
| MR | 10,591 | 14.7% |
| SI | 7,470 | 10.4% |
| N/O | 4,576 | 6.4% |
| TP | 4,094 | 5.7% |
| ID | 237 | 0.3% |
| Specificity | Count | % |
|-------------|-------|---|
| L1 | 29,593 | 41.1% |
| L2 | 16,344 | 22.7% |
| L3 | 17,911 | 24.9% |
| L4 | 8,197 | 11.4% |
### v1 Stage 1 Distribution (50,003 paragraphs, v2.5 prompt, 3-model consensus)
| Category | Count | % |
|----------|-------|---|
| RMP | 22,898 | 45.8% |
| MR | 8,782 | 17.6% |
| BG | 8,024 | 16.0% |
| SI | 5,014 | 10.0% |
| N/O | 2,503 | 5.0% |
| TP | 2,478 | 5.0% |
| ID | 304 | 0.6% |
### GPT-5.4 Prompt Iteration (holdout)
| Specificity | v4.0 (list, 200) | v4.4 (principle, 200) | v4.4 (full, 1200) | v4.5 (full, 1200) |
|-------------|-------------------|----------------------|--------------------|--------------------|
| L1 | 81 (40.5%) | 65 (32.5%) | 546 (45.5%) | 618 (51.5%) |
| L2 | 32 (16.0%) | 41 (20.5%) | 229 (19.1%) | 168 (14.0%) |
| L3 | 43 (21.5%) | 51 (25.5%) | 225 (18.8%) | 207 (17.2%) |
| L4 | 44 (22.0%) | 43 (21.5%) | 200 (16.7%) | 207 (17.2%) |
| Med conf | — | — | 414 (34.5%) | 211 (17.6%) |
v4.4→v4.5 key changes: mechanical bridge (specific_facts drives specificity level, 100% consistent), expertise-vs-topic L1/L2 clarification (fixes TP false L2s), SI negative-assertion L4 fix, lower-bound numbers as hard QV, fact storage in output.