269 lines
16 KiB
Markdown
269 lines
16 KiB
Markdown
# Project Status — v2 Pipeline
|
||
|
||
**Deadline:** 2026-04-24 | **Started:** 2026-04-03 | **Updated:** 2026-04-05 (Holdout eval done: cat F1=0.934, spec F1=0.895 vs GPT-5.4 proxy gold)
|
||
|
||
---
|
||
|
||
## Carried Forward (not re-done)
|
||
|
||
- 72,045 paragraphs (all annotated in v2), quality tiers, 6 surgical patches
|
||
- DAPT checkpoint (eval loss 0.7250, ~14.5h) + TAPT checkpoint (eval loss 1.0754, ~50min)
|
||
- v1 data preserved: 150K Stage 1 annotations, 10-model benchmark, 6-annotator human labels, gold adjudication
|
||
- v2 codebook approved (5/6 group approval 2026-04-04)
|
||
|
||
---
|
||
|
||
## Pipeline Steps
|
||
|
||
### 1. Codebook Finalization — DONE
|
||
- [x] Draft v2 codebook (LABELING-CODEBOOK.md)
|
||
- [x] Draft codebook ethos (CODEBOOK-ETHOS.md)
|
||
- [x] Group approval (5/6, 2026-04-04)
|
||
|
||
### 2. Holdout Selection — DONE
|
||
- [x] Heuristic v2 specificity prediction (keyword scan of v1 L1 → predicted L2, v1 L3 → predicted L4)
|
||
- [x] Stratified holdout: 185 per non-ID category, 90 ID = 1,200 exact
|
||
- [x] Max 2 paragraphs per company per category stratum
|
||
- [x] Specificity floors met: L1=621, L2=119, L3=262, L4=198 (all ≥100)
|
||
- [x] 1,042 companies represented, max 3 from any one company
|
||
- [x] Output: `data/gold/v2-holdout-ids.json`, `data/gold/v2-holdout-manifest.jsonl`
|
||
- [x] Script: `scripts/sample-v2-holdout.py`
|
||
- Dev set drawn from holdout (first 200 paragraphs used for prompt iteration)
|
||
|
||
### 3. Prompt Iteration — DONE
|
||
- [x] Full rewrite of SYSTEM_PROMPT for v2 codebook (v4.0 → v4.5, ~8 iterations)
|
||
- [x] Principle-first restructure: ERM test for L2, "unique to THIS company" for L3, external verifiability for L4
|
||
- [x] Lists compressed to boundary-case disambiguation only (not exhaustive checklists)
|
||
- [x] Category/specificity independence explicitly stated (presence check, not relevance judgment)
|
||
- [x] Hard vs soft number boundary clarified for QV; lower bounds ("more than 20 years") count as hard
|
||
- [x] VP/SVP title boundary: VP-or-above with IT/Security qualifier → L3; Director of IT without security qualifier → L1
|
||
- [x] Schema updated: "Sector-Adapted" → "Domain-Adapted", 2+ QV → 1+ QV
|
||
- [x] Piloted on 200 holdout paragraphs with GPT-5.4 across 5 iterations (~$6 total)
|
||
- [x] v4.5 iteration: mechanical bridge (specific_facts → specificity level), expertise-vs-topic L1/L2 clarification, SI negative-assertion L4 fix, fact storage in output
|
||
- **v4.4 results (200 paragraphs):** L1=65, L2=41, L3=51, L4=43; category 95.5% agreement with v1
|
||
- **Cost per 200:** ~$1.20 (GPT-5.4)
|
||
- **Prompt version:** v4.5 (locked)
|
||
|
||
### 4. Full Holdout Validation — DONE
|
||
- [x] Run GPT-5.4 on all 1,200 holdout paragraphs with v4.4 prompt ($5.70)
|
||
- [x] Identified 34.5% medium-confidence specificity calls, concentrated at L1/L2 and L2/L3 boundaries
|
||
- [x] Identified SI materiality assertions being false-promoted to L4 (negative assertions not verifiable)
|
||
- [x] Identified specific_facts field not being stored to disk (toLabelOutput stripped it)
|
||
- [x] Iterated to v4.5: mechanical bridge, expertise-vs-topic, SI L4 fix, fact storage
|
||
- [x] Re-ran full 1,200 with v4.5 ($6.88)
|
||
- [x] Verified bridge consistency: L1=all empty, L2+=all populated (100%)
|
||
- [x] Verified SI L4 false positives eliminated (0 remaining)
|
||
- [x] Verified TP L2→L1 drops are correct (generic vendor language, not cybersecurity expertise)
|
||
- **v4.5 results (1,200 paragraphs):** L1=618 (51.5%), L2=168 (14.0%), L3=207 (17.2%), L4=207 (17.2%)
|
||
- **Confidence:** 989 high (82.4%), 211 medium (17.6%) — down from 414 medium in v4.4
|
||
- **Category stability:** 96.8% agreement between v4.4 and v4.5
|
||
- **L2 at 14%:** below 15% target on holdout, but holdout oversamples TP (14.4% vs 5% in corpus). On full corpus (46% RMP, 5% TP), L2 should be ~15-17% since RMP L2 held up.
|
||
- **Dev vs unseen stable:** no prompt overfitting
|
||
|
||
### 5. Holdout Benchmark — DONE
|
||
- [x] Run 10 models from 8 providers on 1,200 holdout (GPT-5.4, Grok Fast, Gemini Lite, Gemini Pro, MIMO Flash, Kimi K2.5, GLM-5, MiniMax M2.7, Opus 4.6, + 3 pilots)
|
||
- [x] Opus prompt-only vs codebook A/B test (prompt-only wins: 85.2% vs 82.4% both-match)
|
||
- [x] MIMO Flash broken on specificity (91% L1 collapse, κw=0.662) — disqualified
|
||
- [x] Pilot 3 cheap candidates (Qwen3-235B, Seed 1.6 Flash, Qwen3.5 Flash) — all below Flash Lite quality
|
||
- [x] Grok self-consistency test: 8.5% specificity divergence on repeated runs at temp=0 (reasoning stochasticity)
|
||
- [x] Decision: Grok ×3 self-consistency panel (Wang et al. 2022)
|
||
- **Benchmark cost:** $45.47
|
||
- **Top models:** Grok Fast (86.1% both), Opus prompt-only (85.2%), Gemini Pro (84.2%)
|
||
- **Stage 1 panel:** Grok 4.1 Fast ×3 ($96 estimated)
|
||
|
||
### 6. Stage 1 Re-Run — DONE
|
||
- [x] Lock v2 prompt (v4.5)
|
||
- [x] Model selection: Grok 4.1 Fast ×3 (self-consistency)
|
||
- [x] Re-run Stage 1 on full corpus (72,045 paragraphs × 3 runs, concurrency 200)
|
||
- [x] Cross-run agreement: category 94.9% unanimous, specificity 91.3% unanimous
|
||
- [x] Consensus: 62,510 unanimous (86.8%), 9,323 majority (12.9%), 212 judge tiebreaker (0.3%)
|
||
- [x] GPT-5.4 judge on 212 unresolved paragraphs — 100% agreed with a Grok label
|
||
- [x] Distribution check: L2=22.7% (above 15% target), categories healthy
|
||
- **Stage 1 cost:** $129.75 (3 runs) + $5.76 (judge) = $135.51
|
||
- **Run time:** ~33 min per run at concurrency 200
|
||
|
||
### 7. Labelapp Update ← CURRENT
|
||
- [x] Update quiz questions for v2 codebook (v2 specificity rules, fixed impossible qv-3, all 4 levels as options)
|
||
- [x] Update warmup paragraphs with v2 explanations
|
||
- [x] Update onboarding content for v2 (Domain-Adapted, 1+ QV, domain terminology lists)
|
||
- [x] Update codebook reference page for v2
|
||
- [x] DB migration to clear old 72k data (0002_v2-reset.sql)
|
||
- [x] Seed script updated for 1,200 holdout paragraphs only
|
||
- [x] Nuke admin account, joey is admin
|
||
- [x] Quiz is one-time (at onboarding), warmup resets each login session
|
||
- [ ] Run migration + seed (`la:db:migrate` then `la:seed`)
|
||
- [ ] Generate new BIBD assignments (3 of 6 annotators per paragraph)
|
||
|
||
### 8. Parallel Labeling
|
||
- [ ] Humans: annotators label v2 holdout (~600 per annotator, 2-3 days)
|
||
- [x] Models: full benchmark panel on holdout (10 models, 8 providers + Opus via Agent SDK) — $45.47
|
||
- **Estimated cost:** ~$0 remaining (models done)
|
||
|
||
### 9. Gold Set Assembly
|
||
- [ ] Compute human IRR (category α > 0.75, specificity α > 0.67)
|
||
- [ ] Gold = majority vote; all-disagree → model consensus tiebreaker
|
||
- [ ] Cross-validate against model panel
|
||
|
||
### 10. Stage 2
|
||
- [x] GPT-5.4 judge resolved 212 tiebreaker paragraphs during Stage 1 consensus ($5.76)
|
||
- [ ] Bench Stage 2 accuracy against gold (if needed for additional disputed paragraphs)
|
||
- **Cost so far:** $5.76 | **Remaining budget:** ~$39
|
||
|
||
### 11. Training Data Assembly — DONE
|
||
- [x] Merge Stage 1 consensus with paragraph data (`python/src/finetune/data.py`)
|
||
- [x] Exclude 1,200 holdout paragraphs (reserved for eval)
|
||
- [x] Exclude 614 individually truncated paragraphs (not entire filings — more targeted than original plan)
|
||
- [x] Quality tier weights: clean/headed/minor 1.0, degraded 0.5
|
||
- [x] Stratified train/val split (90/10) from training set
|
||
- **Training set size:** 70,231 paragraphs (72,045 − 1,200 holdout − 614 truncated)
|
||
- **Train/val split:** 63,214 / 7,024
|
||
|
||
### 12. Fine-Tuning — DONE
|
||
- [x] Ablation round 1: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss} = 12 configs × 1 epoch
|
||
- [x] Ablation round 1 winner: base_weighted_ce (CORAL head, [CLS] pooling)
|
||
- [x] CORAL limitation identified: shared weight vector can't capture 3 different transition signals (L1→L2: domain terms, L2→L3: firm facts, L3→L4: quantified claims)
|
||
- [x] Architecture iteration: replaced CORAL with independent threshold heads (3 separate MLP binary classifiers), attention pooling, specificity confidence filtering
|
||
- [x] **Final model (iter1-independent, epoch 8):** Cat F1=0.943, Spec F1=0.945, QWK=0.952, Combined=0.944
|
||
- **Architecture:** ModernBERT-large → attention pooling → dropout →
|
||
- Category: Linear(1024, 7) + weighted CE
|
||
- Specificity: 3× IndependentThreshold(Linear(1024→256→1)) + cumulative BCE + ordinal consistency reg.
|
||
- **Key findings (ablation round 1):**
|
||
- DAPT/TAPT pre-training did not help — base ModernBERT-large outperformed both
|
||
- Class weighting + CE is the best loss combination
|
||
- Focal loss + class weighting = too much correction (always bottom tier)
|
||
- TAPT consistently worst — likely overfitting on task paragraphs during MLM pre-training
|
||
- **Key findings (architecture iteration):**
|
||
- CORAL's shared weight vector was the primary bottleneck for specificity (0.517 → 0.940)
|
||
- Independent threshold heads let each L1→L2, L2→L3, L3→L4 transition learn different features
|
||
- Attention pooling captures distributed specificity signals (one "CISO" mention anywhere matters)
|
||
- Confidence filtering removes ~8.7% noisy boundary labels from specificity training
|
||
- **Training speed:** ~2.1 it/s, batch 32, seq 512, bf16, flash attention 2, torch.compile
|
||
- **Peak VRAM:** ~18-20 GB / 24.6 GB (RTX 3090)
|
||
- **Improvement plan:** `docs/SPECIFICITY-IMPROVEMENT-PLAN.md`
|
||
|
||
### 13. Evaluation & Paper ← CURRENT
|
||
- [x] Proxy eval: fine-tuned model on 1,200 holdout vs GPT-5.4 and Opus-4.6 proxy gold
|
||
- [x] Full metrics suite: macro/per-class F1, precision, recall, MCC, AUC, QWK, MAE, Krippendorff's α, ECE, confusion matrices
|
||
- [x] CORAL baseline comparison: same eval pipeline on CORAL epoch 5 checkpoint
|
||
- [x] Figures: confusion matrices, calibration diagrams, per-class F1 bars, CORAL vs Independent comparison, speed/cost table
|
||
- [x] Reference ceiling analysis: GPT-5.4 vs Opus-4.6 agreement = 0.885 macro spec F1 (our model exceeds this at 0.895)
|
||
- [x] L2 error analysis: model L2 F1 (0.798) within 0.007 of reference ceiling (0.805)
|
||
- [x] Sequence length analysis: only 139/72K paragraphs (0.19%) truncated at 512 tokens — negligible impact
|
||
- [x] Opus labels completed: 1,200/1,200 (filled 16 missing from initial run)
|
||
- [ ] Macro F1 on holdout gold (target > 0.80 both heads) — blocked on human labels
|
||
- [ ] Per-threshold sigmoid tuning against human gold (potential +0.01-0.02 on L2 F1)
|
||
- [x] Temperature scaling for improved calibration — T_cat=1.76, T_spec=2.46; ECE reduced 33%/40% (cat/spec); F1 unchanged
|
||
- [x] Ensemble of 3 seeds for confidence intervals — seeds 42/69/420, val std ±0.002 spec, holdout +0.017 L2 F1, +0.007 spec F1 vs single seed
|
||
- [x] Dictionary/keyword baseline (A-rubric "additional baselines") — Cat 0.55, Spec 0.66; gap to learned model documents value of context
|
||
- [x] Confidence-filter ablation — null result, filtering does not affect F1; architecture changes carry the spec F1 improvement
|
||
- [x] Pooling ablation (attention vs CLS) — attention +0.005 F1 consistent; small but credible effect
|
||
- [x] DAPT re-test with new architecture — val +0.007 cat F1, best val NLL 0.333→0.318 (−4.5%), generalization gap unchanged; holdout gain ~0.001 (better init, not better generalization)
|
||
- [ ] Error analysis against human gold, IGNITE slides
|
||
- [ ] Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work
|
||
- [ ] Note in paper: DAPT/TAPT did not improve fine-tuning — noteworthy null result
|
||
- [ ] Note in paper: CORAL ordinal regression insufficient for multi-signal ordinal classification
|
||
- [ ] Note in paper: model exceeds inter-reference agreement — approaches ceiling of construct reliability
|
||
- **Proxy gold results (vs GPT-5.4):** Cat F1=0.934, Spec F1=0.895, MCC=0.923/0.866, AUC=0.992/0.982, QWK=0.932
|
||
- **Proxy gold results (vs Opus-4.6):** Cat F1=0.923, Spec F1=0.883, QWK=0.923
|
||
- **Speed:** 5.6ms/sample (178/sec) — 520× faster than GPT-5.4, 1,070× faster than Opus
|
||
- **Next:** deploy labelapp for human annotation, then gold evaluation + threshold tuning
|
||
|
||
---
|
||
|
||
## Rubric Checklist
|
||
|
||
**C (F1 > .80):** Fine-tuned model, GenAI comparison, labeled datasets, documentation, Python notebooks
|
||
**B (3+ of 4):** [x] Cost/time/reproducibility, [x] 6+ models / 3+ suppliers, [x] Contemporary self-collected data, [x] Compelling use case
|
||
**A (3+ of 4):** [x] Error analysis, [x] Mitigation strategy, [x] Additional baselines (keyword/dictionary — Cat 0.55 / Spec 0.66), [x] Comparison to amateur labels
|
||
|
||
---
|
||
|
||
## Key Data
|
||
|
||
| What | Where |
|
||
|------|-------|
|
||
| v2 codebook | `docs/LABELING-CODEBOOK.md` |
|
||
| v2 ethos | `docs/CODEBOOK-ETHOS.md` |
|
||
| Paragraphs (patched) | `data/paragraphs/paragraphs-clean.patched.jsonl` (72,045) |
|
||
| v1 Stage 1 annotations | `data/annotations/stage1.patched.jsonl` (150,009) |
|
||
| v2 holdout IDs | `data/gold/v2-holdout-ids.json` (1,200) |
|
||
| v2 holdout manifest | `data/gold/v2-holdout-manifest.jsonl` |
|
||
| v1 holdout IDs | `labelapp/.sampled-ids.original.json` |
|
||
| v1 gold labels | `data/gold/gold-adjudicated.jsonl` |
|
||
| v2 holdout benchmark | `data/annotations/v2-bench/` (10 models + 3 pilots, 1,200 paragraphs) |
|
||
| v2 holdout reference | `data/annotations/v2-bench/gpt-5.4.jsonl` (v4.5, 1,200 paragraphs) |
|
||
| v2 iteration archive | `data/annotations/v2-bench/gpt-5.4.v4.{0,1,2,3,4}.jsonl` |
|
||
| v4.5 boundary test | `data/annotations/v2-bench/v45-test/gpt-5.4.jsonl` (50 paragraphs) |
|
||
| Opus prompt-only | `data/annotations/v2-bench/opus-4.6.jsonl` (1,200 paragraphs) |
|
||
| Opus +codebook | `data/annotations/golden/opus.jsonl` (includes v1 + v2 runs) |
|
||
| Grok self-consistency test | `data/annotations/v2-bench/grok-rerun/grok-4.1-fast.jsonl` (47 paragraphs) |
|
||
| Benchmark analysis | `scripts/analyze-v2-bench.py` |
|
||
| Stage 1 prompt | `ts/src/label/prompts.ts` (v4.5) |
|
||
| Holdout sampling script | `scripts/sample-v2-holdout.py` |
|
||
| v2 Stage 1 run 1 | `data/annotations/v2-stage1/grok-4.1-fast.run1.jsonl` (72,045) |
|
||
| v2 Stage 1 run 2 | `data/annotations/v2-stage1/grok-4.1-fast.run2.jsonl` (72,045) |
|
||
| v2 Stage 1 run 3 | `data/annotations/v2-stage1/grok-4.1-fast.run3.jsonl` (72,045) |
|
||
| v2 Stage 1 consensus | `data/annotations/v2-stage1/consensus.jsonl` (72,045) |
|
||
| v2 Stage 1 judge | `data/annotations/v2-stage1/judge.jsonl` (212 tiebreakers) |
|
||
| Stage 1 distribution charts | `figures/stage1-*.png` (7 charts) |
|
||
| Stage 1 chart script | `scripts/plot-stage1-distributions.py` |
|
||
| Fine-tuning data loader | `python/src/finetune/data.py` |
|
||
| Dual-head model | `python/src/finetune/model.py` |
|
||
| Fine-tuning trainer | `python/src/finetune/train.py` |
|
||
| Fine-tune config | `python/configs/finetune/modernbert.yaml` |
|
||
| Ablation results | `checkpoints/finetune/ablation/ablation_results.json` |
|
||
| **Best model (final)** | `checkpoints/finetune/iter1-independent/final/` (cat=0.943, spec=0.945) |
|
||
| CORAL baseline (ablation winner) | `checkpoints/finetune/best-base_weighted_ce-ep5/final/` (cat=0.932, spec=0.517) |
|
||
| Ablation results | `checkpoints/finetune/ablation/ablation_results.json` |
|
||
| Spec improvement plan | `docs/SPECIFICITY-IMPROVEMENT-PLAN.md` |
|
||
| Best model iter1 config | `python/configs/finetune/iter1-independent.yaml` |
|
||
| Eval script | `python/src/finetune/eval.py` |
|
||
| Eval results (best model) | `results/eval/iter1-independent/metrics.json` |
|
||
| Eval results (CORAL) | `results/eval/coral-baseline/metrics.json` |
|
||
| Comparison figures | `results/eval/comparison/` (5 charts) |
|
||
| Per-model eval figures | `results/eval/iter1-independent/figures/` + `results/eval/coral-baseline/figures/` |
|
||
| Comparison figure script | `python/scripts/generate-comparison-figures.py` |
|
||
|
||
### v2 Stage 1 Distribution (72,045 paragraphs, v4.5 prompt, Grok ×3 consensus + GPT-5.4 judge)
|
||
|
||
| Category | Count | % |
|
||
|----------|-------|---|
|
||
| RMP | 31,201 | 43.3% |
|
||
| BG | 13,876 | 19.3% |
|
||
| MR | 10,591 | 14.7% |
|
||
| SI | 7,470 | 10.4% |
|
||
| N/O | 4,576 | 6.4% |
|
||
| TP | 4,094 | 5.7% |
|
||
| ID | 237 | 0.3% |
|
||
|
||
| Specificity | Count | % |
|
||
|-------------|-------|---|
|
||
| L1 | 29,593 | 41.1% |
|
||
| L2 | 16,344 | 22.7% |
|
||
| L3 | 17,911 | 24.9% |
|
||
| L4 | 8,197 | 11.4% |
|
||
|
||
### v1 Stage 1 Distribution (50,003 paragraphs, v2.5 prompt, 3-model consensus)
|
||
|
||
| Category | Count | % |
|
||
|----------|-------|---|
|
||
| RMP | 22,898 | 45.8% |
|
||
| MR | 8,782 | 17.6% |
|
||
| BG | 8,024 | 16.0% |
|
||
| SI | 5,014 | 10.0% |
|
||
| N/O | 2,503 | 5.0% |
|
||
| TP | 2,478 | 5.0% |
|
||
| ID | 304 | 0.6% |
|
||
|
||
### GPT-5.4 Prompt Iteration (holdout)
|
||
|
||
| Specificity | v4.0 (list, 200) | v4.4 (principle, 200) | v4.4 (full, 1200) | v4.5 (full, 1200) |
|
||
|-------------|-------------------|----------------------|--------------------|--------------------|
|
||
| L1 | 81 (40.5%) | 65 (32.5%) | 546 (45.5%) | 618 (51.5%) |
|
||
| L2 | 32 (16.0%) | 41 (20.5%) | 229 (19.1%) | 168 (14.0%) |
|
||
| L3 | 43 (21.5%) | 51 (25.5%) | 225 (18.8%) | 207 (17.2%) |
|
||
| L4 | 44 (22.0%) | 43 (21.5%) | 200 (16.7%) | 207 (17.2%) |
|
||
| Med conf | — | — | 414 (34.5%) | 211 (17.6%) |
|
||
|
||
v4.4→v4.5 key changes: mechanical bridge (specific_facts drives specificity level, 100% consistent), expertise-vs-topic L1/L2 clarification (fixes TP false L2s), SI negative-assertion L4 fix, lower-bound numbers as hard QV, fact storage in output.
|