11 KiB
11 KiB
Project Status — v2 Pipeline
Deadline: 2026-04-24 | Started: 2026-04-03 | Updated: 2026-04-05 (Stage 1 complete, 72K×3 + judge)
Carried Forward (not re-done)
- 72,045 paragraphs (all annotated in v2), quality tiers, 6 surgical patches
- DAPT checkpoint (eval loss 0.7250, ~14.5h) + TAPT checkpoint (eval loss 1.0754, ~50min)
- v1 data preserved: 150K Stage 1 annotations, 10-model benchmark, 6-annotator human labels, gold adjudication
- v2 codebook approved (5/6 group approval 2026-04-04)
Pipeline Steps
1. Codebook Finalization — DONE
- Draft v2 codebook (LABELING-CODEBOOK.md)
- Draft codebook ethos (CODEBOOK-ETHOS.md)
- Group approval (5/6, 2026-04-04)
2. Holdout Selection — DONE
- Heuristic v2 specificity prediction (keyword scan of v1 L1 → predicted L2, v1 L3 → predicted L4)
- Stratified holdout: 185 per non-ID category, 90 ID = 1,200 exact
- Max 2 paragraphs per company per category stratum
- Specificity floors met: L1=621, L2=119, L3=262, L4=198 (all ≥100)
- 1,042 companies represented, max 3 from any one company
- Output:
data/gold/v2-holdout-ids.json,data/gold/v2-holdout-manifest.jsonl - Script:
scripts/sample-v2-holdout.py - Dev set drawn from holdout (first 200 paragraphs used for prompt iteration)
3. Prompt Iteration — DONE
- Full rewrite of SYSTEM_PROMPT for v2 codebook (v4.0 → v4.5, ~8 iterations)
- Principle-first restructure: ERM test for L2, "unique to THIS company" for L3, external verifiability for L4
- Lists compressed to boundary-case disambiguation only (not exhaustive checklists)
- Category/specificity independence explicitly stated (presence check, not relevance judgment)
- Hard vs soft number boundary clarified for QV; lower bounds ("more than 20 years") count as hard
- VP/SVP title boundary: VP-or-above with IT/Security qualifier → L3; Director of IT without security qualifier → L1
- Schema updated: "Sector-Adapted" → "Domain-Adapted", 2+ QV → 1+ QV
- Piloted on 200 holdout paragraphs with GPT-5.4 across 5 iterations (~$6 total)
- v4.5 iteration: mechanical bridge (specific_facts → specificity level), expertise-vs-topic L1/L2 clarification, SI negative-assertion L4 fix, fact storage in output
- v4.4 results (200 paragraphs): L1=65, L2=41, L3=51, L4=43; category 95.5% agreement with v1
- Cost per 200: ~$1.20 (GPT-5.4)
- Prompt version: v4.5 (locked)
4. Full Holdout Validation — DONE
- Run GPT-5.4 on all 1,200 holdout paragraphs with v4.4 prompt ($5.70)
- Identified 34.5% medium-confidence specificity calls, concentrated at L1/L2 and L2/L3 boundaries
- Identified SI materiality assertions being false-promoted to L4 (negative assertions not verifiable)
- Identified specific_facts field not being stored to disk (toLabelOutput stripped it)
- Iterated to v4.5: mechanical bridge, expertise-vs-topic, SI L4 fix, fact storage
- Re-ran full 1,200 with v4.5 ($6.88)
- Verified bridge consistency: L1=all empty, L2+=all populated (100%)
- Verified SI L4 false positives eliminated (0 remaining)
- Verified TP L2→L1 drops are correct (generic vendor language, not cybersecurity expertise)
- v4.5 results (1,200 paragraphs): L1=618 (51.5%), L2=168 (14.0%), L3=207 (17.2%), L4=207 (17.2%)
- Confidence: 989 high (82.4%), 211 medium (17.6%) — down from 414 medium in v4.4
- Category stability: 96.8% agreement between v4.4 and v4.5
- L2 at 14%: below 15% target on holdout, but holdout oversamples TP (14.4% vs 5% in corpus). On full corpus (46% RMP, 5% TP), L2 should be ~15-17% since RMP L2 held up.
- Dev vs unseen stable: no prompt overfitting
5. Holdout Benchmark — DONE
- Run 10 models from 8 providers on 1,200 holdout (GPT-5.4, Grok Fast, Gemini Lite, Gemini Pro, MIMO Flash, Kimi K2.5, GLM-5, MiniMax M2.7, Opus 4.6, + 3 pilots)
- Opus prompt-only vs codebook A/B test (prompt-only wins: 85.2% vs 82.4% both-match)
- MIMO Flash broken on specificity (91% L1 collapse, κw=0.662) — disqualified
- Pilot 3 cheap candidates (Qwen3-235B, Seed 1.6 Flash, Qwen3.5 Flash) — all below Flash Lite quality
- Grok self-consistency test: 8.5% specificity divergence on repeated runs at temp=0 (reasoning stochasticity)
- Decision: Grok ×3 self-consistency panel (Wang et al. 2022)
- Benchmark cost: $45.47
- Top models: Grok Fast (86.1% both), Opus prompt-only (85.2%), Gemini Pro (84.2%)
- Stage 1 panel: Grok 4.1 Fast ×3 ($96 estimated)
6. Stage 1 Re-Run — DONE
- Lock v2 prompt (v4.5)
- Model selection: Grok 4.1 Fast ×3 (self-consistency)
- Re-run Stage 1 on full corpus (72,045 paragraphs × 3 runs, concurrency 200)
- Cross-run agreement: category 94.9% unanimous, specificity 91.3% unanimous
- Consensus: 62,510 unanimous (86.8%), 9,323 majority (12.9%), 212 judge tiebreaker (0.3%)
- GPT-5.4 judge on 212 unresolved paragraphs — 100% agreed with a Grok label
- Distribution check: L2=22.7% (above 15% target), categories healthy
- Stage 1 cost: $129.75 (3 runs) + $5.76 (judge) = $135.51
- Run time: ~33 min per run at concurrency 200
7. Labelapp Update ← CURRENT
- Update quiz questions for v2 codebook (v2 specificity rules, fixed impossible qv-3, all 4 levels as options)
- Update warmup paragraphs with v2 explanations
- Update onboarding content for v2 (Domain-Adapted, 1+ QV, domain terminology lists)
- Update codebook reference page for v2
- DB migration to clear old 72k data (0002_v2-reset.sql)
- Seed script updated for 1,200 holdout paragraphs only
- Nuke admin account, joey is admin
- Quiz is one-time (at onboarding), warmup resets each login session
- Run migration + seed (
la:db:migratethenla:seed) - Generate new BIBD assignments (3 of 5 annotators per paragraph)
8. Parallel Labeling
- Humans: annotators label v2 holdout (~600 per annotator, 2-3 days)
- Models: full benchmark panel on holdout (10 models, 8 providers + Opus via Agent SDK) — $45.47
- Estimated cost: ~$0 remaining (models done)
9. Gold Set Assembly
- Compute human IRR (category α > 0.75, specificity α > 0.67)
- Gold = majority vote; all-disagree → model consensus tiebreaker
- Cross-validate against model panel
10. Stage 2
- GPT-5.4 judge resolved 212 tiebreaker paragraphs during Stage 1 consensus ($5.76)
- Bench Stage 2 accuracy against gold (if needed for additional disputed paragraphs)
- Cost so far: $5.76 | Remaining budget: ~$39
11. Training Data Assembly
- Unanimous Stage 1 → full weight, calibrated majority → full weight
- Quality tier weights: clean/headed/minor 1.0, degraded 0.5
- Exclude 72 truncated filings
12. Fine-Tuning
- Ablation: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss}
- Dual-head: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
- CORAL for ordinal specificity
- Estimated time: 12-20h GPU
13. Evaluation & Paper
- Macro F1 on holdout (target > 0.80 both heads)
- Per-class F1 breakdown + GenAI benchmark table
- Error analysis, cost comparison, IGNITE slides
- Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work
Rubric Checklist
C (F1 > .80): Fine-tuned model, GenAI comparison, labeled datasets, documentation, Python notebooks B (3+ of 4): [x] Cost/time/reproducibility, [x] 6+ models / 3+ suppliers, [x] Contemporary self-collected data, [x] Compelling use case A (3+ of 4): [x] Error analysis, [x] Mitigation strategy, [ ] Additional baselines (keyword/dictionary), [x] Comparison to amateur labels
Key Data
| What | Where |
|---|---|
| v2 codebook | docs/LABELING-CODEBOOK.md |
| v2 ethos | docs/CODEBOOK-ETHOS.md |
| Paragraphs (patched) | data/paragraphs/paragraphs-clean.patched.jsonl (72,045) |
| v1 Stage 1 annotations | data/annotations/stage1.patched.jsonl (150,009) |
| v2 holdout IDs | data/gold/v2-holdout-ids.json (1,200) |
| v2 holdout manifest | data/gold/v2-holdout-manifest.jsonl |
| v1 holdout IDs | labelapp/.sampled-ids.original.json |
| v1 gold labels | data/gold/gold-adjudicated.jsonl |
| v2 holdout benchmark | data/annotations/v2-bench/ (10 models + 3 pilots, 1,200 paragraphs) |
| v2 holdout reference | data/annotations/v2-bench/gpt-5.4.jsonl (v4.5, 1,200 paragraphs) |
| v2 iteration archive | data/annotations/v2-bench/gpt-5.4.v4.{0,1,2,3,4}.jsonl |
| v4.5 boundary test | data/annotations/v2-bench/v45-test/gpt-5.4.jsonl (50 paragraphs) |
| Opus prompt-only | data/annotations/v2-bench/opus-4.6.jsonl (1,184 paragraphs) |
| Opus +codebook | data/annotations/golden/opus.jsonl (includes v1 + v2 runs) |
| Grok self-consistency test | data/annotations/v2-bench/grok-rerun/grok-4.1-fast.jsonl (47 paragraphs) |
| Benchmark analysis | scripts/analyze-v2-bench.py |
| Stage 1 prompt | ts/src/label/prompts.ts (v4.5) |
| Holdout sampling script | scripts/sample-v2-holdout.py |
| v2 Stage 1 run 1 | data/annotations/v2-stage1/grok-4.1-fast.run1.jsonl (72,045) |
| v2 Stage 1 run 2 | data/annotations/v2-stage1/grok-4.1-fast.run2.jsonl (72,045) |
| v2 Stage 1 run 3 | data/annotations/v2-stage1/grok-4.1-fast.run3.jsonl (72,045) |
| v2 Stage 1 consensus | data/annotations/v2-stage1/consensus.jsonl (72,045) |
| v2 Stage 1 judge | data/annotations/v2-stage1/judge.jsonl (212 tiebreakers) |
| Stage 1 distribution charts | figures/stage1-*.png (7 charts) |
| Stage 1 chart script | scripts/plot-stage1-distributions.py |
v2 Stage 1 Distribution (72,045 paragraphs, v4.5 prompt, Grok ×3 consensus + GPT-5.4 judge)
| Category | Count | % |
|---|---|---|
| RMP | 31,201 | 43.3% |
| BG | 13,876 | 19.3% |
| MR | 10,591 | 14.7% |
| SI | 7,470 | 10.4% |
| N/O | 4,576 | 6.4% |
| TP | 4,094 | 5.7% |
| ID | 237 | 0.3% |
| Specificity | Count | % |
|---|---|---|
| L1 | 29,593 | 41.1% |
| L2 | 16,344 | 22.7% |
| L3 | 17,911 | 24.9% |
| L4 | 8,197 | 11.4% |
v1 Stage 1 Distribution (50,003 paragraphs, v2.5 prompt, 3-model consensus)
| Category | Count | % |
|---|---|---|
| RMP | 22,898 | 45.8% |
| MR | 8,782 | 17.6% |
| BG | 8,024 | 16.0% |
| SI | 5,014 | 10.0% |
| N/O | 2,503 | 5.0% |
| TP | 2,478 | 5.0% |
| ID | 304 | 0.6% |
GPT-5.4 Prompt Iteration (holdout)
| Specificity | v4.0 (list, 200) | v4.4 (principle, 200) | v4.4 (full, 1200) | v4.5 (full, 1200) |
|---|---|---|---|---|
| L1 | 81 (40.5%) | 65 (32.5%) | 546 (45.5%) | 618 (51.5%) |
| L2 | 32 (16.0%) | 41 (20.5%) | 229 (19.1%) | 168 (14.0%) |
| L3 | 43 (21.5%) | 51 (25.5%) | 225 (18.8%) | 207 (17.2%) |
| L4 | 44 (22.0%) | 43 (21.5%) | 200 (16.7%) | 207 (17.2%) |
| Med conf | — | — | 414 (34.5%) | 211 (17.6%) |
v4.4→v4.5 key changes: mechanical bridge (specific_facts drives specificity level, 100% consistent), expertise-vs-topic L1/L2 clarification (fixes TP false L2s), SI negative-assertion L4 fix, lower-bound numbers as hard QV, fact storage in output.