SEC-cyBERT/docs/STATUS.md
2026-04-05 00:55:53 -04:00

9.6 KiB
Raw Blame History

Project Status — v2 Pipeline

Deadline: 2026-04-24 | Started: 2026-04-03 | Updated: 2026-04-05 (holdout benchmark done, Grok ×3 selected)


Carried Forward (not re-done)

  • 72,045 paragraphs (49,795 annotated), quality tiers, 6 surgical patches
  • DAPT checkpoint (eval loss 0.7250, ~14.5h) + TAPT checkpoint (eval loss 1.0754, ~50min)
  • v1 data preserved: 150K Stage 1 annotations, 10-model benchmark, 6-annotator human labels, gold adjudication
  • v2 codebook approved (5/6 group approval 2026-04-04)

Pipeline Steps

1. Codebook Finalization — DONE

  • Draft v2 codebook (LABELING-CODEBOOK.md)
  • Draft codebook ethos (CODEBOOK-ETHOS.md)
  • Group approval (5/6, 2026-04-04)

2. Holdout Selection — DONE

  • Heuristic v2 specificity prediction (keyword scan of v1 L1 → predicted L2, v1 L3 → predicted L4)
  • Stratified holdout: 185 per non-ID category, 90 ID = 1,200 exact
  • Max 2 paragraphs per company per category stratum
  • Specificity floors met: L1=621, L2=119, L3=262, L4=198 (all ≥100)
  • 1,042 companies represented, max 3 from any one company
  • Output: data/gold/v2-holdout-ids.json, data/gold/v2-holdout-manifest.jsonl
  • Script: scripts/sample-v2-holdout.py
  • Dev set drawn from holdout (first 200 paragraphs used for prompt iteration)

3. Prompt Iteration — DONE

  • Full rewrite of SYSTEM_PROMPT for v2 codebook (v4.0 → v4.5, ~8 iterations)
  • Principle-first restructure: ERM test for L2, "unique to THIS company" for L3, external verifiability for L4
  • Lists compressed to boundary-case disambiguation only (not exhaustive checklists)
  • Category/specificity independence explicitly stated (presence check, not relevance judgment)
  • Hard vs soft number boundary clarified for QV; lower bounds ("more than 20 years") count as hard
  • VP/SVP title boundary: VP-or-above with IT/Security qualifier → L3; Director of IT without security qualifier → L1
  • Schema updated: "Sector-Adapted" → "Domain-Adapted", 2+ QV → 1+ QV
  • Piloted on 200 holdout paragraphs with GPT-5.4 across 5 iterations (~$6 total)
  • v4.5 iteration: mechanical bridge (specific_facts → specificity level), expertise-vs-topic L1/L2 clarification, SI negative-assertion L4 fix, fact storage in output
  • v4.4 results (200 paragraphs): L1=65, L2=41, L3=51, L4=43; category 95.5% agreement with v1
  • Cost per 200: ~$1.20 (GPT-5.4)
  • Prompt version: v4.5 (locked)

4. Full Holdout Validation — DONE

  • Run GPT-5.4 on all 1,200 holdout paragraphs with v4.4 prompt ($5.70)
  • Identified 34.5% medium-confidence specificity calls, concentrated at L1/L2 and L2/L3 boundaries
  • Identified SI materiality assertions being false-promoted to L4 (negative assertions not verifiable)
  • Identified specific_facts field not being stored to disk (toLabelOutput stripped it)
  • Iterated to v4.5: mechanical bridge, expertise-vs-topic, SI L4 fix, fact storage
  • Re-ran full 1,200 with v4.5 ($6.88)
  • Verified bridge consistency: L1=all empty, L2+=all populated (100%)
  • Verified SI L4 false positives eliminated (0 remaining)
  • Verified TP L2→L1 drops are correct (generic vendor language, not cybersecurity expertise)
  • v4.5 results (1,200 paragraphs): L1=618 (51.5%), L2=168 (14.0%), L3=207 (17.2%), L4=207 (17.2%)
  • Confidence: 989 high (82.4%), 211 medium (17.6%) — down from 414 medium in v4.4
  • Category stability: 96.8% agreement between v4.4 and v4.5
  • L2 at 14%: below 15% target on holdout, but holdout oversamples TP (14.4% vs 5% in corpus). On full corpus (46% RMP, 5% TP), L2 should be ~15-17% since RMP L2 held up.
  • Dev vs unseen stable: no prompt overfitting

5. Holdout Benchmark — DONE

  • Run 10 models from 8 providers on 1,200 holdout (GPT-5.4, Grok Fast, Gemini Lite, Gemini Pro, MIMO Flash, Kimi K2.5, GLM-5, MiniMax M2.7, Opus 4.6, + 3 pilots)
  • Opus prompt-only vs codebook A/B test (prompt-only wins: 85.2% vs 82.4% both-match)
  • MIMO Flash broken on specificity (91% L1 collapse, κw=0.662) — disqualified
  • Pilot 3 cheap candidates (Qwen3-235B, Seed 1.6 Flash, Qwen3.5 Flash) — all below Flash Lite quality
  • Grok self-consistency test: 8.5% specificity divergence on repeated runs at temp=0 (reasoning stochasticity)
  • Decision: Grok ×3 self-consistency panel (Wang et al. 2022)
  • Benchmark cost: $45.47
  • Top models: Grok Fast (86.1% both), Opus prompt-only (85.2%), Gemini Pro (84.2%)
  • Stage 1 panel: Grok 4.1 Fast ×3 ($96 estimated)

6. Stage 1 Re-Run ← CURRENT

  • Lock v2 prompt (v4.5)
  • Model selection: Grok 4.1 Fast ×3 (self-consistency)
  • Re-run Stage 1 on full corpus (~50K paragraphs × 3 runs)
  • Distribution check: L2 ~15-17%, categories healthy
  • Estimated cost: ~$96

7. Labelapp Update

  • Update quiz questions for v2 codebook (v2 specificity rules, fixed impossible qv-3, all 4 levels as options)
  • Update warmup paragraphs with v2 explanations
  • Update onboarding content for v2 (Domain-Adapted, 1+ QV, domain terminology lists)
  • Update codebook reference page for v2
  • DB migration to clear old 72k data (0002_v2-reset.sql)
  • Seed script updated for 1,200 holdout paragraphs only
  • Nuke admin account, joey is admin
  • Quiz is one-time (at onboarding), warmup resets each login session
  • Run migration + seed (la:db:migrate then la:seed)
  • Generate new BIBD assignments (3 of 5 annotators per paragraph)

8. Parallel Labeling

  • Humans: annotators label v2 holdout (~600 per annotator, 2-3 days)
  • Models: full benchmark panel on holdout (10 models, 8 providers + Opus via Agent SDK) — $45.47
  • Estimated cost: ~$0 remaining (models done)

9. Gold Set Assembly

  • Compute human IRR (category α > 0.75, specificity α > 0.67)
  • Gold = majority vote; all-disagree → model consensus tiebreaker
  • Cross-validate against model panel

10. Stage 2 (if needed)

  • Bench Stage 2 accuracy against gold
  • If adds value → run on disputed Stage 1 paragraphs
  • Estimated cost: ~$20-40 if run

11. Training Data Assembly

  • Unanimous Stage 1 → full weight, calibrated majority → full weight
  • Quality tier weights: clean/headed/minor 1.0, degraded 0.5
  • Exclude 72 truncated filings

12. Fine-Tuning

  • Ablation: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss}
  • Dual-head: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
  • CORAL for ordinal specificity
  • Estimated time: 12-20h GPU

13. Evaluation & Paper

  • Macro F1 on holdout (target > 0.80 both heads)
  • Per-class F1 breakdown + GenAI benchmark table
  • Error analysis, cost comparison, IGNITE slides
  • Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work

Rubric Checklist

C (F1 > .80): Fine-tuned model, GenAI comparison, labeled datasets, documentation, Python notebooks B (3+ of 4): [x] Cost/time/reproducibility, [x] 6+ models / 3+ suppliers, [x] Contemporary self-collected data, [x] Compelling use case A (3+ of 4): [x] Error analysis, [x] Mitigation strategy, [ ] Additional baselines (keyword/dictionary), [x] Comparison to amateur labels


Key Data

What Where
v2 codebook docs/LABELING-CODEBOOK.md
v2 ethos docs/CODEBOOK-ETHOS.md
Paragraphs (patched) data/paragraphs/paragraphs-clean.patched.jsonl (72,045)
v1 Stage 1 annotations data/annotations/stage1.patched.jsonl (150,009)
v2 holdout IDs data/gold/v2-holdout-ids.json (1,200)
v2 holdout manifest data/gold/v2-holdout-manifest.jsonl
v1 holdout IDs labelapp/.sampled-ids.original.json
v1 gold labels data/gold/gold-adjudicated.jsonl
v2 holdout benchmark data/annotations/v2-bench/ (10 models + 3 pilots, 1,200 paragraphs)
v2 holdout reference data/annotations/v2-bench/gpt-5.4.jsonl (v4.5, 1,200 paragraphs)
v2 iteration archive data/annotations/v2-bench/gpt-5.4.v4.{0,1,2,3,4}.jsonl
v4.5 boundary test data/annotations/v2-bench/v45-test/gpt-5.4.jsonl (50 paragraphs)
Opus prompt-only data/annotations/v2-bench/opus-4.6.jsonl (1,184 paragraphs)
Opus +codebook data/annotations/golden/opus.jsonl (includes v1 + v2 runs)
Grok self-consistency test data/annotations/v2-bench/grok-rerun/grok-4.1-fast.jsonl (47 paragraphs)
Benchmark analysis scripts/analyze-v2-bench.py
Stage 1 prompt ts/src/label/prompts.ts (v4.5)
Holdout sampling script scripts/sample-v2-holdout.py

v1 Stage 1 Distribution (50,003 paragraphs, v2.5 prompt, 3-model consensus)

Category Count %
RMP 22,898 45.8%
MR 8,782 17.6%
BG 8,024 16.0%
SI 5,014 10.0%
N/O 2,503 5.0%
TP 2,478 5.0%
ID 304 0.6%

GPT-5.4 Prompt Iteration (holdout)

Specificity v4.0 (list, 200) v4.4 (principle, 200) v4.4 (full, 1200) v4.5 (full, 1200)
L1 81 (40.5%) 65 (32.5%) 546 (45.5%) 618 (51.5%)
L2 32 (16.0%) 41 (20.5%) 229 (19.1%) 168 (14.0%)
L3 43 (21.5%) 51 (25.5%) 225 (18.8%) 207 (17.2%)
L4 44 (22.0%) 43 (21.5%) 200 (16.7%) 207 (17.2%)
Med conf 414 (34.5%) 211 (17.6%)

v4.4→v4.5 key changes: mechanical bridge (specific_facts drives specificity level, 100% consistent), expertise-vs-topic L1/L2 clarification (fixes TP false L2s), SI negative-assertion L4 fix, lower-bound numbers as hard QV, fact storage in output.