SEC-cyBERT/docs/STATUS.md
2026-04-05 01:30:39 -04:00

211 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Project Status — v2 Pipeline
**Deadline:** 2026-04-24 | **Started:** 2026-04-03 | **Updated:** 2026-04-05 (Stage 1 complete, 72K×3 + judge)
---
## Carried Forward (not re-done)
- 72,045 paragraphs (all annotated in v2), quality tiers, 6 surgical patches
- DAPT checkpoint (eval loss 0.7250, ~14.5h) + TAPT checkpoint (eval loss 1.0754, ~50min)
- v1 data preserved: 150K Stage 1 annotations, 10-model benchmark, 6-annotator human labels, gold adjudication
- v2 codebook approved (5/6 group approval 2026-04-04)
---
## Pipeline Steps
### 1. Codebook Finalization — DONE
- [x] Draft v2 codebook (LABELING-CODEBOOK.md)
- [x] Draft codebook ethos (CODEBOOK-ETHOS.md)
- [x] Group approval (5/6, 2026-04-04)
### 2. Holdout Selection — DONE
- [x] Heuristic v2 specificity prediction (keyword scan of v1 L1 → predicted L2, v1 L3 → predicted L4)
- [x] Stratified holdout: 185 per non-ID category, 90 ID = 1,200 exact
- [x] Max 2 paragraphs per company per category stratum
- [x] Specificity floors met: L1=621, L2=119, L3=262, L4=198 (all ≥100)
- [x] 1,042 companies represented, max 3 from any one company
- [x] Output: `data/gold/v2-holdout-ids.json`, `data/gold/v2-holdout-manifest.jsonl`
- [x] Script: `scripts/sample-v2-holdout.py`
- Dev set drawn from holdout (first 200 paragraphs used for prompt iteration)
### 3. Prompt Iteration — DONE
- [x] Full rewrite of SYSTEM_PROMPT for v2 codebook (v4.0 → v4.5, ~8 iterations)
- [x] Principle-first restructure: ERM test for L2, "unique to THIS company" for L3, external verifiability for L4
- [x] Lists compressed to boundary-case disambiguation only (not exhaustive checklists)
- [x] Category/specificity independence explicitly stated (presence check, not relevance judgment)
- [x] Hard vs soft number boundary clarified for QV; lower bounds ("more than 20 years") count as hard
- [x] VP/SVP title boundary: VP-or-above with IT/Security qualifier → L3; Director of IT without security qualifier → L1
- [x] Schema updated: "Sector-Adapted" → "Domain-Adapted", 2+ QV → 1+ QV
- [x] Piloted on 200 holdout paragraphs with GPT-5.4 across 5 iterations (~$6 total)
- [x] v4.5 iteration: mechanical bridge (specific_facts → specificity level), expertise-vs-topic L1/L2 clarification, SI negative-assertion L4 fix, fact storage in output
- **v4.4 results (200 paragraphs):** L1=65, L2=41, L3=51, L4=43; category 95.5% agreement with v1
- **Cost per 200:** ~$1.20 (GPT-5.4)
- **Prompt version:** v4.5 (locked)
### 4. Full Holdout Validation — DONE
- [x] Run GPT-5.4 on all 1,200 holdout paragraphs with v4.4 prompt ($5.70)
- [x] Identified 34.5% medium-confidence specificity calls, concentrated at L1/L2 and L2/L3 boundaries
- [x] Identified SI materiality assertions being false-promoted to L4 (negative assertions not verifiable)
- [x] Identified specific_facts field not being stored to disk (toLabelOutput stripped it)
- [x] Iterated to v4.5: mechanical bridge, expertise-vs-topic, SI L4 fix, fact storage
- [x] Re-ran full 1,200 with v4.5 ($6.88)
- [x] Verified bridge consistency: L1=all empty, L2+=all populated (100%)
- [x] Verified SI L4 false positives eliminated (0 remaining)
- [x] Verified TP L2→L1 drops are correct (generic vendor language, not cybersecurity expertise)
- **v4.5 results (1,200 paragraphs):** L1=618 (51.5%), L2=168 (14.0%), L3=207 (17.2%), L4=207 (17.2%)
- **Confidence:** 989 high (82.4%), 211 medium (17.6%) — down from 414 medium in v4.4
- **Category stability:** 96.8% agreement between v4.4 and v4.5
- **L2 at 14%:** below 15% target on holdout, but holdout oversamples TP (14.4% vs 5% in corpus). On full corpus (46% RMP, 5% TP), L2 should be ~15-17% since RMP L2 held up.
- **Dev vs unseen stable:** no prompt overfitting
### 5. Holdout Benchmark — DONE
- [x] Run 10 models from 8 providers on 1,200 holdout (GPT-5.4, Grok Fast, Gemini Lite, Gemini Pro, MIMO Flash, Kimi K2.5, GLM-5, MiniMax M2.7, Opus 4.6, + 3 pilots)
- [x] Opus prompt-only vs codebook A/B test (prompt-only wins: 85.2% vs 82.4% both-match)
- [x] MIMO Flash broken on specificity (91% L1 collapse, κw=0.662) — disqualified
- [x] Pilot 3 cheap candidates (Qwen3-235B, Seed 1.6 Flash, Qwen3.5 Flash) — all below Flash Lite quality
- [x] Grok self-consistency test: 8.5% specificity divergence on repeated runs at temp=0 (reasoning stochasticity)
- [x] Decision: Grok ×3 self-consistency panel (Wang et al. 2022)
- **Benchmark cost:** $45.47
- **Top models:** Grok Fast (86.1% both), Opus prompt-only (85.2%), Gemini Pro (84.2%)
- **Stage 1 panel:** Grok 4.1 Fast ×3 ($96 estimated)
### 6. Stage 1 Re-Run — DONE
- [x] Lock v2 prompt (v4.5)
- [x] Model selection: Grok 4.1 Fast ×3 (self-consistency)
- [x] Re-run Stage 1 on full corpus (72,045 paragraphs × 3 runs, concurrency 200)
- [x] Cross-run agreement: category 94.9% unanimous, specificity 91.3% unanimous
- [x] Consensus: 62,510 unanimous (86.8%), 9,323 majority (12.9%), 212 judge tiebreaker (0.3%)
- [x] GPT-5.4 judge on 212 unresolved paragraphs — 100% agreed with a Grok label
- [x] Distribution check: L2=22.7% (above 15% target), categories healthy
- **Stage 1 cost:** $129.75 (3 runs) + $5.76 (judge) = $135.51
- **Run time:** ~33 min per run at concurrency 200
### 7. Labelapp Update ← CURRENT
- [x] Update quiz questions for v2 codebook (v2 specificity rules, fixed impossible qv-3, all 4 levels as options)
- [x] Update warmup paragraphs with v2 explanations
- [x] Update onboarding content for v2 (Domain-Adapted, 1+ QV, domain terminology lists)
- [x] Update codebook reference page for v2
- [x] DB migration to clear old 72k data (0002_v2-reset.sql)
- [x] Seed script updated for 1,200 holdout paragraphs only
- [x] Nuke admin account, joey is admin
- [x] Quiz is one-time (at onboarding), warmup resets each login session
- [ ] Run migration + seed (`la:db:migrate` then `la:seed`)
- [ ] Generate new BIBD assignments (3 of 5 annotators per paragraph)
### 8. Parallel Labeling
- [ ] Humans: annotators label v2 holdout (~600 per annotator, 2-3 days)
- [x] Models: full benchmark panel on holdout (10 models, 8 providers + Opus via Agent SDK) — $45.47
- **Estimated cost:** ~$0 remaining (models done)
### 9. Gold Set Assembly
- [ ] Compute human IRR (category α > 0.75, specificity α > 0.67)
- [ ] Gold = majority vote; all-disagree → model consensus tiebreaker
- [ ] Cross-validate against model panel
### 10. Stage 2
- [x] GPT-5.4 judge resolved 212 tiebreaker paragraphs during Stage 1 consensus ($5.76)
- [ ] Bench Stage 2 accuracy against gold (if needed for additional disputed paragraphs)
- **Cost so far:** $5.76 | **Remaining budget:** ~$39
### 11. Training Data Assembly
- [ ] Unanimous Stage 1 → full weight, calibrated majority → full weight
- [ ] Quality tier weights: clean/headed/minor 1.0, degraded 0.5
- [ ] Exclude 72 truncated filings
### 12. Fine-Tuning
- [ ] Ablation: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss}
- [ ] Dual-head: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
- [ ] CORAL for ordinal specificity
- **Estimated time:** 12-20h GPU
### 13. Evaluation & Paper
- [ ] Macro F1 on holdout (target > 0.80 both heads)
- [ ] Per-class F1 breakdown + GenAI benchmark table
- [ ] Error analysis, cost comparison, IGNITE slides
- [ ] Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work
---
## Rubric Checklist
**C (F1 > .80):** Fine-tuned model, GenAI comparison, labeled datasets, documentation, Python notebooks
**B (3+ of 4):** [x] Cost/time/reproducibility, [x] 6+ models / 3+ suppliers, [x] Contemporary self-collected data, [x] Compelling use case
**A (3+ of 4):** [x] Error analysis, [x] Mitigation strategy, [ ] Additional baselines (keyword/dictionary), [x] Comparison to amateur labels
---
## Key Data
| What | Where |
|------|-------|
| v2 codebook | `docs/LABELING-CODEBOOK.md` |
| v2 ethos | `docs/CODEBOOK-ETHOS.md` |
| Paragraphs (patched) | `data/paragraphs/paragraphs-clean.patched.jsonl` (72,045) |
| v1 Stage 1 annotations | `data/annotations/stage1.patched.jsonl` (150,009) |
| v2 holdout IDs | `data/gold/v2-holdout-ids.json` (1,200) |
| v2 holdout manifest | `data/gold/v2-holdout-manifest.jsonl` |
| v1 holdout IDs | `labelapp/.sampled-ids.original.json` |
| v1 gold labels | `data/gold/gold-adjudicated.jsonl` |
| v2 holdout benchmark | `data/annotations/v2-bench/` (10 models + 3 pilots, 1,200 paragraphs) |
| v2 holdout reference | `data/annotations/v2-bench/gpt-5.4.jsonl` (v4.5, 1,200 paragraphs) |
| v2 iteration archive | `data/annotations/v2-bench/gpt-5.4.v4.{0,1,2,3,4}.jsonl` |
| v4.5 boundary test | `data/annotations/v2-bench/v45-test/gpt-5.4.jsonl` (50 paragraphs) |
| Opus prompt-only | `data/annotations/v2-bench/opus-4.6.jsonl` (1,184 paragraphs) |
| Opus +codebook | `data/annotations/golden/opus.jsonl` (includes v1 + v2 runs) |
| Grok self-consistency test | `data/annotations/v2-bench/grok-rerun/grok-4.1-fast.jsonl` (47 paragraphs) |
| Benchmark analysis | `scripts/analyze-v2-bench.py` |
| Stage 1 prompt | `ts/src/label/prompts.ts` (v4.5) |
| Holdout sampling script | `scripts/sample-v2-holdout.py` |
| v2 Stage 1 run 1 | `data/annotations/v2-stage1/grok-4.1-fast.run1.jsonl` (72,045) |
| v2 Stage 1 run 2 | `data/annotations/v2-stage1/grok-4.1-fast.run2.jsonl` (72,045) |
| v2 Stage 1 run 3 | `data/annotations/v2-stage1/grok-4.1-fast.run3.jsonl` (72,045) |
| v2 Stage 1 consensus | `data/annotations/v2-stage1/consensus.jsonl` (72,045) |
| v2 Stage 1 judge | `data/annotations/v2-stage1/judge.jsonl` (212 tiebreakers) |
| Stage 1 distribution charts | `figures/stage1-*.png` (7 charts) |
| Stage 1 chart script | `scripts/plot-stage1-distributions.py` |
### v2 Stage 1 Distribution (72,045 paragraphs, v4.5 prompt, Grok ×3 consensus + GPT-5.4 judge)
| Category | Count | % |
|----------|-------|---|
| RMP | 31,201 | 43.3% |
| BG | 13,876 | 19.3% |
| MR | 10,591 | 14.7% |
| SI | 7,470 | 10.4% |
| N/O | 4,576 | 6.4% |
| TP | 4,094 | 5.7% |
| ID | 237 | 0.3% |
| Specificity | Count | % |
|-------------|-------|---|
| L1 | 29,593 | 41.1% |
| L2 | 16,344 | 22.7% |
| L3 | 17,911 | 24.9% |
| L4 | 8,197 | 11.4% |
### v1 Stage 1 Distribution (50,003 paragraphs, v2.5 prompt, 3-model consensus)
| Category | Count | % |
|----------|-------|---|
| RMP | 22,898 | 45.8% |
| MR | 8,782 | 17.6% |
| BG | 8,024 | 16.0% |
| SI | 5,014 | 10.0% |
| N/O | 2,503 | 5.0% |
| TP | 2,478 | 5.0% |
| ID | 304 | 0.6% |
### GPT-5.4 Prompt Iteration (holdout)
| Specificity | v4.0 (list, 200) | v4.4 (principle, 200) | v4.4 (full, 1200) | v4.5 (full, 1200) |
|-------------|-------------------|----------------------|--------------------|--------------------|
| L1 | 81 (40.5%) | 65 (32.5%) | 546 (45.5%) | 618 (51.5%) |
| L2 | 32 (16.0%) | 41 (20.5%) | 229 (19.1%) | 168 (14.0%) |
| L3 | 43 (21.5%) | 51 (25.5%) | 225 (18.8%) | 207 (17.2%) |
| L4 | 44 (22.0%) | 43 (21.5%) | 200 (16.7%) | 207 (17.2%) |
| Med conf | — | — | 414 (34.5%) | 211 (17.6%) |
v4.4→v4.5 key changes: mechanical bridge (specific_facts drives specificity level, 100% consistent), expertise-vs-topic L1/L2 clarification (fixes TP false L2s), SI negative-assertion L4 fix, lower-bound numbers as hard QV, fact storage in output.