joey/SEC-cyBERT

Fork 0

Joey Eamigh 6f4d6c57a4

labelapp updates v2

2026-04-05 00:55:53 -04:00

9.6 KiB

Raw Blame History

Project Status — v2 Pipeline

Deadline: 2026-04-24 | Started: 2026-04-03 | Updated: 2026-04-05 (holdout benchmark done, Grok ×3 selected)

Carried Forward (not re-done)

72,045 paragraphs (49,795 annotated), quality tiers, 6 surgical patches
DAPT checkpoint (eval loss 0.7250, ~14.5h) + TAPT checkpoint (eval loss 1.0754, ~50min)
v1 data preserved: 150K Stage 1 annotations, 10-model benchmark, 6-annotator human labels, gold adjudication
v2 codebook approved (5/6 group approval 2026-04-04)

Pipeline Steps

1. Codebook Finalization — DONE

Draft v2 codebook (LABELING-CODEBOOK.md)
Draft codebook ethos (CODEBOOK-ETHOS.md)
Group approval (5/6, 2026-04-04)

2. Holdout Selection — DONE

Heuristic v2 specificity prediction (keyword scan of v1 L1 → predicted L2, v1 L3 → predicted L4)
Stratified holdout: 185 per non-ID category, 90 ID = 1,200 exact
Max 2 paragraphs per company per category stratum
Specificity floors met: L1=621, L2=119, L3=262, L4=198 (all ≥100)
1,042 companies represented, max 3 from any one company
Output: data/gold/v2-holdout-ids.json, data/gold/v2-holdout-manifest.jsonl
Script: scripts/sample-v2-holdout.py
Dev set drawn from holdout (first 200 paragraphs used for prompt iteration)

3. Prompt Iteration — DONE

Full rewrite of SYSTEM_PROMPT for v2 codebook (v4.0 → v4.5, ~8 iterations)
Principle-first restructure: ERM test for L2, "unique to THIS company" for L3, external verifiability for L4
Lists compressed to boundary-case disambiguation only (not exhaustive checklists)
Category/specificity independence explicitly stated (presence check, not relevance judgment)
Hard vs soft number boundary clarified for QV; lower bounds ("more than 20 years") count as hard
VP/SVP title boundary: VP-or-above with IT/Security qualifier → L3; Director of IT without security qualifier → L1
Schema updated: "Sector-Adapted" → "Domain-Adapted", 2+ QV → 1+ QV
Piloted on 200 holdout paragraphs with GPT-5.4 across 5 iterations (~$6 total)
v4.5 iteration: mechanical bridge (specific_facts → specificity level), expertise-vs-topic L1/L2 clarification, SI negative-assertion L4 fix, fact storage in output
v4.4 results (200 paragraphs): L1=65, L2=41, L3=51, L4=43; category 95.5% agreement with v1
Cost per 200: ~$1.20 (GPT-5.4)
Prompt version: v4.5 (locked)

4. Full Holdout Validation — DONE

Run GPT-5.4 on all 1,200 holdout paragraphs with v4.4 prompt ($5.70)
Identified 34.5% medium-confidence specificity calls, concentrated at L1/L2 and L2/L3 boundaries
Identified SI materiality assertions being false-promoted to L4 (negative assertions not verifiable)
Identified specific_facts field not being stored to disk (toLabelOutput stripped it)
Iterated to v4.5: mechanical bridge, expertise-vs-topic, SI L4 fix, fact storage
Re-ran full 1,200 with v4.5 ($6.88)
Verified bridge consistency: L1=all empty, L2+=all populated (100%)
Verified SI L4 false positives eliminated (0 remaining)
Verified TP L2→L1 drops are correct (generic vendor language, not cybersecurity expertise)
v4.5 results (1,200 paragraphs): L1=618 (51.5%), L2=168 (14.0%), L3=207 (17.2%), L4=207 (17.2%)
Confidence: 989 high (82.4%), 211 medium (17.6%) — down from 414 medium in v4.4
Category stability: 96.8% agreement between v4.4 and v4.5
L2 at 14%: below 15% target on holdout, but holdout oversamples TP (14.4% vs 5% in corpus). On full corpus (46% RMP, 5% TP), L2 should be ~15-17% since RMP L2 held up.
Dev vs unseen stable: no prompt overfitting

5. Holdout Benchmark — DONE

Run 10 models from 8 providers on 1,200 holdout (GPT-5.4, Grok Fast, Gemini Lite, Gemini Pro, MIMO Flash, Kimi K2.5, GLM-5, MiniMax M2.7, Opus 4.6, + 3 pilots)
Opus prompt-only vs codebook A/B test (prompt-only wins: 85.2% vs 82.4% both-match)
MIMO Flash broken on specificity (91% L1 collapse, κw=0.662) — disqualified
Pilot 3 cheap candidates (Qwen3-235B, Seed 1.6 Flash, Qwen3.5 Flash) — all below Flash Lite quality
Grok self-consistency test: 8.5% specificity divergence on repeated runs at temp=0 (reasoning stochasticity)
Decision: Grok ×3 self-consistency panel (Wang et al. 2022)
Benchmark cost: $45.47
Top models: Grok Fast (86.1% both), Opus prompt-only (85.2%), Gemini Pro (84.2%)
Stage 1 panel: Grok 4.1 Fast ×3 ($96 estimated)

6. Stage 1 Re-Run ← CURRENT

Lock v2 prompt (v4.5)
Model selection: Grok 4.1 Fast ×3 (self-consistency)
Re-run Stage 1 on full corpus (~50K paragraphs × 3 runs)
Distribution check: L2 ~15-17%, categories healthy
Estimated cost: ~$96

7. Labelapp Update

Update quiz questions for v2 codebook (v2 specificity rules, fixed impossible qv-3, all 4 levels as options)
Update warmup paragraphs with v2 explanations
Update onboarding content for v2 (Domain-Adapted, 1+ QV, domain terminology lists)
Update codebook reference page for v2
DB migration to clear old 72k data (0002_v2-reset.sql)
Seed script updated for 1,200 holdout paragraphs only
Nuke admin account, joey is admin
Quiz is one-time (at onboarding), warmup resets each login session
Run migration + seed (la:db:migrate then la:seed)
Generate new BIBD assignments (3 of 5 annotators per paragraph)

8. Parallel Labeling

Humans: annotators label v2 holdout (~600 per annotator, 2-3 days)
Models: full benchmark panel on holdout (10 models, 8 providers + Opus via Agent SDK) — $45.47
Estimated cost: ~$0 remaining (models done)

9. Gold Set Assembly

Compute human IRR (category α > 0.75, specificity α > 0.67)
Gold = majority vote; all-disagree → model consensus tiebreaker
Cross-validate against model panel

10. Stage 2 (if needed)

Bench Stage 2 accuracy against gold
If adds value → run on disputed Stage 1 paragraphs
Estimated cost: ~$20-40 if run

11. Training Data Assembly

Unanimous Stage 1 → full weight, calibrated majority → full weight
Quality tier weights: clean/headed/minor 1.0, degraded 0.5
Exclude 72 truncated filings

12. Fine-Tuning

Ablation: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss}
Dual-head: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
CORAL for ordinal specificity
Estimated time: 12-20h GPU

13. Evaluation & Paper

Macro F1 on holdout (target > 0.80 both heads)
Per-class F1 breakdown + GenAI benchmark table
Error analysis, cost comparison, IGNITE slides
Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work

Rubric Checklist

C (F1 > .80): Fine-tuned model, GenAI comparison, labeled datasets, documentation, Python notebooks B (3+ of 4): [x] Cost/time/reproducibility, [x] 6+ models / 3+ suppliers, [x] Contemporary self-collected data, [x] Compelling use case A (3+ of 4): [x] Error analysis, [x] Mitigation strategy, [ ] Additional baselines (keyword/dictionary), [x] Comparison to amateur labels

Key Data

What	Where
v2 codebook	`docs/LABELING-CODEBOOK.md`
v2 ethos	`docs/CODEBOOK-ETHOS.md`
Paragraphs (patched)	`data/paragraphs/paragraphs-clean.patched.jsonl` (72,045)
v1 Stage 1 annotations	`data/annotations/stage1.patched.jsonl` (150,009)
v2 holdout IDs	`data/gold/v2-holdout-ids.json` (1,200)
v2 holdout manifest	`data/gold/v2-holdout-manifest.jsonl`
v1 holdout IDs	`labelapp/.sampled-ids.original.json`
v1 gold labels	`data/gold/gold-adjudicated.jsonl`
v2 holdout benchmark	`data/annotations/v2-bench/` (10 models + 3 pilots, 1,200 paragraphs)
v2 holdout reference	`data/annotations/v2-bench/gpt-5.4.jsonl` (v4.5, 1,200 paragraphs)
v2 iteration archive	`data/annotations/v2-bench/gpt-5.4.v4.{0,1,2,3,4}.jsonl`
v4.5 boundary test	`data/annotations/v2-bench/v45-test/gpt-5.4.jsonl` (50 paragraphs)
Opus prompt-only	`data/annotations/v2-bench/opus-4.6.jsonl` (1,184 paragraphs)
Opus +codebook	`data/annotations/golden/opus.jsonl` (includes v1 + v2 runs)
Grok self-consistency test	`data/annotations/v2-bench/grok-rerun/grok-4.1-fast.jsonl` (47 paragraphs)
Benchmark analysis	`scripts/analyze-v2-bench.py`
Stage 1 prompt	`ts/src/label/prompts.ts` (v4.5)
Holdout sampling script	`scripts/sample-v2-holdout.py`

v1 Stage 1 Distribution (50,003 paragraphs, v2.5 prompt, 3-model consensus)

Category	Count	%
RMP	22,898	45.8%
MR	8,782	17.6%
BG	8,024	16.0%
SI	5,014	10.0%
N/O	2,503	5.0%
TP	2,478	5.0%
ID	304	0.6%

GPT-5.4 Prompt Iteration (holdout)

Specificity	v4.0 (list, 200)	v4.4 (principle, 200)	v4.4 (full, 1200)	v4.5 (full, 1200)
L1	81 (40.5%)	65 (32.5%)	546 (45.5%)	618 (51.5%)
L2	32 (16.0%)	41 (20.5%)	229 (19.1%)	168 (14.0%)
L3	43 (21.5%)	51 (25.5%)	225 (18.8%)	207 (17.2%)
L4	44 (22.0%)	43 (21.5%)	200 (16.7%)	207 (17.2%)
Med conf	—	—	414 (34.5%)	211 (17.6%)

v4.4→v4.5 key changes: mechanical bridge (specific_facts drives specificity level, 100% consistent), expertise-vs-topic L1/L2 clarification (fixes TP false L2s), SI negative-assertion L4 fix, lower-bound numbers as hard QV, fact storage in output.

9.6 KiB Raw Blame History Unescape Escape