trying ensenble and nofilter versions of the model

2026-04-06 15:50:15 -04:00

16 KiB

Raw Blame History

Project Status — v2 Pipeline

Deadline: 2026-04-24 | Started: 2026-04-03 | Updated: 2026-04-05 (Holdout eval done: cat F1=0.934, spec F1=0.895 vs GPT-5.4 proxy gold)

Carried Forward (not re-done)

72,045 paragraphs (all annotated in v2), quality tiers, 6 surgical patches
DAPT checkpoint (eval loss 0.7250, ~14.5h) + TAPT checkpoint (eval loss 1.0754, ~50min)
v1 data preserved: 150K Stage 1 annotations, 10-model benchmark, 6-annotator human labels, gold adjudication
v2 codebook approved (5/6 group approval 2026-04-04)

Pipeline Steps

1. Codebook Finalization — DONE

Draft v2 codebook (LABELING-CODEBOOK.md)
Draft codebook ethos (CODEBOOK-ETHOS.md)
Group approval (5/6, 2026-04-04)

2. Holdout Selection — DONE

Heuristic v2 specificity prediction (keyword scan of v1 L1 → predicted L2, v1 L3 → predicted L4)
Stratified holdout: 185 per non-ID category, 90 ID = 1,200 exact
Max 2 paragraphs per company per category stratum
Specificity floors met: L1=621, L2=119, L3=262, L4=198 (all ≥100)
1,042 companies represented, max 3 from any one company
Output: data/gold/v2-holdout-ids.json, data/gold/v2-holdout-manifest.jsonl
Script: scripts/sample-v2-holdout.py
Dev set drawn from holdout (first 200 paragraphs used for prompt iteration)

3. Prompt Iteration — DONE

Full rewrite of SYSTEM_PROMPT for v2 codebook (v4.0 → v4.5, ~8 iterations)
Principle-first restructure: ERM test for L2, "unique to THIS company" for L3, external verifiability for L4
Lists compressed to boundary-case disambiguation only (not exhaustive checklists)
Category/specificity independence explicitly stated (presence check, not relevance judgment)
Hard vs soft number boundary clarified for QV; lower bounds ("more than 20 years") count as hard
VP/SVP title boundary: VP-or-above with IT/Security qualifier → L3; Director of IT without security qualifier → L1
Schema updated: "Sector-Adapted" → "Domain-Adapted", 2+ QV → 1+ QV
Piloted on 200 holdout paragraphs with GPT-5.4 across 5 iterations (~$6 total)
v4.5 iteration: mechanical bridge (specific_facts → specificity level), expertise-vs-topic L1/L2 clarification, SI negative-assertion L4 fix, fact storage in output
v4.4 results (200 paragraphs): L1=65, L2=41, L3=51, L4=43; category 95.5% agreement with v1
Cost per 200: ~$1.20 (GPT-5.4)
Prompt version: v4.5 (locked)

4. Full Holdout Validation — DONE

Run GPT-5.4 on all 1,200 holdout paragraphs with v4.4 prompt ($5.70)
Identified 34.5% medium-confidence specificity calls, concentrated at L1/L2 and L2/L3 boundaries
Identified SI materiality assertions being false-promoted to L4 (negative assertions not verifiable)
Identified specific_facts field not being stored to disk (toLabelOutput stripped it)
Iterated to v4.5: mechanical bridge, expertise-vs-topic, SI L4 fix, fact storage
Re-ran full 1,200 with v4.5 ($6.88)
Verified bridge consistency: L1=all empty, L2+=all populated (100%)
Verified SI L4 false positives eliminated (0 remaining)
Verified TP L2→L1 drops are correct (generic vendor language, not cybersecurity expertise)
v4.5 results (1,200 paragraphs): L1=618 (51.5%), L2=168 (14.0%), L3=207 (17.2%), L4=207 (17.2%)
Confidence: 989 high (82.4%), 211 medium (17.6%) — down from 414 medium in v4.4
Category stability: 96.8% agreement between v4.4 and v4.5
L2 at 14%: below 15% target on holdout, but holdout oversamples TP (14.4% vs 5% in corpus). On full corpus (46% RMP, 5% TP), L2 should be ~15-17% since RMP L2 held up.
Dev vs unseen stable: no prompt overfitting

5. Holdout Benchmark — DONE

Run 10 models from 8 providers on 1,200 holdout (GPT-5.4, Grok Fast, Gemini Lite, Gemini Pro, MIMO Flash, Kimi K2.5, GLM-5, MiniMax M2.7, Opus 4.6, + 3 pilots)
Opus prompt-only vs codebook A/B test (prompt-only wins: 85.2% vs 82.4% both-match)
MIMO Flash broken on specificity (91% L1 collapse, κw=0.662) — disqualified
Pilot 3 cheap candidates (Qwen3-235B, Seed 1.6 Flash, Qwen3.5 Flash) — all below Flash Lite quality
Grok self-consistency test: 8.5% specificity divergence on repeated runs at temp=0 (reasoning stochasticity)
Decision: Grok ×3 self-consistency panel (Wang et al. 2022)
Benchmark cost: $45.47
Top models: Grok Fast (86.1% both), Opus prompt-only (85.2%), Gemini Pro (84.2%)
Stage 1 panel: Grok 4.1 Fast ×3 ($96 estimated)

6. Stage 1 Re-Run — DONE

Lock v2 prompt (v4.5)
Model selection: Grok 4.1 Fast ×3 (self-consistency)
Re-run Stage 1 on full corpus (72,045 paragraphs × 3 runs, concurrency 200)
Cross-run agreement: category 94.9% unanimous, specificity 91.3% unanimous
Consensus: 62,510 unanimous (86.8%), 9,323 majority (12.9%), 212 judge tiebreaker (0.3%)
GPT-5.4 judge on 212 unresolved paragraphs — 100% agreed with a Grok label
Distribution check: L2=22.7% (above 15% target), categories healthy
Stage 1 cost: $129.75 (3 runs) + $5.76 (judge) = $135.51
Run time: ~33 min per run at concurrency 200

7. Labelapp Update ← CURRENT

Update quiz questions for v2 codebook (v2 specificity rules, fixed impossible qv-3, all 4 levels as options)
Update warmup paragraphs with v2 explanations
Update onboarding content for v2 (Domain-Adapted, 1+ QV, domain terminology lists)
Update codebook reference page for v2
DB migration to clear old 72k data (0002_v2-reset.sql)
Seed script updated for 1,200 holdout paragraphs only
Nuke admin account, joey is admin
Quiz is one-time (at onboarding), warmup resets each login session
Run migration + seed (la:db:migrate then la:seed)
Generate new BIBD assignments (3 of 5 annotators per paragraph)

8. Parallel Labeling

Humans: annotators label v2 holdout (~600 per annotator, 2-3 days)
Models: full benchmark panel on holdout (10 models, 8 providers + Opus via Agent SDK) — $45.47
Estimated cost: ~$0 remaining (models done)

9. Gold Set Assembly

Compute human IRR (category α > 0.75, specificity α > 0.67)
Gold = majority vote; all-disagree → model consensus tiebreaker
Cross-validate against model panel

10. Stage 2

GPT-5.4 judge resolved 212 tiebreaker paragraphs during Stage 1 consensus ($5.76)
Bench Stage 2 accuracy against gold (if needed for additional disputed paragraphs)
Cost so far: $5.76 | Remaining budget: ~$39

11. Training Data Assembly — DONE

Merge Stage 1 consensus with paragraph data (python/src/finetune/data.py)
Exclude 1,200 holdout paragraphs (reserved for eval)
Exclude 614 individually truncated paragraphs (not entire filings — more targeted than original plan)
Quality tier weights: clean/headed/minor 1.0, degraded 0.5
Stratified train/val split (90/10) from training set
Training set size: 70,231 paragraphs (72,045 − 1,200 holdout − 614 truncated)
Train/val split: 63,214 / 7,024

12. Fine-Tuning — DONE

Ablation round 1: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss} = 12 configs × 1 epoch
Ablation round 1 winner: base_weighted_ce (CORAL head, [CLS] pooling)
CORAL limitation identified: shared weight vector can't capture 3 different transition signals (L1→L2: domain terms, L2→L3: firm facts, L3→L4: quantified claims)
Architecture iteration: replaced CORAL with independent threshold heads (3 separate MLP binary classifiers), attention pooling, specificity confidence filtering
Final model (iter1-independent, epoch 8): Cat F1=0.943, Spec F1=0.945, QWK=0.952, Combined=0.944
Architecture: ModernBERT-large → attention pooling → dropout →
- Category: Linear(1024, 7) + weighted CE
- Specificity: 3× IndependentThreshold(Linear(1024→256→1)) + cumulative BCE + ordinal consistency reg.
Key findings (ablation round 1):
- DAPT/TAPT pre-training did not help — base ModernBERT-large outperformed both
- Class weighting + CE is the best loss combination
- Focal loss + class weighting = too much correction (always bottom tier)
- TAPT consistently worst — likely overfitting on task paragraphs during MLM pre-training
Key findings (architecture iteration):
- CORAL's shared weight vector was the primary bottleneck for specificity (0.517 → 0.940)
- Independent threshold heads let each L1→L2, L2→L3, L3→L4 transition learn different features
- Attention pooling captures distributed specificity signals (one "CISO" mention anywhere matters)
- Confidence filtering removes ~8.7% noisy boundary labels from specificity training
Training speed: ~2.1 it/s, batch 32, seq 512, bf16, flash attention 2, torch.compile
Peak VRAM: ~18-20 GB / 24.6 GB (RTX 3090)
Improvement plan: docs/SPECIFICITY-IMPROVEMENT-PLAN.md

13. Evaluation & Paper ← CURRENT

Proxy eval: fine-tuned model on 1,200 holdout vs GPT-5.4 and Opus-4.6 proxy gold
Full metrics suite: macro/per-class F1, precision, recall, MCC, AUC, QWK, MAE, Krippendorff's α, ECE, confusion matrices
CORAL baseline comparison: same eval pipeline on CORAL epoch 5 checkpoint
Figures: confusion matrices, calibration diagrams, per-class F1 bars, CORAL vs Independent comparison, speed/cost table
Reference ceiling analysis: GPT-5.4 vs Opus-4.6 agreement = 0.885 macro spec F1 (our model exceeds this at 0.895)
L2 error analysis: model L2 F1 (0.798) within 0.007 of reference ceiling (0.805)
Sequence length analysis: only 139/72K paragraphs (0.19%) truncated at 512 tokens — negligible impact
Opus labels completed: 1,200/1,200 (filled 16 missing from initial run)
Macro F1 on holdout gold (target > 0.80 both heads) — blocked on human labels
Per-threshold sigmoid tuning against human gold (potential +0.01-0.02 on L2 F1)
Temperature scaling for improved calibration — T_cat=1.76, T_spec=2.46; ECE reduced 33%/40% (cat/spec); F1 unchanged
Ensemble of 3 seeds for confidence intervals — seeds 42/69/420, val std ±0.002 spec, holdout +0.017 L2 F1, +0.007 spec F1 vs single seed
Dictionary/keyword baseline (A-rubric "additional baselines") — Cat 0.55, Spec 0.66; gap to learned model documents value of context
Confidence-filter ablation — null result, filtering does not affect F1; architecture changes carry the spec F1 improvement
Error analysis against human gold, IGNITE slides
Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work
Note in paper: DAPT/TAPT did not improve fine-tuning — noteworthy null result
Note in paper: CORAL ordinal regression insufficient for multi-signal ordinal classification
Note in paper: model exceeds inter-reference agreement — approaches ceiling of construct reliability
Proxy gold results (vs GPT-5.4): Cat F1=0.934, Spec F1=0.895, MCC=0.923/0.866, AUC=0.992/0.982, QWK=0.932
Proxy gold results (vs Opus-4.6): Cat F1=0.923, Spec F1=0.883, QWK=0.923
Speed: 5.6ms/sample (178/sec) — 520× faster than GPT-5.4, 1,070× faster than Opus
Next: deploy labelapp for human annotation, then gold evaluation + threshold tuning

Rubric Checklist

C (F1 > .80): Fine-tuned model, GenAI comparison, labeled datasets, documentation, Python notebooks B (3+ of 4): [x] Cost/time/reproducibility, [x] 6+ models / 3+ suppliers, [x] Contemporary self-collected data, [x] Compelling use case A (3+ of 4): [x] Error analysis, [x] Mitigation strategy, [x] Additional baselines (keyword/dictionary — Cat 0.55 / Spec 0.66), [x] Comparison to amateur labels

Key Data

What	Where
v2 codebook	`docs/LABELING-CODEBOOK.md`
v2 ethos	`docs/CODEBOOK-ETHOS.md`
Paragraphs (patched)	`data/paragraphs/paragraphs-clean.patched.jsonl` (72,045)
v1 Stage 1 annotations	`data/annotations/stage1.patched.jsonl` (150,009)
v2 holdout IDs	`data/gold/v2-holdout-ids.json` (1,200)
v2 holdout manifest	`data/gold/v2-holdout-manifest.jsonl`
v1 holdout IDs	`labelapp/.sampled-ids.original.json`
v1 gold labels	`data/gold/gold-adjudicated.jsonl`
v2 holdout benchmark	`data/annotations/v2-bench/` (10 models + 3 pilots, 1,200 paragraphs)
v2 holdout reference	`data/annotations/v2-bench/gpt-5.4.jsonl` (v4.5, 1,200 paragraphs)
v2 iteration archive	`data/annotations/v2-bench/gpt-5.4.v4.{0,1,2,3,4}.jsonl`
v4.5 boundary test	`data/annotations/v2-bench/v45-test/gpt-5.4.jsonl` (50 paragraphs)
Opus prompt-only	`data/annotations/v2-bench/opus-4.6.jsonl` (1,200 paragraphs)
Opus +codebook	`data/annotations/golden/opus.jsonl` (includes v1 + v2 runs)
Grok self-consistency test	`data/annotations/v2-bench/grok-rerun/grok-4.1-fast.jsonl` (47 paragraphs)
Benchmark analysis	`scripts/analyze-v2-bench.py`
Stage 1 prompt	`ts/src/label/prompts.ts` (v4.5)
Holdout sampling script	`scripts/sample-v2-holdout.py`
v2 Stage 1 run 1	`data/annotations/v2-stage1/grok-4.1-fast.run1.jsonl` (72,045)
v2 Stage 1 run 2	`data/annotations/v2-stage1/grok-4.1-fast.run2.jsonl` (72,045)
v2 Stage 1 run 3	`data/annotations/v2-stage1/grok-4.1-fast.run3.jsonl` (72,045)
v2 Stage 1 consensus	`data/annotations/v2-stage1/consensus.jsonl` (72,045)
v2 Stage 1 judge	`data/annotations/v2-stage1/judge.jsonl` (212 tiebreakers)
Stage 1 distribution charts	`figures/stage1-*.png` (7 charts)
Stage 1 chart script	`scripts/plot-stage1-distributions.py`
Fine-tuning data loader	`python/src/finetune/data.py`
Dual-head model	`python/src/finetune/model.py`
Fine-tuning trainer	`python/src/finetune/train.py`
Fine-tune config	`python/configs/finetune/modernbert.yaml`
Ablation results	`checkpoints/finetune/ablation/ablation_results.json`
Best model (final)	`checkpoints/finetune/iter1-independent/final/` (cat=0.943, spec=0.945)
CORAL baseline (ablation winner)	`checkpoints/finetune/best-base_weighted_ce-ep5/final/` (cat=0.932, spec=0.517)
Ablation results	`checkpoints/finetune/ablation/ablation_results.json`
Spec improvement plan	`docs/SPECIFICITY-IMPROVEMENT-PLAN.md`
Best model iter1 config	`python/configs/finetune/iter1-independent.yaml`
Eval script	`python/src/finetune/eval.py`
Eval results (best model)	`results/eval/iter1-independent/metrics.json`
Eval results (CORAL)	`results/eval/coral-baseline/metrics.json`
Comparison figures	`results/eval/comparison/` (5 charts)
Per-model eval figures	`results/eval/iter1-independent/figures/` + `results/eval/coral-baseline/figures/`
Comparison figure script	`python/scripts/generate-comparison-figures.py`

v2 Stage 1 Distribution (72,045 paragraphs, v4.5 prompt, Grok ×3 consensus + GPT-5.4 judge)

Category	Count	%
RMP	31,201	43.3%
BG	13,876	19.3%
MR	10,591	14.7%
SI	7,470	10.4%
N/O	4,576	6.4%
TP	4,094	5.7%
ID	237	0.3%

Specificity	Count	%
L1	29,593	41.1%
L2	16,344	22.7%
L3	17,911	24.9%
L4	8,197	11.4%

v1 Stage 1 Distribution (50,003 paragraphs, v2.5 prompt, 3-model consensus)

Category	Count	%
RMP	22,898	45.8%
MR	8,782	17.6%
BG	8,024	16.0%
SI	5,014	10.0%
N/O	2,503	5.0%
TP	2,478	5.0%
ID	304	0.6%

GPT-5.4 Prompt Iteration (holdout)

Specificity	v4.0 (list, 200)	v4.4 (principle, 200)	v4.4 (full, 1200)	v4.5 (full, 1200)
L1	81 (40.5%)	65 (32.5%)	546 (45.5%)	618 (51.5%)
L2	32 (16.0%)	41 (20.5%)	229 (19.1%)	168 (14.0%)
L3	43 (21.5%)	51 (25.5%)	225 (18.8%)	207 (17.2%)
L4	44 (22.0%)	43 (21.5%)	200 (16.7%)	207 (17.2%)
Med conf	—	—	414 (34.5%)	211 (17.6%)

v4.4→v4.5 key changes: mechanical bridge (specific_facts drives specificity level, 100% consistent), expertise-vs-topic L1/L2 clarification (fixes TP false L2s), SI negative-assertion L4 fix, lower-bound numbers as hard QV, fact storage in output.

16 KiB Raw Blame History Unescape Escape