joey/SEC-cyBERT

Fork 0

Joey Eamigh d653ed9a20

pivot point

2026-04-03 14:43:53 -04:00

12 KiB

Raw Blame History

Project Status — 2026-04-02 (evening)

What's Done

Data Pipeline

72,045 paragraphs extracted from ~9,000 10-K filings + 207 8-K filings
14 filing generators identified, quality metrics per generator
6 surgical patches applied (orphan words + heading stripping)
Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
Embedded bullet detection (2,163 paragraphs flagged degraded, 0.5x sample weight)
All data integrity rules formalized (frozen originals, UUID-linked patches)

GenAI Labeling (Stage 1)

Prompt v2.5 locked after 12+ iterations
3-model panel: gemini-flash-lite + mimo-v2-flash + grok-4.1-fast
150,009 annotations completed ($115.88, 0 failures)
Orphan word re-annotation: 1,537 paragraphs re-run ($3.30), merged into stage1.patched.jsonl
Codebook v3.0 with 3 major rulings

DAPT + TAPT Pre-Training

DAPT corpus: 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped)
DAPT training complete: eval loss 0.7250, perplexity 1.65. 1 epoch on 500M tokens, ~14.5h on RTX 3090.
DAPT checkpoint at checkpoints/dapt/modernbert-large/final/
TAPT training complete: eval loss 1.0754, perplexity 2.11. 5 epochs, whole-word masking, ~50 min on RTX 3090. Loss: 1.46 → 1.08.
TAPT checkpoint at checkpoints/tapt/modernbert-large/final/
Custom WholeWordMaskCollator (upstream transformers collator broken for BPE tokenizers)
Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility)
Procedure documented in docs/DAPT-PROCEDURE.md

Human Labeling — Complete

All 6 annotators completed 600 paragraphs each (3,600 labels total, 1,200 paragraphs × 3)
BIBD assignment: each paragraph labeled by exactly 3 of 6 annotators
Full data export: raw labels, timing, quiz sessions, metrics → data/gold/
Comprehensive IRR analysis → data/gold/charts/

Metric	Category	Specificity	Both
Consensus (3/3 agree)	56.8%	42.3%	27.0%
Krippendorff's α	0.801	0.546	—
Avg Cohen's κ	0.612	0.440	—

Prompt v3.0

Codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP
Prompt version bumped from v2.5 → v3.0

GenAI Holdout Benchmark — Complete

6 benchmark models + Opus 4.6 on the 1,200 holdout paragraphs
All 1,200 annotations per model (0 failures after minimax/kimi fence-stripping fix)
Total benchmark cost: $45.63

Model	Supplier	Cost	Cat % vs Opus	Both % vs Opus
openai/gpt-5.4	OpenAI	$6.79	88.2%	79.8%
google/gemini-3.1-pro-preview	Google	$16.09	87.4%	80.0%
moonshotai/kimi-k2.5	Moonshot	$7.70	85.1%	76.8%
z-ai/glm-5:exacto	Zhipu	$6.86	86.2%	76.5%
xiaomi/mimo-v2-pro:exacto	Xiaomi	$6.59	85.7%	76.3%
minimax/minimax-m2.7:exacto	MiniMax	$1.61	82.8%	63.6%
anthropic/claude-opus-4.6	Anthropic	$0	—	—

Plus Stage 1 panel already on file = 10 models, 8 suppliers.

13-Signal Cross-Source Analysis — Complete

30 diagnostic charts generated → data/gold/charts/
Leave-one-out analysis (no model privileged as reference)
Adjudication tier breakdown computed

Adjudication tiers (13 signals per paragraph):

Tier	Count	%	Rule
1	756	63.0%	10+/13 agree on both dimensions → auto gold
2	216	18.0%	Human + GenAI majorities agree → cross-validated
3	26	2.2%	Humans split, GenAI converges → expert review
4	202	16.8%	Universal disagreement → expert review

Leave-one-out ranking (each source vs majority of other 12):

Rank	Source	Cat %	Spec %	Both %
1	Opus 4.6	92.6	90.8	84.0
2	Kimi K2.5	91.6	91.1	83.3
3	Gemini Pro	91.1	90.1	82.3
4	GPT-5.4	91.4	88.8	82.1
8	H:Xander (best human)	91.3	83.9	76.9
16	H:Aaryan (outlier)	59.1	24.7	15.8

Key finding: Opus earns the #1 spot through leave-one-out — it's not special because we designated it as gold; it genuinely disagrees with the crowd least (7.4% odd-one-out rate).

Codebook v3.5 & Prompt Iteration — Complete

Cross-analysis: GenAI vs human systematic errors identified (SI↔N/O 23:0, MR↔RMP 38:13, BG↔MR 33:6)
v3.5 rulings: SI materiality assessment test, BG purpose test, MR↔RMP 3-step chain
v3.5 gold re-run: 7 models × 359 confusion-axis holdout paragraphs ($18)
6 rounds prompt iteration on 26 regression paragraphs ($1.02): v3.0=18/26 → v3.5=22/26
SI rule tightened: "could have material adverse effect" = NOT SI (speculation, not assessment)
Cross-reference exception: materiality language in cross-refs = N/O
BG threshold: one-sentence committee mention doesn't flip to BG
Stage 1 corrections flagged: 308 paragraphs (180 materiality + 128 SPACs)
Prompt locked at v3.5, codebook updated, version history documented
SI↔N/O paradox investigated and resolved: models correct, humans systematically over-call SI on speculation
Codebook Case 9 contradiction with Rule 6 fixed ("could" example → N/O)
Gold adjudication strategy for SI↔N/O defined: trust model consensus, apply SI via regex for assessments

Data asset	Location
v3.5 bench annotations	`data/annotations/bench-holdout-v35/*.jsonl` (7 models × 359)
v3.5 Opus annotations	`data/annotations/golden-v35/opus.jsonl` (359)
Stage 1 correction flags	`data/annotations/stage1-corrections.jsonl` (308)
Holdout re-run IDs	`data/gold/holdout-rerun-v35.jsonl` (359)

Gold Set Adjudication v1 — Complete

Aaryan redo integrated: 50.3% of labels changed, α 0.801→0.825 (cat), 0.546→0.661 (spec)
Old Aaryan labels preserved in data/gold/human-labels-aaryan-v1.jsonl
Cross-axis systematic error analysis: models correct ~85% on MR↔RMP, MR↔BG, RMP↔BG, TP↔RMP, SI↔N/O
5-tier adjudication: T1 super-consensus (911), T2 cross-validated (108), T3 rule-based (30), T4 model-unanimous (59), T5 plurality (92)
30 rule-based overrides (27 SI↔N/O + 3 T5 codebook resolutions)

Gold Set Adjudication v2 — Complete (T5 deep analysis)

Full model disagreement analysis: 6-model vote vectors on all 1,200 paragraphs
Gemini identified as systematic MR outlier (z≈+2.3, 302 MR vs ~192 avg, drives 45% MR↔RMP confusion)
Gemini exclusion experiment: NULL RESULT at T5 (human MR bias makes it redundant; tiering already neutralizes at T4)
v3.5 prompt impact: unanimity 25%→60%, but created new BG↔RMP hotspot (+171%)
Text-based BG vote removal: automated, verifiable — if "board" absent from text, BG model votes removed. 13 labels corrected, source accuracy UP for 10/12 sources
10 new codebook tiebreaker overrides: ID↔SI (negative assertions), SPAC rule, board-removal test, committee-level test
Specificity hybrid: human unanimous → human label, human split → model majority. 195 specificity labels updated
All changes validated experimentally (one variable at a time, acceptance criteria checked)
T5: 92 → 85, gold≠human: 151 → 144

Source	Accuracy vs Gold (v1)	Accuracy vs Gold (v2)	Δ
Xander	91.0%	91.5%	+0.5%
Opus	88.6%	89.1%	+0.5%
GPT-5.4	87.4%	88.5%	+1.1%
GLM-5	86.0%	86.5%	+0.5%
Elisabeth	85.8%	86.5%	+0.7%
MIMO	85.8%	86.2%	+0.5%
Meghan	85.3%	86.0%	+0.7%
Kimi	84.5%	84.9%	+0.4%
Gemini	84.0%	84.6%	+0.6%
Joey	80.7%	80.2%	-0.5%
Aaryan	75.2%	74.2%	-1.0%
Anuj	69.3%	69.7%	+0.3%

Data asset	Location
Adjudicated gold labels	`data/gold/gold-adjudicated.jsonl` (1,200)
Old Aaryan labels	`data/gold/human-labels-aaryan-v1.jsonl` (600)
Adjudication charts	`data/gold/charts/gold-*.png` (4 charts)
Adjudication script	`scripts/adjudicate-gold.py` (v2)
Experiment harness	`scripts/adjudicate-gold-experiment.py`
T5 analysis docs	`docs/T5-ANALYSIS.md`

What's Next (in dependency order)

1. (Optional) Manual review of remaining 85 T5-plurality paragraphs

85 paragraphs resolved by signal plurality — lowest confidence tier
71% on the BG↔MR↔RMP triangle (irreducible ambiguity)
62 have weak plurality (4-5/9) — diminishing returns
Could improve gold set by ~1-3% if reviewed, but diminishing returns

2. Stage 2 re-eval on training data

Pilot gpt-5.4-mini vs gpt-5.4 on holdout validation sample
Run on 308 flagged Stage 1 corrections (180 materiality + 128 SPACs)
Also run standard Stage 2 judge on existing disagreements with v3.5 prompt

3. Training data assembly

Unanimous Stage 1 labels (35,204 paragraphs) → full weight
Calibrated majority labels (~9-12K) → full weight
Judge high-confidence labels (~2-3K) → full weight
Quality tier weights: clean/headed/minor=1.0, degraded=0.5

4. Fine-tuning + ablations

8+ experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} × {±class weighting}
Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
Focal loss / class-weighted CE for category imbalance
Ordinal regression (CORAL) for specificity

5. Evaluation + paper

Macro F1 + per-class F1 on holdout (must exceed 0.80 for category)
Full GenAI benchmark table (10 models × 1,200 holdout)
Cost/time/reproducibility comparison
Error analysis on Tier 4 paragraphs (A-grade criterion)
IGNITE slides (20 slides, 15s each)

Parallel Tracks

Track A (GPU):  DAPT ✓ → TAPT ✓ ─────────────────────────────→ Fine-tuning → Eval
                                                                      ↑
Track B (API):  Opus re-run ✓─┐                                       │
                              ├→ v3.5 re-run ✓ → SI paradox ✓ ───┐   │
Track C (API):  6-model bench ✓┘                                  │   │
                                                    Gold adjud. ✓ ┤   │
Track E (API):  v3.5 prompt ✓ → S1 flags ✓ → Stage 2 re-eval ───┘───┘
                                                                      
Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ → Aaryan redo ✓

Key File Locations

What	Where
Patched paragraphs	`data/paragraphs/paragraphs-clean.patched.jsonl` (49,795)
Patched annotations	`data/annotations/stage1.patched.jsonl` (150,009)
Quality scores	`data/paragraphs/quality/quality-scores.jsonl` (72,045)
Human labels (raw)	`data/gold/human-labels-raw.jsonl` (3,600 labels)
Human label metrics	`data/gold/metrics.json`
Holdout paragraphs	`data/gold/paragraphs-holdout.jsonl` (1,200)
Diagnostic charts	`data/gold/charts/*.png` (30 charts)
Opus golden labels	`data/annotations/golden/opus.jsonl` (1,200)
Benchmark annotations	`data/annotations/bench-holdout/{model}.jsonl` (6 × 1,200)
Original sampled IDs	`labelapp/.sampled-ids.original.json` (1,200 holdout PIDs)
DAPT corpus	`data/dapt-corpus/shard-*.jsonl` (14,756 docs)
DAPT checkpoint	`checkpoints/dapt/modernbert-large/final/`
TAPT checkpoint	`checkpoints/tapt/modernbert-large/final/`
v3.5 bench annotations	`data/annotations/bench-holdout-v35/*.jsonl` (7 × 359)
v3.5 Opus golden	`data/annotations/golden-v35/opus.jsonl` (359)
Stage 1 correction flags	`data/annotations/stage1-corrections.jsonl` (1,014)
Holdout re-run IDs	`data/gold/holdout-rerun-v35.jsonl` (359)
Analysis script	`scripts/analyze-gold.py` (30-chart, 13-signal analysis)
Data dump script	`labelapp/scripts/dump-all.ts`

12 KiB Raw Blame History Unescape Escape