joey/SEC-cyBERT

Fork 0

Joey Eamigh c9497f5709

6 model panel benchmark

2026-04-02 02:02:36 -04:00

7.7 KiB

Raw Blame History

Project Status — 2026-04-02

What's Done

Data Pipeline

72,045 paragraphs extracted from ~9,000 10-K filings + 207 8-K filings
14 filing generators identified, quality metrics per generator
6 surgical patches applied (orphan words + heading stripping)
Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
Embedded bullet detection (2,163 paragraphs flagged degraded, 0.5x sample weight)
All data integrity rules formalized (frozen originals, UUID-linked patches)

GenAI Labeling (Stage 1)

Prompt v2.5 locked after 12+ iterations
3-model panel: gemini-flash-lite + mimo-v2-flash + grok-4.1-fast
150,009 annotations completed ($115.88, 0 failures)
Orphan word re-annotation: 1,537 paragraphs re-run ($3.30), merged into stage1.patched.jsonl
Codebook v3.0 with 3 major rulings

DAPT + TAPT Pre-Training

DAPT corpus: 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped)
DAPT training complete: eval loss 0.7250, perplexity 1.65. 1 epoch on 500M tokens, ~14.5h on RTX 3090.
DAPT checkpoint at checkpoints/dapt/modernbert-large/final/
TAPT training complete: eval loss 1.0754, perplexity 2.11. 5 epochs, whole-word masking, ~50 min on RTX 3090. Loss: 1.46 → 1.08.
TAPT checkpoint at checkpoints/tapt/modernbert-large/final/
Custom WholeWordMaskCollator (upstream transformers collator broken for BPE tokenizers)
Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility)
Procedure documented in docs/DAPT-PROCEDURE.md

Documentation

docs/DATA-QUALITY-AUDIT.md — full audit with all patches and quality tiers
docs/EDGAR-FILING-GENERATORS.md — 14 generators with signatures and quality profiles
docs/DAPT-PROCEDURE.md — pre-flight checklist, commands, monitoring guide
docs/NARRATIVE.md — 11 phases documented through TAPT completion

What's Done (since last update)

Human Labeling — Complete

All 6 annotators completed 600 paragraphs each (3,600 labels total, 1,200 paragraphs × 3)
BIBD assignment: each paragraph labeled by exactly 3 of 6 annotators
Full data export: raw labels, timing, quiz sessions, metrics → data/gold/
Comprehensive IRR analysis with 16 diagnostic charts → data/gold/charts/

Human Labeling Results

Metric	Category	Specificity	Both
Consensus (3/3 agree)	56.8%	42.3%	27.0%
Krippendorff's α	0.801	0.546	—
Avg Cohen's κ	0.612	0.440	—

Key findings:

Category is reliable (α=0.801) — above the 0.80 threshold for reliable data
Specificity is unreliable (α=0.546) — driven primarily by one outlier annotator (Aaryan, +1.28 specificity levels vs Stage 1, κ=0.03-0.25 on specificity) and genuinely hard Spec 3↔4 boundary
Human majority = Stage 1 majority on 83.3% of categories — strong cross-validation
Same confusion axes in humans and GenAI: MR↔RMP (#1), BG↔MR (#2), N/O↔SI (#3)
Excluding outlier annotator: both-unanimous jumps from 5% → 50% on his paragraphs (+45pp)
Timing: 21.5 active hours total, median 14.9s per paragraph

Prompt v3.0

Updated SYSTEM_PROMPT with codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP
Prompt version bumped from v2.5 → v3.0

GenAI Holdout Benchmark — In Progress

Running 6 benchmark models + Opus on the 1,200 holdout paragraphs:

Model	Supplier	Est. Cost/call	Notes
openai/gpt-5.4	OpenAI	$0.009	Structured output
moonshotai/kimi-k2.5	Moonshot	$0.006	Structured output
google/gemini-3.1-pro-preview	Google	$0.006	Structured output
z-ai/glm-5	Zhipu	$0.006	Structured output, exacto routing
minimax/minimax-m2.7	MiniMax	$0.002	Raw text + fence stripping
xiaomi/mimo-v2-pro	Xiaomi	$0.006	Structured output, exacto routing
anthropic/claude-opus-4.6	Anthropic	$0 (subscription)	Agent SDK, parallel workers

Plus Stage 1 panel (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast) already on file = 10 models, 8 suppliers.

What's In Progress

Opus Golden Re-Run

Opus golden labels being re-run on the correct 1,200 holdout paragraphs (previous run was on a stale sample due to .sampled-ids.json being overwritten)
Previous Opus labels (different 1,200 paragraphs) preserved at data/annotations/golden/opus.wrong-sample.jsonl
Using parallelized Agent SDK workers (concurrency=20)

GenAI Benchmark

6 models running on holdout with v3.0 prompt, high concurrency (200)
Output: data/annotations/bench-holdout/{model}.jsonl

What's Next (in dependency order)

1. Gold set adjudication (blocked on benchmark + Opus completion)

Each paragraph will have 13+ independent annotations: 3 human + 3 Stage 1 + 1 Opus + 6 benchmark models. Adjudication tiers:

Tier 1: 10+/13 agree → gold label, no intervention
Tier 2: Human majority + GenAI consensus agree → take consensus
Tier 3: Humans split, GenAI converges → expert adjudication using Opus reasoning traces
Tier 4: Universal disagreement → expert adjudication with documented reasoning

2. Training data assembly (blocked on adjudication)

Unanimous Stage 1 labels (35,204 paragraphs) → full weight
Calibrated majority labels (~9-12K) → full weight
Judge high-confidence labels (~2-3K) → full weight
Quality tier weights: clean/headed/minor=1.0, degraded=0.5

3. Fine-tuning + ablations (blocked on training data)

7 experiments: {base, +DAPT, +DAPT+TAPT} × {with/without SCL} + best config. Dual-head classifier: shared ModernBERT backbone + 2 linear classification heads.

4. Evaluation + paper (blocked on everything above)

Full GenAI benchmark (10 models) on 1,200 holdout. Comparison tables. Write-up. IGNITE slides.

Parallel Tracks

Track A (GPU):  DAPT ✓ → TAPT ✓ ──────────────→ Fine-tuning → Eval
                                                        ↑
Track B (API):  Opus re-run ─┐                          │
                             ├→ Gold adjudication ──────┤
Track C (API):  6-model bench┘                          │
                                                        │
Track D (Human): Labeling ✓ → IRR analysis ✓ ───────────┘

Key File Locations

What	Where
Patched paragraphs	`data/paragraphs/paragraphs-clean.patched.jsonl` (49,795)
Patched annotations	`data/annotations/stage1.patched.jsonl` (150,009)
Quality scores	`data/paragraphs/quality/quality-scores.jsonl` (72,045)
Human labels (raw)	`data/gold/human-labels-raw.jsonl` (3,600 labels)
Human label metrics	`data/gold/metrics.json`
Holdout paragraphs	`data/gold/paragraphs-holdout.jsonl` (1,200)
Diagnostic charts	`data/gold/charts/*.png` (16 charts)
Opus golden labels	`data/annotations/golden/opus.jsonl` (re-run on correct holdout)
Benchmark annotations	`data/annotations/bench-holdout/{model}.jsonl`
Original sampled IDs	`labelapp/.sampled-ids.original.json` (1,200 holdout PIDs)
DAPT corpus	`data/dapt-corpus/shard-*.jsonl` (14,756 docs)
DAPT config	`python/configs/dapt/modernbert.yaml`
TAPT config	`python/configs/tapt/modernbert.yaml`
DAPT checkpoint	`checkpoints/dapt/modernbert-large/final/`
TAPT checkpoint	`checkpoints/tapt/modernbert-large/final/`
Training CLI	`python/main.py dapt --config ...`
Analysis script	`scripts/analyze-gold.py`
Data dump script	`labelapp/scripts/dump-all.ts`

7.7 KiB Raw Blame History Unescape Escape