SEC-cyBERT/docs/STATUS.md
2026-04-02 09:28:44 -04:00

7.1 KiB
Raw Blame History

Project Status — 2026-04-02 (evening)

What's Done

Data Pipeline

  • 72,045 paragraphs extracted from ~9,000 10-K filings + 207 8-K filings
  • 14 filing generators identified, quality metrics per generator
  • 6 surgical patches applied (orphan words + heading stripping)
  • Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
  • Embedded bullet detection (2,163 paragraphs flagged degraded, 0.5x sample weight)
  • All data integrity rules formalized (frozen originals, UUID-linked patches)

GenAI Labeling (Stage 1)

  • Prompt v2.5 locked after 12+ iterations
  • 3-model panel: gemini-flash-lite + mimo-v2-flash + grok-4.1-fast
  • 150,009 annotations completed ($115.88, 0 failures)
  • Orphan word re-annotation: 1,537 paragraphs re-run ($3.30), merged into stage1.patched.jsonl
  • Codebook v3.0 with 3 major rulings

DAPT + TAPT Pre-Training

  • DAPT corpus: 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped)
  • DAPT training complete: eval loss 0.7250, perplexity 1.65. 1 epoch on 500M tokens, ~14.5h on RTX 3090.
  • DAPT checkpoint at checkpoints/dapt/modernbert-large/final/
  • TAPT training complete: eval loss 1.0754, perplexity 2.11. 5 epochs, whole-word masking, ~50 min on RTX 3090. Loss: 1.46 → 1.08.
  • TAPT checkpoint at checkpoints/tapt/modernbert-large/final/
  • Custom WholeWordMaskCollator (upstream transformers collator broken for BPE tokenizers)
  • Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility)
  • Procedure documented in docs/DAPT-PROCEDURE.md

Human Labeling — Complete

  • All 6 annotators completed 600 paragraphs each (3,600 labels total, 1,200 paragraphs × 3)
  • BIBD assignment: each paragraph labeled by exactly 3 of 6 annotators
  • Full data export: raw labels, timing, quiz sessions, metrics → data/gold/
  • Comprehensive IRR analysis → data/gold/charts/
Metric Category Specificity Both
Consensus (3/3 agree) 56.8% 42.3% 27.0%
Krippendorff's α 0.801 0.546
Avg Cohen's κ 0.612 0.440

Prompt v3.0

  • Codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP
  • Prompt version bumped from v2.5 → v3.0

GenAI Holdout Benchmark — Complete

  • 6 benchmark models + Opus 4.6 on the 1,200 holdout paragraphs
  • All 1,200 annotations per model (0 failures after minimax/kimi fence-stripping fix)
  • Total benchmark cost: $45.63
Model Supplier Cost Cat % vs Opus Both % vs Opus
openai/gpt-5.4 OpenAI $6.79 88.2% 79.8%
google/gemini-3.1-pro-preview Google $16.09 87.4% 80.0%
moonshotai/kimi-k2.5 Moonshot $7.70 85.1% 76.8%
z-ai/glm-5:exacto Zhipu $6.86 86.2% 76.5%
xiaomi/mimo-v2-pro:exacto Xiaomi $6.59 85.7% 76.3%
minimax/minimax-m2.7:exacto MiniMax $1.61 82.8% 63.6%
anthropic/claude-opus-4.6 Anthropic $0

Plus Stage 1 panel already on file = 10 models, 8 suppliers.

13-Signal Cross-Source Analysis — Complete

  • 30 diagnostic charts generated → data/gold/charts/
  • Leave-one-out analysis (no model privileged as reference)
  • Adjudication tier breakdown computed

Adjudication tiers (13 signals per paragraph):

Tier Count % Rule
1 756 63.0% 10+/13 agree on both dimensions → auto gold
2 216 18.0% Human + GenAI majorities agree → cross-validated
3 26 2.2% Humans split, GenAI converges → expert review
4 202 16.8% Universal disagreement → expert review

Leave-one-out ranking (each source vs majority of other 12):

Rank Source Cat % Spec % Both %
1 Opus 4.6 92.6 90.8 84.0
2 Kimi K2.5 91.6 91.1 83.3
3 Gemini Pro 91.1 90.1 82.3
4 GPT-5.4 91.4 88.8 82.1
8 H:Xander (best human) 91.3 83.9 76.9
16 H:Aaryan (outlier) 59.1 24.7 15.8

Key finding: Opus earns the #1 spot through leave-one-out — it's not special because we designated it as gold; it genuinely disagrees with the crowd least (7.4% odd-one-out rate).

What's Next (in dependency order)

1. Gold set adjudication

  • Tier 1+2 (972 paragraphs, 81%) → auto-resolved from 13-signal consensus
  • Tier 3+4 (228 paragraphs, 19%) → expert review with Opus reasoning traces
  • For Aaryan's 600 paragraphs: use other-2-annotator majority when they agree and he disagrees

2. Training data assembly

  • Unanimous Stage 1 labels (35,204 paragraphs) → full weight
  • Calibrated majority labels (~9-12K) → full weight
  • Judge high-confidence labels (~2-3K) → full weight
  • Quality tier weights: clean/headed/minor=1.0, degraded=0.5

3. Fine-tuning + ablations

  • 8+ experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} × {±class weighting}
  • Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
  • Focal loss / class-weighted CE for category imbalance
  • Ordinal regression (CORAL) for specificity

4. Evaluation + paper

  • Macro F1 + per-class F1 on holdout (must exceed 0.80 for category)
  • Full GenAI benchmark table (10 models × 1,200 holdout)
  • Cost/time/reproducibility comparison
  • Error analysis on Tier 4 paragraphs (A-grade criterion)
  • IGNITE slides (20 slides, 15s each)

Parallel Tracks

Track A (GPU):  DAPT ✓ → TAPT ✓ ──────────────→ Fine-tuning → Eval
                                                        ↑
Track B (API):  Opus re-run ✓─┐                         │
                              ├→ Gold adjudication ─────┤
Track C (API):  6-model bench ✓┘                        │
                                                        │
Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ ─────┘

Key File Locations

What Where
Patched paragraphs data/paragraphs/paragraphs-clean.patched.jsonl (49,795)
Patched annotations data/annotations/stage1.patched.jsonl (150,009)
Quality scores data/paragraphs/quality/quality-scores.jsonl (72,045)
Human labels (raw) data/gold/human-labels-raw.jsonl (3,600 labels)
Human label metrics data/gold/metrics.json
Holdout paragraphs data/gold/paragraphs-holdout.jsonl (1,200)
Diagnostic charts data/gold/charts/*.png (30 charts)
Opus golden labels data/annotations/golden/opus.jsonl (1,200)
Benchmark annotations data/annotations/bench-holdout/{model}.jsonl (6 × 1,200)
Original sampled IDs labelapp/.sampled-ids.original.json (1,200 holdout PIDs)
DAPT corpus data/dapt-corpus/shard-*.jsonl (14,756 docs)
DAPT checkpoint checkpoints/dapt/modernbert-large/final/
TAPT checkpoint checkpoints/tapt/modernbert-large/final/
Analysis script scripts/analyze-gold.py (30-chart, 13-signal analysis)
Data dump script labelapp/scripts/dump-all.ts