SEC-cyBERT/docs/STATUS.md
2026-03-31 16:27:47 -04:00

4.9 KiB
Raw Blame History

Project Status — 2026-03-30

What's Done

Data Pipeline

  • 72,045 paragraphs extracted from ~9,000 10-K filings + 207 8-K filings
  • 14 filing generators identified, quality metrics per generator
  • 6 surgical patches applied (orphan words + heading stripping)
  • Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
  • Embedded bullet detection (2,163 paragraphs flagged degraded, 0.5x sample weight)
  • All data integrity rules formalized (frozen originals, UUID-linked patches)

GenAI Labeling (Stage 1)

  • Prompt v2.5 locked after 12+ iterations
  • 3-model panel: gemini-flash-lite + mimo-v2-flash + grok-4.1-fast
  • 150,009 annotations completed ($115.88, 0 failures)
  • Orphan word re-annotation: 1,537 paragraphs re-run ($3.30), merged into stage1.patched.jsonl
  • Codebook v3.0 with 3 major rulings

DAPT + TAPT Pre-Training

  • DAPT corpus: 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped)
  • DAPT training complete: eval loss 0.7250, perplexity 1.65. 1 epoch on 500M tokens, ~14.5h on RTX 3090.
  • DAPT checkpoint at checkpoints/dapt/modernbert-large/final/
  • TAPT training complete: eval loss 1.0754, perplexity 2.11. 5 epochs, whole-word masking, ~50 min on RTX 3090. Loss: 1.46 → 1.08.
  • TAPT checkpoint at checkpoints/tapt/modernbert-large/final/
  • Custom WholeWordMaskCollator (upstream transformers collator broken for BPE tokenizers)
  • Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility)
  • Procedure documented in docs/DAPT-PROCEDURE.md

Documentation

  • docs/DATA-QUALITY-AUDIT.md — full audit with all patches and quality tiers
  • docs/EDGAR-FILING-GENERATORS.md — 14 generators with signatures and quality profiles
  • docs/DAPT-PROCEDURE.md — pre-flight checklist, commands, monitoring guide
  • docs/NARRATIVE.md — 11 phases documented through TAPT completion

What's In Progress

Human Labeling (139/1,200)

  • 3 of 6 annotators started: 68 + 50 + 21 paragraphs completed
  • Deployed via labelapp with quiz gating + warmup
  • Each annotator needs 600 paragraphs (BIBD assignment)

What's Next (in dependency order)

1. Fine-tuning pipeline (no blockers — can build now)

Build the dual-head classifier (7-class category + 4-class specificity) with:

  • Shared ModernBERT backbone + 2 linear classification heads
  • Sample weighting from quality tiers (1.0 clean/headed/minor, 0.5 degraded)
  • Confidence-stratified label assembly (unanimous → majority → judge)
  • Train/val/test split with stratification
  • Ablation configs: base vs +DAPT vs +DAPT+TAPT

3. Judge prompt v3.0 update (no blockers — can do now)

Update buildJudgePrompt() with codebook v3.0 rulings:

  • Materiality disclaimers → Strategy Integration
  • SPACs → None/Other
  • Person-vs-function test for Management↔RMP Then re-bench against gold labels.

4. Training data assembly (blocked on judge + human labels)

Combine all annotation sources into final training dataset:

  • Unanimous Stage 1 labels (35,204 paragraphs, ~97% accuracy)
  • Calibrated majority labels (~9-12K, ~85-90%)
  • Judge high-confidence labels (~2-3K, ~84%)
  • Judge low-confidence → downweight or exclude
  • Quality tier sample weights applied

4. Judge production run (blocked on human gold labels)

Run judge on ~409 unresolved + flagged majority cases. Validate against expanded gold set from human labels.

5. Fine-tuning + ablations (blocked on steps 1-3)

7 experiments: {base, +DAPT, +DAPT+TAPT} × {with/without SCL} + best config.

6. Evaluation + paper (blocked on everything above)

Full GenAI benchmark (9 models) on 1,200 holdout. Comparison tables. Write-up.

Parallel Tracks

Track A (GPU):  DAPT ✓ → TAPT ✓ → Fine-tuning → Eval
                                                ↑
Track B (API):  Judge v3 → Judge run ───────────┤
                                                ↑
Track C (Human): Labeling (139/1200) → Gold set validation
                                                ↑
Track D (Code): Fine-tune pipeline build ───────┘

DAPT + TAPT complete. Track D (fine-tune pipeline) can proceed now. Track B can start (prompt update) but production run waits for Track C. Everything converges at fine-tuning.

Key File Locations

What Where
Patched paragraphs data/paragraphs/training.patched.jsonl (49,795)
Patched annotations data/annotations/stage1.patched.jsonl (150,009)
Quality scores data/paragraphs/quality/quality-scores.jsonl (72,045)
DAPT corpus data/dapt-corpus/shard-*.jsonl (14,756 docs)
DAPT config python/configs/dapt/modernbert.yaml
TAPT config python/configs/tapt/modernbert.yaml
DAPT checkpoint checkpoints/dapt/modernbert-large/final/
Training CLI python/main.py dapt --config ...