SEC-cyBERT/docs/STATUS.md
2026-03-30 21:25:46 -04:00

108 lines
5.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Project Status — 2026-03-30
## What's Done
### Data Pipeline
- [x] 72,045 paragraphs extracted from ~9,000 10-K filings + 207 8-K filings
- [x] 14 filing generators identified, quality metrics per generator
- [x] 6 surgical patches applied (orphan words + heading stripping)
- [x] Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
- [x] Embedded bullet detection (2,163 paragraphs flagged degraded, 0.5x sample weight)
- [x] All data integrity rules formalized (frozen originals, UUID-linked patches)
### GenAI Labeling (Stage 1)
- [x] Prompt v2.5 locked after 12+ iterations
- [x] 3-model panel: gemini-flash-lite + mimo-v2-flash + grok-4.1-fast
- [x] 150,009 annotations completed ($115.88, 0 failures)
- [x] Orphan word re-annotation: 1,537 paragraphs re-run ($3.30), merged into `stage1.patched.jsonl`
- [x] Codebook v3.0 with 3 major rulings
### DAPT + TAPT Pre-Training
- [x] DAPT corpus: 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped)
- [x] DAPT training complete: eval loss 0.7250, perplexity 1.65. 1 epoch on 500M tokens, ~14.5h on RTX 3090.
- [x] DAPT checkpoint at `checkpoints/dapt/modernbert-large/final/`
- [x] TAPT config: 5 epochs, whole-word masking, seq_len=512, batch=32
- [x] Custom `WholeWordMaskCollator` (upstream `transformers` collator broken for BPE tokenizers)
- [x] Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility)
- [x] Procedure documented in `docs/DAPT-PROCEDURE.md`
### Documentation
- [x] `docs/DATA-QUALITY-AUDIT.md` — full audit with all patches and quality tiers
- [x] `docs/EDGAR-FILING-GENERATORS.md` — 14 generators with signatures and quality profiles
- [x] `docs/DAPT-PROCEDURE.md` — pre-flight checklist, commands, monitoring guide
- [x] `docs/NARRATIVE.md` — 11 phases documented through TAPT launch
## What's In Progress
### TAPT Training — Running
Training on 72K Item 1C paragraphs using DAPT checkpoint. 5 epochs, whole-word masking, seq_len=512, batch=32. Early loss: 1.46 → 1.40 (first 1% of training). Expected ~1.6h total on RTX 3090. Expecting final loss ~1.0-1.2.
```bash
bun run py:train dapt --config configs/tapt/modernbert.yaml
```
### Human Labeling (139/1,200)
- 3 of 6 annotators started: 68 + 50 + 21 paragraphs completed
- Deployed via labelapp with quiz gating + warmup
- Each annotator needs 600 paragraphs (BIBD assignment)
## What's Next (in dependency order)
### 1. Fine-tuning pipeline (no blockers — can build now)
Build the dual-head classifier (7-class category + 4-class specificity) with:
- Shared ModernBERT backbone + 2 linear classification heads
- Sample weighting from quality tiers (1.0 clean/headed/minor, 0.5 degraded)
- Confidence-stratified label assembly (unanimous → majority → judge)
- Train/val/test split with stratification
- Ablation configs: base vs +DAPT vs +DAPT+TAPT
### 3. Judge prompt v3.0 update (no blockers — can do now)
Update `buildJudgePrompt()` with codebook v3.0 rulings:
- Materiality disclaimers → Strategy Integration
- SPACs → None/Other
- Person-vs-function test for Management↔RMP
Then re-bench against gold labels.
### 4. Training data assembly (blocked on judge + human labels)
Combine all annotation sources into final training dataset:
- Unanimous Stage 1 labels (35,204 paragraphs, ~97% accuracy)
- Calibrated majority labels (~9-12K, ~85-90%)
- Judge high-confidence labels (~2-3K, ~84%)
- Judge low-confidence → downweight or exclude
- Quality tier sample weights applied
### 4. Judge production run (blocked on human gold labels)
Run judge on ~409 unresolved + flagged majority cases. Validate against expanded gold set from human labels.
### 5. Fine-tuning + ablations (blocked on steps 1-3)
7 experiments: {base, +DAPT, +DAPT+TAPT} × {with/without SCL} + best config.
### 6. Evaluation + paper (blocked on everything above)
Full GenAI benchmark (9 models) on 1,200 holdout. Comparison tables. Write-up.
## Parallel Tracks
```
Track A (GPU): DAPT ✓ → TAPT (running) → Fine-tuning → Eval
Track B (API): Judge v3 → Judge run ───────────┤
Track C (Human): Labeling (139/1200) → Gold set validation
Track D (Code): Fine-tune pipeline build ───────┘
```
TAPT finishes in ~1.5h. Track D (fine-tune pipeline) can proceed now. Track B can start (prompt update) but production run waits for Track C. Everything converges at fine-tuning.
## Key File Locations
| What | Where |
|------|-------|
| Patched paragraphs | `data/paragraphs/training.patched.jsonl` (49,795) |
| Patched annotations | `data/annotations/stage1.patched.jsonl` (150,009) |
| Quality scores | `data/paragraphs/quality/quality-scores.jsonl` (72,045) |
| DAPT corpus | `data/dapt-corpus/shard-*.jsonl` (14,756 docs) |
| DAPT config | `python/configs/dapt/modernbert.yaml` |
| TAPT config | `python/configs/tapt/modernbert.yaml` |
| DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` |
| Training CLI | `python/main.py dapt --config ...` |