SEC-cyBERT/docs/STATUS.md
2026-03-29 23:55:49 -04:00

112 lines
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Project Status — 2026-03-29
## What's Done
### Data Pipeline
- [x] 72,045 paragraphs extracted from ~9,000 10-K filings + 207 8-K filings
- [x] 14 filing generators identified, quality metrics per generator
- [x] 6 surgical patches applied (orphan words + heading stripping)
- [x] Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
- [x] Embedded bullet detection (2,163 paragraphs flagged degraded, 0.5x sample weight)
- [x] All data integrity rules formalized (frozen originals, UUID-linked patches)
### GenAI Labeling (Stage 1)
- [x] Prompt v2.5 locked after 12+ iterations
- [x] 3-model panel: gemini-flash-lite + mimo-v2-flash + grok-4.1-fast
- [x] 150,009 annotations completed ($115.88, 0 failures)
- [x] Orphan word re-annotation: 1,537 paragraphs re-run ($3.30), merged into `stage1.patched.jsonl`
- [x] Codebook v3.0 with 3 major rulings
### DAPT Corpus
- [x] 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped)
- [x] Training pipeline verified end-to-end (PyTorch 2.10, CUDA, ModernBERT loads, tokenization works)
- [x] Config: 8192 seq_len, batch=4, grad_accum=8, 1 epoch, bf16, FA2, torch.compile, 500M tokens
- [x] Procedure documented in `docs/DAPT-PROCEDURE.md`
### Documentation
- [x] `docs/DATA-QUALITY-AUDIT.md` — full audit with all patches and quality tiers
- [x] `docs/EDGAR-FILING-GENERATORS.md` — 14 generators with signatures and quality profiles
- [x] `docs/DAPT-PROCEDURE.md` — pre-flight checklist, commands, monitoring guide
- [x] `docs/NARRATIVE.md` — 11 phases documented through DAPT corpus prep
## What's In Progress
### DAPT Training (~13.5h, running)
```bash
cd python && bun run py:train dapt --config configs/dapt/modernbert.yaml
```
Running on RTX 3090. 500M tokens (newest filings), batch=4, ~27s/step, ~1,950 steps. Checkpoints every 256 steps (~1.9h). Resume-safe — ctrl+c and restart to continue from last checkpoint.
### Human Labeling (139/1,200)
- 3 of 6 annotators started: 68 + 50 + 21 paragraphs completed
- Deployed via labelapp with quiz gating + warmup
- Each annotator needs 600 paragraphs (BIBD assignment)
## What's Next (in dependency order)
### 1. TAPT (~2-3h, blocked on DAPT)
Continue MLM on 72K Item 1C paragraphs using the DAPT checkpoint.
```bash
bun run py:train dapt --config configs/dapt/modernbert.yaml \
--model-path ../checkpoints/dapt/modernbert-large/final \
--data-path ../data/paragraphs/paragraphs-clean.patched.jsonl \
--output-dir ../checkpoints/tapt/modernbert-large --stage tapt
```
### 2. Fine-tuning pipeline (no blockers — can build now)
Build the dual-head classifier (7-class category + 4-class specificity) with:
- Shared ModernBERT backbone + 2 linear classification heads
- Sample weighting from quality tiers (1.0 clean/headed/minor, 0.5 degraded)
- Confidence-stratified label assembly (unanimous → majority → judge)
- Train/val/test split with stratification
- Ablation configs: base vs +DAPT vs +DAPT+TAPT
### 3. Judge prompt v3.0 update (no blockers — can do now)
Update `buildJudgePrompt()` with codebook v3.0 rulings:
- Materiality disclaimers → Strategy Integration
- SPACs → None/Other
- Person-vs-function test for Management↔RMP
Then re-bench against gold labels.
### 4. Training data assembly (blocked on judge + human labels)
Combine all annotation sources into final training dataset:
- Unanimous Stage 1 labels (35,204 paragraphs, ~97% accuracy)
- Calibrated majority labels (~9-12K, ~85-90%)
- Judge high-confidence labels (~2-3K, ~84%)
- Judge low-confidence → downweight or exclude
- Quality tier sample weights applied
### 5. Judge production run (blocked on human gold labels)
Run judge on ~409 unresolved + flagged majority cases. Validate against expanded gold set from human labels.
### 6. Fine-tuning + ablations (blocked on steps 1-4)
7 experiments: {base, +DAPT, +DAPT+TAPT} × {with/without SCL} + best config.
### 7. Evaluation + paper (blocked on everything above)
Full GenAI benchmark (9 models) on 1,200 holdout. Comparison tables. Write-up.
## Parallel Tracks
```
Track A (GPU): DAPT ──→ TAPT ──→ Fine-tuning ──→ Eval
Track B (API): Judge v3 → Judge run ───┤
Track C (Human): Labeling (139/1200) → Gold set validation
Track D (Code): Fine-tune pipeline build ┘
```
Tracks A and D can proceed now. Track B can start (prompt update) but production run waits for Track C. Everything converges at fine-tuning.
## Key File Locations
| What | Where |
|------|-------|
| Patched paragraphs | `data/paragraphs/training.patched.jsonl` (49,795) |
| Patched annotations | `data/annotations/stage1.patched.jsonl` (150,009) |
| Quality scores | `data/paragraphs/quality/quality-scores.jsonl` (72,045) |
| DAPT corpus | `data/dapt-corpus/shard-*.jsonl` (14,756 docs) |
| DAPT config | `python/configs/dapt/modernbert.yaml` |
| Training CLI | `python/main.py dapt --config ...` |