SEC-cyBERT/docs/STATUS.md
2026-04-02 09:28:44 -04:00

147 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Project Status — 2026-04-02 (evening)
## What's Done
### Data Pipeline
- [x] 72,045 paragraphs extracted from ~9,000 10-K filings + 207 8-K filings
- [x] 14 filing generators identified, quality metrics per generator
- [x] 6 surgical patches applied (orphan words + heading stripping)
- [x] Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
- [x] Embedded bullet detection (2,163 paragraphs flagged degraded, 0.5x sample weight)
- [x] All data integrity rules formalized (frozen originals, UUID-linked patches)
### GenAI Labeling (Stage 1)
- [x] Prompt v2.5 locked after 12+ iterations
- [x] 3-model panel: gemini-flash-lite + mimo-v2-flash + grok-4.1-fast
- [x] 150,009 annotations completed ($115.88, 0 failures)
- [x] Orphan word re-annotation: 1,537 paragraphs re-run ($3.30), merged into `stage1.patched.jsonl`
- [x] Codebook v3.0 with 3 major rulings
### DAPT + TAPT Pre-Training
- [x] DAPT corpus: 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped)
- [x] DAPT training complete: eval loss 0.7250, perplexity 1.65. 1 epoch on 500M tokens, ~14.5h on RTX 3090.
- [x] DAPT checkpoint at `checkpoints/dapt/modernbert-large/final/`
- [x] TAPT training complete: eval loss 1.0754, perplexity 2.11. 5 epochs, whole-word masking, ~50 min on RTX 3090. Loss: 1.46 → 1.08.
- [x] TAPT checkpoint at `checkpoints/tapt/modernbert-large/final/`
- [x] Custom `WholeWordMaskCollator` (upstream `transformers` collator broken for BPE tokenizers)
- [x] Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility)
- [x] Procedure documented in `docs/DAPT-PROCEDURE.md`
### Human Labeling — Complete
- [x] All 6 annotators completed 600 paragraphs each (3,600 labels total, 1,200 paragraphs × 3)
- [x] BIBD assignment: each paragraph labeled by exactly 3 of 6 annotators
- [x] Full data export: raw labels, timing, quiz sessions, metrics → `data/gold/`
- [x] Comprehensive IRR analysis → `data/gold/charts/`
| Metric | Category | Specificity | Both |
|--------|----------|-------------|------|
| Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% |
| Krippendorff's α | 0.801 | 0.546 | — |
| Avg Cohen's κ | 0.612 | 0.440 | — |
### Prompt v3.0
- [x] Codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP
- [x] Prompt version bumped from v2.5 → v3.0
### GenAI Holdout Benchmark — Complete
- [x] 6 benchmark models + Opus 4.6 on the 1,200 holdout paragraphs
- [x] All 1,200 annotations per model (0 failures after minimax/kimi fence-stripping fix)
- [x] Total benchmark cost: $45.63
| Model | Supplier | Cost | Cat % vs Opus | Both % vs Opus |
|-------|----------|------|---------------|----------------|
| openai/gpt-5.4 | OpenAI | $6.79 | 88.2% | 79.8% |
| google/gemini-3.1-pro-preview | Google | $16.09 | 87.4% | 80.0% |
| moonshotai/kimi-k2.5 | Moonshot | $7.70 | 85.1% | 76.8% |
| z-ai/glm-5:exacto | Zhipu | $6.86 | 86.2% | 76.5% |
| xiaomi/mimo-v2-pro:exacto | Xiaomi | $6.59 | 85.7% | 76.3% |
| minimax/minimax-m2.7:exacto | MiniMax | $1.61 | 82.8% | 63.6% |
| anthropic/claude-opus-4.6 | Anthropic | $0 | — | — |
Plus Stage 1 panel already on file = **10 models, 8 suppliers**.
### 13-Signal Cross-Source Analysis — Complete
- [x] 30 diagnostic charts generated → `data/gold/charts/`
- [x] Leave-one-out analysis (no model privileged as reference)
- [x] Adjudication tier breakdown computed
**Adjudication tiers (13 signals per paragraph):**
| Tier | Count | % | Rule |
|------|-------|---|------|
| 1 | 756 | 63.0% | 10+/13 agree on both dimensions → auto gold |
| 2 | 216 | 18.0% | Human + GenAI majorities agree → cross-validated |
| 3 | 26 | 2.2% | Humans split, GenAI converges → expert review |
| 4 | 202 | 16.8% | Universal disagreement → expert review |
**Leave-one-out ranking (each source vs majority of other 12):**
| Rank | Source | Cat % | Spec % | Both % |
|------|--------|-------|--------|--------|
| 1 | Opus 4.6 | 92.6 | 90.8 | 84.0 |
| 2 | Kimi K2.5 | 91.6 | 91.1 | 83.3 |
| 3 | Gemini Pro | 91.1 | 90.1 | 82.3 |
| 4 | GPT-5.4 | 91.4 | 88.8 | 82.1 |
| 8 | H:Xander (best human) | 91.3 | 83.9 | 76.9 |
| 16 | H:Aaryan (outlier) | 59.1 | 24.7 | 15.8 |
**Key finding:** Opus earns the #1 spot through leave-one-out — it's not special because we designated it as gold; it genuinely disagrees with the crowd least (7.4% odd-one-out rate).
## What's Next (in dependency order)
### 1. Gold set adjudication
- Tier 1+2 (972 paragraphs, 81%) → auto-resolved from 13-signal consensus
- Tier 3+4 (228 paragraphs, 19%) → expert review with Opus reasoning traces
- For Aaryan's 600 paragraphs: use other-2-annotator majority when they agree and he disagrees
### 2. Training data assembly
- Unanimous Stage 1 labels (35,204 paragraphs) → full weight
- Calibrated majority labels (~9-12K) → full weight
- Judge high-confidence labels (~2-3K) → full weight
- Quality tier weights: clean/headed/minor=1.0, degraded=0.5
### 3. Fine-tuning + ablations
- 8+ experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} × {±class weighting}
- Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
- Focal loss / class-weighted CE for category imbalance
- Ordinal regression (CORAL) for specificity
### 4. Evaluation + paper
- Macro F1 + per-class F1 on holdout (must exceed 0.80 for category)
- Full GenAI benchmark table (10 models × 1,200 holdout)
- Cost/time/reproducibility comparison
- Error analysis on Tier 4 paragraphs (A-grade criterion)
- IGNITE slides (20 slides, 15s each)
## Parallel Tracks
```
Track A (GPU): DAPT ✓ → TAPT ✓ ──────────────→ Fine-tuning → Eval
Track B (API): Opus re-run ✓─┐ │
├→ Gold adjudication ─────┤
Track C (API): 6-model bench ✓┘ │
Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ ─────┘
```
## Key File Locations
| What | Where |
|------|-------|
| Patched paragraphs | `data/paragraphs/paragraphs-clean.patched.jsonl` (49,795) |
| Patched annotations | `data/annotations/stage1.patched.jsonl` (150,009) |
| Quality scores | `data/paragraphs/quality/quality-scores.jsonl` (72,045) |
| Human labels (raw) | `data/gold/human-labels-raw.jsonl` (3,600 labels) |
| Human label metrics | `data/gold/metrics.json` |
| Holdout paragraphs | `data/gold/paragraphs-holdout.jsonl` (1,200) |
| Diagnostic charts | `data/gold/charts/*.png` (30 charts) |
| Opus golden labels | `data/annotations/golden/opus.jsonl` (1,200) |
| Benchmark annotations | `data/annotations/bench-holdout/{model}.jsonl` (6 × 1,200) |
| Original sampled IDs | `labelapp/.sampled-ids.original.json` (1,200 holdout PIDs) |
| DAPT corpus | `data/dapt-corpus/shard-*.jsonl` (14,756 docs) |
| DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` |
| TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` |
| Analysis script | `scripts/analyze-gold.py` (30-chart, 13-signal analysis) |
| Data dump script | `labelapp/scripts/dump-all.ts` |