SEC-cyBERT/docs/STATUS.md
2026-04-02 02:02:36 -04:00

147 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Project Status — 2026-04-02
## What's Done
### Data Pipeline
- [x] 72,045 paragraphs extracted from ~9,000 10-K filings + 207 8-K filings
- [x] 14 filing generators identified, quality metrics per generator
- [x] 6 surgical patches applied (orphan words + heading stripping)
- [x] Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
- [x] Embedded bullet detection (2,163 paragraphs flagged degraded, 0.5x sample weight)
- [x] All data integrity rules formalized (frozen originals, UUID-linked patches)
### GenAI Labeling (Stage 1)
- [x] Prompt v2.5 locked after 12+ iterations
- [x] 3-model panel: gemini-flash-lite + mimo-v2-flash + grok-4.1-fast
- [x] 150,009 annotations completed ($115.88, 0 failures)
- [x] Orphan word re-annotation: 1,537 paragraphs re-run ($3.30), merged into `stage1.patched.jsonl`
- [x] Codebook v3.0 with 3 major rulings
### DAPT + TAPT Pre-Training
- [x] DAPT corpus: 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped)
- [x] DAPT training complete: eval loss 0.7250, perplexity 1.65. 1 epoch on 500M tokens, ~14.5h on RTX 3090.
- [x] DAPT checkpoint at `checkpoints/dapt/modernbert-large/final/`
- [x] TAPT training complete: eval loss 1.0754, perplexity 2.11. 5 epochs, whole-word masking, ~50 min on RTX 3090. Loss: 1.46 → 1.08.
- [x] TAPT checkpoint at `checkpoints/tapt/modernbert-large/final/`
- [x] Custom `WholeWordMaskCollator` (upstream `transformers` collator broken for BPE tokenizers)
- [x] Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility)
- [x] Procedure documented in `docs/DAPT-PROCEDURE.md`
### Documentation
- [x] `docs/DATA-QUALITY-AUDIT.md` — full audit with all patches and quality tiers
- [x] `docs/EDGAR-FILING-GENERATORS.md` — 14 generators with signatures and quality profiles
- [x] `docs/DAPT-PROCEDURE.md` — pre-flight checklist, commands, monitoring guide
- [x] `docs/NARRATIVE.md` — 11 phases documented through TAPT completion
## What's Done (since last update)
### Human Labeling — Complete
- [x] All 6 annotators completed 600 paragraphs each (3,600 labels total, 1,200 paragraphs × 3)
- [x] BIBD assignment: each paragraph labeled by exactly 3 of 6 annotators
- [x] Full data export: raw labels, timing, quiz sessions, metrics → `data/gold/`
- [x] Comprehensive IRR analysis with 16 diagnostic charts → `data/gold/charts/`
### Human Labeling Results
| Metric | Category | Specificity | Both |
|--------|----------|-------------|------|
| Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% |
| Krippendorff's α | 0.801 | 0.546 | — |
| Avg Cohen's κ | 0.612 | 0.440 | — |
**Key findings:**
- **Category is reliable (α=0.801)** — above the 0.80 threshold for reliable data
- **Specificity is unreliable (α=0.546)** — driven primarily by one outlier annotator (Aaryan, +1.28 specificity levels vs Stage 1, κ=0.03-0.25 on specificity) and genuinely hard Spec 3↔4 boundary
- **Human majority = Stage 1 majority on 83.3% of categories** — strong cross-validation
- **Same confusion axes** in humans and GenAI: MR↔RMP (#1), BG↔MR (#2), N/O↔SI (#3)
- **Excluding outlier annotator:** both-unanimous jumps from 5% → 50% on his paragraphs (+45pp)
- **Timing:** 21.5 active hours total, median 14.9s per paragraph
### Prompt v3.0
- [x] Updated `SYSTEM_PROMPT` with codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP
- [x] Prompt version bumped from v2.5 → v3.0
### GenAI Holdout Benchmark — In Progress
Running 6 benchmark models + Opus on the 1,200 holdout paragraphs:
| Model | Supplier | Est. Cost/call | Notes |
|-------|----------|---------------|-------|
| openai/gpt-5.4 | OpenAI | $0.009 | Structured output |
| moonshotai/kimi-k2.5 | Moonshot | $0.006 | Structured output |
| google/gemini-3.1-pro-preview | Google | $0.006 | Structured output |
| z-ai/glm-5 | Zhipu | $0.006 | Structured output, exacto routing |
| minimax/minimax-m2.7 | MiniMax | $0.002 | Raw text + fence stripping |
| xiaomi/mimo-v2-pro | Xiaomi | $0.006 | Structured output, exacto routing |
| anthropic/claude-opus-4.6 | Anthropic | $0 (subscription) | Agent SDK, parallel workers |
Plus Stage 1 panel (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast) already on file = **10 models, 8 suppliers**.
## What's In Progress
### Opus Golden Re-Run
- Opus golden labels being re-run on the correct 1,200 holdout paragraphs (previous run was on a stale sample due to `.sampled-ids.json` being overwritten)
- Previous Opus labels (different 1,200 paragraphs) preserved at `data/annotations/golden/opus.wrong-sample.jsonl`
- Using parallelized Agent SDK workers (concurrency=20)
### GenAI Benchmark
- 6 models running on holdout with v3.0 prompt, high concurrency (200)
- Output: `data/annotations/bench-holdout/{model}.jsonl`
## What's Next (in dependency order)
### 1. Gold set adjudication (blocked on benchmark + Opus completion)
Each paragraph will have **13+ independent annotations**: 3 human + 3 Stage 1 + 1 Opus + 6 benchmark models.
Adjudication tiers:
- **Tier 1:** 10+/13 agree → gold label, no intervention
- **Tier 2:** Human majority + GenAI consensus agree → take consensus
- **Tier 3:** Humans split, GenAI converges → expert adjudication using Opus reasoning traces
- **Tier 4:** Universal disagreement → expert adjudication with documented reasoning
### 2. Training data assembly (blocked on adjudication)
- Unanimous Stage 1 labels (35,204 paragraphs) → full weight
- Calibrated majority labels (~9-12K) → full weight
- Judge high-confidence labels (~2-3K) → full weight
- Quality tier weights: clean/headed/minor=1.0, degraded=0.5
### 3. Fine-tuning + ablations (blocked on training data)
7 experiments: {base, +DAPT, +DAPT+TAPT} × {with/without SCL} + best config.
Dual-head classifier: shared ModernBERT backbone + 2 linear classification heads.
### 4. Evaluation + paper (blocked on everything above)
Full GenAI benchmark (10 models) on 1,200 holdout. Comparison tables. Write-up. IGNITE slides.
## Parallel Tracks
```
Track A (GPU): DAPT ✓ → TAPT ✓ ──────────────→ Fine-tuning → Eval
Track B (API): Opus re-run ─┐ │
├→ Gold adjudication ──────┤
Track C (API): 6-model bench┘ │
Track D (Human): Labeling ✓ → IRR analysis ✓ ───────────┘
```
## Key File Locations
| What | Where |
|------|-------|
| Patched paragraphs | `data/paragraphs/paragraphs-clean.patched.jsonl` (49,795) |
| Patched annotations | `data/annotations/stage1.patched.jsonl` (150,009) |
| Quality scores | `data/paragraphs/quality/quality-scores.jsonl` (72,045) |
| Human labels (raw) | `data/gold/human-labels-raw.jsonl` (3,600 labels) |
| Human label metrics | `data/gold/metrics.json` |
| Holdout paragraphs | `data/gold/paragraphs-holdout.jsonl` (1,200) |
| Diagnostic charts | `data/gold/charts/*.png` (16 charts) |
| Opus golden labels | `data/annotations/golden/opus.jsonl` (re-run on correct holdout) |
| Benchmark annotations | `data/annotations/bench-holdout/{model}.jsonl` |
| Original sampled IDs | `labelapp/.sampled-ids.original.json` (1,200 holdout PIDs) |
| DAPT corpus | `data/dapt-corpus/shard-*.jsonl` (14,756 docs) |
| DAPT config | `python/configs/dapt/modernbert.yaml` |
| TAPT config | `python/configs/tapt/modernbert.yaml` |
| DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` |
| TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` |
| Training CLI | `python/main.py dapt --config ...` |
| Analysis script | `scripts/analyze-gold.py` |
| Data dump script | `labelapp/scripts/dump-all.ts` |