# Project Status — 2026-04-02 ## What's Done ### Data Pipeline - [x] 72,045 paragraphs extracted from ~9,000 10-K filings + 207 8-K filings - [x] 14 filing generators identified, quality metrics per generator - [x] 6 surgical patches applied (orphan words + heading stripping) - [x] Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%) - [x] Embedded bullet detection (2,163 paragraphs flagged degraded, 0.5x sample weight) - [x] All data integrity rules formalized (frozen originals, UUID-linked patches) ### GenAI Labeling (Stage 1) - [x] Prompt v2.5 locked after 12+ iterations - [x] 3-model panel: gemini-flash-lite + mimo-v2-flash + grok-4.1-fast - [x] 150,009 annotations completed ($115.88, 0 failures) - [x] Orphan word re-annotation: 1,537 paragraphs re-run ($3.30), merged into `stage1.patched.jsonl` - [x] Codebook v3.0 with 3 major rulings ### DAPT + TAPT Pre-Training - [x] DAPT corpus: 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped) - [x] DAPT training complete: eval loss 0.7250, perplexity 1.65. 1 epoch on 500M tokens, ~14.5h on RTX 3090. - [x] DAPT checkpoint at `checkpoints/dapt/modernbert-large/final/` - [x] TAPT training complete: eval loss 1.0754, perplexity 2.11. 5 epochs, whole-word masking, ~50 min on RTX 3090. Loss: 1.46 → 1.08. - [x] TAPT checkpoint at `checkpoints/tapt/modernbert-large/final/` - [x] Custom `WholeWordMaskCollator` (upstream `transformers` collator broken for BPE tokenizers) - [x] Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility) - [x] Procedure documented in `docs/DAPT-PROCEDURE.md` ### Documentation - [x] `docs/DATA-QUALITY-AUDIT.md` — full audit with all patches and quality tiers - [x] `docs/EDGAR-FILING-GENERATORS.md` — 14 generators with signatures and quality profiles - [x] `docs/DAPT-PROCEDURE.md` — pre-flight checklist, commands, monitoring guide - [x] `docs/NARRATIVE.md` — 11 phases documented through TAPT completion ## What's Done (since last update) ### Human Labeling — Complete - [x] All 6 annotators completed 600 paragraphs each (3,600 labels total, 1,200 paragraphs × 3) - [x] BIBD assignment: each paragraph labeled by exactly 3 of 6 annotators - [x] Full data export: raw labels, timing, quiz sessions, metrics → `data/gold/` - [x] Comprehensive IRR analysis with 16 diagnostic charts → `data/gold/charts/` ### Human Labeling Results | Metric | Category | Specificity | Both | |--------|----------|-------------|------| | Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% | | Krippendorff's α | 0.801 | 0.546 | — | | Avg Cohen's κ | 0.612 | 0.440 | — | **Key findings:** - **Category is reliable (α=0.801)** — above the 0.80 threshold for reliable data - **Specificity is unreliable (α=0.546)** — driven primarily by one outlier annotator (Aaryan, +1.28 specificity levels vs Stage 1, κ=0.03-0.25 on specificity) and genuinely hard Spec 3↔4 boundary - **Human majority = Stage 1 majority on 83.3% of categories** — strong cross-validation - **Same confusion axes** in humans and GenAI: MR↔RMP (#1), BG↔MR (#2), N/O↔SI (#3) - **Excluding outlier annotator:** both-unanimous jumps from 5% → 50% on his paragraphs (+45pp) - **Timing:** 21.5 active hours total, median 14.9s per paragraph ### Prompt v3.0 - [x] Updated `SYSTEM_PROMPT` with codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP - [x] Prompt version bumped from v2.5 → v3.0 ### GenAI Holdout Benchmark — In Progress Running 6 benchmark models + Opus on the 1,200 holdout paragraphs: | Model | Supplier | Est. Cost/call | Notes | |-------|----------|---------------|-------| | openai/gpt-5.4 | OpenAI | $0.009 | Structured output | | moonshotai/kimi-k2.5 | Moonshot | $0.006 | Structured output | | google/gemini-3.1-pro-preview | Google | $0.006 | Structured output | | z-ai/glm-5 | Zhipu | $0.006 | Structured output, exacto routing | | minimax/minimax-m2.7 | MiniMax | $0.002 | Raw text + fence stripping | | xiaomi/mimo-v2-pro | Xiaomi | $0.006 | Structured output, exacto routing | | anthropic/claude-opus-4.6 | Anthropic | $0 (subscription) | Agent SDK, parallel workers | Plus Stage 1 panel (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast) already on file = **10 models, 8 suppliers**. ## What's In Progress ### Opus Golden Re-Run - Opus golden labels being re-run on the correct 1,200 holdout paragraphs (previous run was on a stale sample due to `.sampled-ids.json` being overwritten) - Previous Opus labels (different 1,200 paragraphs) preserved at `data/annotations/golden/opus.wrong-sample.jsonl` - Using parallelized Agent SDK workers (concurrency=20) ### GenAI Benchmark - 6 models running on holdout with v3.0 prompt, high concurrency (200) - Output: `data/annotations/bench-holdout/{model}.jsonl` ## What's Next (in dependency order) ### 1. Gold set adjudication (blocked on benchmark + Opus completion) Each paragraph will have **13+ independent annotations**: 3 human + 3 Stage 1 + 1 Opus + 6 benchmark models. Adjudication tiers: - **Tier 1:** 10+/13 agree → gold label, no intervention - **Tier 2:** Human majority + GenAI consensus agree → take consensus - **Tier 3:** Humans split, GenAI converges → expert adjudication using Opus reasoning traces - **Tier 4:** Universal disagreement → expert adjudication with documented reasoning ### 2. Training data assembly (blocked on adjudication) - Unanimous Stage 1 labels (35,204 paragraphs) → full weight - Calibrated majority labels (~9-12K) → full weight - Judge high-confidence labels (~2-3K) → full weight - Quality tier weights: clean/headed/minor=1.0, degraded=0.5 ### 3. Fine-tuning + ablations (blocked on training data) 7 experiments: {base, +DAPT, +DAPT+TAPT} × {with/without SCL} + best config. Dual-head classifier: shared ModernBERT backbone + 2 linear classification heads. ### 4. Evaluation + paper (blocked on everything above) Full GenAI benchmark (10 models) on 1,200 holdout. Comparison tables. Write-up. IGNITE slides. ## Parallel Tracks ``` Track A (GPU): DAPT ✓ → TAPT ✓ ──────────────→ Fine-tuning → Eval ↑ Track B (API): Opus re-run ─┐ │ ├→ Gold adjudication ──────┤ Track C (API): 6-model bench┘ │ │ Track D (Human): Labeling ✓ → IRR analysis ✓ ───────────┘ ``` ## Key File Locations | What | Where | |------|-------| | Patched paragraphs | `data/paragraphs/paragraphs-clean.patched.jsonl` (49,795) | | Patched annotations | `data/annotations/stage1.patched.jsonl` (150,009) | | Quality scores | `data/paragraphs/quality/quality-scores.jsonl` (72,045) | | Human labels (raw) | `data/gold/human-labels-raw.jsonl` (3,600 labels) | | Human label metrics | `data/gold/metrics.json` | | Holdout paragraphs | `data/gold/paragraphs-holdout.jsonl` (1,200) | | Diagnostic charts | `data/gold/charts/*.png` (16 charts) | | Opus golden labels | `data/annotations/golden/opus.jsonl` (re-run on correct holdout) | | Benchmark annotations | `data/annotations/bench-holdout/{model}.jsonl` | | Original sampled IDs | `labelapp/.sampled-ids.original.json` (1,200 holdout PIDs) | | DAPT corpus | `data/dapt-corpus/shard-*.jsonl` (14,756 docs) | | DAPT config | `python/configs/dapt/modernbert.yaml` | | TAPT config | `python/configs/tapt/modernbert.yaml` | | DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` | | TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` | | Training CLI | `python/main.py dapt --config ...` | | Analysis script | `scripts/analyze-gold.py` | | Data dump script | `labelapp/scripts/dump-all.ts` |