7.7 KiB
7.7 KiB
Project Status — 2026-04-02
What's Done
Data Pipeline
- 72,045 paragraphs extracted from ~9,000 10-K filings + 207 8-K filings
- 14 filing generators identified, quality metrics per generator
- 6 surgical patches applied (orphan words + heading stripping)
- Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
- Embedded bullet detection (2,163 paragraphs flagged degraded, 0.5x sample weight)
- All data integrity rules formalized (frozen originals, UUID-linked patches)
GenAI Labeling (Stage 1)
- Prompt v2.5 locked after 12+ iterations
- 3-model panel: gemini-flash-lite + mimo-v2-flash + grok-4.1-fast
- 150,009 annotations completed ($115.88, 0 failures)
- Orphan word re-annotation: 1,537 paragraphs re-run ($3.30), merged into
stage1.patched.jsonl - Codebook v3.0 with 3 major rulings
DAPT + TAPT Pre-Training
- DAPT corpus: 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped)
- DAPT training complete: eval loss 0.7250, perplexity 1.65. 1 epoch on 500M tokens, ~14.5h on RTX 3090.
- DAPT checkpoint at
checkpoints/dapt/modernbert-large/final/ - TAPT training complete: eval loss 1.0754, perplexity 2.11. 5 epochs, whole-word masking, ~50 min on RTX 3090. Loss: 1.46 → 1.08.
- TAPT checkpoint at
checkpoints/tapt/modernbert-large/final/ - Custom
WholeWordMaskCollator(upstreamtransformerscollator broken for BPE tokenizers) - Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility)
- Procedure documented in
docs/DAPT-PROCEDURE.md
Documentation
docs/DATA-QUALITY-AUDIT.md— full audit with all patches and quality tiersdocs/EDGAR-FILING-GENERATORS.md— 14 generators with signatures and quality profilesdocs/DAPT-PROCEDURE.md— pre-flight checklist, commands, monitoring guidedocs/NARRATIVE.md— 11 phases documented through TAPT completion
What's Done (since last update)
Human Labeling — Complete
- All 6 annotators completed 600 paragraphs each (3,600 labels total, 1,200 paragraphs × 3)
- BIBD assignment: each paragraph labeled by exactly 3 of 6 annotators
- Full data export: raw labels, timing, quiz sessions, metrics →
data/gold/ - Comprehensive IRR analysis with 16 diagnostic charts →
data/gold/charts/
Human Labeling Results
| Metric | Category | Specificity | Both |
|---|---|---|---|
| Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% |
| Krippendorff's α | 0.801 | 0.546 | — |
| Avg Cohen's κ | 0.612 | 0.440 | — |
Key findings:
- Category is reliable (α=0.801) — above the 0.80 threshold for reliable data
- Specificity is unreliable (α=0.546) — driven primarily by one outlier annotator (Aaryan, +1.28 specificity levels vs Stage 1, κ=0.03-0.25 on specificity) and genuinely hard Spec 3↔4 boundary
- Human majority = Stage 1 majority on 83.3% of categories — strong cross-validation
- Same confusion axes in humans and GenAI: MR↔RMP (#1), BG↔MR (#2), N/O↔SI (#3)
- Excluding outlier annotator: both-unanimous jumps from 5% → 50% on his paragraphs (+45pp)
- Timing: 21.5 active hours total, median 14.9s per paragraph
Prompt v3.0
- Updated
SYSTEM_PROMPTwith codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP - Prompt version bumped from v2.5 → v3.0
GenAI Holdout Benchmark — In Progress
Running 6 benchmark models + Opus on the 1,200 holdout paragraphs:
| Model | Supplier | Est. Cost/call | Notes |
|---|---|---|---|
| openai/gpt-5.4 | OpenAI | $0.009 | Structured output |
| moonshotai/kimi-k2.5 | Moonshot | $0.006 | Structured output |
| google/gemini-3.1-pro-preview | $0.006 | Structured output | |
| z-ai/glm-5 | Zhipu | $0.006 | Structured output, exacto routing |
| minimax/minimax-m2.7 | MiniMax | $0.002 | Raw text + fence stripping |
| xiaomi/mimo-v2-pro | Xiaomi | $0.006 | Structured output, exacto routing |
| anthropic/claude-opus-4.6 | Anthropic | $0 (subscription) | Agent SDK, parallel workers |
Plus Stage 1 panel (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast) already on file = 10 models, 8 suppliers.
What's In Progress
Opus Golden Re-Run
- Opus golden labels being re-run on the correct 1,200 holdout paragraphs (previous run was on a stale sample due to
.sampled-ids.jsonbeing overwritten) - Previous Opus labels (different 1,200 paragraphs) preserved at
data/annotations/golden/opus.wrong-sample.jsonl - Using parallelized Agent SDK workers (concurrency=20)
GenAI Benchmark
- 6 models running on holdout with v3.0 prompt, high concurrency (200)
- Output:
data/annotations/bench-holdout/{model}.jsonl
What's Next (in dependency order)
1. Gold set adjudication (blocked on benchmark + Opus completion)
Each paragraph will have 13+ independent annotations: 3 human + 3 Stage 1 + 1 Opus + 6 benchmark models. Adjudication tiers:
- Tier 1: 10+/13 agree → gold label, no intervention
- Tier 2: Human majority + GenAI consensus agree → take consensus
- Tier 3: Humans split, GenAI converges → expert adjudication using Opus reasoning traces
- Tier 4: Universal disagreement → expert adjudication with documented reasoning
2. Training data assembly (blocked on adjudication)
- Unanimous Stage 1 labels (35,204 paragraphs) → full weight
- Calibrated majority labels (~9-12K) → full weight
- Judge high-confidence labels (~2-3K) → full weight
- Quality tier weights: clean/headed/minor=1.0, degraded=0.5
3. Fine-tuning + ablations (blocked on training data)
7 experiments: {base, +DAPT, +DAPT+TAPT} × {with/without SCL} + best config. Dual-head classifier: shared ModernBERT backbone + 2 linear classification heads.
4. Evaluation + paper (blocked on everything above)
Full GenAI benchmark (10 models) on 1,200 holdout. Comparison tables. Write-up. IGNITE slides.
Parallel Tracks
Track A (GPU): DAPT ✓ → TAPT ✓ ──────────────→ Fine-tuning → Eval
↑
Track B (API): Opus re-run ─┐ │
├→ Gold adjudication ──────┤
Track C (API): 6-model bench┘ │
│
Track D (Human): Labeling ✓ → IRR analysis ✓ ───────────┘
Key File Locations
| What | Where |
|---|---|
| Patched paragraphs | data/paragraphs/paragraphs-clean.patched.jsonl (49,795) |
| Patched annotations | data/annotations/stage1.patched.jsonl (150,009) |
| Quality scores | data/paragraphs/quality/quality-scores.jsonl (72,045) |
| Human labels (raw) | data/gold/human-labels-raw.jsonl (3,600 labels) |
| Human label metrics | data/gold/metrics.json |
| Holdout paragraphs | data/gold/paragraphs-holdout.jsonl (1,200) |
| Diagnostic charts | data/gold/charts/*.png (16 charts) |
| Opus golden labels | data/annotations/golden/opus.jsonl (re-run on correct holdout) |
| Benchmark annotations | data/annotations/bench-holdout/{model}.jsonl |
| Original sampled IDs | labelapp/.sampled-ids.original.json (1,200 holdout PIDs) |
| DAPT corpus | data/dapt-corpus/shard-*.jsonl (14,756 docs) |
| DAPT config | python/configs/dapt/modernbert.yaml |
| TAPT config | python/configs/tapt/modernbert.yaml |
| DAPT checkpoint | checkpoints/dapt/modernbert-large/final/ |
| TAPT checkpoint | checkpoints/tapt/modernbert-large/final/ |
| Training CLI | python/main.py dapt --config ... |
| Analysis script | scripts/analyze-gold.py |
| Data dump script | labelapp/scripts/dump-all.ts |