147 lines
7.1 KiB
Markdown
147 lines
7.1 KiB
Markdown
# Project Status — 2026-04-02 (evening)
|
||
|
||
## What's Done
|
||
|
||
### Data Pipeline
|
||
- [x] 72,045 paragraphs extracted from ~9,000 10-K filings + 207 8-K filings
|
||
- [x] 14 filing generators identified, quality metrics per generator
|
||
- [x] 6 surgical patches applied (orphan words + heading stripping)
|
||
- [x] Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
|
||
- [x] Embedded bullet detection (2,163 paragraphs flagged degraded, 0.5x sample weight)
|
||
- [x] All data integrity rules formalized (frozen originals, UUID-linked patches)
|
||
|
||
### GenAI Labeling (Stage 1)
|
||
- [x] Prompt v2.5 locked after 12+ iterations
|
||
- [x] 3-model panel: gemini-flash-lite + mimo-v2-flash + grok-4.1-fast
|
||
- [x] 150,009 annotations completed ($115.88, 0 failures)
|
||
- [x] Orphan word re-annotation: 1,537 paragraphs re-run ($3.30), merged into `stage1.patched.jsonl`
|
||
- [x] Codebook v3.0 with 3 major rulings
|
||
|
||
### DAPT + TAPT Pre-Training
|
||
- [x] DAPT corpus: 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped)
|
||
- [x] DAPT training complete: eval loss 0.7250, perplexity 1.65. 1 epoch on 500M tokens, ~14.5h on RTX 3090.
|
||
- [x] DAPT checkpoint at `checkpoints/dapt/modernbert-large/final/`
|
||
- [x] TAPT training complete: eval loss 1.0754, perplexity 2.11. 5 epochs, whole-word masking, ~50 min on RTX 3090. Loss: 1.46 → 1.08.
|
||
- [x] TAPT checkpoint at `checkpoints/tapt/modernbert-large/final/`
|
||
- [x] Custom `WholeWordMaskCollator` (upstream `transformers` collator broken for BPE tokenizers)
|
||
- [x] Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility)
|
||
- [x] Procedure documented in `docs/DAPT-PROCEDURE.md`
|
||
|
||
### Human Labeling — Complete
|
||
- [x] All 6 annotators completed 600 paragraphs each (3,600 labels total, 1,200 paragraphs × 3)
|
||
- [x] BIBD assignment: each paragraph labeled by exactly 3 of 6 annotators
|
||
- [x] Full data export: raw labels, timing, quiz sessions, metrics → `data/gold/`
|
||
- [x] Comprehensive IRR analysis → `data/gold/charts/`
|
||
|
||
| Metric | Category | Specificity | Both |
|
||
|--------|----------|-------------|------|
|
||
| Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% |
|
||
| Krippendorff's α | 0.801 | 0.546 | — |
|
||
| Avg Cohen's κ | 0.612 | 0.440 | — |
|
||
|
||
### Prompt v3.0
|
||
- [x] Codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP
|
||
- [x] Prompt version bumped from v2.5 → v3.0
|
||
|
||
### GenAI Holdout Benchmark — Complete
|
||
- [x] 6 benchmark models + Opus 4.6 on the 1,200 holdout paragraphs
|
||
- [x] All 1,200 annotations per model (0 failures after minimax/kimi fence-stripping fix)
|
||
- [x] Total benchmark cost: $45.63
|
||
|
||
| Model | Supplier | Cost | Cat % vs Opus | Both % vs Opus |
|
||
|-------|----------|------|---------------|----------------|
|
||
| openai/gpt-5.4 | OpenAI | $6.79 | 88.2% | 79.8% |
|
||
| google/gemini-3.1-pro-preview | Google | $16.09 | 87.4% | 80.0% |
|
||
| moonshotai/kimi-k2.5 | Moonshot | $7.70 | 85.1% | 76.8% |
|
||
| z-ai/glm-5:exacto | Zhipu | $6.86 | 86.2% | 76.5% |
|
||
| xiaomi/mimo-v2-pro:exacto | Xiaomi | $6.59 | 85.7% | 76.3% |
|
||
| minimax/minimax-m2.7:exacto | MiniMax | $1.61 | 82.8% | 63.6% |
|
||
| anthropic/claude-opus-4.6 | Anthropic | $0 | — | — |
|
||
|
||
Plus Stage 1 panel already on file = **10 models, 8 suppliers**.
|
||
|
||
### 13-Signal Cross-Source Analysis — Complete
|
||
- [x] 30 diagnostic charts generated → `data/gold/charts/`
|
||
- [x] Leave-one-out analysis (no model privileged as reference)
|
||
- [x] Adjudication tier breakdown computed
|
||
|
||
**Adjudication tiers (13 signals per paragraph):**
|
||
|
||
| Tier | Count | % | Rule |
|
||
|------|-------|---|------|
|
||
| 1 | 756 | 63.0% | 10+/13 agree on both dimensions → auto gold |
|
||
| 2 | 216 | 18.0% | Human + GenAI majorities agree → cross-validated |
|
||
| 3 | 26 | 2.2% | Humans split, GenAI converges → expert review |
|
||
| 4 | 202 | 16.8% | Universal disagreement → expert review |
|
||
|
||
**Leave-one-out ranking (each source vs majority of other 12):**
|
||
|
||
| Rank | Source | Cat % | Spec % | Both % |
|
||
|------|--------|-------|--------|--------|
|
||
| 1 | Opus 4.6 | 92.6 | 90.8 | 84.0 |
|
||
| 2 | Kimi K2.5 | 91.6 | 91.1 | 83.3 |
|
||
| 3 | Gemini Pro | 91.1 | 90.1 | 82.3 |
|
||
| 4 | GPT-5.4 | 91.4 | 88.8 | 82.1 |
|
||
| 8 | H:Xander (best human) | 91.3 | 83.9 | 76.9 |
|
||
| 16 | H:Aaryan (outlier) | 59.1 | 24.7 | 15.8 |
|
||
|
||
**Key finding:** Opus earns the #1 spot through leave-one-out — it's not special because we designated it as gold; it genuinely disagrees with the crowd least (7.4% odd-one-out rate).
|
||
|
||
## What's Next (in dependency order)
|
||
|
||
### 1. Gold set adjudication
|
||
- Tier 1+2 (972 paragraphs, 81%) → auto-resolved from 13-signal consensus
|
||
- Tier 3+4 (228 paragraphs, 19%) → expert review with Opus reasoning traces
|
||
- For Aaryan's 600 paragraphs: use other-2-annotator majority when they agree and he disagrees
|
||
|
||
### 2. Training data assembly
|
||
- Unanimous Stage 1 labels (35,204 paragraphs) → full weight
|
||
- Calibrated majority labels (~9-12K) → full weight
|
||
- Judge high-confidence labels (~2-3K) → full weight
|
||
- Quality tier weights: clean/headed/minor=1.0, degraded=0.5
|
||
|
||
### 3. Fine-tuning + ablations
|
||
- 8+ experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} × {±class weighting}
|
||
- Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
|
||
- Focal loss / class-weighted CE for category imbalance
|
||
- Ordinal regression (CORAL) for specificity
|
||
|
||
### 4. Evaluation + paper
|
||
- Macro F1 + per-class F1 on holdout (must exceed 0.80 for category)
|
||
- Full GenAI benchmark table (10 models × 1,200 holdout)
|
||
- Cost/time/reproducibility comparison
|
||
- Error analysis on Tier 4 paragraphs (A-grade criterion)
|
||
- IGNITE slides (20 slides, 15s each)
|
||
|
||
## Parallel Tracks
|
||
|
||
```
|
||
Track A (GPU): DAPT ✓ → TAPT ✓ ──────────────→ Fine-tuning → Eval
|
||
↑
|
||
Track B (API): Opus re-run ✓─┐ │
|
||
├→ Gold adjudication ─────┤
|
||
Track C (API): 6-model bench ✓┘ │
|
||
│
|
||
Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ ─────┘
|
||
```
|
||
|
||
## Key File Locations
|
||
|
||
| What | Where |
|
||
|------|-------|
|
||
| Patched paragraphs | `data/paragraphs/paragraphs-clean.patched.jsonl` (49,795) |
|
||
| Patched annotations | `data/annotations/stage1.patched.jsonl` (150,009) |
|
||
| Quality scores | `data/paragraphs/quality/quality-scores.jsonl` (72,045) |
|
||
| Human labels (raw) | `data/gold/human-labels-raw.jsonl` (3,600 labels) |
|
||
| Human label metrics | `data/gold/metrics.json` |
|
||
| Holdout paragraphs | `data/gold/paragraphs-holdout.jsonl` (1,200) |
|
||
| Diagnostic charts | `data/gold/charts/*.png` (30 charts) |
|
||
| Opus golden labels | `data/annotations/golden/opus.jsonl` (1,200) |
|
||
| Benchmark annotations | `data/annotations/bench-holdout/{model}.jsonl` (6 × 1,200) |
|
||
| Original sampled IDs | `labelapp/.sampled-ids.original.json` (1,200 holdout PIDs) |
|
||
| DAPT corpus | `data/dapt-corpus/shard-*.jsonl` (14,756 docs) |
|
||
| DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` |
|
||
| TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` |
|
||
| Analysis script | `scripts/analyze-gold.py` (30-chart, 13-signal analysis) |
|
||
| Data dump script | `labelapp/scripts/dump-all.ts` |
|