analyze gold

2026-04-02 09:28:44 -04:00 · 2026-04-02 09:28:44 -04:00 · 26367a8e86
commit 26367a8e86
parent c9497f5709
6 changed files with 2304 additions and 872 deletions
--- a/.dvc-store.dvc
+++ b/.dvc-store.dvc
@ -1,6 +1,6 @@
 outs:
- md5: 6147599f136e4781a2de20adcb2aba1f.dir
-  size: 737313104
-  nfiles: 135
+- md5: d64ad0c8040d75230a3013c4751910eb.dir
+  size: 740635168
+  nfiles: 174
  hash: md5
  path: .dvc-store
--- a/docs/F1-STRATEGY.md
+++ b/docs/F1-STRATEGY.md
@ -0,0 +1,290 @@
+# F1 Strategy — Passing the Class
+
+The assignment requires **macro F1 > 0.80** on category, measured against the human-labeled 1,200-paragraph holdout. This document lays out the concrete plan for getting there.
+
+---
+
+## The Situation
+
+### What we have
+
+- **Training data:** 150,009 Stage 1 annotations across 50,003 paragraphs (3 models each). ~35K paragraphs with unanimous category agreement (all 3 models agree).
+- **Pre-trained backbone:** ModernBERT-large with DAPT (1B tokens of SEC filings) + TAPT (labeled paragraphs). Domain-adapted and task-adapted.
+- **Gold holdout:** 1,200 paragraphs with 13 independent annotations each (3 human + 10 GenAI). Adjudication tiers computed: 81% auto-resolvable.
+- **Complete benchmark:** 10 GenAI models from 8 suppliers, all on holdout.
+
+### The ceiling
+
+The best individual GenAI models agree with human majority on ~83-87% of category labels. Our fine-tuned model is trained on GenAI labels, so its accuracy is bounded by how well GenAI labels match human labels. With DAPT+TAPT, the model should approach or slightly exceed this ceiling because:
+
+1. It learns decision boundaries directly from representations, not through generative reasoning
+2. It's specialized on the exact domain (SEC filings) and task distribution
+3. The training data (35K+ unanimous labels) is cleaner than any single model's output
+
+### The threat
+
+The holdout was deliberately stratified to over-sample hard decision boundaries (MR<->RMP splits, N/O<->SI splits, Spec 3/4 splits). This means raw F1 on this holdout is **lower** than on a random sample. Macro F1 also weights all 7 classes equally — rare categories (TPR at ~5%, ID at ~8%) get the same influence as RMP at ~35%.
+
+**Estimated range: category macro F1 of 0.78-0.85.** The plan below is designed to push toward the top of that range.
+
+---
+
+## Action 1: Clean the Gold Labels
+
+**Priority: highest. This directly caps F1 from above.**
+
+If the gold label is wrong, even a perfect model gets penalized. The gold label quality depends on how we adjudicate the 1,200 holdout paragraphs.
+
+### Aaryan correction
+
+Aaryan has a 40.9% odd-one-out rate (vs 8-16% for other annotators), specificity kappa of 0.03-0.25, and +1.30 specificity bias vs Opus. On his 600 paragraphs, when the other 2 annotators agree and he disagrees, the other-2 majority should be the human signal. This is not "throwing out" his data — it's using the objective reliability metrics to weight it appropriately.
+
+Excluding his label on his paragraphs pushes both-unanimous from 5% to 50% (+45pp). This single correction likely improves effective gold label quality by 5-10% on the paragraphs he touched.
+
+### Tiered adjudication
+
+| Tier | Count | % | Gold label source |
+|------|-------|---|-------------------|
+| 1 | 756 | 63% | 13-signal consensus (10+/13 agree on both dimensions) |
+| 2 | 216 | 18% | Human majority + GenAI majority agree — take consensus |
+| 3 | 26 | 2% | Expert review with Opus reasoning traces |
+| 4 | 202 | 17% | Expert review, documented reasoning |
+
+For Tier 1+2 (972 paragraphs, 81%), the gold label is objectively strong — at least 10 of 13 annotators agree, or both human and GenAI majorities independently converge. These labels are essentially guaranteed correct.
+
+For Tier 3+4 (228 paragraphs), expert adjudication using:
+1. Opus reasoning trace (why did the best model choose this category?)
+2. GenAI consensus direction (what do 7+/10 models say?)
+3. The paragraph text itself
+4. Codebook boundary rules (MR vs RMP person-vs-function test, materiality disclaimers -> SI, etc.)
+
+Document reasoning for every Tier 4 decision. These 202 paragraphs become the error analysis corpus.
+
+---
+
+## Action 2: Handle Class Imbalance
+
+**Priority: critical. This is the difference between 0.76 and 0.83 on macro F1.**
+
+### The problem
+
+The training data class distribution is heavily skewed:
+
+| Category | Est. % of training | Macro F1 weight |
+|----------|-------------------|-----------------|
+| RMP | ~35% | 14.3% (1/7) |
+| BG | ~15% | 14.3% |
+| MR | ~14% | 14.3% |
+| SI | ~13% | 14.3% |
+| N/O | ~10% | 14.3% |
+| ID | ~8% | 14.3% |
+| TPR | ~5% | 14.3% |
+
+Without correction, the model will over-predict RMP (the majority class) and under-predict TPR/ID. Since macro F1 weights all 7 equally, poor performance on rare classes tanks the overall score.
+
+### Solutions (use in combination)
+
+**Focal loss (gamma=2).** Down-weights easy/confident examples, up-weights hard/uncertain ones. The model spends more gradient on the examples it's getting wrong — which are disproportionately from rare classes and boundary cases. Better than static class weights because it adapts as training progresses.
+
+**Class-weighted sampling.** Over-sample rare categories during training so the model sees roughly equal numbers of each class per epoch. Alternatively, use class-weighted cross-entropy with weights inversely proportional to frequency.
+
+**Stratified validation split.** Ensure the validation set used for early stopping and threshold optimization has proportional representation of all classes. Don't let the model optimize for RMP accuracy at the expense of TPR.
+
+---
+
+## Action 3: Supervised Contrastive Learning (SCL)
+
+**Priority: high. Directly attacks the #1 confusion axis.**
+
+### The problem
+
+MR<->RMP is the dominant confusion axis for humans, Stage 1, all GenAI models, and will be the dominant confusion axis for our fine-tuned model. These two categories share vocabulary (both discuss "cybersecurity" in a management/process context) and differ primarily in whether the paragraph describes a **person's role** (MR) or a **process/procedure** (RMP).
+
+BG<->MR is the #2 axis — both involve governance/management but differ in whether it's board-level or management-level.
+
+### How SCL helps
+
+SCL adds a contrastive loss that pulls representations of same-class paragraphs together and pushes different-class paragraphs apart in the embedding space. This is especially valuable when:
+- Two classes share surface-level vocabulary (MR/RMP, BG/MR)
+- The distinguishing features are subtle (person vs function, board vs management)
+- The model needs to learn discriminative features, not just predictive ones
+
+### Implementation
+
+Dual loss: L = L_classification + lambda * L_contrastive
+
+The contrastive loss uses the [CLS] representation from the shared backbone. Lambda should be tuned (start with 0.1-0.5) on the validation set.
+
+---
+
+## Action 4: Ordinal Specificity
+
+**Priority: medium. Matters for specificity F1, not directly for category F1 (which is the pass/fail metric).**
+
+### The problem
+
+Specificity is a 4-point ordinal scale (1=Generic Boilerplate, 2=Sector-Standard, 3=Firm-Specific, 4=Quantified-Verifiable). Treating it as flat classification ignores the ordering — a Spec 1->4 error is worse than a Spec 2->3 error.
+
+Human alpha on specificity is only 0.546 (unreliable). The Spec 3<->4 boundary is genuinely ambiguous. Even frontier models only agree 75-91% on specificity.
+
+### Solution
+
+Use CORAL (Consistent Rank Logits) ordinal regression for the specificity head. CORAL converts a K-class ordinal problem into K-1 binary problems (is this >= 2? is this >= 3? is this >= 4?) and trains shared representations across all thresholds. This:
+- Respects the ordinal structure
+- Eliminates impossible predictions (e.g., predicting "yes >= 4" but "no >= 3")
+- Handles the noisy Spec 3<->4 boundary gracefully
+
+### Managed expectations
+
+Specificity macro F1 will be 0.65-0.75 regardless of what we do. This is not a model failure — it's a gold label quality issue (alpha=0.546). Report specificity F1 separately and frame it as a finding about construct reliability.
+
+---
+
+## Action 5: Training Data Curation
+
+**Priority: high. Garbage in, garbage out.**
+
+### Confidence-stratified assembly
+
+| Source | Count | Sample Weight | Rationale |
+|--------|-------|--------------|-----------|
+| Unanimous Stage 1 (3/3 agree on both) | ~35K | 1.0 | Highest confidence |
+| Majority Stage 1 (2/3 agree on cat) | ~9-12K | 0.8 | Good but not certain |
+| Judge labels (high confidence) | ~2-3K | 0.7 | Disputed, resolved by stronger model |
+| All-disagree | ~2-3K | 0.0 (exclude) | Too noisy |
+
+### Quality tier weighting
+
+| Paragraph quality | Weight |
+|-------------------|--------|
+| Clean | 1.0 |
+| Headed | 1.0 |
+| Minor issues | 1.0 |
+| Degraded (embedded bullets, orphan words) | 0.5 |
+
+### What NOT to include
+
+- Paragraphs where all 3 Stage 1 models disagree on category (pure noise)
+- Paragraphs from truncated filings (72 identified and removed pre-DAPT)
+- Paragraphs shorter than 10 words (tend to be parsing artifacts)
+
+---
+
+## Action 6: Ablation Design
+
+**The assignment requires at least 4 configurations. We'll run 6-8 to isolate each contribution.**
+
+| # | Backbone | Focal Loss | SCL | Notes |
+|---|----------|-----------|-----|-------|
+| 1 | ModernBERT-large (base) | No | No | Baseline — no domain adaptation |
+| 2 | +DAPT | No | No | Isolate domain pre-training effect |
+| 3 | +DAPT+TAPT | No | No | Isolate task-adaptive pre-training effect |
+| 4 | +DAPT+TAPT | Yes | No | Isolate class imbalance handling |
+| 5 | +DAPT+TAPT | Yes | Yes | Full pipeline |
+| 6 | +DAPT+TAPT | Yes | Yes | Ensemble (3 seeds, majority vote) |
+
+**Expected pattern:** 1 < 2 < 3 (pre-training helps), 3 < 4 (focal loss helps rare classes), 4 < 5 (SCL helps confusion boundaries), 5 < 6 (ensemble smooths variance).
+
+Each experiment trains for ~30-60 min on the RTX 3090. Total ablation time: ~4-8 hours.
+
+### Hyperparameters (starting points, tune on validation)
+
+- Learning rate: 2e-5 (standard for BERT fine-tuning)
+- Batch size: 16-32 (depending on VRAM with dual heads)
+- Max sequence length: 512 (most paragraphs are <200 tokens; 8192 is unnecessary for classification)
+- Epochs: 5-10 with early stopping (patience=3)
+- Warmup: 10% of steps
+- Weight decay: 1e-5 (matching ModernBERT pre-training config)
+- Focal loss gamma: 2.0
+- SCL lambda: 0.1-0.5 (tune)
+- Label smoothing: 0.05
+
+---
+
+## Action 7: Inference-Time Techniques
+
+### Ensemble (3 seeds)
+
+Train 3 instances of the best configuration (experiment 5) with different random seeds. At inference, average the softmax probabilities and take argmax. Typically adds 1-3pp macro F1 over a single model. The variance across seeds also gives confidence intervals for reported metrics.
+
+### Per-class threshold optimization
+
+After training, don't use argmax. Instead, optimize per-class classification thresholds on the validation set to maximize macro F1 directly. The optimal threshold for RMP (high prevalence, high precision needed) is different from TPR (low prevalence, high recall needed). Use a grid search or Bayesian optimization over the 7 thresholds.
+
+### Post-hoc calibration
+
+Apply temperature scaling on the validation set. This doesn't change predictions (and therefore doesn't change F1), but it makes the model's confidence scores meaningful for:
+- Calibration plots (recommended evaluation metric)
+- AUC computation
+- The error analysis narrative
+
+---
+
+## Action 8: Evaluation & Reporting
+
+### Primary metrics (what determines the grade)
+
+- **Category macro F1** on full 1,200 holdout — must exceed 0.80
+- **Per-class F1** — breakdown showing which categories are strong/weak
+- **Krippendorff's alpha** — model vs human labels (should approach GenAI panel's alpha)
+- **MCC** — robust to class imbalance
+- **AUC** — from calibrated probabilities
+
+### Dual F1 reporting (adverse incentive mitigation)
+
+Report F1 on both:
+1. **Full 1,200 holdout** (stratified, over-samples hard cases)
+2. **~720-paragraph proportional subsample** (random draw matching corpus class proportions)
+
+The delta between these two numbers quantifies how much the model degrades at decision boundaries. This directly serves the A-grade "error analysis" criterion and is methodologically honest about the stratified design.
+
+### Error analysis corpus
+
+The 202 Tier 4 paragraphs (universal disagreement) are the natural error analysis set. For each:
+- What did the model predict?
+- What is the gold label?
+- What do the 13 signals show?
+- What is the confusion axis?
+- Is the gold label itself debatable?
+
+This analysis will show that most "errors" fall on the MR<->RMP, BG<->MR, and N/O<->SI axes — the same axes where humans disagree. The model is not failing randomly; it's failing where the construct itself is ambiguous.
+
+### GenAI vs specialist comparison (assignment Step 10)
+
+| Dimension | GenAI Panel (10 models) | Fine-tuned Specialist |
+|-----------|------------------------|----------------------|
+| Category macro F1 | ~0.82-0.87 (per model) | Target: 0.80-0.85 |
+| Cost per 1M texts | ~$5,000-13,000 | ~$5 (GPU inference) |
+| Latency per text | 3-76 seconds | ~5ms |
+| Reproducibility | Varies (temperature, routing) | Deterministic |
+| Setup cost | $165 (one-time labeling) | + ~8h GPU training |
+
+The specialist wins on cost (1000x cheaper), speed (1000x faster), and reproducibility (deterministic). The GenAI panel wins on raw accuracy by a few points. This is the core Ringel (2023) thesis: the specialist approximates the GenAI labeler at near-zero marginal cost.
+
+---
+
+## Risk Assessment
+
+| Risk | Likelihood | Impact | Mitigation |
+|------|-----------|--------|------------|
+| Macro F1 lands at 0.78-0.80 (just below threshold) | Medium | High | Ensemble + threshold optimization should add 2-3pp |
+| TPR per-class F1 tanks macro average | Medium | Medium | Focal loss + over-sampling TPR in training |
+| Gold label noise on Tier 4 paragraphs | Low | Medium | Conservative adjudication + dual F1 reporting |
+| MR<->RMP confusion not resolved by SCL | Low | Medium | Person-vs-function test baked into training data via v3.0 codebook |
+| DAPT+TAPT doesn't help (base model is already good enough) | Low | Low | Still meets 0.80 threshold; the ablation result itself is publishable |
+
+---
+
+## Timeline
+
+| Task | Duration | Target Date |
+|------|----------|-------------|
+| Gold set adjudication (Tier 3+4 expert review) | 2-3h | Apr 3-4 |
+| Training data assembly | 1-2h | Apr 4 |
+| Fine-tuning ablations (6 configs) | 4-8h GPU | Apr 5-8 |
+| Final evaluation on holdout | 1h | Apr 9 |
+| Error analysis writeup | 2h | Apr 10 |
+| Executive memo draft | 3h | Apr 11-12 |
+| IGNITE slides (20 slides) | 2h | Apr 13-14 |
+| Final review + submission | 2h | Apr 22 |
+| **Due date** | | **Apr 23 12pm** |
--- a/docs/NARRATIVE.md
+++ b/docs/NARRATIVE.md
@ -1056,6 +1056,56 @@ The Opus golden labeling was re-run on the correct 1,200 holdout paragraphs. A p

 ---

+## Phase 14: 13-Signal Analysis & F1 Strategy
+
+### Benchmark Complete
+
+All 6 benchmark models + Opus completed 1,200 annotations each. Total benchmark cost: $45.63. Every paragraph in the holdout now has exactly 13 independent annotations: 3 human + 3 Stage 1 + 1 Opus + 6 benchmark.
+
+Model performance sorted by leave-one-out "both" accuracy (each source vs majority of other 12): Opus 4.6 (84.0%), Kimi K2.5 (83.3%), Gemini Pro (82.3%), GPT-5.4 (82.1%), GLM-5 (81.4%), MIMO Pro (81.4%), Grok Fast (80.0%). Best human: Xander at 76.9%. Worst: Aaryan at 15.8%.
+
+### The "Is Opus Special?" Question
+
+We tested whether Opus's apparent dominance was an artifact of using it as the reference. Answer: no. In leave-one-out analysis, Opus has the lowest "odd one out" rate at 7.4% — it disagrees with the remaining 12 sources less than any other source. But the top 6 GenAI models are within 3pp of each other — any could serve as reference with similar results. The 13-signal majority is 99.5% identical to the 10-GenAI majority; adding 3 human votes barely shifts consensus because 10 outvotes 3.
+
+### Adjudication Tiers
+
+The 13-signal consensus enables tiered adjudication:
+- **Tier 1 (63.0%):** 756 paragraphs where 10+/13 agree on both dimensions. Auto-gold, zero human work.
+- **Tier 2 (18.0%):** 216 paragraphs where human majority and GenAI majority agree. Cross-validated.
+- **Tier 3 (2.2%):** 26 paragraphs where humans split but GenAI converges.
+- **Tier 4 (16.8%):** 202 paragraphs with universal disagreement. Expert adjudication needed.
+
+81% of the holdout can be adjudicated automatically. The 202 Tier 4 paragraphs are dominated by MR↔RMP confusion (the #1 axis everywhere) and are the natural error analysis corpus.
+
+### Specificity: GenAI Is More Consistent Than Humans
+
+GenAI spec unanimity is 60.1% vs human spec unanimity of 42.2% (+18pp). Specificity calibration plots show that GPT-5.4, Gemini Pro, and Kimi K2.5 closely track Opus across all 4 specificity levels. MiniMax M2.7 is the only model with systematic specificity bias (−0.26 vs Opus). Among humans, Aaryan's +1.30 bias dwarfs all other sources.
+
+### F1 Strategy
+
+The assignment requires macro F1 > 0.80 on category. Based on the data:
+- The best GenAI models agree with human majority ~83-87% on category
+- Training on 35K+ unanimous Stage 1 labels with DAPT+TAPT should approach this ceiling
+- The swing categories for macro F1 are MR (~65-80%), TPR (~70-90%), N/O (~60-85%)
+- Focal loss for class imbalance + SCL for boundary separation + ensemble for robustness
+
+Key risk: the stratified holdout over-samples hard cases, depressing F1 vs a random sample. Mitigation: report F1 on both the full holdout and a proportional subsample. The delta quantifies model degradation at decision boundaries.
+
+### Cost Ledger Update
+
+| Phase | Cost | Time |
+|-------|------|------|
+| Stage 1 (150K annotations) | $115.88 | ~30 min |
+| Orphan re-annotation | $3.30 | ~9 min |
+| Benchmark (6 models × 1,200) | $45.63 | ~1h |
+| Opus golden (1,200) | $0 (subscription) | ~30 min |
+| Human labeling | $0 (class assignment) | 21.5h active |
+| Post-labeling analysis | ~3h | |
+| **Total API** | **$164.81** | |
+
+---
+
 ## Lessons Learned

 ### On Prompt Engineering
--- a/docs/POST-LABELING-PLAN.md
+++ b/docs/POST-LABELING-PLAN.md
@ -1,104 +1,222 @@
-# Post-Labeling Plan — Gold Set Repair & Final Pipeline
+# Post-Labeling Plan — Gold Set, Fine-Tuning & F1 Strategy

-Updated 2026-04-02 with actual human labeling results.
+Updated 2026-04-02 with actual benchmark results and 13-signal analysis.

 ---

-## Human Labeling Results
+## Human Labeling Results (Complete)

-Completed 2026-04-01. 3,600 labels (1,200 paragraphs × 3 annotators via BIBD), 21.5 active hours total.
-
-### Per-Dimension Agreement
+3,600 labels (1,200 paragraphs x 3 annotators via BIBD), 21.5 active hours total.

 | Metric | Category | Specificity | Both |
 |--------|----------|-------------|------|
 | Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% |
-| Krippendorff's α | **0.801** | 0.546 | — |
-| Avg Cohen's κ | 0.612 | 0.440 | — |
+| Krippendorff's alpha | **0.801** | 0.546 | -- |
+| Avg Cohen's kappa | 0.612 | 0.440 | -- |

-**Category is reliable.** α = 0.801 exceeds the 0.80 conventional threshold. Human majority matches Stage 1 GenAI majority on 83.3% of paragraphs for category.
-
-**Specificity is unreliable.** α = 0.546 is well below the 0.667 threshold. Driven by two factors: one outlier annotator and a genuinely hard Spec 3↔4 boundary.
-
-### The Aaryan Problem
-
-One annotator (Aaryan) is a systematic outlier:
- Labels 67% of paragraphs as Spec 4 (Quantified-Verifiable) — others: 8-23%, Stage 1: 9%
- Specificity bias: +1.28 levels vs Stage 1 (massive over-rater)
- Specificity κ: 0.03-0.25 (essentially chance)
- Category κ: 0.40-0.50 (below "moderate")
- Only 3 quiz attempts (lowest; others: 6-11)
-
-Excluding his label on his 600 paragraphs: both-unanimous jumps from 5% → 50% (+45pp).
-
-### Confusion Axes (Human vs GenAI — Same Order)
-
-1. Management Role ↔ Risk Management Process (dominant)
-2. Board Governance ↔ Management Role
-3. None/Other ↔ Strategy Integration (materiality disclaimers)
-
-The same axes, in the same order, for both humans and the GenAI panel. The codebook boundaries drive disagreement, not annotator or model limitations.
+**Category is reliable.** Alpha = 0.801 exceeds the conventional 0.80 threshold. **Specificity is unreliable.** Alpha = 0.546, driven by one outlier annotator (+1.28 specificity bias) and a genuinely hard Spec 3-4 boundary.

 ---

-## The Adverse Incentive Problem
+## GenAI Benchmark Results (Complete)

-The assignment requires F1 > 0.80 on the holdout to pass. The holdout was deliberately stratified to over-sample hard decision boundaries (120 MR↔RMP, 80 N/O↔SI, 80 Spec [3,4] splits, etc.).
+10 models from 8 suppliers on 1,200 holdout paragraphs. $45.63 total benchmark cost.

-**Mitigation:** Report F1 on both the full 1,200 holdout AND the 720-paragraph proportional subsample. The delta quantifies performance degradation at decision boundaries. The stratified design directly serves the A-grade "error analysis" criterion.
+### Per-Model Accuracy (Leave-One-Out: each source vs majority of other 12)
+
+| Rank | Source | Cat % | Spec % | Both % | Odd-One-Out % |
+|------|--------|-------|--------|--------|---------------|
+| 1 | Opus 4.6 | 92.6 | 90.8 | 84.0 | 7.4% |
+| 2 | Kimi K2.5 | 91.6 | 91.1 | 83.3 | 8.4% |
+| 3 | Gemini Pro | 91.1 | 90.1 | 82.3 | 8.9% |
+| 4 | GPT-5.4 | 91.4 | 88.8 | 82.1 | 8.6% |
+| 5 | GLM-5 | 91.9 | 88.4 | 81.4 | 8.1% |
+| 6 | MIMO Pro | 91.1 | 89.4 | 81.4 | 8.9% |
+| 7 | Grok Fast | 88.9 | 89.6 | 80.0 | 11.1% |
+| 8 | Xander (best human) | 91.3 | 83.9 | 76.9 | 8.7% |
+| 9 | Elisabeth | 85.5 | 84.6 | 72.3 | 14.5% |
+| 10 | Gemini Lite | 83.0 | 86.1 | 71.7 | 17.0% |
+| 11 | MIMO Flash | 80.4 | 86.4 | 69.2 | 19.6% |
+| 12 | Meghan | 86.3 | 76.8 | 66.5 | 13.7% |
+| 13 | MiniMax M2.7 | 87.9 | 75.6 | 66.1 | 12.1% |
+| 14 | Joey | 84.0 | 77.2 | 65.8 | 16.0% |
+| 15 | Anuj | 72.7 | 60.6 | 42.8 | 27.3% |
+| 16 | Aaryan (outlier) | 59.1 | 24.7 | 15.8 | 40.9% |
+
+Opus earns #1 without being privileged -- it genuinely disagrees with the crowd least.
+
+### Cross-Source Agreement
+
+| Comparison | Category |
+|------------|----------|
+| Human maj = S1 maj | 81.7% |
+| Human maj = Opus | 83.2% |
+| Human maj = GenAI maj (10) | 82.2% |
+| GenAI maj = Opus | 86.8% |
+| 13-signal maj = 10-GenAI maj | 99.5% |
+
+### Confusion Axes (same order for all source types)
+
+1. MR <-> RMP (dominant)
+2. BG <-> MR
+3. N/O <-> SI

 ---

-## Gold Set Repair Strategy: 13 Signals Per Paragraph
+## Adjudication Strategy (13 Signals)

-### Annotation sources per paragraph
+### Sources per paragraph

-| Source | Count | Prompt | Notes |
-|--------|-------|--------|-------|
-| Human annotators | 3 | Codebook v3.0 | With notes, timing data |
-| Stage 1 panel (gemini-flash-lite, mimo-flash, grok-fast) | 3 | v2.5 | Already on file |
-| Opus 4.6 golden | 1 | v2.5 + full codebook | With reasoning traces |
-| Benchmark models (gpt-5.4, kimi-k2.5, gemini-pro, glm-5, minimax-m2.7, mimo-pro) | 6 | v3.0 | Running now |
-| **Total** | **13** | | |
+| Source | Count | Prompt |
+|--------|-------|--------|
+| Human annotators | 3 | Codebook v3.0 |
+| Stage 1 (gemini-lite, mimo-flash, grok-fast) | 3 | v2.5 |
+| Opus 4.6 golden | 1 | v3.0+codebook |
+| Benchmark (gpt-5.4, kimi-k2.5, gemini-pro, glm-5, minimax-m2.7, mimo-pro) | 6 | v3.0 |
+| **Total** | **13** | |

-### Adjudication tiers
+### Tier breakdown (actual counts)

-**Tier 1 — High confidence:** 10+/13 agree on both dimensions. Gold label, no intervention.
+| Tier | Rule | Count | % |
+|------|------|-------|---|
+| 1 | 10+/13 agree on both dimensions | 756 | 63.0% |
+| 2 | Human majority + GenAI majority agree | 216 | 18.0% |
+| 3 | Humans split, GenAI converges | 26 | 2.2% |
+| 4 | Universal disagreement | 202 | 16.8% |

-**Tier 2 — Clear majority with cross-validation:** Human majority (2/3) matches GenAI consensus (majority of 10 GenAI labels). Take the consensus.
+**81% auto-resolvable.** Only 228 paragraphs (19%) need expert review.

-**Tier 3 — Human split, GenAI consensus:** Humans disagree but GenAI labels converge. Expert adjudication informed by Opus reasoning traces. Human makes the final call.
+### Aaryan correction

-**Tier 4 — Universal disagreement:** Everyone splits. Expert adjudication with documented reasoning, or flagged as inherently ambiguous for error analysis.
+On Aaryan's 600 paragraphs: when the other 2 annotators agree and Aaryan disagrees, the other-2 majority becomes the human signal for adjudication. This is justified by his 40.9% odd-one-out rate (vs 8-16% for other annotators) and α=0.03-0.25 on specificity.

-GenAI labels are evidence for adjudication, not the gold label itself. The final label is always a human decision — this avoids circularity.
+### Adjudication process for Tier 3+4
+
+1. Pull Opus reasoning trace for the paragraph
+2. Check the GenAI consensus (which category do 7+/10 models agree on?)
+3. Expert reads the paragraph and all signals, makes final call
+4. Document reasoning for Tier 4 paragraphs (these are the error analysis corpus)

 ---

-## Task Sequence
+## F1 Strategy — How to Pass

-### In progress
- [x] Human labeling — complete
- [x] Data export and IRR analysis — complete
- [x] Prompt v3.0 update with codebook rulings — complete
- [x] GenAI benchmark infrastructure — complete
- [ ] Opus golden re-run on correct holdout (running, ~1h with 20 workers)
- [ ] 6-model benchmark on holdout (running, high concurrency)
+### The requirement

-### After benchmark completes
- [ ] Cross-source analysis with all 13 signals (update `analyze-gold.py`)
- [ ] Gold set adjudication using tiered strategy
- [ ] Training data assembly (unanimous + calibrated majority + judge)
+- **C grade minimum:** fine-tuned model with macro F1 > 0.80 on holdout
+- **Gold standard:** human-labeled holdout (1,200 paragraphs)
+- **Metrics to report:** macro F1, per-class F1, Krippendorff's alpha, AUC, MCC
+- The fine-tuned "specialist" must be compared head-to-head with GenAI labeling

-### After gold set is finalized
- [ ] Fine-tuning + ablations (7 experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} + best)
- [ ] Final evaluation on holdout
- [ ] Writeup + IGNITE slides
+### The challenge
+
+The holdout was deliberately stratified to over-sample hard decision boundaries (MR<->RMP, N/O<->SI, Spec 3<->4). This means raw F1 on this holdout will be **lower** than on a random sample. Additionally:
+
+- The best individual GenAI models only agree with human majority ~83-87% on category
+- Our model is trained on GenAI labels, so its ceiling is bounded by GenAI-vs-human agreement
+- Macro F1 weights all 7 classes equally -- rare classes (TPR, ID) get equal influence
+- The MR<->RMP confusion axis is the #1 challenge across all source types
+
+### Why F1 > 0.80 is achievable
+
+1. **DAPT + TAPT give domain advantage.** The model has seen 1B tokens of SEC filings (DAPT) and all labeled paragraphs (TAPT). It understands SEC disclosure language at a depth that generic BERT models don't.
+
+2. **35K+ high-confidence training examples.** Unanimous Stage 1 labels where all 3 models agreed on both dimensions. These are cleaner than any single model's labels.
+
+3. **Encoder classification outperforms generative labeling on fine-tuned domains.** The model doesn't need to "reason" about the codebook -- it learns the decision boundaries directly from representations. This is the core thesis of Ringel (2023).
+
+4. **The hard cases are a small fraction.** 63% of the holdout is Tier 1 (10+/13 agree). The model only needs reasonable performance on the remaining 37% to clear 0.80.
+
+### Critical actions
+
+#### 1. Gold label quality (highest priority)
+
+Noisy gold labels directly cap F1. If the gold label is wrong, even a perfect model gets penalized.
+
+- **Tier 1+2 (972 paragraphs):** Use 13-signal consensus. These are essentially guaranteed correct.
+- **Tier 3+4 (228 paragraphs):** Expert adjudication with documented reasoning. Prioritize Opus reasoning traces + GenAI consensus as evidence.
+- **Aaryan correction:** On his 600 paragraphs, replace his vote with the other-2 majority when they agree. This alone should improve gold label quality substantially.
+- **Document the process:** The adjudication methodology itself is a deliverable (IRR report + reliability analysis).
+
+#### 2. Training data curation
+
+- **Primary corpus:** Unanimous Stage 1 labels (all 3 models agree on both cat+spec) -- ~35K paragraphs
+- **Secondary:** Majority labels (2/3 agree) with 0.8x sample weight -- ~9-12K
+- **Tertiary:** Judge labels with high confidence -- ~2-3K
+- **Exclude:** Paragraphs where all 3 models disagree (too noisy for training)
+- **Quality weighting:** clean/headed/minor = 1.0, degraded = 0.5
+
+#### 3. Architecture and loss
+
+- **Dual-head classifier:** Shared ModernBERT backbone -> category head (7-class softmax) + specificity head (4-class ordinal)
+- **Category loss:** Focal loss (gamma=2) or class-weighted cross-entropy. The model must not ignore rare categories (TPR, ID). Weights inversely proportional to class frequency in training data.
+- **Specificity loss:** Ordinal regression (CORAL) -- penalizes Spec 1->4 errors more than Spec 2->3. This respects the ordinal nature and handles the noisy Spec 3<->4 boundary gracefully.
+- **Combined loss:** L = L_cat + 0.5 * L_spec (category gets more gradient weight because it's the more reliable dimension and the primary metric)
+
+#### 4. Ablation experiments (need >=4 configurations)
+
+| # | Backbone | Class Weights | SCL | Notes |
+|---|----------|--------------|-----|-------|
+| 1 | Base ModernBERT-large | No | No | Baseline |
+| 2 | +DAPT | No | No | Domain adaptation effect |
+| 3 | +DAPT+TAPT | No | No | Full pre-training pipeline |
+| 4 | +DAPT+TAPT | Yes (focal) | No | Class imbalance handling |
+| 5 | +DAPT+TAPT | Yes (focal) | Yes | Supervised contrastive learning |
+| 6 | +DAPT+TAPT | Yes (focal) | Yes | + ensemble (3 seeds) |
+
+Experiments 1-3 isolate the pre-training contribution. 4-5 isolate training strategy. 6 is the final system.
+
+#### 5. Evaluation strategy
+
+- **Primary metric:** Category macro F1 on full 1,200 holdout (must exceed 0.80)
+- **Secondary metrics:** Per-class F1, specificity F1 (report separately), MCC, Krippendorff's alpha vs human labels
+- **Dual reporting (adverse incentive mitigation):** Also report F1 on a 720-paragraph proportional subsample (random draw matching corpus class proportions). The delta quantifies degradation on hard boundary cases. This serves the A-grade "error analysis" criterion.
+- **Error analysis corpus:** Tier 4 paragraphs (202) are the natural error analysis set. Where the model fails on these, the 13-signal disagreement pattern explains why.
+
+#### 6. Inference-time techniques
+
+- **Ensemble:** Train 3 models with different random seeds on the best config. Majority vote at inference. Typically adds 1-3pp F1.
+- **Threshold optimization:** After training, optimize per-class classification thresholds on a validation set (not holdout) to maximize macro F1. Don't use argmax -- use thresholds that balance precision and recall per class.
+- **Post-hoc calibration:** Temperature scaling on validation set. Important for AUC and calibration plots.
+
+### Specificity dimension -- managed expectations
+
+Specificity F1 will be lower than category F1. This is not a model failure:
+- Human alpha on specificity is only 0.546 (unreliable gold)
+- Even frontier models only agree 75-91% on specificity
+- The Spec 3<->4 boundary is genuinely ambiguous
+
+Strategy: report specificity F1 separately, explain why it's lower, and frame it as a finding about construct reliability (the specificity dimension needs more operational clarity, not better models). This is honest and scientifically interesting.
+
+### Concrete F1 estimate
+
+Based on GenAI-vs-human agreement rates and the typical BERT fine-tuning premium:
+- **Category macro F1:** 0.78-0.85 (depends on class imbalance handling and gold quality)
+- **Specificity macro F1:** 0.65-0.75 (ceiling-limited by human disagreement)
+- **Combined (cat x spec) accuracy:** 0.55-0.70
+
+The swing categories for macro F1 are MR (~65-80% per-class F1), TPR (~70-90%), and N/O (~60-85%). Focal loss + SCL should push MR and N/O into the range where macro F1 clears 0.80.

 ---

 ## The Meta-Narrative

-The finding that trained student annotators achieve α = 0.801 on category but only α = 0.546 on specificity, while calibrated LLM panels achieve 70.8%+ both-unanimous on an easier sample, validates the synthetic experts hypothesis for rule-heavy classification tasks. The human labels are essential as a calibration anchor, but GenAI's advantage on multi-step reasoning tasks (like QV fact counting) is itself a key finding.
+The finding that trained student annotators achieve alpha = 0.801 on category but only 0.546 on specificity, while calibrated LLM panels achieve higher consistency (60.1% spec unanimity vs 42.2% for humans), validates the synthetic experts hypothesis for rule-heavy classification tasks. The low specificity agreement is not annotator incompetence -- it's evidence that the specificity construct requires systematic attention to IS/NOT lists and counting rules that humans don't consistently invest at 15s/paragraph pace. GenAI's advantage on multi-step reasoning tasks is itself a key finding.

-The low specificity agreement is not annotator incompetence — it's evidence that the specificity construct requires cognitive effort that humans don't consistently invest at the 15-second-per-paragraph pace the task demands. The GenAI panel, which processes every paragraph with the same systematic attention to the IS/NOT lists and counting rules, achieves more consistent results on this specific dimension.
+The leave-one-out analysis showing that Opus earns the top rank without being privileged is the strongest validation of using frontier LLMs as "gold" annotators: they're not just consistent with each other, they're the most consistent with the emergent consensus of all 16 sources combined.
+
+---
+
+## Timeline
+
+| Task | Target | Status |
+|------|--------|--------|
+| Human labeling | 2026-04-01 | Done |
+| GenAI benchmark (10 models) | 2026-04-02 | Done |
+| 13-signal analysis | 2026-04-02 | Done |
+| Gold set adjudication | 2026-04-03-04 | Next |
+| Training data assembly | 2026-04-04 | |
+| Fine-tuning ablations (6 configs) | 2026-04-05-08 | |
+| Final evaluation on holdout | 2026-04-09 | |
+| Executive memo + IGNITE slides | 2026-04-10-14 | |
+| Submission | 2026-04-23 | |
--- a/docs/STATUS.md
+++ b/docs/STATUS.md
@ -1,4 +1,4 @@
-# Project Status — 2026-04-02
+# Project Status — 2026-04-02 (evening)

 ## What's Done

@ -27,21 +27,11 @@
 - [x] Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility)
 - [x] Procedure documented in `docs/DAPT-PROCEDURE.md`

-### Documentation
- [x] `docs/DATA-QUALITY-AUDIT.md` — full audit with all patches and quality tiers
- [x] `docs/EDGAR-FILING-GENERATORS.md` — 14 generators with signatures and quality profiles
- [x] `docs/DAPT-PROCEDURE.md` — pre-flight checklist, commands, monitoring guide
- [x] `docs/NARRATIVE.md` — 11 phases documented through TAPT completion
-
-## What's Done (since last update)
-
 ### Human Labeling — Complete
 - [x] All 6 annotators completed 600 paragraphs each (3,600 labels total, 1,200 paragraphs × 3)
 - [x] BIBD assignment: each paragraph labeled by exactly 3 of 6 annotators
 - [x] Full data export: raw labels, timing, quiz sessions, metrics → `data/gold/`
- [x] Comprehensive IRR analysis with 16 diagnostic charts → `data/gold/charts/`
-
-### Human Labeling Results
+- [x] Comprehensive IRR analysis → `data/gold/charts/`

 | Metric | Category | Specificity | Both |
 |--------|----------|-------------|------|
@ -49,77 +39,90 @@
 | Krippendorff's α | 0.801 | 0.546 | — |
 | Avg Cohen's κ | 0.612 | 0.440 | — |

-**Key findings:**
- **Category is reliable (α=0.801)** — above the 0.80 threshold for reliable data
- **Specificity is unreliable (α=0.546)** — driven primarily by one outlier annotator (Aaryan, +1.28 specificity levels vs Stage 1, κ=0.03-0.25 on specificity) and genuinely hard Spec 3↔4 boundary
- **Human majority = Stage 1 majority on 83.3% of categories** — strong cross-validation
- **Same confusion axes** in humans and GenAI: MR↔RMP (#1), BG↔MR (#2), N/O↔SI (#3)
- **Excluding outlier annotator:** both-unanimous jumps from 5% → 50% on his paragraphs (+45pp)
- **Timing:** 21.5 active hours total, median 14.9s per paragraph
-
 ### Prompt v3.0
- [x] Updated `SYSTEM_PROMPT` with codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP
+- [x] Codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP
 - [x] Prompt version bumped from v2.5 → v3.0

-### GenAI Holdout Benchmark — In Progress
-Running 6 benchmark models + Opus on the 1,200 holdout paragraphs:
+### GenAI Holdout Benchmark — Complete
+- [x] 6 benchmark models + Opus 4.6 on the 1,200 holdout paragraphs
+- [x] All 1,200 annotations per model (0 failures after minimax/kimi fence-stripping fix)
+- [x] Total benchmark cost: $45.63

-| Model | Supplier | Est. Cost/call | Notes |
-|-------|----------|---------------|-------|
-| openai/gpt-5.4 | OpenAI | $0.009 | Structured output |
-| moonshotai/kimi-k2.5 | Moonshot | $0.006 | Structured output |
-| google/gemini-3.1-pro-preview | Google | $0.006 | Structured output |
-| z-ai/glm-5 | Zhipu | $0.006 | Structured output, exacto routing |
-| minimax/minimax-m2.7 | MiniMax | $0.002 | Raw text + fence stripping |
-| xiaomi/mimo-v2-pro | Xiaomi | $0.006 | Structured output, exacto routing |
-| anthropic/claude-opus-4.6 | Anthropic | $0 (subscription) | Agent SDK, parallel workers |
+| Model | Supplier | Cost | Cat % vs Opus | Both % vs Opus |
+|-------|----------|------|---------------|----------------|
+| openai/gpt-5.4 | OpenAI | $6.79 | 88.2% | 79.8% |
+| google/gemini-3.1-pro-preview | Google | $16.09 | 87.4% | 80.0% |
+| moonshotai/kimi-k2.5 | Moonshot | $7.70 | 85.1% | 76.8% |
+| z-ai/glm-5:exacto | Zhipu | $6.86 | 86.2% | 76.5% |
+| xiaomi/mimo-v2-pro:exacto | Xiaomi | $6.59 | 85.7% | 76.3% |
+| minimax/minimax-m2.7:exacto | MiniMax | $1.61 | 82.8% | 63.6% |
+| anthropic/claude-opus-4.6 | Anthropic | $0 | — | — |

-Plus Stage 1 panel (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast) already on file = **10 models, 8 suppliers**.
+Plus Stage 1 panel already on file = **10 models, 8 suppliers**.

-## What's In Progress
+### 13-Signal Cross-Source Analysis — Complete
+- [x] 30 diagnostic charts generated → `data/gold/charts/`
+- [x] Leave-one-out analysis (no model privileged as reference)
+- [x] Adjudication tier breakdown computed

-### Opus Golden Re-Run
- Opus golden labels being re-run on the correct 1,200 holdout paragraphs (previous run was on a stale sample due to `.sampled-ids.json` being overwritten)
- Previous Opus labels (different 1,200 paragraphs) preserved at `data/annotations/golden/opus.wrong-sample.jsonl`
- Using parallelized Agent SDK workers (concurrency=20)
+**Adjudication tiers (13 signals per paragraph):**

-### GenAI Benchmark
- 6 models running on holdout with v3.0 prompt, high concurrency (200)
- Output: `data/annotations/bench-holdout/{model}.jsonl`
+| Tier | Count | % | Rule |
+|------|-------|---|------|
+| 1 | 756 | 63.0% | 10+/13 agree on both dimensions → auto gold |
+| 2 | 216 | 18.0% | Human + GenAI majorities agree → cross-validated |
+| 3 | 26 | 2.2% | Humans split, GenAI converges → expert review |
+| 4 | 202 | 16.8% | Universal disagreement → expert review |
+
+**Leave-one-out ranking (each source vs majority of other 12):**
+
+| Rank | Source | Cat % | Spec % | Both % |
+|------|--------|-------|--------|--------|
+| 1 | Opus 4.6 | 92.6 | 90.8 | 84.0 |
+| 2 | Kimi K2.5 | 91.6 | 91.1 | 83.3 |
+| 3 | Gemini Pro | 91.1 | 90.1 | 82.3 |
+| 4 | GPT-5.4 | 91.4 | 88.8 | 82.1 |
+| 8 | H:Xander (best human) | 91.3 | 83.9 | 76.9 |
+| 16 | H:Aaryan (outlier) | 59.1 | 24.7 | 15.8 |
+
+**Key finding:** Opus earns the #1 spot through leave-one-out — it's not special because we designated it as gold; it genuinely disagrees with the crowd least (7.4% odd-one-out rate).

 ## What's Next (in dependency order)

-### 1. Gold set adjudication (blocked on benchmark + Opus completion)
-Each paragraph will have **13+ independent annotations**: 3 human + 3 Stage 1 + 1 Opus + 6 benchmark models.
-Adjudication tiers:
- **Tier 1:** 10+/13 agree → gold label, no intervention
- **Tier 2:** Human majority + GenAI consensus agree → take consensus
- **Tier 3:** Humans split, GenAI converges → expert adjudication using Opus reasoning traces
- **Tier 4:** Universal disagreement → expert adjudication with documented reasoning
+### 1. Gold set adjudication
+- Tier 1+2 (972 paragraphs, 81%) → auto-resolved from 13-signal consensus
+- Tier 3+4 (228 paragraphs, 19%) → expert review with Opus reasoning traces
+- For Aaryan's 600 paragraphs: use other-2-annotator majority when they agree and he disagrees

-### 2. Training data assembly (blocked on adjudication)
+### 2. Training data assembly
 - Unanimous Stage 1 labels (35,204 paragraphs) → full weight
 - Calibrated majority labels (~9-12K) → full weight
 - Judge high-confidence labels (~2-3K) → full weight
 - Quality tier weights: clean/headed/minor=1.0, degraded=0.5

-### 3. Fine-tuning + ablations (blocked on training data)
-7 experiments: {base, +DAPT, +DAPT+TAPT} × {with/without SCL} + best config.
-Dual-head classifier: shared ModernBERT backbone + 2 linear classification heads.
+### 3. Fine-tuning + ablations
+- 8+ experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} × {±class weighting}
+- Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
+- Focal loss / class-weighted CE for category imbalance
+- Ordinal regression (CORAL) for specificity

-### 4. Evaluation + paper (blocked on everything above)
-Full GenAI benchmark (10 models) on 1,200 holdout. Comparison tables. Write-up. IGNITE slides.
+### 4. Evaluation + paper
+- Macro F1 + per-class F1 on holdout (must exceed 0.80 for category)
+- Full GenAI benchmark table (10 models × 1,200 holdout)
+- Cost/time/reproducibility comparison
+- Error analysis on Tier 4 paragraphs (A-grade criterion)
+- IGNITE slides (20 slides, 15s each)

 ## Parallel Tracks

 ```
 Track A (GPU):  DAPT ✓ → TAPT ✓ ──────────────→ Fine-tuning → Eval
                                                        ↑
-Track B (API):  Opus re-run ─┐                          │
-                             ├→ Gold adjudication ──────┤
-Track C (API):  6-model bench┘                          │
+Track B (API):  Opus re-run ✓─┐                         │
+                              ├→ Gold adjudication ─────┤
+Track C (API):  6-model bench ✓┘                        │
                                                        │
-Track D (Human): Labeling ✓ → IRR analysis ✓ ───────────┘
+Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ ─────┘
 ```

 ## Key File Locations
@ -132,15 +135,12 @@ Track D (Human): Labeling ✓ → IRR analysis ✓ ─────────
 | Human labels (raw) | `data/gold/human-labels-raw.jsonl` (3,600 labels) |
 | Human label metrics | `data/gold/metrics.json` |
 | Holdout paragraphs | `data/gold/paragraphs-holdout.jsonl` (1,200) |
-| Diagnostic charts | `data/gold/charts/*.png` (16 charts) |
-| Opus golden labels | `data/annotations/golden/opus.jsonl` (re-run on correct holdout) |
-| Benchmark annotations | `data/annotations/bench-holdout/{model}.jsonl` |
+| Diagnostic charts | `data/gold/charts/*.png` (30 charts) |
+| Opus golden labels | `data/annotations/golden/opus.jsonl` (1,200) |
+| Benchmark annotations | `data/annotations/bench-holdout/{model}.jsonl` (6 × 1,200) |
 | Original sampled IDs | `labelapp/.sampled-ids.original.json` (1,200 holdout PIDs) |
 | DAPT corpus | `data/dapt-corpus/shard-*.jsonl` (14,756 docs) |
-| DAPT config | `python/configs/dapt/modernbert.yaml` |
-| TAPT config | `python/configs/tapt/modernbert.yaml` |
 | DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` |
 | TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` |
-| Training CLI | `python/main.py dapt --config ...` |
-| Analysis script | `scripts/analyze-gold.py` |
+| Analysis script | `scripts/analyze-gold.py` (30-chart, 13-signal analysis) |
 | Data dump script | `labelapp/scripts/dump-all.ts` |
--- a/scripts/analyze-gold.py
+++ b/scripts/analyze-gold.py