11 KiB
Post-Labeling Plan — Gold Set, Fine-Tuning & F1 Strategy
Updated 2026-04-02 with actual benchmark results and 13-signal analysis.
Human Labeling Results (Complete)
3,600 labels (1,200 paragraphs x 3 annotators via BIBD), 21.5 active hours total.
| Metric | Category | Specificity | Both |
|---|---|---|---|
| Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% |
| Krippendorff's alpha | 0.801 | 0.546 | -- |
| Avg Cohen's kappa | 0.612 | 0.440 | -- |
Category is reliable. Alpha = 0.801 exceeds the conventional 0.80 threshold. Specificity is unreliable. Alpha = 0.546, driven by one outlier annotator (+1.28 specificity bias) and a genuinely hard Spec 3-4 boundary.
GenAI Benchmark Results (Complete)
10 models from 8 suppliers on 1,200 holdout paragraphs. $45.63 total benchmark cost.
Per-Model Accuracy (Leave-One-Out: each source vs majority of other 12)
| Rank | Source | Cat % | Spec % | Both % | Odd-One-Out % |
|---|---|---|---|---|---|
| 1 | Opus 4.6 | 92.6 | 90.8 | 84.0 | 7.4% |
| 2 | Kimi K2.5 | 91.6 | 91.1 | 83.3 | 8.4% |
| 3 | Gemini Pro | 91.1 | 90.1 | 82.3 | 8.9% |
| 4 | GPT-5.4 | 91.4 | 88.8 | 82.1 | 8.6% |
| 5 | GLM-5 | 91.9 | 88.4 | 81.4 | 8.1% |
| 6 | MIMO Pro | 91.1 | 89.4 | 81.4 | 8.9% |
| 7 | Grok Fast | 88.9 | 89.6 | 80.0 | 11.1% |
| 8 | Xander (best human) | 91.3 | 83.9 | 76.9 | 8.7% |
| 9 | Elisabeth | 85.5 | 84.6 | 72.3 | 14.5% |
| 10 | Gemini Lite | 83.0 | 86.1 | 71.7 | 17.0% |
| 11 | MIMO Flash | 80.4 | 86.4 | 69.2 | 19.6% |
| 12 | Meghan | 86.3 | 76.8 | 66.5 | 13.7% |
| 13 | MiniMax M2.7 | 87.9 | 75.6 | 66.1 | 12.1% |
| 14 | Joey | 84.0 | 77.2 | 65.8 | 16.0% |
| 15 | Anuj | 72.7 | 60.6 | 42.8 | 27.3% |
| 16 | Aaryan (outlier) | 59.1 | 24.7 | 15.8 | 40.9% |
Opus earns #1 without being privileged -- it genuinely disagrees with the crowd least.
Cross-Source Agreement
| Comparison | Category |
|---|---|
| Human maj = S1 maj | 81.7% |
| Human maj = Opus | 83.2% |
| Human maj = GenAI maj (10) | 82.2% |
| GenAI maj = Opus | 86.8% |
| 13-signal maj = 10-GenAI maj | 99.5% |
Confusion Axes (same order for all source types)
- MR <-> RMP (dominant)
- BG <-> MR
- N/O <-> SI
Adjudication Strategy (13 Signals)
Sources per paragraph
| Source | Count | Prompt |
|---|---|---|
| Human annotators | 3 | Codebook v3.0 |
| Stage 1 (gemini-lite, mimo-flash, grok-fast) | 3 | v2.5 |
| Opus 4.6 golden | 1 | v3.0+codebook |
| Benchmark (gpt-5.4, kimi-k2.5, gemini-pro, glm-5, minimax-m2.7, mimo-pro) | 6 | v3.0 |
| Total | 13 |
Tier breakdown (actual counts)
| Tier | Rule | Count | % |
|---|---|---|---|
| 1 | 10+/13 agree on both dimensions | 756 | 63.0% |
| 2 | Human majority + GenAI majority agree | 216 | 18.0% |
| 3 | Humans split, GenAI converges | 26 | 2.2% |
| 4 | Universal disagreement | 202 | 16.8% |
81% auto-resolvable. Only 228 paragraphs (19%) need expert review.
Aaryan correction
On Aaryan's 600 paragraphs: when the other 2 annotators agree and Aaryan disagrees, the other-2 majority becomes the human signal for adjudication. This is justified by his 40.9% odd-one-out rate (vs 8-16% for other annotators) and α=0.03-0.25 on specificity.
Adjudication process for Tier 3+4
- Pull Opus reasoning trace for the paragraph
- Check the GenAI consensus (which category do 7+/10 models agree on?)
- Expert reads the paragraph and all signals, makes final call
- Document reasoning for Tier 4 paragraphs (these are the error analysis corpus)
F1 Strategy — How to Pass
The requirement
- C grade minimum: fine-tuned model with macro F1 > 0.80 on holdout
- Gold standard: human-labeled holdout (1,200 paragraphs)
- Metrics to report: macro F1, per-class F1, Krippendorff's alpha, AUC, MCC
- The fine-tuned "specialist" must be compared head-to-head with GenAI labeling
The challenge
The holdout was deliberately stratified to over-sample hard decision boundaries (MR<->RMP, N/O<->SI, Spec 3<->4). This means raw F1 on this holdout will be lower than on a random sample. Additionally:
- The best individual GenAI models only agree with human majority ~83-87% on category
- Our model is trained on GenAI labels, so its ceiling is bounded by GenAI-vs-human agreement
- Macro F1 weights all 7 classes equally -- rare classes (TPR, ID) get equal influence
- The MR<->RMP confusion axis is the #1 challenge across all source types
Why F1 > 0.80 is achievable
-
DAPT + TAPT give domain advantage. The model has seen 1B tokens of SEC filings (DAPT) and all labeled paragraphs (TAPT). It understands SEC disclosure language at a depth that generic BERT models don't.
-
35K+ high-confidence training examples. Unanimous Stage 1 labels where all 3 models agreed on both dimensions. These are cleaner than any single model's labels.
-
Encoder classification outperforms generative labeling on fine-tuned domains. The model doesn't need to "reason" about the codebook -- it learns the decision boundaries directly from representations. This is the core thesis of Ringel (2023).
-
The hard cases are a small fraction. 63% of the holdout is Tier 1 (10+/13 agree). The model only needs reasonable performance on the remaining 37% to clear 0.80.
Critical actions
1. Gold label quality (highest priority)
Noisy gold labels directly cap F1. If the gold label is wrong, even a perfect model gets penalized.
- Tier 1+2 (972 paragraphs): Use 13-signal consensus. These are essentially guaranteed correct.
- Tier 3+4 (228 paragraphs): Expert adjudication with documented reasoning. Prioritize Opus reasoning traces + GenAI consensus as evidence.
- Aaryan correction: On his 600 paragraphs, replace his vote with the other-2 majority when they agree. This alone should improve gold label quality substantially.
- Document the process: The adjudication methodology itself is a deliverable (IRR report + reliability analysis).
2. Training data curation
- Primary corpus: Unanimous Stage 1 labels (all 3 models agree on both cat+spec) -- ~35K paragraphs
- Secondary: Majority labels (2/3 agree) with 0.8x sample weight -- ~9-12K
- Tertiary: Judge labels with high confidence -- ~2-3K
- Exclude: Paragraphs where all 3 models disagree (too noisy for training)
- Quality weighting: clean/headed/minor = 1.0, degraded = 0.5
3. Architecture and loss
- Dual-head classifier: Shared ModernBERT backbone -> category head (7-class softmax) + specificity head (4-class ordinal)
- Category loss: Focal loss (gamma=2) or class-weighted cross-entropy. The model must not ignore rare categories (TPR, ID). Weights inversely proportional to class frequency in training data.
- Specificity loss: Ordinal regression (CORAL) -- penalizes Spec 1->4 errors more than Spec 2->3. This respects the ordinal nature and handles the noisy Spec 3<->4 boundary gracefully.
- Combined loss: L = L_cat + 0.5 * L_spec (category gets more gradient weight because it's the more reliable dimension and the primary metric)
4. Ablation experiments (need >=4 configurations)
| # | Backbone | Class Weights | SCL | Notes |
|---|---|---|---|---|
| 1 | Base ModernBERT-large | No | No | Baseline |
| 2 | +DAPT | No | No | Domain adaptation effect |
| 3 | +DAPT+TAPT | No | No | Full pre-training pipeline |
| 4 | +DAPT+TAPT | Yes (focal) | No | Class imbalance handling |
| 5 | +DAPT+TAPT | Yes (focal) | Yes | Supervised contrastive learning |
| 6 | +DAPT+TAPT | Yes (focal) | Yes | + ensemble (3 seeds) |
Experiments 1-3 isolate the pre-training contribution. 4-5 isolate training strategy. 6 is the final system.
5. Evaluation strategy
- Primary metric: Category macro F1 on full 1,200 holdout (must exceed 0.80)
- Secondary metrics: Per-class F1, specificity F1 (report separately), MCC, Krippendorff's alpha vs human labels
- Dual reporting (adverse incentive mitigation): Also report F1 on a 720-paragraph proportional subsample (random draw matching corpus class proportions). The delta quantifies degradation on hard boundary cases. This serves the A-grade "error analysis" criterion.
- Error analysis corpus: Tier 4 paragraphs (202) are the natural error analysis set. Where the model fails on these, the 13-signal disagreement pattern explains why.
6. Inference-time techniques
- Ensemble: Train 3 models with different random seeds on the best config. Majority vote at inference. Typically adds 1-3pp F1.
- Threshold optimization: After training, optimize per-class classification thresholds on a validation set (not holdout) to maximize macro F1. Don't use argmax -- use thresholds that balance precision and recall per class.
- Post-hoc calibration: Temperature scaling on validation set. Important for AUC and calibration plots.
Specificity dimension -- managed expectations
Specificity F1 will be lower than category F1. This is not a model failure:
- Human alpha on specificity is only 0.546 (unreliable gold)
- Even frontier models only agree 75-91% on specificity
- The Spec 3<->4 boundary is genuinely ambiguous
Strategy: report specificity F1 separately, explain why it's lower, and frame it as a finding about construct reliability (the specificity dimension needs more operational clarity, not better models). This is honest and scientifically interesting.
Concrete F1 estimate
Based on GenAI-vs-human agreement rates and the typical BERT fine-tuning premium:
- Category macro F1: 0.78-0.85 (depends on class imbalance handling and gold quality)
- Specificity macro F1: 0.65-0.75 (ceiling-limited by human disagreement)
- Combined (cat x spec) accuracy: 0.55-0.70
The swing categories for macro F1 are MR (~65-80% per-class F1), TPR (~70-90%), and N/O (~60-85%). Focal loss + SCL should push MR and N/O into the range where macro F1 clears 0.80.
The Meta-Narrative
The finding that trained student annotators achieve alpha = 0.801 on category but only 0.546 on specificity, while calibrated LLM panels achieve higher consistency (60.1% spec unanimity vs 42.2% for humans), validates the synthetic experts hypothesis for rule-heavy classification tasks. The low specificity agreement is not annotator incompetence -- it's evidence that the specificity construct requires systematic attention to IS/NOT lists and counting rules that humans don't consistently invest at 15s/paragraph pace. GenAI's advantage on multi-step reasoning tasks is itself a key finding.
The leave-one-out analysis showing that Opus earns the top rank without being privileged is the strongest validation of using frontier LLMs as "gold" annotators: they're not just consistent with each other, they're the most consistent with the emergent consensus of all 16 sources combined.
Timeline
| Task | Target | Status |
|---|---|---|
| Human labeling | 2026-04-01 | Done |
| GenAI benchmark (10 models) | 2026-04-02 | Done |
| 13-signal analysis | 2026-04-02 | Done |
| Gold set adjudication | 2026-04-03-04 | Next |
| Training data assembly | 2026-04-04 | |
| Fine-tuning ablations (6 configs) | 2026-04-05-08 | |
| Final evaluation on holdout | 2026-04-09 | |
| Executive memo + IGNITE slides | 2026-04-10-14 | |
| Submission | 2026-04-23 |