# Post-Labeling Plan — Gold Set, Fine-Tuning & F1 Strategy Updated 2026-04-02 with actual benchmark results and 13-signal analysis. --- ## Human Labeling Results (Complete) 3,600 labels (1,200 paragraphs x 3 annotators via BIBD), 21.5 active hours total. | Metric | Category | Specificity | Both | |--------|----------|-------------|------| | Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% | | Krippendorff's alpha | **0.801** | 0.546 | -- | | Avg Cohen's kappa | 0.612 | 0.440 | -- | **Category is reliable.** Alpha = 0.801 exceeds the conventional 0.80 threshold. **Specificity is unreliable.** Alpha = 0.546, driven by one outlier annotator (+1.28 specificity bias) and a genuinely hard Spec 3-4 boundary. --- ## GenAI Benchmark Results (Complete) 10 models from 8 suppliers on 1,200 holdout paragraphs. $45.63 total benchmark cost. ### Per-Model Accuracy (Leave-One-Out: each source vs majority of other 12) | Rank | Source | Cat % | Spec % | Both % | Odd-One-Out % | |------|--------|-------|--------|--------|---------------| | 1 | Opus 4.6 | 92.6 | 90.8 | 84.0 | 7.4% | | 2 | Kimi K2.5 | 91.6 | 91.1 | 83.3 | 8.4% | | 3 | Gemini Pro | 91.1 | 90.1 | 82.3 | 8.9% | | 4 | GPT-5.4 | 91.4 | 88.8 | 82.1 | 8.6% | | 5 | GLM-5 | 91.9 | 88.4 | 81.4 | 8.1% | | 6 | MIMO Pro | 91.1 | 89.4 | 81.4 | 8.9% | | 7 | Grok Fast | 88.9 | 89.6 | 80.0 | 11.1% | | 8 | Xander (best human) | 91.3 | 83.9 | 76.9 | 8.7% | | 9 | Elisabeth | 85.5 | 84.6 | 72.3 | 14.5% | | 10 | Gemini Lite | 83.0 | 86.1 | 71.7 | 17.0% | | 11 | MIMO Flash | 80.4 | 86.4 | 69.2 | 19.6% | | 12 | Meghan | 86.3 | 76.8 | 66.5 | 13.7% | | 13 | MiniMax M2.7 | 87.9 | 75.6 | 66.1 | 12.1% | | 14 | Joey | 84.0 | 77.2 | 65.8 | 16.0% | | 15 | Anuj | 72.7 | 60.6 | 42.8 | 27.3% | | 16 | Aaryan (outlier) | 59.1 | 24.7 | 15.8 | 40.9% | Opus earns #1 without being privileged -- it genuinely disagrees with the crowd least. ### Cross-Source Agreement | Comparison | Category | |------------|----------| | Human maj = S1 maj | 81.7% | | Human maj = Opus | 83.2% | | Human maj = GenAI maj (10) | 82.2% | | GenAI maj = Opus | 86.8% | | 13-signal maj = 10-GenAI maj | 99.5% | ### Confusion Axes (same order for all source types) 1. MR <-> RMP (dominant) 2. BG <-> MR 3. N/O <-> SI --- ## Adjudication Strategy (13 Signals) ### Sources per paragraph | Source | Count | Prompt | |--------|-------|--------| | Human annotators | 3 | Codebook v3.0 | | Stage 1 (gemini-lite, mimo-flash, grok-fast) | 3 | v2.5 | | Opus 4.6 golden | 1 | v3.0+codebook | | Benchmark (gpt-5.4, kimi-k2.5, gemini-pro, glm-5, minimax-m2.7, mimo-pro) | 6 | v3.0 | | **Total** | **13** | | ### Tier breakdown (actual counts) | Tier | Rule | Count | % | |------|------|-------|---| | 1 | 10+/13 agree on both dimensions | 756 | 63.0% | | 2 | Human majority + GenAI majority agree | 216 | 18.0% | | 3 | Humans split, GenAI converges | 26 | 2.2% | | 4 | Universal disagreement | 202 | 16.8% | **81% auto-resolvable.** Only 228 paragraphs (19%) need expert review. ### Aaryan correction On Aaryan's 600 paragraphs: when the other 2 annotators agree and Aaryan disagrees, the other-2 majority becomes the human signal for adjudication. This is justified by his 40.9% odd-one-out rate (vs 8-16% for other annotators) and α=0.03-0.25 on specificity. ### Adjudication process for Tier 3+4 1. Pull Opus reasoning trace for the paragraph 2. Check the GenAI consensus (which category do 7+/10 models agree on?) 3. Expert reads the paragraph and all signals, makes final call 4. Document reasoning for Tier 4 paragraphs (these are the error analysis corpus) --- ## F1 Strategy — How to Pass ### The requirement - **C grade minimum:** fine-tuned model with macro F1 > 0.80 on holdout - **Gold standard:** human-labeled holdout (1,200 paragraphs) - **Metrics to report:** macro F1, per-class F1, Krippendorff's alpha, AUC, MCC - The fine-tuned "specialist" must be compared head-to-head with GenAI labeling ### The challenge The holdout was deliberately stratified to over-sample hard decision boundaries (MR<->RMP, N/O<->SI, Spec 3<->4). This means raw F1 on this holdout will be **lower** than on a random sample. Additionally: - The best individual GenAI models only agree with human majority ~83-87% on category - Our model is trained on GenAI labels, so its ceiling is bounded by GenAI-vs-human agreement - Macro F1 weights all 7 classes equally -- rare classes (TPR, ID) get equal influence - The MR<->RMP confusion axis is the #1 challenge across all source types ### Why F1 > 0.80 is achievable 1. **DAPT + TAPT give domain advantage.** The model has seen 1B tokens of SEC filings (DAPT) and all labeled paragraphs (TAPT). It understands SEC disclosure language at a depth that generic BERT models don't. 2. **35K+ high-confidence training examples.** Unanimous Stage 1 labels where all 3 models agreed on both dimensions. These are cleaner than any single model's labels. 3. **Encoder classification outperforms generative labeling on fine-tuned domains.** The model doesn't need to "reason" about the codebook -- it learns the decision boundaries directly from representations. This is the core thesis of Ringel (2023). 4. **The hard cases are a small fraction.** 63% of the holdout is Tier 1 (10+/13 agree). The model only needs reasonable performance on the remaining 37% to clear 0.80. ### Critical actions #### 1. Gold label quality (highest priority) Noisy gold labels directly cap F1. If the gold label is wrong, even a perfect model gets penalized. - **Tier 1+2 (972 paragraphs):** Use 13-signal consensus. These are essentially guaranteed correct. - **Tier 3+4 (228 paragraphs):** Expert adjudication with documented reasoning. Prioritize Opus reasoning traces + GenAI consensus as evidence. - **Aaryan correction:** On his 600 paragraphs, replace his vote with the other-2 majority when they agree. This alone should improve gold label quality substantially. - **Document the process:** The adjudication methodology itself is a deliverable (IRR report + reliability analysis). #### 2. Training data curation - **Primary corpus:** Unanimous Stage 1 labels (all 3 models agree on both cat+spec) -- ~35K paragraphs - **Secondary:** Majority labels (2/3 agree) with 0.8x sample weight -- ~9-12K - **Tertiary:** Judge labels with high confidence -- ~2-3K - **Exclude:** Paragraphs where all 3 models disagree (too noisy for training) - **Quality weighting:** clean/headed/minor = 1.0, degraded = 0.5 #### 3. Architecture and loss - **Dual-head classifier:** Shared ModernBERT backbone -> category head (7-class softmax) + specificity head (4-class ordinal) - **Category loss:** Focal loss (gamma=2) or class-weighted cross-entropy. The model must not ignore rare categories (TPR, ID). Weights inversely proportional to class frequency in training data. - **Specificity loss:** Ordinal regression (CORAL) -- penalizes Spec 1->4 errors more than Spec 2->3. This respects the ordinal nature and handles the noisy Spec 3<->4 boundary gracefully. - **Combined loss:** L = L_cat + 0.5 * L_spec (category gets more gradient weight because it's the more reliable dimension and the primary metric) #### 4. Ablation experiments (need >=4 configurations) | # | Backbone | Class Weights | SCL | Notes | |---|----------|--------------|-----|-------| | 1 | Base ModernBERT-large | No | No | Baseline | | 2 | +DAPT | No | No | Domain adaptation effect | | 3 | +DAPT+TAPT | No | No | Full pre-training pipeline | | 4 | +DAPT+TAPT | Yes (focal) | No | Class imbalance handling | | 5 | +DAPT+TAPT | Yes (focal) | Yes | Supervised contrastive learning | | 6 | +DAPT+TAPT | Yes (focal) | Yes | + ensemble (3 seeds) | Experiments 1-3 isolate the pre-training contribution. 4-5 isolate training strategy. 6 is the final system. #### 5. Evaluation strategy - **Primary metric:** Category macro F1 on full 1,200 holdout (must exceed 0.80) - **Secondary metrics:** Per-class F1, specificity F1 (report separately), MCC, Krippendorff's alpha vs human labels - **Dual reporting (adverse incentive mitigation):** Also report F1 on a 720-paragraph proportional subsample (random draw matching corpus class proportions). The delta quantifies degradation on hard boundary cases. This serves the A-grade "error analysis" criterion. - **Error analysis corpus:** Tier 4 paragraphs (202) are the natural error analysis set. Where the model fails on these, the 13-signal disagreement pattern explains why. #### 6. Inference-time techniques - **Ensemble:** Train 3 models with different random seeds on the best config. Majority vote at inference. Typically adds 1-3pp F1. - **Threshold optimization:** After training, optimize per-class classification thresholds on a validation set (not holdout) to maximize macro F1. Don't use argmax -- use thresholds that balance precision and recall per class. - **Post-hoc calibration:** Temperature scaling on validation set. Important for AUC and calibration plots. ### Specificity dimension -- managed expectations Specificity F1 will be lower than category F1. This is not a model failure: - Human alpha on specificity is only 0.546 (unreliable gold) - Even frontier models only agree 75-91% on specificity - The Spec 3<->4 boundary is genuinely ambiguous Strategy: report specificity F1 separately, explain why it's lower, and frame it as a finding about construct reliability (the specificity dimension needs more operational clarity, not better models). This is honest and scientifically interesting. ### Concrete F1 estimate Based on GenAI-vs-human agreement rates and the typical BERT fine-tuning premium: - **Category macro F1:** 0.78-0.85 (depends on class imbalance handling and gold quality) - **Specificity macro F1:** 0.65-0.75 (ceiling-limited by human disagreement) - **Combined (cat x spec) accuracy:** 0.55-0.70 The swing categories for macro F1 are MR (~65-80% per-class F1), TPR (~70-90%), and N/O (~60-85%). Focal loss + SCL should push MR and N/O into the range where macro F1 clears 0.80. --- ## The Meta-Narrative The finding that trained student annotators achieve alpha = 0.801 on category but only 0.546 on specificity, while calibrated LLM panels achieve higher consistency (60.1% spec unanimity vs 42.2% for humans), validates the synthetic experts hypothesis for rule-heavy classification tasks. The low specificity agreement is not annotator incompetence -- it's evidence that the specificity construct requires systematic attention to IS/NOT lists and counting rules that humans don't consistently invest at 15s/paragraph pace. GenAI's advantage on multi-step reasoning tasks is itself a key finding. The leave-one-out analysis showing that Opus earns the top rank without being privileged is the strongest validation of using frontier LLMs as "gold" annotators: they're not just consistent with each other, they're the most consistent with the emergent consensus of all 16 sources combined. --- ## Timeline | Task | Target | Status | |------|--------|--------| | Human labeling | 2026-04-01 | Done | | GenAI benchmark (10 models) | 2026-04-02 | Done | | 13-signal analysis | 2026-04-02 | Done | | Gold set adjudication | 2026-04-03-04 | Next | | Training data assembly | 2026-04-04 | | | Fine-tuning ablations (6 configs) | 2026-04-05-08 | | | Final evaluation on holdout | 2026-04-09 | | | Executive memo + IGNITE slides | 2026-04-10-14 | | | Submission | 2026-04-23 | |