# Post-Labeling Plan — Gold Set, Fine-Tuning & F1 Strategy

Updated 2026-04-02 with actual benchmark results and 13-signal analysis.

---

## Human Labeling Results (Complete)

3,600 labels (1,200 paragraphs x 3 annotators via BIBD), 21.5 active hours total.

| Metric | Category | Specificity | Both |
|--------|----------|-------------|------|
| Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% |
| Krippendorff's alpha | **0.801** | 0.546 | -- |
| Avg Cohen's kappa | 0.612 | 0.440 | -- |

**Category is reliable.** Alpha = 0.801 exceeds the conventional 0.80 threshold. **Specificity is unreliable.** Alpha = 0.546, driven by one outlier annotator (+1.28 specificity bias) and a genuinely hard Spec 3-4 boundary.

---

## GenAI Benchmark Results (Complete)

10 models from 8 suppliers on 1,200 holdout paragraphs. $45.63 total benchmark cost.

### Per-Model Accuracy (Leave-One-Out: each source vs majority of other 12)

| Rank | Source | Cat % | Spec % | Both % | Odd-One-Out % |
|------|--------|-------|--------|--------|---------------|
| 1 | Opus 4.6 | 92.6 | 90.8 | 84.0 | 7.4% |
| 2 | Kimi K2.5 | 91.6 | 91.1 | 83.3 | 8.4% |
| 3 | Gemini Pro | 91.1 | 90.1 | 82.3 | 8.9% |
| 4 | GPT-5.4 | 91.4 | 88.8 | 82.1 | 8.6% |
| 5 | GLM-5 | 91.9 | 88.4 | 81.4 | 8.1% |
| 6 | MIMO Pro | 91.1 | 89.4 | 81.4 | 8.9% |
| 7 | Grok Fast | 88.9 | 89.6 | 80.0 | 11.1% |
| 8 | Xander (best human) | 91.3 | 83.9 | 76.9 | 8.7% |
| 9 | Elisabeth | 85.5 | 84.6 | 72.3 | 14.5% |
| 10 | Gemini Lite | 83.0 | 86.1 | 71.7 | 17.0% |
| 11 | MIMO Flash | 80.4 | 86.4 | 69.2 | 19.6% |
| 12 | Meghan | 86.3 | 76.8 | 66.5 | 13.7% |
| 13 | MiniMax M2.7 | 87.9 | 75.6 | 66.1 | 12.1% |
| 14 | Joey | 84.0 | 77.2 | 65.8 | 16.0% |
| 15 | Anuj | 72.7 | 60.6 | 42.8 | 27.3% |
| 16 | Aaryan (outlier) | 59.1 | 24.7 | 15.8 | 40.9% |

Opus earns #1 without being privileged -- it genuinely disagrees with the crowd least.

### Cross-Source Agreement

| Comparison | Category |
|------------|----------|
| Human maj = S1 maj | 81.7% |
| Human maj = Opus | 83.2% |
| Human maj = GenAI maj (10) | 82.2% |
| GenAI maj = Opus | 86.8% |
| 13-signal maj = 10-GenAI maj | 99.5% |

### Confusion Axes (same order for all source types)

1. MR <-> RMP (dominant)
2. BG <-> MR
3. N/O <-> SI

---

## Adjudication Strategy (13 Signals)

### Sources per paragraph

| Source | Count | Prompt |
|--------|-------|--------|
| Human annotators | 3 | Codebook v3.0 |
| Stage 1 (gemini-lite, mimo-flash, grok-fast) | 3 | v2.5 |
| Opus 4.6 golden | 1 | v3.0+codebook |
| Benchmark (gpt-5.4, kimi-k2.5, gemini-pro, glm-5, minimax-m2.7, mimo-pro) | 6 | v3.0 |
| **Total** | **13** | |

### Tier breakdown (actual counts)

| Tier | Rule | Count | % |
|------|------|-------|---|
| 1 | 10+/13 agree on both dimensions | 756 | 63.0% |
| 2 | Human majority + GenAI majority agree | 216 | 18.0% |
| 3 | Humans split, GenAI converges | 26 | 2.2% |
| 4 | Universal disagreement | 202 | 16.8% |

**81% auto-resolvable.** Only 228 paragraphs (19%) need expert review.

### Aaryan correction

On Aaryan's 600 paragraphs: when the other 2 annotators agree and Aaryan disagrees, the other-2 majority becomes the human signal for adjudication. This is justified by his 40.9% odd-one-out rate (vs 8-16% for other annotators) and α=0.03-0.25 on specificity.

### Adjudication process for Tier 3+4

1. Pull Opus reasoning trace for the paragraph
2. Check the GenAI consensus (which category do 7+/10 models agree on?)
3. Expert reads the paragraph and all signals, makes final call
4. Document reasoning for Tier 4 paragraphs (these are the error analysis corpus)

---

## F1 Strategy — How to Pass

### The requirement

- **C grade minimum:** fine-tuned model with macro F1 > 0.80 on holdout
- **Gold standard:** human-labeled holdout (1,200 paragraphs)
- **Metrics to report:** macro F1, per-class F1, Krippendorff's alpha, AUC, MCC
- The fine-tuned "specialist" must be compared head-to-head with GenAI labeling

### The challenge

The holdout was deliberately stratified to over-sample hard decision boundaries (MR<->RMP, N/O<->SI, Spec 3<->4). This means raw F1 on this holdout will be **lower** than on a random sample. Additionally:

- The best individual GenAI models only agree with human majority ~83-87% on category
- Our model is trained on GenAI labels, so its ceiling is bounded by GenAI-vs-human agreement
- Macro F1 weights all 7 classes equally -- rare classes (TPR, ID) get equal influence
- The MR<->RMP confusion axis is the #1 challenge across all source types

### Why F1 > 0.80 is achievable

1. **DAPT + TAPT give domain advantage.** The model has seen 1B tokens of SEC filings (DAPT) and all labeled paragraphs (TAPT). It understands SEC disclosure language at a depth that generic BERT models don't.

2. **35K+ high-confidence training examples.** Unanimous Stage 1 labels where all 3 models agreed on both dimensions. These are cleaner than any single model's labels.

3. **Encoder classification outperforms generative labeling on fine-tuned domains.** The model doesn't need to "reason" about the codebook -- it learns the decision boundaries directly from representations. This is the core thesis of Ringel (2023).

4. **The hard cases are a small fraction.** 63% of the holdout is Tier 1 (10+/13 agree). The model only needs reasonable performance on the remaining 37% to clear 0.80.

### Critical actions

#### 1. Gold label quality (highest priority)

Noisy gold labels directly cap F1. If the gold label is wrong, even a perfect model gets penalized.

- **Tier 1+2 (972 paragraphs):** Use 13-signal consensus. These are essentially guaranteed correct.
- **Tier 3+4 (228 paragraphs):** Expert adjudication with documented reasoning. Prioritize Opus reasoning traces + GenAI consensus as evidence.
- **Aaryan correction:** On his 600 paragraphs, replace his vote with the other-2 majority when they agree. This alone should improve gold label quality substantially.
- **Document the process:** The adjudication methodology itself is a deliverable (IRR report + reliability analysis).

#### 2. Training data curation

- **Primary corpus:** Unanimous Stage 1 labels (all 3 models agree on both cat+spec) -- ~35K paragraphs
- **Secondary:** Majority labels (2/3 agree) with 0.8x sample weight -- ~9-12K
- **Tertiary:** Judge labels with high confidence -- ~2-3K
- **Exclude:** Paragraphs where all 3 models disagree (too noisy for training)
- **Quality weighting:** clean/headed/minor = 1.0, degraded = 0.5

#### 3. Architecture and loss

- **Dual-head classifier:** Shared ModernBERT backbone -> category head (7-class softmax) + specificity head (4-class ordinal)
- **Category loss:** Focal loss (gamma=2) or class-weighted cross-entropy. The model must not ignore rare categories (TPR, ID). Weights inversely proportional to class frequency in training data.
- **Specificity loss:** Ordinal regression (CORAL) -- penalizes Spec 1->4 errors more than Spec 2->3. This respects the ordinal nature and handles the noisy Spec 3<->4 boundary gracefully.
- **Combined loss:** L = L_cat + 0.5 * L_spec (category gets more gradient weight because it's the more reliable dimension and the primary metric)

#### 4. Ablation experiments (need >=4 configurations)

| # | Backbone | Class Weights | SCL | Notes |
|---|----------|--------------|-----|-------|
| 1 | Base ModernBERT-large | No | No | Baseline |
| 2 | +DAPT | No | No | Domain adaptation effect |
| 3 | +DAPT+TAPT | No | No | Full pre-training pipeline |
| 4 | +DAPT+TAPT | Yes (focal) | No | Class imbalance handling |
| 5 | +DAPT+TAPT | Yes (focal) | Yes | Supervised contrastive learning |
| 6 | +DAPT+TAPT | Yes (focal) | Yes | + ensemble (3 seeds) |

Experiments 1-3 isolate the pre-training contribution. 4-5 isolate training strategy. 6 is the final system.

#### 5. Evaluation strategy

- **Primary metric:** Category macro F1 on full 1,200 holdout (must exceed 0.80)
- **Secondary metrics:** Per-class F1, specificity F1 (report separately), MCC, Krippendorff's alpha vs human labels
- **Dual reporting (adverse incentive mitigation):** Also report F1 on a 720-paragraph proportional subsample (random draw matching corpus class proportions). The delta quantifies degradation on hard boundary cases. This serves the A-grade "error analysis" criterion.
- **Error analysis corpus:** Tier 4 paragraphs (202) are the natural error analysis set. Where the model fails on these, the 13-signal disagreement pattern explains why.

#### 6. Inference-time techniques

- **Ensemble:** Train 3 models with different random seeds on the best config. Majority vote at inference. Typically adds 1-3pp F1.
- **Threshold optimization:** After training, optimize per-class classification thresholds on a validation set (not holdout) to maximize macro F1. Don't use argmax -- use thresholds that balance precision and recall per class.
- **Post-hoc calibration:** Temperature scaling on validation set. Important for AUC and calibration plots.

### Specificity dimension -- managed expectations

Specificity F1 will be lower than category F1. This is not a model failure:
- Human alpha on specificity is only 0.546 (unreliable gold)
- Even frontier models only agree 75-91% on specificity
- The Spec 3<->4 boundary is genuinely ambiguous

Strategy: report specificity F1 separately, explain why it's lower, and frame it as a finding about construct reliability (the specificity dimension needs more operational clarity, not better models). This is honest and scientifically interesting.

### Concrete F1 estimate

Based on GenAI-vs-human agreement rates and the typical BERT fine-tuning premium:
- **Category macro F1:** 0.78-0.85 (depends on class imbalance handling and gold quality)
- **Specificity macro F1:** 0.65-0.75 (ceiling-limited by human disagreement)
- **Combined (cat x spec) accuracy:** 0.55-0.70

The swing categories for macro F1 are MR (~65-80% per-class F1), TPR (~70-90%), and N/O (~60-85%). Focal loss + SCL should push MR and N/O into the range where macro F1 clears 0.80.

---

## The Meta-Narrative

The finding that trained student annotators achieve alpha = 0.801 on category but only 0.546 on specificity, while calibrated LLM panels achieve higher consistency (60.1% spec unanimity vs 42.2% for humans), validates the synthetic experts hypothesis for rule-heavy classification tasks. The low specificity agreement is not annotator incompetence -- it's evidence that the specificity construct requires systematic attention to IS/NOT lists and counting rules that humans don't consistently invest at 15s/paragraph pace. GenAI's advantage on multi-step reasoning tasks is itself a key finding.

The leave-one-out analysis showing that Opus earns the top rank without being privileged is the strongest validation of using frontier LLMs as "gold" annotators: they're not just consistent with each other, they're the most consistent with the emergent consensus of all 16 sources combined.

---

## Timeline

| Task | Target | Status |
|------|--------|--------|
| Human labeling | 2026-04-01 | Done |
| GenAI benchmark (10 models) | 2026-04-02 | Done |
| 13-signal analysis | 2026-04-02 | Done |
| Gold set adjudication | 2026-04-03-04 | Next |
| Training data assembly | 2026-04-04 | |
| Fine-tuning ablations (6 configs) | 2026-04-05-08 | |
| Final evaluation on holdout | 2026-04-09 | |
| Executive memo + IGNITE slides | 2026-04-10-14 | |
| Submission | 2026-04-23 | |