# Post-Labeling Plan — Gold Set Repair & Final Pipeline

Written 2026-04-01 while waiting for the last human annotator to finish.

---

## The Situation

Human labeling is nearly complete (1,200 paragraphs, 6 annotators, 3 per paragraph via BIBD). Current inter-annotator agreement:
- **Cohen's Kappa (avg):** 0.622
- **Krippendorff's alpha:** 0.616

These numbers are at the floor of "substantial agreement" (Landis & Koch) but below the 0.667 threshold Krippendorff recommends for tentative conclusions. The holdout was deliberately stratified to over-sample hard cases (120 Management↔RMP splits, 80 None/Other↔Strategy splits, 80 Spec [3,4] splits, etc.), so raw consensus reflects sampling difficulty, not pure annotator quality.

The task is genuinely hard: 7 categories, 4 specificity levels, 5 decision rules, 3 codebook rulings, multi-step reasoning required (person-vs-function test, QV fact counting). The GenAI panel struggled with the same boundaries.

---

## Immediate Analysis (once last annotator finishes)

1. **Export labels** from labelapp (`bun run la:export`)
2. **Per-dimension alpha:** Compute Krippendorff's alpha for category and specificity separately. Hypothesis: category alpha is significantly higher than specificity alpha (matching the GenAI pattern where Spec 4 was only 37.6% unanimous).
3. **Pairwise Kappa matrix:** All 15 annotator pairs. Identify if one annotator is a systematic outlier or if disagreement is uniform.
4. **Stratum-level agreement:** Break down consensus rates by sampling stratum (Management↔RMP, None/Other↔Strategy, Spec [3,4], proportional random, etc.). The hard strata should show lower agreement; the proportional random stratum should be higher.

---

## The Adverse Incentive Problem

The assignment requires F1 > 0.80 on the holdout to pass. This creates a perverse incentive: pick easy, unambiguous paragraphs for the holdout → high human agreement, high GenAI scores, high fine-tuned model F1 → passing grade, meaningless evaluation.

We did the opposite: stratified to stress-test decision boundaries. This produces a harder holdout with lower headline numbers but an actually informative evaluation.

**Mitigation:** Report F1 on both the full 1,200 holdout AND the 720-paragraph "proportional stratified random" subsample separately. The proportional subsample approximates what a random holdout would look like. The delta between the two quantifies exactly how much performance degrades at decision boundaries. This isn't gaming — it's rigorous reporting.

The A-grade criteria ("error analysis," "comparison to amateur labels") are directly served by our approach. The low human agreement rate is a finding, not a failure.

---

## Gold Set Repair Strategy: 13+ Signals Per Paragraph

### Existing signals (7 per paragraph)
- 3 human labels (from labelapp, with notes and timing)
- 3 Stage 1 GenAI labels (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast)
- 1 Opus golden label (with full reasoning trace)

### New signals from GenAI benchmark (6+ additional)
The assignment requires benchmarking 6+ models from 3+ suppliers against the holdout. This serves triple duty:
1. Assignment deliverable (GenAI benchmark table)
2. Gold set repair evidence (6+ more annotation signals for adjudication)
3. "GenAI vs amateur" comparison (A-grade criterion)

**Candidate models (6+ from 3+ suppliers):**
- OpenAI: gpt-5.4-mini, gpt-5.4
- Google: gemini-3-flash, gemini-3-pro (or similar)
- Anthropic: claude-sonnet-4.6, claude-haiku-4.5
- xAI: grok-4.20 (or similar)
- Others as needed for count

After the benchmark, each paragraph has **13+ independent annotations**. This is an absurdly rich signal for adjudication.

### Adjudication tiers

**Tier 1 — High confidence:** 10+/13 annotators agree on both dimensions. Gold label, no intervention needed. Expected: ~500-600 paragraphs.

**Tier 2 — Clear majority with cross-validation:** Human majority exists (2/3) and matches GenAI consensus (majority of 10 GenAI labels). Strong signal — take the consensus. Expected: ~300-400 paragraphs.

**Tier 3 — Human split, GenAI consensus:** Humans disagree but GenAI labels converge. Use Opus reasoning trace + GenAI consensus to inform expert adjudication. Human (Joey) makes the final call. Expected: ~100-200 paragraphs.

**Tier 4 — Universal disagreement:** Humans and GenAI both split. Genuinely ambiguous. Expert adjudication with documented reasoning, or flag as inherently ambiguous and report in error analysis. Expected: ~50-100 paragraphs.

The GenAI labels are evidence for adjudication, not the gold label itself. The final label is always a human decision. This avoids circularity — we're not evaluating GenAI against GenAI-derived labels. We're using GenAI agreement patterns to identify which human label is most likely correct in cases of human disagreement.

If we can't produce reliable gold labels from 13+ signals per paragraph, the construct itself is ill-defined. That would be an important finding too — but given that the GenAI panel achieved 70.8% both-unanimous on 50K paragraphs (unstratified), and the hardest axes have clear codebook resolutions, the construct should hold.

---

## The Meta-Narrative

The finding that trained student annotators achieve α = 0.616 while calibrated LLM panels achieve 70.8%+ unanimity on the same task validates the synthetic experts hypothesis. For complex, rule-heavy classification tasks requiring multi-step reasoning, LLMs with reasoning tokens can match or exceed human annotation quality.

This isn't a failure of the humans — it's the whole point of the project. The Ringel pipeline exists because these tasks are too cognitively demanding for consistent human annotation at scale. The human labels are essential as a calibration anchor, but GenAI's advantage on rule-application tasks is a key finding.

---

## Task Sequence (dependency order)

### Can start now (no blockers)
- [ ] Judge prompt v3.0 update (codebook rulings → `buildJudgePrompt()`)
- [ ] Fine-tuning pipeline code (dual-head classifier, sample weighting, train/val/test split)
- [ ] GenAI benchmark infrastructure (scripts to run 6+ models on holdout)

### After last annotator finishes
- [ ] Export + per-dimension alpha + pairwise Kappa matrix + stratum breakdown
- [ ] Run GenAI benchmark on 1,200 holdout (6+ models, 3+ suppliers)
- [ ] Gold set adjudication using 13+ signals per paragraph
- [ ] Judge v3.0 validation against adjudicated gold set

### After gold set is finalized
- [ ] Training data assembly (unanimous + calibrated majority + judge)
- [ ] Fine-tuning + ablations (7 experiments)
- [ ] Final evaluation on holdout
- [ ] Writeup + IGNITE slides

---

## Open Questions

1. **F1 threshold per-dimension?** Worth asking Ringel if the 0.80 F1 requirement applies to the joint 28-class label or can be reported per-dimension (category + specificity separately).
2. **Soft labels for ambiguous cases?** For Tier 4 paragraphs, could use label distributions as soft targets during training instead of forcing a hard label. More sophisticated but harder to evaluate.
3. **One bad annotator vs. uniform disagreement?** The pairwise Kappa matrix will answer this. If one annotator is systematically off, their labels could be downweighted during adjudication.