7.1 KiB
Post-Labeling Plan — Gold Set Repair & Final Pipeline
Written 2026-04-01 while waiting for the last human annotator to finish.
The Situation
Human labeling is nearly complete (1,200 paragraphs, 6 annotators, 3 per paragraph via BIBD). Current inter-annotator agreement:
- Cohen's Kappa (avg): 0.622
- Krippendorff's alpha: 0.616
These numbers are at the floor of "substantial agreement" (Landis & Koch) but below the 0.667 threshold Krippendorff recommends for tentative conclusions. The holdout was deliberately stratified to over-sample hard cases (120 Management↔RMP splits, 80 None/Other↔Strategy splits, 80 Spec [3,4] splits, etc.), so raw consensus reflects sampling difficulty, not pure annotator quality.
The task is genuinely hard: 7 categories, 4 specificity levels, 5 decision rules, 3 codebook rulings, multi-step reasoning required (person-vs-function test, QV fact counting). The GenAI panel struggled with the same boundaries.
Immediate Analysis (once last annotator finishes)
- Export labels from labelapp (
bun run la:export) - Per-dimension alpha: Compute Krippendorff's alpha for category and specificity separately. Hypothesis: category alpha is significantly higher than specificity alpha (matching the GenAI pattern where Spec 4 was only 37.6% unanimous).
- Pairwise Kappa matrix: All 15 annotator pairs. Identify if one annotator is a systematic outlier or if disagreement is uniform.
- Stratum-level agreement: Break down consensus rates by sampling stratum (Management↔RMP, None/Other↔Strategy, Spec [3,4], proportional random, etc.). The hard strata should show lower agreement; the proportional random stratum should be higher.
The Adverse Incentive Problem
The assignment requires F1 > 0.80 on the holdout to pass. This creates a perverse incentive: pick easy, unambiguous paragraphs for the holdout → high human agreement, high GenAI scores, high fine-tuned model F1 → passing grade, meaningless evaluation.
We did the opposite: stratified to stress-test decision boundaries. This produces a harder holdout with lower headline numbers but an actually informative evaluation.
Mitigation: Report F1 on both the full 1,200 holdout AND the 720-paragraph "proportional stratified random" subsample separately. The proportional subsample approximates what a random holdout would look like. The delta between the two quantifies exactly how much performance degrades at decision boundaries. This isn't gaming — it's rigorous reporting.
The A-grade criteria ("error analysis," "comparison to amateur labels") are directly served by our approach. The low human agreement rate is a finding, not a failure.
Gold Set Repair Strategy: 13+ Signals Per Paragraph
Existing signals (7 per paragraph)
- 3 human labels (from labelapp, with notes and timing)
- 3 Stage 1 GenAI labels (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast)
- 1 Opus golden label (with full reasoning trace)
New signals from GenAI benchmark (6+ additional)
The assignment requires benchmarking 6+ models from 3+ suppliers against the holdout. This serves triple duty:
- Assignment deliverable (GenAI benchmark table)
- Gold set repair evidence (6+ more annotation signals for adjudication)
- "GenAI vs amateur" comparison (A-grade criterion)
Candidate models (6+ from 3+ suppliers):
- OpenAI: gpt-5.4-mini, gpt-5.4
- Google: gemini-3-flash, gemini-3-pro (or similar)
- Anthropic: claude-sonnet-4.6, claude-haiku-4.5
- xAI: grok-4.20 (or similar)
- Others as needed for count
After the benchmark, each paragraph has 13+ independent annotations. This is an absurdly rich signal for adjudication.
Adjudication tiers
Tier 1 — High confidence: 10+/13 annotators agree on both dimensions. Gold label, no intervention needed. Expected: ~500-600 paragraphs.
Tier 2 — Clear majority with cross-validation: Human majority exists (2/3) and matches GenAI consensus (majority of 10 GenAI labels). Strong signal — take the consensus. Expected: ~300-400 paragraphs.
Tier 3 — Human split, GenAI consensus: Humans disagree but GenAI labels converge. Use Opus reasoning trace + GenAI consensus to inform expert adjudication. Human (Joey) makes the final call. Expected: ~100-200 paragraphs.
Tier 4 — Universal disagreement: Humans and GenAI both split. Genuinely ambiguous. Expert adjudication with documented reasoning, or flag as inherently ambiguous and report in error analysis. Expected: ~50-100 paragraphs.
The GenAI labels are evidence for adjudication, not the gold label itself. The final label is always a human decision. This avoids circularity — we're not evaluating GenAI against GenAI-derived labels. We're using GenAI agreement patterns to identify which human label is most likely correct in cases of human disagreement.
If we can't produce reliable gold labels from 13+ signals per paragraph, the construct itself is ill-defined. That would be an important finding too — but given that the GenAI panel achieved 70.8% both-unanimous on 50K paragraphs (unstratified), and the hardest axes have clear codebook resolutions, the construct should hold.
The Meta-Narrative
The finding that trained student annotators achieve α = 0.616 while calibrated LLM panels achieve 70.8%+ unanimity on the same task validates the synthetic experts hypothesis. For complex, rule-heavy classification tasks requiring multi-step reasoning, LLMs with reasoning tokens can match or exceed human annotation quality.
This isn't a failure of the humans — it's the whole point of the project. The Ringel pipeline exists because these tasks are too cognitively demanding for consistent human annotation at scale. The human labels are essential as a calibration anchor, but GenAI's advantage on rule-application tasks is a key finding.
Task Sequence (dependency order)
Can start now (no blockers)
- Judge prompt v3.0 update (codebook rulings →
buildJudgePrompt()) - Fine-tuning pipeline code (dual-head classifier, sample weighting, train/val/test split)
- GenAI benchmark infrastructure (scripts to run 6+ models on holdout)
After last annotator finishes
- Export + per-dimension alpha + pairwise Kappa matrix + stratum breakdown
- Run GenAI benchmark on 1,200 holdout (6+ models, 3+ suppliers)
- Gold set adjudication using 13+ signals per paragraph
- Judge v3.0 validation against adjudicated gold set
After gold set is finalized
- Training data assembly (unanimous + calibrated majority + judge)
- Fine-tuning + ablations (7 experiments)
- Final evaluation on holdout
- Writeup + IGNITE slides
Open Questions
- F1 threshold per-dimension? Worth asking Ringel if the 0.80 F1 requirement applies to the joint 28-class label or can be reported per-dimension (category + specificity separately).
- Soft labels for ambiguous cases? For Tier 4 paragraphs, could use label distributions as soft targets during training instead of forcing a hard label. More sophisticated but harder to evaluate.
- One bad annotator vs. uniform disagreement? The pairwise Kappa matrix will answer this. If one annotator is systematically off, their labels could be downweighted during adjudication.