# Post-Labeling Plan — Gold Set Repair & Final Pipeline Written 2026-04-01 while waiting for the last human annotator to finish. --- ## The Situation Human labeling is nearly complete (1,200 paragraphs, 6 annotators, 3 per paragraph via BIBD). Current inter-annotator agreement: - **Cohen's Kappa (avg):** 0.622 - **Krippendorff's alpha:** 0.616 These numbers are at the floor of "substantial agreement" (Landis & Koch) but below the 0.667 threshold Krippendorff recommends for tentative conclusions. The holdout was deliberately stratified to over-sample hard cases (120 Management↔RMP splits, 80 None/Other↔Strategy splits, 80 Spec [3,4] splits, etc.), so raw consensus reflects sampling difficulty, not pure annotator quality. The task is genuinely hard: 7 categories, 4 specificity levels, 5 decision rules, 3 codebook rulings, multi-step reasoning required (person-vs-function test, QV fact counting). The GenAI panel struggled with the same boundaries. --- ## Immediate Analysis (once last annotator finishes) 1. **Export labels** from labelapp (`bun run la:export`) 2. **Per-dimension alpha:** Compute Krippendorff's alpha for category and specificity separately. Hypothesis: category alpha is significantly higher than specificity alpha (matching the GenAI pattern where Spec 4 was only 37.6% unanimous). 3. **Pairwise Kappa matrix:** All 15 annotator pairs. Identify if one annotator is a systematic outlier or if disagreement is uniform. 4. **Stratum-level agreement:** Break down consensus rates by sampling stratum (Management↔RMP, None/Other↔Strategy, Spec [3,4], proportional random, etc.). The hard strata should show lower agreement; the proportional random stratum should be higher. --- ## The Adverse Incentive Problem The assignment requires F1 > 0.80 on the holdout to pass. This creates a perverse incentive: pick easy, unambiguous paragraphs for the holdout → high human agreement, high GenAI scores, high fine-tuned model F1 → passing grade, meaningless evaluation. We did the opposite: stratified to stress-test decision boundaries. This produces a harder holdout with lower headline numbers but an actually informative evaluation. **Mitigation:** Report F1 on both the full 1,200 holdout AND the 720-paragraph "proportional stratified random" subsample separately. The proportional subsample approximates what a random holdout would look like. The delta between the two quantifies exactly how much performance degrades at decision boundaries. This isn't gaming — it's rigorous reporting. The A-grade criteria ("error analysis," "comparison to amateur labels") are directly served by our approach. The low human agreement rate is a finding, not a failure. --- ## Gold Set Repair Strategy: 13+ Signals Per Paragraph ### Existing signals (7 per paragraph) - 3 human labels (from labelapp, with notes and timing) - 3 Stage 1 GenAI labels (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast) - 1 Opus golden label (with full reasoning trace) ### New signals from GenAI benchmark (6+ additional) The assignment requires benchmarking 6+ models from 3+ suppliers against the holdout. This serves triple duty: 1. Assignment deliverable (GenAI benchmark table) 2. Gold set repair evidence (6+ more annotation signals for adjudication) 3. "GenAI vs amateur" comparison (A-grade criterion) **Candidate models (6+ from 3+ suppliers):** - OpenAI: gpt-5.4-mini, gpt-5.4 - Google: gemini-3-flash, gemini-3-pro (or similar) - Anthropic: claude-sonnet-4.6, claude-haiku-4.5 - xAI: grok-4.20 (or similar) - Others as needed for count After the benchmark, each paragraph has **13+ independent annotations**. This is an absurdly rich signal for adjudication. ### Adjudication tiers **Tier 1 — High confidence:** 10+/13 annotators agree on both dimensions. Gold label, no intervention needed. Expected: ~500-600 paragraphs. **Tier 2 — Clear majority with cross-validation:** Human majority exists (2/3) and matches GenAI consensus (majority of 10 GenAI labels). Strong signal — take the consensus. Expected: ~300-400 paragraphs. **Tier 3 — Human split, GenAI consensus:** Humans disagree but GenAI labels converge. Use Opus reasoning trace + GenAI consensus to inform expert adjudication. Human (Joey) makes the final call. Expected: ~100-200 paragraphs. **Tier 4 — Universal disagreement:** Humans and GenAI both split. Genuinely ambiguous. Expert adjudication with documented reasoning, or flag as inherently ambiguous and report in error analysis. Expected: ~50-100 paragraphs. The GenAI labels are evidence for adjudication, not the gold label itself. The final label is always a human decision. This avoids circularity — we're not evaluating GenAI against GenAI-derived labels. We're using GenAI agreement patterns to identify which human label is most likely correct in cases of human disagreement. If we can't produce reliable gold labels from 13+ signals per paragraph, the construct itself is ill-defined. That would be an important finding too — but given that the GenAI panel achieved 70.8% both-unanimous on 50K paragraphs (unstratified), and the hardest axes have clear codebook resolutions, the construct should hold. --- ## The Meta-Narrative The finding that trained student annotators achieve α = 0.616 while calibrated LLM panels achieve 70.8%+ unanimity on the same task validates the synthetic experts hypothesis. For complex, rule-heavy classification tasks requiring multi-step reasoning, LLMs with reasoning tokens can match or exceed human annotation quality. This isn't a failure of the humans — it's the whole point of the project. The Ringel pipeline exists because these tasks are too cognitively demanding for consistent human annotation at scale. The human labels are essential as a calibration anchor, but GenAI's advantage on rule-application tasks is a key finding. --- ## Task Sequence (dependency order) ### Can start now (no blockers) - [ ] Judge prompt v3.0 update (codebook rulings → `buildJudgePrompt()`) - [ ] Fine-tuning pipeline code (dual-head classifier, sample weighting, train/val/test split) - [ ] GenAI benchmark infrastructure (scripts to run 6+ models on holdout) ### After last annotator finishes - [ ] Export + per-dimension alpha + pairwise Kappa matrix + stratum breakdown - [ ] Run GenAI benchmark on 1,200 holdout (6+ models, 3+ suppliers) - [ ] Gold set adjudication using 13+ signals per paragraph - [ ] Judge v3.0 validation against adjudicated gold set ### After gold set is finalized - [ ] Training data assembly (unanimous + calibrated majority + judge) - [ ] Fine-tuning + ablations (7 experiments) - [ ] Final evaluation on holdout - [ ] Writeup + IGNITE slides --- ## Open Questions 1. **F1 threshold per-dimension?** Worth asking Ringel if the 0.80 F1 requirement applies to the joint 28-class label or can be reported per-dimension (category + specificity separately). 2. **Soft labels for ambiguous cases?** For Tier 4 paragraphs, could use label distributions as soft targets during training instead of forcing a hard label. More sophisticated but harder to evaluate. 3. **One bad annotator vs. uniform disagreement?** The pairwise Kappa matrix will answer this. If one annotator is systematically off, their labels could be downweighted during adjudication.