112 lines
7.1 KiB
Markdown
112 lines
7.1 KiB
Markdown
# Post-Labeling Plan — Gold Set Repair & Final Pipeline
|
||
|
||
Written 2026-04-01 while waiting for the last human annotator to finish.
|
||
|
||
---
|
||
|
||
## The Situation
|
||
|
||
Human labeling is nearly complete (1,200 paragraphs, 6 annotators, 3 per paragraph via BIBD). Current inter-annotator agreement:
|
||
- **Cohen's Kappa (avg):** 0.622
|
||
- **Krippendorff's alpha:** 0.616
|
||
|
||
These numbers are at the floor of "substantial agreement" (Landis & Koch) but below the 0.667 threshold Krippendorff recommends for tentative conclusions. The holdout was deliberately stratified to over-sample hard cases (120 Management↔RMP splits, 80 None/Other↔Strategy splits, 80 Spec [3,4] splits, etc.), so raw consensus reflects sampling difficulty, not pure annotator quality.
|
||
|
||
The task is genuinely hard: 7 categories, 4 specificity levels, 5 decision rules, 3 codebook rulings, multi-step reasoning required (person-vs-function test, QV fact counting). The GenAI panel struggled with the same boundaries.
|
||
|
||
---
|
||
|
||
## Immediate Analysis (once last annotator finishes)
|
||
|
||
1. **Export labels** from labelapp (`bun run la:export`)
|
||
2. **Per-dimension alpha:** Compute Krippendorff's alpha for category and specificity separately. Hypothesis: category alpha is significantly higher than specificity alpha (matching the GenAI pattern where Spec 4 was only 37.6% unanimous).
|
||
3. **Pairwise Kappa matrix:** All 15 annotator pairs. Identify if one annotator is a systematic outlier or if disagreement is uniform.
|
||
4. **Stratum-level agreement:** Break down consensus rates by sampling stratum (Management↔RMP, None/Other↔Strategy, Spec [3,4], proportional random, etc.). The hard strata should show lower agreement; the proportional random stratum should be higher.
|
||
|
||
---
|
||
|
||
## The Adverse Incentive Problem
|
||
|
||
The assignment requires F1 > 0.80 on the holdout to pass. This creates a perverse incentive: pick easy, unambiguous paragraphs for the holdout → high human agreement, high GenAI scores, high fine-tuned model F1 → passing grade, meaningless evaluation.
|
||
|
||
We did the opposite: stratified to stress-test decision boundaries. This produces a harder holdout with lower headline numbers but an actually informative evaluation.
|
||
|
||
**Mitigation:** Report F1 on both the full 1,200 holdout AND the 720-paragraph "proportional stratified random" subsample separately. The proportional subsample approximates what a random holdout would look like. The delta between the two quantifies exactly how much performance degrades at decision boundaries. This isn't gaming — it's rigorous reporting.
|
||
|
||
The A-grade criteria ("error analysis," "comparison to amateur labels") are directly served by our approach. The low human agreement rate is a finding, not a failure.
|
||
|
||
---
|
||
|
||
## Gold Set Repair Strategy: 13+ Signals Per Paragraph
|
||
|
||
### Existing signals (7 per paragraph)
|
||
- 3 human labels (from labelapp, with notes and timing)
|
||
- 3 Stage 1 GenAI labels (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast)
|
||
- 1 Opus golden label (with full reasoning trace)
|
||
|
||
### New signals from GenAI benchmark (6+ additional)
|
||
The assignment requires benchmarking 6+ models from 3+ suppliers against the holdout. This serves triple duty:
|
||
1. Assignment deliverable (GenAI benchmark table)
|
||
2. Gold set repair evidence (6+ more annotation signals for adjudication)
|
||
3. "GenAI vs amateur" comparison (A-grade criterion)
|
||
|
||
**Candidate models (6+ from 3+ suppliers):**
|
||
- OpenAI: gpt-5.4-mini, gpt-5.4
|
||
- Google: gemini-3-flash, gemini-3-pro (or similar)
|
||
- Anthropic: claude-sonnet-4.6, claude-haiku-4.5
|
||
- xAI: grok-4.20 (or similar)
|
||
- Others as needed for count
|
||
|
||
After the benchmark, each paragraph has **13+ independent annotations**. This is an absurdly rich signal for adjudication.
|
||
|
||
### Adjudication tiers
|
||
|
||
**Tier 1 — High confidence:** 10+/13 annotators agree on both dimensions. Gold label, no intervention needed. Expected: ~500-600 paragraphs.
|
||
|
||
**Tier 2 — Clear majority with cross-validation:** Human majority exists (2/3) and matches GenAI consensus (majority of 10 GenAI labels). Strong signal — take the consensus. Expected: ~300-400 paragraphs.
|
||
|
||
**Tier 3 — Human split, GenAI consensus:** Humans disagree but GenAI labels converge. Use Opus reasoning trace + GenAI consensus to inform expert adjudication. Human (Joey) makes the final call. Expected: ~100-200 paragraphs.
|
||
|
||
**Tier 4 — Universal disagreement:** Humans and GenAI both split. Genuinely ambiguous. Expert adjudication with documented reasoning, or flag as inherently ambiguous and report in error analysis. Expected: ~50-100 paragraphs.
|
||
|
||
The GenAI labels are evidence for adjudication, not the gold label itself. The final label is always a human decision. This avoids circularity — we're not evaluating GenAI against GenAI-derived labels. We're using GenAI agreement patterns to identify which human label is most likely correct in cases of human disagreement.
|
||
|
||
If we can't produce reliable gold labels from 13+ signals per paragraph, the construct itself is ill-defined. That would be an important finding too — but given that the GenAI panel achieved 70.8% both-unanimous on 50K paragraphs (unstratified), and the hardest axes have clear codebook resolutions, the construct should hold.
|
||
|
||
---
|
||
|
||
## The Meta-Narrative
|
||
|
||
The finding that trained student annotators achieve α = 0.616 while calibrated LLM panels achieve 70.8%+ unanimity on the same task validates the synthetic experts hypothesis. For complex, rule-heavy classification tasks requiring multi-step reasoning, LLMs with reasoning tokens can match or exceed human annotation quality.
|
||
|
||
This isn't a failure of the humans — it's the whole point of the project. The Ringel pipeline exists because these tasks are too cognitively demanding for consistent human annotation at scale. The human labels are essential as a calibration anchor, but GenAI's advantage on rule-application tasks is a key finding.
|
||
|
||
---
|
||
|
||
## Task Sequence (dependency order)
|
||
|
||
### Can start now (no blockers)
|
||
- [ ] Judge prompt v3.0 update (codebook rulings → `buildJudgePrompt()`)
|
||
- [ ] Fine-tuning pipeline code (dual-head classifier, sample weighting, train/val/test split)
|
||
- [ ] GenAI benchmark infrastructure (scripts to run 6+ models on holdout)
|
||
|
||
### After last annotator finishes
|
||
- [ ] Export + per-dimension alpha + pairwise Kappa matrix + stratum breakdown
|
||
- [ ] Run GenAI benchmark on 1,200 holdout (6+ models, 3+ suppliers)
|
||
- [ ] Gold set adjudication using 13+ signals per paragraph
|
||
- [ ] Judge v3.0 validation against adjudicated gold set
|
||
|
||
### After gold set is finalized
|
||
- [ ] Training data assembly (unanimous + calibrated majority + judge)
|
||
- [ ] Fine-tuning + ablations (7 experiments)
|
||
- [ ] Final evaluation on holdout
|
||
- [ ] Writeup + IGNITE slides
|
||
|
||
---
|
||
|
||
## Open Questions
|
||
|
||
1. **F1 threshold per-dimension?** Worth asking Ringel if the 0.80 F1 requirement applies to the joint 28-class label or can be reported per-dimension (category + specificity separately).
|
||
2. **Soft labels for ambiguous cases?** For Tier 4 paragraphs, could use label distributions as soft targets during training instead of forcing a hard label. More sophisticated but harder to evaluate.
|
||
3. **One bad annotator vs. uniform disagreement?** The pairwise Kappa matrix will answer this. If one annotator is systematically off, their labels could be downweighted during adjudication.
|