5.0 KiB
Post-Labeling Plan — Gold Set Repair & Final Pipeline
Updated 2026-04-02 with actual human labeling results.
Human Labeling Results
Completed 2026-04-01. 3,600 labels (1,200 paragraphs × 3 annotators via BIBD), 21.5 active hours total.
Per-Dimension Agreement
| Metric | Category | Specificity | Both |
|---|---|---|---|
| Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% |
| Krippendorff's α | 0.801 | 0.546 | — |
| Avg Cohen's κ | 0.612 | 0.440 | — |
Category is reliable. α = 0.801 exceeds the 0.80 conventional threshold. Human majority matches Stage 1 GenAI majority on 83.3% of paragraphs for category.
Specificity is unreliable. α = 0.546 is well below the 0.667 threshold. Driven by two factors: one outlier annotator and a genuinely hard Spec 3↔4 boundary.
The Aaryan Problem
One annotator (Aaryan) is a systematic outlier:
- Labels 67% of paragraphs as Spec 4 (Quantified-Verifiable) — others: 8-23%, Stage 1: 9%
- Specificity bias: +1.28 levels vs Stage 1 (massive over-rater)
- Specificity κ: 0.03-0.25 (essentially chance)
- Category κ: 0.40-0.50 (below "moderate")
- Only 3 quiz attempts (lowest; others: 6-11)
Excluding his label on his 600 paragraphs: both-unanimous jumps from 5% → 50% (+45pp).
Confusion Axes (Human vs GenAI — Same Order)
- Management Role ↔ Risk Management Process (dominant)
- Board Governance ↔ Management Role
- None/Other ↔ Strategy Integration (materiality disclaimers)
The same axes, in the same order, for both humans and the GenAI panel. The codebook boundaries drive disagreement, not annotator or model limitations.
The Adverse Incentive Problem
The assignment requires F1 > 0.80 on the holdout to pass. The holdout was deliberately stratified to over-sample hard decision boundaries (120 MR↔RMP, 80 N/O↔SI, 80 Spec [3,4] splits, etc.).
Mitigation: Report F1 on both the full 1,200 holdout AND the 720-paragraph proportional subsample. The delta quantifies performance degradation at decision boundaries. The stratified design directly serves the A-grade "error analysis" criterion.
Gold Set Repair Strategy: 13 Signals Per Paragraph
Annotation sources per paragraph
| Source | Count | Prompt | Notes |
|---|---|---|---|
| Human annotators | 3 | Codebook v3.0 | With notes, timing data |
| Stage 1 panel (gemini-flash-lite, mimo-flash, grok-fast) | 3 | v2.5 | Already on file |
| Opus 4.6 golden | 1 | v2.5 + full codebook | With reasoning traces |
| Benchmark models (gpt-5.4, kimi-k2.5, gemini-pro, glm-5, minimax-m2.7, mimo-pro) | 6 | v3.0 | Running now |
| Total | 13 |
Adjudication tiers
Tier 1 — High confidence: 10+/13 agree on both dimensions. Gold label, no intervention.
Tier 2 — Clear majority with cross-validation: Human majority (2/3) matches GenAI consensus (majority of 10 GenAI labels). Take the consensus.
Tier 3 — Human split, GenAI consensus: Humans disagree but GenAI labels converge. Expert adjudication informed by Opus reasoning traces. Human makes the final call.
Tier 4 — Universal disagreement: Everyone splits. Expert adjudication with documented reasoning, or flagged as inherently ambiguous for error analysis.
GenAI labels are evidence for adjudication, not the gold label itself. The final label is always a human decision — this avoids circularity.
Task Sequence
In progress
- Human labeling — complete
- Data export and IRR analysis — complete
- Prompt v3.0 update with codebook rulings — complete
- GenAI benchmark infrastructure — complete
- Opus golden re-run on correct holdout (running, ~1h with 20 workers)
- 6-model benchmark on holdout (running, high concurrency)
After benchmark completes
- Cross-source analysis with all 13 signals (update
analyze-gold.py) - Gold set adjudication using tiered strategy
- Training data assembly (unanimous + calibrated majority + judge)
After gold set is finalized
- Fine-tuning + ablations (7 experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} + best)
- Final evaluation on holdout
- Writeup + IGNITE slides
The Meta-Narrative
The finding that trained student annotators achieve α = 0.801 on category but only α = 0.546 on specificity, while calibrated LLM panels achieve 70.8%+ both-unanimous on an easier sample, validates the synthetic experts hypothesis for rule-heavy classification tasks. The human labels are essential as a calibration anchor, but GenAI's advantage on multi-step reasoning tasks (like QV fact counting) is itself a key finding.
The low specificity agreement is not annotator incompetence — it's evidence that the specificity construct requires cognitive effort that humans don't consistently invest at the 15-second-per-paragraph pace the task demands. The GenAI panel, which processes every paragraph with the same systematic attention to the IS/NOT lists and counting rules, achieves more consistent results on this specific dimension.