105 lines
5.0 KiB
Markdown
105 lines
5.0 KiB
Markdown
# Post-Labeling Plan — Gold Set Repair & Final Pipeline
|
||
|
||
Updated 2026-04-02 with actual human labeling results.
|
||
|
||
---
|
||
|
||
## Human Labeling Results
|
||
|
||
Completed 2026-04-01. 3,600 labels (1,200 paragraphs × 3 annotators via BIBD), 21.5 active hours total.
|
||
|
||
### Per-Dimension Agreement
|
||
|
||
| Metric | Category | Specificity | Both |
|
||
|--------|----------|-------------|------|
|
||
| Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% |
|
||
| Krippendorff's α | **0.801** | 0.546 | — |
|
||
| Avg Cohen's κ | 0.612 | 0.440 | — |
|
||
|
||
**Category is reliable.** α = 0.801 exceeds the 0.80 conventional threshold. Human majority matches Stage 1 GenAI majority on 83.3% of paragraphs for category.
|
||
|
||
**Specificity is unreliable.** α = 0.546 is well below the 0.667 threshold. Driven by two factors: one outlier annotator and a genuinely hard Spec 3↔4 boundary.
|
||
|
||
### The Aaryan Problem
|
||
|
||
One annotator (Aaryan) is a systematic outlier:
|
||
- Labels 67% of paragraphs as Spec 4 (Quantified-Verifiable) — others: 8-23%, Stage 1: 9%
|
||
- Specificity bias: +1.28 levels vs Stage 1 (massive over-rater)
|
||
- Specificity κ: 0.03-0.25 (essentially chance)
|
||
- Category κ: 0.40-0.50 (below "moderate")
|
||
- Only 3 quiz attempts (lowest; others: 6-11)
|
||
|
||
Excluding his label on his 600 paragraphs: both-unanimous jumps from 5% → 50% (+45pp).
|
||
|
||
### Confusion Axes (Human vs GenAI — Same Order)
|
||
|
||
1. Management Role ↔ Risk Management Process (dominant)
|
||
2. Board Governance ↔ Management Role
|
||
3. None/Other ↔ Strategy Integration (materiality disclaimers)
|
||
|
||
The same axes, in the same order, for both humans and the GenAI panel. The codebook boundaries drive disagreement, not annotator or model limitations.
|
||
|
||
---
|
||
|
||
## The Adverse Incentive Problem
|
||
|
||
The assignment requires F1 > 0.80 on the holdout to pass. The holdout was deliberately stratified to over-sample hard decision boundaries (120 MR↔RMP, 80 N/O↔SI, 80 Spec [3,4] splits, etc.).
|
||
|
||
**Mitigation:** Report F1 on both the full 1,200 holdout AND the 720-paragraph proportional subsample. The delta quantifies performance degradation at decision boundaries. The stratified design directly serves the A-grade "error analysis" criterion.
|
||
|
||
---
|
||
|
||
## Gold Set Repair Strategy: 13 Signals Per Paragraph
|
||
|
||
### Annotation sources per paragraph
|
||
|
||
| Source | Count | Prompt | Notes |
|
||
|--------|-------|--------|-------|
|
||
| Human annotators | 3 | Codebook v3.0 | With notes, timing data |
|
||
| Stage 1 panel (gemini-flash-lite, mimo-flash, grok-fast) | 3 | v2.5 | Already on file |
|
||
| Opus 4.6 golden | 1 | v2.5 + full codebook | With reasoning traces |
|
||
| Benchmark models (gpt-5.4, kimi-k2.5, gemini-pro, glm-5, minimax-m2.7, mimo-pro) | 6 | v3.0 | Running now |
|
||
| **Total** | **13** | | |
|
||
|
||
### Adjudication tiers
|
||
|
||
**Tier 1 — High confidence:** 10+/13 agree on both dimensions. Gold label, no intervention.
|
||
|
||
**Tier 2 — Clear majority with cross-validation:** Human majority (2/3) matches GenAI consensus (majority of 10 GenAI labels). Take the consensus.
|
||
|
||
**Tier 3 — Human split, GenAI consensus:** Humans disagree but GenAI labels converge. Expert adjudication informed by Opus reasoning traces. Human makes the final call.
|
||
|
||
**Tier 4 — Universal disagreement:** Everyone splits. Expert adjudication with documented reasoning, or flagged as inherently ambiguous for error analysis.
|
||
|
||
GenAI labels are evidence for adjudication, not the gold label itself. The final label is always a human decision — this avoids circularity.
|
||
|
||
---
|
||
|
||
## Task Sequence
|
||
|
||
### In progress
|
||
- [x] Human labeling — complete
|
||
- [x] Data export and IRR analysis — complete
|
||
- [x] Prompt v3.0 update with codebook rulings — complete
|
||
- [x] GenAI benchmark infrastructure — complete
|
||
- [ ] Opus golden re-run on correct holdout (running, ~1h with 20 workers)
|
||
- [ ] 6-model benchmark on holdout (running, high concurrency)
|
||
|
||
### After benchmark completes
|
||
- [ ] Cross-source analysis with all 13 signals (update `analyze-gold.py`)
|
||
- [ ] Gold set adjudication using tiered strategy
|
||
- [ ] Training data assembly (unanimous + calibrated majority + judge)
|
||
|
||
### After gold set is finalized
|
||
- [ ] Fine-tuning + ablations (7 experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} + best)
|
||
- [ ] Final evaluation on holdout
|
||
- [ ] Writeup + IGNITE slides
|
||
|
||
---
|
||
|
||
## The Meta-Narrative
|
||
|
||
The finding that trained student annotators achieve α = 0.801 on category but only α = 0.546 on specificity, while calibrated LLM panels achieve 70.8%+ both-unanimous on an easier sample, validates the synthetic experts hypothesis for rule-heavy classification tasks. The human labels are essential as a calibration anchor, but GenAI's advantage on multi-step reasoning tasks (like QV fact counting) is itself a key finding.
|
||
|
||
The low specificity agreement is not annotator incompetence — it's evidence that the specificity construct requires cognitive effort that humans don't consistently invest at the 15-second-per-paragraph pace the task demands. The GenAI panel, which processes every paragraph with the same systematic attention to the IS/NOT lists and counting rules, achieves more consistent results on this specific dimension.
|