SEC-cyBERT/docs/POST-LABELING-PLAN.md
2026-04-02 00:28:31 -04:00

7.1 KiB
Raw Blame History

Post-Labeling Plan — Gold Set Repair & Final Pipeline

Written 2026-04-01 while waiting for the last human annotator to finish.


The Situation

Human labeling is nearly complete (1,200 paragraphs, 6 annotators, 3 per paragraph via BIBD). Current inter-annotator agreement:

  • Cohen's Kappa (avg): 0.622
  • Krippendorff's alpha: 0.616

These numbers are at the floor of "substantial agreement" (Landis & Koch) but below the 0.667 threshold Krippendorff recommends for tentative conclusions. The holdout was deliberately stratified to over-sample hard cases (120 Management↔RMP splits, 80 None/Other↔Strategy splits, 80 Spec [3,4] splits, etc.), so raw consensus reflects sampling difficulty, not pure annotator quality.

The task is genuinely hard: 7 categories, 4 specificity levels, 5 decision rules, 3 codebook rulings, multi-step reasoning required (person-vs-function test, QV fact counting). The GenAI panel struggled with the same boundaries.


Immediate Analysis (once last annotator finishes)

  1. Export labels from labelapp (bun run la:export)
  2. Per-dimension alpha: Compute Krippendorff's alpha for category and specificity separately. Hypothesis: category alpha is significantly higher than specificity alpha (matching the GenAI pattern where Spec 4 was only 37.6% unanimous).
  3. Pairwise Kappa matrix: All 15 annotator pairs. Identify if one annotator is a systematic outlier or if disagreement is uniform.
  4. Stratum-level agreement: Break down consensus rates by sampling stratum (Management↔RMP, None/Other↔Strategy, Spec [3,4], proportional random, etc.). The hard strata should show lower agreement; the proportional random stratum should be higher.

The Adverse Incentive Problem

The assignment requires F1 > 0.80 on the holdout to pass. This creates a perverse incentive: pick easy, unambiguous paragraphs for the holdout → high human agreement, high GenAI scores, high fine-tuned model F1 → passing grade, meaningless evaluation.

We did the opposite: stratified to stress-test decision boundaries. This produces a harder holdout with lower headline numbers but an actually informative evaluation.

Mitigation: Report F1 on both the full 1,200 holdout AND the 720-paragraph "proportional stratified random" subsample separately. The proportional subsample approximates what a random holdout would look like. The delta between the two quantifies exactly how much performance degrades at decision boundaries. This isn't gaming — it's rigorous reporting.

The A-grade criteria ("error analysis," "comparison to amateur labels") are directly served by our approach. The low human agreement rate is a finding, not a failure.


Gold Set Repair Strategy: 13+ Signals Per Paragraph

Existing signals (7 per paragraph)

  • 3 human labels (from labelapp, with notes and timing)
  • 3 Stage 1 GenAI labels (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast)
  • 1 Opus golden label (with full reasoning trace)

New signals from GenAI benchmark (6+ additional)

The assignment requires benchmarking 6+ models from 3+ suppliers against the holdout. This serves triple duty:

  1. Assignment deliverable (GenAI benchmark table)
  2. Gold set repair evidence (6+ more annotation signals for adjudication)
  3. "GenAI vs amateur" comparison (A-grade criterion)

Candidate models (6+ from 3+ suppliers):

  • OpenAI: gpt-5.4-mini, gpt-5.4
  • Google: gemini-3-flash, gemini-3-pro (or similar)
  • Anthropic: claude-sonnet-4.6, claude-haiku-4.5
  • xAI: grok-4.20 (or similar)
  • Others as needed for count

After the benchmark, each paragraph has 13+ independent annotations. This is an absurdly rich signal for adjudication.

Adjudication tiers

Tier 1 — High confidence: 10+/13 annotators agree on both dimensions. Gold label, no intervention needed. Expected: ~500-600 paragraphs.

Tier 2 — Clear majority with cross-validation: Human majority exists (2/3) and matches GenAI consensus (majority of 10 GenAI labels). Strong signal — take the consensus. Expected: ~300-400 paragraphs.

Tier 3 — Human split, GenAI consensus: Humans disagree but GenAI labels converge. Use Opus reasoning trace + GenAI consensus to inform expert adjudication. Human (Joey) makes the final call. Expected: ~100-200 paragraphs.

Tier 4 — Universal disagreement: Humans and GenAI both split. Genuinely ambiguous. Expert adjudication with documented reasoning, or flag as inherently ambiguous and report in error analysis. Expected: ~50-100 paragraphs.

The GenAI labels are evidence for adjudication, not the gold label itself. The final label is always a human decision. This avoids circularity — we're not evaluating GenAI against GenAI-derived labels. We're using GenAI agreement patterns to identify which human label is most likely correct in cases of human disagreement.

If we can't produce reliable gold labels from 13+ signals per paragraph, the construct itself is ill-defined. That would be an important finding too — but given that the GenAI panel achieved 70.8% both-unanimous on 50K paragraphs (unstratified), and the hardest axes have clear codebook resolutions, the construct should hold.


The Meta-Narrative

The finding that trained student annotators achieve α = 0.616 while calibrated LLM panels achieve 70.8%+ unanimity on the same task validates the synthetic experts hypothesis. For complex, rule-heavy classification tasks requiring multi-step reasoning, LLMs with reasoning tokens can match or exceed human annotation quality.

This isn't a failure of the humans — it's the whole point of the project. The Ringel pipeline exists because these tasks are too cognitively demanding for consistent human annotation at scale. The human labels are essential as a calibration anchor, but GenAI's advantage on rule-application tasks is a key finding.


Task Sequence (dependency order)

Can start now (no blockers)

  • Judge prompt v3.0 update (codebook rulings → buildJudgePrompt())
  • Fine-tuning pipeline code (dual-head classifier, sample weighting, train/val/test split)
  • GenAI benchmark infrastructure (scripts to run 6+ models on holdout)

After last annotator finishes

  • Export + per-dimension alpha + pairwise Kappa matrix + stratum breakdown
  • Run GenAI benchmark on 1,200 holdout (6+ models, 3+ suppliers)
  • Gold set adjudication using 13+ signals per paragraph
  • Judge v3.0 validation against adjudicated gold set

After gold set is finalized

  • Training data assembly (unanimous + calibrated majority + judge)
  • Fine-tuning + ablations (7 experiments)
  • Final evaluation on holdout
  • Writeup + IGNITE slides

Open Questions

  1. F1 threshold per-dimension? Worth asking Ringel if the 0.80 F1 requirement applies to the joint 28-class label or can be reported per-dimension (category + specificity separately).
  2. Soft labels for ambiguous cases? For Tier 4 paragraphs, could use label distributions as soft targets during training instead of forcing a hard label. More sophisticated but harder to evaluate.
  3. One bad annotator vs. uniform disagreement? The pairwise Kappa matrix will answer this. If one annotator is systematically off, their labels could be downweighted during adjudication.