joey/SEC-cyBERT

Fork 0

Joey Eamigh c9497f5709

6 model panel benchmark

2026-04-02 02:02:36 -04:00

5.0 KiB

Raw Blame History

Post-Labeling Plan — Gold Set Repair & Final Pipeline

Updated 2026-04-02 with actual human labeling results.

Human Labeling Results

Completed 2026-04-01. 3,600 labels (1,200 paragraphs × 3 annotators via BIBD), 21.5 active hours total.

Per-Dimension Agreement

Metric	Category	Specificity	Both
Consensus (3/3 agree)	56.8%	42.3%	27.0%
Krippendorff's α	0.801	0.546	—
Avg Cohen's κ	0.612	0.440	—

Category is reliable. α = 0.801 exceeds the 0.80 conventional threshold. Human majority matches Stage 1 GenAI majority on 83.3% of paragraphs for category.

Specificity is unreliable. α = 0.546 is well below the 0.667 threshold. Driven by two factors: one outlier annotator and a genuinely hard Spec 3↔4 boundary.

The Aaryan Problem

One annotator (Aaryan) is a systematic outlier:

Labels 67% of paragraphs as Spec 4 (Quantified-Verifiable) — others: 8-23%, Stage 1: 9%
Specificity bias: +1.28 levels vs Stage 1 (massive over-rater)
Specificity κ: 0.03-0.25 (essentially chance)
Category κ: 0.40-0.50 (below "moderate")
Only 3 quiz attempts (lowest; others: 6-11)

Excluding his label on his 600 paragraphs: both-unanimous jumps from 5% → 50% (+45pp).

Confusion Axes (Human vs GenAI — Same Order)

Management Role ↔ Risk Management Process (dominant)
Board Governance ↔ Management Role
None/Other ↔ Strategy Integration (materiality disclaimers)

The same axes, in the same order, for both humans and the GenAI panel. The codebook boundaries drive disagreement, not annotator or model limitations.

The Adverse Incentive Problem

The assignment requires F1 > 0.80 on the holdout to pass. The holdout was deliberately stratified to over-sample hard decision boundaries (120 MR↔RMP, 80 N/O↔SI, 80 Spec [3,4] splits, etc.).

Mitigation: Report F1 on both the full 1,200 holdout AND the 720-paragraph proportional subsample. The delta quantifies performance degradation at decision boundaries. The stratified design directly serves the A-grade "error analysis" criterion.

Gold Set Repair Strategy: 13 Signals Per Paragraph

Annotation sources per paragraph

Source	Count	Prompt	Notes
Human annotators	3	Codebook v3.0	With notes, timing data
Stage 1 panel (gemini-flash-lite, mimo-flash, grok-fast)	3	v2.5	Already on file
Opus 4.6 golden	1	v2.5 + full codebook	With reasoning traces
Benchmark models (gpt-5.4, kimi-k2.5, gemini-pro, glm-5, minimax-m2.7, mimo-pro)	6	v3.0	Running now
Total	13

Adjudication tiers

Tier 1 — High confidence: 10+/13 agree on both dimensions. Gold label, no intervention.

Tier 2 — Clear majority with cross-validation: Human majority (2/3) matches GenAI consensus (majority of 10 GenAI labels). Take the consensus.

Tier 3 — Human split, GenAI consensus: Humans disagree but GenAI labels converge. Expert adjudication informed by Opus reasoning traces. Human makes the final call.

Tier 4 — Universal disagreement: Everyone splits. Expert adjudication with documented reasoning, or flagged as inherently ambiguous for error analysis.

GenAI labels are evidence for adjudication, not the gold label itself. The final label is always a human decision — this avoids circularity.

Task Sequence

In progress

Human labeling — complete
Data export and IRR analysis — complete
Prompt v3.0 update with codebook rulings — complete
GenAI benchmark infrastructure — complete
Opus golden re-run on correct holdout (running, ~1h with 20 workers)
6-model benchmark on holdout (running, high concurrency)

After benchmark completes

Cross-source analysis with all 13 signals (update analyze-gold.py)
Gold set adjudication using tiered strategy
Training data assembly (unanimous + calibrated majority + judge)

After gold set is finalized

Fine-tuning + ablations (7 experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} + best)
Final evaluation on holdout
Writeup + IGNITE slides

The Meta-Narrative

The finding that trained student annotators achieve α = 0.801 on category but only α = 0.546 on specificity, while calibrated LLM panels achieve 70.8%+ both-unanimous on an easier sample, validates the synthetic experts hypothesis for rule-heavy classification tasks. The human labels are essential as a calibration anchor, but GenAI's advantage on multi-step reasoning tasks (like QV fact counting) is itself a key finding.

The low specificity agreement is not annotator incompetence — it's evidence that the specificity construct requires cognitive effort that humans don't consistently invest at the 15-second-per-paragraph pace the task demands. The GenAI panel, which processes every paragraph with the same systematic attention to the IS/NOT lists and counting rules, achieves more consistent results on this specific dimension.

5.0 KiB Raw Blame History Unescape Escape