From 26367a8e8685a1bdd810708c4316ddf7f58dab92 Mon Sep 17 00:00:00 2001 From: Joey Eamigh <55670930+JoeyEamigh@users.noreply.github.com> Date: Thu, 2 Apr 2026 09:28:44 -0400 Subject: [PATCH] analyze gold --- .dvc-store.dvc | 6 +- docs/F1-STRATEGY.md | 290 +++++ docs/NARRATIVE.md | 50 + docs/POST-LABELING-PLAN.md | 250 +++- docs/STATUS.md | 130 +- scripts/analyze-gold.py | 2450 +++++++++++++++++++++++++----------- 6 files changed, 2304 insertions(+), 872 deletions(-) create mode 100644 docs/F1-STRATEGY.md diff --git a/.dvc-store.dvc b/.dvc-store.dvc index 44b1443..106e495 100644 --- a/.dvc-store.dvc +++ b/.dvc-store.dvc @@ -1,6 +1,6 @@ outs: -- md5: 6147599f136e4781a2de20adcb2aba1f.dir - size: 737313104 - nfiles: 135 +- md5: d64ad0c8040d75230a3013c4751910eb.dir + size: 740635168 + nfiles: 174 hash: md5 path: .dvc-store diff --git a/docs/F1-STRATEGY.md b/docs/F1-STRATEGY.md new file mode 100644 index 0000000..db96b89 --- /dev/null +++ b/docs/F1-STRATEGY.md @@ -0,0 +1,290 @@ +# F1 Strategy — Passing the Class + +The assignment requires **macro F1 > 0.80** on category, measured against the human-labeled 1,200-paragraph holdout. This document lays out the concrete plan for getting there. + +--- + +## The Situation + +### What we have + +- **Training data:** 150,009 Stage 1 annotations across 50,003 paragraphs (3 models each). ~35K paragraphs with unanimous category agreement (all 3 models agree). +- **Pre-trained backbone:** ModernBERT-large with DAPT (1B tokens of SEC filings) + TAPT (labeled paragraphs). Domain-adapted and task-adapted. +- **Gold holdout:** 1,200 paragraphs with 13 independent annotations each (3 human + 10 GenAI). Adjudication tiers computed: 81% auto-resolvable. +- **Complete benchmark:** 10 GenAI models from 8 suppliers, all on holdout. + +### The ceiling + +The best individual GenAI models agree with human majority on ~83-87% of category labels. Our fine-tuned model is trained on GenAI labels, so its accuracy is bounded by how well GenAI labels match human labels. With DAPT+TAPT, the model should approach or slightly exceed this ceiling because: + +1. It learns decision boundaries directly from representations, not through generative reasoning +2. It's specialized on the exact domain (SEC filings) and task distribution +3. The training data (35K+ unanimous labels) is cleaner than any single model's output + +### The threat + +The holdout was deliberately stratified to over-sample hard decision boundaries (MR<->RMP splits, N/O<->SI splits, Spec 3/4 splits). This means raw F1 on this holdout is **lower** than on a random sample. Macro F1 also weights all 7 classes equally — rare categories (TPR at ~5%, ID at ~8%) get the same influence as RMP at ~35%. + +**Estimated range: category macro F1 of 0.78-0.85.** The plan below is designed to push toward the top of that range. + +--- + +## Action 1: Clean the Gold Labels + +**Priority: highest. This directly caps F1 from above.** + +If the gold label is wrong, even a perfect model gets penalized. The gold label quality depends on how we adjudicate the 1,200 holdout paragraphs. + +### Aaryan correction + +Aaryan has a 40.9% odd-one-out rate (vs 8-16% for other annotators), specificity kappa of 0.03-0.25, and +1.30 specificity bias vs Opus. On his 600 paragraphs, when the other 2 annotators agree and he disagrees, the other-2 majority should be the human signal. This is not "throwing out" his data — it's using the objective reliability metrics to weight it appropriately. + +Excluding his label on his paragraphs pushes both-unanimous from 5% to 50% (+45pp). This single correction likely improves effective gold label quality by 5-10% on the paragraphs he touched. + +### Tiered adjudication + +| Tier | Count | % | Gold label source | +|------|-------|---|-------------------| +| 1 | 756 | 63% | 13-signal consensus (10+/13 agree on both dimensions) | +| 2 | 216 | 18% | Human majority + GenAI majority agree — take consensus | +| 3 | 26 | 2% | Expert review with Opus reasoning traces | +| 4 | 202 | 17% | Expert review, documented reasoning | + +For Tier 1+2 (972 paragraphs, 81%), the gold label is objectively strong — at least 10 of 13 annotators agree, or both human and GenAI majorities independently converge. These labels are essentially guaranteed correct. + +For Tier 3+4 (228 paragraphs), expert adjudication using: +1. Opus reasoning trace (why did the best model choose this category?) +2. GenAI consensus direction (what do 7+/10 models say?) +3. The paragraph text itself +4. Codebook boundary rules (MR vs RMP person-vs-function test, materiality disclaimers -> SI, etc.) + +Document reasoning for every Tier 4 decision. These 202 paragraphs become the error analysis corpus. + +--- + +## Action 2: Handle Class Imbalance + +**Priority: critical. This is the difference between 0.76 and 0.83 on macro F1.** + +### The problem + +The training data class distribution is heavily skewed: + +| Category | Est. % of training | Macro F1 weight | +|----------|-------------------|-----------------| +| RMP | ~35% | 14.3% (1/7) | +| BG | ~15% | 14.3% | +| MR | ~14% | 14.3% | +| SI | ~13% | 14.3% | +| N/O | ~10% | 14.3% | +| ID | ~8% | 14.3% | +| TPR | ~5% | 14.3% | + +Without correction, the model will over-predict RMP (the majority class) and under-predict TPR/ID. Since macro F1 weights all 7 equally, poor performance on rare classes tanks the overall score. + +### Solutions (use in combination) + +**Focal loss (gamma=2).** Down-weights easy/confident examples, up-weights hard/uncertain ones. The model spends more gradient on the examples it's getting wrong — which are disproportionately from rare classes and boundary cases. Better than static class weights because it adapts as training progresses. + +**Class-weighted sampling.** Over-sample rare categories during training so the model sees roughly equal numbers of each class per epoch. Alternatively, use class-weighted cross-entropy with weights inversely proportional to frequency. + +**Stratified validation split.** Ensure the validation set used for early stopping and threshold optimization has proportional representation of all classes. Don't let the model optimize for RMP accuracy at the expense of TPR. + +--- + +## Action 3: Supervised Contrastive Learning (SCL) + +**Priority: high. Directly attacks the #1 confusion axis.** + +### The problem + +MR<->RMP is the dominant confusion axis for humans, Stage 1, all GenAI models, and will be the dominant confusion axis for our fine-tuned model. These two categories share vocabulary (both discuss "cybersecurity" in a management/process context) and differ primarily in whether the paragraph describes a **person's role** (MR) or a **process/procedure** (RMP). + +BG<->MR is the #2 axis — both involve governance/management but differ in whether it's board-level or management-level. + +### How SCL helps + +SCL adds a contrastive loss that pulls representations of same-class paragraphs together and pushes different-class paragraphs apart in the embedding space. This is especially valuable when: +- Two classes share surface-level vocabulary (MR/RMP, BG/MR) +- The distinguishing features are subtle (person vs function, board vs management) +- The model needs to learn discriminative features, not just predictive ones + +### Implementation + +Dual loss: L = L_classification + lambda * L_contrastive + +The contrastive loss uses the [CLS] representation from the shared backbone. Lambda should be tuned (start with 0.1-0.5) on the validation set. + +--- + +## Action 4: Ordinal Specificity + +**Priority: medium. Matters for specificity F1, not directly for category F1 (which is the pass/fail metric).** + +### The problem + +Specificity is a 4-point ordinal scale (1=Generic Boilerplate, 2=Sector-Standard, 3=Firm-Specific, 4=Quantified-Verifiable). Treating it as flat classification ignores the ordering — a Spec 1->4 error is worse than a Spec 2->3 error. + +Human alpha on specificity is only 0.546 (unreliable). The Spec 3<->4 boundary is genuinely ambiguous. Even frontier models only agree 75-91% on specificity. + +### Solution + +Use CORAL (Consistent Rank Logits) ordinal regression for the specificity head. CORAL converts a K-class ordinal problem into K-1 binary problems (is this >= 2? is this >= 3? is this >= 4?) and trains shared representations across all thresholds. This: +- Respects the ordinal structure +- Eliminates impossible predictions (e.g., predicting "yes >= 4" but "no >= 3") +- Handles the noisy Spec 3<->4 boundary gracefully + +### Managed expectations + +Specificity macro F1 will be 0.65-0.75 regardless of what we do. This is not a model failure — it's a gold label quality issue (alpha=0.546). Report specificity F1 separately and frame it as a finding about construct reliability. + +--- + +## Action 5: Training Data Curation + +**Priority: high. Garbage in, garbage out.** + +### Confidence-stratified assembly + +| Source | Count | Sample Weight | Rationale | +|--------|-------|--------------|-----------| +| Unanimous Stage 1 (3/3 agree on both) | ~35K | 1.0 | Highest confidence | +| Majority Stage 1 (2/3 agree on cat) | ~9-12K | 0.8 | Good but not certain | +| Judge labels (high confidence) | ~2-3K | 0.7 | Disputed, resolved by stronger model | +| All-disagree | ~2-3K | 0.0 (exclude) | Too noisy | + +### Quality tier weighting + +| Paragraph quality | Weight | +|-------------------|--------| +| Clean | 1.0 | +| Headed | 1.0 | +| Minor issues | 1.0 | +| Degraded (embedded bullets, orphan words) | 0.5 | + +### What NOT to include + +- Paragraphs where all 3 Stage 1 models disagree on category (pure noise) +- Paragraphs from truncated filings (72 identified and removed pre-DAPT) +- Paragraphs shorter than 10 words (tend to be parsing artifacts) + +--- + +## Action 6: Ablation Design + +**The assignment requires at least 4 configurations. We'll run 6-8 to isolate each contribution.** + +| # | Backbone | Focal Loss | SCL | Notes | +|---|----------|-----------|-----|-------| +| 1 | ModernBERT-large (base) | No | No | Baseline — no domain adaptation | +| 2 | +DAPT | No | No | Isolate domain pre-training effect | +| 3 | +DAPT+TAPT | No | No | Isolate task-adaptive pre-training effect | +| 4 | +DAPT+TAPT | Yes | No | Isolate class imbalance handling | +| 5 | +DAPT+TAPT | Yes | Yes | Full pipeline | +| 6 | +DAPT+TAPT | Yes | Yes | Ensemble (3 seeds, majority vote) | + +**Expected pattern:** 1 < 2 < 3 (pre-training helps), 3 < 4 (focal loss helps rare classes), 4 < 5 (SCL helps confusion boundaries), 5 < 6 (ensemble smooths variance). + +Each experiment trains for ~30-60 min on the RTX 3090. Total ablation time: ~4-8 hours. + +### Hyperparameters (starting points, tune on validation) + +- Learning rate: 2e-5 (standard for BERT fine-tuning) +- Batch size: 16-32 (depending on VRAM with dual heads) +- Max sequence length: 512 (most paragraphs are <200 tokens; 8192 is unnecessary for classification) +- Epochs: 5-10 with early stopping (patience=3) +- Warmup: 10% of steps +- Weight decay: 1e-5 (matching ModernBERT pre-training config) +- Focal loss gamma: 2.0 +- SCL lambda: 0.1-0.5 (tune) +- Label smoothing: 0.05 + +--- + +## Action 7: Inference-Time Techniques + +### Ensemble (3 seeds) + +Train 3 instances of the best configuration (experiment 5) with different random seeds. At inference, average the softmax probabilities and take argmax. Typically adds 1-3pp macro F1 over a single model. The variance across seeds also gives confidence intervals for reported metrics. + +### Per-class threshold optimization + +After training, don't use argmax. Instead, optimize per-class classification thresholds on the validation set to maximize macro F1 directly. The optimal threshold for RMP (high prevalence, high precision needed) is different from TPR (low prevalence, high recall needed). Use a grid search or Bayesian optimization over the 7 thresholds. + +### Post-hoc calibration + +Apply temperature scaling on the validation set. This doesn't change predictions (and therefore doesn't change F1), but it makes the model's confidence scores meaningful for: +- Calibration plots (recommended evaluation metric) +- AUC computation +- The error analysis narrative + +--- + +## Action 8: Evaluation & Reporting + +### Primary metrics (what determines the grade) + +- **Category macro F1** on full 1,200 holdout — must exceed 0.80 +- **Per-class F1** — breakdown showing which categories are strong/weak +- **Krippendorff's alpha** — model vs human labels (should approach GenAI panel's alpha) +- **MCC** — robust to class imbalance +- **AUC** — from calibrated probabilities + +### Dual F1 reporting (adverse incentive mitigation) + +Report F1 on both: +1. **Full 1,200 holdout** (stratified, over-samples hard cases) +2. **~720-paragraph proportional subsample** (random draw matching corpus class proportions) + +The delta between these two numbers quantifies how much the model degrades at decision boundaries. This directly serves the A-grade "error analysis" criterion and is methodologically honest about the stratified design. + +### Error analysis corpus + +The 202 Tier 4 paragraphs (universal disagreement) are the natural error analysis set. For each: +- What did the model predict? +- What is the gold label? +- What do the 13 signals show? +- What is the confusion axis? +- Is the gold label itself debatable? + +This analysis will show that most "errors" fall on the MR<->RMP, BG<->MR, and N/O<->SI axes — the same axes where humans disagree. The model is not failing randomly; it's failing where the construct itself is ambiguous. + +### GenAI vs specialist comparison (assignment Step 10) + +| Dimension | GenAI Panel (10 models) | Fine-tuned Specialist | +|-----------|------------------------|----------------------| +| Category macro F1 | ~0.82-0.87 (per model) | Target: 0.80-0.85 | +| Cost per 1M texts | ~$5,000-13,000 | ~$5 (GPU inference) | +| Latency per text | 3-76 seconds | ~5ms | +| Reproducibility | Varies (temperature, routing) | Deterministic | +| Setup cost | $165 (one-time labeling) | + ~8h GPU training | + +The specialist wins on cost (1000x cheaper), speed (1000x faster), and reproducibility (deterministic). The GenAI panel wins on raw accuracy by a few points. This is the core Ringel (2023) thesis: the specialist approximates the GenAI labeler at near-zero marginal cost. + +--- + +## Risk Assessment + +| Risk | Likelihood | Impact | Mitigation | +|------|-----------|--------|------------| +| Macro F1 lands at 0.78-0.80 (just below threshold) | Medium | High | Ensemble + threshold optimization should add 2-3pp | +| TPR per-class F1 tanks macro average | Medium | Medium | Focal loss + over-sampling TPR in training | +| Gold label noise on Tier 4 paragraphs | Low | Medium | Conservative adjudication + dual F1 reporting | +| MR<->RMP confusion not resolved by SCL | Low | Medium | Person-vs-function test baked into training data via v3.0 codebook | +| DAPT+TAPT doesn't help (base model is already good enough) | Low | Low | Still meets 0.80 threshold; the ablation result itself is publishable | + +--- + +## Timeline + +| Task | Duration | Target Date | +|------|----------|-------------| +| Gold set adjudication (Tier 3+4 expert review) | 2-3h | Apr 3-4 | +| Training data assembly | 1-2h | Apr 4 | +| Fine-tuning ablations (6 configs) | 4-8h GPU | Apr 5-8 | +| Final evaluation on holdout | 1h | Apr 9 | +| Error analysis writeup | 2h | Apr 10 | +| Executive memo draft | 3h | Apr 11-12 | +| IGNITE slides (20 slides) | 2h | Apr 13-14 | +| Final review + submission | 2h | Apr 22 | +| **Due date** | | **Apr 23 12pm** | diff --git a/docs/NARRATIVE.md b/docs/NARRATIVE.md index 0b42800..8ebd15f 100644 --- a/docs/NARRATIVE.md +++ b/docs/NARRATIVE.md @@ -1056,6 +1056,56 @@ The Opus golden labeling was re-run on the correct 1,200 holdout paragraphs. A p --- +## Phase 14: 13-Signal Analysis & F1 Strategy + +### Benchmark Complete + +All 6 benchmark models + Opus completed 1,200 annotations each. Total benchmark cost: $45.63. Every paragraph in the holdout now has exactly 13 independent annotations: 3 human + 3 Stage 1 + 1 Opus + 6 benchmark. + +Model performance sorted by leave-one-out "both" accuracy (each source vs majority of other 12): Opus 4.6 (84.0%), Kimi K2.5 (83.3%), Gemini Pro (82.3%), GPT-5.4 (82.1%), GLM-5 (81.4%), MIMO Pro (81.4%), Grok Fast (80.0%). Best human: Xander at 76.9%. Worst: Aaryan at 15.8%. + +### The "Is Opus Special?" Question + +We tested whether Opus's apparent dominance was an artifact of using it as the reference. Answer: no. In leave-one-out analysis, Opus has the lowest "odd one out" rate at 7.4% — it disagrees with the remaining 12 sources less than any other source. But the top 6 GenAI models are within 3pp of each other — any could serve as reference with similar results. The 13-signal majority is 99.5% identical to the 10-GenAI majority; adding 3 human votes barely shifts consensus because 10 outvotes 3. + +### Adjudication Tiers + +The 13-signal consensus enables tiered adjudication: +- **Tier 1 (63.0%):** 756 paragraphs where 10+/13 agree on both dimensions. Auto-gold, zero human work. +- **Tier 2 (18.0%):** 216 paragraphs where human majority and GenAI majority agree. Cross-validated. +- **Tier 3 (2.2%):** 26 paragraphs where humans split but GenAI converges. +- **Tier 4 (16.8%):** 202 paragraphs with universal disagreement. Expert adjudication needed. + +81% of the holdout can be adjudicated automatically. The 202 Tier 4 paragraphs are dominated by MR↔RMP confusion (the #1 axis everywhere) and are the natural error analysis corpus. + +### Specificity: GenAI Is More Consistent Than Humans + +GenAI spec unanimity is 60.1% vs human spec unanimity of 42.2% (+18pp). Specificity calibration plots show that GPT-5.4, Gemini Pro, and Kimi K2.5 closely track Opus across all 4 specificity levels. MiniMax M2.7 is the only model with systematic specificity bias (−0.26 vs Opus). Among humans, Aaryan's +1.30 bias dwarfs all other sources. + +### F1 Strategy + +The assignment requires macro F1 > 0.80 on category. Based on the data: +- The best GenAI models agree with human majority ~83-87% on category +- Training on 35K+ unanimous Stage 1 labels with DAPT+TAPT should approach this ceiling +- The swing categories for macro F1 are MR (~65-80%), TPR (~70-90%), N/O (~60-85%) +- Focal loss for class imbalance + SCL for boundary separation + ensemble for robustness + +Key risk: the stratified holdout over-samples hard cases, depressing F1 vs a random sample. Mitigation: report F1 on both the full holdout and a proportional subsample. The delta quantifies model degradation at decision boundaries. + +### Cost Ledger Update + +| Phase | Cost | Time | +|-------|------|------| +| Stage 1 (150K annotations) | $115.88 | ~30 min | +| Orphan re-annotation | $3.30 | ~9 min | +| Benchmark (6 models × 1,200) | $45.63 | ~1h | +| Opus golden (1,200) | $0 (subscription) | ~30 min | +| Human labeling | $0 (class assignment) | 21.5h active | +| Post-labeling analysis | ~3h | | +| **Total API** | **$164.81** | | + +--- + ## Lessons Learned ### On Prompt Engineering diff --git a/docs/POST-LABELING-PLAN.md b/docs/POST-LABELING-PLAN.md index 5ae7ba0..0a71aeb 100644 --- a/docs/POST-LABELING-PLAN.md +++ b/docs/POST-LABELING-PLAN.md @@ -1,104 +1,222 @@ -# Post-Labeling Plan — Gold Set Repair & Final Pipeline +# Post-Labeling Plan — Gold Set, Fine-Tuning & F1 Strategy -Updated 2026-04-02 with actual human labeling results. +Updated 2026-04-02 with actual benchmark results and 13-signal analysis. --- -## Human Labeling Results +## Human Labeling Results (Complete) -Completed 2026-04-01. 3,600 labels (1,200 paragraphs × 3 annotators via BIBD), 21.5 active hours total. - -### Per-Dimension Agreement +3,600 labels (1,200 paragraphs x 3 annotators via BIBD), 21.5 active hours total. | Metric | Category | Specificity | Both | |--------|----------|-------------|------| | Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% | -| Krippendorff's α | **0.801** | 0.546 | — | -| Avg Cohen's κ | 0.612 | 0.440 | — | +| Krippendorff's alpha | **0.801** | 0.546 | -- | +| Avg Cohen's kappa | 0.612 | 0.440 | -- | -**Category is reliable.** α = 0.801 exceeds the 0.80 conventional threshold. Human majority matches Stage 1 GenAI majority on 83.3% of paragraphs for category. - -**Specificity is unreliable.** α = 0.546 is well below the 0.667 threshold. Driven by two factors: one outlier annotator and a genuinely hard Spec 3↔4 boundary. - -### The Aaryan Problem - -One annotator (Aaryan) is a systematic outlier: -- Labels 67% of paragraphs as Spec 4 (Quantified-Verifiable) — others: 8-23%, Stage 1: 9% -- Specificity bias: +1.28 levels vs Stage 1 (massive over-rater) -- Specificity κ: 0.03-0.25 (essentially chance) -- Category κ: 0.40-0.50 (below "moderate") -- Only 3 quiz attempts (lowest; others: 6-11) - -Excluding his label on his 600 paragraphs: both-unanimous jumps from 5% → 50% (+45pp). - -### Confusion Axes (Human vs GenAI — Same Order) - -1. Management Role ↔ Risk Management Process (dominant) -2. Board Governance ↔ Management Role -3. None/Other ↔ Strategy Integration (materiality disclaimers) - -The same axes, in the same order, for both humans and the GenAI panel. The codebook boundaries drive disagreement, not annotator or model limitations. +**Category is reliable.** Alpha = 0.801 exceeds the conventional 0.80 threshold. **Specificity is unreliable.** Alpha = 0.546, driven by one outlier annotator (+1.28 specificity bias) and a genuinely hard Spec 3-4 boundary. --- -## The Adverse Incentive Problem +## GenAI Benchmark Results (Complete) -The assignment requires F1 > 0.80 on the holdout to pass. The holdout was deliberately stratified to over-sample hard decision boundaries (120 MR↔RMP, 80 N/O↔SI, 80 Spec [3,4] splits, etc.). +10 models from 8 suppliers on 1,200 holdout paragraphs. $45.63 total benchmark cost. -**Mitigation:** Report F1 on both the full 1,200 holdout AND the 720-paragraph proportional subsample. The delta quantifies performance degradation at decision boundaries. The stratified design directly serves the A-grade "error analysis" criterion. +### Per-Model Accuracy (Leave-One-Out: each source vs majority of other 12) + +| Rank | Source | Cat % | Spec % | Both % | Odd-One-Out % | +|------|--------|-------|--------|--------|---------------| +| 1 | Opus 4.6 | 92.6 | 90.8 | 84.0 | 7.4% | +| 2 | Kimi K2.5 | 91.6 | 91.1 | 83.3 | 8.4% | +| 3 | Gemini Pro | 91.1 | 90.1 | 82.3 | 8.9% | +| 4 | GPT-5.4 | 91.4 | 88.8 | 82.1 | 8.6% | +| 5 | GLM-5 | 91.9 | 88.4 | 81.4 | 8.1% | +| 6 | MIMO Pro | 91.1 | 89.4 | 81.4 | 8.9% | +| 7 | Grok Fast | 88.9 | 89.6 | 80.0 | 11.1% | +| 8 | Xander (best human) | 91.3 | 83.9 | 76.9 | 8.7% | +| 9 | Elisabeth | 85.5 | 84.6 | 72.3 | 14.5% | +| 10 | Gemini Lite | 83.0 | 86.1 | 71.7 | 17.0% | +| 11 | MIMO Flash | 80.4 | 86.4 | 69.2 | 19.6% | +| 12 | Meghan | 86.3 | 76.8 | 66.5 | 13.7% | +| 13 | MiniMax M2.7 | 87.9 | 75.6 | 66.1 | 12.1% | +| 14 | Joey | 84.0 | 77.2 | 65.8 | 16.0% | +| 15 | Anuj | 72.7 | 60.6 | 42.8 | 27.3% | +| 16 | Aaryan (outlier) | 59.1 | 24.7 | 15.8 | 40.9% | + +Opus earns #1 without being privileged -- it genuinely disagrees with the crowd least. + +### Cross-Source Agreement + +| Comparison | Category | +|------------|----------| +| Human maj = S1 maj | 81.7% | +| Human maj = Opus | 83.2% | +| Human maj = GenAI maj (10) | 82.2% | +| GenAI maj = Opus | 86.8% | +| 13-signal maj = 10-GenAI maj | 99.5% | + +### Confusion Axes (same order for all source types) + +1. MR <-> RMP (dominant) +2. BG <-> MR +3. N/O <-> SI --- -## Gold Set Repair Strategy: 13 Signals Per Paragraph +## Adjudication Strategy (13 Signals) -### Annotation sources per paragraph +### Sources per paragraph -| Source | Count | Prompt | Notes | -|--------|-------|--------|-------| -| Human annotators | 3 | Codebook v3.0 | With notes, timing data | -| Stage 1 panel (gemini-flash-lite, mimo-flash, grok-fast) | 3 | v2.5 | Already on file | -| Opus 4.6 golden | 1 | v2.5 + full codebook | With reasoning traces | -| Benchmark models (gpt-5.4, kimi-k2.5, gemini-pro, glm-5, minimax-m2.7, mimo-pro) | 6 | v3.0 | Running now | -| **Total** | **13** | | | +| Source | Count | Prompt | +|--------|-------|--------| +| Human annotators | 3 | Codebook v3.0 | +| Stage 1 (gemini-lite, mimo-flash, grok-fast) | 3 | v2.5 | +| Opus 4.6 golden | 1 | v3.0+codebook | +| Benchmark (gpt-5.4, kimi-k2.5, gemini-pro, glm-5, minimax-m2.7, mimo-pro) | 6 | v3.0 | +| **Total** | **13** | | -### Adjudication tiers +### Tier breakdown (actual counts) -**Tier 1 — High confidence:** 10+/13 agree on both dimensions. Gold label, no intervention. +| Tier | Rule | Count | % | +|------|------|-------|---| +| 1 | 10+/13 agree on both dimensions | 756 | 63.0% | +| 2 | Human majority + GenAI majority agree | 216 | 18.0% | +| 3 | Humans split, GenAI converges | 26 | 2.2% | +| 4 | Universal disagreement | 202 | 16.8% | -**Tier 2 — Clear majority with cross-validation:** Human majority (2/3) matches GenAI consensus (majority of 10 GenAI labels). Take the consensus. +**81% auto-resolvable.** Only 228 paragraphs (19%) need expert review. -**Tier 3 — Human split, GenAI consensus:** Humans disagree but GenAI labels converge. Expert adjudication informed by Opus reasoning traces. Human makes the final call. +### Aaryan correction -**Tier 4 — Universal disagreement:** Everyone splits. Expert adjudication with documented reasoning, or flagged as inherently ambiguous for error analysis. +On Aaryan's 600 paragraphs: when the other 2 annotators agree and Aaryan disagrees, the other-2 majority becomes the human signal for adjudication. This is justified by his 40.9% odd-one-out rate (vs 8-16% for other annotators) and α=0.03-0.25 on specificity. -GenAI labels are evidence for adjudication, not the gold label itself. The final label is always a human decision — this avoids circularity. +### Adjudication process for Tier 3+4 + +1. Pull Opus reasoning trace for the paragraph +2. Check the GenAI consensus (which category do 7+/10 models agree on?) +3. Expert reads the paragraph and all signals, makes final call +4. Document reasoning for Tier 4 paragraphs (these are the error analysis corpus) --- -## Task Sequence +## F1 Strategy — How to Pass -### In progress -- [x] Human labeling — complete -- [x] Data export and IRR analysis — complete -- [x] Prompt v3.0 update with codebook rulings — complete -- [x] GenAI benchmark infrastructure — complete -- [ ] Opus golden re-run on correct holdout (running, ~1h with 20 workers) -- [ ] 6-model benchmark on holdout (running, high concurrency) +### The requirement -### After benchmark completes -- [ ] Cross-source analysis with all 13 signals (update `analyze-gold.py`) -- [ ] Gold set adjudication using tiered strategy -- [ ] Training data assembly (unanimous + calibrated majority + judge) +- **C grade minimum:** fine-tuned model with macro F1 > 0.80 on holdout +- **Gold standard:** human-labeled holdout (1,200 paragraphs) +- **Metrics to report:** macro F1, per-class F1, Krippendorff's alpha, AUC, MCC +- The fine-tuned "specialist" must be compared head-to-head with GenAI labeling -### After gold set is finalized -- [ ] Fine-tuning + ablations (7 experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} + best) -- [ ] Final evaluation on holdout -- [ ] Writeup + IGNITE slides +### The challenge + +The holdout was deliberately stratified to over-sample hard decision boundaries (MR<->RMP, N/O<->SI, Spec 3<->4). This means raw F1 on this holdout will be **lower** than on a random sample. Additionally: + +- The best individual GenAI models only agree with human majority ~83-87% on category +- Our model is trained on GenAI labels, so its ceiling is bounded by GenAI-vs-human agreement +- Macro F1 weights all 7 classes equally -- rare classes (TPR, ID) get equal influence +- The MR<->RMP confusion axis is the #1 challenge across all source types + +### Why F1 > 0.80 is achievable + +1. **DAPT + TAPT give domain advantage.** The model has seen 1B tokens of SEC filings (DAPT) and all labeled paragraphs (TAPT). It understands SEC disclosure language at a depth that generic BERT models don't. + +2. **35K+ high-confidence training examples.** Unanimous Stage 1 labels where all 3 models agreed on both dimensions. These are cleaner than any single model's labels. + +3. **Encoder classification outperforms generative labeling on fine-tuned domains.** The model doesn't need to "reason" about the codebook -- it learns the decision boundaries directly from representations. This is the core thesis of Ringel (2023). + +4. **The hard cases are a small fraction.** 63% of the holdout is Tier 1 (10+/13 agree). The model only needs reasonable performance on the remaining 37% to clear 0.80. + +### Critical actions + +#### 1. Gold label quality (highest priority) + +Noisy gold labels directly cap F1. If the gold label is wrong, even a perfect model gets penalized. + +- **Tier 1+2 (972 paragraphs):** Use 13-signal consensus. These are essentially guaranteed correct. +- **Tier 3+4 (228 paragraphs):** Expert adjudication with documented reasoning. Prioritize Opus reasoning traces + GenAI consensus as evidence. +- **Aaryan correction:** On his 600 paragraphs, replace his vote with the other-2 majority when they agree. This alone should improve gold label quality substantially. +- **Document the process:** The adjudication methodology itself is a deliverable (IRR report + reliability analysis). + +#### 2. Training data curation + +- **Primary corpus:** Unanimous Stage 1 labels (all 3 models agree on both cat+spec) -- ~35K paragraphs +- **Secondary:** Majority labels (2/3 agree) with 0.8x sample weight -- ~9-12K +- **Tertiary:** Judge labels with high confidence -- ~2-3K +- **Exclude:** Paragraphs where all 3 models disagree (too noisy for training) +- **Quality weighting:** clean/headed/minor = 1.0, degraded = 0.5 + +#### 3. Architecture and loss + +- **Dual-head classifier:** Shared ModernBERT backbone -> category head (7-class softmax) + specificity head (4-class ordinal) +- **Category loss:** Focal loss (gamma=2) or class-weighted cross-entropy. The model must not ignore rare categories (TPR, ID). Weights inversely proportional to class frequency in training data. +- **Specificity loss:** Ordinal regression (CORAL) -- penalizes Spec 1->4 errors more than Spec 2->3. This respects the ordinal nature and handles the noisy Spec 3<->4 boundary gracefully. +- **Combined loss:** L = L_cat + 0.5 * L_spec (category gets more gradient weight because it's the more reliable dimension and the primary metric) + +#### 4. Ablation experiments (need >=4 configurations) + +| # | Backbone | Class Weights | SCL | Notes | +|---|----------|--------------|-----|-------| +| 1 | Base ModernBERT-large | No | No | Baseline | +| 2 | +DAPT | No | No | Domain adaptation effect | +| 3 | +DAPT+TAPT | No | No | Full pre-training pipeline | +| 4 | +DAPT+TAPT | Yes (focal) | No | Class imbalance handling | +| 5 | +DAPT+TAPT | Yes (focal) | Yes | Supervised contrastive learning | +| 6 | +DAPT+TAPT | Yes (focal) | Yes | + ensemble (3 seeds) | + +Experiments 1-3 isolate the pre-training contribution. 4-5 isolate training strategy. 6 is the final system. + +#### 5. Evaluation strategy + +- **Primary metric:** Category macro F1 on full 1,200 holdout (must exceed 0.80) +- **Secondary metrics:** Per-class F1, specificity F1 (report separately), MCC, Krippendorff's alpha vs human labels +- **Dual reporting (adverse incentive mitigation):** Also report F1 on a 720-paragraph proportional subsample (random draw matching corpus class proportions). The delta quantifies degradation on hard boundary cases. This serves the A-grade "error analysis" criterion. +- **Error analysis corpus:** Tier 4 paragraphs (202) are the natural error analysis set. Where the model fails on these, the 13-signal disagreement pattern explains why. + +#### 6. Inference-time techniques + +- **Ensemble:** Train 3 models with different random seeds on the best config. Majority vote at inference. Typically adds 1-3pp F1. +- **Threshold optimization:** After training, optimize per-class classification thresholds on a validation set (not holdout) to maximize macro F1. Don't use argmax -- use thresholds that balance precision and recall per class. +- **Post-hoc calibration:** Temperature scaling on validation set. Important for AUC and calibration plots. + +### Specificity dimension -- managed expectations + +Specificity F1 will be lower than category F1. This is not a model failure: +- Human alpha on specificity is only 0.546 (unreliable gold) +- Even frontier models only agree 75-91% on specificity +- The Spec 3<->4 boundary is genuinely ambiguous + +Strategy: report specificity F1 separately, explain why it's lower, and frame it as a finding about construct reliability (the specificity dimension needs more operational clarity, not better models). This is honest and scientifically interesting. + +### Concrete F1 estimate + +Based on GenAI-vs-human agreement rates and the typical BERT fine-tuning premium: +- **Category macro F1:** 0.78-0.85 (depends on class imbalance handling and gold quality) +- **Specificity macro F1:** 0.65-0.75 (ceiling-limited by human disagreement) +- **Combined (cat x spec) accuracy:** 0.55-0.70 + +The swing categories for macro F1 are MR (~65-80% per-class F1), TPR (~70-90%), and N/O (~60-85%). Focal loss + SCL should push MR and N/O into the range where macro F1 clears 0.80. --- ## The Meta-Narrative -The finding that trained student annotators achieve α = 0.801 on category but only α = 0.546 on specificity, while calibrated LLM panels achieve 70.8%+ both-unanimous on an easier sample, validates the synthetic experts hypothesis for rule-heavy classification tasks. The human labels are essential as a calibration anchor, but GenAI's advantage on multi-step reasoning tasks (like QV fact counting) is itself a key finding. +The finding that trained student annotators achieve alpha = 0.801 on category but only 0.546 on specificity, while calibrated LLM panels achieve higher consistency (60.1% spec unanimity vs 42.2% for humans), validates the synthetic experts hypothesis for rule-heavy classification tasks. The low specificity agreement is not annotator incompetence -- it's evidence that the specificity construct requires systematic attention to IS/NOT lists and counting rules that humans don't consistently invest at 15s/paragraph pace. GenAI's advantage on multi-step reasoning tasks is itself a key finding. -The low specificity agreement is not annotator incompetence — it's evidence that the specificity construct requires cognitive effort that humans don't consistently invest at the 15-second-per-paragraph pace the task demands. The GenAI panel, which processes every paragraph with the same systematic attention to the IS/NOT lists and counting rules, achieves more consistent results on this specific dimension. +The leave-one-out analysis showing that Opus earns the top rank without being privileged is the strongest validation of using frontier LLMs as "gold" annotators: they're not just consistent with each other, they're the most consistent with the emergent consensus of all 16 sources combined. + +--- + +## Timeline + +| Task | Target | Status | +|------|--------|--------| +| Human labeling | 2026-04-01 | Done | +| GenAI benchmark (10 models) | 2026-04-02 | Done | +| 13-signal analysis | 2026-04-02 | Done | +| Gold set adjudication | 2026-04-03-04 | Next | +| Training data assembly | 2026-04-04 | | +| Fine-tuning ablations (6 configs) | 2026-04-05-08 | | +| Final evaluation on holdout | 2026-04-09 | | +| Executive memo + IGNITE slides | 2026-04-10-14 | | +| Submission | 2026-04-23 | | diff --git a/docs/STATUS.md b/docs/STATUS.md index 411c815..706998b 100644 --- a/docs/STATUS.md +++ b/docs/STATUS.md @@ -1,4 +1,4 @@ -# Project Status — 2026-04-02 +# Project Status — 2026-04-02 (evening) ## What's Done @@ -27,21 +27,11 @@ - [x] Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility) - [x] Procedure documented in `docs/DAPT-PROCEDURE.md` -### Documentation -- [x] `docs/DATA-QUALITY-AUDIT.md` — full audit with all patches and quality tiers -- [x] `docs/EDGAR-FILING-GENERATORS.md` — 14 generators with signatures and quality profiles -- [x] `docs/DAPT-PROCEDURE.md` — pre-flight checklist, commands, monitoring guide -- [x] `docs/NARRATIVE.md` — 11 phases documented through TAPT completion - -## What's Done (since last update) - ### Human Labeling — Complete - [x] All 6 annotators completed 600 paragraphs each (3,600 labels total, 1,200 paragraphs × 3) - [x] BIBD assignment: each paragraph labeled by exactly 3 of 6 annotators - [x] Full data export: raw labels, timing, quiz sessions, metrics → `data/gold/` -- [x] Comprehensive IRR analysis with 16 diagnostic charts → `data/gold/charts/` - -### Human Labeling Results +- [x] Comprehensive IRR analysis → `data/gold/charts/` | Metric | Category | Specificity | Both | |--------|----------|-------------|------| @@ -49,77 +39,90 @@ | Krippendorff's α | 0.801 | 0.546 | — | | Avg Cohen's κ | 0.612 | 0.440 | — | -**Key findings:** -- **Category is reliable (α=0.801)** — above the 0.80 threshold for reliable data -- **Specificity is unreliable (α=0.546)** — driven primarily by one outlier annotator (Aaryan, +1.28 specificity levels vs Stage 1, κ=0.03-0.25 on specificity) and genuinely hard Spec 3↔4 boundary -- **Human majority = Stage 1 majority on 83.3% of categories** — strong cross-validation -- **Same confusion axes** in humans and GenAI: MR↔RMP (#1), BG↔MR (#2), N/O↔SI (#3) -- **Excluding outlier annotator:** both-unanimous jumps from 5% → 50% on his paragraphs (+45pp) -- **Timing:** 21.5 active hours total, median 14.9s per paragraph - ### Prompt v3.0 -- [x] Updated `SYSTEM_PROMPT` with codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP +- [x] Codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP - [x] Prompt version bumped from v2.5 → v3.0 -### GenAI Holdout Benchmark — In Progress -Running 6 benchmark models + Opus on the 1,200 holdout paragraphs: +### GenAI Holdout Benchmark — Complete +- [x] 6 benchmark models + Opus 4.6 on the 1,200 holdout paragraphs +- [x] All 1,200 annotations per model (0 failures after minimax/kimi fence-stripping fix) +- [x] Total benchmark cost: $45.63 -| Model | Supplier | Est. Cost/call | Notes | -|-------|----------|---------------|-------| -| openai/gpt-5.4 | OpenAI | $0.009 | Structured output | -| moonshotai/kimi-k2.5 | Moonshot | $0.006 | Structured output | -| google/gemini-3.1-pro-preview | Google | $0.006 | Structured output | -| z-ai/glm-5 | Zhipu | $0.006 | Structured output, exacto routing | -| minimax/minimax-m2.7 | MiniMax | $0.002 | Raw text + fence stripping | -| xiaomi/mimo-v2-pro | Xiaomi | $0.006 | Structured output, exacto routing | -| anthropic/claude-opus-4.6 | Anthropic | $0 (subscription) | Agent SDK, parallel workers | +| Model | Supplier | Cost | Cat % vs Opus | Both % vs Opus | +|-------|----------|------|---------------|----------------| +| openai/gpt-5.4 | OpenAI | $6.79 | 88.2% | 79.8% | +| google/gemini-3.1-pro-preview | Google | $16.09 | 87.4% | 80.0% | +| moonshotai/kimi-k2.5 | Moonshot | $7.70 | 85.1% | 76.8% | +| z-ai/glm-5:exacto | Zhipu | $6.86 | 86.2% | 76.5% | +| xiaomi/mimo-v2-pro:exacto | Xiaomi | $6.59 | 85.7% | 76.3% | +| minimax/minimax-m2.7:exacto | MiniMax | $1.61 | 82.8% | 63.6% | +| anthropic/claude-opus-4.6 | Anthropic | $0 | — | — | -Plus Stage 1 panel (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast) already on file = **10 models, 8 suppliers**. +Plus Stage 1 panel already on file = **10 models, 8 suppliers**. -## What's In Progress +### 13-Signal Cross-Source Analysis — Complete +- [x] 30 diagnostic charts generated → `data/gold/charts/` +- [x] Leave-one-out analysis (no model privileged as reference) +- [x] Adjudication tier breakdown computed -### Opus Golden Re-Run -- Opus golden labels being re-run on the correct 1,200 holdout paragraphs (previous run was on a stale sample due to `.sampled-ids.json` being overwritten) -- Previous Opus labels (different 1,200 paragraphs) preserved at `data/annotations/golden/opus.wrong-sample.jsonl` -- Using parallelized Agent SDK workers (concurrency=20) +**Adjudication tiers (13 signals per paragraph):** -### GenAI Benchmark -- 6 models running on holdout with v3.0 prompt, high concurrency (200) -- Output: `data/annotations/bench-holdout/{model}.jsonl` +| Tier | Count | % | Rule | +|------|-------|---|------| +| 1 | 756 | 63.0% | 10+/13 agree on both dimensions → auto gold | +| 2 | 216 | 18.0% | Human + GenAI majorities agree → cross-validated | +| 3 | 26 | 2.2% | Humans split, GenAI converges → expert review | +| 4 | 202 | 16.8% | Universal disagreement → expert review | + +**Leave-one-out ranking (each source vs majority of other 12):** + +| Rank | Source | Cat % | Spec % | Both % | +|------|--------|-------|--------|--------| +| 1 | Opus 4.6 | 92.6 | 90.8 | 84.0 | +| 2 | Kimi K2.5 | 91.6 | 91.1 | 83.3 | +| 3 | Gemini Pro | 91.1 | 90.1 | 82.3 | +| 4 | GPT-5.4 | 91.4 | 88.8 | 82.1 | +| 8 | H:Xander (best human) | 91.3 | 83.9 | 76.9 | +| 16 | H:Aaryan (outlier) | 59.1 | 24.7 | 15.8 | + +**Key finding:** Opus earns the #1 spot through leave-one-out — it's not special because we designated it as gold; it genuinely disagrees with the crowd least (7.4% odd-one-out rate). ## What's Next (in dependency order) -### 1. Gold set adjudication (blocked on benchmark + Opus completion) -Each paragraph will have **13+ independent annotations**: 3 human + 3 Stage 1 + 1 Opus + 6 benchmark models. -Adjudication tiers: -- **Tier 1:** 10+/13 agree → gold label, no intervention -- **Tier 2:** Human majority + GenAI consensus agree → take consensus -- **Tier 3:** Humans split, GenAI converges → expert adjudication using Opus reasoning traces -- **Tier 4:** Universal disagreement → expert adjudication with documented reasoning +### 1. Gold set adjudication +- Tier 1+2 (972 paragraphs, 81%) → auto-resolved from 13-signal consensus +- Tier 3+4 (228 paragraphs, 19%) → expert review with Opus reasoning traces +- For Aaryan's 600 paragraphs: use other-2-annotator majority when they agree and he disagrees -### 2. Training data assembly (blocked on adjudication) +### 2. Training data assembly - Unanimous Stage 1 labels (35,204 paragraphs) → full weight - Calibrated majority labels (~9-12K) → full weight - Judge high-confidence labels (~2-3K) → full weight - Quality tier weights: clean/headed/minor=1.0, degraded=0.5 -### 3. Fine-tuning + ablations (blocked on training data) -7 experiments: {base, +DAPT, +DAPT+TAPT} × {with/without SCL} + best config. -Dual-head classifier: shared ModernBERT backbone + 2 linear classification heads. +### 3. Fine-tuning + ablations +- 8+ experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} × {±class weighting} +- Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal) +- Focal loss / class-weighted CE for category imbalance +- Ordinal regression (CORAL) for specificity -### 4. Evaluation + paper (blocked on everything above) -Full GenAI benchmark (10 models) on 1,200 holdout. Comparison tables. Write-up. IGNITE slides. +### 4. Evaluation + paper +- Macro F1 + per-class F1 on holdout (must exceed 0.80 for category) +- Full GenAI benchmark table (10 models × 1,200 holdout) +- Cost/time/reproducibility comparison +- Error analysis on Tier 4 paragraphs (A-grade criterion) +- IGNITE slides (20 slides, 15s each) ## Parallel Tracks ``` Track A (GPU): DAPT ✓ → TAPT ✓ ──────────────→ Fine-tuning → Eval ↑ -Track B (API): Opus re-run ─┐ │ - ├→ Gold adjudication ──────┤ -Track C (API): 6-model bench┘ │ +Track B (API): Opus re-run ✓─┐ │ + ├→ Gold adjudication ─────┤ +Track C (API): 6-model bench ✓┘ │ │ -Track D (Human): Labeling ✓ → IRR analysis ✓ ───────────┘ +Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ ─────┘ ``` ## Key File Locations @@ -132,15 +135,12 @@ Track D (Human): Labeling ✓ → IRR analysis ✓ ───────── | Human labels (raw) | `data/gold/human-labels-raw.jsonl` (3,600 labels) | | Human label metrics | `data/gold/metrics.json` | | Holdout paragraphs | `data/gold/paragraphs-holdout.jsonl` (1,200) | -| Diagnostic charts | `data/gold/charts/*.png` (16 charts) | -| Opus golden labels | `data/annotations/golden/opus.jsonl` (re-run on correct holdout) | -| Benchmark annotations | `data/annotations/bench-holdout/{model}.jsonl` | +| Diagnostic charts | `data/gold/charts/*.png` (30 charts) | +| Opus golden labels | `data/annotations/golden/opus.jsonl` (1,200) | +| Benchmark annotations | `data/annotations/bench-holdout/{model}.jsonl` (6 × 1,200) | | Original sampled IDs | `labelapp/.sampled-ids.original.json` (1,200 holdout PIDs) | | DAPT corpus | `data/dapt-corpus/shard-*.jsonl` (14,756 docs) | -| DAPT config | `python/configs/dapt/modernbert.yaml` | -| TAPT config | `python/configs/tapt/modernbert.yaml` | | DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` | | TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` | -| Training CLI | `python/main.py dapt --config ...` | -| Analysis script | `scripts/analyze-gold.py` | +| Analysis script | `scripts/analyze-gold.py` (30-chart, 13-signal analysis) | | Data dump script | `labelapp/scripts/dump-all.ts` | diff --git a/scripts/analyze-gold.py b/scripts/analyze-gold.py index cf0638e..8778b95 100644 --- a/scripts/analyze-gold.py +++ b/scripts/analyze-gold.py @@ -1,13 +1,19 @@ """ -Comprehensive analysis of human labeling data cross-referenced with -Stage 1 GenAI panel and Opus golden labels. +Comprehensive 13-signal analysis of gold set holdout. -Outputs charts to data/gold/charts/ and a summary to stdout. +Sources (per paragraph): + 3 human annotators (BIBD) + 3 Stage 1 panel (gemini-flash-lite, mimo-v2-flash, grok-fast) — v2.5 + 1 Opus 4.6 golden — v3.0+codebook + 6 benchmark models (gpt-5.4, kimi-k2.5, gemini-pro, glm-5, minimax-m2.7, mimo-pro) — v3.0 + +Outputs ~30 charts to data/gold/charts/ and detailed textual analysis to stdout. """ import json import os from collections import Counter, defaultdict +from itertools import combinations from pathlib import Path import matplotlib @@ -17,14 +23,14 @@ import matplotlib.ticker as mticker import numpy as np # ── Paths ── -GOLD_DIR = Path("/home/joey/Documents/sec-cyBERT/data/gold") +ROOT = Path("/home/joey/Documents/sec-cyBERT") +GOLD_DIR = ROOT / "data/gold" CHART_DIR = GOLD_DIR / "charts" -STAGE1_PATH = Path("/home/joey/Documents/sec-cyBERT/data/annotations/stage1.patched.jsonl") -OPUS_PATH = Path("/home/joey/Documents/sec-cyBERT/data/annotations/golden/opus.jsonl") -HOLDOUT_PATH = GOLD_DIR / "paragraphs-holdout.jsonl" +STAGE1_PATH = ROOT / "data/annotations/stage1.patched.jsonl" +OPUS_PATH = ROOT / "data/annotations/golden/opus.jsonl" +BENCH_DIR = ROOT / "data/annotations/bench-holdout" LABELS_PATH = GOLD_DIR / "human-labels-raw.jsonl" METRICS_PATH = GOLD_DIR / "metrics.json" -OPUS_ID_MAP_PATH = GOLD_DIR / "opus-to-db-id-map.json" CATEGORIES = [ "Board Governance", "Management Role", "Risk Management Process", @@ -32,10 +38,54 @@ CATEGORIES = [ ] CAT_SHORT = ["BG", "MR", "RMP", "TPR", "ID", "SI", "N/O"] CAT_MAP = dict(zip(CATEGORIES, CAT_SHORT)) +CAT_IDX = {c: i for i, c in enumerate(CATEGORIES)} SPEC_LEVELS = [1, 2, 3, 4] CHART_DIR.mkdir(parents=True, exist_ok=True) +# ── Shared style ── +plt.rcParams.update({ + "figure.facecolor": "white", + "axes.facecolor": "#fafafa", + "axes.grid": True, + "grid.alpha": 0.3, + "font.size": 10, +}) + +# Short display names for models +MODEL_SHORT = { + "google/gemini-3.1-flash-lite-preview": "Gemini Lite", + "x-ai/grok-4.1-fast": "Grok Fast", + "xiaomi/mimo-v2-flash": "MIMO Flash", + "anthropic/claude-opus-4-6": "Opus 4.6", + "openai/gpt-5.4": "GPT-5.4", + "moonshotai/kimi-k2.5": "Kimi K2.5", + "google/gemini-3.1-pro-preview": "Gemini Pro", + "z-ai/glm-5": "GLM-5", + "minimax/minimax-m2.7": "MiniMax M2.7", + "xiaomi/mimo-v2-pro": "MIMO Pro", +} + +MODEL_TIER = { + "google/gemini-3.1-flash-lite-preview": "stage1", + "x-ai/grok-4.1-fast": "stage1", + "xiaomi/mimo-v2-flash": "stage1", + "anthropic/claude-opus-4-6": "frontier", + "openai/gpt-5.4": "frontier", + "moonshotai/kimi-k2.5": "frontier", + "google/gemini-3.1-pro-preview": "frontier", + "z-ai/glm-5": "mid", + "minimax/minimax-m2.7": "budget", + "xiaomi/mimo-v2-pro": "mid", +} + +TIER_COLORS = { + "stage1": "#95a5a6", + "frontier": "#e74c3c", + "mid": "#f39c12", + "budget": "#27ae60", +} + def load_jsonl(path: Path) -> list[dict]: records = [] @@ -47,128 +97,225 @@ def load_jsonl(path: Path) -> list[dict]: return records -def majority_vote(items: list[str]) -> str | None: - """Return majority item if one exists, else None.""" +def majority_vote(items: list) -> object | None: + if not items: + return None c = Counter(items) top, count = c.most_common(1)[0] return top if count > len(items) / 2 else None def plurality_vote(items: list) -> tuple: - """Return most common item and its count.""" c = Counter(items) return c.most_common(1)[0] -# ── Load data ── +def cohens_kappa(labels_a: list, labels_b: list) -> float: + """Compute Cohen's kappa for two lists of categorical labels.""" + assert len(labels_a) == len(labels_b) + n = len(labels_a) + if n == 0: + return 0.0 + all_labels = sorted(set(labels_a) | set(labels_b)) + idx = {l: i for i, l in enumerate(all_labels)} + k = len(all_labels) + conf = np.zeros((k, k)) + for a, b in zip(labels_a, labels_b): + conf[idx[a]][idx[b]] += 1 + po = np.trace(conf) / n + pe = sum((conf[i, :].sum() / n) * (conf[:, i].sum() / n) for i in range(k)) + if pe >= 1.0: + return 1.0 + return (po - pe) / (1 - pe) + + +# ═══════════════════════════════════════════════════════════ +# LOAD ALL DATA +# ═══════════════════════════════════════════════════════════ print("Loading data...") + human_labels = load_jsonl(LABELS_PATH) -paragraphs_all = load_jsonl(HOLDOUT_PATH) -opus_labels = load_jsonl(OPUS_PATH) -metrics = json.loads(METRICS_PATH.read_text()) - -# Build paragraph metadata lookup (only holdout ones) holdout_ids = {l["paragraphId"] for l in human_labels} -para_meta = {} -for p in paragraphs_all: - if p["id"] in holdout_ids: - para_meta[p["id"]] = p +print(f" {len(human_labels)} human labels, {len(holdout_ids)} paragraphs") -# Load Stage 1 annotations for holdout -stage1_annots = [] +# Stage 1 annotations for holdout +stage1_by_pid: dict[str, list[dict]] = defaultdict(list) with open(STAGE1_PATH) as f: for line in f: d = json.loads(line) if d["paragraphId"] in holdout_ids: - stage1_annots.append(d) + stage1_by_pid[d["paragraphId"]].append(d) +print(f" {sum(len(v) for v in stage1_by_pid.values())} Stage 1 annotations") -# Build lookups -# Opus labels: only use if we have sufficient coverage (>50% of holdout) -# The Opus golden run may have been done on a different sample than what's in the DB. +# Opus opus_by_pid: dict[str, dict] = {} -for r in opus_labels: +for r in load_jsonl(OPUS_PATH): if r["paragraphId"] in holdout_ids: opus_by_pid[r["paragraphId"]] = r -# Also try ID remapping if direct match is low -if len(opus_by_pid) < 600 and OPUS_ID_MAP_PATH.exists(): - opus_id_map = json.loads(OPUS_ID_MAP_PATH.read_text()) - for r in opus_labels: - db_pid = opus_id_map.get(r["paragraphId"]) - if db_pid and db_pid in holdout_ids and db_pid not in opus_by_pid: - opus_by_pid[db_pid] = r +print(f" {len(opus_by_pid)} Opus annotations matched to holdout") -OPUS_AVAILABLE = len(opus_by_pid) >= 600 # gate all Opus analysis on sufficient coverage -opus_coverage = len(opus_by_pid) -print(f" Opus labels matched to holdout: {opus_coverage}/1200" - f" {'— SKIPPING Opus analysis (insufficient coverage)' if not OPUS_AVAILABLE else ''}") +# Benchmark models +bench_by_model: dict[str, dict[str, dict]] = {} # model_short -> {pid -> annotation} +bench_files = sorted(BENCH_DIR.glob("*.jsonl")) +for bf in bench_files: + if "errors" in bf.name: + continue + records = load_jsonl(bf) + if len(records) < 100: + continue # skip partial runs (deepseek-r1 has 1 annotation) + model_id = records[0]["provenance"]["modelId"] + short = MODEL_SHORT.get(model_id, model_id.split("/")[-1]) + by_pid = {} + for r in records: + if r["paragraphId"] in holdout_ids: + by_pid[r["paragraphId"]] = r + bench_by_model[short] = by_pid + print(f" {short}: {len(by_pid)} annotations") -# Stage 1: 3 annotations per paragraph -stage1_by_pid: dict[str, list[dict]] = defaultdict(list) -for a in stage1_annots: - stage1_by_pid[a["paragraphId"]].append(a) - -# Human labels grouped by paragraph +# Human labels grouped human_by_pid: dict[str, list[dict]] = defaultdict(list) for l in human_labels: human_by_pid[l["paragraphId"]].append(l) -# Annotator names annotator_names = sorted({l["annotatorName"] for l in human_labels}) -annotator_ids = sorted({l["annotatorId"] for l in human_labels}) -name_to_id = {} -for l in human_labels: - name_to_id[l["annotatorName"]] = l["annotatorId"] +metrics = json.loads(METRICS_PATH.read_text()) -print(f" {len(human_labels)} human labels across {len(holdout_ids)} paragraphs") -print(f" {len(stage1_annots)} Stage 1 annotations") -print(f" {len(opus_labels)} Opus labels") -print(f" Annotators: {', '.join(annotator_names)}") +# Paragraph metadata +para_all = load_jsonl(GOLD_DIR / "paragraphs-holdout.jsonl") +para_meta = {p["id"]: p for p in para_all if p["id"] in holdout_ids} -# ── Derive per-paragraph consensus labels ── -consensus = {} # pid -> {human_cat, human_spec, human_cat_method, ...} -for pid, lbls in human_by_pid.items(): - cats = [l["contentCategory"] for l in lbls] - specs = [l["specificityLevel"] for l in lbls] +# ═══════════════════════════════════════════════════════════ +# BUILD 13-SIGNAL MATRIX +# ═══════════════════════════════════════════════════════════ +print("\nBuilding 13-signal matrix...") - cat_maj = majority_vote(cats) - spec_maj = majority_vote([str(s) for s in specs]) +# For each paragraph, collect all signals +# GenAI models: 3 Stage1 + Opus + 6 bench = 10 +GENAI_SOURCES = ["Gemini Lite", "Grok Fast", "MIMO Flash", "Opus 4.6"] + sorted(bench_by_model.keys()) +# Deduplicate (Opus might already be in bench) +GENAI_SOURCES = list(dict.fromkeys(GENAI_SOURCES)) +ALL_GENAI = GENAI_SOURCES + +# Model ID to short name mapping (reverse) +MODEL_ID_TO_SHORT = {v: k for k, v in MODEL_SHORT.items()} + +signals = {} # pid -> {source_name: {cat, spec}} +for pid in holdout_ids: + sig = {} + + # Human labels + for lbl in human_by_pid.get(pid, []): + sig[f"H:{lbl['annotatorName']}"] = { + "cat": lbl["contentCategory"], + "spec": lbl["specificityLevel"], + } # Stage 1 - s1 = stage1_by_pid.get(pid, []) - s1_cats = [a["label"]["content_category"] for a in s1] - s1_specs = [a["label"]["specificity_level"] for a in s1] - s1_cat_maj = majority_vote(s1_cats) if s1_cats else None - s1_spec_maj = majority_vote([str(s) for s in s1_specs]) if s1_specs else None + for a in stage1_by_pid.get(pid, []): + mid = a["provenance"]["modelId"] + short = MODEL_SHORT.get(mid, mid.split("/")[-1]) + sig[short] = {"cat": a["label"]["content_category"], "spec": a["label"]["specificity_level"]} # Opus - op = opus_by_pid.get(pid) - op_cat = op["label"]["content_category"] if op else None - op_spec = op["label"]["specificity_level"] if op else None + if pid in opus_by_pid: + sig["Opus 4.6"] = { + "cat": opus_by_pid[pid]["label"]["content_category"], + "spec": opus_by_pid[pid]["label"]["specificity_level"], + } + + # Benchmark + for model_short, by_pid_map in bench_by_model.items(): + if pid in by_pid_map: + a = by_pid_map[pid] + sig[model_short] = {"cat": a["label"]["content_category"], "spec": a["label"]["specificity_level"]} + + signals[pid] = sig + +# Derive consensus labels +consensus = {} +for pid in holdout_ids: + sig = signals[pid] + + human_cats = [s["cat"] for k, s in sig.items() if k.startswith("H:")] + human_specs = [s["spec"] for k, s in sig.items() if k.startswith("H:")] + genai_cats = [s["cat"] for k, s in sig.items() if not k.startswith("H:")] + genai_specs = [s["spec"] for k, s in sig.items() if not k.startswith("H:")] + all_cats = human_cats + genai_cats + all_specs = human_specs + genai_specs + + s1_cats = [s["cat"] for k, s in sig.items() if k in ("Gemini Lite", "Grok Fast", "MIMO Flash")] + s1_specs = [s["spec"] for k, s in sig.items() if k in ("Gemini Lite", "Grok Fast", "MIMO Flash")] consensus[pid] = { - "human_cats": cats, - "human_specs": specs, - "human_cat_maj": cat_maj, - "human_spec_maj": int(spec_maj) if spec_maj else None, - "human_cat_unanimous": len(set(cats)) == 1, - "human_spec_unanimous": len(set(specs)) == 1, + "human_cats": human_cats, + "human_specs": human_specs, + "human_cat_maj": majority_vote(human_cats), + "human_spec_maj": majority_vote([str(s) for s in human_specs]), + "human_cat_unanimous": len(set(human_cats)) == 1, + "human_spec_unanimous": len(set(human_specs)) == 1, "s1_cats": s1_cats, "s1_specs": s1_specs, - "s1_cat_maj": s1_cat_maj, - "s1_spec_maj": int(s1_spec_maj) if s1_spec_maj else None, - "s1_cat_unanimous": len(set(s1_cats)) == 1 if s1_cats else False, - "opus_cat": op_cat, - "opus_spec": op_spec, + "s1_cat_maj": majority_vote(s1_cats), + "s1_spec_maj": majority_vote([str(s) for s in s1_specs]), + "genai_cats": genai_cats, + "genai_specs": genai_specs, + "genai_cat_maj": majority_vote(genai_cats), + "genai_spec_maj": majority_vote([str(s) for s in genai_specs]), + "all_cats": all_cats, + "all_specs": all_specs, + "all_cat_counts": Counter(all_cats), + "all_spec_counts": Counter(all_specs), + "n_signals": len(all_cats), + "opus_cat": sig.get("Opus 4.6", {}).get("cat"), + "opus_spec": sig.get("Opus 4.6", {}).get("spec"), "word_count": para_meta.get(pid, {}).get("wordCount", 0), + "signals": sig, } + # Fix human_spec_maj back to int + hsm = consensus[pid]["human_spec_maj"] + consensus[pid]["human_spec_maj"] = int(hsm) if hsm else None + ssm = consensus[pid]["s1_spec_maj"] + consensus[pid]["s1_spec_maj"] = int(ssm) if ssm else None + gsm = consensus[pid]["genai_spec_maj"] + consensus[pid]["genai_spec_maj"] = int(gsm) if gsm else None + +# ═══════════════════════════════════════════════════════════ +# ADJUDICATION TIERS +# ═══════════════════════════════════════════════════════════ +tiers = {1: [], 2: [], 3: [], 4: []} +for pid, c in consensus.items(): + n = c["n_signals"] + top_cat, top_cat_n = c["all_cat_counts"].most_common(1)[0] + top_spec, top_spec_n = Counter(c["all_specs"]).most_common(1)[0] + + hm_cat = c["human_cat_maj"] + gm_cat = c["genai_cat_maj"] + hm_spec = c["human_spec_maj"] + gm_spec = c["genai_spec_maj"] + + # Tier 1: 10+/13 agree on BOTH dimensions + if top_cat_n >= 10 and top_spec_n >= 10: + tiers[1].append(pid) + # Tier 2: human majority + genai majority agree on category + elif hm_cat and gm_cat and hm_cat == gm_cat: + tiers[2].append(pid) + # Tier 3: humans split, genai converges + elif hm_cat is None and gm_cat: + tiers[3].append(pid) + # Tier 4: everything else + else: + tiers[4].append(pid) + +print(f"\nAdjudication tiers:") +for t in range(1, 5): + print(f" Tier {t}: {len(tiers[t])} paragraphs ({len(tiers[t])/12:.1f}%)") # ═══════════════════════════════════════════════════════════ -# CHART 1: Pairwise Kappa Heatmaps (category + specificity) +# CHART 01: Pairwise Kappa Heatmaps (human annotators) # ═══════════════════════════════════════════════════════════ def plot_kappa_heatmaps(): fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5.5)) - for ax, dim_key, title in [ (ax1, "category", "Category"), (ax2, "specificity", "Specificity"), @@ -176,25 +323,20 @@ def plot_kappa_heatmaps(): data = metrics["pairwiseKappa"][dim_key] names = data["annotators"] matrix = np.array(data["matrix"]) - - # Mask diagonal mask = np.eye(len(names), dtype=bool) display = np.where(mask, np.nan, matrix) - im = ax.imshow(display, cmap="RdYlGn", vmin=0, vmax=1, aspect="equal") ax.set_xticks(range(len(names))) ax.set_xticklabels(names, rotation=45, ha="right", fontsize=9) ax.set_yticks(range(len(names))) ax.set_yticklabels(names, fontsize=9) ax.set_title(f"Pairwise Cohen's κ — {title}", fontsize=12, fontweight="bold") - for i in range(len(names)): for j in range(len(names)): if i != j: color = "white" if matrix[i][j] < 0.4 else "black" ax.text(j, i, f"{matrix[i][j]:.2f}", ha="center", va="center", fontsize=8, color=color) - fig.colorbar(im, ax=[ax1, ax2], shrink=0.8, label="Cohen's κ") fig.tight_layout() fig.savefig(CHART_DIR / "01_kappa_heatmaps.png", dpi=150) @@ -203,65 +345,70 @@ def plot_kappa_heatmaps(): # ═══════════════════════════════════════════════════════════ -# CHART 2: Per-annotator category distribution +# CHART 02: Per-source category distribution (all 13 sources) # ═══════════════════════════════════════════════════════════ -def plot_annotator_category_dist(): - fig, ax = plt.subplots(figsize=(12, 6)) - - # Also add Stage 1 majority (and Opus if available) - sources = list(annotator_names) + ["Stage1 Maj"] + (["Opus"] if OPUS_AVAILABLE else []) +def plot_all_source_category_dist(): + fig, ax = plt.subplots(figsize=(18, 7)) + sources = annotator_names + ["Human Maj", "S1 Maj"] + sorted(ALL_GENAI) dist = {s: Counter() for s in sources} + for l in human_labels: dist[l["annotatorName"]][l["contentCategory"]] += 1 - - for pid, c in consensus.items(): + for c in consensus.values(): + if c["human_cat_maj"]: + dist["Human Maj"][c["human_cat_maj"]] += 1 if c["s1_cat_maj"]: - dist["Stage1 Maj"][c["s1_cat_maj"]] += 1 - if OPUS_AVAILABLE and c["opus_cat"]: - dist["Opus"][c["opus_cat"]] += 1 + dist["S1 Maj"][c["s1_cat_maj"]] += 1 + for pid, c in consensus.items(): + for src, sig in c["signals"].items(): + if not src.startswith("H:"): + dist[src][sig["cat"]] += 1 x = np.arange(len(sources)) width = 0.11 offsets = np.arange(len(CATEGORIES)) - len(CATEGORIES) / 2 + 0.5 - colors = plt.cm.Set2(np.linspace(0, 1, len(CATEGORIES))) for i, (cat, color) in enumerate(zip(CATEGORIES, colors)): counts = [dist[s].get(cat, 0) for s in sources] - totals = [sum(dist[s].values()) for s in sources] - pcts = [c / t * 100 if t > 0 else 0 for c, t in zip(counts, totals)] + totals = [sum(dist[s].values()) or 1 for s in sources] + pcts = [c / t * 100 for c, t in zip(counts, totals)] ax.bar(x + offsets[i] * width, pcts, width, label=CAT_MAP[cat], color=color) ax.set_xticks(x) - ax.set_xticklabels(sources, rotation=45, ha="right") + ax.set_xticklabels(sources, rotation=60, ha="right", fontsize=8) ax.set_ylabel("% of labels") - ax.set_title("Category Distribution by Annotator (incl. Stage1 & Opus)", fontweight="bold") + ax.set_title("Category Distribution — All Sources (Humans + 10 GenAI Models)", fontweight="bold") ax.legend(bbox_to_anchor=(1.02, 1), loc="upper left", fontsize=8) ax.yaxis.set_major_formatter(mticker.PercentFormatter()) fig.tight_layout() - fig.savefig(CHART_DIR / "02_category_distribution.png", dpi=150) + fig.savefig(CHART_DIR / "02_category_distribution_all.png", dpi=150) plt.close(fig) - print(" 02_category_distribution.png") + print(" 02_category_distribution_all.png") # ═══════════════════════════════════════════════════════════ -# CHART 3: Per-annotator specificity distribution +# CHART 03: Per-source specificity distribution # ═══════════════════════════════════════════════════════════ -def plot_annotator_spec_dist(): - fig, ax = plt.subplots(figsize=(12, 5)) - - sources = list(annotator_names) + ["Stage1 Maj"] + (["Opus"] if OPUS_AVAILABLE else []) - +def plot_all_source_spec_dist(): + fig, ax = plt.subplots(figsize=(18, 6)) + sources = annotator_names + ["Human Maj", "S1 Maj"] + sorted(ALL_GENAI) dist = {s: Counter() for s in sources} + for l in human_labels: dist[l["annotatorName"]][l["specificityLevel"]] += 1 - + for c in consensus.values(): + hm = c["human_spec_maj"] + if hm is not None: + dist["Human Maj"][hm] += 1 + sm = c["s1_spec_maj"] + if sm is not None: + dist["S1 Maj"][sm] += 1 for pid, c in consensus.items(): - if c["s1_spec_maj"]: - dist["Stage1 Maj"][c["s1_spec_maj"]] += 1 - if OPUS_AVAILABLE and c["opus_spec"]: - dist["Opus"][c["opus_spec"]] += 1 + for src, sig in c["signals"].items(): + if not src.startswith("H:"): + dist[src][sig["spec"]] += 1 x = np.arange(len(sources)) width = 0.18 @@ -270,62 +417,51 @@ def plot_annotator_spec_dist(): for i, (level, color, label) in enumerate(zip(SPEC_LEVELS, colors, spec_labels)): counts = [dist[s].get(level, 0) for s in sources] - totals = [sum(dist[s].values()) for s in sources] - pcts = [c / t * 100 if t > 0 else 0 for c, t in zip(counts, totals)] + totals = [sum(dist[s].values()) or 1 for s in sources] + pcts = [c / t * 100 for c, t in zip(counts, totals)] ax.bar(x + (i - 1.5) * width, pcts, width, label=label, color=color) ax.set_xticks(x) - ax.set_xticklabels(sources, rotation=45, ha="right") + ax.set_xticklabels(sources, rotation=60, ha="right", fontsize=8) ax.set_ylabel("% of labels") - ax.set_title("Specificity Distribution by Annotator (incl. Stage1 & Opus)", fontweight="bold") + ax.set_title("Specificity Distribution — All Sources", fontweight="bold") ax.legend() ax.yaxis.set_major_formatter(mticker.PercentFormatter()) fig.tight_layout() - fig.savefig(CHART_DIR / "03_specificity_distribution.png", dpi=150) + fig.savefig(CHART_DIR / "03_specificity_distribution_all.png", dpi=150) plt.close(fig) - print(" 03_specificity_distribution.png") + print(" 03_specificity_distribution_all.png") # ═══════════════════════════════════════════════════════════ -# CHART 4: Human confusion matrix (aggregated pairwise) +# CHART 04: Human confusion matrices (category + specificity) # ═══════════════════════════════════════════════════════════ def plot_human_confusion(): fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6)) - # Category confusion cat_conf = np.zeros((len(CATEGORIES), len(CATEGORIES))) - cat_idx = {c: i for i, c in enumerate(CATEGORIES)} - for pid, lbls in human_by_pid.items(): cats = [l["contentCategory"] for l in lbls] for i in range(len(cats)): for j in range(i + 1, len(cats)): - a, b = cat_idx[cats[i]], cat_idx[cats[j]] + a, b = CAT_IDX[cats[i]], CAT_IDX[cats[j]] cat_conf[a][b] += 1 cat_conf[b][a] += 1 - - # Normalize rows row_sums = cat_conf.sum(axis=1, keepdims=True) cat_conf_norm = np.where(row_sums > 0, cat_conf / row_sums * 100, 0) - im1 = ax1.imshow(cat_conf_norm, cmap="YlOrRd", aspect="equal") ax1.set_xticks(range(len(CAT_SHORT))) ax1.set_xticklabels(CAT_SHORT, fontsize=9) ax1.set_yticks(range(len(CAT_SHORT))) ax1.set_yticklabels(CAT_SHORT, fontsize=9) - ax1.set_title("Human Category Confusion (row-normalized %)", fontweight="bold") - ax1.set_xlabel("Annotator B") - ax1.set_ylabel("Annotator A") - + ax1.set_title("Human Category Confusion (row-norm %)", fontweight="bold") for i in range(len(CAT_SHORT)): for j in range(len(CAT_SHORT)): val = cat_conf_norm[i][j] if val > 0.5: color = "white" if val > 40 else "black" - ax1.text(j, i, f"{val:.0f}", ha="center", va="center", - fontsize=7, color=color) + ax1.text(j, i, f"{val:.0f}", ha="center", va="center", fontsize=7, color=color) - # Specificity confusion spec_conf = np.zeros((4, 4)) for pid, lbls in human_by_pid.items(): specs = [l["specificityLevel"] for l in lbls] @@ -334,25 +470,20 @@ def plot_human_confusion(): a, b = specs[i] - 1, specs[j] - 1 spec_conf[a][b] += 1 spec_conf[b][a] += 1 - row_sums = spec_conf.sum(axis=1, keepdims=True) spec_conf_norm = np.where(row_sums > 0, spec_conf / row_sums * 100, 0) - im2 = ax2.imshow(spec_conf_norm, cmap="YlOrRd", aspect="equal") ax2.set_xticks(range(4)) - ax2.set_xticklabels(["Spec 1", "Spec 2", "Spec 3", "Spec 4"], fontsize=9) + ax2.set_xticklabels(["S1", "S2", "S3", "S4"], fontsize=9) ax2.set_yticks(range(4)) - ax2.set_yticklabels(["Spec 1", "Spec 2", "Spec 3", "Spec 4"], fontsize=9) - ax2.set_title("Human Specificity Confusion (row-normalized %)", fontweight="bold") - + ax2.set_yticklabels(["S1", "S2", "S3", "S4"], fontsize=9) + ax2.set_title("Human Specificity Confusion (row-norm %)", fontweight="bold") for i in range(4): for j in range(4): val = spec_conf_norm[i][j] if val > 0.5: color = "white" if val > 40 else "black" - ax2.text(j, i, f"{val:.0f}", ha="center", va="center", - fontsize=9, color=color) - + ax2.text(j, i, f"{val:.0f}", ha="center", va="center", fontsize=9, color=color) fig.colorbar(im1, ax=ax1, shrink=0.8) fig.colorbar(im2, ax=ax2, shrink=0.8) fig.tight_layout() @@ -362,41 +493,77 @@ def plot_human_confusion(): # ═══════════════════════════════════════════════════════════ -# CHART 5: Human majority vs Stage 1 majority vs Opus +# CHART 05: GenAI Model Agreement Matrix (10×10 pairwise kappa) +# ═══════════════════════════════════════════════════════════ +def plot_genai_agreement_matrix(): + fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7)) + + models = sorted(ALL_GENAI) + n = len(models) + + for ax, dim, title in [(ax1, "cat", "Category"), (ax2, "spec", "Specificity")]: + matrix = np.eye(n) + for i, m1 in enumerate(models): + for j, m2 in enumerate(models): + if i >= j: + continue + labels_a, labels_b = [], [] + for pid, c in consensus.items(): + sig = c["signals"] + if m1 in sig and m2 in sig: + labels_a.append(str(sig[m1][dim])) + labels_b.append(str(sig[m2][dim])) + if len(labels_a) >= 100: + k = cohens_kappa(labels_a, labels_b) + matrix[i][j] = k + matrix[j][i] = k + + mask = np.eye(n, dtype=bool) + display = np.where(mask, np.nan, matrix) + im = ax.imshow(display, cmap="RdYlGn", vmin=0.2, vmax=1, aspect="equal") + ax.set_xticks(range(n)) + ax.set_xticklabels(models, rotation=60, ha="right", fontsize=7) + ax.set_yticks(range(n)) + ax.set_yticklabels(models, fontsize=7) + ax.set_title(f"GenAI Pairwise κ — {title}", fontweight="bold") + for i in range(n): + for j in range(n): + if i != j: + val = matrix[i][j] + color = "white" if val < 0.5 else "black" + ax.text(j, i, f"{val:.2f}", ha="center", va="center", fontsize=6, color=color) + + fig.colorbar(im, ax=[ax1, ax2], shrink=0.7, label="Cohen's κ") + fig.tight_layout() + fig.savefig(CHART_DIR / "05_genai_agreement_matrix.png", dpi=150) + plt.close(fig) + print(" 05_genai_agreement_matrix.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 06: Cross-source confusion (Human vs Stage1, Human vs Opus, Human vs GenAI consensus) # ═══════════════════════════════════════════════════════════ def plot_cross_source_confusion(): comparisons = [ - ("Human Maj", "Stage1 Maj", "human_cat_maj", "s1_cat_maj"), + ("Human Maj", "S1 Maj", "human_cat_maj", "s1_cat_maj"), + ("Human Maj", "Opus 4.6", "human_cat_maj", "opus_cat"), + ("Human Maj", "GenAI Maj", "human_cat_maj", "genai_cat_maj"), ] - if OPUS_AVAILABLE: - comparisons += [ - ("Human Maj", "Opus", "human_cat_maj", "opus_cat"), - ("Stage1 Maj", "Opus", "s1_cat_maj", "opus_cat"), - ] - ncols = len(comparisons) - fig, axes = plt.subplots(1, ncols, figsize=(7 * ncols, 5.5)) - if ncols == 1: - axes = [axes] + fig, axes = plt.subplots(1, 3, figsize=(21, 5.5)) for ax, (name_a, name_b, key_a, key_b) in zip(axes, comparisons): conf = np.zeros((len(CATEGORIES), len(CATEGORIES))) - cat_idx = {c: i for i, c in enumerate(CATEGORIES)} - total = 0 - agree = 0 - + total, agree = 0, 0 for pid, c in consensus.items(): a_val = c[key_a] b_val = c[key_b] if a_val and b_val: - conf[cat_idx[a_val]][cat_idx[b_val]] += 1 + conf[CAT_IDX[a_val]][CAT_IDX[b_val]] += 1 total += 1 if a_val == b_val: agree += 1 - - # Normalize rows row_sums = conf.sum(axis=1, keepdims=True) conf_norm = np.where(row_sums > 0, conf / row_sums * 100, 0) - im = ax.imshow(conf_norm, cmap="YlGnBu", aspect="equal") ax.set_xticks(range(len(CAT_SHORT))) ax.set_xticklabels(CAT_SHORT, fontsize=8) @@ -407,201 +574,454 @@ def plot_cross_source_confusion(): fontweight="bold", fontsize=10) ax.set_ylabel(name_a) ax.set_xlabel(name_b) - for i in range(len(CAT_SHORT)): for j in range(len(CAT_SHORT)): val = conf_norm[i][j] if val > 0.5: color = "white" if val > 50 else "black" - ax.text(j, i, f"{val:.0f}", ha="center", va="center", - fontsize=7, color=color) - + ax.text(j, i, f"{val:.0f}", ha="center", va="center", fontsize=7, color=color) fig.tight_layout() - fig.savefig(CHART_DIR / "05_cross_source_category.png", dpi=150) + fig.savefig(CHART_DIR / "06_cross_source_category.png", dpi=150) plt.close(fig) - print(" 05_cross_source_category.png") + print(" 06_cross_source_category.png") # ═══════════════════════════════════════════════════════════ -# CHART 6: Cross-source specificity confusion +# CHART 07: Cross-source specificity confusion # ═══════════════════════════════════════════════════════════ def plot_cross_source_specificity(): comparisons = [ - ("Human Maj", "Stage1 Maj", "human_spec_maj", "s1_spec_maj"), + ("Human Maj", "S1 Maj", "human_spec_maj", "s1_spec_maj"), + ("Human Maj", "Opus", "human_spec_maj", "opus_spec"), + ("Human Maj", "GenAI Maj", "human_spec_maj", "genai_spec_maj"), ] - if OPUS_AVAILABLE: - comparisons += [ - ("Human Maj", "Opus", "human_spec_maj", "opus_spec"), - ("Stage1 Maj", "Opus", "s1_spec_maj", "opus_spec"), - ] - ncols = len(comparisons) - fig, axes = plt.subplots(1, ncols, figsize=(5.5 * ncols, 4.5)) - if ncols == 1: - axes = [axes] + fig, axes = plt.subplots(1, 3, figsize=(18, 5)) for ax, (name_a, name_b, key_a, key_b) in zip(axes, comparisons): conf = np.zeros((4, 4)) - total = 0 - agree = 0 - + total, agree = 0, 0 for pid, c in consensus.items(): a_val = c[key_a] b_val = c[key_b] if a_val is not None and b_val is not None: - conf[a_val - 1][b_val - 1] += 1 + conf[int(a_val) - 1][int(b_val) - 1] += 1 total += 1 - if a_val == b_val: + if int(a_val) == int(b_val): agree += 1 - row_sums = conf.sum(axis=1, keepdims=True) conf_norm = np.where(row_sums > 0, conf / row_sums * 100, 0) - im = ax.imshow(conf_norm, cmap="YlGnBu", aspect="equal") ax.set_xticks(range(4)) - ax.set_xticklabels(["S1", "S2", "S3", "S4"], fontsize=9) + ax.set_xticklabels(["S1", "S2", "S3", "S4"]) ax.set_yticks(range(4)) - ax.set_yticklabels(["S1", "S2", "S3", "S4"], fontsize=9) + ax.set_yticklabels(["S1", "S2", "S3", "S4"]) pct = agree / total * 100 if total > 0 else 0 - ax.set_title(f"{name_a} vs {name_b}\n({pct:.1f}% agree, n={total})", - fontweight="bold", fontsize=10) + ax.set_title(f"{name_a} vs {name_b}\n({pct:.1f}% agree, n={total})", fontweight="bold") ax.set_ylabel(name_a) ax.set_xlabel(name_b) - for i in range(4): for j in range(4): val = conf_norm[i][j] if val > 0.5: color = "white" if val > 50 else "black" - ax.text(j, i, f"{val:.0f}", ha="center", va="center", - fontsize=9, color=color) + ax.text(j, i, f"{val:.0f}", ha="center", va="center", fontsize=9, color=color) + fig.tight_layout() + fig.savefig(CHART_DIR / "07_cross_source_specificity.png", dpi=150) + plt.close(fig) + print(" 07_cross_source_specificity.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 08: Adjudication tier breakdown +# ═══════════════════════════════════════════════════════════ +def plot_adjudication_tiers(): + fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6)) + + # Tier counts + tier_sizes = [len(tiers[t]) for t in range(1, 5)] + tier_labels = [ + f"Tier 1\n10+/13 agree\n(auto)", + f"Tier 2\nHuman+GenAI\nmaj agree", + f"Tier 3\nHumans split\nGenAI converges", + f"Tier 4\nUniversal\ndisagreement", + ] + tier_colors = ["#27ae60", "#3498db", "#f39c12", "#e74c3c"] + bars = ax1.bar(range(4), tier_sizes, color=tier_colors) + ax1.set_xticks(range(4)) + ax1.set_xticklabels(tier_labels, fontsize=8) + ax1.set_ylabel("Paragraphs") + ax1.set_title("Adjudication Tier Distribution (1,200 paragraphs)", fontweight="bold") + for bar, n in zip(bars, tier_sizes): + pct = n / 1200 * 100 + ax1.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 5, + f"{n}\n({pct:.1f}%)", ha="center", fontsize=10, fontweight="bold") + + # Per-tier category distribution + for t in range(1, 5): + cats = Counter() + for pid in tiers[t]: + top_cat = consensus[pid]["all_cat_counts"].most_common(1)[0][0] + cats[top_cat] += 1 + pcts = [cats.get(c, 0) / len(tiers[t]) * 100 if tiers[t] else 0 for c in CATEGORIES] + x = np.arange(len(CATEGORIES)) + ax2.barh([f"Tier {t}" for _ in CATEGORIES], pcts, left=[sum(pcts[:i]) for i in range(len(pcts))], + color=plt.cm.Set2(np.linspace(0, 1, len(CATEGORIES))), + label=[CAT_MAP[c] if t == 1 else "" for c in CATEGORIES]) + + # Manually build the stacked bar properly + ax2.clear() + tier_names = [f"Tier {t}" for t in range(1, 5)] + bottom = np.zeros(4) + colors = plt.cm.Set2(np.linspace(0, 1, len(CATEGORIES))) + for ci, cat in enumerate(CATEGORIES): + vals = [] + for t in range(1, 5): + cat_count = sum(1 for pid in tiers[t] + if consensus[pid]["all_cat_counts"].most_common(1)[0][0] == cat) + vals.append(cat_count / len(tiers[t]) * 100 if tiers[t] else 0) + ax2.barh(tier_names, vals, left=bottom, color=colors[ci], label=CAT_MAP[cat]) + bottom += np.array(vals) + + ax2.set_xlabel("% of paragraphs in tier") + ax2.set_title("Category Mix by Adjudication Tier", fontweight="bold") + ax2.legend(bbox_to_anchor=(1.02, 1), loc="upper left", fontsize=8) + ax2.set_xlim(0, 105) fig.tight_layout() - fig.savefig(CHART_DIR / "06_cross_source_specificity.png", dpi=150) + fig.savefig(CHART_DIR / "08_adjudication_tiers.png", dpi=150) plt.close(fig) - print(" 06_cross_source_specificity.png") + print(" 08_adjudication_tiers.png") # ═══════════════════════════════════════════════════════════ -# CHART 7: Per-annotator agreement with Stage1 and Opus +# CHART 09: Per-model accuracy vs Opus (as quasi-ground-truth) # ═══════════════════════════════════════════════════════════ -def plot_annotator_vs_references(): - fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5)) +def plot_model_accuracy_vs_opus(): + fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7)) - # Build per-annotator label lookup - ann_labels: dict[str, dict[str, dict]] = defaultdict(dict) - for l in human_labels: - ann_labels[l["annotatorName"]][l["paragraphId"]] = l + models = sorted(ALL_GENAI) + cat_acc = [] + spec_acc = [] + model_labels = [] - for ax, dim, title in [(ax1, "cat", "Category"), (ax2, "spec", "Specificity")]: - ref_sources = [ - ("Stage1 Maj", "s1_cat_maj", "s1_spec_maj"), - ("Human Maj", "human_cat_maj", "human_spec_maj"), - ] - if OPUS_AVAILABLE: - ref_sources.insert(1, ("Opus", "opus_cat", "opus_spec")) + for model in models: + agree_cat, agree_spec, total = 0, 0, 0 + for pid, c in consensus.items(): + sig = c["signals"] + if model in sig and "Opus 4.6" in sig and model != "Opus 4.6": + total += 1 + if sig[model]["cat"] == sig["Opus 4.6"]["cat"]: + agree_cat += 1 + if sig[model]["spec"] == sig["Opus 4.6"]["spec"]: + agree_spec += 1 + if total > 0: + cat_acc.append(agree_cat / total * 100) + spec_acc.append(agree_spec / total * 100) + model_labels.append(model) - x = np.arange(len(annotator_names)) - width = 0.25 if len(ref_sources) == 3 else 0.3 + # Sort by category accuracy + order = np.argsort(cat_acc)[::-1] + cat_acc = [cat_acc[i] for i in order] + spec_acc = [spec_acc[i] for i in order] + model_labels = [model_labels[i] for i in order] - for ri, (ref_name, ref_key_cat, ref_key_spec) in enumerate(ref_sources): - rates = [] - for ann_name in annotator_names: - agree = 0 - total = 0 - for pid, lbl in ann_labels[ann_name].items(): - c = consensus.get(pid) - if not c: - continue - if dim == "cat": - ref_val = c[ref_key_cat] - ann_val = lbl["contentCategory"] - else: - ref_val = c[ref_key_spec] - ann_val = lbl["specificityLevel"] - if ref_val is not None: + tier_c = [TIER_COLORS.get(MODEL_TIER.get( + {v: k for k, v in MODEL_SHORT.items()}.get(m, ""), ""), "#999") for m in model_labels] + + x = np.arange(len(model_labels)) + width = 0.35 + bars1 = ax1.barh(x, cat_acc, color=tier_c, edgecolor="black", linewidth=0.5) + ax1.set_yticks(x) + ax1.set_yticklabels(model_labels, fontsize=8) + ax1.set_xlabel("Agreement with Opus (%)") + ax1.set_title("Category Agreement with Opus 4.6", fontweight="bold") + ax1.set_xlim(60, 100) + ax1.invert_yaxis() + for bar, v in zip(bars1, cat_acc): + ax1.text(bar.get_width() + 0.3, bar.get_y() + bar.get_height() / 2, + f"{v:.1f}%", va="center", fontsize=8) + + bars2 = ax2.barh(x, spec_acc, color=tier_c, edgecolor="black", linewidth=0.5) + ax2.set_yticks(x) + ax2.set_yticklabels(model_labels, fontsize=8) + ax2.set_xlabel("Agreement with Opus (%)") + ax2.set_title("Specificity Agreement with Opus 4.6", fontweight="bold") + ax2.set_xlim(30, 100) + ax2.invert_yaxis() + for bar, v in zip(bars2, spec_acc): + ax2.text(bar.get_width() + 0.3, bar.get_y() + bar.get_height() / 2, + f"{v:.1f}%", va="center", fontsize=8) + + # Legend for tiers + from matplotlib.patches import Patch + legend_elements = [Patch(facecolor=c, label=t) for t, c in TIER_COLORS.items()] + ax1.legend(handles=legend_elements, loc="lower right", fontsize=8) + + fig.tight_layout() + fig.savefig(CHART_DIR / "09_model_accuracy_vs_opus.png", dpi=150) + plt.close(fig) + print(" 09_model_accuracy_vs_opus.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 10: Cost vs Accuracy scatter +# ═══════════════════════════════════════════════════════════ +def plot_cost_vs_accuracy(): + fig, ax = plt.subplots(figsize=(12, 7)) + + # Gather cost and accuracy data per model + model_costs: dict[str, list[float]] = defaultdict(list) + model_lats: dict[str, list[float]] = defaultdict(list) + + # From bench + for bf in bench_files: + if "errors" in bf.name: + continue + records = load_jsonl(bf) + if len(records) < 100: + continue + mid = records[0]["provenance"]["modelId"] + short = MODEL_SHORT.get(mid, mid.split("/")[-1]) + for r in records: + model_costs[short].append(r["provenance"].get("costUsd", 0)) + model_lats[short].append(r["provenance"].get("latencyMs", 0)) + + # Stage 1 costs from annotations + for pid, annots in stage1_by_pid.items(): + for a in annots: + mid = a["provenance"]["modelId"] + short = MODEL_SHORT.get(mid, mid.split("/")[-1]) + model_costs[short].append(a["provenance"].get("costUsd", 0)) + model_lats[short].append(a["provenance"].get("latencyMs", 0)) + + # Opus + for r in opus_by_pid.values(): + model_costs["Opus 4.6"].append(r["provenance"].get("costUsd", 0)) + model_lats["Opus 4.6"].append(r["provenance"].get("latencyMs", 0)) + + for model in sorted(ALL_GENAI): + costs = model_costs.get(model, []) + if not costs: + continue + avg_cost = sum(costs) / len(costs) + avg_lat = sum(model_lats.get(model, [])) / max(len(model_lats.get(model, [])), 1) / 1000 # seconds + + # Category accuracy vs Opus + agree, total = 0, 0 + for pid, c in consensus.items(): + sig = c["signals"] + if model in sig and "Opus 4.6" in sig and model != "Opus 4.6": + total += 1 + if sig[model]["cat"] == sig["Opus 4.6"]["cat"]: + agree += 1 + cat_acc = agree / total * 100 if total > 0 else 0 + + mid_full = {v: k for k, v in MODEL_SHORT.items()}.get(model, "") + tier = MODEL_TIER.get(mid_full, "mid") + color = TIER_COLORS.get(tier, "#999") + + ax.scatter(avg_cost * 1000, cat_acc, s=150, c=color, edgecolors="black", + linewidths=0.5, zorder=3) + ax.annotate(model, (avg_cost * 1000, cat_acc), + textcoords="offset points", xytext=(8, 4), fontsize=7) + + ax.set_xlabel("Average Cost per Call (millicents, $0.001)") + ax.set_ylabel("Category Agreement with Opus (%)") + ax.set_title("Cost vs Category Accuracy (Opus as reference)", fontweight="bold") + ax.set_ylim(60, 100) + + from matplotlib.patches import Patch + legend_elements = [Patch(facecolor=c, label=t) for t, c in TIER_COLORS.items()] + ax.legend(handles=legend_elements, loc="lower right") + + fig.tight_layout() + fig.savefig(CHART_DIR / "10_cost_vs_accuracy.png", dpi=150) + plt.close(fig) + print(" 10_cost_vs_accuracy.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 11: Per-category accuracy by model +# ═══════════════════════════════════════════════════════════ +def plot_per_category_accuracy(): + fig, ax = plt.subplots(figsize=(16, 8)) + + models = sorted(ALL_GENAI) + # For each model, compute accuracy vs Opus per category + data = np.zeros((len(models), len(CATEGORIES))) + for mi, model in enumerate(models): + for ci, cat in enumerate(CATEGORIES): + agree, total = 0, 0 + for pid, c in consensus.items(): + sig = c["signals"] + if "Opus 4.6" in sig and model in sig and model != "Opus 4.6": + if sig["Opus 4.6"]["cat"] == cat: total += 1 - if str(ann_val) == str(ref_val): + if sig[model]["cat"] == cat: agree += 1 - rates.append(agree / total * 100 if total > 0 else 0) + data[mi][ci] = agree / total * 100 if total > 0 else 0 - ax.bar(x + (ri - 1) * width, rates, width, label=ref_name) + im = ax.imshow(data, cmap="RdYlGn", aspect="auto", vmin=50, vmax=100) + ax.set_xticks(range(len(CAT_SHORT))) + ax.set_xticklabels(CAT_SHORT, fontsize=10) + ax.set_yticks(range(len(models))) + ax.set_yticklabels(models, fontsize=8) + ax.set_title("Per-Category Recall vs Opus (%) — Where each model excels/struggles", fontweight="bold") + ax.set_xlabel("Opus label (true category)") - ax.set_xticks(x) - ax.set_xticklabels(annotator_names, rotation=45, ha="right") - ax.set_ylabel("Agreement %") - ax.set_title(f"Per-Annotator {title} Agreement with References", fontweight="bold") - ax.legend() - ax.set_ylim(0, 100) + for i in range(len(models)): + for j in range(len(CATEGORIES)): + val = data[i][j] + color = "white" if val < 65 else "black" + ax.text(j, i, f"{val:.0f}", ha="center", va="center", fontsize=8, color=color) + fig.colorbar(im, ax=ax, shrink=0.6, label="Recall %") fig.tight_layout() - fig.savefig(CHART_DIR / "07_annotator_vs_references.png", dpi=150) + fig.savefig(CHART_DIR / "11_per_category_accuracy.png", dpi=150) plt.close(fig) - print(" 07_annotator_vs_references.png") + print(" 11_per_category_accuracy.png") # ═══════════════════════════════════════════════════════════ -# CHART 8: Agreement rate by word count (binned) +# CHART 12: Ensemble size vs accuracy (how many models needed?) +# ═══════════════════════════════════════════════════════════ +def plot_ensemble_accuracy(): + fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) + + # For each ensemble size k (1..10), sample 100 random subsets of k GenAI models, + # take majority vote, compare to Opus + all_models = sorted(ALL_GENAI) + # Remove Opus itself from ensemble candidates + ensemble_candidates = [m for m in all_models if m != "Opus 4.6"] + + rng = np.random.RandomState(42) + max_k = len(ensemble_candidates) + n_trials = 200 + + cat_accs_by_k = [] + spec_accs_by_k = [] + + for k in range(1, max_k + 1): + cat_accs = [] + spec_accs = [] + subsets = [] + if k >= max_k: + subsets = [ensemble_candidates] + else: + for _ in range(n_trials): + subsets.append(list(rng.choice(ensemble_candidates, k, replace=False))) + + for subset in subsets: + agree_cat, agree_spec, total = 0, 0, 0 + for pid, c in consensus.items(): + sig = c["signals"] + if "Opus 4.6" not in sig: + continue + sub_cats = [sig[m]["cat"] for m in subset if m in sig] + sub_specs = [sig[m]["spec"] for m in subset if m in sig] + if len(sub_cats) < k: + continue + total += 1 + ens_cat = majority_vote(sub_cats) + ens_spec = majority_vote([str(s) for s in sub_specs]) + if ens_cat == sig["Opus 4.6"]["cat"]: + agree_cat += 1 + if ens_spec is not None and int(ens_spec) == sig["Opus 4.6"]["spec"]: + agree_spec += 1 + if total > 0: + cat_accs.append(agree_cat / total * 100) + spec_accs.append(agree_spec / total * 100) + + cat_accs_by_k.append(cat_accs) + spec_accs_by_k.append(spec_accs) + + # Box plot + ks = range(1, max_k + 1) + ax1.boxplot(cat_accs_by_k, positions=list(ks), widths=0.6, patch_artist=True, + boxprops=dict(facecolor="#3498db", alpha=0.5), + medianprops=dict(color="red", linewidth=2)) + ax1.set_xlabel("Ensemble size (# GenAI models)") + ax1.set_ylabel("Category agreement with Opus (%)") + ax1.set_title("Ensemble Size vs Category Accuracy", fontweight="bold") + ax1.set_xticks(list(ks)) + ax1.set_xticklabels(list(ks)) + + ax2.boxplot(spec_accs_by_k, positions=list(ks), widths=0.6, patch_artist=True, + boxprops=dict(facecolor="#e74c3c", alpha=0.5), + medianprops=dict(color="red", linewidth=2)) + ax2.set_xlabel("Ensemble size (# GenAI models)") + ax2.set_ylabel("Specificity agreement with Opus (%)") + ax2.set_title("Ensemble Size vs Specificity Accuracy", fontweight="bold") + ax2.set_xticks(list(ks)) + ax2.set_xticklabels(list(ks)) + + fig.tight_layout() + fig.savefig(CHART_DIR / "12_ensemble_accuracy.png", dpi=150) + plt.close(fig) + print(" 12_ensemble_accuracy.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 13: Agreement by word count (human + genai) # ═══════════════════════════════════════════════════════════ def plot_agreement_by_wordcount(): fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5)) - - # Bin paragraphs by word count wc_bins = [(0, 50), (51, 80), (81, 120), (121, 180), (181, 500)] bin_labels = ["≤50", "51-80", "81-120", "121-180", "180+"] for ax, dim, title in [(ax1, "cat", "Category"), (ax2, "both", "Both")]: - rates = [] - ns = [] + h_rates, g_rates, ns = [], [], [] for lo, hi in wc_bins: - agree = 0 - total = 0 + h_agree, g_agree, total = 0, 0, 0 for pid, c in consensus.items(): wc = c["word_count"] if lo <= wc <= hi: total += 1 if dim == "cat": if c["human_cat_unanimous"]: - agree += 1 + h_agree += 1 + if len(set(c["genai_cats"])) == 1: + g_agree += 1 else: if c["human_cat_unanimous"] and c["human_spec_unanimous"]: - agree += 1 - rates.append(agree / total * 100 if total > 0 else 0) + h_agree += 1 + if len(set(c["genai_cats"])) == 1 and len(set(c["genai_specs"])) == 1: + g_agree += 1 + h_rates.append(h_agree / total * 100 if total > 0 else 0) + g_rates.append(g_agree / total * 100 if total > 0 else 0) ns.append(total) - bars = ax.bar(range(len(bin_labels)), rates, color="#3498db") - ax.set_xticks(range(len(bin_labels))) + x = np.arange(len(bin_labels)) + width = 0.35 + ax.bar(x - width / 2, h_rates, width, label="Human unanimous", color="#3498db") + ax.bar(x + width / 2, g_rates, width, label="GenAI unanimous", color="#e74c3c") + ax.set_xticks(x) ax.set_xticklabels(bin_labels) ax.set_xlabel("Word Count") - ax.set_ylabel("Unanimous Agreement %") - ax.set_title(f"{title} Consensus by Paragraph Length", fontweight="bold") - ax.set_ylim(0, 80) - - for bar, n in zip(bars, ns): - ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 1, - f"n={n}", ha="center", va="bottom", fontsize=8) + ax.set_ylabel("Unanimous %") + ax.set_title(f"{title} Unanimity by Paragraph Length", fontweight="bold") + ax.legend() + for i, n in enumerate(ns): + ax.text(i, max(h_rates[i], g_rates[i]) + 1, f"n={n}", ha="center", fontsize=8) fig.tight_layout() - fig.savefig(CHART_DIR / "08_agreement_by_wordcount.png", dpi=150) + fig.savefig(CHART_DIR / "13_agreement_by_wordcount.png", dpi=150) plt.close(fig) - print(" 08_agreement_by_wordcount.png") + print(" 13_agreement_by_wordcount.png") # ═══════════════════════════════════════════════════════════ -# CHART 9: Active time vs agreement +# CHART 14: Time vs agreement # ═══════════════════════════════════════════════════════════ def plot_time_vs_agreement(): fig, ax = plt.subplots(figsize=(10, 5)) - - # For each paragraph, compute median active time and whether humans agreed - agreed_times = [] - disagreed_times = [] - + agreed_times, disagreed_times = [], [] for pid, lbls in human_by_pid.items(): - times = [l["activeMs"] for l in lbls if l["activeMs"] is not None] + times = [l.get("activeMs") or l.get("durationMs") for l in lbls] + times = [t for t in times if t is not None] if not times: continue - med_time = sorted(times)[len(times) // 2] / 1000 # seconds - + med_time = sorted(times)[len(times) // 2] / 1000 cats = [l["contentCategory"] for l in lbls] if len(set(cats)) == 1: agreed_times.append(med_time) @@ -619,19 +1039,191 @@ def plot_time_vs_agreement(): ax.legend() ax.set_xlim(0, 120) fig.tight_layout() - fig.savefig(CHART_DIR / "09_time_vs_agreement.png", dpi=150) + fig.savefig(CHART_DIR / "14_time_vs_agreement.png", dpi=150) plt.close(fig) - print(" 09_time_vs_agreement.png") + print(" 14_time_vs_agreement.png") # ═══════════════════════════════════════════════════════════ -# CHART 10: None/Other deep dive — what do people label instead? +# CHART 15: Outlier annotator deep-dive +# ═══════════════════════════════════════════════════════════ +def plot_outlier_annotator(): + cat_kappas = metrics["pairwiseKappa"]["category"]["pairs"] + ann_kappa_sum = defaultdict(lambda: {"sum": 0, "n": 0}) + for pair in cat_kappas: + for a in ("a1", "a2"): + ann_kappa_sum[pair[a]]["sum"] += pair["kappa"] + ann_kappa_sum[pair[a]]["n"] += 1 + outlier = min(ann_kappa_sum, key=lambda a: ann_kappa_sum[a]["sum"] / ann_kappa_sum[a]["n"]) + name_to_id = {} + for l in human_labels: + name_to_id[l["annotatorName"]] = l["annotatorId"] + + fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5)) + + outlier_diverge_from = Counter() + outlier_diverge_to = Counter() + for pid, lbls in human_by_pid.items(): + outlier_lbl = None + others = [] + for l in lbls: + if l["annotatorName"] == outlier: + outlier_lbl = l + else: + others.append(l) + if outlier_lbl and len(others) >= 2: + other_cats = [o["contentCategory"] for o in others] + if other_cats[0] == other_cats[1] and other_cats[0] != outlier_lbl["contentCategory"]: + outlier_diverge_from[other_cats[0]] += 1 + outlier_diverge_to[outlier_lbl["contentCategory"]] += 1 + + cats1 = sorted(outlier_diverge_from.keys(), key=lambda c: -outlier_diverge_from[c]) + ax1.barh(range(len(cats1)), [outlier_diverge_from[c] for c in cats1], color="#e74c3c") + ax1.set_yticks(range(len(cats1))) + ax1.set_yticklabels([CAT_MAP.get(c, c) for c in cats1]) + ax1.set_xlabel("Count") + ax1.set_title(f"{outlier}: what others agreed on", fontweight="bold") + ax1.invert_yaxis() + + cats2 = sorted(outlier_diverge_to.keys(), key=lambda c: -outlier_diverge_to[c]) + ax2.barh(range(len(cats2)), [outlier_diverge_to[c] for c in cats2], color="#f39c12") + ax2.set_yticks(range(len(cats2))) + ax2.set_yticklabels([CAT_MAP.get(c, c) for c in cats2]) + ax2.set_xlabel("Count") + ax2.set_title(f"What {outlier} chose instead", fontweight="bold") + ax2.invert_yaxis() + + avg_k = ann_kappa_sum[outlier]["sum"] / ann_kappa_sum[outlier]["n"] + fig.suptitle(f"Outlier Analysis: {outlier} (avg κ = {avg_k:.3f})", fontweight="bold") + fig.tight_layout() + fig.savefig(CHART_DIR / "15_outlier_annotator.png", dpi=150) + plt.close(fig) + print(" 15_outlier_annotator.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 16: With/without outlier consensus +# ═══════════════════════════════════════════════════════════ +def plot_with_without_outlier(): + cat_kappas = metrics["pairwiseKappa"]["category"]["pairs"] + ann_kappa_sum = defaultdict(lambda: {"sum": 0, "n": 0}) + for pair in cat_kappas: + for a in ("a1", "a2"): + ann_kappa_sum[pair[a]]["sum"] += pair["kappa"] + ann_kappa_sum[pair[a]]["n"] += 1 + outlier = min(ann_kappa_sum, key=lambda a: ann_kappa_sum[a]["sum"] / ann_kappa_sum[a]["n"]) + + fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5)) + + n = 0 + cat_w, cat_wo, spec_w, spec_wo, both_w, both_wo = 0, 0, 0, 0, 0, 0 + for pid, lbls in human_by_pid.items(): + names = [l["annotatorName"] for l in lbls] + if outlier not in names or len(lbls) < 3: + continue + n += 1 + cats_all = [l["contentCategory"] for l in lbls] + specs_all = [l["specificityLevel"] for l in lbls] + cats_excl = [l["contentCategory"] for l in lbls if l["annotatorName"] != outlier] + specs_excl = [l["specificityLevel"] for l in lbls if l["annotatorName"] != outlier] + + cat_u = len(set(cats_all)) == 1 + cat_e = len(set(cats_excl)) == 1 + spec_u = len(set(specs_all)) == 1 + spec_e = len(set(specs_excl)) == 1 + + if cat_u: cat_w += 1 + if cat_e: cat_wo += 1 + if spec_u: spec_w += 1 + if spec_e: spec_wo += 1 + if cat_u and spec_u: both_w += 1 + if cat_e and spec_e: both_wo += 1 + + labels_m = ["Category\nUnanimous", "Specificity\nUnanimous", "Both\nUnanimous"] + with_v = [cat_w / n * 100, spec_w / n * 100, both_w / n * 100] + without_v = [cat_wo / n * 100, spec_wo / n * 100, both_wo / n * 100] + + x = np.arange(3) + width = 0.35 + ax1.bar(x - width / 2, with_v, width, label="All 3", color="#e74c3c") + ax1.bar(x + width / 2, without_v, width, label=f"Excl. {outlier}", color="#2ecc71") + ax1.set_xticks(x) + ax1.set_xticklabels(labels_m) + ax1.set_ylabel("% of paragraphs") + ax1.set_title(f"Agreement on {outlier}'s paragraphs (n={n})", fontweight="bold") + ax1.legend() + for i, (w, wo) in enumerate(zip(with_v, without_v)): + ax1.text(i, max(w, wo) + 2, f"Δ={wo - w:+.1f}pp", ha="center", fontsize=9, fontweight="bold") + + kappas_with = [p["kappa"] for p in cat_kappas] + kappas_without = [p["kappa"] for p in cat_kappas if outlier not in (p["a1"], p["a2"])] + bp = ax2.boxplot([kappas_with, kappas_without], positions=[1, 2], widths=0.5, patch_artist=True) + bp["boxes"][0].set_facecolor("#e74c3c") + bp["boxes"][0].set_alpha(0.5) + bp["boxes"][1].set_facecolor("#2ecc71") + bp["boxes"][1].set_alpha(0.5) + ax2.set_xticks([1, 2]) + ax2.set_xticklabels(["All pairs", f"Excl. {outlier}"]) + ax2.set_ylabel("Cohen's κ (category)") + ax2.set_title("Kappa Distribution", fontweight="bold") + rng = np.random.RandomState(42) + for pos, kappas in zip([1, 2], [kappas_with, kappas_without]): + jitter = rng.normal(0, 0.04, len(kappas)) + ax2.scatter([pos + j for j in jitter], kappas, alpha=0.6, s=30, color="black", zorder=3) + + fig.tight_layout() + fig.savefig(CHART_DIR / "16_with_without_outlier.png", dpi=150) + plt.close(fig) + print(" 16_with_without_outlier.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 17: Disagreement axes — Human vs Stage1 vs All GenAI +# ═══════════════════════════════════════════════════════════ +def plot_disagreement_axes(): + fig, axes = plt.subplots(1, 3, figsize=(18, 6)) + + def compute_axes(cat_lists: list[list[str]]) -> Counter: + result = Counter() + for cats in cat_lists: + if len(set(cats)) >= 2: + for i, c1 in enumerate(cats): + for c2 in cats[i + 1:]: + if c1 != c2: + result[tuple(sorted([c1, c2]))] += 1 + return result + + human_axes = compute_axes([c["human_cats"] for c in consensus.values()]) + s1_axes = compute_axes([c["s1_cats"] for c in consensus.values()]) + genai_axes = compute_axes([c["genai_cats"] for c in consensus.values()]) + + for ax, data, title, color in [ + (axes[0], human_axes, "Human", "#e74c3c"), + (axes[1], s1_axes, "Stage 1", "#3498db"), + (axes[2], genai_axes, "All GenAI (10)", "#2ecc71"), + ]: + top = data.most_common(10) + labels = [f"{CAT_MAP[a]}↔{CAT_MAP[b]}" for (a, b), _ in top] + counts = [c for _, c in top] + ax.barh(range(len(labels)), counts, color=color) + ax.set_yticks(range(len(labels))) + ax.set_yticklabels(labels, fontsize=9) + ax.set_xlabel("Disagreement count") + ax.set_title(f"{title} Confusion Axes", fontweight="bold") + ax.invert_yaxis() + + fig.tight_layout() + fig.savefig(CHART_DIR / "17_disagreement_axes.png", dpi=150) + plt.close(fig) + print(" 17_disagreement_axes.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 18: None/Other analysis # ═══════════════════════════════════════════════════════════ def plot_none_other_analysis(): fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5)) - # For paragraphs where at least one annotator said None/Other - # What did the others say? noneother_vs = Counter() noneother_pids = set() for pid, lbls in human_by_pid.items(): @@ -642,583 +1234,965 @@ def plot_none_other_analysis(): if c != "None/Other": noneother_vs[c] += 1 - # Also: paragraphs where NO human said None/Other but Stage1 or Opus did - s1_noneother_human_not = Counter() - for pid, c in consensus.items(): - human_cats = set(c["human_cats"]) - if "None/Other" not in human_cats: - if c["s1_cat_maj"] == "None/Other": - for hc in c["human_cats"]: - s1_noneother_human_not[hc] += 1 - cats_sorted = sorted(noneother_vs.keys(), key=lambda c: -noneother_vs[c]) ax1.barh(range(len(cats_sorted)), [noneother_vs[c] for c in cats_sorted], color="#e74c3c") ax1.set_yticks(range(len(cats_sorted))) ax1.set_yticklabels([CAT_MAP.get(c, c) for c in cats_sorted]) ax1.set_xlabel("Count") - ax1.set_title(f"When someone says N/O but others disagree\n({len(noneother_pids)} paragraphs)", + ax1.set_title(f"When someone says N/O, others say...\n({len(noneother_pids)} paragraphs)", fontweight="bold") ax1.invert_yaxis() - # What does Stage1 say when humans disagree on category? - s1_for_disagreed = Counter() + # What do GenAI models say for human-disagreed paragraphs? + genai_for_disagreed = Counter() for pid, c in consensus.items(): - if not c["human_cat_unanimous"] and c["s1_cat_maj"]: - s1_for_disagreed[c["s1_cat_maj"]] += 1 + if not c["human_cat_unanimous"] and c["genai_cat_maj"]: + genai_for_disagreed[c["genai_cat_maj"]] += 1 - cats_sorted2 = sorted(s1_for_disagreed.keys(), key=lambda c: -s1_for_disagreed[c]) - ax2.barh(range(len(cats_sorted2)), [s1_for_disagreed[c] for c in cats_sorted2], color="#3498db") + cats_sorted2 = sorted(genai_for_disagreed.keys(), key=lambda c: -genai_for_disagreed[c]) + ax2.barh(range(len(cats_sorted2)), [genai_for_disagreed[c] for c in cats_sorted2], color="#3498db") ax2.set_yticks(range(len(cats_sorted2))) ax2.set_yticklabels([CAT_MAP.get(c, c) for c in cats_sorted2]) ax2.set_xlabel("Count") - ax2.set_title(f"Stage1 majority for human-disagreed paragraphs\n(n={sum(s1_for_disagreed.values())})", + ax2.set_title(f"GenAI majority for human-disagreed\n(n={sum(genai_for_disagreed.values())})", fontweight="bold") ax2.invert_yaxis() fig.tight_layout() - fig.savefig(CHART_DIR / "10_none_other_analysis.png", dpi=150) + fig.savefig(CHART_DIR / "18_none_other_analysis.png", dpi=150) plt.close(fig) - print(" 10_none_other_analysis.png") + print(" 18_none_other_analysis.png") # ═══════════════════════════════════════════════════════════ -# CHART 11: Aaryan vs everyone else — where does he diverge? +# CHART 19: Specificity bias per model vs Opus # ═══════════════════════════════════════════════════════════ -def plot_outlier_annotator(): - # Find the annotator with lowest avg kappa - cat_kappas = metrics["pairwiseKappa"]["category"]["pairs"] - ann_kappa_sum = defaultdict(lambda: {"sum": 0, "n": 0}) - for pair in cat_kappas: - ann_kappa_sum[pair["a1"]]["sum"] += pair["kappa"] - ann_kappa_sum[pair["a1"]]["n"] += 1 - ann_kappa_sum[pair["a2"]]["sum"] += pair["kappa"] - ann_kappa_sum[pair["a2"]]["n"] += 1 +def plot_specificity_bias_all(): + fig, ax = plt.subplots(figsize=(16, 6)) - outlier = min(ann_kappa_sum, key=lambda a: ann_kappa_sum[a]["sum"] / ann_kappa_sum[a]["n"]) - - fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5)) - - # What does the outlier label differently? - # Compare outlier's category choices vs the majority of the other 2 annotators - outlier_id = name_to_id.get(outlier, outlier) - outlier_diverge_from = Counter() # (outlier_cat, others_cat) pairs - outlier_diverge_to = Counter() - - for pid, lbls in human_by_pid.items(): - outlier_lbl = None - others = [] - for l in lbls: - if l["annotatorName"] == outlier: - outlier_lbl = l - else: - others.append(l) - - if outlier_lbl and len(others) >= 2: - other_cats = [o["contentCategory"] for o in others] - if other_cats[0] == other_cats[1] and other_cats[0] != outlier_lbl["contentCategory"]: - outlier_diverge_from[other_cats[0]] += 1 - outlier_diverge_to[outlier_lbl["contentCategory"]] += 1 - - # Diverge FROM (what category the others agreed on) - cats1 = sorted(outlier_diverge_from.keys(), key=lambda c: -outlier_diverge_from[c]) - ax1.barh(range(len(cats1)), [outlier_diverge_from[c] for c in cats1], color="#e74c3c") - ax1.set_yticks(range(len(cats1))) - ax1.set_yticklabels([CAT_MAP.get(c, c) for c in cats1]) - ax1.set_xlabel("Count") - ax1.set_title(f"{outlier} disagrees: what others chose\n(others agreed, {outlier} didn't)", - fontweight="bold") - ax1.invert_yaxis() - - # Diverge TO (what did the outlier pick instead) - cats2 = sorted(outlier_diverge_to.keys(), key=lambda c: -outlier_diverge_to[c]) - ax2.barh(range(len(cats2)), [outlier_diverge_to[c] for c in cats2], color="#f39c12") - ax2.set_yticks(range(len(cats2))) - ax2.set_yticklabels([CAT_MAP.get(c, c) for c in cats2]) - ax2.set_xlabel("Count") - ax2.set_title(f"What {outlier} chose instead", fontweight="bold") - ax2.invert_yaxis() - - fig.suptitle(f"Outlier Analysis: {outlier} (lowest avg κ = " - f"{ann_kappa_sum[outlier]['sum']/ann_kappa_sum[outlier]['n']:.3f})", - fontweight="bold", fontsize=12) - fig.tight_layout() - fig.savefig(CHART_DIR / "11_outlier_annotator.png", dpi=150) - plt.close(fig) - print(" 11_outlier_annotator.png") - - -# ═══════════════════════════════════════════════════════════ -# CHART 12: Human vs GenAI consensus comparison -# ═══════════════════════════════════════════════════════════ -def plot_human_vs_genai_consensus(): - fig, axes = plt.subplots(1, 3, figsize=(16, 5)) - - # For each paragraph: human unanimity vs stage1 unanimity - # Quadrants: both agree, human only, stage1 only, neither - human_unan_cat = sum(1 for c in consensus.values() if c["human_cat_unanimous"]) - s1_unan_cat = sum(1 for c in consensus.values() if c["s1_cat_unanimous"]) - both_unan_cat = sum(1 for c in consensus.values() - if c["human_cat_unanimous"] and c["s1_cat_unanimous"]) - - human_unan_spec = sum(1 for c in consensus.values() if c["human_spec_unanimous"]) - s1_unan_spec = sum(1 for c in consensus.values() - if len(set(c["s1_specs"])) == 1 if c["s1_specs"]) - - # Chart 1: Category agreement Venn-style comparison - ax = axes[0] - labels_data = ["Human\nunanimous", "Stage1\nunanimous", "Both\nunanimous"] - vals = [human_unan_cat, s1_unan_cat, both_unan_cat] - pcts = [v / 1200 * 100 for v in vals] - bars = ax.bar(range(3), pcts, color=["#3498db", "#e74c3c", "#2ecc71"]) - ax.set_xticks(range(3)) - ax.set_xticklabels(labels_data) - ax.set_ylabel("%") - ax.set_title("Category Unanimity Rates", fontweight="bold") - for bar, v, p in zip(bars, vals, pcts): - ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 1, - f"{p:.1f}%\n({v})", ha="center", fontsize=9) - - # Chart 2: Human vs Stage1 category agreement breakdown - ax = axes[1] - both_agree = 0 # human unanimous AND matches s1 - human_unan_s1_diff = 0 # human unanimous but s1 differs - s1_unan_human_diff = 0 # s1 unanimous but human majority differs - both_majority_agree = 0 # neither unanimous but majorities match - majorities_differ = 0 - - for pid, c in consensus.items(): - hm = c["human_cat_maj"] - sm = c["s1_cat_maj"] - hu = c["human_cat_unanimous"] - su = c["s1_cat_unanimous"] - if not hm or not sm: - continue - if hm == sm: - both_majority_agree += 1 - else: - majorities_differ += 1 - - total = both_majority_agree + majorities_differ - vals = [both_majority_agree, majorities_differ] - pcts = [v / total * 100 for v in vals] - labels_d = ["Majorities\nagree", "Majorities\ndiffer"] - colors_d = ["#2ecc71", "#e74c3c"] - bars = ax.bar(range(2), pcts, color=colors_d) - ax.set_xticks(range(2)) - ax.set_xticklabels(labels_d) - ax.set_ylabel("%") - ax.set_title(f"Human vs Stage1 Category Agreement\n(n={total})", fontweight="bold") - for bar, v, p in zip(bars, vals, pcts): - ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5, - f"{v}\n({p:.1f}%)", ha="center", fontsize=9) - - # Chart 3: Same for specificity - ax = axes[2] - spec_agree = 0 - spec_differ = 0 - for pid, c in consensus.items(): - hm = c["human_spec_maj"] - sm = c["s1_spec_maj"] - if hm is None or sm is None: - continue - if hm == sm: - spec_agree += 1 - else: - spec_differ += 1 - - total = spec_agree + spec_differ - vals = [spec_agree, spec_differ] - pcts = [v / total * 100 for v in vals] - bars = ax.bar(range(2), pcts, color=colors_d) - ax.set_xticks(range(2)) - ax.set_xticklabels(labels_d) - ax.set_ylabel("%") - ax.set_title(f"Human vs Stage1 Specificity Agreement\n(n={total})", fontweight="bold") - for bar, v, p in zip(bars, vals, pcts): - ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5, - f"{v}\n({p:.1f}%)", ha="center", fontsize=9) - - fig.tight_layout() - fig.savefig(CHART_DIR / "12_human_vs_genai_consensus.png", dpi=150) - plt.close(fig) - print(" 12_human_vs_genai_consensus.png") - - -# ═══════════════════════════════════════════════════════════ -# CHART 13: Per-annotator specificity bias -# ═══════════════════════════════════════════════════════════ -def plot_specificity_bias(): - fig, ax = plt.subplots(figsize=(10, 5)) - - # For each annotator, compare their spec vs Opus spec - ann_labels_by_name: dict[str, dict[str, dict]] = defaultdict(dict) - for l in human_labels: - ann_labels_by_name[l["annotatorName"]][l["paragraphId"]] = l - - names = annotator_names - biases = [] # mean(annotator_spec - stage1_majority_spec) - for name in names: + sources = annotator_names + sorted(ALL_GENAI) + biases = [] + for src in sources: diffs = [] - for pid, lbl in ann_labels_by_name[name].items(): - c = consensus.get(pid) - if c and c["s1_spec_maj"] is not None: - diffs.append(lbl["specificityLevel"] - c["s1_spec_maj"]) + for pid, c in consensus.items(): + if "Opus 4.6" not in c["signals"]: + continue + opus_spec = c["signals"]["Opus 4.6"]["spec"] + if src in annotator_names: + # Human + for l in human_by_pid.get(pid, []): + if l["annotatorName"] == src: + diffs.append(l["specificityLevel"] - opus_spec) + elif src in c["signals"] and src != "Opus 4.6": + diffs.append(c["signals"][src]["spec"] - opus_spec) biases.append(np.mean(diffs) if diffs else 0) - colors = ["#e74c3c" if b < -0.1 else "#2ecc71" if b > 0.1 else "#95a5a6" for b in biases] - bars = ax.bar(range(len(names)), biases, color=colors) - ax.set_xticks(range(len(names))) - ax.set_xticklabels(names, rotation=45, ha="right") - ax.set_ylabel("Mean (Human - Stage1 Maj) Specificity") - ax.set_title("Specificity Bias vs Stage1 (negative = under-rates, positive = over-rates)", - fontweight="bold") - ax.axhline(0, color="black", linewidth=0.5) + colors = [] + for i, (src, b) in enumerate(zip(sources, biases)): + if src in annotator_names: + colors.append("#9b59b6" if abs(b) > 0.5 else "#8e44ad") + else: + mid = {v: k for k, v in MODEL_SHORT.items()}.get(src, "") + colors.append(TIER_COLORS.get(MODEL_TIER.get(mid, "mid"), "#999")) + + bars = ax.bar(range(len(sources)), biases, color=colors, edgecolor="black", linewidth=0.3) + ax.set_xticks(range(len(sources))) + ax.set_xticklabels(sources, rotation=60, ha="right", fontsize=7) + ax.set_ylabel("Mean (Source − Opus) Specificity") + ax.set_title("Specificity Bias vs Opus 4.6 (positive = over-rates specificity)", fontweight="bold") + ax.axhline(0, color="black", linewidth=1) + + # Add a vertical line separating humans from models + ax.axvline(len(annotator_names) - 0.5, color="gray", linewidth=1, linestyle="--", alpha=0.5) + ax.text(len(annotator_names) / 2, ax.get_ylim()[1] * 0.9, "Humans", ha="center", fontsize=9, style="italic") + ax.text(len(annotator_names) + len(ALL_GENAI) / 2, ax.get_ylim()[1] * 0.9, "GenAI", ha="center", fontsize=9, style="italic") for bar, b in zip(bars, biases): - ax.text(bar.get_x() + bar.get_width() / 2, - bar.get_height() + (0.02 if b >= 0 else -0.05), - f"{b:+.2f}", ha="center", fontsize=9) + if abs(b) > 0.05: + ax.text(bar.get_x() + bar.get_width() / 2, + bar.get_height() + (0.02 if b >= 0 else -0.06), + f"{b:+.2f}", ha="center", fontsize=7) fig.tight_layout() - fig.savefig(CHART_DIR / "13_specificity_bias.png", dpi=150) + fig.savefig(CHART_DIR / "19_specificity_bias_all.png", dpi=150) plt.close(fig) - print(" 13_specificity_bias.png") + print(" 19_specificity_bias_all.png") # ═══════════════════════════════════════════════════════════ -# CHART 14: Disagreement axes — human vs GenAI top confusions -# ═══════════════════════════════════════════════════════════ -def plot_disagreement_axes(): - fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) - - # Human disagreement axes (where 2 annotators agree, 1 disagrees) - human_axes = Counter() - for pid, lbls in human_by_pid.items(): - cats = [l["contentCategory"] for l in lbls] - if len(set(cats)) == 2: - c = Counter(cats) - items = c.most_common() - axis = tuple(sorted([items[0][0], items[1][0]])) - human_axes[axis] += 1 - elif len(set(cats)) == 3: - for i, c1 in enumerate(cats): - for c2 in cats[i+1:]: - if c1 != c2: - axis = tuple(sorted([c1, c2])) - human_axes[axis] += 1 - - top_human = human_axes.most_common(10) - labels_h = [f"{CAT_MAP[a]}↔{CAT_MAP[b]}" for (a, b), _ in top_human] - counts_h = [c for _, c in top_human] - - ax1.barh(range(len(labels_h)), counts_h, color="#e74c3c") - ax1.set_yticks(range(len(labels_h))) - ax1.set_yticklabels(labels_h, fontsize=9) - ax1.set_xlabel("Disagreement count") - ax1.set_title("Human Top Disagreement Axes", fontweight="bold") - ax1.invert_yaxis() - - # Stage 1 disagreement axes on same paragraphs - s1_axes = Counter() - for pid, c in consensus.items(): - s1_cats = c["s1_cats"] - if len(set(s1_cats)) == 2: - cnt = Counter(s1_cats) - items = cnt.most_common() - axis = tuple(sorted([items[0][0], items[1][0]])) - s1_axes[axis] += 1 - elif len(set(s1_cats)) == 3: - for i, c1 in enumerate(s1_cats): - for c2 in s1_cats[i+1:]: - if c1 != c2: - axis = tuple(sorted([c1, c2])) - s1_axes[axis] += 1 - - top_s1 = s1_axes.most_common(10) - labels_s = [f"{CAT_MAP[a]}↔{CAT_MAP[b]}" for (a, b), _ in top_s1] - counts_s = [c for _, c in top_s1] - - ax2.barh(range(len(labels_s)), counts_s, color="#3498db") - ax2.set_yticks(range(len(labels_s))) - ax2.set_yticklabels(labels_s, fontsize=9) - ax2.set_xlabel("Disagreement count") - ax2.set_title("Stage 1 Top Disagreement Axes (same paragraphs)", fontweight="bold") - ax2.invert_yaxis() - - fig.tight_layout() - fig.savefig(CHART_DIR / "14_disagreement_axes.png", dpi=150) - plt.close(fig) - print(" 14_disagreement_axes.png") - - -# ═══════════════════════════════════════════════════════════ -# CHART 15: Quiz performance vs labeling quality +# CHART 20: Quiz vs quality # ═══════════════════════════════════════════════════════════ def plot_quiz_vs_quality(): fig, ax = plt.subplots(figsize=(10, 5)) - - # Load quiz data quiz_sessions = load_jsonl(GOLD_DIR / "quiz-sessions.jsonl") - # Best quiz score per annotator - best_quiz: dict[str, int] = {} attempts: dict[str, int] = defaultdict(int) for q in quiz_sessions: - name = q["annotatorName"] - attempts[name] += 1 - if q["passed"]: - if name not in best_quiz or q["score"] > best_quiz[name]: - best_quiz[name] = q["score"] + attempts[q["annotatorName"]] += 1 - # Agreement rate with Stage1 majority per annotator ann_labels_by_name: dict[str, dict[str, dict]] = defaultdict(dict) for l in human_labels: ann_labels_by_name[l["annotatorName"]][l["paragraphId"]] = l - s1_agree = {} + opus_agree = {} for name in annotator_names: - agree = 0 - total = 0 + agree, total = 0, 0 for pid, lbl in ann_labels_by_name[name].items(): c = consensus.get(pid) - if c and c["s1_cat_maj"]: + if c and c["opus_cat"]: total += 1 - if lbl["contentCategory"] == c["s1_cat_maj"]: + if lbl["contentCategory"] == c["opus_cat"]: agree += 1 - s1_agree[name] = agree / total * 100 if total > 0 else 0 + opus_agree[name] = agree / total * 100 if total > 0 else 0 x = np.arange(len(annotator_names)) width = 0.35 - ax.bar(x - width/2, [attempts.get(n, 0) for n in annotator_names], + ax.bar(x - width / 2, [attempts.get(n, 0) for n in annotator_names], width, label="Quiz attempts", color="#f39c12") ax2 = ax.twinx() - ax2.bar(x + width/2, [s1_agree.get(n, 0) for n in annotator_names], - width, label="Category agree w/ Stage1 (%)", color="#3498db", alpha=0.7) + ax2.bar(x + width / 2, [opus_agree.get(n, 0) for n in annotator_names], + width, label="Cat agree w/ Opus (%)", color="#3498db", alpha=0.7) ax.set_xticks(x) ax.set_xticklabels(annotator_names, rotation=45, ha="right") ax.set_ylabel("Quiz attempts", color="#f39c12") ax2.set_ylabel("Opus agreement %", color="#3498db") - ax.set_title("Quiz Attempts vs Labeling Quality (Stage1 Agreement)", fontweight="bold") + ax.set_title("Quiz Attempts vs Labeling Quality (Opus Agreement)", fontweight="bold") lines1, labels1 = ax.get_legend_handles_labels() lines2, labels2 = ax2.get_legend_handles_labels() ax.legend(lines1 + lines2, labels1 + labels2, loc="upper left") fig.tight_layout() - fig.savefig(CHART_DIR / "15_quiz_vs_quality.png", dpi=150) + fig.savefig(CHART_DIR / "20_quiz_vs_quality.png", dpi=150) plt.close(fig) - print(" 15_quiz_vs_quality.png") + print(" 20_quiz_vs_quality.png") # ═══════════════════════════════════════════════════════════ -# CHART 16: Aaryan-excluded metrics comparison +# CHART 21: Human vs GenAI consensus rates # ═══════════════════════════════════════════════════════════ -def plot_with_without_outlier(): - fig, axes = plt.subplots(1, 2, figsize=(12, 5)) +def plot_human_vs_genai_consensus(): + fig, axes = plt.subplots(1, 3, figsize=(16, 5)) - # Find outlier (lowest avg kappa) - cat_kappas = metrics["pairwiseKappa"]["category"]["pairs"] - ann_kappa_sum = defaultdict(lambda: {"sum": 0, "n": 0}) - for pair in cat_kappas: - ann_kappa_sum[pair["a1"]]["sum"] += pair["kappa"] - ann_kappa_sum[pair["a1"]]["n"] += 1 - ann_kappa_sum[pair["a2"]]["sum"] += pair["kappa"] - ann_kappa_sum[pair["a2"]]["n"] += 1 - outlier = min(ann_kappa_sum, key=lambda a: ann_kappa_sum[a]["sum"] / ann_kappa_sum[a]["n"]) - - # Compute consensus with and without outlier - # For paragraphs where outlier participated - outlier_participated = 0 - cat_agree_with = 0 - cat_agree_without = 0 - spec_agree_with = 0 - spec_agree_without = 0 - both_agree_with = 0 - both_agree_without = 0 - - for pid, lbls in human_by_pid.items(): - if len(lbls) < 3: - continue - names = [l["annotatorName"] for l in lbls] - if outlier not in names: - continue - outlier_participated += 1 - - cats_all = [l["contentCategory"] for l in lbls] - specs_all = [l["specificityLevel"] for l in lbls] - cats_excl = [l["contentCategory"] for l in lbls if l["annotatorName"] != outlier] - specs_excl = [l["specificityLevel"] for l in lbls if l["annotatorName"] != outlier] - - cat_u_all = len(set(cats_all)) == 1 - cat_u_excl = len(set(cats_excl)) == 1 - spec_u_all = len(set(specs_all)) == 1 - spec_u_excl = len(set(specs_excl)) == 1 - - if cat_u_all: cat_agree_with += 1 - if cat_u_excl: cat_agree_without += 1 - if spec_u_all: spec_agree_with += 1 - if spec_u_excl: spec_agree_without += 1 - if cat_u_all and spec_u_all: both_agree_with += 1 - if cat_u_excl and spec_u_excl: both_agree_without += 1 - - n = outlier_participated - metrics_labels = ["Category\nUnanimous", "Specificity\nUnanimous", "Both\nUnanimous"] - with_vals = [cat_agree_with / n * 100, spec_agree_with / n * 100, both_agree_with / n * 100] - without_vals = [cat_agree_without / n * 100, spec_agree_without / n * 100, both_agree_without / n * 100] + # Category unanimity + h_unan = sum(1 for c in consensus.values() if c["human_cat_unanimous"]) + g_unan = sum(1 for c in consensus.values() if len(set(c["genai_cats"])) == 1) + b_unan = sum(1 for c in consensus.values() if c["human_cat_unanimous"] and len(set(c["genai_cats"])) == 1) ax = axes[0] - x = np.arange(3) - width = 0.35 - ax.bar(x - width/2, with_vals, width, label=f"All 3 annotators", color="#e74c3c") - ax.bar(x + width/2, without_vals, width, label=f"Excluding {outlier}", color="#2ecc71") - ax.set_xticks(x) - ax.set_xticklabels(metrics_labels) - ax.set_ylabel("% of paragraphs") - ax.set_title(f"Agreement on {outlier}'s paragraphs (n={n})", fontweight="bold") - ax.legend() + vals = [h_unan, g_unan, b_unan] + pcts = [v / 1200 * 100 for v in vals] + labels = ["Human\n3/3", "GenAI\n10/10", "Both"] + bars = ax.bar(range(3), pcts, color=["#3498db", "#e74c3c", "#2ecc71"]) + ax.set_xticks(range(3)) + ax.set_xticklabels(labels) + ax.set_ylabel("%") + ax.set_title("Category Unanimity", fontweight="bold") + for bar, v, p in zip(bars, vals, pcts): + ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 1, + f"{p:.1f}%\n({v})", ha="center", fontsize=9) - for i, (w, wo) in enumerate(zip(with_vals, without_vals)): - delta = wo - w - ax.text(i, max(w, wo) + 2, f"Δ={delta:+.1f}pp", ha="center", fontsize=9, fontweight="bold") - - # Chart 2: kappa distributions with/without + # Majority agreement ax = axes[1] - kappas_with = [p["kappa"] for p in cat_kappas] - kappas_without = [p["kappa"] for p in cat_kappas if outlier not in (p["a1"], p["a2"])] + cat_agree = sum(1 for c in consensus.values() + if c["human_cat_maj"] and c["genai_cat_maj"] and c["human_cat_maj"] == c["genai_cat_maj"]) + cat_total = sum(1 for c in consensus.values() if c["human_cat_maj"] and c["genai_cat_maj"]) + cat_diff = cat_total - cat_agree - positions = [1, 2] - bp = ax.boxplot([kappas_with, kappas_without], positions=positions, widths=0.5, - patch_artist=True) - bp["boxes"][0].set_facecolor("#e74c3c") - bp["boxes"][0].set_alpha(0.5) - bp["boxes"][1].set_facecolor("#2ecc71") - bp["boxes"][1].set_alpha(0.5) + bars = ax.bar(range(2), [cat_agree / cat_total * 100, cat_diff / cat_total * 100], + color=["#2ecc71", "#e74c3c"]) + ax.set_xticks(range(2)) + ax.set_xticklabels(["Agree", "Differ"]) + ax.set_ylabel("%") + ax.set_title(f"Human Maj vs GenAI Maj — Category\n(n={cat_total})", fontweight="bold") + for bar, v in zip(bars, [cat_agree, cat_diff]): + ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5, + f"{v}", ha="center", fontsize=10) - ax.set_xticks(positions) - ax.set_xticklabels(["All pairs", f"Excl. {outlier}"]) - ax.set_ylabel("Cohen's κ (category)") - ax.set_title("Kappa Distribution", fontweight="bold") - - # Add individual points - for pos, kappas in zip(positions, [kappas_with, kappas_without]): - jitter = np.random.normal(0, 0.04, len(kappas)) - ax.scatter([pos + j for j in jitter], kappas, alpha=0.6, s=30, color="black", zorder=3) + # Specificity majority agreement + ax = axes[2] + spec_agree = sum(1 for c in consensus.values() + if c["human_spec_maj"] is not None and c["genai_spec_maj"] is not None + and c["human_spec_maj"] == c["genai_spec_maj"]) + spec_total = sum(1 for c in consensus.values() + if c["human_spec_maj"] is not None and c["genai_spec_maj"] is not None) + spec_diff = spec_total - spec_agree + bars = ax.bar(range(2), [spec_agree / spec_total * 100, spec_diff / spec_total * 100], + color=["#2ecc71", "#e74c3c"]) + ax.set_xticks(range(2)) + ax.set_xticklabels(["Agree", "Differ"]) + ax.set_ylabel("%") + ax.set_title(f"Human Maj vs GenAI Maj — Specificity\n(n={spec_total})", fontweight="bold") + for bar, v in zip(bars, [spec_agree, spec_diff]): + ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5, + f"{v}", ha="center", fontsize=10) fig.tight_layout() - fig.savefig(CHART_DIR / "16_with_without_outlier.png", dpi=150) + fig.savefig(CHART_DIR / "21_human_vs_genai_consensus.png", dpi=150) plt.close(fig) - print(" 16_with_without_outlier.png") + print(" 21_human_vs_genai_consensus.png") # ═══════════════════════════════════════════════════════════ -# TEXTUAL ANALYSIS OUTPUT +# CHART 22: Signal agreement distribution (how many of 13 agree?) # ═══════════════════════════════════════════════════════════ -def print_analysis(): - print("\n" + "=" * 70) - print("CROSS-SOURCE ANALYSIS") - print("=" * 70) +def plot_signal_agreement_dist(): + fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5)) + + cat_top_counts = [] + spec_top_counts = [] + for c in consensus.values(): + cat_top_counts.append(c["all_cat_counts"].most_common(1)[0][1]) + spec_top_counts.append(Counter(c["all_specs"]).most_common(1)[0][1]) + + ax1.hist(cat_top_counts, bins=range(1, 15), color="#3498db", edgecolor="black", alpha=0.7, align="left") + ax1.set_xlabel("# signals agreeing on top category") + ax1.set_ylabel("Paragraphs") + ax1.set_title("Category: Max Agreement Count per Paragraph", fontweight="bold") + ax1.axvline(10, color="red", linewidth=2, linestyle="--", label="Tier 1 threshold (10+)") + ax1.legend() + + ax2.hist(spec_top_counts, bins=range(1, 15), color="#e74c3c", edgecolor="black", alpha=0.7, align="left") + ax2.set_xlabel("# signals agreeing on top specificity") + ax2.set_ylabel("Paragraphs") + ax2.set_title("Specificity: Max Agreement Count per Paragraph", fontweight="bold") + ax2.axvline(10, color="red", linewidth=2, linestyle="--", label="Tier 1 threshold (10+)") + ax2.legend() + + fig.tight_layout() + fig.savefig(CHART_DIR / "22_signal_agreement_dist.png", dpi=150) + plt.close(fig) + print(" 22_signal_agreement_dist.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 23: Per-annotator agreement with all references +# ═══════════════════════════════════════════════════════════ +def plot_annotator_vs_references(): + fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5)) + + ann_labels_by_name: dict[str, dict[str, dict]] = defaultdict(dict) + for l in human_labels: + ann_labels_by_name[l["annotatorName"]][l["paragraphId"]] = l + + refs = [ + ("S1 Maj", "s1_cat_maj", "s1_spec_maj"), + ("Opus", "opus_cat", "opus_spec"), + ("GenAI Maj", "genai_cat_maj", "genai_spec_maj"), + ] + + for ax, dim, title in [(ax1, "cat", "Category"), (ax2, "spec", "Specificity")]: + x = np.arange(len(annotator_names)) + width = 0.25 + + for ri, (ref_name, ref_cat, ref_spec) in enumerate(refs): + rates = [] + for ann_name in annotator_names: + agree, total = 0, 0 + for pid, lbl in ann_labels_by_name[ann_name].items(): + c = consensus.get(pid) + if not c: + continue + ref_val = c[ref_cat] if dim == "cat" else c[ref_spec] + ann_val = lbl["contentCategory"] if dim == "cat" else lbl["specificityLevel"] + if ref_val is not None: + total += 1 + if str(ann_val) == str(ref_val): + agree += 1 + rates.append(agree / total * 100 if total > 0 else 0) + + ax.bar(x + (ri - 1) * width, rates, width, label=ref_name) + + ax.set_xticks(x) + ax.set_xticklabels(annotator_names, rotation=45, ha="right") + ax.set_ylabel("Agreement %") + ax.set_title(f"Per-Annotator {title} Agreement", fontweight="bold") + ax.legend() + ax.set_ylim(0, 100) + + fig.tight_layout() + fig.savefig(CHART_DIR / "23_annotator_vs_references.png", dpi=150) + plt.close(fig) + print(" 23_annotator_vs_references.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 24: "Hard paragraph" analysis — what makes Tier 4 different? +# ═══════════════════════════════════════════════════════════ +def plot_hard_paragraphs(): + fig, axes = plt.subplots(2, 2, figsize=(14, 10)) + + # Word count distribution by tier + ax = axes[0][0] + tier_wcs = {t: [consensus[pid]["word_count"] for pid in pids] for t, pids in tiers.items()} + data = [tier_wcs[t] for t in range(1, 5)] + bp = ax.boxplot(data, positions=range(1, 5), widths=0.6, patch_artist=True) + colors_t = ["#27ae60", "#3498db", "#f39c12", "#e74c3c"] + for patch, color in zip(bp["boxes"], colors_t): + patch.set_facecolor(color) + patch.set_alpha(0.5) + ax.set_xticklabels([f"Tier {t}" for t in range(1, 5)]) + ax.set_ylabel("Word count") + ax.set_title("Paragraph Length by Tier", fontweight="bold") + + # Category distribution by tier + ax = axes[0][1] + for t in range(1, 5): + cats = Counter() + for pid in tiers[t]: + top_cat = consensus[pid]["all_cat_counts"].most_common(1)[0][0] + cats[top_cat] += 1 + pcts = [cats.get(c, 0) / len(tiers[t]) * 100 if tiers[t] else 0 for c in CATEGORIES] + ax.plot(range(len(CATEGORIES)), pcts, marker="o", label=f"Tier {t}", color=colors_t[t - 1]) + ax.set_xticks(range(len(CAT_SHORT))) + ax.set_xticklabels(CAT_SHORT) + ax.set_ylabel("% of tier") + ax.set_title("Category Profile by Tier", fontweight="bold") + ax.legend() + + # Specificity distribution by tier + ax = axes[1][0] + for t in range(1, 5): + specs = Counter() + for pid in tiers[t]: + top_spec = Counter(consensus[pid]["all_specs"]).most_common(1)[0][0] + specs[top_spec] += 1 + pcts = [specs.get(s, 0) / len(tiers[t]) * 100 if tiers[t] else 0 for s in SPEC_LEVELS] + ax.plot(SPEC_LEVELS, pcts, marker="s", label=f"Tier {t}", color=colors_t[t - 1]) + ax.set_xticks(SPEC_LEVELS) + ax.set_xticklabels(["S1", "S2", "S3", "S4"]) + ax.set_ylabel("% of tier") + ax.set_title("Specificity Profile by Tier", fontweight="bold") + ax.legend() + + # For Tier 4, what are the top confusion axes? + ax = axes[1][1] + t4_axes = Counter() + for pid in tiers[4]: + cats = consensus[pid]["all_cats"] + unique = set(cats) + if len(unique) >= 2: + for a, b in combinations(unique, 2): + t4_axes[tuple(sorted([a, b]))] += 1 + top = t4_axes.most_common(8) + if top: + labels = [f"{CAT_MAP[a]}↔{CAT_MAP[b]}" for (a, b), _ in top] + counts = [c for _, c in top] + ax.barh(range(len(labels)), counts, color="#e74c3c") + ax.set_yticks(range(len(labels))) + ax.set_yticklabels(labels) + ax.set_xlabel("Count") + ax.set_title(f"Tier 4 Confusion Axes (n={len(tiers[4])})", fontweight="bold") + ax.invert_yaxis() + + fig.tight_layout() + fig.savefig(CHART_DIR / "24_hard_paragraphs.png", dpi=150) + plt.close(fig) + print(" 24_hard_paragraphs.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 25: Model agreement with human majority (per category) +# ═══════════════════════════════════════════════════════════ +def plot_model_vs_human_per_category(): + fig, ax = plt.subplots(figsize=(16, 8)) + + models = sorted(ALL_GENAI) + data = np.zeros((len(models), len(CATEGORIES))) + for mi, model in enumerate(models): + for ci, cat in enumerate(CATEGORIES): + agree, total = 0, 0 + for pid, c in consensus.items(): + if c["human_cat_maj"] != cat: + continue + sig = c["signals"] + if model in sig: + total += 1 + if sig[model]["cat"] == cat: + agree += 1 + data[mi][ci] = agree / total * 100 if total > 0 else 0 + + im = ax.imshow(data, cmap="RdYlGn", aspect="auto", vmin=40, vmax=100) + ax.set_xticks(range(len(CAT_SHORT))) + ax.set_xticklabels(CAT_SHORT, fontsize=10) + ax.set_yticks(range(len(models))) + ax.set_yticklabels(models, fontsize=8) + ax.set_title("Per-Category Recall vs Human Majority (%)", fontweight="bold") + ax.set_xlabel("Human majority label") + + for i in range(len(models)): + for j in range(len(CATEGORIES)): + val = data[i][j] + color = "white" if val < 60 else "black" + ax.text(j, i, f"{val:.0f}", ha="center", va="center", fontsize=8, color=color) + + fig.colorbar(im, ax=ax, shrink=0.6, label="Recall %") + fig.tight_layout() + fig.savefig(CHART_DIR / "25_model_vs_human_per_category.png", dpi=150) + plt.close(fig) + print(" 25_model_vs_human_per_category.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 26: Prompt version effect (v2.5 Stage1 vs v3.0 bench) +# ═══════════════════════════════════════════════════════════ +def plot_prompt_version_effect(): + fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) + + # Compare Stage 1 (v2.5) vs benchmark (v3.0) agreement with Opus + v25_models = ["Gemini Lite", "Grok Fast", "MIMO Flash"] + v30_models = [m for m in bench_by_model.keys() if m != "Opus 4.6"] + + # Category agreement with Opus per model + model_acc = {} + for model in v25_models + list(v30_models): + agree, total = 0, 0 + for pid, c in consensus.items(): + sig = c["signals"] + if model in sig and "Opus 4.6" in sig and model != "Opus 4.6": + total += 1 + if sig[model]["cat"] == sig["Opus 4.6"]["cat"]: + agree += 1 + model_acc[model] = agree / total * 100 if total > 0 else 0 + + v25_accs = [model_acc[m] for m in v25_models] + v30_accs = [model_acc[m] for m in v30_models] + + ax1.boxplot([v25_accs, v30_accs], labels=["v2.5 (Stage 1)", "v3.0 (Bench)"], + patch_artist=True, + boxprops=dict(alpha=0.5)) + for pos, accs in zip([1, 2], [v25_accs, v30_accs]): + ax1.scatter([pos] * len(accs), accs, s=80, zorder=3, edgecolors="black") + for acc, m in zip(accs, v25_models if pos == 1 else list(v30_models)): + ax1.annotate(m, (pos, acc), textcoords="offset points", xytext=(8, 0), fontsize=7) + ax1.set_ylabel("Category Agreement with Opus (%)") + ax1.set_title("Prompt Version Effect on Category Accuracy", fontweight="bold") + + # Confusion on the 3 codebook rulings axes + # MR↔RMP, N/O↔SI, N/O (SPAC/materiality) + ruling_axes = [ + ("MR↔RMP", "Management Role", "Risk Management Process"), + ("N/O↔SI", "None/Other", "Strategy Integration"), + ("BG↔MR", "Board Governance", "Management Role"), + ] + + x = np.arange(len(ruling_axes)) + width = 0.3 + + for gi, (group_models, label, color) in enumerate([ + (v25_models, "v2.5", "#e74c3c"), + (list(v30_models), "v3.0", "#3498db"), + ]): + confusion_rates = [] + for axis_label, cat_a, cat_b in ruling_axes: + confuse, total_relevant = 0, 0 + for pid, c in consensus.items(): + sig = c["signals"] + opus_cat = c.get("opus_cat") + if opus_cat not in (cat_a, cat_b): + continue + for m in group_models: + if m in sig: + total_relevant += 1 + if sig[m]["cat"] in (cat_a, cat_b) and sig[m]["cat"] != opus_cat: + confuse += 1 + confusion_rates.append(confuse / total_relevant * 100 if total_relevant > 0 else 0) + ax2.bar(x + (gi - 0.5) * width, confusion_rates, width, label=label, color=color) + + ax2.set_xticks(x) + ax2.set_xticklabels([a[0] for a in ruling_axes]) + ax2.set_ylabel("Confusion rate (%)") + ax2.set_title("Codebook Ruling Axes: v2.5 vs v3.0 Confusion", fontweight="bold") + ax2.legend() + + fig.tight_layout() + fig.savefig(CHART_DIR / "26_prompt_version_effect.png", dpi=150) + plt.close(fig) + print(" 26_prompt_version_effect.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 27: Human-GenAI agreement conditioned on difficulty +# ═══════════════════════════════════════════════════════════ +def plot_conditional_agreement(): + fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) + + # When humans are unanimous, how often does GenAI agree? + # When GenAI is unanimous, how often do humans agree? + categories_for_analysis = CATEGORIES + + # Human unanimous → GenAI agreement rate per category + h_unan_g_agree = {c: [0, 0] for c in categories_for_analysis} # [agree, total] + g_unan_h_agree = {c: [0, 0] for c in categories_for_analysis} + + for pid, c in consensus.items(): + hm = c["human_cat_maj"] + gm = c["genai_cat_maj"] + + if c["human_cat_unanimous"] and hm: + h_unan_g_agree[hm][1] += 1 + if gm == hm: + h_unan_g_agree[hm][0] += 1 + + if len(set(c["genai_cats"])) == 1 and gm: + g_unan_h_agree[gm][1] += 1 + if hm == gm: + g_unan_h_agree[gm][0] += 1 + + cats = CATEGORIES + h_rates = [h_unan_g_agree[c][0] / h_unan_g_agree[c][1] * 100 + if h_unan_g_agree[c][1] > 0 else 0 for c in cats] + g_rates = [g_unan_h_agree[c][0] / g_unan_h_agree[c][1] * 100 + if g_unan_h_agree[c][1] > 0 else 0 for c in cats] + + x = np.arange(len(cats)) + ax1.bar(x, h_rates, color="#3498db") + ax1.set_xticks(x) + ax1.set_xticklabels(CAT_SHORT) + ax1.set_ylabel("GenAI majority agrees (%)") + ax1.set_title("When Humans are Unanimous → GenAI agreement", fontweight="bold") + ax1.set_ylim(0, 105) + for i, (rate, c) in enumerate(zip(h_rates, cats)): + n = h_unan_g_agree[c][1] + if n > 0: + ax1.text(i, rate + 1, f"{rate:.0f}%\nn={n}", ha="center", fontsize=8) + + ax2.bar(x, g_rates, color="#e74c3c") + ax2.set_xticks(x) + ax2.set_xticklabels(CAT_SHORT) + ax2.set_ylabel("Human majority agrees (%)") + ax2.set_title("When GenAI is Unanimous → Human agreement", fontweight="bold") + ax2.set_ylim(0, 105) + for i, (rate, c) in enumerate(zip(g_rates, cats)): + n = g_unan_h_agree[c][1] + if n > 0: + ax2.text(i, rate + 1, f"{rate:.0f}%\nn={n}", ha="center", fontsize=8) + + fig.tight_layout() + fig.savefig(CHART_DIR / "27_conditional_agreement.png", dpi=150) + plt.close(fig) + print(" 27_conditional_agreement.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 28: Model clustering — which models agree with which? +# ═══════════════════════════════════════════════════════════ +def plot_model_clustering(): + fig, ax = plt.subplots(figsize=(12, 8)) + + models = sorted(ALL_GENAI) + n = len(models) + + # Compute pairwise agreement rate (simpler than kappa, more intuitive) + agree_matrix = np.zeros((n, n)) + for i, m1 in enumerate(models): + for j, m2 in enumerate(models): + if i == j: + agree_matrix[i][j] = 100 + continue + agree, total = 0, 0 + for pid, c in consensus.items(): + sig = c["signals"] + if m1 in sig and m2 in sig: + total += 1 + if sig[m1]["cat"] == sig[m2]["cat"]: + agree += 1 + agree_matrix[i][j] = agree / total * 100 if total > 0 else 0 + + # Hierarchical clustering via simple greedy reordering + # Use 1 - agreement as distance + dist = 100 - agree_matrix + # Simple nearest-neighbor chain ordering + remaining = list(range(n)) + order = [remaining.pop(0)] + while remaining: + last = order[-1] + nearest = min(remaining, key=lambda x: dist[last][x]) + remaining.remove(nearest) + order.append(nearest) + + reordered = agree_matrix[np.ix_(order, order)] + reordered_labels = [models[i] for i in order] + + im = ax.imshow(reordered, cmap="YlGnBu", vmin=60, vmax=100, aspect="equal") + ax.set_xticks(range(n)) + ax.set_xticklabels(reordered_labels, rotation=60, ha="right", fontsize=8) + ax.set_yticks(range(n)) + ax.set_yticklabels(reordered_labels, fontsize=8) + ax.set_title("Model Pairwise Category Agreement % (clustered)", fontweight="bold") + + for i in range(n): + for j in range(n): + if i != j: + val = reordered[i][j] + color = "white" if val < 75 else "black" + ax.text(j, i, f"{val:.0f}", ha="center", va="center", fontsize=7, color=color) + + fig.colorbar(im, ax=ax, shrink=0.7, label="Agreement %") + fig.tight_layout() + fig.savefig(CHART_DIR / "28_model_clustering.png", dpi=150) + plt.close(fig) + print(" 28_model_clustering.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 29: Specificity calibration — per-model spec distribution conditioned on Opus spec +# ═══════════════════════════════════════════════════════════ +def plot_spec_calibration(): + fig, axes = plt.subplots(2, 2, figsize=(14, 10)) + + models_to_show = ["GPT-5.4", "Gemini Pro", "Kimi K2.5", "MIMO Flash"] + models_to_show = [m for m in models_to_show if m in ALL_GENAI] + + for ax, model in zip(axes.flat, models_to_show): + # For each Opus spec level, what does this model predict? + data = np.zeros((4, 4)) + for pid, c in consensus.items(): + sig = c["signals"] + if "Opus 4.6" in sig and model in sig: + opus_s = sig["Opus 4.6"]["spec"] + model_s = sig[model]["spec"] + data[opus_s - 1][model_s - 1] += 1 + + # Normalize rows + row_sums = data.sum(axis=1, keepdims=True) + data_norm = np.where(row_sums > 0, data / row_sums * 100, 0) + + im = ax.imshow(data_norm, cmap="YlGnBu", aspect="equal", vmin=0, vmax=100) + ax.set_xticks(range(4)) + ax.set_xticklabels(["S1", "S2", "S3", "S4"]) + ax.set_yticks(range(4)) + ax.set_yticklabels(["S1", "S2", "S3", "S4"]) + ax.set_xlabel(f"{model} prediction") + ax.set_ylabel("Opus label") + ax.set_title(f"{model} Specificity Calibration", fontweight="bold") + + for i in range(4): + for j in range(4): + val = data_norm[i][j] + n = int(data[i][j]) + color = "white" if val > 60 else "black" + ax.text(j, i, f"{val:.0f}%\n({n})", ha="center", va="center", fontsize=8, color=color) + + fig.tight_layout() + fig.savefig(CHART_DIR / "29_spec_calibration.png", dpi=150) + plt.close(fig) + print(" 29_spec_calibration.png") + + +# ═══════════════════════════════════════════════════════════ +# CHART 30: Latency vs accuracy +# ═══════════════════════════════════════════════════════════ +def plot_latency_vs_accuracy(): + fig, ax = plt.subplots(figsize=(12, 7)) + + model_lats: dict[str, list[float]] = defaultdict(list) + for bf in bench_files: + if "errors" in bf.name: + continue + records = load_jsonl(bf) + if len(records) < 100: + continue + mid = records[0]["provenance"]["modelId"] + short = MODEL_SHORT.get(mid, mid.split("/")[-1]) + for r in records: + model_lats[short].append(r["provenance"].get("latencyMs", 0)) + for pid, annots in stage1_by_pid.items(): + for a in annots: + mid = a["provenance"]["modelId"] + short = MODEL_SHORT.get(mid, mid.split("/")[-1]) + model_lats[short].append(a["provenance"].get("latencyMs", 0)) + for r in opus_by_pid.values(): + model_lats["Opus 4.6"].append(r["provenance"].get("latencyMs", 0)) + + for model in sorted(ALL_GENAI): + lats = model_lats.get(model, []) + if not lats: + continue + avg_lat = sum(lats) / len(lats) / 1000 + + agree, total = 0, 0 + for pid, c in consensus.items(): + sig = c["signals"] + if model in sig and "Opus 4.6" in sig and model != "Opus 4.6": + total += 1 + if sig[model]["cat"] == sig["Opus 4.6"]["cat"]: + agree += 1 + cat_acc = agree / total * 100 if total > 0 else 0 + + mid_full = {v: k for k, v in MODEL_SHORT.items()}.get(model, "") + tier = MODEL_TIER.get(mid_full, "mid") + color = TIER_COLORS.get(tier, "#999") + + ax.scatter(avg_lat, cat_acc, s=150, c=color, edgecolors="black", linewidths=0.5, zorder=3) + ax.annotate(model, (avg_lat, cat_acc), textcoords="offset points", xytext=(8, 4), fontsize=7) + + ax.set_xlabel("Average Latency (seconds)") + ax.set_ylabel("Category Agreement with Opus (%)") + ax.set_title("Latency vs Category Accuracy", fontweight="bold") + ax.set_ylim(60, 100) + + from matplotlib.patches import Patch + legend_elements = [Patch(facecolor=c, label=t) for t, c in TIER_COLORS.items()] + ax.legend(handles=legend_elements, loc="lower left") + + fig.tight_layout() + fig.savefig(CHART_DIR / "30_latency_vs_accuracy.png", dpi=150) + plt.close(fig) + print(" 30_latency_vs_accuracy.png") + + +# ═══════════════════════════════════════════════════════════ +# TEXTUAL ANALYSIS +# ═══════════════════════════════════════════════════════════ +def print_full_analysis(): + print("\n" + "=" * 80) + print("COMPREHENSIVE 13-SIGNAL ANALYSIS") + print("=" * 80) + + # ── Summary stats ── + print(f"\n{'─' * 60}") + print("SIGNAL COVERAGE") + print(f"{'─' * 60}") + signal_counts = [c["n_signals"] for c in consensus.values()] + print(f" Paragraphs: {len(consensus)}") + print(f" Min/Max/Mean signals per paragraph: {min(signal_counts)}/{max(signal_counts)}/{np.mean(signal_counts):.1f}") + print(f" GenAI models: {len(ALL_GENAI)}") + print(f" Human annotators: {len(annotator_names)}") + + # ── Adjudication ── + print(f"\n{'─' * 60}") + print("ADJUDICATION TIERS") + print(f"{'─' * 60}") + for t in range(1, 5): + pct = len(tiers[t]) / 1200 * 100 + print(f" Tier {t}: {len(tiers[t]):4d} ({pct:5.1f}%)") + + # What are the dominant categories in Tier 1 vs Tier 4? + for t in [1, 4]: + cats = Counter() + for pid in tiers[t]: + cats[consensus[pid]["all_cat_counts"].most_common(1)[0][0]] += 1 + print(f"\n Tier {t} category breakdown:") + for cat, n in cats.most_common(): + print(f" {CAT_MAP[cat]}: {n} ({n/len(tiers[t])*100:.1f}%)") + + # ── Cross-source agreement ── + print(f"\n{'─' * 60}") + print("CROSS-SOURCE AGREEMENT — CATEGORY") + print(f"{'─' * 60}") - # Human majority vs Stage1 majority vs Opus — category h_eq_s1 = sum(1 for c in consensus.values() if c["human_cat_maj"] and c["s1_cat_maj"] and c["human_cat_maj"] == c["s1_cat_maj"]) h_eq_op = sum(1 for c in consensus.values() if c["human_cat_maj"] and c["opus_cat"] and c["human_cat_maj"] == c["opus_cat"]) + h_eq_g = sum(1 for c in consensus.values() + if c["human_cat_maj"] and c["genai_cat_maj"] and c["human_cat_maj"] == c["genai_cat_maj"]) s1_eq_op = sum(1 for c in consensus.values() if c["s1_cat_maj"] and c["opus_cat"] and c["s1_cat_maj"] == c["opus_cat"]) + g_eq_op = sum(1 for c in consensus.values() + if c["genai_cat_maj"] and c["opus_cat"] and c["genai_cat_maj"] == c["opus_cat"]) - # Count where all exist - n_with_all_cat = sum(1 for c in consensus.values() - if c["human_cat_maj"] and c["s1_cat_maj"] and c["opus_cat"]) - n_with_hmaj = sum(1 for c in consensus.values() if c["human_cat_maj"]) - n_with_s1maj = sum(1 for c in consensus.values() if c["s1_cat_maj"]) + n_hmaj = sum(1 for c in consensus.values() if c["human_cat_maj"]) + n_opus = sum(1 for c in consensus.values() if c["opus_cat"]) - print(f"\n── Category Agreement Rates ──") - print(f" Human maj = Stage1 maj: {h_eq_s1}/{n_with_hmaj} ({h_eq_s1/n_with_hmaj*100:.1f}%)") - if OPUS_AVAILABLE: - n_with_opus_and_hmaj = sum(1 for c in consensus.values() - if c["human_cat_maj"] and c["opus_cat"]) - n_with_opus_and_s1 = sum(1 for c in consensus.values() - if c["s1_cat_maj"] and c["opus_cat"]) - if n_with_opus_and_hmaj > 0: - print(f" Human maj = Opus: {h_eq_op}/{n_with_opus_and_hmaj} ({h_eq_op/n_with_opus_and_hmaj*100:.1f}%)") - if n_with_opus_and_s1 > 0: - print(f" Stage1 maj = Opus: {s1_eq_op}/{n_with_opus_and_s1} ({s1_eq_op/n_with_opus_and_s1*100:.1f}%)") - else: - print(f" (Opus comparison skipped — only {opus_coverage}/1200 matched)") + print(f" Human maj = S1 maj: {h_eq_s1}/{n_hmaj} ({h_eq_s1/n_hmaj*100:.1f}%)") + print(f" Human maj = Opus: {h_eq_op}/{n_opus} ({h_eq_op/n_opus*100:.1f}%)") + print(f" Human maj = GenAI maj: {h_eq_g}/{n_hmaj} ({h_eq_g/n_hmaj*100:.1f}%)") + print(f" S1 maj = Opus: {s1_eq_op}/{n_opus} ({s1_eq_op/n_opus*100:.1f}%)") + print(f" GenAI maj = Opus: {g_eq_op}/{n_opus} ({g_eq_op/n_opus*100:.1f}%)") - # Specificity - h_eq_s1_spec = sum(1 for c in consensus.values() - if c["human_spec_maj"] is not None and c["s1_spec_maj"] is not None - and c["human_spec_maj"] == c["s1_spec_maj"]) + # ── Cross-source agreement: specificity ── + print(f"\n{'─' * 60}") + print("CROSS-SOURCE AGREEMENT — SPECIFICITY") + print(f"{'─' * 60}") - n_h_spec = sum(1 for c in consensus.values() if c["human_spec_maj"] is not None) + h_eq_s1_s = sum(1 for c in consensus.values() + if c["human_spec_maj"] is not None and c["s1_spec_maj"] is not None + and c["human_spec_maj"] == c["s1_spec_maj"]) + h_eq_op_s = sum(1 for c in consensus.values() + if c["human_spec_maj"] is not None and c["opus_spec"] is not None + and c["human_spec_maj"] == c["opus_spec"]) + h_eq_g_s = sum(1 for c in consensus.values() + if c["human_spec_maj"] is not None and c["genai_spec_maj"] is not None + and c["human_spec_maj"] == c["genai_spec_maj"]) - print(f"\n── Specificity Agreement Rates ──") - print(f" Human maj = Stage1 maj: {h_eq_s1_spec}/{n_h_spec} ({h_eq_s1_spec/n_h_spec*100:.1f}%)") + n_hs = sum(1 for c in consensus.values() if c["human_spec_maj"] is not None) + print(f" Human maj = S1 maj: {h_eq_s1_s}/{n_hs} ({h_eq_s1_s/n_hs*100:.1f}%)") + print(f" Human maj = Opus: {h_eq_op_s}/{n_hs} ({h_eq_op_s/n_hs*100:.1f}%)") + print(f" Human maj = GenAI maj: {h_eq_g_s}/{n_hs} ({h_eq_g_s/n_hs*100:.1f}%)") - # Disagreement patterns between human and Stage1 - print(f"\n── Disagreement Patterns (Human vs Stage1) ──") - human_unan_s1_agrees = 0 - human_unan_s1_differs = 0 - s1_unan_human_agrees = 0 - s1_unan_human_differs = 0 + # ── Per-model accuracy ── + print(f"\n{'─' * 60}") + print("PER-MODEL ACCURACY vs OPUS (category / specificity / both)") + print(f"{'─' * 60}") + model_stats = [] + for model in sorted(ALL_GENAI): + if model == "Opus 4.6": + continue + agree_c, agree_s, agree_b, total = 0, 0, 0, 0 + for pid, c in consensus.items(): + sig = c["signals"] + if model in sig and "Opus 4.6" in sig: + total += 1 + cat_match = sig[model]["cat"] == sig["Opus 4.6"]["cat"] + spec_match = sig[model]["spec"] == sig["Opus 4.6"]["spec"] + if cat_match: + agree_c += 1 + if spec_match: + agree_s += 1 + if cat_match and spec_match: + agree_b += 1 + if total > 0: + model_stats.append((model, agree_c / total * 100, agree_s / total * 100, agree_b / total * 100, total)) + model_stats.sort(key=lambda x: -x[3]) # sort by both + for model, cat, spec, both, n in model_stats: + print(f" {model:20s} cat={cat:5.1f}% spec={spec:5.1f}% both={both:5.1f}% (n={n})") + + # ── Per-model accuracy vs HUMAN majority ── + print(f"\n{'─' * 60}") + print("PER-MODEL ACCURACY vs HUMAN MAJORITY (category / specificity / both)") + print(f"{'─' * 60}") + model_stats_h = [] + for model in sorted(ALL_GENAI): + agree_c, agree_s, agree_b, total = 0, 0, 0, 0 + for pid, c in consensus.items(): + sig = c["signals"] + hm_c = c["human_cat_maj"] + hm_s = c["human_spec_maj"] + if model in sig and hm_c: + total += 1 + cat_match = sig[model]["cat"] == hm_c + spec_match = hm_s is not None and sig[model]["spec"] == hm_s + if cat_match: + agree_c += 1 + if spec_match: + agree_s += 1 + if cat_match and spec_match: + agree_b += 1 + if total > 0: + model_stats_h.append((model, agree_c / total * 100, agree_s / total * 100, agree_b / total * 100, total)) + model_stats_h.sort(key=lambda x: -x[3]) + for model, cat, spec, both, n in model_stats_h: + print(f" {model:20s} cat={cat:5.1f}% spec={spec:5.1f}% both={both:5.1f}% (n={n})") + + # ── Disagreement patterns ── + print(f"\n{'─' * 60}") + print("CROSS-SOURCE DISAGREEMENT AXES (Human Maj ≠ GenAI Maj)") + print(f"{'─' * 60}") + h_g_confusion = Counter() for c in consensus.values(): hm = c["human_cat_maj"] - sm = c["s1_cat_maj"] - hu = c["human_cat_unanimous"] - su = c["s1_cat_unanimous"] - if hm and sm: - if hu and su: - if hm == sm: - human_unan_s1_agrees += 1 - else: - human_unan_s1_differs += 1 + gm = c["genai_cat_maj"] + if hm and gm and hm != gm: + h_g_confusion[tuple(sorted([hm, gm]))] += 1 + for (a, b), count in h_g_confusion.most_common(10): + print(f" {CAT_MAP[a]}↔{CAT_MAP[b]}: {count}") - print(f" Both unanimous, agree: {human_unan_s1_agrees}") - print(f" Both unanimous, DIFFER: {human_unan_s1_differs}") - - # Where do the majorities differ? Top confusion axes - human_s1_confusion = Counter() - for c in consensus.values(): - hm = c["human_cat_maj"] - sm = c["s1_cat_maj"] - if hm and sm and hm != sm: - axis = tuple(sorted([hm, sm])) - human_s1_confusion[axis] += 1 - - if human_s1_confusion: - print(f"\n Top Human↔Stage1 disagreement axes:") - for (a, b), count in human_s1_confusion.most_common(8): - print(f" {CAT_MAP[a]}↔{CAT_MAP[b]}: {count}") - - # Paragraphs with NO majority on any source + # ── 3-way splits ── + print(f"\n{'─' * 60}") + print("THREE-WAY SPLITS (no majority)") + print(f"{'─' * 60}") no_human_maj = sum(1 for c in consensus.values() if c["human_cat_maj"] is None) no_s1_maj = sum(1 for c in consensus.values() if c["s1_cat_maj"] is None) - print(f"\n── 3-way splits (no majority) ──") - print(f" Human: {no_human_maj} paragraphs") - print(f" Stage1: {no_s1_maj} paragraphs") + no_genai_maj = sum(1 for c in consensus.values() if c["genai_cat_maj"] is None) + print(f" Human 3-way split: {no_human_maj}") + print(f" Stage 1 3-way split: {no_s1_maj}") + print(f" GenAI (10-model) no majority: {no_genai_maj}") + + # ── Unanimity rates ── + print(f"\n{'─' * 60}") + print("UNANIMITY RATES") + print(f"{'─' * 60}") + h_cat_u = sum(1 for c in consensus.values() if c["human_cat_unanimous"]) + h_spec_u = sum(1 for c in consensus.values() if c["human_spec_unanimous"]) + h_both_u = sum(1 for c in consensus.values() if c["human_cat_unanimous"] and c["human_spec_unanimous"]) + g_cat_u = sum(1 for c in consensus.values() if len(set(c["genai_cats"])) == 1) + g_spec_u = sum(1 for c in consensus.values() if len(set(c["genai_specs"])) == 1) + g_both_u = sum(1 for c in consensus.values() if len(set(c["genai_cats"])) == 1 and len(set(c["genai_specs"])) == 1) + a_cat_u = sum(1 for c in consensus.values() if len(set(c["all_cats"])) == 1) + a_both_u = sum(1 for c in consensus.values() if len(set(c["all_cats"])) == 1 and len(set(c["all_specs"])) == 1) + print(f" Human (3): cat={h_cat_u/12:.1f}% spec={h_spec_u/12:.1f}% both={h_both_u/12:.1f}%") + print(f" GenAI (10): cat={g_cat_u/12:.1f}% spec={g_spec_u/12:.1f}% both={g_both_u/12:.1f}%") + print(f" All (13): cat={a_cat_u/12:.1f}% both={a_both_u/12:.1f}%") + + # ── Cost summary ── + print(f"\n{'─' * 60}") + print("COST SUMMARY (benchmark run)") + print(f"{'─' * 60}") + total_cost = 0 + for bf in bench_files: + if "errors" in bf.name: + continue + records = load_jsonl(bf) + if len(records) < 100: + continue + mid = records[0]["provenance"]["modelId"] + short = MODEL_SHORT.get(mid, mid.split("/")[-1]) + cost = sum(r["provenance"].get("costUsd", 0) for r in records) + total_cost += cost + print(f" {short:20s}: ${cost:.2f}") + print(f" {'TOTAL':20s}: ${total_cost:.2f}") + + # ── Key findings ── + print(f"\n{'=' * 80}") + print("KEY FINDINGS") + print(f"{'=' * 80}") + print(f""" + 1. ADJUDICATION: {len(tiers[1])}/{1200} paragraphs ({len(tiers[1])/12:.1f}%) fall into Tier 1 (10+/13 agree), + requiring zero human intervention. Tier 2 adds {len(tiers[2])} more with cross-validated consensus. + Only {len(tiers[3]) + len(tiers[4])} ({(len(tiers[3]) + len(tiers[4]))/12:.1f}%) need expert adjudication. + + 2. OPUS AS REFERENCE: GenAI majority agrees with Opus on {g_eq_op/n_opus*100:.1f}% of categories. + Human majority agrees with Opus on {h_eq_op/n_opus*100:.1f}%. + Human majority agrees with GenAI majority on {h_eq_g/n_hmaj*100:.1f}%. + + 3. SPECIFICITY REMAINS HARD: Human spec unanimity is only {h_spec_u/12:.1f}%, GenAI spec unanimity + is {g_spec_u/12:.1f}%. The Spec 3↔4 boundary is the dominant axis of disagreement for everyone. + + 4. AARYAN EFFECT: Excluding the outlier annotator would push category alpha from 0.801 to ~0.87+, + and specificity alpha from 0.546 to ~0.65+. His paragraphs show a ~+45pp jump + in both-unanimous rate when he's excluded. + + 5. SAME CONFUSION AXES: MR↔RMP > BG↔MR > N/O↔SI for humans, Stage 1, AND full GenAI panel. + The codebook boundaries, not the annotator type, drive disagreement. +""") # ═══════════════════════════════════════════════════════════ -# Run all +# RUN ALL # ═══════════════════════════════════════════════════════════ print("\nGenerating charts...") plot_kappa_heatmaps() -plot_annotator_category_dist() -plot_annotator_spec_dist() +plot_all_source_category_dist() +plot_all_source_spec_dist() plot_human_confusion() +plot_genai_agreement_matrix() plot_cross_source_confusion() plot_cross_source_specificity() -plot_annotator_vs_references() +plot_adjudication_tiers() +plot_model_accuracy_vs_opus() +plot_cost_vs_accuracy() +plot_per_category_accuracy() +plot_ensemble_accuracy() plot_agreement_by_wordcount() plot_time_vs_agreement() -plot_none_other_analysis() plot_outlier_annotator() -plot_human_vs_genai_consensus() -plot_specificity_bias() -plot_disagreement_axes() -plot_quiz_vs_quality() plot_with_without_outlier() -print_analysis() +plot_disagreement_axes() +plot_none_other_analysis() +plot_specificity_bias_all() +plot_quiz_vs_quality() +plot_human_vs_genai_consensus() +plot_signal_agreement_dist() +plot_annotator_vs_references() +plot_hard_paragraphs() +plot_model_vs_human_per_category() +plot_prompt_version_effect() +plot_conditional_agreement() +plot_model_clustering() +plot_spec_calibration() +plot_latency_vs_accuracy() +print_full_analysis() print(f"\nAll charts saved to {CHART_DIR}/")