working model!!!!!

2026-04-05 15:37:50 -04:00 · 2026-04-05 15:37:50 -04:00 · 2e932bc327
commit 2e932bc327
parent 41df5923f2
46 changed files with 2025 additions and 34 deletions
--- a/.dockerignore
+++ b/.dockerignore
@ -3,7 +3,6 @@
 # Allow only what the labelapp Dockerfile needs
 !package.json
 !bun.lock
 !packages/schemas/
 !ts/package.json
@ -14,9 +13,10 @@ labelapp/.env*
 labelapp/playwright-report/
 labelapp/test-results/
-# Seed data (only the two JSONL files we need)
+# Seed data
 !data/paragraphs/paragraphs-clean.jsonl
 !data/annotations/stage1.jsonl
 !data/gold/v2-holdout-ids.json
 # Git/IDE
 .git
--- a/docs/NARRATIVE.md
+++ b/docs/NARRATIVE.md
@ -575,6 +575,137 @@ As a proxy before human labels arrive, evaluation against GPT-5.4 and Opus bench
 ---
 ## Phase 9: Holdout Evaluation — Proxy Gold Results
 ### Evaluation Setup
 Built a comprehensive evaluation pipeline (`python/src/finetune/eval.py`) to test the trained model on the 1,200-paragraph holdout set. Since human gold labels were not yet available, we used two frontier API models as proxy references:
 - **GPT-5.4** (1,200 labels, ~$3,400/1M texts, ~2,900ms/sample)
 - **Opus-4.6** (1,200 labels, ~$5,000/1M texts, ~6,000ms/sample)
 Both references used the same v4.5 prompt as the Grok training labels but are different model families — they provide independent validation that the fine-tuned model learned the construct, not just Grok's idiosyncrasies.
 The evaluation computed: macro/weighted F1, per-class F1, precision, recall, MCC, AUC (one-vs-rest), QWK, MAE, Krippendorff's alpha (nominal for category, ordinal for specificity), confusion matrices, and calibration (ECE).
 ### Results: Independent Thresholds (Epoch 8, Best Model)
 | Metric | vs GPT-5.4 | vs Opus-4.6 |
 |--------|-----------|-------------|
 | **Cat Macro F1** | **0.934** | **0.923** |
 | **Spec Macro F1** | **0.895** | **0.883** |
 | Cat MCC | 0.923 | 0.909 |
 | Cat AUC (OvR) | 0.992 | 0.994 |
 | Spec QWK | 0.932 | 0.923 |
 | Spec MAE | 0.118 | 0.136 |
 | Cat Kripp α | 0.922 | 0.909 |
 | Spec Kripp α | 0.918 | 0.907 |
 | Cat ECE | 0.054 | 0.066 |
 | Throughput | **178 samples/sec** | — |
 | Latency | **5.6ms/sample** | — |
 Both heads pass the 0.80 macro F1 target by wide margins on held-out data against independent reference models.
 Per-class category F1 (vs GPT-5.4): Board Gov. 0.972, Incident Disc. 0.961, Mgmt Role 0.941, None/Other 0.888, Risk Mgmt Proc. 0.856, Strategy Int. 0.958, Third-Party 0.959. RMP is the weakest category (0.856) due to MR↔RMP boundary ambiguity, but still comfortably above target.
 Per-class specificity F1 (vs GPT-5.4): L1 0.936, L2 0.798, L3 0.894, L4 0.954. L2 is the weakest level — analyzed in detail below.
 ### Results: CORAL Baseline (Epoch 5) — For Comparison
 | Metric | vs GPT-5.4 | vs Opus-4.6 |
 |--------|-----------|-------------|
 | Cat Macro F1 | 0.936 | 0.928 |
 | **Spec Macro F1** | **0.597** | **0.596** |
 | Spec QWK | 0.876 | 0.872 |
 The category heads are essentially identical between models — the backbone handles category well regardless of specificity architecture. The +0.298 spec F1 improvement is entirely attributable to the independent threshold heads.
 CORAL's confusion matrix reveals the mechanism: it collapses L2 (F1=0.407) and L3 (F1=0.369) into L1 and L4, predicting extreme levels because the shared weight vector can't represent the intermediate transitions. The independent threshold model's confusion matrix shows clean diagonals across all four levels.
 ### Reference Agreement Ceiling
 A critical finding: **the model agrees with the references more than the references agree with each other.**
 | Comparison | Macro Spec F1 | L2 F1 |
 |-----------|---------------|-------|
 | GPT-5.4 vs Opus-4.6 | **0.885** | **0.805** |
 | Our model vs GPT-5.4 | **0.895** | 0.798 |
 | Our model vs Opus-4.6 | 0.883 | 0.776 |
 | Stage 1 Consensus vs GPT-5.4 | 0.911 | 0.845 |
 Our model's macro spec F1 (0.895) exceeds the inter-reference agreement (0.885). This means the model learned a "consensus position" that is more consistent than either individual reference. Further improvements against these proxy references are not meaningful — they would represent overfitting to one reference's idiosyncrasies rather than genuine improvement.
 The L2 F1 of 0.798 is within 0.007 of the reference ceiling (0.805). The L1↔L2 boundary is the hardest in the construct — it hinges on whether language is "domain-specific" enough to qualify (the ERM test). Paragraphs using quasi-domain language (e.g., "risk management program for cybersecurity") sit in a genuine gray zone where even frontier models disagree.
 ### L2 Error Analysis
 The L2 confusion is directional. Against GPT-5.4:
 - 29 L2 paragraphs misclassified as L1 (model under-calls domain terminology)
 - 23 L1 paragraphs misclassified as L2 (model over-calls domain terminology)
 - Only 7 L2→L3 and 2 L2→L4 errors (higher transitions are clean)
 This is the L1↔L2 boundary problem in isolation — the model handles L2↔L3 and L3↔L4 transitions with high accuracy. The ERM test ("would an employee relations manager understand this language?") is inherently subjective at the margin.
 ### Category × Specificity Joint Distribution
 The holdout set reveals strong correlation between category and specificity:
 | Category | L1 | L2 | L3 | L4 |
 |---------|-----|-----|-----|-----|
 | None/Other | **100%** | 0% | 0% | 0% |
 | Strategy Integration | **85%** | 10% | 2% | 3% |
 | Third-Party Risk | 62% | **22%** | 12% | 5% |
 | Risk Mgmt Process | 34% | **44%** | 16% | 6% |
 | Board Governance | 42% | 4% | **45%** | 9% |
 | Management Role | 13% | 3% | 29% | **54%** |
 | Incident Disclosure | 0% | 8% | 2% | **90%** |
 Despite this correlation, the current architecture treats specificity as category-independent (by design — per the codebook, specificity measures "how specific" regardless of "what about"). Making specificity category-dependent was considered but rejected: the cell sizes for many (category, spec_level) combinations are too small for reliable conditional modeling, and error propagation from category mistakes would corrupt specificity predictions. The strong correlations are already captured implicitly by the shared backbone. This remains a potential direction for future investigation with a larger dataset.
 ### Sequence Length Analysis
 At max_seq_length=512, truncation is negligible:
 | Dataset | Mean tokens | P95 | P99 | Max | Truncated (>512) |
 |---------|------------|-----|-----|-----|-----------------|
 | All paragraphs (72K) | 114.6 | 240 | 350 | 678 | 139 (0.19%) |
 | Holdout (1,200) | 117.9 | 236 | 329 | 603 | 1 (0.08%) |
 SEC cybersecurity disclosure paragraphs are short by nature (median ~100 tokens). The 512-token limit is more than sufficient — increasing to 1024 would affect only 139 training paragraphs and 1 holdout paragraph.
 ### Speed and Cost Comparison
 | System | Latency | Throughput | Cost/1M texts | Reproducible |
 |--------|---------|-----------|---------------|-------------|
 | **Fine-tuned specialist** | **5.6ms** | **178/sec** | **~$5** | **Yes** |
 | GPT-5.4 (API) | ~2,900ms | ~0.3/sec | ~$3,400 | No |
 | Opus-4.6 (API) | ~6,000ms | ~0.2/sec | ~$5,000 | No |
 The fine-tuned model is **520× faster** than GPT-5.4 and **1,070× faster** than Opus-4.6, at **~680-1,000× lower cost**, with comparable or better accuracy and full determinism.
 ### Calibration
 The model is well-calibrated for category (ECE=0.054 vs GPT-5.4) and reasonably calibrated for specificity (ECE=0.071). The calibration plot shows slight overconfidence in the 0.7-0.9 range — consistent with the "benign overfitting" observed during training where the model became more confident without changing decision boundaries. Temperature scaling could improve calibration without affecting predictions (a single scalar adjustment on validation logits), which would be valuable for deployment confidence thresholds.
 ### Remaining Opportunities
 **Threshold tuning (free, post-gold):** Once human gold labels arrive, grid-search the per-threshold sigmoid cutoffs. Currently all thresholds use 0.5 — the optimal L1→L2 cutoff may differ. This requires no retraining and could gain +0.01-0.02 on L2 F1.
 **Ensemble (3 seeds, +0.01-0.03 F1):** Train 3 models with seeds 42/43/44, average sigmoid outputs. Reduces variance on boundary cases and provides confidence intervals for reported metrics. Cost: 3× training time (~24h total), 3× inference time (~17ms/sample).
 **Temperature scaling (free, improves calibration only):** Fit a single temperature parameter on the validation set. Reduces ECE without changing predictions — relevant for deployment where confidence scores matter.
 **Larger specificity MLP (future investigation):** The current 256-dim MLP is efficient but may not capture the full complexity of subtle specificity distinctions. Larger heads (512-dim or 3-layer) could help if the dataset grows, but risk overfitting at current data scale.
 ### Figures Generated
 All evaluation figures saved to `results/eval/`:
 - `iter1-independent/figures/` — confusion matrices (cat + spec), calibration reliability diagrams, per-class F1 bar charts (vs GPT-5.4 and Opus-4.6 separately)
 - `coral-baseline/figures/` — same set for CORAL baseline comparison
 - `comparison/` — side-by-side CORAL vs Independent (per-class F1 bars, all-metrics comparison, improvement delta chart, confusion matrix comparison, summary table)
 ---
 ## v1 Reference
 The complete v1 narrative — Stage 1 prompt engineering (12+ iterations), model benchmarking (21+ models, 12 providers), human labeling webapp, gold set adjudication (13-signal cross-analysis), codebook iterations v1.0–v3.5 — is preserved at `docs/NARRATIVE-v1.md`.
--- a/docs/SPECIFICITY-IMPROVEMENT-PLAN.md
+++ b/docs/SPECIFICITY-IMPROVEMENT-PLAN.md
@ -129,3 +129,24 @@ eval F1 improvement. Best checkpoint: epoch 8 (spec F1=0.945).
 Independent thresholds were the key insight — CORAL's shared weight vector was
 the primary bottleneck. Attention pooling, MLP heads, and confidence filtering
 all contributed. Tier 2 and Tier 3 ideas were not needed.
 ### Holdout Evaluation (1,200 paragraphs, proxy gold)
 Validated on held-out data against two independent frontier model references:
 | Model | Ref | Cat F1 | Spec F1 | L2 F1 | Spec QWK |
 |-------|-----|--------|---------|-------|----------|
 | Independent (ep8) | GPT-5.4 | 0.934 | **0.895** | 0.798 | 0.932 |
 | Independent (ep8) | Opus-4.6 | 0.923 | **0.883** | 0.776 | 0.923 |
 | CORAL (ep5) | GPT-5.4 | 0.936 | 0.597 | 0.407 | 0.876 |
 | CORAL (ep5) | Opus-4.6 | 0.928 | 0.596 | 0.418 | 0.872 |
 | GPT-5.4 | Opus-4.6 | — | **0.885** | **0.805** | 0.919 |
 **Key finding:** The model's holdout spec F1 (0.895) exceeds the inter-reference
 agreement (0.885 between GPT-5.4 and Opus-4.6). The model has reached the
 construct reliability ceiling — further improvement requires cleaner reference
 labels, not a better model.
 **L2 is at ceiling:** Model L2 F1 (0.798) is within 0.007 of reference agreement
 (0.805). The L1↔L2 boundary is genuinely ambiguous. Remaining opportunity:
 per-threshold sigmoid tuning against human gold labels (potential +0.01-0.02).
--- a/docs/STATUS.md
+++ b/docs/STATUS.md
@ -1,6 +1,6 @@
 # Project Status — v2 Pipeline
-**Deadline:** 2026-04-24 | **Started:** 2026-04-03 | **Updated:** 2026-04-05 (Fine-tuning done: cat F1=0.943, spec F1=0.945)
+**Deadline:** 2026-04-24 | **Started:** 2026-04-03 | **Updated:** 2026-04-05 (Holdout eval done: cat F1=0.934, spec F1=0.895 vs GPT-5.4 proxy gold)
 ---
@ -142,14 +142,27 @@
 - **Improvement plan:** `docs/SPECIFICITY-IMPROVEMENT-PLAN.md`
 ### 13. Evaluation & Paper ← CURRENT
- [ ] Proxy eval: run fine-tuned model on holdout, compare against GPT-5.4 and Opus benchmark labels
+- [x] Proxy eval: fine-tuned model on 1,200 holdout vs GPT-5.4 and Opus-4.6 proxy gold
 - [x] Full metrics suite: macro/per-class F1, precision, recall, MCC, AUC, QWK, MAE, Krippendorff's α, ECE, confusion matrices
 - [x] CORAL baseline comparison: same eval pipeline on CORAL epoch 5 checkpoint
 - [x] Figures: confusion matrices, calibration diagrams, per-class F1 bars, CORAL vs Independent comparison, speed/cost table
 - [x] Reference ceiling analysis: GPT-5.4 vs Opus-4.6 agreement = 0.885 macro spec F1 (our model exceeds this at 0.895)
 - [x] L2 error analysis: model L2 F1 (0.798) within 0.007 of reference ceiling (0.805)
 - [x] Sequence length analysis: only 139/72K paragraphs (0.19%) truncated at 512 tokens — negligible impact
 - [x] Opus labels completed: 1,200/1,200 (filled 16 missing from initial run)
 - [ ] Macro F1 on holdout gold (target > 0.80 both heads) — blocked on human labels
- [ ] Per-class F1 breakdown + GenAI benchmark table
+- [ ] Per-threshold sigmoid tuning against human gold (potential +0.01-0.02 on L2 F1)
- [ ] Error analysis, cost comparison, IGNITE slides
+- [ ] Temperature scaling for improved calibration (ECE reduction without changing predictions)
 - [ ] Ensemble of 3 seeds for confidence intervals and potential +0.01-0.03 F1
 - [ ] Error analysis against human gold, IGNITE slides
 - [ ] Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work
 - [ ] Note in paper: DAPT/TAPT did not improve fine-tuning — noteworthy null result
 - [ ] Note in paper: CORAL ordinal regression insufficient for multi-signal ordinal classification
- **Next:** evaluate fine-tuned model on holdout using GPT-5.4 + Opus labels as proxy gold
+- [ ] Note in paper: model exceeds inter-reference agreement — approaches ceiling of construct reliability
 - **Proxy gold results (vs GPT-5.4):** Cat F1=0.934, Spec F1=0.895, MCC=0.923/0.866, AUC=0.992/0.982, QWK=0.932
 - **Proxy gold results (vs Opus-4.6):** Cat F1=0.923, Spec F1=0.883, QWK=0.923
 - **Speed:** 5.6ms/sample (178/sec) — 520× faster than GPT-5.4, 1,070× faster than Opus
 - **Next:** deploy labelapp for human annotation, then gold evaluation + threshold tuning
 ---
@ -177,7 +190,7 @@
 | v2 holdout reference | `data/annotations/v2-bench/gpt-5.4.jsonl` (v4.5, 1,200 paragraphs) |
 | v2 iteration archive | `data/annotations/v2-bench/gpt-5.4.v4.{0,1,2,3,4}.jsonl` |
 | v4.5 boundary test | `data/annotations/v2-bench/v45-test/gpt-5.4.jsonl` (50 paragraphs) |
-| Opus prompt-only | `data/annotations/v2-bench/opus-4.6.jsonl` (1,184 paragraphs) |
+| Opus prompt-only | `data/annotations/v2-bench/opus-4.6.jsonl` (1,200 paragraphs) |
 | Opus +codebook | `data/annotations/golden/opus.jsonl` (includes v1 + v2 runs) |
 | Grok self-consistency test | `data/annotations/v2-bench/grok-rerun/grok-4.1-fast.jsonl` (47 paragraphs) |
 | Benchmark analysis | `scripts/analyze-v2-bench.py` |
@ -199,6 +212,13 @@
 | CORAL baseline (ablation winner) | `checkpoints/finetune/best-base_weighted_ce-ep5/final/` (cat=0.932, spec=0.517) |
 | Ablation results | `checkpoints/finetune/ablation/ablation_results.json` |
 | Spec improvement plan | `docs/SPECIFICITY-IMPROVEMENT-PLAN.md` |
 | Best model iter1 config | `python/configs/finetune/iter1-independent.yaml` |
 | Eval script | `python/src/finetune/eval.py` |
 | Eval results (best model) | `results/eval/iter1-independent/metrics.json` |
 | Eval results (CORAL) | `results/eval/coral-baseline/metrics.json` |
 | Comparison figures | `results/eval/comparison/` (5 charts) |
 | Per-model eval figures | `results/eval/iter1-independent/figures/` + `results/eval/coral-baseline/figures/` |
 | Comparison figure script | `python/scripts/generate-comparison-figures.py` |
 ### v2 Stage 1 Distribution (72,045 paragraphs, v4.5 prompt, Grok ×3 consensus + GPT-5.4 judge)
--- a/labelapp/Dockerfile
+++ b/labelapp/Dockerfile
@ -1,14 +1,15 @@
 # Build context: monorepo root (run: docker build -f labelapp/Dockerfile .)
-FROM oven/bun:1 AS base
+FROM oven/bun:1.3.8 AS base
 # -- Install dependencies --
 FROM base AS deps
 WORKDIR /app
-COPY package.json bun.lock ./
+COPY package.json ./
 COPY packages/schemas/package.json packages/schemas/
 COPY ts/package.json ts/
 COPY labelapp/package.json labelapp/
-RUN bun install --frozen-lockfile
+# bun.lock intentionally excluded — bun hangs parsing the binary lockfile in Docker (bun bug)
 RUN bun install
 # -- Build Next.js --
 FROM base AS builder
@ -16,7 +17,7 @@ WORKDIR /app
 COPY --from=deps /app/node_modules ./node_modules
 COPY --from=deps /app/packages/schemas/node_modules ./packages/schemas/node_modules
 COPY --from=deps /app/labelapp/node_modules ./labelapp/node_modules
-COPY package.json bun.lock ./
+COPY package.json ./
 COPY packages/schemas/ packages/schemas/
 COPY labelapp/ labelapp/
 ENV NEXT_TELEMETRY_DISABLED=1
--- a/labelapp/app/api/metrics/route.ts
+++ b/labelapp/app/api/metrics/route.ts
@ -100,7 +100,6 @@ export async function GET() {
  // Filter to non-admin annotators for per-annotator stats
  const perAnnotator = allAnnotators
    .filter((a) => a.id !== "joey")
    .map((a) => ({
      id: a.id,
      displayName: a.displayName,
@ -132,7 +131,7 @@ export async function GET() {
  // Collect all annotator IDs that have labels (excluding admin)
  const annotatorIds = [
    ...new Set(allLabels.map((l) => l.annotatorId)),
-  ].filter((id) => id !== "joey");
+  ];
  annotatorIds.sort();
  // For each annotator pair, collect shared paragraph ratings
--- a/labelapp/app/codebook/page.tsx
+++ b/labelapp/app/codebook/page.tsx
@ -797,6 +797,13 @@ export default function CodebookPage() {
            how company-specific the disclosure is. Apply the decision test in
            order — stop at the first &ldquo;yes.&rdquo;
          </p>
          <p className="rounded-md border border-amber-300 bg-amber-50 dark:border-amber-700 dark:bg-amber-950/30 p-3 text-sm">
            <strong>Specificity rates the ENTIRE paragraph</strong> — not just
            the parts related to the content category you chose. If a Board
            Governance paragraph also mentions CrowdStrike Falcon or the
            CISO&rsquo;s 20 years of experience, those facts count. Scan
            everything, don&rsquo;t filter by category.
          </p>
          {/* Decision Test */}
          <div className="space-y-3">
--- a/labelapp/app/label/page.tsx
+++ b/labelapp/app/label/page.tsx
@ -489,6 +489,9 @@ function CodebookSidebar() {
                  desc="Contains 1+ QV-eligible facts: specific numbers, dates, named external entities, named tools/products, verifiable certifications."
                />
              </div>
              <p className="mt-2 text-xs text-amber-700 dark:text-amber-400">
                Specificity rates the WHOLE paragraph — not just the category-relevant parts. Scan everything.
              </p>
            </section>
            <Separator />
--- a/labelapp/lib/onboarding-content.ts
+++ b/labelapp/lib/onboarding-content.ts
@ -29,7 +29,7 @@ export const ONBOARDING_STEPS: OnboardingStep[] = [
      "Management Role is broader: it now covers how management is ORGANIZED to handle cybersecurity — role allocation, committee structure, reporting lines — not just \"who a specific person is.\" Paragraphs about management structure without named individuals can be MR.",
      "Specificity Level 2 is broader: renamed from \"Sector-Adapted\" to \"Domain-Adapted.\" Cybersecurity terms like penetration testing, vulnerability scanning, SIEM, and SOC now trigger Level 2. In v1, these were incorrectly classified as Level 1.",
      "Level 4 requires just 1 QV fact (was 2+). No more counting. If an external party could verify even one claim in the paragraph — a dollar amount, a named tool, a specific date — it's Level 4.",
-      "You'll be labeling 1,200 holdout paragraphs total. There are 5 annotators, with 3 labeling each paragraph. You'll see roughly 720.",
+      "You'll be labeling 1,200 holdout paragraphs total. There are 6 annotators, with 3 labeling each paragraph. You'll see roughly 600.",
    ],
    keyPoints: [
      "Same 7 categories, same 4 specificity levels — the framework is unchanged.",
@ -49,11 +49,12 @@ export const ONBOARDING_STEPS: OnboardingStep[] = [
      "Question 1 — Content Category: \"What is this paragraph about?\" Pick the best of 7 options.",
      "Question 2 — Specificity Level: \"How company-specific is this paragraph?\" Pick a level from 1 to 4.",
      "These are independent dimensions. A materiality disclaimer can be Strategy Integration (category) at Level 1 (generic boilerplate). An incident report can be Incident Disclosure at Level 4 (specific dates and firms).",
      "Important: specificity rates THE WHOLE PARAGRAPH, not just the category-relevant parts. If a Board Governance paragraph mentions the CISO by name and describes penetration testing, those facts count for specificity even though they're not \"board\" content. Scan the entire paragraph for the most specific fact present — don't filter by category first.",
    ],
    keyPoints: [
      "One content category (of 7) — pick the dominant one.",
-      "One specificity level (1–4) — determined by the most specific fact present.",
+      "One specificity level (1–4) — determined by the most specific fact in THE WHOLE PARAGRAPH.",
-      "Category and specificity are independent — don't let one influence the other.",
+      "Specificity rates the paragraph, not the category. A Board Governance paragraph that mentions CrowdStrike Falcon is Level 4.",
    ],
  },
@ -142,6 +143,7 @@ export const ONBOARDING_STEPS: OnboardingStep[] = [
    subtitle: "How company-specific is this paragraph?",
    content: [
      "Specificity measures how much this paragraph tells you about THIS specific company versus generic filler any company could use.",
      "Critical: specificity rates the ENTIRE paragraph — not just the parts related to the category you chose. If you categorize a paragraph as Board Governance but it also mentions CrowdStrike Falcon or the CISO's 20 years of experience, those facts still count. Scan everything.",
      "Think of it as a waterfall — check from the top and stop at the first yes:",
      "Level 4 — Quantified-Verifiable: Can an external party verify at least one claim? (a specific number, date, named tool/firm, verifiable certification) → Level 4.",
      "Level 3 — Firm-Specific: Does it contain at least one fact unique to THIS company? (CISO title, named non-generic committee, named individual, 24/7 SOC) → Level 3.",
@ -218,6 +220,13 @@ export const ONBOARDING_STEPS: OnboardingStep[] = [
        explanation:
          "BG because the Audit Committee is the subject (oversight). CISO is a firm-specific fact → Level 3. No QV facts (no numbers, dates, named firms).",
      },
      {
        text: "The Board oversees our cybersecurity program, which is led by our CISO and includes penetration testing and vulnerability assessments using CrowdStrike Falcon.",
        category: "Board Governance",
        specificity: "Level 4 — Quantified-Verifiable",
        explanation:
          "BG because the Board is the subject. But specificity rates THE WHOLE PARAGRAPH — not just the board content. CrowdStrike Falcon is a named tool (QV-eligible), so Level 4. Don't be tempted to rate only the \"board\" parts as generic — the paragraph as a whole contains a verifiable fact.",
      },
      {
        text: "Under the leadership of our CISO, we have implemented network segmentation, endpoint detection and response, data loss prevention, and SIEM. Our team monitors critical systems continuously and conducts quarterly tabletop exercises.",
        category: "Risk Management Process",
@ -248,8 +257,8 @@ export const ONBOARDING_STEPS: OnboardingStep[] = [
      },
    ],
    keyPoints: [
-      "Category and specificity are independent. Don't let one influence the other.",
+      "Specificity rates the WHOLE paragraph — not just the parts related to the category. Scan everything.",
-      "The person-removal test and specificity waterfall work together — use both.",
+      "A Board Governance paragraph that mentions CrowdStrike Falcon → still Level 4. Don't filter facts by category.",
      "When in doubt on category: which question does the paragraph answer?",
      "When in doubt on specificity: check the waterfall top-down (QV → IS → Domain → Generic).",
    ],
--- a/labelapp/scripts/assign.ts
+++ b/labelapp/scripts/assign.ts
@ -1,7 +1,6 @@
 process.env.DATABASE_URL ??=
  "postgresql://sec_cybert:sec_cybert@localhost:5432/sec_cybert";
 import { ne } from "drizzle-orm";
 import { db } from "../db";
 import * as schema from "../db/schema";
 import { generateAssignments, printAssignmentStats } from "../lib/assignment";
@ -15,12 +14,11 @@ async function main() {
  const paragraphIds = rows.map((r) => r.id);
  console.log(`  ${paragraphIds.length} paragraphs`);
-  // 2. Read annotator IDs from DB (exclude joey — admin)
+  // 2. Read annotator IDs from DB (all annotators, including joey)
  console.log("Loading annotators...");
  const annotators = await db
    .select({ id: schema.annotators.id })
-    .from(schema.annotators)
+    .from(schema.annotators);
    .where(ne(schema.annotators.id, "joey"));
  const annotatorIds = annotators.map((a) => a.id).sort();
  console.log(`  ${annotatorIds.length} annotators: ${annotatorIds.join(", ")}`);
--- a/labelapp/scripts/dump-all.ts
+++ b/labelapp/scripts/dump-all.ts
@ -53,16 +53,14 @@ async function main() {
      db.select().from(schema.adjudications),
    ]);
-  const nonAdminAnnotators = allAnnotators.filter((a) => a.id !== "joey");
+  const annotatorIds = allAnnotators.map((a) => a.id).sort();
  const annotatorIds = nonAdminAnnotators.map((a) => a.id).sort();
  const annotatorNames = new Map(allAnnotators.map((a) => [a.id, a.displayName]));
-  // Filter to non-admin labels only
+  const labels = allLabels;
  const labels = allLabels.filter((l) => l.annotatorId !== "joey");
-  console.log(`  ${labels.length} human labels (non-admin)`);
+  console.log(`  ${labels.length} human labels`);
  console.log(`  ${allParagraphs.length} paragraphs`);
-  console.log(`  ${nonAdminAnnotators.length} annotators`);
+  console.log(`  ${allAnnotators.length} annotators`);
  console.log(`  ${allQuizSessions.length} quiz sessions`);
  console.log(`  ${allAdjudications.length} adjudications`);
@ -108,7 +106,7 @@ async function main() {
  // ── 3. Annotators JSON ──
  console.log("\nExporting annotator profiles...");
-  const annotatorProfiles = nonAdminAnnotators.map((a) => ({
+  const annotatorProfiles = allAnnotators.map((a: { id: string; displayName: string; onboardedAt: Date | null }) => ({
    id: a.id,
    displayName: a.displayName,
    onboardedAt: a.onboardedAt?.toISOString() ?? null,
--- a/python/main.py
+++ b/python/main.py
@ -44,6 +44,35 @@ def cmd_finetune(args: argparse.Namespace) -> None:
    train(config)
 def cmd_eval(args: argparse.Namespace) -> None:
    from src.finetune.eval import EvalConfig, evaluate
    benchmark_paths = {}
    if args.benchmark:
        for name, path in args.benchmark:
            benchmark_paths[name] = path
    else:
        # Default benchmarks
        benchmark_paths = {
            "GPT-5.4": "../data/annotations/v2-bench/gpt-5.4.jsonl",
            "Opus-4.6": "../data/annotations/v2-bench/opus-4.6.jsonl",
        }
    config = EvalConfig(
        checkpoint_path=args.checkpoint,
        paragraphs_path=args.paragraphs,
        holdout_path=args.holdout,
        benchmark_paths=benchmark_paths,
        output_dir=args.output_dir,
        max_seq_length=args.max_seq_length,
        batch_size=args.batch_size,
        specificity_head=args.spec_head,
        spec_mlp_dim=args.spec_mlp_dim,
        pooling=args.pooling,
    )
    evaluate(config)
 def cmd_ablate(args: argparse.Namespace) -> None:
    from src.common.config import FinetuneConfig
    from src.finetune.train import ablate
@ -94,10 +123,20 @@ def main() -> None:
    ab.add_argument("--epochs", type=int, help="Override epochs per ablation run (default: config value)")
    ab.set_defaults(func=cmd_ablate)
-    # ── eval (placeholder) ──
+    # ── eval ──
-    ev = sub.add_parser("eval", help="Evaluate a trained model")
+    ev = sub.add_parser("eval", help="Evaluate a trained model on holdout set")
-    ev.add_argument("--config", required=True, help="Path to YAML config file")
+    ev.add_argument("--checkpoint", required=True, help="Path to model checkpoint directory")
-    ev.set_defaults(func=lambda args: print("Evaluation not yet implemented."))
+    ev.add_argument("--paragraphs", default="../data/paragraphs/paragraphs-clean.patched.jsonl")
    ev.add_argument("--holdout", default="../data/gold/v2-holdout-ids.json")
    ev.add_argument("--benchmark", action="append", nargs=2, metavar=("NAME", "PATH"),
                    help="Benchmark reference: NAME PATH (can repeat)")
    ev.add_argument("--output-dir", default="../results/eval")
    ev.add_argument("--max-seq-length", type=int, default=512)
    ev.add_argument("--batch-size", type=int, default=64)
    ev.add_argument("--spec-head", default="independent", choices=["coral", "independent", "softmax"])
    ev.add_argument("--spec-mlp-dim", type=int, default=256)
    ev.add_argument("--pooling", default="attention", choices=["cls", "attention"])
    ev.set_defaults(func=cmd_eval)
    args = parser.parse_args()
    args.func(args)
--- a/python/pyproject.toml
+++ b/python/pyproject.toml
@ -15,6 +15,9 @@ dependencies = [
    "unsloth==2026.3.11",
    "coral-pytorch>=1.4.0",
    "scikit-learn>=1.8.0",
    "krippendorff>=0.8.2",
    "matplotlib>=3.10.8",
    "seaborn>=0.13.2",
 ]
 [project.scripts]
--- a/python/scripts/generate-comparison-figures.py
+++ b/python/scripts/generate-comparison-figures.py
@ -0,0 +1,226 @@
 """Generate side-by-side comparison figures: CORAL baseline vs Independent threshold model."""
 import json
 from pathlib import Path
 import matplotlib
 matplotlib.use("Agg")
 import matplotlib.pyplot as plt
 import numpy as np
 import seaborn as sns
 RESULTS_DIR = Path(__file__).resolve().parent.parent.parent / "results" / "eval"
 OUTPUT_DIR = RESULTS_DIR / "comparison"
 CATEGORIES = [
    "Board Gov.",
    "Incident Disc.",
    "Mgmt Role",
    "None/Other",
    "Risk Mgmt Proc.",
    "Strategy Int.",
    "Third-Party",
 ]
 SPEC_LABELS = ["L1: Generic", "L2: Domain", "L3: Firm-Spec.", "L4: Quantified"]
 def load_metrics(model_dir: str) -> dict:
    with open(RESULTS_DIR / model_dir / "metrics.json") as f:
        return json.load(f)
 def main():
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    sns.set_theme(style="whitegrid", font_scale=1.1)
    coral = load_metrics("coral-baseline")
    indep = load_metrics("iter1-independent")
    # Use GPT-5.4 as the reference (1200 samples, complete)
    coral_gpt = coral["best-base_weighted_ce-ep5_vs_GPT-5.4"]
    indep_gpt = indep["iter1-independent_vs_GPT-5.4"]
    # ── 1. Side-by-side per-class F1 (Category) ─────────────────────────────
    # Keys come from eval.py: name.replace(" ", "").replace("/", "")[:8]
    cat_keys = ["BoardGov", "Incident", "Manageme", "NoneOthe", "RiskMana", "Strategy", "Third-Pa"]
    coral_cat_f1 = [coral_gpt.get(f"cat_f1_{k}", 0) for k in cat_keys]
    indep_cat_f1 = [indep_gpt.get(f"cat_f1_{k}", 0) for k in cat_keys]
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    x = np.arange(len(CATEGORIES))
    width = 0.35
    bars1 = ax1.bar(x - width/2, coral_cat_f1, width, label="CORAL (Epoch 5)", color="#DD8452", alpha=0.85)
    bars2 = ax1.bar(x + width/2, indep_cat_f1, width, label="Independent (Epoch 8)", color="#4C72B0", alpha=0.85)
    ax1.axhline(0.80, color="red", linestyle="--", alpha=0.5, label="Target (0.80)")
    ax1.set_ylabel("F1 Score")
    ax1.set_title("Category F1 by Class")
    ax1.set_xticks(x)
    ax1.set_xticklabels(CATEGORIES, rotation=25, ha="right")
    ax1.set_ylim(0, 1.05)
    ax1.legend(loc="lower right")
    for bar, v in zip(bars1, coral_cat_f1):
        ax1.text(bar.get_x() + bar.get_width()/2, v + 0.01, f"{v:.2f}", ha="center", va="bottom", fontsize=7)
    for bar, v in zip(bars2, indep_cat_f1):
        ax1.text(bar.get_x() + bar.get_width()/2, v + 0.01, f"{v:.2f}", ha="center", va="bottom", fontsize=7)
    # ── Specificity F1 side-by-side ──────────────────────────────────────────
    # Keys come from eval.py: name.replace(" ", "").replace(":", "")[:8]
    spec_keys = ["L1Generi", "L2Domain", "L3Firm-S", "L4Quanti"]
    coral_spec_f1 = [coral_gpt.get(f"spec_f1_{k}", 0) for k in spec_keys]
    indep_spec_f1 = [indep_gpt.get(f"spec_f1_{k}", 0) for k in spec_keys]
    x2 = np.arange(len(SPEC_LABELS))
    bars3 = ax2.bar(x2 - width/2, coral_spec_f1, width, label="CORAL (Epoch 5)", color="#DD8452", alpha=0.85)
    bars4 = ax2.bar(x2 + width/2, indep_spec_f1, width, label="Independent (Epoch 8)", color="#4C72B0", alpha=0.85)
    ax2.axhline(0.80, color="red", linestyle="--", alpha=0.5, label="Target (0.80)")
    ax2.set_ylabel("F1 Score")
    ax2.set_title("Specificity F1 by Level")
    ax2.set_xticks(x2)
    ax2.set_xticklabels(SPEC_LABELS)
    ax2.set_ylim(0, 1.05)
    ax2.legend(loc="lower right")
    for bar, v in zip(bars3, coral_spec_f1):
        ax2.text(bar.get_x() + bar.get_width()/2, v + 0.01, f"{v:.2f}", ha="center", va="bottom", fontsize=8)
    for bar, v in zip(bars4, indep_spec_f1):
        ax2.text(bar.get_x() + bar.get_width()/2, v + 0.01, f"{v:.2f}", ha="center", va="bottom", fontsize=8)
    plt.suptitle("CORAL Baseline vs Independent Thresholds — Holdout Set (vs GPT-5.4)", fontsize=14, fontweight="bold")
    plt.tight_layout()
    fig.savefig(OUTPUT_DIR / "coral_vs_independent_f1.png", dpi=200)
    plt.close(fig)
    print(f"  Saved: coral_vs_independent_f1.png")
    # ── 2. Summary metrics comparison ────────────────────────────────────────
    metrics_to_compare = {
        "Cat Macro F1": ("cat_macro_f1", "cat_macro_f1"),
        "Spec Macro F1": ("spec_macro_f1", "spec_macro_f1"),
        "Cat MCC": ("cat_mcc", "cat_mcc"),
        "Spec MCC": ("spec_mcc", "spec_mcc"),
        "Cat AUC": ("cat_auc", "cat_auc"),
        "Spec AUC": ("spec_auc", "spec_auc"),
        "Spec QWK": ("spec_qwk", "spec_qwk"),
        "Cat Kripp α": ("cat_kripp_alpha", "cat_kripp_alpha"),
        "Spec Kripp α": ("spec_kripp_alpha", "spec_kripp_alpha"),
    }
    fig, ax = plt.subplots(figsize=(12, 6))
    labels = list(metrics_to_compare.keys())
    coral_vals = [coral_gpt.get(v[0], 0) for v in metrics_to_compare.values()]
    indep_vals = [indep_gpt.get(v[1], 0) for v in metrics_to_compare.values()]
    x = np.arange(len(labels))
    width = 0.35
    ax.bar(x - width/2, coral_vals, width, label="CORAL (Epoch 5)", color="#DD8452", alpha=0.85)
    ax.bar(x + width/2, indep_vals, width, label="Independent (Epoch 8)", color="#4C72B0", alpha=0.85)
    ax.axhline(0.80, color="red", linestyle="--", alpha=0.5)
    ax.set_ylabel("Score")
    ax.set_title("CORAL vs Independent — All Metrics (Holdout vs GPT-5.4)")
    ax.set_xticks(x)
    ax.set_xticklabels(labels, rotation=30, ha="right")
    ax.set_ylim(0, 1.1)
    ax.legend()
    for i, (cv, iv) in enumerate(zip(coral_vals, indep_vals)):
        ax.text(i - width/2, cv + 0.01, f"{cv:.3f}", ha="center", va="bottom", fontsize=7)
        ax.text(i + width/2, iv + 0.01, f"{iv:.3f}", ha="center", va="bottom", fontsize=7)
    plt.tight_layout()
    fig.savefig(OUTPUT_DIR / "coral_vs_independent_all_metrics.png", dpi=200)
    plt.close(fig)
    print(f"  Saved: coral_vs_independent_all_metrics.png")
    # ── 3. Delta chart (improvement from CORAL → Independent) ────────────────
    deltas = [iv - cv for cv, iv in zip(coral_vals, indep_vals)]
    colors = ["#55a868" if d >= 0 else "#c44e52" for d in deltas]
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.barh(labels, deltas, color=colors, alpha=0.85)
    ax.axvline(0, color="black", linewidth=0.8)
    ax.set_xlabel("Improvement (Independent − CORAL)")
    ax.set_title("Metric Improvement: Independent Thresholds over CORAL")
    for i, (d, label) in enumerate(zip(deltas, labels)):
        ax.text(d + 0.003 if d >= 0 else d - 0.003, i, f"{d:+.3f}",
                va="center", ha="left" if d >= 0 else "right", fontsize=9)
    plt.tight_layout()
    fig.savefig(OUTPUT_DIR / "improvement_delta.png", dpi=200)
    plt.close(fig)
    print(f"  Saved: improvement_delta.png")
    # ── 4. Specificity confusion matrix side-by-side ─────────────────────────
    spec_labels_short = ["L1", "L2", "L3", "L4"]
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))
    cm_coral = np.array(coral_gpt["spec_confusion_matrix"])
    cm_indep = np.array(indep_gpt["spec_confusion_matrix"])
    cm_coral_norm = cm_coral.astype(float) / cm_coral.sum(axis=1, keepdims=True).clip(min=1)
    cm_indep_norm = cm_indep.astype(float) / cm_indep.sum(axis=1, keepdims=True).clip(min=1)
    sns.heatmap(cm_coral_norm, annot=cm_coral, fmt="d", cmap="Oranges",
                xticklabels=spec_labels_short, yticklabels=spec_labels_short,
                ax=ax1, vmin=0, vmax=1, cbar=False)
    ax1.set_title("CORAL (Epoch 5) — Spec F1=0.597")
    ax1.set_xlabel("Predicted")
    ax1.set_ylabel("GPT-5.4 Reference")
    sns.heatmap(cm_indep_norm, annot=cm_indep, fmt="d", cmap="Blues",
                xticklabels=spec_labels_short, yticklabels=spec_labels_short,
                ax=ax2, vmin=0, vmax=1, cbar=False)
    ax2.set_title("Independent (Epoch 8) — Spec F1=0.895")
    ax2.set_xlabel("Predicted")
    ax2.set_ylabel("GPT-5.4 Reference")
    plt.suptitle("Specificity Confusion Matrices — CORAL vs Independent", fontsize=13, fontweight="bold")
    plt.tight_layout()
    fig.savefig(OUTPUT_DIR / "spec_confusion_comparison.png", dpi=200)
    plt.close(fig)
    print(f"  Saved: spec_confusion_comparison.png")
    # ── 5. Cost/speed comparison table figure ────────────────────────────────
    fig, ax = plt.subplots(figsize=(10, 4))
    ax.axis("off")
    table_data = [
        ["Metric", "CORAL (Ep5)", "Independent (Ep8)", "GPT-5.4 (API)", "Opus-4.6 (API)"],
        ["Cat Macro F1", f"{coral_gpt['cat_macro_f1']:.4f}", f"{indep_gpt['cat_macro_f1']:.4f}", "—(reference)", "—(reference)"],
        ["Spec Macro F1", f"{coral_gpt['spec_macro_f1']:.4f}", f"{indep_gpt['spec_macro_f1']:.4f}", "—(reference)", "—(reference)"],
        ["Spec QWK", f"{coral_gpt['spec_qwk']:.4f}", f"{indep_gpt['spec_qwk']:.4f}", "—", "—"],
        ["MCC (Cat)", f"{coral_gpt['cat_mcc']:.4f}", f"{indep_gpt['cat_mcc']:.4f}", "—", "—"],
        ["Latency/sample", "5.6ms", "5.6ms", "~2,900ms", "~6,000ms"],
        ["Cost/1M texts", "~$5", "~$5", "~$3,400", "~$5,000*"],
        ["Reproducible", "Yes", "Yes", "No", "No"],
    ]
    table = ax.table(cellText=table_data[1:], colLabels=table_data[0],
                     cellLoc="center", loc="center")
    table.auto_set_font_size(False)
    table.set_fontsize(9)
    table.scale(1, 1.5)
    # Style header
    for j in range(len(table_data[0])):
        table[0, j].set_facecolor("#4C72B0")
        table[0, j].set_text_props(color="white", fontweight="bold")
    # Highlight best specialist column
    for i in range(1, len(table_data)):
        table[i, 2].set_facecolor("#d4edda")
    ax.set_title("Model Comparison Summary", fontsize=13, fontweight="bold", pad=20)
    plt.tight_layout()
    fig.savefig(OUTPUT_DIR / "comparison_table.png", dpi=200)
    plt.close(fig)
    print(f"  Saved: comparison_table.png")
    print(f"\n  All figures saved to {OUTPUT_DIR}")
 if __name__ == "__main__":
    main()
--- a/python/src/finetune/eval.py
+++ b/python/src/finetune/eval.py
@ -0,0 +1,724 @@
 """Holdout evaluation for trained DualHeadModernBERT models.
 Loads a trained checkpoint, runs inference on the 1,200-paragraph holdout set,
 and compares predictions against proxy gold labels (GPT-5.4, Opus-4.6) or
 real human gold labels when available.
 Reports: macro/per-class F1, precision, recall, MCC, AUC (one-vs-rest),
 QWK, MAE, Krippendorff's alpha, confusion matrices, calibration (ECE),
 cost/latency comparison, and generates publication-ready figures.
 """
 import json
 import time
 from dataclasses import dataclass
 from pathlib import Path
 import krippendorff
 import matplotlib
 matplotlib.use("Agg")
 import matplotlib.pyplot as plt
 import numpy as np
 import seaborn as sns
 import torch
 import torch.nn.functional as F
 from safetensors.torch import load_file
 from sklearn.metrics import (
    f1_score,
    precision_score,
    recall_score,
    matthews_corrcoef,
    roc_auc_score,
    cohen_kappa_score,
    confusion_matrix,
    classification_report,
    mean_absolute_error,
 )
 from transformers import AutoModel, AutoTokenizer
 from .data import CATEGORIES, CAT2ID, NUM_CATEGORIES, NUM_SPECIFICITY
 from .model import DualHeadModernBERT, ordinal_predict, softmax_predict
 SPEC_LABELS = ["L1: Generic", "L2: Domain", "L3: Firm-Specific", "L4: Quantified"]
@dataclass
 class EvalConfig:
    """Configuration for holdout evaluation."""
    checkpoint_path: str
    paragraphs_path: str
    holdout_path: str
    benchmark_paths: dict[str, str]  # name → path to benchmark JSONL
    output_dir: str
    max_seq_length: int = 512
    batch_size: int = 64
    # Architecture params (must match training)
    specificity_head: str = "independent"
    spec_mlp_dim: int = 256
    pooling: str = "attention"
 def _load_jsonl(path: str | Path) -> list[dict]:
    records = []
    with open(path) as f:
        for line in f:
            line = line.strip()
            if line:
                records.append(json.loads(line))
    return records
 def load_holdout_data(
    paragraphs_path: str,
    holdout_path: str,
    benchmark_paths: dict[str, str],
 ) -> dict:
    """Load holdout paragraphs and benchmark labels, joined by paragraph ID."""
    paragraphs = {p["id"]: p for p in _load_jsonl(paragraphs_path)}
    with open(holdout_path) as f:
        holdout_ids = json.load(f)
    # Load benchmark labels
    benchmarks = {}
    for name, path in benchmark_paths.items():
        labels = {r["paragraphId"]: r for r in _load_jsonl(path)}
        benchmarks[name] = labels
    # Build joined dataset
    records = []
    for pid in holdout_ids:
        para = paragraphs.get(pid)
        if para is None:
            continue
        bench_labels = {}
        for name, labels in benchmarks.items():
            if pid in labels:
                lab = labels[pid]["label"]
                bench_labels[name] = {
                    "category": lab["content_category"],
                    "specificity": lab["specificity_level"],
                }
        records.append({
            "id": pid,
            "text": para["text"],
            "benchmark_labels": bench_labels,
        })
    return records
 def load_model(config: EvalConfig, device: torch.device) -> tuple:
    """Load trained DualHeadModernBERT from checkpoint."""
    checkpoint = Path(config.checkpoint_path)
    tokenizer = AutoTokenizer.from_pretrained(str(checkpoint))
    # Load backbone (we need the architecture, then overwrite weights)
    try:
        import flash_attn  # noqa: F401
        attn_impl = "flash_attention_2"
    except ImportError:
        attn_impl = "sdpa"
    backbone = AutoModel.from_pretrained(
        "answerdotai/ModernBERT-large",
        trust_remote_code=True,
        attn_implementation=attn_impl,
        dtype=torch.bfloat16,
    )
    hidden_size = backbone.config.hidden_size
    model = DualHeadModernBERT(
        backbone=backbone,
        hidden_size=hidden_size,
        num_categories=NUM_CATEGORIES,
        num_specificity=NUM_SPECIFICITY,
        specificity_head_type=config.specificity_head,
        spec_mlp_dim=config.spec_mlp_dim,
        pooling=config.pooling,
    )
    # Load trained weights (filter out loss function buffers not needed for inference)
    state_dict = load_file(str(checkpoint / "model.safetensors"))
    model.load_state_dict(state_dict, strict=False)
    model = model.to(device).to(torch.bfloat16)
    model.eval()
    return model, tokenizer
@torch.no_grad()
 def run_inference(
    model: DualHeadModernBERT,
    tokenizer,
    records: list[dict],
    max_seq_length: int,
    batch_size: int,
    device: torch.device,
    spec_head_type: str = "independent",
 ) -> dict:
    """Run model inference on holdout records. Returns predictions + timing."""
    texts = [r["text"] for r in records]
    all_cat_logits = []
    all_spec_logits = []
    total_time = 0.0
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i : i + batch_size]
        encoded = tokenizer(
            batch_texts,
            truncation=True,
            max_length=max_seq_length,
            padding="longest",
            return_tensors="pt",
        ).to(device)
        start = time.perf_counter()
        outputs = model(
            input_ids=encoded["input_ids"],
            attention_mask=encoded["attention_mask"],
        )
        torch.cuda.synchronize() if device.type == "cuda" else None
        total_time += time.perf_counter() - start
        all_cat_logits.append(outputs["category_logits"].float().cpu())
        all_spec_logits.append(outputs["specificity_logits"].float().cpu())
    cat_logits = torch.cat(all_cat_logits, dim=0)
    spec_logits = torch.cat(all_spec_logits, dim=0)
    cat_probs = F.softmax(cat_logits, dim=1).numpy()
    cat_preds = cat_logits.argmax(dim=1).numpy()
    if spec_head_type == "softmax":
        spec_preds = softmax_predict(spec_logits).numpy()
        spec_probs = F.softmax(spec_logits, dim=1).numpy()
    else:
        spec_preds = ordinal_predict(spec_logits).numpy()
        # Convert ordinal logits to class probs for AUC
        spec_probs = _ordinal_to_class_probs(spec_logits).numpy()
    return {
        "cat_preds": cat_preds,
        "cat_probs": cat_probs,
        "cat_logits": cat_logits.numpy(),
        "spec_preds": spec_preds,
        "spec_probs": spec_probs,
        "spec_logits": spec_logits.numpy(),
        "total_time_s": total_time,
        "num_samples": len(texts),
        "avg_ms_per_sample": (total_time / len(texts)) * 1000,
    }
 def _ordinal_to_class_probs(logits: torch.Tensor) -> torch.Tensor:
    """Convert ordinal threshold logits to per-class probabilities.
    P(class=0) = 1 - P(>=1)
    P(class=k) = P(>=k) - P(>=k+1) for 0 < k < K-1
    P(class=K-1) = P(>=K-1)
    """
    probs = torch.sigmoid(logits)  # (N, K-1)
    num_classes = probs.shape[1] + 1
    class_probs = torch.zeros(probs.shape[0], num_classes)
    class_probs[:, 0] = 1.0 - probs[:, 0]
    for k in range(1, num_classes - 1):
        class_probs[:, k] = probs[:, k - 1] - probs[:, k]
    class_probs[:, -1] = probs[:, -1]
    # Clamp for numerical stability
    class_probs = class_probs.clamp(min=0.0)
    class_probs = class_probs / class_probs.sum(dim=1, keepdim=True)
    return class_probs
 def compute_ece(probs: np.ndarray, labels: np.ndarray, n_bins: int = 15) -> tuple[float, dict]:
    """Compute Expected Calibration Error and per-bin data for reliability diagram."""
    confidences = np.max(probs, axis=1)
    predictions = np.argmax(probs, axis=1)
    accuracies = (predictions == labels).astype(float)
    bin_edges = np.linspace(0.0, 1.0, n_bins + 1)
    bin_data = {"bin_centers": [], "bin_accs": [], "bin_confs": [], "bin_counts": []}
    ece = 0.0
    for i in range(n_bins):
        mask = (confidences > bin_edges[i]) & (confidences <= bin_edges[i + 1])
        count = mask.sum()
        if count > 0:
            avg_conf = confidences[mask].mean()
            avg_acc = accuracies[mask].mean()
            ece += (count / len(labels)) * abs(avg_acc - avg_conf)
            bin_data["bin_centers"].append((bin_edges[i] + bin_edges[i + 1]) / 2)
            bin_data["bin_accs"].append(avg_acc)
            bin_data["bin_confs"].append(avg_conf)
            bin_data["bin_counts"].append(int(count))
    return ece, bin_data
 def compute_all_metrics(
    preds: np.ndarray,
    labels: np.ndarray,
    probs: np.ndarray,
    label_names: list[str],
    task_name: str,
    is_ordinal: bool = False,
 ) -> dict:
    """Compute comprehensive metrics for a single classification task."""
    num_classes = len(label_names)
    # Basic classification metrics
    macro_f1 = f1_score(labels, preds, average="macro", labels=range(num_classes))
    weighted_f1 = f1_score(labels, preds, average="weighted", labels=range(num_classes))
    per_class_f1 = f1_score(labels, preds, average=None, labels=range(num_classes))
    per_class_precision = precision_score(labels, preds, average=None, labels=range(num_classes), zero_division=0)
    per_class_recall = recall_score(labels, preds, average=None, labels=range(num_classes), zero_division=0)
    macro_precision = precision_score(labels, preds, average="macro", labels=range(num_classes), zero_division=0)
    macro_recall = recall_score(labels, preds, average="macro", labels=range(num_classes), zero_division=0)
    # MCC (multiclass)
    mcc = matthews_corrcoef(labels, preds)
    # AUC (one-vs-rest, macro)
    try:
        auc = roc_auc_score(labels, probs, multi_class="ovr", average="macro", labels=range(num_classes))
    except ValueError:
        auc = float("nan")
    # Confusion matrix
    cm = confusion_matrix(labels, preds, labels=range(num_classes))
    # Calibration
    ece, bin_data = compute_ece(probs, labels)
    metrics = {
        f"{task_name}_macro_f1": macro_f1,
        f"{task_name}_weighted_f1": weighted_f1,
        f"{task_name}_macro_precision": macro_precision,
        f"{task_name}_macro_recall": macro_recall,
        f"{task_name}_mcc": mcc,
        f"{task_name}_auc": auc,
        f"{task_name}_ece": ece,
        f"{task_name}_confusion_matrix": cm.tolist(),
        f"{task_name}_calibration": bin_data,
    }
    for i, name in enumerate(label_names):
        short = name.replace(" ", "").replace("/", "").replace(":", "")[:8]
        metrics[f"{task_name}_f1_{short}"] = per_class_f1[i]
        metrics[f"{task_name}_prec_{short}"] = per_class_precision[i]
        metrics[f"{task_name}_recall_{short}"] = per_class_recall[i]
    # Ordinal-specific metrics
    if is_ordinal:
        metrics[f"{task_name}_qwk"] = cohen_kappa_score(labels, preds, weights="quadratic")
        metrics[f"{task_name}_mae"] = mean_absolute_error(labels, preds)
    # Krippendorff's alpha (treat model + reference as two coders)
    reliability_data = np.array([labels, preds])
    if is_ordinal:
        metrics[f"{task_name}_kripp_alpha"] = krippendorff.alpha(
            reliability_data=reliability_data, level_of_measurement="ordinal"
        )
    else:
        metrics[f"{task_name}_kripp_alpha"] = krippendorff.alpha(
            reliability_data=reliability_data, level_of_measurement="nominal"
        )
    return metrics
 def generate_figures(
    results: dict,
    output_dir: Path,
    model_name: str,
    ref_name: str,
 ) -> list[str]:
    """Generate publication-ready figures. Returns list of saved file paths."""
    figures_dir = output_dir / "figures"
    figures_dir.mkdir(parents=True, exist_ok=True)
    saved = []
    sns.set_theme(style="whitegrid", font_scale=1.1)
    palette = sns.color_palette("viridis", 7)
    # 1. Category confusion matrix
    cm = np.array(results["cat_confusion_matrix"])
    fig, ax = plt.subplots(figsize=(10, 8))
    cm_norm = cm.astype(float) / cm.sum(axis=1, keepdims=True)
    sns.heatmap(
        cm_norm, annot=cm, fmt="d", cmap="Blues",
        xticklabels=[c[:12] for c in CATEGORIES],
        yticklabels=[c[:12] for c in CATEGORIES],
        ax=ax, vmin=0, vmax=1, cbar_kws={"label": "Proportion"},
    )
    ax.set_xlabel(f"Predicted ({model_name})")
    ax.set_ylabel(f"Reference ({ref_name})")
    ax.set_title(f"Category Confusion Matrix — {model_name} vs {ref_name}")
    plt.tight_layout()
    path = figures_dir / f"confusion_cat_{ref_name.lower().replace(' ', '_')}.png"
    fig.savefig(path, dpi=150)
    plt.close(fig)
    saved.append(str(path))
    # 2. Specificity confusion matrix
    cm_spec = np.array(results["spec_confusion_matrix"])
    fig, ax = plt.subplots(figsize=(7, 6))
    cm_spec_norm = cm_spec.astype(float) / cm_spec.sum(axis=1, keepdims=True).clip(min=1)
    sns.heatmap(
        cm_spec_norm, annot=cm_spec, fmt="d", cmap="Oranges",
        xticklabels=SPEC_LABELS,
        yticklabels=SPEC_LABELS,
        ax=ax, vmin=0, vmax=1, cbar_kws={"label": "Proportion"},
    )
    ax.set_xlabel(f"Predicted ({model_name})")
    ax.set_ylabel(f"Reference ({ref_name})")
    ax.set_title(f"Specificity Confusion Matrix — {model_name} vs {ref_name}")
    plt.tight_layout()
    path = figures_dir / f"confusion_spec_{ref_name.lower().replace(' ', '_')}.png"
    fig.savefig(path, dpi=150)
    plt.close(fig)
    saved.append(str(path))
    # 3. Calibration reliability diagram (category)
    cal = results["cat_calibration"]
    fig, ax = plt.subplots(figsize=(6, 6))
    ax.bar(cal["bin_centers"], cal["bin_accs"], width=0.06, alpha=0.7, label="Accuracy", color="#4C72B0")
    ax.plot([0, 1], [0, 1], "k--", alpha=0.5, label="Perfect calibration")
    ax.set_xlabel("Confidence")
    ax.set_ylabel("Accuracy")
    ax.set_title(f"Category Calibration — ECE={results['cat_ece']:.4f}")
    ax.legend()
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    plt.tight_layout()
    path = figures_dir / f"calibration_cat_{ref_name.lower().replace(' ', '_')}.png"
    fig.savefig(path, dpi=150)
    plt.close(fig)
    saved.append(str(path))
    # 4. Per-class F1 bar chart
    cat_f1s = [results.get(f"cat_f1_{c.replace(' ', '').replace('/', '')[:8]}", 0) for c in CATEGORIES]
    spec_f1s = [results.get(f"spec_f1_{s.replace(' ', '').replace(':', '')[:8]}", 0) for s in SPEC_LABELS]
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    bars = ax1.barh([c[:15] for c in CATEGORIES], cat_f1s, color=palette)
    ax1.axvline(0.80, color="red", linestyle="--", alpha=0.7, label="Target (0.80)")
    ax1.set_xlabel("F1 Score")
    ax1.set_title(f"Per-Category F1 ({model_name} vs {ref_name})")
    ax1.set_xlim(0, 1)
    ax1.legend()
    for bar, v in zip(bars, cat_f1s):
        ax1.text(v + 0.01, bar.get_y() + bar.get_height() / 2, f"{v:.3f}", va="center", fontsize=9)
    bars2 = ax2.barh(SPEC_LABELS, spec_f1s, color=sns.color_palette("YlOrRd", 4))
    ax2.axvline(0.80, color="red", linestyle="--", alpha=0.7, label="Target (0.80)")
    ax2.set_xlabel("F1 Score")
    ax2.set_title(f"Per-Level Specificity F1 ({model_name} vs {ref_name})")
    ax2.set_xlim(0, 1)
    ax2.legend()
    for bar, v in zip(bars2, spec_f1s):
        ax2.text(v + 0.01, bar.get_y() + bar.get_height() / 2, f"{v:.3f}", va="center", fontsize=9)
    plt.tight_layout()
    path = figures_dir / f"per_class_f1_{ref_name.lower().replace(' ', '_')}.png"
    fig.savefig(path, dpi=150)
    plt.close(fig)
    saved.append(str(path))
    return saved
 def generate_comparison_figures(
    all_results: dict[str, dict],
    output_dir: Path,
 ) -> list[str]:
    """Generate figures comparing multiple models/references."""
    figures_dir = output_dir / "figures"
    figures_dir.mkdir(parents=True, exist_ok=True)
    saved = []
    sns.set_theme(style="whitegrid", font_scale=1.1)
    # 1. Model comparison bar chart (if we have CORAL vs Independent)
    model_names = list(all_results.keys())
    if len(model_names) >= 2:
        metric_keys = ["cat_macro_f1", "spec_macro_f1", "cat_mcc", "spec_qwk", "spec_mae"]
        metric_labels = ["Cat Macro F1", "Spec Macro F1", "Cat MCC", "Spec QWK", "Spec MAE"]
        fig, ax = plt.subplots(figsize=(12, 6))
        x = np.arange(len(metric_labels))
        width = 0.8 / len(model_names)
        for i, name in enumerate(model_names):
            vals = []
            for k in metric_keys:
                v = all_results[name].get(k, 0)
                vals.append(v if not np.isnan(v) else 0)
            bars = ax.bar(x + i * width - 0.4 + width / 2, vals, width, label=name, alpha=0.85)
            for bar, v in zip(bars, vals):
                ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01,
                        f"{v:.3f}", ha="center", va="bottom", fontsize=8)
        ax.set_xticks(x)
        ax.set_xticklabels(metric_labels)
        ax.set_ylabel("Score")
        ax.set_title("Model Comparison — All Metrics")
        ax.legend()
        ax.axhline(0.80, color="red", linestyle="--", alpha=0.5, label="F1 Target")
        plt.tight_layout()
        path = figures_dir / "model_comparison.png"
        fig.savefig(path, dpi=150)
        plt.close(fig)
        saved.append(str(path))
    # 2. Speed/cost comparison
    speed_data = {}
    for name, res in all_results.items():
        if "avg_ms_per_sample" in res:
            speed_data[name] = res["avg_ms_per_sample"]
    if speed_data:
        # Add GenAI panel reference speeds from docs
        genai_speeds = {
            "GPT-5.4 (API)": 2900,  # from provenance data
            "Opus-4.6 (API)": 6000,  # from provenance data
        }
        fig, ax = plt.subplots(figsize=(10, 5))
        all_names = list(speed_data.keys()) + list(genai_speeds.keys())
        all_speeds = list(speed_data.values()) + list(genai_speeds.values())
        colors = ["#4C72B0"] * len(speed_data) + ["#DD8452"] * len(genai_speeds)
        bars = ax.barh(all_names, all_speeds, color=colors)
        ax.set_xlabel("Latency (ms per paragraph)")
        ax.set_title("Inference Speed Comparison")
        ax.set_xscale("log")
        for bar, v in zip(bars, all_speeds):
            ax.text(v * 1.1, bar.get_y() + bar.get_height() / 2,
                    f"{v:.1f}ms", va="center", fontsize=9)
        plt.tight_layout()
        path = figures_dir / "speed_comparison.png"
        fig.savefig(path, dpi=150)
        plt.close(fig)
        saved.append(str(path))
    return saved
 def format_report(
    model_name: str,
    ref_name: str,
    metrics: dict,
    inference_stats: dict,
 ) -> str:
    """Format a human-readable evaluation report."""
    lines = []
    lines.append(f"\n{'='*70}")
    lines.append(f"  HOLDOUT EVALUATION: {model_name} vs {ref_name}")
    lines.append(f"{'='*70}\n")
    lines.append(f"  Samples evaluated: {inference_stats['num_samples']}")
    lines.append(f"  Total inference time: {inference_stats['total_time_s']:.2f}s")
    lines.append(f"  Avg latency: {inference_stats['avg_ms_per_sample']:.2f}ms/sample")
    lines.append(f"  Throughput: {1000 / inference_stats['avg_ms_per_sample']:.0f} samples/sec\n")
    # Category metrics
    lines.append(f"  {'─'*50}")
    lines.append(f"  CATEGORY CLASSIFICATION")
    lines.append(f"  {'─'*50}")
    lines.append(f"  Macro F1:       {metrics['cat_macro_f1']:.4f}  {'✓' if metrics['cat_macro_f1'] >= 0.80 else '✗'} (target: 0.80)")
    lines.append(f"  Weighted F1:    {metrics['cat_weighted_f1']:.4f}")
    lines.append(f"  Macro Prec:     {metrics['cat_macro_precision']:.4f}")
    lines.append(f"  Macro Recall:   {metrics['cat_macro_recall']:.4f}")
    lines.append(f"  MCC:            {metrics['cat_mcc']:.4f}")
    lines.append(f"  AUC (OvR):      {metrics['cat_auc']:.4f}")
    lines.append(f"  ECE:            {metrics['cat_ece']:.4f}")
    lines.append(f"  Kripp Alpha:    {metrics['cat_kripp_alpha']:.4f}")
    lines.append("")
    lines.append(f"  {'Category':<25} {'F1':>8} {'Prec':>8} {'Recall':>8}")
    lines.append(f"  {'-'*25} {'-'*8} {'-'*8} {'-'*8}")
    for c in CATEGORIES:
        short = c.replace(" ", "").replace("/", "")[:8]
        f1 = metrics.get(f"cat_f1_{short}", 0)
        prec = metrics.get(f"cat_prec_{short}", 0)
        rec = metrics.get(f"cat_recall_{short}", 0)
        lines.append(f"  {c:<25} {f1:>8.4f} {prec:>8.4f} {rec:>8.4f}")
    # Specificity metrics
    lines.append(f"\n  {'─'*50}")
    lines.append(f"  SPECIFICITY CLASSIFICATION")
    lines.append(f"  {'─'*50}")
    lines.append(f"  Macro F1:       {metrics['spec_macro_f1']:.4f}  {'✓' if metrics['spec_macro_f1'] >= 0.80 else '✗'} (target: 0.80)")
    lines.append(f"  Weighted F1:    {metrics['spec_weighted_f1']:.4f}")
    lines.append(f"  Macro Prec:     {metrics['spec_macro_precision']:.4f}")
    lines.append(f"  Macro Recall:   {metrics['spec_macro_recall']:.4f}")
    lines.append(f"  MCC:            {metrics['spec_mcc']:.4f}")
    lines.append(f"  AUC (OvR):      {metrics['spec_auc']:.4f}")
    lines.append(f"  QWK:            {metrics['spec_qwk']:.4f}")
    lines.append(f"  MAE:            {metrics['spec_mae']:.4f}")
    lines.append(f"  ECE:            {metrics['spec_ece']:.4f}")
    lines.append(f"  Kripp Alpha:    {metrics['spec_kripp_alpha']:.4f}")
    lines.append("")
    lines.append(f"  {'Level':<25} {'F1':>8} {'Prec':>8} {'Recall':>8}")
    lines.append(f"  {'-'*25} {'-'*8} {'-'*8} {'-'*8}")
    for s in SPEC_LABELS:
        short = s.replace(" ", "").replace(":", "")[:8]
        f1 = metrics.get(f"spec_f1_{short}", 0)
        prec = metrics.get(f"spec_prec_{short}", 0)
        rec = metrics.get(f"spec_recall_{short}", 0)
        lines.append(f"  {s:<25} {f1:>8.4f} {prec:>8.4f} {rec:>8.4f}")
    lines.append(f"\n{'='*70}\n")
    return "\n".join(lines)
 def evaluate(config: EvalConfig) -> dict:
    """Run full holdout evaluation. Returns all metrics + generates figures."""
    output_dir = Path(config.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"\n  Device: {device}")
    # Load holdout data
    print("  Loading holdout data...")
    records = load_holdout_data(
        config.paragraphs_path,
        config.holdout_path,
        config.benchmark_paths,
    )
    print(f"  Holdout paragraphs: {len(records)}")
    # Load model
    print(f"  Loading model from {config.checkpoint_path}...")
    model, tokenizer = load_model(config, device)
    param_count = sum(p.numel() for p in model.parameters()) / 1e6
    print(f"  Model parameters: {param_count:.0f}M")
    # Run inference
    print("  Running inference...")
    inference = run_inference(
        model, tokenizer, records,
        config.max_seq_length, config.batch_size,
        device, config.specificity_head,
    )
    print(f"  Inference complete: {inference['total_time_s']:.2f}s ({inference['avg_ms_per_sample']:.2f}ms/sample)")
    # Evaluate against each benchmark
    all_results = {}
    all_figures = []
    model_name = Path(config.checkpoint_path).parent.name
    if model_name == "final":
        model_name = Path(config.checkpoint_path).parent.parent.name
    for ref_name, ref_path in config.benchmark_paths.items():
        print(f"\n  Evaluating against {ref_name}...")
        # Get reference labels for paragraphs that have them
        cat_labels = []
        spec_labels = []
        cat_preds = []
        spec_preds = []
        cat_probs = []
        spec_probs = []
        for i, rec in enumerate(records):
            bench = rec["benchmark_labels"].get(ref_name)
            if bench is None:
                continue
            cat_labels.append(CAT2ID[bench["category"]])
            spec_labels.append(bench["specificity"] - 1)  # 0-indexed
            cat_preds.append(inference["cat_preds"][i])
            spec_preds.append(inference["spec_preds"][i])
            cat_probs.append(inference["cat_probs"][i])
            spec_probs.append(inference["spec_probs"][i])
        cat_labels = np.array(cat_labels)
        spec_labels = np.array(spec_labels)
        cat_preds = np.array(cat_preds)
        spec_preds = np.array(spec_preds)
        cat_probs = np.array(cat_probs)
        spec_probs = np.array(spec_probs)
        print(f"  Matched samples: {len(cat_labels)}")
        # Compute metrics
        cat_metrics = compute_all_metrics(
            cat_preds, cat_labels, cat_probs, CATEGORIES, "cat", is_ordinal=False
        )
        spec_metrics = compute_all_metrics(
            spec_preds, spec_labels, spec_probs, SPEC_LABELS, "spec", is_ordinal=True
        )
        combined = {**cat_metrics, **spec_metrics, **inference}
        combined["combined_macro_f1"] = (combined["cat_macro_f1"] + combined["spec_macro_f1"]) / 2
        # Print report
        report = format_report(model_name, ref_name, combined, inference)
        print(report)
        # Save report
        report_path = output_dir / f"report_{ref_name.lower().replace(' ', '_').replace('.', '')}.txt"
        with open(report_path, "w") as f:
            f.write(report)
        # Generate figures
        figs = generate_figures(combined, output_dir, model_name, ref_name)
        all_figures.extend(figs)
        all_results[f"{model_name}_vs_{ref_name}"] = combined
    # Cross-model comparison figures
    comp_figs = generate_comparison_figures(all_results, output_dir)
    all_figures.extend(comp_figs)
    # Save all metrics as JSON
    serializable = {}
    for k, v in all_results.items():
        serializable[k] = {
            mk: mv for mk, mv in v.items()
            if isinstance(mv, (int, float, str, list, bool))
        }
    metrics_path = output_dir / "metrics.json"
    with open(metrics_path, "w") as f:
        json.dump(serializable, f, indent=2, default=str)
    print(f"\n  Results saved to {output_dir}")
    print(f"  Figures: {len(all_figures)} generated")
    for fig_path in all_figures:
        print(f"    {fig_path}")
    return all_results
 def evaluate_comparison(
    configs: list[EvalConfig],
    output_dir: str,
 ) -> dict:
    """Evaluate multiple model checkpoints and generate cross-model comparisons."""
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    all_results = {}
    for config in configs:
        results = evaluate(config)
        all_results.update(results)
    # Generate comparison figures across all models
    comp_figs = generate_comparison_figures(all_results, output_dir)
    print(f"\n  Comparison figures: {len(comp_figs)}")
    return all_results
--- a/results/eval/comparison/comparison_table.png
+++ b/results/eval/comparison/comparison_table.png
--- a/results/eval/comparison/coral_vs_independent_all_metrics.png
+++ b/results/eval/comparison/coral_vs_independent_all_metrics.png
--- a/results/eval/comparison/coral_vs_independent_f1.png
+++ b/results/eval/comparison/coral_vs_independent_f1.png
--- a/results/eval/comparison/improvement_delta.png
+++ b/results/eval/comparison/improvement_delta.png
--- a/results/eval/comparison/spec_confusion_comparison.png
+++ b/results/eval/comparison/spec_confusion_comparison.png
--- a/results/eval/coral-baseline/figures/calibration_cat_gpt-5.4.png
+++ b/results/eval/coral-baseline/figures/calibration_cat_gpt-5.4.png
--- a/results/eval/coral-baseline/figures/calibration_cat_opus-4.6.png
+++ b/results/eval/coral-baseline/figures/calibration_cat_opus-4.6.png
--- a/results/eval/coral-baseline/figures/confusion_cat_gpt-5.4.png
+++ b/results/eval/coral-baseline/figures/confusion_cat_gpt-5.4.png
--- a/results/eval/coral-baseline/figures/confusion_cat_opus-4.6.png
+++ b/results/eval/coral-baseline/figures/confusion_cat_opus-4.6.png
--- a/results/eval/coral-baseline/figures/confusion_spec_gpt-5.4.png
+++ b/results/eval/coral-baseline/figures/confusion_spec_gpt-5.4.png
--- a/results/eval/coral-baseline/figures/confusion_spec_opus-4.6.png
+++ b/results/eval/coral-baseline/figures/confusion_spec_opus-4.6.png
--- a/results/eval/coral-baseline/figures/model_comparison.png
+++ b/results/eval/coral-baseline/figures/model_comparison.png
--- a/results/eval/coral-baseline/figures/per_class_f1_gpt-5.4.png
+++ b/results/eval/coral-baseline/figures/per_class_f1_gpt-5.4.png
--- a/results/eval/coral-baseline/figures/per_class_f1_opus-4.6.png
+++ b/results/eval/coral-baseline/figures/per_class_f1_opus-4.6.png
--- a/results/eval/coral-baseline/figures/speed_comparison.png
+++ b/results/eval/coral-baseline/figures/speed_comparison.png
--- a/results/eval/coral-baseline/metrics.json
+++ b/results/eval/coral-baseline/metrics.json
@ -0,0 +1,298 @@
 {
  "best-base_weighted_ce-ep5_vs_GPT-5.4": {
    "cat_macro_f1": 0.9360575579144376,
    "cat_weighted_f1": 0.9360564701876355,
    "cat_macro_precision": 0.9336791798534633,
    "cat_macro_recall": 0.9414080218768329,
    "cat_mcc": 0.9248088496355107,
    "cat_auc": 0.991343460842945,
    "cat_ece": 0.0441274690628052,
    "cat_confusion_matrix": [
      [
        220,
        0,
        8,
        0,
        1,
        0,
        1
      ],
      [
        0,
        86,
        0,
        0,
        1,
        1,
        0
      ],
      [
        1,
        0,
        143,
        1,
        5,
        0,
        0
      ],
      [
        0,
        0,
        2,
        133,
        0,
        1,
        0
      ],
      [
        6,
        0,
        7,
        18,
        165,
        1,
        1
      ],
      [
        0,
        3,
        1,
        8,
        2,
        207,
        0
      ],
      [
        0,
        0,
        0,
        1,
        6,
        1,
        169
      ]
    ],
    "cat_f1_BoardGov": 0.962800875273523,
    "cat_prec_BoardGov": 0.9691629955947136,
    "cat_recall_BoardGov": 0.9565217391304348,
    "cat_f1_Incident": 0.9717514124293786,
    "cat_prec_Incident": 0.9662921348314607,
    "cat_recall_Incident": 0.9772727272727273,
    "cat_f1_Manageme": 0.9196141479099679,
    "cat_prec_Manageme": 0.8881987577639752,
    "cat_recall_Manageme": 0.9533333333333334,
    "cat_f1_NoneOthe": 0.8956228956228957,
    "cat_prec_NoneOthe": 0.8260869565217391,
    "cat_recall_NoneOthe": 0.9779411764705882,
    "cat_f1_RiskMana": 0.873015873015873,
    "cat_prec_RiskMana": 0.9166666666666666,
    "cat_recall_RiskMana": 0.8333333333333334,
    "cat_f1_Strategy": 0.9583333333333334,
    "cat_prec_Strategy": 0.981042654028436,
    "cat_recall_Strategy": 0.9366515837104072,
    "cat_f1_Third-Pa": 0.9712643678160919,
    "cat_prec_Third-Pa": 0.9883040935672515,
    "cat_recall_Third-Pa": 0.9548022598870056,
    "cat_kripp_alpha": 0.9243601922903683,
    "spec_macro_f1": 0.5970357338282843,
    "spec_weighted_f1": 0.7040798408451929,
    "spec_macro_precision": 0.7225196233593912,
    "spec_macro_recall": 0.6139005306639329,
    "spec_mcc": 0.6138700055328291,
    "spec_auc": 0.9498756282617218,
    "spec_ece": 0.1652249880135059,
    "spec_confusion_matrix": [
      [
        596,
        6,
        11,
        5
      ],
      [
        105,
        46,
        9,
        8
      ],
      [
        14,
        6,
        52,
        135
      ],
      [
        4,
        0,
        3,
        200
      ]
    ],
    "spec_f1_L1Generi": 0.8915482423335827,
    "spec_prec_L1Generi": 0.8289290681502086,
    "spec_recall_L1Generi": 0.9644012944983819,
    "spec_f1_L2Domain": 0.40707964601769914,
    "spec_prec_L2Domain": 0.7931034482758621,
    "spec_recall_L2Domain": 0.27380952380952384,
    "spec_f1_L3Firm-S": 0.36879432624113473,
    "spec_prec_L3Firm-S": 0.6933333333333334,
    "spec_recall_L3Firm-S": 0.25120772946859904,
    "spec_f1_L4Quanti": 0.7207207207207207,
    "spec_prec_L4Quanti": 0.5747126436781609,
    "spec_recall_L4Quanti": 0.966183574879227,
    "spec_qwk": 0.8757404773441285,
    "spec_mae": 0.2975,
    "spec_kripp_alpha": 0.8479072400833478,
    "total_time_s": 6.695346015971154,
    "num_samples": 1200,
    "avg_ms_per_sample": 5.579455013309295,
    "combined_macro_f1": 0.766546645871361
  },
  "best-base_weighted_ce-ep5_vs_Opus-4.6": {
    "cat_macro_f1": 0.9280167387549427,
    "cat_weighted_f1": 0.9273898648954128,
    "cat_macro_precision": 0.9223465490796974,
    "cat_macro_recall": 0.9382296607170699,
    "cat_mcc": 0.9162751746063641,
    "cat_auc": 0.992382433433919,
    "cat_ece": 0.04689237485329312,
    "cat_confusion_matrix": [
      [
        209,
        0,
        2,
        1,
        1,
        0,
        1
      ],
      [
        0,
        78,
        0,
        0,
        1,
        0,
        0
      ],
      [
        4,
        0,
        147,
        2,
        4,
        0,
        1
      ],
      [
        0,
        0,
        1,
        139,
        1,
        0,
        0
      ],
      [
        13,
        1,
        10,
        15,
        168,
        1,
        5
      ],
      [
        1,
        10,
        1,
        4,
        3,
        209,
        0
      ],
      [
        0,
        0,
        0,
        0,
        2,
        1,
        164
      ]
    ],
    "cat_f1_BoardGov": 0.9478458049886621,
    "cat_prec_BoardGov": 0.920704845814978,
    "cat_recall_BoardGov": 0.9766355140186916,
    "cat_f1_Incident": 0.9285714285714286,
    "cat_prec_Incident": 0.8764044943820225,
    "cat_recall_Incident": 0.9873417721518988,
    "cat_f1_Manageme": 0.9216300940438872,
    "cat_prec_Manageme": 0.9130434782608695,
    "cat_recall_Manageme": 0.930379746835443,
    "cat_f1_NoneOthe": 0.9205298013245033,
    "cat_prec_NoneOthe": 0.8633540372670807,
    "cat_recall_NoneOthe": 0.9858156028368794,
    "cat_f1_RiskMana": 0.8549618320610687,
    "cat_prec_RiskMana": 0.9333333333333333,
    "cat_recall_RiskMana": 0.7887323943661971,
    "cat_f1_Strategy": 0.9521640091116174,
    "cat_prec_Strategy": 0.990521327014218,
    "cat_recall_Strategy": 0.9166666666666666,
    "cat_f1_Third-Pa": 0.9704142011834319,
    "cat_prec_Third-Pa": 0.9590643274853801,
    "cat_recall_Third-Pa": 0.9820359281437125,
    "cat_kripp_alpha": 0.9154955768233572,
    "spec_macro_f1": 0.5957642708821952,
    "spec_weighted_f1": 0.693005282664721,
    "spec_macro_precision": 0.731933400476396,
    "spec_macro_recall": 0.6249872364065566,
    "spec_mcc": 0.6143201053040909,
    "spec_auc": 0.9470735892830423,
    "spec_ece": 0.18189165468017254,
    "spec_confusion_matrix": [
      [
        592,
        4,
        4,
        5
      ],
      [
        92,
        42,
        6,
        5
      ],
      [
        35,
        12,
        63,
        150
      ],
      [
        0,
        0,
        2,
        188
      ]
    ],
    "spec_f1_L1Generi": 0.8942598187311178,
    "spec_prec_L1Generi": 0.8233657858136301,
    "spec_recall_L1Generi": 0.9785123966942149,
    "spec_f1_L2Domain": 0.41379310344827586,
    "spec_prec_L2Domain": 0.7241379310344828,
    "spec_recall_L2Domain": 0.2896551724137931,
    "spec_f1_L3Firm-S": 0.3761194029850746,
    "spec_prec_L3Firm-S": 0.84,
    "spec_recall_L3Firm-S": 0.2423076923076923,
    "spec_f1_L4Quanti": 0.6988847583643123,
    "spec_prec_L4Quanti": 0.5402298850574713,
    "spec_recall_L4Quanti": 0.9894736842105263,
    "spec_qwk": 0.872110225054491,
    "spec_mae": 0.3075,
    "spec_kripp_alpha": 0.8502616991488389,
    "total_time_s": 6.695346015971154,
    "num_samples": 1200,
    "avg_ms_per_sample": 5.579455013309295,
    "combined_macro_f1": 0.7618905048185689
  }
 }
--- a/results/eval/coral-baseline/report_gpt-54.txt
+++ b/results/eval/coral-baseline/report_gpt-54.txt
@ -0,0 +1,54 @@
 ======================================================================
  HOLDOUT EVALUATION: best-base_weighted_ce-ep5 vs GPT-5.4
 ======================================================================
  Samples evaluated: 1200
  Total inference time: 6.70s
  Avg latency: 5.58ms/sample
  Throughput: 179 samples/sec
  ──────────────────────────────────────────────────
  CATEGORY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.9361  ✓ (target: 0.80)
  Weighted F1:    0.9361
  Macro Prec:     0.9337
  Macro Recall:   0.9414
  MCC:            0.9248
  AUC (OvR):      0.9913
  ECE:            0.0441
  Kripp Alpha:    0.9244
  Category                        F1     Prec   Recall
  ------------------------- -------- -------- --------
  Board Governance            0.9628   0.9692   0.9565
  Incident Disclosure         0.9718   0.9663   0.9773
  Management Role             0.9196   0.8882   0.9533
  None/Other                  0.8956   0.8261   0.9779
  Risk Management Process     0.8730   0.9167   0.8333
  Strategy Integration        0.9583   0.9810   0.9367
  Third-Party Risk            0.9713   0.9883   0.9548
  ──────────────────────────────────────────────────
  SPECIFICITY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.5970  ✗ (target: 0.80)
  Weighted F1:    0.7041
  Macro Prec:     0.7225
  Macro Recall:   0.6139
  MCC:            0.6139
  AUC (OvR):      0.9499
  QWK:            0.8757
  MAE:            0.2975
  ECE:            0.1652
  Kripp Alpha:    0.8479
  Level                           F1     Prec   Recall
  ------------------------- -------- -------- --------
  L1: Generic                 0.8915   0.8289   0.9644
  L2: Domain                  0.4071   0.7931   0.2738
  L3: Firm-Specific           0.3688   0.6933   0.2512
  L4: Quantified              0.7207   0.5747   0.9662
 ======================================================================
--- a/results/eval/coral-baseline/report_opus-46.txt
+++ b/results/eval/coral-baseline/report_opus-46.txt
@ -0,0 +1,54 @@
 ======================================================================
  HOLDOUT EVALUATION: best-base_weighted_ce-ep5 vs Opus-4.6
 ======================================================================
  Samples evaluated: 1200
  Total inference time: 6.70s
  Avg latency: 5.58ms/sample
  Throughput: 179 samples/sec
  ──────────────────────────────────────────────────
  CATEGORY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.9280  ✓ (target: 0.80)
  Weighted F1:    0.9274
  Macro Prec:     0.9223
  Macro Recall:   0.9382
  MCC:            0.9163
  AUC (OvR):      0.9924
  ECE:            0.0469
  Kripp Alpha:    0.9155
  Category                        F1     Prec   Recall
  ------------------------- -------- -------- --------
  Board Governance            0.9478   0.9207   0.9766
  Incident Disclosure         0.9286   0.8764   0.9873
  Management Role             0.9216   0.9130   0.9304
  None/Other                  0.9205   0.8634   0.9858
  Risk Management Process     0.8550   0.9333   0.7887
  Strategy Integration        0.9522   0.9905   0.9167
  Third-Party Risk            0.9704   0.9591   0.9820
  ──────────────────────────────────────────────────
  SPECIFICITY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.5958  ✗ (target: 0.80)
  Weighted F1:    0.6930
  Macro Prec:     0.7319
  Macro Recall:   0.6250
  MCC:            0.6143
  AUC (OvR):      0.9471
  QWK:            0.8721
  MAE:            0.3075
  ECE:            0.1819
  Kripp Alpha:    0.8503
  Level                           F1     Prec   Recall
  ------------------------- -------- -------- --------
  L1: Generic                 0.8943   0.8234   0.9785
  L2: Domain                  0.4138   0.7241   0.2897
  L3: Firm-Specific           0.3761   0.8400   0.2423
  L4: Quantified              0.6989   0.5402   0.9895
 ======================================================================
--- a/results/eval/iter1-independent/figures/calibration_cat_gpt-5.4.png
+++ b/results/eval/iter1-independent/figures/calibration_cat_gpt-5.4.png
--- a/results/eval/iter1-independent/figures/calibration_cat_opus-4.6.png
+++ b/results/eval/iter1-independent/figures/calibration_cat_opus-4.6.png
--- a/results/eval/iter1-independent/figures/confusion_cat_gpt-5.4.png
+++ b/results/eval/iter1-independent/figures/confusion_cat_gpt-5.4.png
--- a/results/eval/iter1-independent/figures/confusion_cat_opus-4.6.png
+++ b/results/eval/iter1-independent/figures/confusion_cat_opus-4.6.png
--- a/results/eval/iter1-independent/figures/confusion_spec_gpt-5.4.png
+++ b/results/eval/iter1-independent/figures/confusion_spec_gpt-5.4.png
--- a/results/eval/iter1-independent/figures/confusion_spec_opus-4.6.png
+++ b/results/eval/iter1-independent/figures/confusion_spec_opus-4.6.png
--- a/results/eval/iter1-independent/figures/model_comparison.png
+++ b/results/eval/iter1-independent/figures/model_comparison.png
--- a/results/eval/iter1-independent/figures/per_class_f1_gpt-5.4.png
+++ b/results/eval/iter1-independent/figures/per_class_f1_gpt-5.4.png
--- a/results/eval/iter1-independent/figures/per_class_f1_opus-4.6.png
+++ b/results/eval/iter1-independent/figures/per_class_f1_opus-4.6.png
--- a/results/eval/iter1-independent/figures/speed_comparison.png
+++ b/results/eval/iter1-independent/figures/speed_comparison.png
--- a/results/eval/iter1-independent/metrics.json
+++ b/results/eval/iter1-independent/metrics.json
@ -0,0 +1,298 @@
 {
  "iter1-independent_vs_GPT-5.4": {
    "cat_macro_f1": 0.9336741161693523,
    "cat_weighted_f1": 0.9343162998643407,
    "cat_macro_precision": 0.93189297179766,
    "cat_macro_recall": 0.9377918652022429,
    "cat_mcc": 0.9226990724708704,
    "cat_auc": 0.991991833154947,
    "cat_ece": 0.053848127176364245,
    "cat_confusion_matrix": [
      [
        225,
        0,
        3,
        0,
        2,
        0,
        0
      ],
      [
        0,
        85,
        0,
        0,
        2,
        1,
        0
      ],
      [
        2,
        0,
        144,
        1,
        3,
        0,
        0
      ],
      [
        0,
        0,
        3,
        131,
        0,
        2,
        0
      ],
      [
        6,
        1,
        5,
        19,
        164,
        1,
        2
      ],
      [
        0,
        3,
        1,
        8,
        2,
        207,
        0
      ],
      [
        0,
        0,
        0,
        0,
        12,
        0,
        165
      ]
    ],
    "cat_f1_BoardGov": 0.9719222462203023,
    "cat_prec_BoardGov": 0.9656652360515021,
    "cat_recall_BoardGov": 0.9782608695652174,
    "cat_f1_Incident": 0.96045197740113,
    "cat_prec_Incident": 0.9550561797752809,
    "cat_recall_Incident": 0.9659090909090909,
    "cat_f1_Manageme": 0.9411764705882353,
    "cat_prec_Manageme": 0.9230769230769231,
    "cat_recall_Manageme": 0.96,
    "cat_f1_NoneOthe": 0.888135593220339,
    "cat_prec_NoneOthe": 0.8238993710691824,
    "cat_recall_NoneOthe": 0.9632352941176471,
    "cat_f1_RiskMana": 0.856396866840731,
    "cat_prec_RiskMana": 0.8864864864864865,
    "cat_recall_RiskMana": 0.8282828282828283,
    "cat_f1_Strategy": 0.9583333333333334,
    "cat_prec_Strategy": 0.981042654028436,
    "cat_recall_Strategy": 0.9366515837104072,
    "cat_f1_Third-Pa": 0.9593023255813954,
    "cat_prec_Third-Pa": 0.9880239520958084,
    "cat_recall_Third-Pa": 0.9322033898305084,
    "cat_kripp_alpha": 0.9223591517560865,
    "spec_macro_f1": 0.8951731906425856,
    "spec_weighted_f1": 0.9121524819510628,
    "spec_macro_precision": 0.8980417155129858,
    "spec_macro_recall": 0.8930560580782194,
    "spec_mcc": 0.866381831963237,
    "spec_auc": 0.981666223606385,
    "spec_ece": 0.07135417198141418,
    "spec_confusion_matrix": [
      [
        580,
        23,
        12,
        3
      ],
      [
        29,
        130,
        7,
        2
      ],
      [
        11,
        4,
        190,
        2
      ],
      [
        2,
        1,
        9,
        195
      ]
    ],
    "spec_f1_L1Generi": 0.9354838709677419,
    "spec_prec_L1Generi": 0.932475884244373,
    "spec_recall_L1Generi": 0.9385113268608414,
    "spec_f1_L2Domain": 0.7975460122699386,
    "spec_prec_L2Domain": 0.8227848101265823,
    "spec_recall_L2Domain": 0.7738095238095238,
    "spec_f1_L3Firm-S": 0.8941176470588236,
    "spec_prec_L3Firm-S": 0.8715596330275229,
    "spec_recall_L3Firm-S": 0.9178743961352657,
    "spec_f1_L4Quanti": 0.9535452322738386,
    "spec_prec_L4Quanti": 0.9653465346534653,
    "spec_recall_L4Quanti": 0.9420289855072463,
    "spec_qwk": 0.9324447137231142,
    "spec_mae": 0.1175,
    "spec_kripp_alpha": 0.917725722448833,
    "total_time_s": 6.732117835083045,
    "num_samples": 1200,
    "avg_ms_per_sample": 5.6100981959025376,
    "combined_macro_f1": 0.9144236534059689
  },
  "iter1-independent_vs_Opus-4.6": {
    "cat_macro_f1": 0.922684387023173,
    "cat_weighted_f1": 0.9216414809666168,
    "cat_macro_precision": 0.9177680939029339,
    "cat_macro_recall": 0.9316060900094703,
    "cat_mcc": 0.909266938399113,
    "cat_auc": 0.9939660707189948,
    "cat_ece": 0.06551479384303091,
    "cat_confusion_matrix": [
      [
        211,
        0,
        1,
        1,
        1,
        0,
        0
      ],
      [
        0,
        78,
        0,
        0,
        1,
        0,
        0
      ],
      [
        8,
        0,
        144,
        1,
        4,
        0,
        1
      ],
      [
        0,
        0,
        1,
        138,
        1,
        1,
        0
      ],
      [
        13,
        0,
        9,
        14,
        169,
        1,
        7
      ],
      [
        1,
        11,
        1,
        4,
        3,
        208,
        0
      ],
      [
        0,
        0,
        0,
        1,
        6,
        1,
        159
      ]
    ],
    "cat_f1_BoardGov": 0.9440715883668904,
    "cat_prec_BoardGov": 0.9055793991416309,
    "cat_recall_BoardGov": 0.985981308411215,
    "cat_f1_Incident": 0.9285714285714286,
    "cat_prec_Incident": 0.8764044943820225,
    "cat_recall_Incident": 0.9873417721518988,
    "cat_f1_Manageme": 0.9171974522292994,
    "cat_prec_Manageme": 0.9230769230769231,
    "cat_recall_Manageme": 0.9113924050632911,
    "cat_f1_NoneOthe": 0.92,
    "cat_prec_NoneOthe": 0.8679245283018868,
    "cat_recall_NoneOthe": 0.9787234042553191,
    "cat_f1_RiskMana": 0.8492462311557789,
    "cat_prec_RiskMana": 0.9135135135135135,
    "cat_recall_RiskMana": 0.7934272300469484,
    "cat_f1_Strategy": 0.9476082004555809,
    "cat_prec_Strategy": 0.985781990521327,
    "cat_recall_Strategy": 0.9122807017543859,
    "cat_f1_Third-Pa": 0.9520958083832335,
    "cat_prec_Third-Pa": 0.9520958083832335,
    "cat_recall_Third-Pa": 0.9520958083832335,
    "cat_kripp_alpha": 0.908575631724203,
    "spec_macro_f1": 0.8833694419146193,
    "spec_weighted_f1": 0.9004034318676798,
    "spec_macro_precision": 0.8858989636247611,
    "spec_macro_recall": 0.8854684685880032,
    "spec_mcc": 0.8500778641433316,
    "spec_auc": 0.9736633898988131,
    "spec_ece": 0.08248284702499709,
    "spec_confusion_matrix": [
      [
        567,
        30,
        7,
        1
      ],
      [
        22,
        118,
        3,
        2
      ],
      [
        33,
        10,
        207,
        10
      ],
      [
        0,
        0,
        1,
        189
      ]
    ],
    "spec_f1_L1Generi": 0.9242053789731052,
    "spec_prec_L1Generi": 0.9115755627009646,
    "spec_recall_L1Generi": 0.9371900826446281,
    "spec_f1_L2Domain": 0.7788778877887789,
    "spec_prec_L2Domain": 0.7468354430379747,
    "spec_recall_L2Domain": 0.8137931034482758,
    "spec_f1_L3Firm-S": 0.8661087866108786,
    "spec_prec_L3Firm-S": 0.9495412844036697,
    "spec_recall_L3Firm-S": 0.7961538461538461,
    "spec_f1_L4Quanti": 0.9642857142857143,
    "spec_prec_L4Quanti": 0.9356435643564357,
    "spec_recall_L4Quanti": 0.9947368421052631,
    "spec_qwk": 0.9227008860372746,
    "spec_mae": 0.13583333333333333,
    "spec_kripp_alpha": 0.9065248741550552,
    "total_time_s": 6.732117835083045,
    "num_samples": 1200,
    "avg_ms_per_sample": 5.6100981959025376,
    "combined_macro_f1": 0.9030269144688962
  }
 }
--- a/results/eval/iter1-independent/report_gpt-54.txt
+++ b/results/eval/iter1-independent/report_gpt-54.txt
@ -0,0 +1,54 @@
 ======================================================================
  HOLDOUT EVALUATION: iter1-independent vs GPT-5.4
 ======================================================================
  Samples evaluated: 1200
  Total inference time: 6.73s
  Avg latency: 5.61ms/sample
  Throughput: 178 samples/sec
  ──────────────────────────────────────────────────
  CATEGORY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.9337  ✓ (target: 0.80)
  Weighted F1:    0.9343
  Macro Prec:     0.9319
  Macro Recall:   0.9378
  MCC:            0.9227
  AUC (OvR):      0.9920
  ECE:            0.0538
  Kripp Alpha:    0.9224
  Category                        F1     Prec   Recall
  ------------------------- -------- -------- --------
  Board Governance            0.9719   0.9657   0.9783
  Incident Disclosure         0.9605   0.9551   0.9659
  Management Role             0.9412   0.9231   0.9600
  None/Other                  0.8881   0.8239   0.9632
  Risk Management Process     0.8564   0.8865   0.8283
  Strategy Integration        0.9583   0.9810   0.9367
  Third-Party Risk            0.9593   0.9880   0.9322
  ──────────────────────────────────────────────────
  SPECIFICITY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.8952  ✓ (target: 0.80)
  Weighted F1:    0.9122
  Macro Prec:     0.8980
  Macro Recall:   0.8931
  MCC:            0.8664
  AUC (OvR):      0.9817
  QWK:            0.9324
  MAE:            0.1175
  ECE:            0.0714
  Kripp Alpha:    0.9177
  Level                           F1     Prec   Recall
  ------------------------- -------- -------- --------
  L1: Generic                 0.9355   0.9325   0.9385
  L2: Domain                  0.7975   0.8228   0.7738
  L3: Firm-Specific           0.8941   0.8716   0.9179
  L4: Quantified              0.9535   0.9653   0.9420
 ======================================================================
--- a/results/eval/iter1-independent/report_opus-46.txt
+++ b/results/eval/iter1-independent/report_opus-46.txt
@ -0,0 +1,54 @@
 ======================================================================
  HOLDOUT EVALUATION: iter1-independent vs Opus-4.6
 ======================================================================
  Samples evaluated: 1200
  Total inference time: 6.73s
  Avg latency: 5.61ms/sample
  Throughput: 178 samples/sec
  ──────────────────────────────────────────────────
  CATEGORY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.9227  ✓ (target: 0.80)
  Weighted F1:    0.9216
  Macro Prec:     0.9178
  Macro Recall:   0.9316
  MCC:            0.9093
  AUC (OvR):      0.9940
  ECE:            0.0655
  Kripp Alpha:    0.9086
  Category                        F1     Prec   Recall
  ------------------------- -------- -------- --------
  Board Governance            0.9441   0.9056   0.9860
  Incident Disclosure         0.9286   0.8764   0.9873
  Management Role             0.9172   0.9231   0.9114
  None/Other                  0.9200   0.8679   0.9787
  Risk Management Process     0.8492   0.9135   0.7934
  Strategy Integration        0.9476   0.9858   0.9123
  Third-Party Risk            0.9521   0.9521   0.9521
  ──────────────────────────────────────────────────
  SPECIFICITY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.8834  ✓ (target: 0.80)
  Weighted F1:    0.9004
  Macro Prec:     0.8859
  Macro Recall:   0.8855
  MCC:            0.8501
  AUC (OvR):      0.9737
  QWK:            0.9227
  MAE:            0.1358
  ECE:            0.0825
  Kripp Alpha:    0.9065
  Level                           F1     Prec   Recall
  ------------------------- -------- -------- --------
  L1: Generic                 0.9242   0.9116   0.9372
  L2: Domain                  0.7789   0.7468   0.8138
  L3: Firm-Specific           0.8661   0.9495   0.7962
  L4: Quantified              0.9643   0.9356   0.9947
 ======================================================================