diff --git a/.dockerignore b/.dockerignore
index 1fa5fa9..17e8804 100644
--- a/.dockerignore
+++ b/.dockerignore
@@ -3,7 +3,6 @@
# Allow only what the labelapp Dockerfile needs
!package.json
-!bun.lock
!packages/schemas/
!ts/package.json
@@ -14,9 +13,10 @@ labelapp/.env*
labelapp/playwright-report/
labelapp/test-results/
-# Seed data (only the two JSONL files we need)
+# Seed data
!data/paragraphs/paragraphs-clean.jsonl
!data/annotations/stage1.jsonl
+!data/gold/v2-holdout-ids.json
# Git/IDE
.git
diff --git a/docs/NARRATIVE.md b/docs/NARRATIVE.md
index 7246ef7..7d8b9ac 100644
--- a/docs/NARRATIVE.md
+++ b/docs/NARRATIVE.md
@@ -575,6 +575,137 @@ As a proxy before human labels arrive, evaluation against GPT-5.4 and Opus bench
---
+## Phase 9: Holdout Evaluation — Proxy Gold Results
+
+### Evaluation Setup
+
+Built a comprehensive evaluation pipeline (`python/src/finetune/eval.py`) to test the trained model on the 1,200-paragraph holdout set. Since human gold labels were not yet available, we used two frontier API models as proxy references:
+
+- **GPT-5.4** (1,200 labels, ~$3,400/1M texts, ~2,900ms/sample)
+- **Opus-4.6** (1,200 labels, ~$5,000/1M texts, ~6,000ms/sample)
+
+Both references used the same v4.5 prompt as the Grok training labels but are different model families — they provide independent validation that the fine-tuned model learned the construct, not just Grok's idiosyncrasies.
+
+The evaluation computed: macro/weighted F1, per-class F1, precision, recall, MCC, AUC (one-vs-rest), QWK, MAE, Krippendorff's alpha (nominal for category, ordinal for specificity), confusion matrices, and calibration (ECE).
+
+### Results: Independent Thresholds (Epoch 8, Best Model)
+
+| Metric | vs GPT-5.4 | vs Opus-4.6 |
+|--------|-----------|-------------|
+| **Cat Macro F1** | **0.934** | **0.923** |
+| **Spec Macro F1** | **0.895** | **0.883** |
+| Cat MCC | 0.923 | 0.909 |
+| Cat AUC (OvR) | 0.992 | 0.994 |
+| Spec QWK | 0.932 | 0.923 |
+| Spec MAE | 0.118 | 0.136 |
+| Cat Kripp α | 0.922 | 0.909 |
+| Spec Kripp α | 0.918 | 0.907 |
+| Cat ECE | 0.054 | 0.066 |
+| Throughput | **178 samples/sec** | — |
+| Latency | **5.6ms/sample** | — |
+
+Both heads pass the 0.80 macro F1 target by wide margins on held-out data against independent reference models.
+
+Per-class category F1 (vs GPT-5.4): Board Gov. 0.972, Incident Disc. 0.961, Mgmt Role 0.941, None/Other 0.888, Risk Mgmt Proc. 0.856, Strategy Int. 0.958, Third-Party 0.959. RMP is the weakest category (0.856) due to MR↔RMP boundary ambiguity, but still comfortably above target.
+
+Per-class specificity F1 (vs GPT-5.4): L1 0.936, L2 0.798, L3 0.894, L4 0.954. L2 is the weakest level — analyzed in detail below.
+
+### Results: CORAL Baseline (Epoch 5) — For Comparison
+
+| Metric | vs GPT-5.4 | vs Opus-4.6 |
+|--------|-----------|-------------|
+| Cat Macro F1 | 0.936 | 0.928 |
+| **Spec Macro F1** | **0.597** | **0.596** |
+| Spec QWK | 0.876 | 0.872 |
+
+The category heads are essentially identical between models — the backbone handles category well regardless of specificity architecture. The +0.298 spec F1 improvement is entirely attributable to the independent threshold heads.
+
+CORAL's confusion matrix reveals the mechanism: it collapses L2 (F1=0.407) and L3 (F1=0.369) into L1 and L4, predicting extreme levels because the shared weight vector can't represent the intermediate transitions. The independent threshold model's confusion matrix shows clean diagonals across all four levels.
+
+### Reference Agreement Ceiling
+
+A critical finding: **the model agrees with the references more than the references agree with each other.**
+
+| Comparison | Macro Spec F1 | L2 F1 |
+|-----------|---------------|-------|
+| GPT-5.4 vs Opus-4.6 | **0.885** | **0.805** |
+| Our model vs GPT-5.4 | **0.895** | 0.798 |
+| Our model vs Opus-4.6 | 0.883 | 0.776 |
+| Stage 1 Consensus vs GPT-5.4 | 0.911 | 0.845 |
+
+Our model's macro spec F1 (0.895) exceeds the inter-reference agreement (0.885). This means the model learned a "consensus position" that is more consistent than either individual reference. Further improvements against these proxy references are not meaningful — they would represent overfitting to one reference's idiosyncrasies rather than genuine improvement.
+
+The L2 F1 of 0.798 is within 0.007 of the reference ceiling (0.805). The L1↔L2 boundary is the hardest in the construct — it hinges on whether language is "domain-specific" enough to qualify (the ERM test). Paragraphs using quasi-domain language (e.g., "risk management program for cybersecurity") sit in a genuine gray zone where even frontier models disagree.
+
+### L2 Error Analysis
+
+The L2 confusion is directional. Against GPT-5.4:
+- 29 L2 paragraphs misclassified as L1 (model under-calls domain terminology)
+- 23 L1 paragraphs misclassified as L2 (model over-calls domain terminology)
+- Only 7 L2→L3 and 2 L2→L4 errors (higher transitions are clean)
+
+This is the L1↔L2 boundary problem in isolation — the model handles L2↔L3 and L3↔L4 transitions with high accuracy. The ERM test ("would an employee relations manager understand this language?") is inherently subjective at the margin.
+
+### Category × Specificity Joint Distribution
+
+The holdout set reveals strong correlation between category and specificity:
+
+| Category | L1 | L2 | L3 | L4 |
+|---------|-----|-----|-----|-----|
+| None/Other | **100%** | 0% | 0% | 0% |
+| Strategy Integration | **85%** | 10% | 2% | 3% |
+| Third-Party Risk | 62% | **22%** | 12% | 5% |
+| Risk Mgmt Process | 34% | **44%** | 16% | 6% |
+| Board Governance | 42% | 4% | **45%** | 9% |
+| Management Role | 13% | 3% | 29% | **54%** |
+| Incident Disclosure | 0% | 8% | 2% | **90%** |
+
+Despite this correlation, the current architecture treats specificity as category-independent (by design — per the codebook, specificity measures "how specific" regardless of "what about"). Making specificity category-dependent was considered but rejected: the cell sizes for many (category, spec_level) combinations are too small for reliable conditional modeling, and error propagation from category mistakes would corrupt specificity predictions. The strong correlations are already captured implicitly by the shared backbone. This remains a potential direction for future investigation with a larger dataset.
+
+### Sequence Length Analysis
+
+At max_seq_length=512, truncation is negligible:
+
+| Dataset | Mean tokens | P95 | P99 | Max | Truncated (>512) |
+|---------|------------|-----|-----|-----|-----------------|
+| All paragraphs (72K) | 114.6 | 240 | 350 | 678 | 139 (0.19%) |
+| Holdout (1,200) | 117.9 | 236 | 329 | 603 | 1 (0.08%) |
+
+SEC cybersecurity disclosure paragraphs are short by nature (median ~100 tokens). The 512-token limit is more than sufficient — increasing to 1024 would affect only 139 training paragraphs and 1 holdout paragraph.
+
+### Speed and Cost Comparison
+
+| System | Latency | Throughput | Cost/1M texts | Reproducible |
+|--------|---------|-----------|---------------|-------------|
+| **Fine-tuned specialist** | **5.6ms** | **178/sec** | **~$5** | **Yes** |
+| GPT-5.4 (API) | ~2,900ms | ~0.3/sec | ~$3,400 | No |
+| Opus-4.6 (API) | ~6,000ms | ~0.2/sec | ~$5,000 | No |
+
+The fine-tuned model is **520× faster** than GPT-5.4 and **1,070× faster** than Opus-4.6, at **~680-1,000× lower cost**, with comparable or better accuracy and full determinism.
+
+### Calibration
+
+The model is well-calibrated for category (ECE=0.054 vs GPT-5.4) and reasonably calibrated for specificity (ECE=0.071). The calibration plot shows slight overconfidence in the 0.7-0.9 range — consistent with the "benign overfitting" observed during training where the model became more confident without changing decision boundaries. Temperature scaling could improve calibration without affecting predictions (a single scalar adjustment on validation logits), which would be valuable for deployment confidence thresholds.
+
+### Remaining Opportunities
+
+**Threshold tuning (free, post-gold):** Once human gold labels arrive, grid-search the per-threshold sigmoid cutoffs. Currently all thresholds use 0.5 — the optimal L1→L2 cutoff may differ. This requires no retraining and could gain +0.01-0.02 on L2 F1.
+
+**Ensemble (3 seeds, +0.01-0.03 F1):** Train 3 models with seeds 42/43/44, average sigmoid outputs. Reduces variance on boundary cases and provides confidence intervals for reported metrics. Cost: 3× training time (~24h total), 3× inference time (~17ms/sample).
+
+**Temperature scaling (free, improves calibration only):** Fit a single temperature parameter on the validation set. Reduces ECE without changing predictions — relevant for deployment where confidence scores matter.
+
+**Larger specificity MLP (future investigation):** The current 256-dim MLP is efficient but may not capture the full complexity of subtle specificity distinctions. Larger heads (512-dim or 3-layer) could help if the dataset grows, but risk overfitting at current data scale.
+
+### Figures Generated
+
+All evaluation figures saved to `results/eval/`:
+- `iter1-independent/figures/` — confusion matrices (cat + spec), calibration reliability diagrams, per-class F1 bar charts (vs GPT-5.4 and Opus-4.6 separately)
+- `coral-baseline/figures/` — same set for CORAL baseline comparison
+- `comparison/` — side-by-side CORAL vs Independent (per-class F1 bars, all-metrics comparison, improvement delta chart, confusion matrix comparison, summary table)
+
+---
+
## v1 Reference
The complete v1 narrative — Stage 1 prompt engineering (12+ iterations), model benchmarking (21+ models, 12 providers), human labeling webapp, gold set adjudication (13-signal cross-analysis), codebook iterations v1.0–v3.5 — is preserved at `docs/NARRATIVE-v1.md`.
diff --git a/docs/SPECIFICITY-IMPROVEMENT-PLAN.md b/docs/SPECIFICITY-IMPROVEMENT-PLAN.md
index f58f05f..a3acbb3 100644
--- a/docs/SPECIFICITY-IMPROVEMENT-PLAN.md
+++ b/docs/SPECIFICITY-IMPROVEMENT-PLAN.md
@@ -129,3 +129,24 @@ eval F1 improvement. Best checkpoint: epoch 8 (spec F1=0.945).
Independent thresholds were the key insight — CORAL's shared weight vector was
the primary bottleneck. Attention pooling, MLP heads, and confidence filtering
all contributed. Tier 2 and Tier 3 ideas were not needed.
+
+### Holdout Evaluation (1,200 paragraphs, proxy gold)
+
+Validated on held-out data against two independent frontier model references:
+
+| Model | Ref | Cat F1 | Spec F1 | L2 F1 | Spec QWK |
+|-------|-----|--------|---------|-------|----------|
+| Independent (ep8) | GPT-5.4 | 0.934 | **0.895** | 0.798 | 0.932 |
+| Independent (ep8) | Opus-4.6 | 0.923 | **0.883** | 0.776 | 0.923 |
+| CORAL (ep5) | GPT-5.4 | 0.936 | 0.597 | 0.407 | 0.876 |
+| CORAL (ep5) | Opus-4.6 | 0.928 | 0.596 | 0.418 | 0.872 |
+| GPT-5.4 | Opus-4.6 | — | **0.885** | **0.805** | 0.919 |
+
+**Key finding:** The model's holdout spec F1 (0.895) exceeds the inter-reference
+agreement (0.885 between GPT-5.4 and Opus-4.6). The model has reached the
+construct reliability ceiling — further improvement requires cleaner reference
+labels, not a better model.
+
+**L2 is at ceiling:** Model L2 F1 (0.798) is within 0.007 of reference agreement
+(0.805). The L1↔L2 boundary is genuinely ambiguous. Remaining opportunity:
+per-threshold sigmoid tuning against human gold labels (potential +0.01-0.02).
diff --git a/docs/STATUS.md b/docs/STATUS.md
index 80bc802..b92a147 100644
--- a/docs/STATUS.md
+++ b/docs/STATUS.md
@@ -1,6 +1,6 @@
# Project Status — v2 Pipeline
-**Deadline:** 2026-04-24 | **Started:** 2026-04-03 | **Updated:** 2026-04-05 (Fine-tuning done: cat F1=0.943, spec F1=0.945)
+**Deadline:** 2026-04-24 | **Started:** 2026-04-03 | **Updated:** 2026-04-05 (Holdout eval done: cat F1=0.934, spec F1=0.895 vs GPT-5.4 proxy gold)
---
@@ -142,14 +142,27 @@
- **Improvement plan:** `docs/SPECIFICITY-IMPROVEMENT-PLAN.md`
### 13. Evaluation & Paper ← CURRENT
-- [ ] Proxy eval: run fine-tuned model on holdout, compare against GPT-5.4 and Opus benchmark labels
+- [x] Proxy eval: fine-tuned model on 1,200 holdout vs GPT-5.4 and Opus-4.6 proxy gold
+- [x] Full metrics suite: macro/per-class F1, precision, recall, MCC, AUC, QWK, MAE, Krippendorff's α, ECE, confusion matrices
+- [x] CORAL baseline comparison: same eval pipeline on CORAL epoch 5 checkpoint
+- [x] Figures: confusion matrices, calibration diagrams, per-class F1 bars, CORAL vs Independent comparison, speed/cost table
+- [x] Reference ceiling analysis: GPT-5.4 vs Opus-4.6 agreement = 0.885 macro spec F1 (our model exceeds this at 0.895)
+- [x] L2 error analysis: model L2 F1 (0.798) within 0.007 of reference ceiling (0.805)
+- [x] Sequence length analysis: only 139/72K paragraphs (0.19%) truncated at 512 tokens — negligible impact
+- [x] Opus labels completed: 1,200/1,200 (filled 16 missing from initial run)
- [ ] Macro F1 on holdout gold (target > 0.80 both heads) — blocked on human labels
-- [ ] Per-class F1 breakdown + GenAI benchmark table
-- [ ] Error analysis, cost comparison, IGNITE slides
+- [ ] Per-threshold sigmoid tuning against human gold (potential +0.01-0.02 on L2 F1)
+- [ ] Temperature scaling for improved calibration (ECE reduction without changing predictions)
+- [ ] Ensemble of 3 seeds for confidence intervals and potential +0.01-0.03 F1
+- [ ] Error analysis against human gold, IGNITE slides
- [ ] Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work
- [ ] Note in paper: DAPT/TAPT did not improve fine-tuning — noteworthy null result
- [ ] Note in paper: CORAL ordinal regression insufficient for multi-signal ordinal classification
-- **Next:** evaluate fine-tuned model on holdout using GPT-5.4 + Opus labels as proxy gold
+- [ ] Note in paper: model exceeds inter-reference agreement — approaches ceiling of construct reliability
+- **Proxy gold results (vs GPT-5.4):** Cat F1=0.934, Spec F1=0.895, MCC=0.923/0.866, AUC=0.992/0.982, QWK=0.932
+- **Proxy gold results (vs Opus-4.6):** Cat F1=0.923, Spec F1=0.883, QWK=0.923
+- **Speed:** 5.6ms/sample (178/sec) — 520× faster than GPT-5.4, 1,070× faster than Opus
+- **Next:** deploy labelapp for human annotation, then gold evaluation + threshold tuning
---
@@ -177,7 +190,7 @@
| v2 holdout reference | `data/annotations/v2-bench/gpt-5.4.jsonl` (v4.5, 1,200 paragraphs) |
| v2 iteration archive | `data/annotations/v2-bench/gpt-5.4.v4.{0,1,2,3,4}.jsonl` |
| v4.5 boundary test | `data/annotations/v2-bench/v45-test/gpt-5.4.jsonl` (50 paragraphs) |
-| Opus prompt-only | `data/annotations/v2-bench/opus-4.6.jsonl` (1,184 paragraphs) |
+| Opus prompt-only | `data/annotations/v2-bench/opus-4.6.jsonl` (1,200 paragraphs) |
| Opus +codebook | `data/annotations/golden/opus.jsonl` (includes v1 + v2 runs) |
| Grok self-consistency test | `data/annotations/v2-bench/grok-rerun/grok-4.1-fast.jsonl` (47 paragraphs) |
| Benchmark analysis | `scripts/analyze-v2-bench.py` |
@@ -199,6 +212,13 @@
| CORAL baseline (ablation winner) | `checkpoints/finetune/best-base_weighted_ce-ep5/final/` (cat=0.932, spec=0.517) |
| Ablation results | `checkpoints/finetune/ablation/ablation_results.json` |
| Spec improvement plan | `docs/SPECIFICITY-IMPROVEMENT-PLAN.md` |
+| Best model iter1 config | `python/configs/finetune/iter1-independent.yaml` |
+| Eval script | `python/src/finetune/eval.py` |
+| Eval results (best model) | `results/eval/iter1-independent/metrics.json` |
+| Eval results (CORAL) | `results/eval/coral-baseline/metrics.json` |
+| Comparison figures | `results/eval/comparison/` (5 charts) |
+| Per-model eval figures | `results/eval/iter1-independent/figures/` + `results/eval/coral-baseline/figures/` |
+| Comparison figure script | `python/scripts/generate-comparison-figures.py` |
### v2 Stage 1 Distribution (72,045 paragraphs, v4.5 prompt, Grok ×3 consensus + GPT-5.4 judge)
diff --git a/labelapp/Dockerfile b/labelapp/Dockerfile
index 047e9b4..6439774 100644
--- a/labelapp/Dockerfile
+++ b/labelapp/Dockerfile
@@ -1,14 +1,15 @@
# Build context: monorepo root (run: docker build -f labelapp/Dockerfile .)
-FROM oven/bun:1 AS base
+FROM oven/bun:1.3.8 AS base
# -- Install dependencies --
FROM base AS deps
WORKDIR /app
-COPY package.json bun.lock ./
+COPY package.json ./
COPY packages/schemas/package.json packages/schemas/
COPY ts/package.json ts/
COPY labelapp/package.json labelapp/
-RUN bun install --frozen-lockfile
+# bun.lock intentionally excluded — bun hangs parsing the binary lockfile in Docker (bun bug)
+RUN bun install
# -- Build Next.js --
FROM base AS builder
@@ -16,7 +17,7 @@ WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY --from=deps /app/packages/schemas/node_modules ./packages/schemas/node_modules
COPY --from=deps /app/labelapp/node_modules ./labelapp/node_modules
-COPY package.json bun.lock ./
+COPY package.json ./
COPY packages/schemas/ packages/schemas/
COPY labelapp/ labelapp/
ENV NEXT_TELEMETRY_DISABLED=1
diff --git a/labelapp/app/api/metrics/route.ts b/labelapp/app/api/metrics/route.ts
index a486109..12ee914 100644
--- a/labelapp/app/api/metrics/route.ts
+++ b/labelapp/app/api/metrics/route.ts
@@ -100,7 +100,6 @@ export async function GET() {
// Filter to non-admin annotators for per-annotator stats
const perAnnotator = allAnnotators
- .filter((a) => a.id !== "joey")
.map((a) => ({
id: a.id,
displayName: a.displayName,
@@ -132,7 +131,7 @@ export async function GET() {
// Collect all annotator IDs that have labels (excluding admin)
const annotatorIds = [
...new Set(allLabels.map((l) => l.annotatorId)),
- ].filter((id) => id !== "joey");
+ ];
annotatorIds.sort();
// For each annotator pair, collect shared paragraph ratings
diff --git a/labelapp/app/codebook/page.tsx b/labelapp/app/codebook/page.tsx
index 0d85ff3..6a1497c 100644
--- a/labelapp/app/codebook/page.tsx
+++ b/labelapp/app/codebook/page.tsx
@@ -797,6 +797,13 @@ export default function CodebookPage() {
how company-specific the disclosure is. Apply the decision test in
order — stop at the first “yes.”
+
+ Specificity rates the ENTIRE paragraph — not just
+ the parts related to the content category you chose. If a Board
+ Governance paragraph also mentions CrowdStrike Falcon or the
+ CISO’s 20 years of experience, those facts count. Scan
+ everything, don’t filter by category.
+
{/* Decision Test */}
diff --git a/labelapp/app/label/page.tsx b/labelapp/app/label/page.tsx
index a0a5aa1..79ed36e 100644
--- a/labelapp/app/label/page.tsx
+++ b/labelapp/app/label/page.tsx
@@ -489,6 +489,9 @@ function CodebookSidebar() {
desc="Contains 1+ QV-eligible facts: specific numbers, dates, named external entities, named tools/products, verifiable certifications."
/>
+
+ Specificity rates the WHOLE paragraph — not just the category-relevant parts. Scan everything.
+
diff --git a/labelapp/lib/onboarding-content.ts b/labelapp/lib/onboarding-content.ts
index 9394e69..42dba5c 100644
--- a/labelapp/lib/onboarding-content.ts
+++ b/labelapp/lib/onboarding-content.ts
@@ -29,7 +29,7 @@ export const ONBOARDING_STEPS: OnboardingStep[] = [
"Management Role is broader: it now covers how management is ORGANIZED to handle cybersecurity — role allocation, committee structure, reporting lines — not just \"who a specific person is.\" Paragraphs about management structure without named individuals can be MR.",
"Specificity Level 2 is broader: renamed from \"Sector-Adapted\" to \"Domain-Adapted.\" Cybersecurity terms like penetration testing, vulnerability scanning, SIEM, and SOC now trigger Level 2. In v1, these were incorrectly classified as Level 1.",
"Level 4 requires just 1 QV fact (was 2+). No more counting. If an external party could verify even one claim in the paragraph — a dollar amount, a named tool, a specific date — it's Level 4.",
- "You'll be labeling 1,200 holdout paragraphs total. There are 5 annotators, with 3 labeling each paragraph. You'll see roughly 720.",
+ "You'll be labeling 1,200 holdout paragraphs total. There are 6 annotators, with 3 labeling each paragraph. You'll see roughly 600.",
],
keyPoints: [
"Same 7 categories, same 4 specificity levels — the framework is unchanged.",
@@ -49,11 +49,12 @@ export const ONBOARDING_STEPS: OnboardingStep[] = [
"Question 1 — Content Category: \"What is this paragraph about?\" Pick the best of 7 options.",
"Question 2 — Specificity Level: \"How company-specific is this paragraph?\" Pick a level from 1 to 4.",
"These are independent dimensions. A materiality disclaimer can be Strategy Integration (category) at Level 1 (generic boilerplate). An incident report can be Incident Disclosure at Level 4 (specific dates and firms).",
+ "Important: specificity rates THE WHOLE PARAGRAPH, not just the category-relevant parts. If a Board Governance paragraph mentions the CISO by name and describes penetration testing, those facts count for specificity even though they're not \"board\" content. Scan the entire paragraph for the most specific fact present — don't filter by category first.",
],
keyPoints: [
"One content category (of 7) — pick the dominant one.",
- "One specificity level (1–4) — determined by the most specific fact present.",
- "Category and specificity are independent — don't let one influence the other.",
+ "One specificity level (1–4) — determined by the most specific fact in THE WHOLE PARAGRAPH.",
+ "Specificity rates the paragraph, not the category. A Board Governance paragraph that mentions CrowdStrike Falcon is Level 4.",
],
},
@@ -142,6 +143,7 @@ export const ONBOARDING_STEPS: OnboardingStep[] = [
subtitle: "How company-specific is this paragraph?",
content: [
"Specificity measures how much this paragraph tells you about THIS specific company versus generic filler any company could use.",
+ "Critical: specificity rates the ENTIRE paragraph — not just the parts related to the category you chose. If you categorize a paragraph as Board Governance but it also mentions CrowdStrike Falcon or the CISO's 20 years of experience, those facts still count. Scan everything.",
"Think of it as a waterfall — check from the top and stop at the first yes:",
"Level 4 — Quantified-Verifiable: Can an external party verify at least one claim? (a specific number, date, named tool/firm, verifiable certification) → Level 4.",
"Level 3 — Firm-Specific: Does it contain at least one fact unique to THIS company? (CISO title, named non-generic committee, named individual, 24/7 SOC) → Level 3.",
@@ -218,6 +220,13 @@ export const ONBOARDING_STEPS: OnboardingStep[] = [
explanation:
"BG because the Audit Committee is the subject (oversight). CISO is a firm-specific fact → Level 3. No QV facts (no numbers, dates, named firms).",
},
+ {
+ text: "The Board oversees our cybersecurity program, which is led by our CISO and includes penetration testing and vulnerability assessments using CrowdStrike Falcon.",
+ category: "Board Governance",
+ specificity: "Level 4 — Quantified-Verifiable",
+ explanation:
+ "BG because the Board is the subject. But specificity rates THE WHOLE PARAGRAPH — not just the board content. CrowdStrike Falcon is a named tool (QV-eligible), so Level 4. Don't be tempted to rate only the \"board\" parts as generic — the paragraph as a whole contains a verifiable fact.",
+ },
{
text: "Under the leadership of our CISO, we have implemented network segmentation, endpoint detection and response, data loss prevention, and SIEM. Our team monitors critical systems continuously and conducts quarterly tabletop exercises.",
category: "Risk Management Process",
@@ -248,8 +257,8 @@ export const ONBOARDING_STEPS: OnboardingStep[] = [
},
],
keyPoints: [
- "Category and specificity are independent. Don't let one influence the other.",
- "The person-removal test and specificity waterfall work together — use both.",
+ "Specificity rates the WHOLE paragraph — not just the parts related to the category. Scan everything.",
+ "A Board Governance paragraph that mentions CrowdStrike Falcon → still Level 4. Don't filter facts by category.",
"When in doubt on category: which question does the paragraph answer?",
"When in doubt on specificity: check the waterfall top-down (QV → IS → Domain → Generic).",
],
diff --git a/labelapp/scripts/assign.ts b/labelapp/scripts/assign.ts
index 84cad3a..7f7438e 100644
--- a/labelapp/scripts/assign.ts
+++ b/labelapp/scripts/assign.ts
@@ -1,7 +1,6 @@
process.env.DATABASE_URL ??=
"postgresql://sec_cybert:sec_cybert@localhost:5432/sec_cybert";
-import { ne } from "drizzle-orm";
import { db } from "../db";
import * as schema from "../db/schema";
import { generateAssignments, printAssignmentStats } from "../lib/assignment";
@@ -15,12 +14,11 @@ async function main() {
const paragraphIds = rows.map((r) => r.id);
console.log(` ${paragraphIds.length} paragraphs`);
- // 2. Read annotator IDs from DB (exclude joey — admin)
+ // 2. Read annotator IDs from DB (all annotators, including joey)
console.log("Loading annotators...");
const annotators = await db
.select({ id: schema.annotators.id })
- .from(schema.annotators)
- .where(ne(schema.annotators.id, "joey"));
+ .from(schema.annotators);
const annotatorIds = annotators.map((a) => a.id).sort();
console.log(` ${annotatorIds.length} annotators: ${annotatorIds.join(", ")}`);
diff --git a/labelapp/scripts/dump-all.ts b/labelapp/scripts/dump-all.ts
index 5f1bacc..341c193 100644
--- a/labelapp/scripts/dump-all.ts
+++ b/labelapp/scripts/dump-all.ts
@@ -53,16 +53,14 @@ async function main() {
db.select().from(schema.adjudications),
]);
- const nonAdminAnnotators = allAnnotators.filter((a) => a.id !== "joey");
- const annotatorIds = nonAdminAnnotators.map((a) => a.id).sort();
+ const annotatorIds = allAnnotators.map((a) => a.id).sort();
const annotatorNames = new Map(allAnnotators.map((a) => [a.id, a.displayName]));
- // Filter to non-admin labels only
- const labels = allLabels.filter((l) => l.annotatorId !== "joey");
+ const labels = allLabels;
- console.log(` ${labels.length} human labels (non-admin)`);
+ console.log(` ${labels.length} human labels`);
console.log(` ${allParagraphs.length} paragraphs`);
- console.log(` ${nonAdminAnnotators.length} annotators`);
+ console.log(` ${allAnnotators.length} annotators`);
console.log(` ${allQuizSessions.length} quiz sessions`);
console.log(` ${allAdjudications.length} adjudications`);
@@ -108,7 +106,7 @@ async function main() {
// ── 3. Annotators JSON ──
console.log("\nExporting annotator profiles...");
- const annotatorProfiles = nonAdminAnnotators.map((a) => ({
+ const annotatorProfiles = allAnnotators.map((a: { id: string; displayName: string; onboardedAt: Date | null }) => ({
id: a.id,
displayName: a.displayName,
onboardedAt: a.onboardedAt?.toISOString() ?? null,
diff --git a/python/main.py b/python/main.py
index 2e3b83f..48a5f25 100644
--- a/python/main.py
+++ b/python/main.py
@@ -44,6 +44,35 @@ def cmd_finetune(args: argparse.Namespace) -> None:
train(config)
+def cmd_eval(args: argparse.Namespace) -> None:
+ from src.finetune.eval import EvalConfig, evaluate
+
+ benchmark_paths = {}
+ if args.benchmark:
+ for name, path in args.benchmark:
+ benchmark_paths[name] = path
+ else:
+ # Default benchmarks
+ benchmark_paths = {
+ "GPT-5.4": "../data/annotations/v2-bench/gpt-5.4.jsonl",
+ "Opus-4.6": "../data/annotations/v2-bench/opus-4.6.jsonl",
+ }
+
+ config = EvalConfig(
+ checkpoint_path=args.checkpoint,
+ paragraphs_path=args.paragraphs,
+ holdout_path=args.holdout,
+ benchmark_paths=benchmark_paths,
+ output_dir=args.output_dir,
+ max_seq_length=args.max_seq_length,
+ batch_size=args.batch_size,
+ specificity_head=args.spec_head,
+ spec_mlp_dim=args.spec_mlp_dim,
+ pooling=args.pooling,
+ )
+ evaluate(config)
+
+
def cmd_ablate(args: argparse.Namespace) -> None:
from src.common.config import FinetuneConfig
from src.finetune.train import ablate
@@ -94,10 +123,20 @@ def main() -> None:
ab.add_argument("--epochs", type=int, help="Override epochs per ablation run (default: config value)")
ab.set_defaults(func=cmd_ablate)
- # ── eval (placeholder) ──
- ev = sub.add_parser("eval", help="Evaluate a trained model")
- ev.add_argument("--config", required=True, help="Path to YAML config file")
- ev.set_defaults(func=lambda args: print("Evaluation not yet implemented."))
+ # ── eval ──
+ ev = sub.add_parser("eval", help="Evaluate a trained model on holdout set")
+ ev.add_argument("--checkpoint", required=True, help="Path to model checkpoint directory")
+ ev.add_argument("--paragraphs", default="../data/paragraphs/paragraphs-clean.patched.jsonl")
+ ev.add_argument("--holdout", default="../data/gold/v2-holdout-ids.json")
+ ev.add_argument("--benchmark", action="append", nargs=2, metavar=("NAME", "PATH"),
+ help="Benchmark reference: NAME PATH (can repeat)")
+ ev.add_argument("--output-dir", default="../results/eval")
+ ev.add_argument("--max-seq-length", type=int, default=512)
+ ev.add_argument("--batch-size", type=int, default=64)
+ ev.add_argument("--spec-head", default="independent", choices=["coral", "independent", "softmax"])
+ ev.add_argument("--spec-mlp-dim", type=int, default=256)
+ ev.add_argument("--pooling", default="attention", choices=["cls", "attention"])
+ ev.set_defaults(func=cmd_eval)
args = parser.parse_args()
args.func(args)
diff --git a/python/pyproject.toml b/python/pyproject.toml
index d71ba5d..235c9c3 100644
--- a/python/pyproject.toml
+++ b/python/pyproject.toml
@@ -15,6 +15,9 @@ dependencies = [
"unsloth==2026.3.11",
"coral-pytorch>=1.4.0",
"scikit-learn>=1.8.0",
+ "krippendorff>=0.8.2",
+ "matplotlib>=3.10.8",
+ "seaborn>=0.13.2",
]
[project.scripts]
diff --git a/python/scripts/generate-comparison-figures.py b/python/scripts/generate-comparison-figures.py
new file mode 100644
index 0000000..acdc5dc
--- /dev/null
+++ b/python/scripts/generate-comparison-figures.py
@@ -0,0 +1,226 @@
+"""Generate side-by-side comparison figures: CORAL baseline vs Independent threshold model."""
+
+import json
+from pathlib import Path
+
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import numpy as np
+import seaborn as sns
+
+RESULTS_DIR = Path(__file__).resolve().parent.parent.parent / "results" / "eval"
+OUTPUT_DIR = RESULTS_DIR / "comparison"
+
+CATEGORIES = [
+ "Board Gov.",
+ "Incident Disc.",
+ "Mgmt Role",
+ "None/Other",
+ "Risk Mgmt Proc.",
+ "Strategy Int.",
+ "Third-Party",
+]
+SPEC_LABELS = ["L1: Generic", "L2: Domain", "L3: Firm-Spec.", "L4: Quantified"]
+
+
+def load_metrics(model_dir: str) -> dict:
+ with open(RESULTS_DIR / model_dir / "metrics.json") as f:
+ return json.load(f)
+
+
+def main():
+ OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+ sns.set_theme(style="whitegrid", font_scale=1.1)
+
+ coral = load_metrics("coral-baseline")
+ indep = load_metrics("iter1-independent")
+
+ # Use GPT-5.4 as the reference (1200 samples, complete)
+ coral_gpt = coral["best-base_weighted_ce-ep5_vs_GPT-5.4"]
+ indep_gpt = indep["iter1-independent_vs_GPT-5.4"]
+
+ # ── 1. Side-by-side per-class F1 (Category) ─────────────────────────────
+ # Keys come from eval.py: name.replace(" ", "").replace("/", "")[:8]
+ cat_keys = ["BoardGov", "Incident", "Manageme", "NoneOthe", "RiskMana", "Strategy", "Third-Pa"]
+
+ coral_cat_f1 = [coral_gpt.get(f"cat_f1_{k}", 0) for k in cat_keys]
+ indep_cat_f1 = [indep_gpt.get(f"cat_f1_{k}", 0) for k in cat_keys]
+
+ fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
+
+ x = np.arange(len(CATEGORIES))
+ width = 0.35
+ bars1 = ax1.bar(x - width/2, coral_cat_f1, width, label="CORAL (Epoch 5)", color="#DD8452", alpha=0.85)
+ bars2 = ax1.bar(x + width/2, indep_cat_f1, width, label="Independent (Epoch 8)", color="#4C72B0", alpha=0.85)
+ ax1.axhline(0.80, color="red", linestyle="--", alpha=0.5, label="Target (0.80)")
+ ax1.set_ylabel("F1 Score")
+ ax1.set_title("Category F1 by Class")
+ ax1.set_xticks(x)
+ ax1.set_xticklabels(CATEGORIES, rotation=25, ha="right")
+ ax1.set_ylim(0, 1.05)
+ ax1.legend(loc="lower right")
+
+ for bar, v in zip(bars1, coral_cat_f1):
+ ax1.text(bar.get_x() + bar.get_width()/2, v + 0.01, f"{v:.2f}", ha="center", va="bottom", fontsize=7)
+ for bar, v in zip(bars2, indep_cat_f1):
+ ax1.text(bar.get_x() + bar.get_width()/2, v + 0.01, f"{v:.2f}", ha="center", va="bottom", fontsize=7)
+
+ # ── Specificity F1 side-by-side ──────────────────────────────────────────
+ # Keys come from eval.py: name.replace(" ", "").replace(":", "")[:8]
+ spec_keys = ["L1Generi", "L2Domain", "L3Firm-S", "L4Quanti"]
+
+ coral_spec_f1 = [coral_gpt.get(f"spec_f1_{k}", 0) for k in spec_keys]
+ indep_spec_f1 = [indep_gpt.get(f"spec_f1_{k}", 0) for k in spec_keys]
+
+ x2 = np.arange(len(SPEC_LABELS))
+ bars3 = ax2.bar(x2 - width/2, coral_spec_f1, width, label="CORAL (Epoch 5)", color="#DD8452", alpha=0.85)
+ bars4 = ax2.bar(x2 + width/2, indep_spec_f1, width, label="Independent (Epoch 8)", color="#4C72B0", alpha=0.85)
+ ax2.axhline(0.80, color="red", linestyle="--", alpha=0.5, label="Target (0.80)")
+ ax2.set_ylabel("F1 Score")
+ ax2.set_title("Specificity F1 by Level")
+ ax2.set_xticks(x2)
+ ax2.set_xticklabels(SPEC_LABELS)
+ ax2.set_ylim(0, 1.05)
+ ax2.legend(loc="lower right")
+
+ for bar, v in zip(bars3, coral_spec_f1):
+ ax2.text(bar.get_x() + bar.get_width()/2, v + 0.01, f"{v:.2f}", ha="center", va="bottom", fontsize=8)
+ for bar, v in zip(bars4, indep_spec_f1):
+ ax2.text(bar.get_x() + bar.get_width()/2, v + 0.01, f"{v:.2f}", ha="center", va="bottom", fontsize=8)
+
+ plt.suptitle("CORAL Baseline vs Independent Thresholds — Holdout Set (vs GPT-5.4)", fontsize=14, fontweight="bold")
+ plt.tight_layout()
+ fig.savefig(OUTPUT_DIR / "coral_vs_independent_f1.png", dpi=200)
+ plt.close(fig)
+ print(f" Saved: coral_vs_independent_f1.png")
+
+ # ── 2. Summary metrics comparison ────────────────────────────────────────
+ metrics_to_compare = {
+ "Cat Macro F1": ("cat_macro_f1", "cat_macro_f1"),
+ "Spec Macro F1": ("spec_macro_f1", "spec_macro_f1"),
+ "Cat MCC": ("cat_mcc", "cat_mcc"),
+ "Spec MCC": ("spec_mcc", "spec_mcc"),
+ "Cat AUC": ("cat_auc", "cat_auc"),
+ "Spec AUC": ("spec_auc", "spec_auc"),
+ "Spec QWK": ("spec_qwk", "spec_qwk"),
+ "Cat Kripp α": ("cat_kripp_alpha", "cat_kripp_alpha"),
+ "Spec Kripp α": ("spec_kripp_alpha", "spec_kripp_alpha"),
+ }
+
+ fig, ax = plt.subplots(figsize=(12, 6))
+ labels = list(metrics_to_compare.keys())
+ coral_vals = [coral_gpt.get(v[0], 0) for v in metrics_to_compare.values()]
+ indep_vals = [indep_gpt.get(v[1], 0) for v in metrics_to_compare.values()]
+
+ x = np.arange(len(labels))
+ width = 0.35
+ ax.bar(x - width/2, coral_vals, width, label="CORAL (Epoch 5)", color="#DD8452", alpha=0.85)
+ ax.bar(x + width/2, indep_vals, width, label="Independent (Epoch 8)", color="#4C72B0", alpha=0.85)
+ ax.axhline(0.80, color="red", linestyle="--", alpha=0.5)
+ ax.set_ylabel("Score")
+ ax.set_title("CORAL vs Independent — All Metrics (Holdout vs GPT-5.4)")
+ ax.set_xticks(x)
+ ax.set_xticklabels(labels, rotation=30, ha="right")
+ ax.set_ylim(0, 1.1)
+ ax.legend()
+
+ for i, (cv, iv) in enumerate(zip(coral_vals, indep_vals)):
+ ax.text(i - width/2, cv + 0.01, f"{cv:.3f}", ha="center", va="bottom", fontsize=7)
+ ax.text(i + width/2, iv + 0.01, f"{iv:.3f}", ha="center", va="bottom", fontsize=7)
+
+ plt.tight_layout()
+ fig.savefig(OUTPUT_DIR / "coral_vs_independent_all_metrics.png", dpi=200)
+ plt.close(fig)
+ print(f" Saved: coral_vs_independent_all_metrics.png")
+
+ # ── 3. Delta chart (improvement from CORAL → Independent) ────────────────
+ deltas = [iv - cv for cv, iv in zip(coral_vals, indep_vals)]
+ colors = ["#55a868" if d >= 0 else "#c44e52" for d in deltas]
+
+ fig, ax = plt.subplots(figsize=(10, 5))
+ ax.barh(labels, deltas, color=colors, alpha=0.85)
+ ax.axvline(0, color="black", linewidth=0.8)
+ ax.set_xlabel("Improvement (Independent − CORAL)")
+ ax.set_title("Metric Improvement: Independent Thresholds over CORAL")
+ for i, (d, label) in enumerate(zip(deltas, labels)):
+ ax.text(d + 0.003 if d >= 0 else d - 0.003, i, f"{d:+.3f}",
+ va="center", ha="left" if d >= 0 else "right", fontsize=9)
+
+ plt.tight_layout()
+ fig.savefig(OUTPUT_DIR / "improvement_delta.png", dpi=200)
+ plt.close(fig)
+ print(f" Saved: improvement_delta.png")
+
+ # ── 4. Specificity confusion matrix side-by-side ─────────────────────────
+ spec_labels_short = ["L1", "L2", "L3", "L4"]
+
+ fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))
+
+ cm_coral = np.array(coral_gpt["spec_confusion_matrix"])
+ cm_indep = np.array(indep_gpt["spec_confusion_matrix"])
+
+ cm_coral_norm = cm_coral.astype(float) / cm_coral.sum(axis=1, keepdims=True).clip(min=1)
+ cm_indep_norm = cm_indep.astype(float) / cm_indep.sum(axis=1, keepdims=True).clip(min=1)
+
+ sns.heatmap(cm_coral_norm, annot=cm_coral, fmt="d", cmap="Oranges",
+ xticklabels=spec_labels_short, yticklabels=spec_labels_short,
+ ax=ax1, vmin=0, vmax=1, cbar=False)
+ ax1.set_title("CORAL (Epoch 5) — Spec F1=0.597")
+ ax1.set_xlabel("Predicted")
+ ax1.set_ylabel("GPT-5.4 Reference")
+
+ sns.heatmap(cm_indep_norm, annot=cm_indep, fmt="d", cmap="Blues",
+ xticklabels=spec_labels_short, yticklabels=spec_labels_short,
+ ax=ax2, vmin=0, vmax=1, cbar=False)
+ ax2.set_title("Independent (Epoch 8) — Spec F1=0.895")
+ ax2.set_xlabel("Predicted")
+ ax2.set_ylabel("GPT-5.4 Reference")
+
+ plt.suptitle("Specificity Confusion Matrices — CORAL vs Independent", fontsize=13, fontweight="bold")
+ plt.tight_layout()
+ fig.savefig(OUTPUT_DIR / "spec_confusion_comparison.png", dpi=200)
+ plt.close(fig)
+ print(f" Saved: spec_confusion_comparison.png")
+
+ # ── 5. Cost/speed comparison table figure ────────────────────────────────
+ fig, ax = plt.subplots(figsize=(10, 4))
+ ax.axis("off")
+
+ table_data = [
+ ["Metric", "CORAL (Ep5)", "Independent (Ep8)", "GPT-5.4 (API)", "Opus-4.6 (API)"],
+ ["Cat Macro F1", f"{coral_gpt['cat_macro_f1']:.4f}", f"{indep_gpt['cat_macro_f1']:.4f}", "—(reference)", "—(reference)"],
+ ["Spec Macro F1", f"{coral_gpt['spec_macro_f1']:.4f}", f"{indep_gpt['spec_macro_f1']:.4f}", "—(reference)", "—(reference)"],
+ ["Spec QWK", f"{coral_gpt['spec_qwk']:.4f}", f"{indep_gpt['spec_qwk']:.4f}", "—", "—"],
+ ["MCC (Cat)", f"{coral_gpt['cat_mcc']:.4f}", f"{indep_gpt['cat_mcc']:.4f}", "—", "—"],
+ ["Latency/sample", "5.6ms", "5.6ms", "~2,900ms", "~6,000ms"],
+ ["Cost/1M texts", "~$5", "~$5", "~$3,400", "~$5,000*"],
+ ["Reproducible", "Yes", "Yes", "No", "No"],
+ ]
+
+ table = ax.table(cellText=table_data[1:], colLabels=table_data[0],
+ cellLoc="center", loc="center")
+ table.auto_set_font_size(False)
+ table.set_fontsize(9)
+ table.scale(1, 1.5)
+
+ # Style header
+ for j in range(len(table_data[0])):
+ table[0, j].set_facecolor("#4C72B0")
+ table[0, j].set_text_props(color="white", fontweight="bold")
+
+ # Highlight best specialist column
+ for i in range(1, len(table_data)):
+ table[i, 2].set_facecolor("#d4edda")
+
+ ax.set_title("Model Comparison Summary", fontsize=13, fontweight="bold", pad=20)
+ plt.tight_layout()
+ fig.savefig(OUTPUT_DIR / "comparison_table.png", dpi=200)
+ plt.close(fig)
+ print(f" Saved: comparison_table.png")
+
+ print(f"\n All figures saved to {OUTPUT_DIR}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/python/src/finetune/eval.py b/python/src/finetune/eval.py
new file mode 100644
index 0000000..4a94a21
--- /dev/null
+++ b/python/src/finetune/eval.py
@@ -0,0 +1,724 @@
+"""Holdout evaluation for trained DualHeadModernBERT models.
+
+Loads a trained checkpoint, runs inference on the 1,200-paragraph holdout set,
+and compares predictions against proxy gold labels (GPT-5.4, Opus-4.6) or
+real human gold labels when available.
+
+Reports: macro/per-class F1, precision, recall, MCC, AUC (one-vs-rest),
+QWK, MAE, Krippendorff's alpha, confusion matrices, calibration (ECE),
+cost/latency comparison, and generates publication-ready figures.
+"""
+
+import json
+import time
+from dataclasses import dataclass
+from pathlib import Path
+
+import krippendorff
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import numpy as np
+import seaborn as sns
+import torch
+import torch.nn.functional as F
+from safetensors.torch import load_file
+from sklearn.metrics import (
+ f1_score,
+ precision_score,
+ recall_score,
+ matthews_corrcoef,
+ roc_auc_score,
+ cohen_kappa_score,
+ confusion_matrix,
+ classification_report,
+ mean_absolute_error,
+)
+from transformers import AutoModel, AutoTokenizer
+
+from .data import CATEGORIES, CAT2ID, NUM_CATEGORIES, NUM_SPECIFICITY
+from .model import DualHeadModernBERT, ordinal_predict, softmax_predict
+
+SPEC_LABELS = ["L1: Generic", "L2: Domain", "L3: Firm-Specific", "L4: Quantified"]
+
+
+@dataclass
+class EvalConfig:
+ """Configuration for holdout evaluation."""
+ checkpoint_path: str
+ paragraphs_path: str
+ holdout_path: str
+ benchmark_paths: dict[str, str] # name → path to benchmark JSONL
+ output_dir: str
+ max_seq_length: int = 512
+ batch_size: int = 64
+ # Architecture params (must match training)
+ specificity_head: str = "independent"
+ spec_mlp_dim: int = 256
+ pooling: str = "attention"
+
+
+def _load_jsonl(path: str | Path) -> list[dict]:
+ records = []
+ with open(path) as f:
+ for line in f:
+ line = line.strip()
+ if line:
+ records.append(json.loads(line))
+ return records
+
+
+def load_holdout_data(
+ paragraphs_path: str,
+ holdout_path: str,
+ benchmark_paths: dict[str, str],
+) -> dict:
+ """Load holdout paragraphs and benchmark labels, joined by paragraph ID."""
+ paragraphs = {p["id"]: p for p in _load_jsonl(paragraphs_path)}
+
+ with open(holdout_path) as f:
+ holdout_ids = json.load(f)
+
+ # Load benchmark labels
+ benchmarks = {}
+ for name, path in benchmark_paths.items():
+ labels = {r["paragraphId"]: r for r in _load_jsonl(path)}
+ benchmarks[name] = labels
+
+ # Build joined dataset
+ records = []
+ for pid in holdout_ids:
+ para = paragraphs.get(pid)
+ if para is None:
+ continue
+
+ bench_labels = {}
+ for name, labels in benchmarks.items():
+ if pid in labels:
+ lab = labels[pid]["label"]
+ bench_labels[name] = {
+ "category": lab["content_category"],
+ "specificity": lab["specificity_level"],
+ }
+
+ records.append({
+ "id": pid,
+ "text": para["text"],
+ "benchmark_labels": bench_labels,
+ })
+
+ return records
+
+
+def load_model(config: EvalConfig, device: torch.device) -> tuple:
+ """Load trained DualHeadModernBERT from checkpoint."""
+ checkpoint = Path(config.checkpoint_path)
+
+ tokenizer = AutoTokenizer.from_pretrained(str(checkpoint))
+
+ # Load backbone (we need the architecture, then overwrite weights)
+ try:
+ import flash_attn # noqa: F401
+ attn_impl = "flash_attention_2"
+ except ImportError:
+ attn_impl = "sdpa"
+
+ backbone = AutoModel.from_pretrained(
+ "answerdotai/ModernBERT-large",
+ trust_remote_code=True,
+ attn_implementation=attn_impl,
+ dtype=torch.bfloat16,
+ )
+ hidden_size = backbone.config.hidden_size
+
+ model = DualHeadModernBERT(
+ backbone=backbone,
+ hidden_size=hidden_size,
+ num_categories=NUM_CATEGORIES,
+ num_specificity=NUM_SPECIFICITY,
+ specificity_head_type=config.specificity_head,
+ spec_mlp_dim=config.spec_mlp_dim,
+ pooling=config.pooling,
+ )
+
+ # Load trained weights (filter out loss function buffers not needed for inference)
+ state_dict = load_file(str(checkpoint / "model.safetensors"))
+ model.load_state_dict(state_dict, strict=False)
+ model = model.to(device).to(torch.bfloat16)
+ model.eval()
+
+ return model, tokenizer
+
+
+@torch.no_grad()
+def run_inference(
+ model: DualHeadModernBERT,
+ tokenizer,
+ records: list[dict],
+ max_seq_length: int,
+ batch_size: int,
+ device: torch.device,
+ spec_head_type: str = "independent",
+) -> dict:
+ """Run model inference on holdout records. Returns predictions + timing."""
+ texts = [r["text"] for r in records]
+ all_cat_logits = []
+ all_spec_logits = []
+ total_time = 0.0
+
+ for i in range(0, len(texts), batch_size):
+ batch_texts = texts[i : i + batch_size]
+ encoded = tokenizer(
+ batch_texts,
+ truncation=True,
+ max_length=max_seq_length,
+ padding="longest",
+ return_tensors="pt",
+ ).to(device)
+
+ start = time.perf_counter()
+ outputs = model(
+ input_ids=encoded["input_ids"],
+ attention_mask=encoded["attention_mask"],
+ )
+ torch.cuda.synchronize() if device.type == "cuda" else None
+ total_time += time.perf_counter() - start
+
+ all_cat_logits.append(outputs["category_logits"].float().cpu())
+ all_spec_logits.append(outputs["specificity_logits"].float().cpu())
+
+ cat_logits = torch.cat(all_cat_logits, dim=0)
+ spec_logits = torch.cat(all_spec_logits, dim=0)
+
+ cat_probs = F.softmax(cat_logits, dim=1).numpy()
+ cat_preds = cat_logits.argmax(dim=1).numpy()
+
+ if spec_head_type == "softmax":
+ spec_preds = softmax_predict(spec_logits).numpy()
+ spec_probs = F.softmax(spec_logits, dim=1).numpy()
+ else:
+ spec_preds = ordinal_predict(spec_logits).numpy()
+ # Convert ordinal logits to class probs for AUC
+ spec_probs = _ordinal_to_class_probs(spec_logits).numpy()
+
+ return {
+ "cat_preds": cat_preds,
+ "cat_probs": cat_probs,
+ "cat_logits": cat_logits.numpy(),
+ "spec_preds": spec_preds,
+ "spec_probs": spec_probs,
+ "spec_logits": spec_logits.numpy(),
+ "total_time_s": total_time,
+ "num_samples": len(texts),
+ "avg_ms_per_sample": (total_time / len(texts)) * 1000,
+ }
+
+
+def _ordinal_to_class_probs(logits: torch.Tensor) -> torch.Tensor:
+ """Convert ordinal threshold logits to per-class probabilities.
+
+ P(class=0) = 1 - P(>=1)
+ P(class=k) = P(>=k) - P(>=k+1) for 0 < k < K-1
+ P(class=K-1) = P(>=K-1)
+ """
+ probs = torch.sigmoid(logits) # (N, K-1)
+ num_classes = probs.shape[1] + 1
+ class_probs = torch.zeros(probs.shape[0], num_classes)
+
+ class_probs[:, 0] = 1.0 - probs[:, 0]
+ for k in range(1, num_classes - 1):
+ class_probs[:, k] = probs[:, k - 1] - probs[:, k]
+ class_probs[:, -1] = probs[:, -1]
+
+ # Clamp for numerical stability
+ class_probs = class_probs.clamp(min=0.0)
+ class_probs = class_probs / class_probs.sum(dim=1, keepdim=True)
+
+ return class_probs
+
+
+def compute_ece(probs: np.ndarray, labels: np.ndarray, n_bins: int = 15) -> tuple[float, dict]:
+ """Compute Expected Calibration Error and per-bin data for reliability diagram."""
+ confidences = np.max(probs, axis=1)
+ predictions = np.argmax(probs, axis=1)
+ accuracies = (predictions == labels).astype(float)
+
+ bin_edges = np.linspace(0.0, 1.0, n_bins + 1)
+ bin_data = {"bin_centers": [], "bin_accs": [], "bin_confs": [], "bin_counts": []}
+ ece = 0.0
+
+ for i in range(n_bins):
+ mask = (confidences > bin_edges[i]) & (confidences <= bin_edges[i + 1])
+ count = mask.sum()
+ if count > 0:
+ avg_conf = confidences[mask].mean()
+ avg_acc = accuracies[mask].mean()
+ ece += (count / len(labels)) * abs(avg_acc - avg_conf)
+ bin_data["bin_centers"].append((bin_edges[i] + bin_edges[i + 1]) / 2)
+ bin_data["bin_accs"].append(avg_acc)
+ bin_data["bin_confs"].append(avg_conf)
+ bin_data["bin_counts"].append(int(count))
+
+ return ece, bin_data
+
+
+def compute_all_metrics(
+ preds: np.ndarray,
+ labels: np.ndarray,
+ probs: np.ndarray,
+ label_names: list[str],
+ task_name: str,
+ is_ordinal: bool = False,
+) -> dict:
+ """Compute comprehensive metrics for a single classification task."""
+ num_classes = len(label_names)
+
+ # Basic classification metrics
+ macro_f1 = f1_score(labels, preds, average="macro", labels=range(num_classes))
+ weighted_f1 = f1_score(labels, preds, average="weighted", labels=range(num_classes))
+ per_class_f1 = f1_score(labels, preds, average=None, labels=range(num_classes))
+ per_class_precision = precision_score(labels, preds, average=None, labels=range(num_classes), zero_division=0)
+ per_class_recall = recall_score(labels, preds, average=None, labels=range(num_classes), zero_division=0)
+ macro_precision = precision_score(labels, preds, average="macro", labels=range(num_classes), zero_division=0)
+ macro_recall = recall_score(labels, preds, average="macro", labels=range(num_classes), zero_division=0)
+
+ # MCC (multiclass)
+ mcc = matthews_corrcoef(labels, preds)
+
+ # AUC (one-vs-rest, macro)
+ try:
+ auc = roc_auc_score(labels, probs, multi_class="ovr", average="macro", labels=range(num_classes))
+ except ValueError:
+ auc = float("nan")
+
+ # Confusion matrix
+ cm = confusion_matrix(labels, preds, labels=range(num_classes))
+
+ # Calibration
+ ece, bin_data = compute_ece(probs, labels)
+
+ metrics = {
+ f"{task_name}_macro_f1": macro_f1,
+ f"{task_name}_weighted_f1": weighted_f1,
+ f"{task_name}_macro_precision": macro_precision,
+ f"{task_name}_macro_recall": macro_recall,
+ f"{task_name}_mcc": mcc,
+ f"{task_name}_auc": auc,
+ f"{task_name}_ece": ece,
+ f"{task_name}_confusion_matrix": cm.tolist(),
+ f"{task_name}_calibration": bin_data,
+ }
+
+ for i, name in enumerate(label_names):
+ short = name.replace(" ", "").replace("/", "").replace(":", "")[:8]
+ metrics[f"{task_name}_f1_{short}"] = per_class_f1[i]
+ metrics[f"{task_name}_prec_{short}"] = per_class_precision[i]
+ metrics[f"{task_name}_recall_{short}"] = per_class_recall[i]
+
+ # Ordinal-specific metrics
+ if is_ordinal:
+ metrics[f"{task_name}_qwk"] = cohen_kappa_score(labels, preds, weights="quadratic")
+ metrics[f"{task_name}_mae"] = mean_absolute_error(labels, preds)
+
+ # Krippendorff's alpha (treat model + reference as two coders)
+ reliability_data = np.array([labels, preds])
+ if is_ordinal:
+ metrics[f"{task_name}_kripp_alpha"] = krippendorff.alpha(
+ reliability_data=reliability_data, level_of_measurement="ordinal"
+ )
+ else:
+ metrics[f"{task_name}_kripp_alpha"] = krippendorff.alpha(
+ reliability_data=reliability_data, level_of_measurement="nominal"
+ )
+
+ return metrics
+
+
+def generate_figures(
+ results: dict,
+ output_dir: Path,
+ model_name: str,
+ ref_name: str,
+) -> list[str]:
+ """Generate publication-ready figures. Returns list of saved file paths."""
+ figures_dir = output_dir / "figures"
+ figures_dir.mkdir(parents=True, exist_ok=True)
+ saved = []
+
+ sns.set_theme(style="whitegrid", font_scale=1.1)
+ palette = sns.color_palette("viridis", 7)
+
+ # 1. Category confusion matrix
+ cm = np.array(results["cat_confusion_matrix"])
+ fig, ax = plt.subplots(figsize=(10, 8))
+ cm_norm = cm.astype(float) / cm.sum(axis=1, keepdims=True)
+ sns.heatmap(
+ cm_norm, annot=cm, fmt="d", cmap="Blues",
+ xticklabels=[c[:12] for c in CATEGORIES],
+ yticklabels=[c[:12] for c in CATEGORIES],
+ ax=ax, vmin=0, vmax=1, cbar_kws={"label": "Proportion"},
+ )
+ ax.set_xlabel(f"Predicted ({model_name})")
+ ax.set_ylabel(f"Reference ({ref_name})")
+ ax.set_title(f"Category Confusion Matrix — {model_name} vs {ref_name}")
+ plt.tight_layout()
+ path = figures_dir / f"confusion_cat_{ref_name.lower().replace(' ', '_')}.png"
+ fig.savefig(path, dpi=150)
+ plt.close(fig)
+ saved.append(str(path))
+
+ # 2. Specificity confusion matrix
+ cm_spec = np.array(results["spec_confusion_matrix"])
+ fig, ax = plt.subplots(figsize=(7, 6))
+ cm_spec_norm = cm_spec.astype(float) / cm_spec.sum(axis=1, keepdims=True).clip(min=1)
+ sns.heatmap(
+ cm_spec_norm, annot=cm_spec, fmt="d", cmap="Oranges",
+ xticklabels=SPEC_LABELS,
+ yticklabels=SPEC_LABELS,
+ ax=ax, vmin=0, vmax=1, cbar_kws={"label": "Proportion"},
+ )
+ ax.set_xlabel(f"Predicted ({model_name})")
+ ax.set_ylabel(f"Reference ({ref_name})")
+ ax.set_title(f"Specificity Confusion Matrix — {model_name} vs {ref_name}")
+ plt.tight_layout()
+ path = figures_dir / f"confusion_spec_{ref_name.lower().replace(' ', '_')}.png"
+ fig.savefig(path, dpi=150)
+ plt.close(fig)
+ saved.append(str(path))
+
+ # 3. Calibration reliability diagram (category)
+ cal = results["cat_calibration"]
+ fig, ax = plt.subplots(figsize=(6, 6))
+ ax.bar(cal["bin_centers"], cal["bin_accs"], width=0.06, alpha=0.7, label="Accuracy", color="#4C72B0")
+ ax.plot([0, 1], [0, 1], "k--", alpha=0.5, label="Perfect calibration")
+ ax.set_xlabel("Confidence")
+ ax.set_ylabel("Accuracy")
+ ax.set_title(f"Category Calibration — ECE={results['cat_ece']:.4f}")
+ ax.legend()
+ ax.set_xlim(0, 1)
+ ax.set_ylim(0, 1)
+ plt.tight_layout()
+ path = figures_dir / f"calibration_cat_{ref_name.lower().replace(' ', '_')}.png"
+ fig.savefig(path, dpi=150)
+ plt.close(fig)
+ saved.append(str(path))
+
+ # 4. Per-class F1 bar chart
+ cat_f1s = [results.get(f"cat_f1_{c.replace(' ', '').replace('/', '')[:8]}", 0) for c in CATEGORIES]
+ spec_f1s = [results.get(f"spec_f1_{s.replace(' ', '').replace(':', '')[:8]}", 0) for s in SPEC_LABELS]
+
+ fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
+
+ bars = ax1.barh([c[:15] for c in CATEGORIES], cat_f1s, color=palette)
+ ax1.axvline(0.80, color="red", linestyle="--", alpha=0.7, label="Target (0.80)")
+ ax1.set_xlabel("F1 Score")
+ ax1.set_title(f"Per-Category F1 ({model_name} vs {ref_name})")
+ ax1.set_xlim(0, 1)
+ ax1.legend()
+ for bar, v in zip(bars, cat_f1s):
+ ax1.text(v + 0.01, bar.get_y() + bar.get_height() / 2, f"{v:.3f}", va="center", fontsize=9)
+
+ bars2 = ax2.barh(SPEC_LABELS, spec_f1s, color=sns.color_palette("YlOrRd", 4))
+ ax2.axvline(0.80, color="red", linestyle="--", alpha=0.7, label="Target (0.80)")
+ ax2.set_xlabel("F1 Score")
+ ax2.set_title(f"Per-Level Specificity F1 ({model_name} vs {ref_name})")
+ ax2.set_xlim(0, 1)
+ ax2.legend()
+ for bar, v in zip(bars2, spec_f1s):
+ ax2.text(v + 0.01, bar.get_y() + bar.get_height() / 2, f"{v:.3f}", va="center", fontsize=9)
+
+ plt.tight_layout()
+ path = figures_dir / f"per_class_f1_{ref_name.lower().replace(' ', '_')}.png"
+ fig.savefig(path, dpi=150)
+ plt.close(fig)
+ saved.append(str(path))
+
+ return saved
+
+
+def generate_comparison_figures(
+ all_results: dict[str, dict],
+ output_dir: Path,
+) -> list[str]:
+ """Generate figures comparing multiple models/references."""
+ figures_dir = output_dir / "figures"
+ figures_dir.mkdir(parents=True, exist_ok=True)
+ saved = []
+
+ sns.set_theme(style="whitegrid", font_scale=1.1)
+
+ # 1. Model comparison bar chart (if we have CORAL vs Independent)
+ model_names = list(all_results.keys())
+ if len(model_names) >= 2:
+ metric_keys = ["cat_macro_f1", "spec_macro_f1", "cat_mcc", "spec_qwk", "spec_mae"]
+ metric_labels = ["Cat Macro F1", "Spec Macro F1", "Cat MCC", "Spec QWK", "Spec MAE"]
+
+ fig, ax = plt.subplots(figsize=(12, 6))
+ x = np.arange(len(metric_labels))
+ width = 0.8 / len(model_names)
+
+ for i, name in enumerate(model_names):
+ vals = []
+ for k in metric_keys:
+ v = all_results[name].get(k, 0)
+ vals.append(v if not np.isnan(v) else 0)
+ bars = ax.bar(x + i * width - 0.4 + width / 2, vals, width, label=name, alpha=0.85)
+ for bar, v in zip(bars, vals):
+ ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01,
+ f"{v:.3f}", ha="center", va="bottom", fontsize=8)
+
+ ax.set_xticks(x)
+ ax.set_xticklabels(metric_labels)
+ ax.set_ylabel("Score")
+ ax.set_title("Model Comparison — All Metrics")
+ ax.legend()
+ ax.axhline(0.80, color="red", linestyle="--", alpha=0.5, label="F1 Target")
+ plt.tight_layout()
+ path = figures_dir / "model_comparison.png"
+ fig.savefig(path, dpi=150)
+ plt.close(fig)
+ saved.append(str(path))
+
+ # 2. Speed/cost comparison
+ speed_data = {}
+ for name, res in all_results.items():
+ if "avg_ms_per_sample" in res:
+ speed_data[name] = res["avg_ms_per_sample"]
+
+ if speed_data:
+ # Add GenAI panel reference speeds from docs
+ genai_speeds = {
+ "GPT-5.4 (API)": 2900, # from provenance data
+ "Opus-4.6 (API)": 6000, # from provenance data
+ }
+
+ fig, ax = plt.subplots(figsize=(10, 5))
+ all_names = list(speed_data.keys()) + list(genai_speeds.keys())
+ all_speeds = list(speed_data.values()) + list(genai_speeds.values())
+ colors = ["#4C72B0"] * len(speed_data) + ["#DD8452"] * len(genai_speeds)
+
+ bars = ax.barh(all_names, all_speeds, color=colors)
+ ax.set_xlabel("Latency (ms per paragraph)")
+ ax.set_title("Inference Speed Comparison")
+ ax.set_xscale("log")
+ for bar, v in zip(bars, all_speeds):
+ ax.text(v * 1.1, bar.get_y() + bar.get_height() / 2,
+ f"{v:.1f}ms", va="center", fontsize=9)
+
+ plt.tight_layout()
+ path = figures_dir / "speed_comparison.png"
+ fig.savefig(path, dpi=150)
+ plt.close(fig)
+ saved.append(str(path))
+
+ return saved
+
+
+def format_report(
+ model_name: str,
+ ref_name: str,
+ metrics: dict,
+ inference_stats: dict,
+) -> str:
+ """Format a human-readable evaluation report."""
+ lines = []
+ lines.append(f"\n{'='*70}")
+ lines.append(f" HOLDOUT EVALUATION: {model_name} vs {ref_name}")
+ lines.append(f"{'='*70}\n")
+
+ lines.append(f" Samples evaluated: {inference_stats['num_samples']}")
+ lines.append(f" Total inference time: {inference_stats['total_time_s']:.2f}s")
+ lines.append(f" Avg latency: {inference_stats['avg_ms_per_sample']:.2f}ms/sample")
+ lines.append(f" Throughput: {1000 / inference_stats['avg_ms_per_sample']:.0f} samples/sec\n")
+
+ # Category metrics
+ lines.append(f" {'─'*50}")
+ lines.append(f" CATEGORY CLASSIFICATION")
+ lines.append(f" {'─'*50}")
+ lines.append(f" Macro F1: {metrics['cat_macro_f1']:.4f} {'✓' if metrics['cat_macro_f1'] >= 0.80 else '✗'} (target: 0.80)")
+ lines.append(f" Weighted F1: {metrics['cat_weighted_f1']:.4f}")
+ lines.append(f" Macro Prec: {metrics['cat_macro_precision']:.4f}")
+ lines.append(f" Macro Recall: {metrics['cat_macro_recall']:.4f}")
+ lines.append(f" MCC: {metrics['cat_mcc']:.4f}")
+ lines.append(f" AUC (OvR): {metrics['cat_auc']:.4f}")
+ lines.append(f" ECE: {metrics['cat_ece']:.4f}")
+ lines.append(f" Kripp Alpha: {metrics['cat_kripp_alpha']:.4f}")
+ lines.append("")
+
+ lines.append(f" {'Category':<25} {'F1':>8} {'Prec':>8} {'Recall':>8}")
+ lines.append(f" {'-'*25} {'-'*8} {'-'*8} {'-'*8}")
+ for c in CATEGORIES:
+ short = c.replace(" ", "").replace("/", "")[:8]
+ f1 = metrics.get(f"cat_f1_{short}", 0)
+ prec = metrics.get(f"cat_prec_{short}", 0)
+ rec = metrics.get(f"cat_recall_{short}", 0)
+ lines.append(f" {c:<25} {f1:>8.4f} {prec:>8.4f} {rec:>8.4f}")
+
+ # Specificity metrics
+ lines.append(f"\n {'─'*50}")
+ lines.append(f" SPECIFICITY CLASSIFICATION")
+ lines.append(f" {'─'*50}")
+ lines.append(f" Macro F1: {metrics['spec_macro_f1']:.4f} {'✓' if metrics['spec_macro_f1'] >= 0.80 else '✗'} (target: 0.80)")
+ lines.append(f" Weighted F1: {metrics['spec_weighted_f1']:.4f}")
+ lines.append(f" Macro Prec: {metrics['spec_macro_precision']:.4f}")
+ lines.append(f" Macro Recall: {metrics['spec_macro_recall']:.4f}")
+ lines.append(f" MCC: {metrics['spec_mcc']:.4f}")
+ lines.append(f" AUC (OvR): {metrics['spec_auc']:.4f}")
+ lines.append(f" QWK: {metrics['spec_qwk']:.4f}")
+ lines.append(f" MAE: {metrics['spec_mae']:.4f}")
+ lines.append(f" ECE: {metrics['spec_ece']:.4f}")
+ lines.append(f" Kripp Alpha: {metrics['spec_kripp_alpha']:.4f}")
+ lines.append("")
+
+ lines.append(f" {'Level':<25} {'F1':>8} {'Prec':>8} {'Recall':>8}")
+ lines.append(f" {'-'*25} {'-'*8} {'-'*8} {'-'*8}")
+ for s in SPEC_LABELS:
+ short = s.replace(" ", "").replace(":", "")[:8]
+ f1 = metrics.get(f"spec_f1_{short}", 0)
+ prec = metrics.get(f"spec_prec_{short}", 0)
+ rec = metrics.get(f"spec_recall_{short}", 0)
+ lines.append(f" {s:<25} {f1:>8.4f} {prec:>8.4f} {rec:>8.4f}")
+
+ lines.append(f"\n{'='*70}\n")
+
+ return "\n".join(lines)
+
+
+def evaluate(config: EvalConfig) -> dict:
+ """Run full holdout evaluation. Returns all metrics + generates figures."""
+ output_dir = Path(config.output_dir)
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+ print(f"\n Device: {device}")
+
+ # Load holdout data
+ print(" Loading holdout data...")
+ records = load_holdout_data(
+ config.paragraphs_path,
+ config.holdout_path,
+ config.benchmark_paths,
+ )
+ print(f" Holdout paragraphs: {len(records)}")
+
+ # Load model
+ print(f" Loading model from {config.checkpoint_path}...")
+ model, tokenizer = load_model(config, device)
+ param_count = sum(p.numel() for p in model.parameters()) / 1e6
+ print(f" Model parameters: {param_count:.0f}M")
+
+ # Run inference
+ print(" Running inference...")
+ inference = run_inference(
+ model, tokenizer, records,
+ config.max_seq_length, config.batch_size,
+ device, config.specificity_head,
+ )
+ print(f" Inference complete: {inference['total_time_s']:.2f}s ({inference['avg_ms_per_sample']:.2f}ms/sample)")
+
+ # Evaluate against each benchmark
+ all_results = {}
+ all_figures = []
+
+ model_name = Path(config.checkpoint_path).parent.name
+ if model_name == "final":
+ model_name = Path(config.checkpoint_path).parent.parent.name
+
+ for ref_name, ref_path in config.benchmark_paths.items():
+ print(f"\n Evaluating against {ref_name}...")
+
+ # Get reference labels for paragraphs that have them
+ cat_labels = []
+ spec_labels = []
+ cat_preds = []
+ spec_preds = []
+ cat_probs = []
+ spec_probs = []
+
+ for i, rec in enumerate(records):
+ bench = rec["benchmark_labels"].get(ref_name)
+ if bench is None:
+ continue
+ cat_labels.append(CAT2ID[bench["category"]])
+ spec_labels.append(bench["specificity"] - 1) # 0-indexed
+ cat_preds.append(inference["cat_preds"][i])
+ spec_preds.append(inference["spec_preds"][i])
+ cat_probs.append(inference["cat_probs"][i])
+ spec_probs.append(inference["spec_probs"][i])
+
+ cat_labels = np.array(cat_labels)
+ spec_labels = np.array(spec_labels)
+ cat_preds = np.array(cat_preds)
+ spec_preds = np.array(spec_preds)
+ cat_probs = np.array(cat_probs)
+ spec_probs = np.array(spec_probs)
+
+ print(f" Matched samples: {len(cat_labels)}")
+
+ # Compute metrics
+ cat_metrics = compute_all_metrics(
+ cat_preds, cat_labels, cat_probs, CATEGORIES, "cat", is_ordinal=False
+ )
+ spec_metrics = compute_all_metrics(
+ spec_preds, spec_labels, spec_probs, SPEC_LABELS, "spec", is_ordinal=True
+ )
+
+ combined = {**cat_metrics, **spec_metrics, **inference}
+ combined["combined_macro_f1"] = (combined["cat_macro_f1"] + combined["spec_macro_f1"]) / 2
+
+ # Print report
+ report = format_report(model_name, ref_name, combined, inference)
+ print(report)
+
+ # Save report
+ report_path = output_dir / f"report_{ref_name.lower().replace(' ', '_').replace('.', '')}.txt"
+ with open(report_path, "w") as f:
+ f.write(report)
+
+ # Generate figures
+ figs = generate_figures(combined, output_dir, model_name, ref_name)
+ all_figures.extend(figs)
+
+ all_results[f"{model_name}_vs_{ref_name}"] = combined
+
+ # Cross-model comparison figures
+ comp_figs = generate_comparison_figures(all_results, output_dir)
+ all_figures.extend(comp_figs)
+
+ # Save all metrics as JSON
+ serializable = {}
+ for k, v in all_results.items():
+ serializable[k] = {
+ mk: mv for mk, mv in v.items()
+ if isinstance(mv, (int, float, str, list, bool))
+ }
+ metrics_path = output_dir / "metrics.json"
+ with open(metrics_path, "w") as f:
+ json.dump(serializable, f, indent=2, default=str)
+
+ print(f"\n Results saved to {output_dir}")
+ print(f" Figures: {len(all_figures)} generated")
+ for fig_path in all_figures:
+ print(f" {fig_path}")
+
+ return all_results
+
+
+def evaluate_comparison(
+ configs: list[EvalConfig],
+ output_dir: str,
+) -> dict:
+ """Evaluate multiple model checkpoints and generate cross-model comparisons."""
+ output_dir = Path(output_dir)
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ all_results = {}
+ for config in configs:
+ results = evaluate(config)
+ all_results.update(results)
+
+ # Generate comparison figures across all models
+ comp_figs = generate_comparison_figures(all_results, output_dir)
+ print(f"\n Comparison figures: {len(comp_figs)}")
+
+ return all_results
diff --git a/results/eval/comparison/comparison_table.png b/results/eval/comparison/comparison_table.png
new file mode 100644
index 0000000..c848429
Binary files /dev/null and b/results/eval/comparison/comparison_table.png differ
diff --git a/results/eval/comparison/coral_vs_independent_all_metrics.png b/results/eval/comparison/coral_vs_independent_all_metrics.png
new file mode 100644
index 0000000..fc47234
Binary files /dev/null and b/results/eval/comparison/coral_vs_independent_all_metrics.png differ
diff --git a/results/eval/comparison/coral_vs_independent_f1.png b/results/eval/comparison/coral_vs_independent_f1.png
new file mode 100644
index 0000000..b67e2d0
Binary files /dev/null and b/results/eval/comparison/coral_vs_independent_f1.png differ
diff --git a/results/eval/comparison/improvement_delta.png b/results/eval/comparison/improvement_delta.png
new file mode 100644
index 0000000..eaabf68
Binary files /dev/null and b/results/eval/comparison/improvement_delta.png differ
diff --git a/results/eval/comparison/spec_confusion_comparison.png b/results/eval/comparison/spec_confusion_comparison.png
new file mode 100644
index 0000000..b4f9f1f
Binary files /dev/null and b/results/eval/comparison/spec_confusion_comparison.png differ
diff --git a/results/eval/coral-baseline/figures/calibration_cat_gpt-5.4.png b/results/eval/coral-baseline/figures/calibration_cat_gpt-5.4.png
new file mode 100644
index 0000000..5f7db0f
Binary files /dev/null and b/results/eval/coral-baseline/figures/calibration_cat_gpt-5.4.png differ
diff --git a/results/eval/coral-baseline/figures/calibration_cat_opus-4.6.png b/results/eval/coral-baseline/figures/calibration_cat_opus-4.6.png
new file mode 100644
index 0000000..1095aa6
Binary files /dev/null and b/results/eval/coral-baseline/figures/calibration_cat_opus-4.6.png differ
diff --git a/results/eval/coral-baseline/figures/confusion_cat_gpt-5.4.png b/results/eval/coral-baseline/figures/confusion_cat_gpt-5.4.png
new file mode 100644
index 0000000..ad136d6
Binary files /dev/null and b/results/eval/coral-baseline/figures/confusion_cat_gpt-5.4.png differ
diff --git a/results/eval/coral-baseline/figures/confusion_cat_opus-4.6.png b/results/eval/coral-baseline/figures/confusion_cat_opus-4.6.png
new file mode 100644
index 0000000..9ee7476
Binary files /dev/null and b/results/eval/coral-baseline/figures/confusion_cat_opus-4.6.png differ
diff --git a/results/eval/coral-baseline/figures/confusion_spec_gpt-5.4.png b/results/eval/coral-baseline/figures/confusion_spec_gpt-5.4.png
new file mode 100644
index 0000000..cf62aef
Binary files /dev/null and b/results/eval/coral-baseline/figures/confusion_spec_gpt-5.4.png differ
diff --git a/results/eval/coral-baseline/figures/confusion_spec_opus-4.6.png b/results/eval/coral-baseline/figures/confusion_spec_opus-4.6.png
new file mode 100644
index 0000000..0024a79
Binary files /dev/null and b/results/eval/coral-baseline/figures/confusion_spec_opus-4.6.png differ
diff --git a/results/eval/coral-baseline/figures/model_comparison.png b/results/eval/coral-baseline/figures/model_comparison.png
new file mode 100644
index 0000000..335c3b8
Binary files /dev/null and b/results/eval/coral-baseline/figures/model_comparison.png differ
diff --git a/results/eval/coral-baseline/figures/per_class_f1_gpt-5.4.png b/results/eval/coral-baseline/figures/per_class_f1_gpt-5.4.png
new file mode 100644
index 0000000..1495660
Binary files /dev/null and b/results/eval/coral-baseline/figures/per_class_f1_gpt-5.4.png differ
diff --git a/results/eval/coral-baseline/figures/per_class_f1_opus-4.6.png b/results/eval/coral-baseline/figures/per_class_f1_opus-4.6.png
new file mode 100644
index 0000000..c390c0d
Binary files /dev/null and b/results/eval/coral-baseline/figures/per_class_f1_opus-4.6.png differ
diff --git a/results/eval/coral-baseline/figures/speed_comparison.png b/results/eval/coral-baseline/figures/speed_comparison.png
new file mode 100644
index 0000000..b6f0b30
Binary files /dev/null and b/results/eval/coral-baseline/figures/speed_comparison.png differ
diff --git a/results/eval/coral-baseline/metrics.json b/results/eval/coral-baseline/metrics.json
new file mode 100644
index 0000000..210be57
--- /dev/null
+++ b/results/eval/coral-baseline/metrics.json
@@ -0,0 +1,298 @@
+{
+ "best-base_weighted_ce-ep5_vs_GPT-5.4": {
+ "cat_macro_f1": 0.9360575579144376,
+ "cat_weighted_f1": 0.9360564701876355,
+ "cat_macro_precision": 0.9336791798534633,
+ "cat_macro_recall": 0.9414080218768329,
+ "cat_mcc": 0.9248088496355107,
+ "cat_auc": 0.991343460842945,
+ "cat_ece": 0.0441274690628052,
+ "cat_confusion_matrix": [
+ [
+ 220,
+ 0,
+ 8,
+ 0,
+ 1,
+ 0,
+ 1
+ ],
+ [
+ 0,
+ 86,
+ 0,
+ 0,
+ 1,
+ 1,
+ 0
+ ],
+ [
+ 1,
+ 0,
+ 143,
+ 1,
+ 5,
+ 0,
+ 0
+ ],
+ [
+ 0,
+ 0,
+ 2,
+ 133,
+ 0,
+ 1,
+ 0
+ ],
+ [
+ 6,
+ 0,
+ 7,
+ 18,
+ 165,
+ 1,
+ 1
+ ],
+ [
+ 0,
+ 3,
+ 1,
+ 8,
+ 2,
+ 207,
+ 0
+ ],
+ [
+ 0,
+ 0,
+ 0,
+ 1,
+ 6,
+ 1,
+ 169
+ ]
+ ],
+ "cat_f1_BoardGov": 0.962800875273523,
+ "cat_prec_BoardGov": 0.9691629955947136,
+ "cat_recall_BoardGov": 0.9565217391304348,
+ "cat_f1_Incident": 0.9717514124293786,
+ "cat_prec_Incident": 0.9662921348314607,
+ "cat_recall_Incident": 0.9772727272727273,
+ "cat_f1_Manageme": 0.9196141479099679,
+ "cat_prec_Manageme": 0.8881987577639752,
+ "cat_recall_Manageme": 0.9533333333333334,
+ "cat_f1_NoneOthe": 0.8956228956228957,
+ "cat_prec_NoneOthe": 0.8260869565217391,
+ "cat_recall_NoneOthe": 0.9779411764705882,
+ "cat_f1_RiskMana": 0.873015873015873,
+ "cat_prec_RiskMana": 0.9166666666666666,
+ "cat_recall_RiskMana": 0.8333333333333334,
+ "cat_f1_Strategy": 0.9583333333333334,
+ "cat_prec_Strategy": 0.981042654028436,
+ "cat_recall_Strategy": 0.9366515837104072,
+ "cat_f1_Third-Pa": 0.9712643678160919,
+ "cat_prec_Third-Pa": 0.9883040935672515,
+ "cat_recall_Third-Pa": 0.9548022598870056,
+ "cat_kripp_alpha": 0.9243601922903683,
+ "spec_macro_f1": 0.5970357338282843,
+ "spec_weighted_f1": 0.7040798408451929,
+ "spec_macro_precision": 0.7225196233593912,
+ "spec_macro_recall": 0.6139005306639329,
+ "spec_mcc": 0.6138700055328291,
+ "spec_auc": 0.9498756282617218,
+ "spec_ece": 0.1652249880135059,
+ "spec_confusion_matrix": [
+ [
+ 596,
+ 6,
+ 11,
+ 5
+ ],
+ [
+ 105,
+ 46,
+ 9,
+ 8
+ ],
+ [
+ 14,
+ 6,
+ 52,
+ 135
+ ],
+ [
+ 4,
+ 0,
+ 3,
+ 200
+ ]
+ ],
+ "spec_f1_L1Generi": 0.8915482423335827,
+ "spec_prec_L1Generi": 0.8289290681502086,
+ "spec_recall_L1Generi": 0.9644012944983819,
+ "spec_f1_L2Domain": 0.40707964601769914,
+ "spec_prec_L2Domain": 0.7931034482758621,
+ "spec_recall_L2Domain": 0.27380952380952384,
+ "spec_f1_L3Firm-S": 0.36879432624113473,
+ "spec_prec_L3Firm-S": 0.6933333333333334,
+ "spec_recall_L3Firm-S": 0.25120772946859904,
+ "spec_f1_L4Quanti": 0.7207207207207207,
+ "spec_prec_L4Quanti": 0.5747126436781609,
+ "spec_recall_L4Quanti": 0.966183574879227,
+ "spec_qwk": 0.8757404773441285,
+ "spec_mae": 0.2975,
+ "spec_kripp_alpha": 0.8479072400833478,
+ "total_time_s": 6.695346015971154,
+ "num_samples": 1200,
+ "avg_ms_per_sample": 5.579455013309295,
+ "combined_macro_f1": 0.766546645871361
+ },
+ "best-base_weighted_ce-ep5_vs_Opus-4.6": {
+ "cat_macro_f1": 0.9280167387549427,
+ "cat_weighted_f1": 0.9273898648954128,
+ "cat_macro_precision": 0.9223465490796974,
+ "cat_macro_recall": 0.9382296607170699,
+ "cat_mcc": 0.9162751746063641,
+ "cat_auc": 0.992382433433919,
+ "cat_ece": 0.04689237485329312,
+ "cat_confusion_matrix": [
+ [
+ 209,
+ 0,
+ 2,
+ 1,
+ 1,
+ 0,
+ 1
+ ],
+ [
+ 0,
+ 78,
+ 0,
+ 0,
+ 1,
+ 0,
+ 0
+ ],
+ [
+ 4,
+ 0,
+ 147,
+ 2,
+ 4,
+ 0,
+ 1
+ ],
+ [
+ 0,
+ 0,
+ 1,
+ 139,
+ 1,
+ 0,
+ 0
+ ],
+ [
+ 13,
+ 1,
+ 10,
+ 15,
+ 168,
+ 1,
+ 5
+ ],
+ [
+ 1,
+ 10,
+ 1,
+ 4,
+ 3,
+ 209,
+ 0
+ ],
+ [
+ 0,
+ 0,
+ 0,
+ 0,
+ 2,
+ 1,
+ 164
+ ]
+ ],
+ "cat_f1_BoardGov": 0.9478458049886621,
+ "cat_prec_BoardGov": 0.920704845814978,
+ "cat_recall_BoardGov": 0.9766355140186916,
+ "cat_f1_Incident": 0.9285714285714286,
+ "cat_prec_Incident": 0.8764044943820225,
+ "cat_recall_Incident": 0.9873417721518988,
+ "cat_f1_Manageme": 0.9216300940438872,
+ "cat_prec_Manageme": 0.9130434782608695,
+ "cat_recall_Manageme": 0.930379746835443,
+ "cat_f1_NoneOthe": 0.9205298013245033,
+ "cat_prec_NoneOthe": 0.8633540372670807,
+ "cat_recall_NoneOthe": 0.9858156028368794,
+ "cat_f1_RiskMana": 0.8549618320610687,
+ "cat_prec_RiskMana": 0.9333333333333333,
+ "cat_recall_RiskMana": 0.7887323943661971,
+ "cat_f1_Strategy": 0.9521640091116174,
+ "cat_prec_Strategy": 0.990521327014218,
+ "cat_recall_Strategy": 0.9166666666666666,
+ "cat_f1_Third-Pa": 0.9704142011834319,
+ "cat_prec_Third-Pa": 0.9590643274853801,
+ "cat_recall_Third-Pa": 0.9820359281437125,
+ "cat_kripp_alpha": 0.9154955768233572,
+ "spec_macro_f1": 0.5957642708821952,
+ "spec_weighted_f1": 0.693005282664721,
+ "spec_macro_precision": 0.731933400476396,
+ "spec_macro_recall": 0.6249872364065566,
+ "spec_mcc": 0.6143201053040909,
+ "spec_auc": 0.9470735892830423,
+ "spec_ece": 0.18189165468017254,
+ "spec_confusion_matrix": [
+ [
+ 592,
+ 4,
+ 4,
+ 5
+ ],
+ [
+ 92,
+ 42,
+ 6,
+ 5
+ ],
+ [
+ 35,
+ 12,
+ 63,
+ 150
+ ],
+ [
+ 0,
+ 0,
+ 2,
+ 188
+ ]
+ ],
+ "spec_f1_L1Generi": 0.8942598187311178,
+ "spec_prec_L1Generi": 0.8233657858136301,
+ "spec_recall_L1Generi": 0.9785123966942149,
+ "spec_f1_L2Domain": 0.41379310344827586,
+ "spec_prec_L2Domain": 0.7241379310344828,
+ "spec_recall_L2Domain": 0.2896551724137931,
+ "spec_f1_L3Firm-S": 0.3761194029850746,
+ "spec_prec_L3Firm-S": 0.84,
+ "spec_recall_L3Firm-S": 0.2423076923076923,
+ "spec_f1_L4Quanti": 0.6988847583643123,
+ "spec_prec_L4Quanti": 0.5402298850574713,
+ "spec_recall_L4Quanti": 0.9894736842105263,
+ "spec_qwk": 0.872110225054491,
+ "spec_mae": 0.3075,
+ "spec_kripp_alpha": 0.8502616991488389,
+ "total_time_s": 6.695346015971154,
+ "num_samples": 1200,
+ "avg_ms_per_sample": 5.579455013309295,
+ "combined_macro_f1": 0.7618905048185689
+ }
+}
\ No newline at end of file
diff --git a/results/eval/coral-baseline/report_gpt-54.txt b/results/eval/coral-baseline/report_gpt-54.txt
new file mode 100644
index 0000000..1b25429
--- /dev/null
+++ b/results/eval/coral-baseline/report_gpt-54.txt
@@ -0,0 +1,54 @@
+
+======================================================================
+ HOLDOUT EVALUATION: best-base_weighted_ce-ep5 vs GPT-5.4
+======================================================================
+
+ Samples evaluated: 1200
+ Total inference time: 6.70s
+ Avg latency: 5.58ms/sample
+ Throughput: 179 samples/sec
+
+ ──────────────────────────────────────────────────
+ CATEGORY CLASSIFICATION
+ ──────────────────────────────────────────────────
+ Macro F1: 0.9361 ✓ (target: 0.80)
+ Weighted F1: 0.9361
+ Macro Prec: 0.9337
+ Macro Recall: 0.9414
+ MCC: 0.9248
+ AUC (OvR): 0.9913
+ ECE: 0.0441
+ Kripp Alpha: 0.9244
+
+ Category F1 Prec Recall
+ ------------------------- -------- -------- --------
+ Board Governance 0.9628 0.9692 0.9565
+ Incident Disclosure 0.9718 0.9663 0.9773
+ Management Role 0.9196 0.8882 0.9533
+ None/Other 0.8956 0.8261 0.9779
+ Risk Management Process 0.8730 0.9167 0.8333
+ Strategy Integration 0.9583 0.9810 0.9367
+ Third-Party Risk 0.9713 0.9883 0.9548
+
+ ──────────────────────────────────────────────────
+ SPECIFICITY CLASSIFICATION
+ ──────────────────────────────────────────────────
+ Macro F1: 0.5970 ✗ (target: 0.80)
+ Weighted F1: 0.7041
+ Macro Prec: 0.7225
+ Macro Recall: 0.6139
+ MCC: 0.6139
+ AUC (OvR): 0.9499
+ QWK: 0.8757
+ MAE: 0.2975
+ ECE: 0.1652
+ Kripp Alpha: 0.8479
+
+ Level F1 Prec Recall
+ ------------------------- -------- -------- --------
+ L1: Generic 0.8915 0.8289 0.9644
+ L2: Domain 0.4071 0.7931 0.2738
+ L3: Firm-Specific 0.3688 0.6933 0.2512
+ L4: Quantified 0.7207 0.5747 0.9662
+
+======================================================================
diff --git a/results/eval/coral-baseline/report_opus-46.txt b/results/eval/coral-baseline/report_opus-46.txt
new file mode 100644
index 0000000..ca3012d
--- /dev/null
+++ b/results/eval/coral-baseline/report_opus-46.txt
@@ -0,0 +1,54 @@
+
+======================================================================
+ HOLDOUT EVALUATION: best-base_weighted_ce-ep5 vs Opus-4.6
+======================================================================
+
+ Samples evaluated: 1200
+ Total inference time: 6.70s
+ Avg latency: 5.58ms/sample
+ Throughput: 179 samples/sec
+
+ ──────────────────────────────────────────────────
+ CATEGORY CLASSIFICATION
+ ──────────────────────────────────────────────────
+ Macro F1: 0.9280 ✓ (target: 0.80)
+ Weighted F1: 0.9274
+ Macro Prec: 0.9223
+ Macro Recall: 0.9382
+ MCC: 0.9163
+ AUC (OvR): 0.9924
+ ECE: 0.0469
+ Kripp Alpha: 0.9155
+
+ Category F1 Prec Recall
+ ------------------------- -------- -------- --------
+ Board Governance 0.9478 0.9207 0.9766
+ Incident Disclosure 0.9286 0.8764 0.9873
+ Management Role 0.9216 0.9130 0.9304
+ None/Other 0.9205 0.8634 0.9858
+ Risk Management Process 0.8550 0.9333 0.7887
+ Strategy Integration 0.9522 0.9905 0.9167
+ Third-Party Risk 0.9704 0.9591 0.9820
+
+ ──────────────────────────────────────────────────
+ SPECIFICITY CLASSIFICATION
+ ──────────────────────────────────────────────────
+ Macro F1: 0.5958 ✗ (target: 0.80)
+ Weighted F1: 0.6930
+ Macro Prec: 0.7319
+ Macro Recall: 0.6250
+ MCC: 0.6143
+ AUC (OvR): 0.9471
+ QWK: 0.8721
+ MAE: 0.3075
+ ECE: 0.1819
+ Kripp Alpha: 0.8503
+
+ Level F1 Prec Recall
+ ------------------------- -------- -------- --------
+ L1: Generic 0.8943 0.8234 0.9785
+ L2: Domain 0.4138 0.7241 0.2897
+ L3: Firm-Specific 0.3761 0.8400 0.2423
+ L4: Quantified 0.6989 0.5402 0.9895
+
+======================================================================
diff --git a/results/eval/iter1-independent/figures/calibration_cat_gpt-5.4.png b/results/eval/iter1-independent/figures/calibration_cat_gpt-5.4.png
new file mode 100644
index 0000000..1ca1e0b
Binary files /dev/null and b/results/eval/iter1-independent/figures/calibration_cat_gpt-5.4.png differ
diff --git a/results/eval/iter1-independent/figures/calibration_cat_opus-4.6.png b/results/eval/iter1-independent/figures/calibration_cat_opus-4.6.png
new file mode 100644
index 0000000..9d7c48b
Binary files /dev/null and b/results/eval/iter1-independent/figures/calibration_cat_opus-4.6.png differ
diff --git a/results/eval/iter1-independent/figures/confusion_cat_gpt-5.4.png b/results/eval/iter1-independent/figures/confusion_cat_gpt-5.4.png
new file mode 100644
index 0000000..540db31
Binary files /dev/null and b/results/eval/iter1-independent/figures/confusion_cat_gpt-5.4.png differ
diff --git a/results/eval/iter1-independent/figures/confusion_cat_opus-4.6.png b/results/eval/iter1-independent/figures/confusion_cat_opus-4.6.png
new file mode 100644
index 0000000..671e45e
Binary files /dev/null and b/results/eval/iter1-independent/figures/confusion_cat_opus-4.6.png differ
diff --git a/results/eval/iter1-independent/figures/confusion_spec_gpt-5.4.png b/results/eval/iter1-independent/figures/confusion_spec_gpt-5.4.png
new file mode 100644
index 0000000..6a34fc9
Binary files /dev/null and b/results/eval/iter1-independent/figures/confusion_spec_gpt-5.4.png differ
diff --git a/results/eval/iter1-independent/figures/confusion_spec_opus-4.6.png b/results/eval/iter1-independent/figures/confusion_spec_opus-4.6.png
new file mode 100644
index 0000000..b7cb10d
Binary files /dev/null and b/results/eval/iter1-independent/figures/confusion_spec_opus-4.6.png differ
diff --git a/results/eval/iter1-independent/figures/model_comparison.png b/results/eval/iter1-independent/figures/model_comparison.png
new file mode 100644
index 0000000..0c35f6a
Binary files /dev/null and b/results/eval/iter1-independent/figures/model_comparison.png differ
diff --git a/results/eval/iter1-independent/figures/per_class_f1_gpt-5.4.png b/results/eval/iter1-independent/figures/per_class_f1_gpt-5.4.png
new file mode 100644
index 0000000..dad5668
Binary files /dev/null and b/results/eval/iter1-independent/figures/per_class_f1_gpt-5.4.png differ
diff --git a/results/eval/iter1-independent/figures/per_class_f1_opus-4.6.png b/results/eval/iter1-independent/figures/per_class_f1_opus-4.6.png
new file mode 100644
index 0000000..46e46f0
Binary files /dev/null and b/results/eval/iter1-independent/figures/per_class_f1_opus-4.6.png differ
diff --git a/results/eval/iter1-independent/figures/speed_comparison.png b/results/eval/iter1-independent/figures/speed_comparison.png
new file mode 100644
index 0000000..665b6e3
Binary files /dev/null and b/results/eval/iter1-independent/figures/speed_comparison.png differ
diff --git a/results/eval/iter1-independent/metrics.json b/results/eval/iter1-independent/metrics.json
new file mode 100644
index 0000000..7e34412
--- /dev/null
+++ b/results/eval/iter1-independent/metrics.json
@@ -0,0 +1,298 @@
+{
+ "iter1-independent_vs_GPT-5.4": {
+ "cat_macro_f1": 0.9336741161693523,
+ "cat_weighted_f1": 0.9343162998643407,
+ "cat_macro_precision": 0.93189297179766,
+ "cat_macro_recall": 0.9377918652022429,
+ "cat_mcc": 0.9226990724708704,
+ "cat_auc": 0.991991833154947,
+ "cat_ece": 0.053848127176364245,
+ "cat_confusion_matrix": [
+ [
+ 225,
+ 0,
+ 3,
+ 0,
+ 2,
+ 0,
+ 0
+ ],
+ [
+ 0,
+ 85,
+ 0,
+ 0,
+ 2,
+ 1,
+ 0
+ ],
+ [
+ 2,
+ 0,
+ 144,
+ 1,
+ 3,
+ 0,
+ 0
+ ],
+ [
+ 0,
+ 0,
+ 3,
+ 131,
+ 0,
+ 2,
+ 0
+ ],
+ [
+ 6,
+ 1,
+ 5,
+ 19,
+ 164,
+ 1,
+ 2
+ ],
+ [
+ 0,
+ 3,
+ 1,
+ 8,
+ 2,
+ 207,
+ 0
+ ],
+ [
+ 0,
+ 0,
+ 0,
+ 0,
+ 12,
+ 0,
+ 165
+ ]
+ ],
+ "cat_f1_BoardGov": 0.9719222462203023,
+ "cat_prec_BoardGov": 0.9656652360515021,
+ "cat_recall_BoardGov": 0.9782608695652174,
+ "cat_f1_Incident": 0.96045197740113,
+ "cat_prec_Incident": 0.9550561797752809,
+ "cat_recall_Incident": 0.9659090909090909,
+ "cat_f1_Manageme": 0.9411764705882353,
+ "cat_prec_Manageme": 0.9230769230769231,
+ "cat_recall_Manageme": 0.96,
+ "cat_f1_NoneOthe": 0.888135593220339,
+ "cat_prec_NoneOthe": 0.8238993710691824,
+ "cat_recall_NoneOthe": 0.9632352941176471,
+ "cat_f1_RiskMana": 0.856396866840731,
+ "cat_prec_RiskMana": 0.8864864864864865,
+ "cat_recall_RiskMana": 0.8282828282828283,
+ "cat_f1_Strategy": 0.9583333333333334,
+ "cat_prec_Strategy": 0.981042654028436,
+ "cat_recall_Strategy": 0.9366515837104072,
+ "cat_f1_Third-Pa": 0.9593023255813954,
+ "cat_prec_Third-Pa": 0.9880239520958084,
+ "cat_recall_Third-Pa": 0.9322033898305084,
+ "cat_kripp_alpha": 0.9223591517560865,
+ "spec_macro_f1": 0.8951731906425856,
+ "spec_weighted_f1": 0.9121524819510628,
+ "spec_macro_precision": 0.8980417155129858,
+ "spec_macro_recall": 0.8930560580782194,
+ "spec_mcc": 0.866381831963237,
+ "spec_auc": 0.981666223606385,
+ "spec_ece": 0.07135417198141418,
+ "spec_confusion_matrix": [
+ [
+ 580,
+ 23,
+ 12,
+ 3
+ ],
+ [
+ 29,
+ 130,
+ 7,
+ 2
+ ],
+ [
+ 11,
+ 4,
+ 190,
+ 2
+ ],
+ [
+ 2,
+ 1,
+ 9,
+ 195
+ ]
+ ],
+ "spec_f1_L1Generi": 0.9354838709677419,
+ "spec_prec_L1Generi": 0.932475884244373,
+ "spec_recall_L1Generi": 0.9385113268608414,
+ "spec_f1_L2Domain": 0.7975460122699386,
+ "spec_prec_L2Domain": 0.8227848101265823,
+ "spec_recall_L2Domain": 0.7738095238095238,
+ "spec_f1_L3Firm-S": 0.8941176470588236,
+ "spec_prec_L3Firm-S": 0.8715596330275229,
+ "spec_recall_L3Firm-S": 0.9178743961352657,
+ "spec_f1_L4Quanti": 0.9535452322738386,
+ "spec_prec_L4Quanti": 0.9653465346534653,
+ "spec_recall_L4Quanti": 0.9420289855072463,
+ "spec_qwk": 0.9324447137231142,
+ "spec_mae": 0.1175,
+ "spec_kripp_alpha": 0.917725722448833,
+ "total_time_s": 6.732117835083045,
+ "num_samples": 1200,
+ "avg_ms_per_sample": 5.6100981959025376,
+ "combined_macro_f1": 0.9144236534059689
+ },
+ "iter1-independent_vs_Opus-4.6": {
+ "cat_macro_f1": 0.922684387023173,
+ "cat_weighted_f1": 0.9216414809666168,
+ "cat_macro_precision": 0.9177680939029339,
+ "cat_macro_recall": 0.9316060900094703,
+ "cat_mcc": 0.909266938399113,
+ "cat_auc": 0.9939660707189948,
+ "cat_ece": 0.06551479384303091,
+ "cat_confusion_matrix": [
+ [
+ 211,
+ 0,
+ 1,
+ 1,
+ 1,
+ 0,
+ 0
+ ],
+ [
+ 0,
+ 78,
+ 0,
+ 0,
+ 1,
+ 0,
+ 0
+ ],
+ [
+ 8,
+ 0,
+ 144,
+ 1,
+ 4,
+ 0,
+ 1
+ ],
+ [
+ 0,
+ 0,
+ 1,
+ 138,
+ 1,
+ 1,
+ 0
+ ],
+ [
+ 13,
+ 0,
+ 9,
+ 14,
+ 169,
+ 1,
+ 7
+ ],
+ [
+ 1,
+ 11,
+ 1,
+ 4,
+ 3,
+ 208,
+ 0
+ ],
+ [
+ 0,
+ 0,
+ 0,
+ 1,
+ 6,
+ 1,
+ 159
+ ]
+ ],
+ "cat_f1_BoardGov": 0.9440715883668904,
+ "cat_prec_BoardGov": 0.9055793991416309,
+ "cat_recall_BoardGov": 0.985981308411215,
+ "cat_f1_Incident": 0.9285714285714286,
+ "cat_prec_Incident": 0.8764044943820225,
+ "cat_recall_Incident": 0.9873417721518988,
+ "cat_f1_Manageme": 0.9171974522292994,
+ "cat_prec_Manageme": 0.9230769230769231,
+ "cat_recall_Manageme": 0.9113924050632911,
+ "cat_f1_NoneOthe": 0.92,
+ "cat_prec_NoneOthe": 0.8679245283018868,
+ "cat_recall_NoneOthe": 0.9787234042553191,
+ "cat_f1_RiskMana": 0.8492462311557789,
+ "cat_prec_RiskMana": 0.9135135135135135,
+ "cat_recall_RiskMana": 0.7934272300469484,
+ "cat_f1_Strategy": 0.9476082004555809,
+ "cat_prec_Strategy": 0.985781990521327,
+ "cat_recall_Strategy": 0.9122807017543859,
+ "cat_f1_Third-Pa": 0.9520958083832335,
+ "cat_prec_Third-Pa": 0.9520958083832335,
+ "cat_recall_Third-Pa": 0.9520958083832335,
+ "cat_kripp_alpha": 0.908575631724203,
+ "spec_macro_f1": 0.8833694419146193,
+ "spec_weighted_f1": 0.9004034318676798,
+ "spec_macro_precision": 0.8858989636247611,
+ "spec_macro_recall": 0.8854684685880032,
+ "spec_mcc": 0.8500778641433316,
+ "spec_auc": 0.9736633898988131,
+ "spec_ece": 0.08248284702499709,
+ "spec_confusion_matrix": [
+ [
+ 567,
+ 30,
+ 7,
+ 1
+ ],
+ [
+ 22,
+ 118,
+ 3,
+ 2
+ ],
+ [
+ 33,
+ 10,
+ 207,
+ 10
+ ],
+ [
+ 0,
+ 0,
+ 1,
+ 189
+ ]
+ ],
+ "spec_f1_L1Generi": 0.9242053789731052,
+ "spec_prec_L1Generi": 0.9115755627009646,
+ "spec_recall_L1Generi": 0.9371900826446281,
+ "spec_f1_L2Domain": 0.7788778877887789,
+ "spec_prec_L2Domain": 0.7468354430379747,
+ "spec_recall_L2Domain": 0.8137931034482758,
+ "spec_f1_L3Firm-S": 0.8661087866108786,
+ "spec_prec_L3Firm-S": 0.9495412844036697,
+ "spec_recall_L3Firm-S": 0.7961538461538461,
+ "spec_f1_L4Quanti": 0.9642857142857143,
+ "spec_prec_L4Quanti": 0.9356435643564357,
+ "spec_recall_L4Quanti": 0.9947368421052631,
+ "spec_qwk": 0.9227008860372746,
+ "spec_mae": 0.13583333333333333,
+ "spec_kripp_alpha": 0.9065248741550552,
+ "total_time_s": 6.732117835083045,
+ "num_samples": 1200,
+ "avg_ms_per_sample": 5.6100981959025376,
+ "combined_macro_f1": 0.9030269144688962
+ }
+}
\ No newline at end of file
diff --git a/results/eval/iter1-independent/report_gpt-54.txt b/results/eval/iter1-independent/report_gpt-54.txt
new file mode 100644
index 0000000..ccf57e2
--- /dev/null
+++ b/results/eval/iter1-independent/report_gpt-54.txt
@@ -0,0 +1,54 @@
+
+======================================================================
+ HOLDOUT EVALUATION: iter1-independent vs GPT-5.4
+======================================================================
+
+ Samples evaluated: 1200
+ Total inference time: 6.73s
+ Avg latency: 5.61ms/sample
+ Throughput: 178 samples/sec
+
+ ──────────────────────────────────────────────────
+ CATEGORY CLASSIFICATION
+ ──────────────────────────────────────────────────
+ Macro F1: 0.9337 ✓ (target: 0.80)
+ Weighted F1: 0.9343
+ Macro Prec: 0.9319
+ Macro Recall: 0.9378
+ MCC: 0.9227
+ AUC (OvR): 0.9920
+ ECE: 0.0538
+ Kripp Alpha: 0.9224
+
+ Category F1 Prec Recall
+ ------------------------- -------- -------- --------
+ Board Governance 0.9719 0.9657 0.9783
+ Incident Disclosure 0.9605 0.9551 0.9659
+ Management Role 0.9412 0.9231 0.9600
+ None/Other 0.8881 0.8239 0.9632
+ Risk Management Process 0.8564 0.8865 0.8283
+ Strategy Integration 0.9583 0.9810 0.9367
+ Third-Party Risk 0.9593 0.9880 0.9322
+
+ ──────────────────────────────────────────────────
+ SPECIFICITY CLASSIFICATION
+ ──────────────────────────────────────────────────
+ Macro F1: 0.8952 ✓ (target: 0.80)
+ Weighted F1: 0.9122
+ Macro Prec: 0.8980
+ Macro Recall: 0.8931
+ MCC: 0.8664
+ AUC (OvR): 0.9817
+ QWK: 0.9324
+ MAE: 0.1175
+ ECE: 0.0714
+ Kripp Alpha: 0.9177
+
+ Level F1 Prec Recall
+ ------------------------- -------- -------- --------
+ L1: Generic 0.9355 0.9325 0.9385
+ L2: Domain 0.7975 0.8228 0.7738
+ L3: Firm-Specific 0.8941 0.8716 0.9179
+ L4: Quantified 0.9535 0.9653 0.9420
+
+======================================================================
diff --git a/results/eval/iter1-independent/report_opus-46.txt b/results/eval/iter1-independent/report_opus-46.txt
new file mode 100644
index 0000000..78c7c47
--- /dev/null
+++ b/results/eval/iter1-independent/report_opus-46.txt
@@ -0,0 +1,54 @@
+
+======================================================================
+ HOLDOUT EVALUATION: iter1-independent vs Opus-4.6
+======================================================================
+
+ Samples evaluated: 1200
+ Total inference time: 6.73s
+ Avg latency: 5.61ms/sample
+ Throughput: 178 samples/sec
+
+ ──────────────────────────────────────────────────
+ CATEGORY CLASSIFICATION
+ ──────────────────────────────────────────────────
+ Macro F1: 0.9227 ✓ (target: 0.80)
+ Weighted F1: 0.9216
+ Macro Prec: 0.9178
+ Macro Recall: 0.9316
+ MCC: 0.9093
+ AUC (OvR): 0.9940
+ ECE: 0.0655
+ Kripp Alpha: 0.9086
+
+ Category F1 Prec Recall
+ ------------------------- -------- -------- --------
+ Board Governance 0.9441 0.9056 0.9860
+ Incident Disclosure 0.9286 0.8764 0.9873
+ Management Role 0.9172 0.9231 0.9114
+ None/Other 0.9200 0.8679 0.9787
+ Risk Management Process 0.8492 0.9135 0.7934
+ Strategy Integration 0.9476 0.9858 0.9123
+ Third-Party Risk 0.9521 0.9521 0.9521
+
+ ──────────────────────────────────────────────────
+ SPECIFICITY CLASSIFICATION
+ ──────────────────────────────────────────────────
+ Macro F1: 0.8834 ✓ (target: 0.80)
+ Weighted F1: 0.9004
+ Macro Prec: 0.8859
+ Macro Recall: 0.8855
+ MCC: 0.8501
+ AUC (OvR): 0.9737
+ QWK: 0.9227
+ MAE: 0.1358
+ ECE: 0.0825
+ Kripp Alpha: 0.9065
+
+ Level F1 Prec Recall
+ ------------------------- -------- -------- --------
+ L1: Generic 0.9242 0.9116 0.9372
+ L2: Domain 0.7789 0.7468 0.8138
+ L3: Firm-Specific 0.8661 0.9495 0.7962
+ L4: Quantified 0.9643 0.9356 0.9947
+
+======================================================================