working model!!!!!

This commit is contained in:
Joey Eamigh 2026-04-05 15:37:50 -04:00
parent 41df5923f2
commit 2e932bc327
No known key found for this signature in database
GPG Key ID: CE8C05DFFC53C9CB
46 changed files with 2025 additions and 34 deletions

View File

@ -3,7 +3,6 @@
# Allow only what the labelapp Dockerfile needs # Allow only what the labelapp Dockerfile needs
!package.json !package.json
!bun.lock
!packages/schemas/ !packages/schemas/
!ts/package.json !ts/package.json
@ -14,9 +13,10 @@ labelapp/.env*
labelapp/playwright-report/ labelapp/playwright-report/
labelapp/test-results/ labelapp/test-results/
# Seed data (only the two JSONL files we need) # Seed data
!data/paragraphs/paragraphs-clean.jsonl !data/paragraphs/paragraphs-clean.jsonl
!data/annotations/stage1.jsonl !data/annotations/stage1.jsonl
!data/gold/v2-holdout-ids.json
# Git/IDE # Git/IDE
.git .git

View File

@ -575,6 +575,137 @@ As a proxy before human labels arrive, evaluation against GPT-5.4 and Opus bench
--- ---
## Phase 9: Holdout Evaluation — Proxy Gold Results
### Evaluation Setup
Built a comprehensive evaluation pipeline (`python/src/finetune/eval.py`) to test the trained model on the 1,200-paragraph holdout set. Since human gold labels were not yet available, we used two frontier API models as proxy references:
- **GPT-5.4** (1,200 labels, ~$3,400/1M texts, ~2,900ms/sample)
- **Opus-4.6** (1,200 labels, ~$5,000/1M texts, ~6,000ms/sample)
Both references used the same v4.5 prompt as the Grok training labels but are different model families — they provide independent validation that the fine-tuned model learned the construct, not just Grok's idiosyncrasies.
The evaluation computed: macro/weighted F1, per-class F1, precision, recall, MCC, AUC (one-vs-rest), QWK, MAE, Krippendorff's alpha (nominal for category, ordinal for specificity), confusion matrices, and calibration (ECE).
### Results: Independent Thresholds (Epoch 8, Best Model)
| Metric | vs GPT-5.4 | vs Opus-4.6 |
|--------|-----------|-------------|
| **Cat Macro F1** | **0.934** | **0.923** |
| **Spec Macro F1** | **0.895** | **0.883** |
| Cat MCC | 0.923 | 0.909 |
| Cat AUC (OvR) | 0.992 | 0.994 |
| Spec QWK | 0.932 | 0.923 |
| Spec MAE | 0.118 | 0.136 |
| Cat Kripp α | 0.922 | 0.909 |
| Spec Kripp α | 0.918 | 0.907 |
| Cat ECE | 0.054 | 0.066 |
| Throughput | **178 samples/sec** | — |
| Latency | **5.6ms/sample** | — |
Both heads pass the 0.80 macro F1 target by wide margins on held-out data against independent reference models.
Per-class category F1 (vs GPT-5.4): Board Gov. 0.972, Incident Disc. 0.961, Mgmt Role 0.941, None/Other 0.888, Risk Mgmt Proc. 0.856, Strategy Int. 0.958, Third-Party 0.959. RMP is the weakest category (0.856) due to MR↔RMP boundary ambiguity, but still comfortably above target.
Per-class specificity F1 (vs GPT-5.4): L1 0.936, L2 0.798, L3 0.894, L4 0.954. L2 is the weakest level — analyzed in detail below.
### Results: CORAL Baseline (Epoch 5) — For Comparison
| Metric | vs GPT-5.4 | vs Opus-4.6 |
|--------|-----------|-------------|
| Cat Macro F1 | 0.936 | 0.928 |
| **Spec Macro F1** | **0.597** | **0.596** |
| Spec QWK | 0.876 | 0.872 |
The category heads are essentially identical between models — the backbone handles category well regardless of specificity architecture. The +0.298 spec F1 improvement is entirely attributable to the independent threshold heads.
CORAL's confusion matrix reveals the mechanism: it collapses L2 (F1=0.407) and L3 (F1=0.369) into L1 and L4, predicting extreme levels because the shared weight vector can't represent the intermediate transitions. The independent threshold model's confusion matrix shows clean diagonals across all four levels.
### Reference Agreement Ceiling
A critical finding: **the model agrees with the references more than the references agree with each other.**
| Comparison | Macro Spec F1 | L2 F1 |
|-----------|---------------|-------|
| GPT-5.4 vs Opus-4.6 | **0.885** | **0.805** |
| Our model vs GPT-5.4 | **0.895** | 0.798 |
| Our model vs Opus-4.6 | 0.883 | 0.776 |
| Stage 1 Consensus vs GPT-5.4 | 0.911 | 0.845 |
Our model's macro spec F1 (0.895) exceeds the inter-reference agreement (0.885). This means the model learned a "consensus position" that is more consistent than either individual reference. Further improvements against these proxy references are not meaningful — they would represent overfitting to one reference's idiosyncrasies rather than genuine improvement.
The L2 F1 of 0.798 is within 0.007 of the reference ceiling (0.805). The L1↔L2 boundary is the hardest in the construct — it hinges on whether language is "domain-specific" enough to qualify (the ERM test). Paragraphs using quasi-domain language (e.g., "risk management program for cybersecurity") sit in a genuine gray zone where even frontier models disagree.
### L2 Error Analysis
The L2 confusion is directional. Against GPT-5.4:
- 29 L2 paragraphs misclassified as L1 (model under-calls domain terminology)
- 23 L1 paragraphs misclassified as L2 (model over-calls domain terminology)
- Only 7 L2→L3 and 2 L2→L4 errors (higher transitions are clean)
This is the L1↔L2 boundary problem in isolation — the model handles L2↔L3 and L3↔L4 transitions with high accuracy. The ERM test ("would an employee relations manager understand this language?") is inherently subjective at the margin.
### Category × Specificity Joint Distribution
The holdout set reveals strong correlation between category and specificity:
| Category | L1 | L2 | L3 | L4 |
|---------|-----|-----|-----|-----|
| None/Other | **100%** | 0% | 0% | 0% |
| Strategy Integration | **85%** | 10% | 2% | 3% |
| Third-Party Risk | 62% | **22%** | 12% | 5% |
| Risk Mgmt Process | 34% | **44%** | 16% | 6% |
| Board Governance | 42% | 4% | **45%** | 9% |
| Management Role | 13% | 3% | 29% | **54%** |
| Incident Disclosure | 0% | 8% | 2% | **90%** |
Despite this correlation, the current architecture treats specificity as category-independent (by design — per the codebook, specificity measures "how specific" regardless of "what about"). Making specificity category-dependent was considered but rejected: the cell sizes for many (category, spec_level) combinations are too small for reliable conditional modeling, and error propagation from category mistakes would corrupt specificity predictions. The strong correlations are already captured implicitly by the shared backbone. This remains a potential direction for future investigation with a larger dataset.
### Sequence Length Analysis
At max_seq_length=512, truncation is negligible:
| Dataset | Mean tokens | P95 | P99 | Max | Truncated (>512) |
|---------|------------|-----|-----|-----|-----------------|
| All paragraphs (72K) | 114.6 | 240 | 350 | 678 | 139 (0.19%) |
| Holdout (1,200) | 117.9 | 236 | 329 | 603 | 1 (0.08%) |
SEC cybersecurity disclosure paragraphs are short by nature (median ~100 tokens). The 512-token limit is more than sufficient — increasing to 1024 would affect only 139 training paragraphs and 1 holdout paragraph.
### Speed and Cost Comparison
| System | Latency | Throughput | Cost/1M texts | Reproducible |
|--------|---------|-----------|---------------|-------------|
| **Fine-tuned specialist** | **5.6ms** | **178/sec** | **~$5** | **Yes** |
| GPT-5.4 (API) | ~2,900ms | ~0.3/sec | ~$3,400 | No |
| Opus-4.6 (API) | ~6,000ms | ~0.2/sec | ~$5,000 | No |
The fine-tuned model is **520× faster** than GPT-5.4 and **1,070× faster** than Opus-4.6, at **~680-1,000× lower cost**, with comparable or better accuracy and full determinism.
### Calibration
The model is well-calibrated for category (ECE=0.054 vs GPT-5.4) and reasonably calibrated for specificity (ECE=0.071). The calibration plot shows slight overconfidence in the 0.7-0.9 range — consistent with the "benign overfitting" observed during training where the model became more confident without changing decision boundaries. Temperature scaling could improve calibration without affecting predictions (a single scalar adjustment on validation logits), which would be valuable for deployment confidence thresholds.
### Remaining Opportunities
**Threshold tuning (free, post-gold):** Once human gold labels arrive, grid-search the per-threshold sigmoid cutoffs. Currently all thresholds use 0.5 — the optimal L1→L2 cutoff may differ. This requires no retraining and could gain +0.01-0.02 on L2 F1.
**Ensemble (3 seeds, +0.01-0.03 F1):** Train 3 models with seeds 42/43/44, average sigmoid outputs. Reduces variance on boundary cases and provides confidence intervals for reported metrics. Cost: 3× training time (~24h total), 3× inference time (~17ms/sample).
**Temperature scaling (free, improves calibration only):** Fit a single temperature parameter on the validation set. Reduces ECE without changing predictions — relevant for deployment where confidence scores matter.
**Larger specificity MLP (future investigation):** The current 256-dim MLP is efficient but may not capture the full complexity of subtle specificity distinctions. Larger heads (512-dim or 3-layer) could help if the dataset grows, but risk overfitting at current data scale.
### Figures Generated
All evaluation figures saved to `results/eval/`:
- `iter1-independent/figures/` — confusion matrices (cat + spec), calibration reliability diagrams, per-class F1 bar charts (vs GPT-5.4 and Opus-4.6 separately)
- `coral-baseline/figures/` — same set for CORAL baseline comparison
- `comparison/` — side-by-side CORAL vs Independent (per-class F1 bars, all-metrics comparison, improvement delta chart, confusion matrix comparison, summary table)
---
## v1 Reference ## v1 Reference
The complete v1 narrative — Stage 1 prompt engineering (12+ iterations), model benchmarking (21+ models, 12 providers), human labeling webapp, gold set adjudication (13-signal cross-analysis), codebook iterations v1.0v3.5 — is preserved at `docs/NARRATIVE-v1.md`. The complete v1 narrative — Stage 1 prompt engineering (12+ iterations), model benchmarking (21+ models, 12 providers), human labeling webapp, gold set adjudication (13-signal cross-analysis), codebook iterations v1.0v3.5 — is preserved at `docs/NARRATIVE-v1.md`.

View File

@ -129,3 +129,24 @@ eval F1 improvement. Best checkpoint: epoch 8 (spec F1=0.945).
Independent thresholds were the key insight — CORAL's shared weight vector was Independent thresholds were the key insight — CORAL's shared weight vector was
the primary bottleneck. Attention pooling, MLP heads, and confidence filtering the primary bottleneck. Attention pooling, MLP heads, and confidence filtering
all contributed. Tier 2 and Tier 3 ideas were not needed. all contributed. Tier 2 and Tier 3 ideas were not needed.
### Holdout Evaluation (1,200 paragraphs, proxy gold)
Validated on held-out data against two independent frontier model references:
| Model | Ref | Cat F1 | Spec F1 | L2 F1 | Spec QWK |
|-------|-----|--------|---------|-------|----------|
| Independent (ep8) | GPT-5.4 | 0.934 | **0.895** | 0.798 | 0.932 |
| Independent (ep8) | Opus-4.6 | 0.923 | **0.883** | 0.776 | 0.923 |
| CORAL (ep5) | GPT-5.4 | 0.936 | 0.597 | 0.407 | 0.876 |
| CORAL (ep5) | Opus-4.6 | 0.928 | 0.596 | 0.418 | 0.872 |
| GPT-5.4 | Opus-4.6 | — | **0.885** | **0.805** | 0.919 |
**Key finding:** The model's holdout spec F1 (0.895) exceeds the inter-reference
agreement (0.885 between GPT-5.4 and Opus-4.6). The model has reached the
construct reliability ceiling — further improvement requires cleaner reference
labels, not a better model.
**L2 is at ceiling:** Model L2 F1 (0.798) is within 0.007 of reference agreement
(0.805). The L1↔L2 boundary is genuinely ambiguous. Remaining opportunity:
per-threshold sigmoid tuning against human gold labels (potential +0.01-0.02).

View File

@ -1,6 +1,6 @@
# Project Status — v2 Pipeline # Project Status — v2 Pipeline
**Deadline:** 2026-04-24 | **Started:** 2026-04-03 | **Updated:** 2026-04-05 (Fine-tuning done: cat F1=0.943, spec F1=0.945) **Deadline:** 2026-04-24 | **Started:** 2026-04-03 | **Updated:** 2026-04-05 (Holdout eval done: cat F1=0.934, spec F1=0.895 vs GPT-5.4 proxy gold)
--- ---
@ -142,14 +142,27 @@
- **Improvement plan:** `docs/SPECIFICITY-IMPROVEMENT-PLAN.md` - **Improvement plan:** `docs/SPECIFICITY-IMPROVEMENT-PLAN.md`
### 13. Evaluation & Paper ← CURRENT ### 13. Evaluation & Paper ← CURRENT
- [ ] Proxy eval: run fine-tuned model on holdout, compare against GPT-5.4 and Opus benchmark labels - [x] Proxy eval: fine-tuned model on 1,200 holdout vs GPT-5.4 and Opus-4.6 proxy gold
- [x] Full metrics suite: macro/per-class F1, precision, recall, MCC, AUC, QWK, MAE, Krippendorff's α, ECE, confusion matrices
- [x] CORAL baseline comparison: same eval pipeline on CORAL epoch 5 checkpoint
- [x] Figures: confusion matrices, calibration diagrams, per-class F1 bars, CORAL vs Independent comparison, speed/cost table
- [x] Reference ceiling analysis: GPT-5.4 vs Opus-4.6 agreement = 0.885 macro spec F1 (our model exceeds this at 0.895)
- [x] L2 error analysis: model L2 F1 (0.798) within 0.007 of reference ceiling (0.805)
- [x] Sequence length analysis: only 139/72K paragraphs (0.19%) truncated at 512 tokens — negligible impact
- [x] Opus labels completed: 1,200/1,200 (filled 16 missing from initial run)
- [ ] Macro F1 on holdout gold (target > 0.80 both heads) — blocked on human labels - [ ] Macro F1 on holdout gold (target > 0.80 both heads) — blocked on human labels
- [ ] Per-class F1 breakdown + GenAI benchmark table - [ ] Per-threshold sigmoid tuning against human gold (potential +0.01-0.02 on L2 F1)
- [ ] Error analysis, cost comparison, IGNITE slides - [ ] Temperature scaling for improved calibration (ECE reduction without changing predictions)
- [ ] Ensemble of 3 seeds for confidence intervals and potential +0.01-0.03 F1
- [ ] Error analysis against human gold, IGNITE slides
- [ ] Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work - [ ] Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work
- [ ] Note in paper: DAPT/TAPT did not improve fine-tuning — noteworthy null result - [ ] Note in paper: DAPT/TAPT did not improve fine-tuning — noteworthy null result
- [ ] Note in paper: CORAL ordinal regression insufficient for multi-signal ordinal classification - [ ] Note in paper: CORAL ordinal regression insufficient for multi-signal ordinal classification
- **Next:** evaluate fine-tuned model on holdout using GPT-5.4 + Opus labels as proxy gold - [ ] Note in paper: model exceeds inter-reference agreement — approaches ceiling of construct reliability
- **Proxy gold results (vs GPT-5.4):** Cat F1=0.934, Spec F1=0.895, MCC=0.923/0.866, AUC=0.992/0.982, QWK=0.932
- **Proxy gold results (vs Opus-4.6):** Cat F1=0.923, Spec F1=0.883, QWK=0.923
- **Speed:** 5.6ms/sample (178/sec) — 520× faster than GPT-5.4, 1,070× faster than Opus
- **Next:** deploy labelapp for human annotation, then gold evaluation + threshold tuning
--- ---
@ -177,7 +190,7 @@
| v2 holdout reference | `data/annotations/v2-bench/gpt-5.4.jsonl` (v4.5, 1,200 paragraphs) | | v2 holdout reference | `data/annotations/v2-bench/gpt-5.4.jsonl` (v4.5, 1,200 paragraphs) |
| v2 iteration archive | `data/annotations/v2-bench/gpt-5.4.v4.{0,1,2,3,4}.jsonl` | | v2 iteration archive | `data/annotations/v2-bench/gpt-5.4.v4.{0,1,2,3,4}.jsonl` |
| v4.5 boundary test | `data/annotations/v2-bench/v45-test/gpt-5.4.jsonl` (50 paragraphs) | | v4.5 boundary test | `data/annotations/v2-bench/v45-test/gpt-5.4.jsonl` (50 paragraphs) |
| Opus prompt-only | `data/annotations/v2-bench/opus-4.6.jsonl` (1,184 paragraphs) | | Opus prompt-only | `data/annotations/v2-bench/opus-4.6.jsonl` (1,200 paragraphs) |
| Opus +codebook | `data/annotations/golden/opus.jsonl` (includes v1 + v2 runs) | | Opus +codebook | `data/annotations/golden/opus.jsonl` (includes v1 + v2 runs) |
| Grok self-consistency test | `data/annotations/v2-bench/grok-rerun/grok-4.1-fast.jsonl` (47 paragraphs) | | Grok self-consistency test | `data/annotations/v2-bench/grok-rerun/grok-4.1-fast.jsonl` (47 paragraphs) |
| Benchmark analysis | `scripts/analyze-v2-bench.py` | | Benchmark analysis | `scripts/analyze-v2-bench.py` |
@ -199,6 +212,13 @@
| CORAL baseline (ablation winner) | `checkpoints/finetune/best-base_weighted_ce-ep5/final/` (cat=0.932, spec=0.517) | | CORAL baseline (ablation winner) | `checkpoints/finetune/best-base_weighted_ce-ep5/final/` (cat=0.932, spec=0.517) |
| Ablation results | `checkpoints/finetune/ablation/ablation_results.json` | | Ablation results | `checkpoints/finetune/ablation/ablation_results.json` |
| Spec improvement plan | `docs/SPECIFICITY-IMPROVEMENT-PLAN.md` | | Spec improvement plan | `docs/SPECIFICITY-IMPROVEMENT-PLAN.md` |
| Best model iter1 config | `python/configs/finetune/iter1-independent.yaml` |
| Eval script | `python/src/finetune/eval.py` |
| Eval results (best model) | `results/eval/iter1-independent/metrics.json` |
| Eval results (CORAL) | `results/eval/coral-baseline/metrics.json` |
| Comparison figures | `results/eval/comparison/` (5 charts) |
| Per-model eval figures | `results/eval/iter1-independent/figures/` + `results/eval/coral-baseline/figures/` |
| Comparison figure script | `python/scripts/generate-comparison-figures.py` |
### v2 Stage 1 Distribution (72,045 paragraphs, v4.5 prompt, Grok ×3 consensus + GPT-5.4 judge) ### v2 Stage 1 Distribution (72,045 paragraphs, v4.5 prompt, Grok ×3 consensus + GPT-5.4 judge)

View File

@ -1,14 +1,15 @@
# Build context: monorepo root (run: docker build -f labelapp/Dockerfile .) # Build context: monorepo root (run: docker build -f labelapp/Dockerfile .)
FROM oven/bun:1 AS base FROM oven/bun:1.3.8 AS base
# -- Install dependencies -- # -- Install dependencies --
FROM base AS deps FROM base AS deps
WORKDIR /app WORKDIR /app
COPY package.json bun.lock ./ COPY package.json ./
COPY packages/schemas/package.json packages/schemas/ COPY packages/schemas/package.json packages/schemas/
COPY ts/package.json ts/ COPY ts/package.json ts/
COPY labelapp/package.json labelapp/ COPY labelapp/package.json labelapp/
RUN bun install --frozen-lockfile # bun.lock intentionally excluded — bun hangs parsing the binary lockfile in Docker (bun bug)
RUN bun install
# -- Build Next.js -- # -- Build Next.js --
FROM base AS builder FROM base AS builder
@ -16,7 +17,7 @@ WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules COPY --from=deps /app/node_modules ./node_modules
COPY --from=deps /app/packages/schemas/node_modules ./packages/schemas/node_modules COPY --from=deps /app/packages/schemas/node_modules ./packages/schemas/node_modules
COPY --from=deps /app/labelapp/node_modules ./labelapp/node_modules COPY --from=deps /app/labelapp/node_modules ./labelapp/node_modules
COPY package.json bun.lock ./ COPY package.json ./
COPY packages/schemas/ packages/schemas/ COPY packages/schemas/ packages/schemas/
COPY labelapp/ labelapp/ COPY labelapp/ labelapp/
ENV NEXT_TELEMETRY_DISABLED=1 ENV NEXT_TELEMETRY_DISABLED=1

View File

@ -100,7 +100,6 @@ export async function GET() {
// Filter to non-admin annotators for per-annotator stats // Filter to non-admin annotators for per-annotator stats
const perAnnotator = allAnnotators const perAnnotator = allAnnotators
.filter((a) => a.id !== "joey")
.map((a) => ({ .map((a) => ({
id: a.id, id: a.id,
displayName: a.displayName, displayName: a.displayName,
@ -132,7 +131,7 @@ export async function GET() {
// Collect all annotator IDs that have labels (excluding admin) // Collect all annotator IDs that have labels (excluding admin)
const annotatorIds = [ const annotatorIds = [
...new Set(allLabels.map((l) => l.annotatorId)), ...new Set(allLabels.map((l) => l.annotatorId)),
].filter((id) => id !== "joey"); ];
annotatorIds.sort(); annotatorIds.sort();
// For each annotator pair, collect shared paragraph ratings // For each annotator pair, collect shared paragraph ratings

View File

@ -797,6 +797,13 @@ export default function CodebookPage() {
how company-specific the disclosure is. Apply the decision test in how company-specific the disclosure is. Apply the decision test in
order stop at the first “yes.” order stop at the first “yes.”
</p> </p>
<p className="rounded-md border border-amber-300 bg-amber-50 dark:border-amber-700 dark:bg-amber-950/30 p-3 text-sm">
<strong>Specificity rates the ENTIRE paragraph</strong> not just
the parts related to the content category you chose. If a Board
Governance paragraph also mentions CrowdStrike Falcon or the
CISO&rsquo;s 20 years of experience, those facts count. Scan
everything, don&rsquo;t filter by category.
</p>
{/* Decision Test */} {/* Decision Test */}
<div className="space-y-3"> <div className="space-y-3">

View File

@ -489,6 +489,9 @@ function CodebookSidebar() {
desc="Contains 1+ QV-eligible facts: specific numbers, dates, named external entities, named tools/products, verifiable certifications." desc="Contains 1+ QV-eligible facts: specific numbers, dates, named external entities, named tools/products, verifiable certifications."
/> />
</div> </div>
<p className="mt-2 text-xs text-amber-700 dark:text-amber-400">
Specificity rates the WHOLE paragraph not just the category-relevant parts. Scan everything.
</p>
</section> </section>
<Separator /> <Separator />

View File

@ -29,7 +29,7 @@ export const ONBOARDING_STEPS: OnboardingStep[] = [
"Management Role is broader: it now covers how management is ORGANIZED to handle cybersecurity — role allocation, committee structure, reporting lines — not just \"who a specific person is.\" Paragraphs about management structure without named individuals can be MR.", "Management Role is broader: it now covers how management is ORGANIZED to handle cybersecurity — role allocation, committee structure, reporting lines — not just \"who a specific person is.\" Paragraphs about management structure without named individuals can be MR.",
"Specificity Level 2 is broader: renamed from \"Sector-Adapted\" to \"Domain-Adapted.\" Cybersecurity terms like penetration testing, vulnerability scanning, SIEM, and SOC now trigger Level 2. In v1, these were incorrectly classified as Level 1.", "Specificity Level 2 is broader: renamed from \"Sector-Adapted\" to \"Domain-Adapted.\" Cybersecurity terms like penetration testing, vulnerability scanning, SIEM, and SOC now trigger Level 2. In v1, these were incorrectly classified as Level 1.",
"Level 4 requires just 1 QV fact (was 2+). No more counting. If an external party could verify even one claim in the paragraph — a dollar amount, a named tool, a specific date — it's Level 4.", "Level 4 requires just 1 QV fact (was 2+). No more counting. If an external party could verify even one claim in the paragraph — a dollar amount, a named tool, a specific date — it's Level 4.",
"You'll be labeling 1,200 holdout paragraphs total. There are 5 annotators, with 3 labeling each paragraph. You'll see roughly 720.", "You'll be labeling 1,200 holdout paragraphs total. There are 6 annotators, with 3 labeling each paragraph. You'll see roughly 600.",
], ],
keyPoints: [ keyPoints: [
"Same 7 categories, same 4 specificity levels — the framework is unchanged.", "Same 7 categories, same 4 specificity levels — the framework is unchanged.",
@ -49,11 +49,12 @@ export const ONBOARDING_STEPS: OnboardingStep[] = [
"Question 1 — Content Category: \"What is this paragraph about?\" Pick the best of 7 options.", "Question 1 — Content Category: \"What is this paragraph about?\" Pick the best of 7 options.",
"Question 2 — Specificity Level: \"How company-specific is this paragraph?\" Pick a level from 1 to 4.", "Question 2 — Specificity Level: \"How company-specific is this paragraph?\" Pick a level from 1 to 4.",
"These are independent dimensions. A materiality disclaimer can be Strategy Integration (category) at Level 1 (generic boilerplate). An incident report can be Incident Disclosure at Level 4 (specific dates and firms).", "These are independent dimensions. A materiality disclaimer can be Strategy Integration (category) at Level 1 (generic boilerplate). An incident report can be Incident Disclosure at Level 4 (specific dates and firms).",
"Important: specificity rates THE WHOLE PARAGRAPH, not just the category-relevant parts. If a Board Governance paragraph mentions the CISO by name and describes penetration testing, those facts count for specificity even though they're not \"board\" content. Scan the entire paragraph for the most specific fact present — don't filter by category first.",
], ],
keyPoints: [ keyPoints: [
"One content category (of 7) — pick the dominant one.", "One content category (of 7) — pick the dominant one.",
"One specificity level (14) — determined by the most specific fact present.", "One specificity level (14) — determined by the most specific fact in THE WHOLE PARAGRAPH.",
"Category and specificity are independent — don't let one influence the other.", "Specificity rates the paragraph, not the category. A Board Governance paragraph that mentions CrowdStrike Falcon is Level 4.",
], ],
}, },
@ -142,6 +143,7 @@ export const ONBOARDING_STEPS: OnboardingStep[] = [
subtitle: "How company-specific is this paragraph?", subtitle: "How company-specific is this paragraph?",
content: [ content: [
"Specificity measures how much this paragraph tells you about THIS specific company versus generic filler any company could use.", "Specificity measures how much this paragraph tells you about THIS specific company versus generic filler any company could use.",
"Critical: specificity rates the ENTIRE paragraph — not just the parts related to the category you chose. If you categorize a paragraph as Board Governance but it also mentions CrowdStrike Falcon or the CISO's 20 years of experience, those facts still count. Scan everything.",
"Think of it as a waterfall — check from the top and stop at the first yes:", "Think of it as a waterfall — check from the top and stop at the first yes:",
"Level 4 — Quantified-Verifiable: Can an external party verify at least one claim? (a specific number, date, named tool/firm, verifiable certification) → Level 4.", "Level 4 — Quantified-Verifiable: Can an external party verify at least one claim? (a specific number, date, named tool/firm, verifiable certification) → Level 4.",
"Level 3 — Firm-Specific: Does it contain at least one fact unique to THIS company? (CISO title, named non-generic committee, named individual, 24/7 SOC) → Level 3.", "Level 3 — Firm-Specific: Does it contain at least one fact unique to THIS company? (CISO title, named non-generic committee, named individual, 24/7 SOC) → Level 3.",
@ -218,6 +220,13 @@ export const ONBOARDING_STEPS: OnboardingStep[] = [
explanation: explanation:
"BG because the Audit Committee is the subject (oversight). CISO is a firm-specific fact → Level 3. No QV facts (no numbers, dates, named firms).", "BG because the Audit Committee is the subject (oversight). CISO is a firm-specific fact → Level 3. No QV facts (no numbers, dates, named firms).",
}, },
{
text: "The Board oversees our cybersecurity program, which is led by our CISO and includes penetration testing and vulnerability assessments using CrowdStrike Falcon.",
category: "Board Governance",
specificity: "Level 4 — Quantified-Verifiable",
explanation:
"BG because the Board is the subject. But specificity rates THE WHOLE PARAGRAPH — not just the board content. CrowdStrike Falcon is a named tool (QV-eligible), so Level 4. Don't be tempted to rate only the \"board\" parts as generic — the paragraph as a whole contains a verifiable fact.",
},
{ {
text: "Under the leadership of our CISO, we have implemented network segmentation, endpoint detection and response, data loss prevention, and SIEM. Our team monitors critical systems continuously and conducts quarterly tabletop exercises.", text: "Under the leadership of our CISO, we have implemented network segmentation, endpoint detection and response, data loss prevention, and SIEM. Our team monitors critical systems continuously and conducts quarterly tabletop exercises.",
category: "Risk Management Process", category: "Risk Management Process",
@ -248,8 +257,8 @@ export const ONBOARDING_STEPS: OnboardingStep[] = [
}, },
], ],
keyPoints: [ keyPoints: [
"Category and specificity are independent. Don't let one influence the other.", "Specificity rates the WHOLE paragraph — not just the parts related to the category. Scan everything.",
"The person-removal test and specificity waterfall work together — use both.", "A Board Governance paragraph that mentions CrowdStrike Falcon → still Level 4. Don't filter facts by category.",
"When in doubt on category: which question does the paragraph answer?", "When in doubt on category: which question does the paragraph answer?",
"When in doubt on specificity: check the waterfall top-down (QV → IS → Domain → Generic).", "When in doubt on specificity: check the waterfall top-down (QV → IS → Domain → Generic).",
], ],

View File

@ -1,7 +1,6 @@
process.env.DATABASE_URL ??= process.env.DATABASE_URL ??=
"postgresql://sec_cybert:sec_cybert@localhost:5432/sec_cybert"; "postgresql://sec_cybert:sec_cybert@localhost:5432/sec_cybert";
import { ne } from "drizzle-orm";
import { db } from "../db"; import { db } from "../db";
import * as schema from "../db/schema"; import * as schema from "../db/schema";
import { generateAssignments, printAssignmentStats } from "../lib/assignment"; import { generateAssignments, printAssignmentStats } from "../lib/assignment";
@ -15,12 +14,11 @@ async function main() {
const paragraphIds = rows.map((r) => r.id); const paragraphIds = rows.map((r) => r.id);
console.log(` ${paragraphIds.length} paragraphs`); console.log(` ${paragraphIds.length} paragraphs`);
// 2. Read annotator IDs from DB (exclude joey — admin) // 2. Read annotator IDs from DB (all annotators, including joey)
console.log("Loading annotators..."); console.log("Loading annotators...");
const annotators = await db const annotators = await db
.select({ id: schema.annotators.id }) .select({ id: schema.annotators.id })
.from(schema.annotators) .from(schema.annotators);
.where(ne(schema.annotators.id, "joey"));
const annotatorIds = annotators.map((a) => a.id).sort(); const annotatorIds = annotators.map((a) => a.id).sort();
console.log(` ${annotatorIds.length} annotators: ${annotatorIds.join(", ")}`); console.log(` ${annotatorIds.length} annotators: ${annotatorIds.join(", ")}`);

View File

@ -53,16 +53,14 @@ async function main() {
db.select().from(schema.adjudications), db.select().from(schema.adjudications),
]); ]);
const nonAdminAnnotators = allAnnotators.filter((a) => a.id !== "joey"); const annotatorIds = allAnnotators.map((a) => a.id).sort();
const annotatorIds = nonAdminAnnotators.map((a) => a.id).sort();
const annotatorNames = new Map(allAnnotators.map((a) => [a.id, a.displayName])); const annotatorNames = new Map(allAnnotators.map((a) => [a.id, a.displayName]));
// Filter to non-admin labels only const labels = allLabels;
const labels = allLabels.filter((l) => l.annotatorId !== "joey");
console.log(` ${labels.length} human labels (non-admin)`); console.log(` ${labels.length} human labels`);
console.log(` ${allParagraphs.length} paragraphs`); console.log(` ${allParagraphs.length} paragraphs`);
console.log(` ${nonAdminAnnotators.length} annotators`); console.log(` ${allAnnotators.length} annotators`);
console.log(` ${allQuizSessions.length} quiz sessions`); console.log(` ${allQuizSessions.length} quiz sessions`);
console.log(` ${allAdjudications.length} adjudications`); console.log(` ${allAdjudications.length} adjudications`);
@ -108,7 +106,7 @@ async function main() {
// ── 3. Annotators JSON ── // ── 3. Annotators JSON ──
console.log("\nExporting annotator profiles..."); console.log("\nExporting annotator profiles...");
const annotatorProfiles = nonAdminAnnotators.map((a) => ({ const annotatorProfiles = allAnnotators.map((a: { id: string; displayName: string; onboardedAt: Date | null }) => ({
id: a.id, id: a.id,
displayName: a.displayName, displayName: a.displayName,
onboardedAt: a.onboardedAt?.toISOString() ?? null, onboardedAt: a.onboardedAt?.toISOString() ?? null,

View File

@ -44,6 +44,35 @@ def cmd_finetune(args: argparse.Namespace) -> None:
train(config) train(config)
def cmd_eval(args: argparse.Namespace) -> None:
from src.finetune.eval import EvalConfig, evaluate
benchmark_paths = {}
if args.benchmark:
for name, path in args.benchmark:
benchmark_paths[name] = path
else:
# Default benchmarks
benchmark_paths = {
"GPT-5.4": "../data/annotations/v2-bench/gpt-5.4.jsonl",
"Opus-4.6": "../data/annotations/v2-bench/opus-4.6.jsonl",
}
config = EvalConfig(
checkpoint_path=args.checkpoint,
paragraphs_path=args.paragraphs,
holdout_path=args.holdout,
benchmark_paths=benchmark_paths,
output_dir=args.output_dir,
max_seq_length=args.max_seq_length,
batch_size=args.batch_size,
specificity_head=args.spec_head,
spec_mlp_dim=args.spec_mlp_dim,
pooling=args.pooling,
)
evaluate(config)
def cmd_ablate(args: argparse.Namespace) -> None: def cmd_ablate(args: argparse.Namespace) -> None:
from src.common.config import FinetuneConfig from src.common.config import FinetuneConfig
from src.finetune.train import ablate from src.finetune.train import ablate
@ -94,10 +123,20 @@ def main() -> None:
ab.add_argument("--epochs", type=int, help="Override epochs per ablation run (default: config value)") ab.add_argument("--epochs", type=int, help="Override epochs per ablation run (default: config value)")
ab.set_defaults(func=cmd_ablate) ab.set_defaults(func=cmd_ablate)
# ── eval (placeholder) ── # ── eval ──
ev = sub.add_parser("eval", help="Evaluate a trained model") ev = sub.add_parser("eval", help="Evaluate a trained model on holdout set")
ev.add_argument("--config", required=True, help="Path to YAML config file") ev.add_argument("--checkpoint", required=True, help="Path to model checkpoint directory")
ev.set_defaults(func=lambda args: print("Evaluation not yet implemented.")) ev.add_argument("--paragraphs", default="../data/paragraphs/paragraphs-clean.patched.jsonl")
ev.add_argument("--holdout", default="../data/gold/v2-holdout-ids.json")
ev.add_argument("--benchmark", action="append", nargs=2, metavar=("NAME", "PATH"),
help="Benchmark reference: NAME PATH (can repeat)")
ev.add_argument("--output-dir", default="../results/eval")
ev.add_argument("--max-seq-length", type=int, default=512)
ev.add_argument("--batch-size", type=int, default=64)
ev.add_argument("--spec-head", default="independent", choices=["coral", "independent", "softmax"])
ev.add_argument("--spec-mlp-dim", type=int, default=256)
ev.add_argument("--pooling", default="attention", choices=["cls", "attention"])
ev.set_defaults(func=cmd_eval)
args = parser.parse_args() args = parser.parse_args()
args.func(args) args.func(args)

View File

@ -15,6 +15,9 @@ dependencies = [
"unsloth==2026.3.11", "unsloth==2026.3.11",
"coral-pytorch>=1.4.0", "coral-pytorch>=1.4.0",
"scikit-learn>=1.8.0", "scikit-learn>=1.8.0",
"krippendorff>=0.8.2",
"matplotlib>=3.10.8",
"seaborn>=0.13.2",
] ]
[project.scripts] [project.scripts]

View File

@ -0,0 +1,226 @@
"""Generate side-by-side comparison figures: CORAL baseline vs Independent threshold model."""
import json
from pathlib import Path
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
RESULTS_DIR = Path(__file__).resolve().parent.parent.parent / "results" / "eval"
OUTPUT_DIR = RESULTS_DIR / "comparison"
CATEGORIES = [
"Board Gov.",
"Incident Disc.",
"Mgmt Role",
"None/Other",
"Risk Mgmt Proc.",
"Strategy Int.",
"Third-Party",
]
SPEC_LABELS = ["L1: Generic", "L2: Domain", "L3: Firm-Spec.", "L4: Quantified"]
def load_metrics(model_dir: str) -> dict:
with open(RESULTS_DIR / model_dir / "metrics.json") as f:
return json.load(f)
def main():
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
sns.set_theme(style="whitegrid", font_scale=1.1)
coral = load_metrics("coral-baseline")
indep = load_metrics("iter1-independent")
# Use GPT-5.4 as the reference (1200 samples, complete)
coral_gpt = coral["best-base_weighted_ce-ep5_vs_GPT-5.4"]
indep_gpt = indep["iter1-independent_vs_GPT-5.4"]
# ── 1. Side-by-side per-class F1 (Category) ─────────────────────────────
# Keys come from eval.py: name.replace(" ", "").replace("/", "")[:8]
cat_keys = ["BoardGov", "Incident", "Manageme", "NoneOthe", "RiskMana", "Strategy", "Third-Pa"]
coral_cat_f1 = [coral_gpt.get(f"cat_f1_{k}", 0) for k in cat_keys]
indep_cat_f1 = [indep_gpt.get(f"cat_f1_{k}", 0) for k in cat_keys]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
x = np.arange(len(CATEGORIES))
width = 0.35
bars1 = ax1.bar(x - width/2, coral_cat_f1, width, label="CORAL (Epoch 5)", color="#DD8452", alpha=0.85)
bars2 = ax1.bar(x + width/2, indep_cat_f1, width, label="Independent (Epoch 8)", color="#4C72B0", alpha=0.85)
ax1.axhline(0.80, color="red", linestyle="--", alpha=0.5, label="Target (0.80)")
ax1.set_ylabel("F1 Score")
ax1.set_title("Category F1 by Class")
ax1.set_xticks(x)
ax1.set_xticklabels(CATEGORIES, rotation=25, ha="right")
ax1.set_ylim(0, 1.05)
ax1.legend(loc="lower right")
for bar, v in zip(bars1, coral_cat_f1):
ax1.text(bar.get_x() + bar.get_width()/2, v + 0.01, f"{v:.2f}", ha="center", va="bottom", fontsize=7)
for bar, v in zip(bars2, indep_cat_f1):
ax1.text(bar.get_x() + bar.get_width()/2, v + 0.01, f"{v:.2f}", ha="center", va="bottom", fontsize=7)
# ── Specificity F1 side-by-side ──────────────────────────────────────────
# Keys come from eval.py: name.replace(" ", "").replace(":", "")[:8]
spec_keys = ["L1Generi", "L2Domain", "L3Firm-S", "L4Quanti"]
coral_spec_f1 = [coral_gpt.get(f"spec_f1_{k}", 0) for k in spec_keys]
indep_spec_f1 = [indep_gpt.get(f"spec_f1_{k}", 0) for k in spec_keys]
x2 = np.arange(len(SPEC_LABELS))
bars3 = ax2.bar(x2 - width/2, coral_spec_f1, width, label="CORAL (Epoch 5)", color="#DD8452", alpha=0.85)
bars4 = ax2.bar(x2 + width/2, indep_spec_f1, width, label="Independent (Epoch 8)", color="#4C72B0", alpha=0.85)
ax2.axhline(0.80, color="red", linestyle="--", alpha=0.5, label="Target (0.80)")
ax2.set_ylabel("F1 Score")
ax2.set_title("Specificity F1 by Level")
ax2.set_xticks(x2)
ax2.set_xticklabels(SPEC_LABELS)
ax2.set_ylim(0, 1.05)
ax2.legend(loc="lower right")
for bar, v in zip(bars3, coral_spec_f1):
ax2.text(bar.get_x() + bar.get_width()/2, v + 0.01, f"{v:.2f}", ha="center", va="bottom", fontsize=8)
for bar, v in zip(bars4, indep_spec_f1):
ax2.text(bar.get_x() + bar.get_width()/2, v + 0.01, f"{v:.2f}", ha="center", va="bottom", fontsize=8)
plt.suptitle("CORAL Baseline vs Independent Thresholds — Holdout Set (vs GPT-5.4)", fontsize=14, fontweight="bold")
plt.tight_layout()
fig.savefig(OUTPUT_DIR / "coral_vs_independent_f1.png", dpi=200)
plt.close(fig)
print(f" Saved: coral_vs_independent_f1.png")
# ── 2. Summary metrics comparison ────────────────────────────────────────
metrics_to_compare = {
"Cat Macro F1": ("cat_macro_f1", "cat_macro_f1"),
"Spec Macro F1": ("spec_macro_f1", "spec_macro_f1"),
"Cat MCC": ("cat_mcc", "cat_mcc"),
"Spec MCC": ("spec_mcc", "spec_mcc"),
"Cat AUC": ("cat_auc", "cat_auc"),
"Spec AUC": ("spec_auc", "spec_auc"),
"Spec QWK": ("spec_qwk", "spec_qwk"),
"Cat Kripp α": ("cat_kripp_alpha", "cat_kripp_alpha"),
"Spec Kripp α": ("spec_kripp_alpha", "spec_kripp_alpha"),
}
fig, ax = plt.subplots(figsize=(12, 6))
labels = list(metrics_to_compare.keys())
coral_vals = [coral_gpt.get(v[0], 0) for v in metrics_to_compare.values()]
indep_vals = [indep_gpt.get(v[1], 0) for v in metrics_to_compare.values()]
x = np.arange(len(labels))
width = 0.35
ax.bar(x - width/2, coral_vals, width, label="CORAL (Epoch 5)", color="#DD8452", alpha=0.85)
ax.bar(x + width/2, indep_vals, width, label="Independent (Epoch 8)", color="#4C72B0", alpha=0.85)
ax.axhline(0.80, color="red", linestyle="--", alpha=0.5)
ax.set_ylabel("Score")
ax.set_title("CORAL vs Independent — All Metrics (Holdout vs GPT-5.4)")
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=30, ha="right")
ax.set_ylim(0, 1.1)
ax.legend()
for i, (cv, iv) in enumerate(zip(coral_vals, indep_vals)):
ax.text(i - width/2, cv + 0.01, f"{cv:.3f}", ha="center", va="bottom", fontsize=7)
ax.text(i + width/2, iv + 0.01, f"{iv:.3f}", ha="center", va="bottom", fontsize=7)
plt.tight_layout()
fig.savefig(OUTPUT_DIR / "coral_vs_independent_all_metrics.png", dpi=200)
plt.close(fig)
print(f" Saved: coral_vs_independent_all_metrics.png")
# ── 3. Delta chart (improvement from CORAL → Independent) ────────────────
deltas = [iv - cv for cv, iv in zip(coral_vals, indep_vals)]
colors = ["#55a868" if d >= 0 else "#c44e52" for d in deltas]
fig, ax = plt.subplots(figsize=(10, 5))
ax.barh(labels, deltas, color=colors, alpha=0.85)
ax.axvline(0, color="black", linewidth=0.8)
ax.set_xlabel("Improvement (Independent CORAL)")
ax.set_title("Metric Improvement: Independent Thresholds over CORAL")
for i, (d, label) in enumerate(zip(deltas, labels)):
ax.text(d + 0.003 if d >= 0 else d - 0.003, i, f"{d:+.3f}",
va="center", ha="left" if d >= 0 else "right", fontsize=9)
plt.tight_layout()
fig.savefig(OUTPUT_DIR / "improvement_delta.png", dpi=200)
plt.close(fig)
print(f" Saved: improvement_delta.png")
# ── 4. Specificity confusion matrix side-by-side ─────────────────────────
spec_labels_short = ["L1", "L2", "L3", "L4"]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))
cm_coral = np.array(coral_gpt["spec_confusion_matrix"])
cm_indep = np.array(indep_gpt["spec_confusion_matrix"])
cm_coral_norm = cm_coral.astype(float) / cm_coral.sum(axis=1, keepdims=True).clip(min=1)
cm_indep_norm = cm_indep.astype(float) / cm_indep.sum(axis=1, keepdims=True).clip(min=1)
sns.heatmap(cm_coral_norm, annot=cm_coral, fmt="d", cmap="Oranges",
xticklabels=spec_labels_short, yticklabels=spec_labels_short,
ax=ax1, vmin=0, vmax=1, cbar=False)
ax1.set_title("CORAL (Epoch 5) — Spec F1=0.597")
ax1.set_xlabel("Predicted")
ax1.set_ylabel("GPT-5.4 Reference")
sns.heatmap(cm_indep_norm, annot=cm_indep, fmt="d", cmap="Blues",
xticklabels=spec_labels_short, yticklabels=spec_labels_short,
ax=ax2, vmin=0, vmax=1, cbar=False)
ax2.set_title("Independent (Epoch 8) — Spec F1=0.895")
ax2.set_xlabel("Predicted")
ax2.set_ylabel("GPT-5.4 Reference")
plt.suptitle("Specificity Confusion Matrices — CORAL vs Independent", fontsize=13, fontweight="bold")
plt.tight_layout()
fig.savefig(OUTPUT_DIR / "spec_confusion_comparison.png", dpi=200)
plt.close(fig)
print(f" Saved: spec_confusion_comparison.png")
# ── 5. Cost/speed comparison table figure ────────────────────────────────
fig, ax = plt.subplots(figsize=(10, 4))
ax.axis("off")
table_data = [
["Metric", "CORAL (Ep5)", "Independent (Ep8)", "GPT-5.4 (API)", "Opus-4.6 (API)"],
["Cat Macro F1", f"{coral_gpt['cat_macro_f1']:.4f}", f"{indep_gpt['cat_macro_f1']:.4f}", "—(reference)", "—(reference)"],
["Spec Macro F1", f"{coral_gpt['spec_macro_f1']:.4f}", f"{indep_gpt['spec_macro_f1']:.4f}", "—(reference)", "—(reference)"],
["Spec QWK", f"{coral_gpt['spec_qwk']:.4f}", f"{indep_gpt['spec_qwk']:.4f}", "", ""],
["MCC (Cat)", f"{coral_gpt['cat_mcc']:.4f}", f"{indep_gpt['cat_mcc']:.4f}", "", ""],
["Latency/sample", "5.6ms", "5.6ms", "~2,900ms", "~6,000ms"],
["Cost/1M texts", "~$5", "~$5", "~$3,400", "~$5,000*"],
["Reproducible", "Yes", "Yes", "No", "No"],
]
table = ax.table(cellText=table_data[1:], colLabels=table_data[0],
cellLoc="center", loc="center")
table.auto_set_font_size(False)
table.set_fontsize(9)
table.scale(1, 1.5)
# Style header
for j in range(len(table_data[0])):
table[0, j].set_facecolor("#4C72B0")
table[0, j].set_text_props(color="white", fontweight="bold")
# Highlight best specialist column
for i in range(1, len(table_data)):
table[i, 2].set_facecolor("#d4edda")
ax.set_title("Model Comparison Summary", fontsize=13, fontweight="bold", pad=20)
plt.tight_layout()
fig.savefig(OUTPUT_DIR / "comparison_table.png", dpi=200)
plt.close(fig)
print(f" Saved: comparison_table.png")
print(f"\n All figures saved to {OUTPUT_DIR}")
if __name__ == "__main__":
main()

724
python/src/finetune/eval.py Normal file
View File

@ -0,0 +1,724 @@
"""Holdout evaluation for trained DualHeadModernBERT models.
Loads a trained checkpoint, runs inference on the 1,200-paragraph holdout set,
and compares predictions against proxy gold labels (GPT-5.4, Opus-4.6) or
real human gold labels when available.
Reports: macro/per-class F1, precision, recall, MCC, AUC (one-vs-rest),
QWK, MAE, Krippendorff's alpha, confusion matrices, calibration (ECE),
cost/latency comparison, and generates publication-ready figures.
"""
import json
import time
from dataclasses import dataclass
from pathlib import Path
import krippendorff
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import torch
import torch.nn.functional as F
from safetensors.torch import load_file
from sklearn.metrics import (
f1_score,
precision_score,
recall_score,
matthews_corrcoef,
roc_auc_score,
cohen_kappa_score,
confusion_matrix,
classification_report,
mean_absolute_error,
)
from transformers import AutoModel, AutoTokenizer
from .data import CATEGORIES, CAT2ID, NUM_CATEGORIES, NUM_SPECIFICITY
from .model import DualHeadModernBERT, ordinal_predict, softmax_predict
SPEC_LABELS = ["L1: Generic", "L2: Domain", "L3: Firm-Specific", "L4: Quantified"]
@dataclass
class EvalConfig:
"""Configuration for holdout evaluation."""
checkpoint_path: str
paragraphs_path: str
holdout_path: str
benchmark_paths: dict[str, str] # name → path to benchmark JSONL
output_dir: str
max_seq_length: int = 512
batch_size: int = 64
# Architecture params (must match training)
specificity_head: str = "independent"
spec_mlp_dim: int = 256
pooling: str = "attention"
def _load_jsonl(path: str | Path) -> list[dict]:
records = []
with open(path) as f:
for line in f:
line = line.strip()
if line:
records.append(json.loads(line))
return records
def load_holdout_data(
paragraphs_path: str,
holdout_path: str,
benchmark_paths: dict[str, str],
) -> dict:
"""Load holdout paragraphs and benchmark labels, joined by paragraph ID."""
paragraphs = {p["id"]: p for p in _load_jsonl(paragraphs_path)}
with open(holdout_path) as f:
holdout_ids = json.load(f)
# Load benchmark labels
benchmarks = {}
for name, path in benchmark_paths.items():
labels = {r["paragraphId"]: r for r in _load_jsonl(path)}
benchmarks[name] = labels
# Build joined dataset
records = []
for pid in holdout_ids:
para = paragraphs.get(pid)
if para is None:
continue
bench_labels = {}
for name, labels in benchmarks.items():
if pid in labels:
lab = labels[pid]["label"]
bench_labels[name] = {
"category": lab["content_category"],
"specificity": lab["specificity_level"],
}
records.append({
"id": pid,
"text": para["text"],
"benchmark_labels": bench_labels,
})
return records
def load_model(config: EvalConfig, device: torch.device) -> tuple:
"""Load trained DualHeadModernBERT from checkpoint."""
checkpoint = Path(config.checkpoint_path)
tokenizer = AutoTokenizer.from_pretrained(str(checkpoint))
# Load backbone (we need the architecture, then overwrite weights)
try:
import flash_attn # noqa: F401
attn_impl = "flash_attention_2"
except ImportError:
attn_impl = "sdpa"
backbone = AutoModel.from_pretrained(
"answerdotai/ModernBERT-large",
trust_remote_code=True,
attn_implementation=attn_impl,
dtype=torch.bfloat16,
)
hidden_size = backbone.config.hidden_size
model = DualHeadModernBERT(
backbone=backbone,
hidden_size=hidden_size,
num_categories=NUM_CATEGORIES,
num_specificity=NUM_SPECIFICITY,
specificity_head_type=config.specificity_head,
spec_mlp_dim=config.spec_mlp_dim,
pooling=config.pooling,
)
# Load trained weights (filter out loss function buffers not needed for inference)
state_dict = load_file(str(checkpoint / "model.safetensors"))
model.load_state_dict(state_dict, strict=False)
model = model.to(device).to(torch.bfloat16)
model.eval()
return model, tokenizer
@torch.no_grad()
def run_inference(
model: DualHeadModernBERT,
tokenizer,
records: list[dict],
max_seq_length: int,
batch_size: int,
device: torch.device,
spec_head_type: str = "independent",
) -> dict:
"""Run model inference on holdout records. Returns predictions + timing."""
texts = [r["text"] for r in records]
all_cat_logits = []
all_spec_logits = []
total_time = 0.0
for i in range(0, len(texts), batch_size):
batch_texts = texts[i : i + batch_size]
encoded = tokenizer(
batch_texts,
truncation=True,
max_length=max_seq_length,
padding="longest",
return_tensors="pt",
).to(device)
start = time.perf_counter()
outputs = model(
input_ids=encoded["input_ids"],
attention_mask=encoded["attention_mask"],
)
torch.cuda.synchronize() if device.type == "cuda" else None
total_time += time.perf_counter() - start
all_cat_logits.append(outputs["category_logits"].float().cpu())
all_spec_logits.append(outputs["specificity_logits"].float().cpu())
cat_logits = torch.cat(all_cat_logits, dim=0)
spec_logits = torch.cat(all_spec_logits, dim=0)
cat_probs = F.softmax(cat_logits, dim=1).numpy()
cat_preds = cat_logits.argmax(dim=1).numpy()
if spec_head_type == "softmax":
spec_preds = softmax_predict(spec_logits).numpy()
spec_probs = F.softmax(spec_logits, dim=1).numpy()
else:
spec_preds = ordinal_predict(spec_logits).numpy()
# Convert ordinal logits to class probs for AUC
spec_probs = _ordinal_to_class_probs(spec_logits).numpy()
return {
"cat_preds": cat_preds,
"cat_probs": cat_probs,
"cat_logits": cat_logits.numpy(),
"spec_preds": spec_preds,
"spec_probs": spec_probs,
"spec_logits": spec_logits.numpy(),
"total_time_s": total_time,
"num_samples": len(texts),
"avg_ms_per_sample": (total_time / len(texts)) * 1000,
}
def _ordinal_to_class_probs(logits: torch.Tensor) -> torch.Tensor:
"""Convert ordinal threshold logits to per-class probabilities.
P(class=0) = 1 - P(>=1)
P(class=k) = P(>=k) - P(>=k+1) for 0 < k < K-1
P(class=K-1) = P(>=K-1)
"""
probs = torch.sigmoid(logits) # (N, K-1)
num_classes = probs.shape[1] + 1
class_probs = torch.zeros(probs.shape[0], num_classes)
class_probs[:, 0] = 1.0 - probs[:, 0]
for k in range(1, num_classes - 1):
class_probs[:, k] = probs[:, k - 1] - probs[:, k]
class_probs[:, -1] = probs[:, -1]
# Clamp for numerical stability
class_probs = class_probs.clamp(min=0.0)
class_probs = class_probs / class_probs.sum(dim=1, keepdim=True)
return class_probs
def compute_ece(probs: np.ndarray, labels: np.ndarray, n_bins: int = 15) -> tuple[float, dict]:
"""Compute Expected Calibration Error and per-bin data for reliability diagram."""
confidences = np.max(probs, axis=1)
predictions = np.argmax(probs, axis=1)
accuracies = (predictions == labels).astype(float)
bin_edges = np.linspace(0.0, 1.0, n_bins + 1)
bin_data = {"bin_centers": [], "bin_accs": [], "bin_confs": [], "bin_counts": []}
ece = 0.0
for i in range(n_bins):
mask = (confidences > bin_edges[i]) & (confidences <= bin_edges[i + 1])
count = mask.sum()
if count > 0:
avg_conf = confidences[mask].mean()
avg_acc = accuracies[mask].mean()
ece += (count / len(labels)) * abs(avg_acc - avg_conf)
bin_data["bin_centers"].append((bin_edges[i] + bin_edges[i + 1]) / 2)
bin_data["bin_accs"].append(avg_acc)
bin_data["bin_confs"].append(avg_conf)
bin_data["bin_counts"].append(int(count))
return ece, bin_data
def compute_all_metrics(
preds: np.ndarray,
labels: np.ndarray,
probs: np.ndarray,
label_names: list[str],
task_name: str,
is_ordinal: bool = False,
) -> dict:
"""Compute comprehensive metrics for a single classification task."""
num_classes = len(label_names)
# Basic classification metrics
macro_f1 = f1_score(labels, preds, average="macro", labels=range(num_classes))
weighted_f1 = f1_score(labels, preds, average="weighted", labels=range(num_classes))
per_class_f1 = f1_score(labels, preds, average=None, labels=range(num_classes))
per_class_precision = precision_score(labels, preds, average=None, labels=range(num_classes), zero_division=0)
per_class_recall = recall_score(labels, preds, average=None, labels=range(num_classes), zero_division=0)
macro_precision = precision_score(labels, preds, average="macro", labels=range(num_classes), zero_division=0)
macro_recall = recall_score(labels, preds, average="macro", labels=range(num_classes), zero_division=0)
# MCC (multiclass)
mcc = matthews_corrcoef(labels, preds)
# AUC (one-vs-rest, macro)
try:
auc = roc_auc_score(labels, probs, multi_class="ovr", average="macro", labels=range(num_classes))
except ValueError:
auc = float("nan")
# Confusion matrix
cm = confusion_matrix(labels, preds, labels=range(num_classes))
# Calibration
ece, bin_data = compute_ece(probs, labels)
metrics = {
f"{task_name}_macro_f1": macro_f1,
f"{task_name}_weighted_f1": weighted_f1,
f"{task_name}_macro_precision": macro_precision,
f"{task_name}_macro_recall": macro_recall,
f"{task_name}_mcc": mcc,
f"{task_name}_auc": auc,
f"{task_name}_ece": ece,
f"{task_name}_confusion_matrix": cm.tolist(),
f"{task_name}_calibration": bin_data,
}
for i, name in enumerate(label_names):
short = name.replace(" ", "").replace("/", "").replace(":", "")[:8]
metrics[f"{task_name}_f1_{short}"] = per_class_f1[i]
metrics[f"{task_name}_prec_{short}"] = per_class_precision[i]
metrics[f"{task_name}_recall_{short}"] = per_class_recall[i]
# Ordinal-specific metrics
if is_ordinal:
metrics[f"{task_name}_qwk"] = cohen_kappa_score(labels, preds, weights="quadratic")
metrics[f"{task_name}_mae"] = mean_absolute_error(labels, preds)
# Krippendorff's alpha (treat model + reference as two coders)
reliability_data = np.array([labels, preds])
if is_ordinal:
metrics[f"{task_name}_kripp_alpha"] = krippendorff.alpha(
reliability_data=reliability_data, level_of_measurement="ordinal"
)
else:
metrics[f"{task_name}_kripp_alpha"] = krippendorff.alpha(
reliability_data=reliability_data, level_of_measurement="nominal"
)
return metrics
def generate_figures(
results: dict,
output_dir: Path,
model_name: str,
ref_name: str,
) -> list[str]:
"""Generate publication-ready figures. Returns list of saved file paths."""
figures_dir = output_dir / "figures"
figures_dir.mkdir(parents=True, exist_ok=True)
saved = []
sns.set_theme(style="whitegrid", font_scale=1.1)
palette = sns.color_palette("viridis", 7)
# 1. Category confusion matrix
cm = np.array(results["cat_confusion_matrix"])
fig, ax = plt.subplots(figsize=(10, 8))
cm_norm = cm.astype(float) / cm.sum(axis=1, keepdims=True)
sns.heatmap(
cm_norm, annot=cm, fmt="d", cmap="Blues",
xticklabels=[c[:12] for c in CATEGORIES],
yticklabels=[c[:12] for c in CATEGORIES],
ax=ax, vmin=0, vmax=1, cbar_kws={"label": "Proportion"},
)
ax.set_xlabel(f"Predicted ({model_name})")
ax.set_ylabel(f"Reference ({ref_name})")
ax.set_title(f"Category Confusion Matrix — {model_name} vs {ref_name}")
plt.tight_layout()
path = figures_dir / f"confusion_cat_{ref_name.lower().replace(' ', '_')}.png"
fig.savefig(path, dpi=150)
plt.close(fig)
saved.append(str(path))
# 2. Specificity confusion matrix
cm_spec = np.array(results["spec_confusion_matrix"])
fig, ax = plt.subplots(figsize=(7, 6))
cm_spec_norm = cm_spec.astype(float) / cm_spec.sum(axis=1, keepdims=True).clip(min=1)
sns.heatmap(
cm_spec_norm, annot=cm_spec, fmt="d", cmap="Oranges",
xticklabels=SPEC_LABELS,
yticklabels=SPEC_LABELS,
ax=ax, vmin=0, vmax=1, cbar_kws={"label": "Proportion"},
)
ax.set_xlabel(f"Predicted ({model_name})")
ax.set_ylabel(f"Reference ({ref_name})")
ax.set_title(f"Specificity Confusion Matrix — {model_name} vs {ref_name}")
plt.tight_layout()
path = figures_dir / f"confusion_spec_{ref_name.lower().replace(' ', '_')}.png"
fig.savefig(path, dpi=150)
plt.close(fig)
saved.append(str(path))
# 3. Calibration reliability diagram (category)
cal = results["cat_calibration"]
fig, ax = plt.subplots(figsize=(6, 6))
ax.bar(cal["bin_centers"], cal["bin_accs"], width=0.06, alpha=0.7, label="Accuracy", color="#4C72B0")
ax.plot([0, 1], [0, 1], "k--", alpha=0.5, label="Perfect calibration")
ax.set_xlabel("Confidence")
ax.set_ylabel("Accuracy")
ax.set_title(f"Category Calibration — ECE={results['cat_ece']:.4f}")
ax.legend()
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
plt.tight_layout()
path = figures_dir / f"calibration_cat_{ref_name.lower().replace(' ', '_')}.png"
fig.savefig(path, dpi=150)
plt.close(fig)
saved.append(str(path))
# 4. Per-class F1 bar chart
cat_f1s = [results.get(f"cat_f1_{c.replace(' ', '').replace('/', '')[:8]}", 0) for c in CATEGORIES]
spec_f1s = [results.get(f"spec_f1_{s.replace(' ', '').replace(':', '')[:8]}", 0) for s in SPEC_LABELS]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
bars = ax1.barh([c[:15] for c in CATEGORIES], cat_f1s, color=palette)
ax1.axvline(0.80, color="red", linestyle="--", alpha=0.7, label="Target (0.80)")
ax1.set_xlabel("F1 Score")
ax1.set_title(f"Per-Category F1 ({model_name} vs {ref_name})")
ax1.set_xlim(0, 1)
ax1.legend()
for bar, v in zip(bars, cat_f1s):
ax1.text(v + 0.01, bar.get_y() + bar.get_height() / 2, f"{v:.3f}", va="center", fontsize=9)
bars2 = ax2.barh(SPEC_LABELS, spec_f1s, color=sns.color_palette("YlOrRd", 4))
ax2.axvline(0.80, color="red", linestyle="--", alpha=0.7, label="Target (0.80)")
ax2.set_xlabel("F1 Score")
ax2.set_title(f"Per-Level Specificity F1 ({model_name} vs {ref_name})")
ax2.set_xlim(0, 1)
ax2.legend()
for bar, v in zip(bars2, spec_f1s):
ax2.text(v + 0.01, bar.get_y() + bar.get_height() / 2, f"{v:.3f}", va="center", fontsize=9)
plt.tight_layout()
path = figures_dir / f"per_class_f1_{ref_name.lower().replace(' ', '_')}.png"
fig.savefig(path, dpi=150)
plt.close(fig)
saved.append(str(path))
return saved
def generate_comparison_figures(
all_results: dict[str, dict],
output_dir: Path,
) -> list[str]:
"""Generate figures comparing multiple models/references."""
figures_dir = output_dir / "figures"
figures_dir.mkdir(parents=True, exist_ok=True)
saved = []
sns.set_theme(style="whitegrid", font_scale=1.1)
# 1. Model comparison bar chart (if we have CORAL vs Independent)
model_names = list(all_results.keys())
if len(model_names) >= 2:
metric_keys = ["cat_macro_f1", "spec_macro_f1", "cat_mcc", "spec_qwk", "spec_mae"]
metric_labels = ["Cat Macro F1", "Spec Macro F1", "Cat MCC", "Spec QWK", "Spec MAE"]
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(metric_labels))
width = 0.8 / len(model_names)
for i, name in enumerate(model_names):
vals = []
for k in metric_keys:
v = all_results[name].get(k, 0)
vals.append(v if not np.isnan(v) else 0)
bars = ax.bar(x + i * width - 0.4 + width / 2, vals, width, label=name, alpha=0.85)
for bar, v in zip(bars, vals):
ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01,
f"{v:.3f}", ha="center", va="bottom", fontsize=8)
ax.set_xticks(x)
ax.set_xticklabels(metric_labels)
ax.set_ylabel("Score")
ax.set_title("Model Comparison — All Metrics")
ax.legend()
ax.axhline(0.80, color="red", linestyle="--", alpha=0.5, label="F1 Target")
plt.tight_layout()
path = figures_dir / "model_comparison.png"
fig.savefig(path, dpi=150)
plt.close(fig)
saved.append(str(path))
# 2. Speed/cost comparison
speed_data = {}
for name, res in all_results.items():
if "avg_ms_per_sample" in res:
speed_data[name] = res["avg_ms_per_sample"]
if speed_data:
# Add GenAI panel reference speeds from docs
genai_speeds = {
"GPT-5.4 (API)": 2900, # from provenance data
"Opus-4.6 (API)": 6000, # from provenance data
}
fig, ax = plt.subplots(figsize=(10, 5))
all_names = list(speed_data.keys()) + list(genai_speeds.keys())
all_speeds = list(speed_data.values()) + list(genai_speeds.values())
colors = ["#4C72B0"] * len(speed_data) + ["#DD8452"] * len(genai_speeds)
bars = ax.barh(all_names, all_speeds, color=colors)
ax.set_xlabel("Latency (ms per paragraph)")
ax.set_title("Inference Speed Comparison")
ax.set_xscale("log")
for bar, v in zip(bars, all_speeds):
ax.text(v * 1.1, bar.get_y() + bar.get_height() / 2,
f"{v:.1f}ms", va="center", fontsize=9)
plt.tight_layout()
path = figures_dir / "speed_comparison.png"
fig.savefig(path, dpi=150)
plt.close(fig)
saved.append(str(path))
return saved
def format_report(
model_name: str,
ref_name: str,
metrics: dict,
inference_stats: dict,
) -> str:
"""Format a human-readable evaluation report."""
lines = []
lines.append(f"\n{'='*70}")
lines.append(f" HOLDOUT EVALUATION: {model_name} vs {ref_name}")
lines.append(f"{'='*70}\n")
lines.append(f" Samples evaluated: {inference_stats['num_samples']}")
lines.append(f" Total inference time: {inference_stats['total_time_s']:.2f}s")
lines.append(f" Avg latency: {inference_stats['avg_ms_per_sample']:.2f}ms/sample")
lines.append(f" Throughput: {1000 / inference_stats['avg_ms_per_sample']:.0f} samples/sec\n")
# Category metrics
lines.append(f" {''*50}")
lines.append(f" CATEGORY CLASSIFICATION")
lines.append(f" {''*50}")
lines.append(f" Macro F1: {metrics['cat_macro_f1']:.4f} {'' if metrics['cat_macro_f1'] >= 0.80 else ''} (target: 0.80)")
lines.append(f" Weighted F1: {metrics['cat_weighted_f1']:.4f}")
lines.append(f" Macro Prec: {metrics['cat_macro_precision']:.4f}")
lines.append(f" Macro Recall: {metrics['cat_macro_recall']:.4f}")
lines.append(f" MCC: {metrics['cat_mcc']:.4f}")
lines.append(f" AUC (OvR): {metrics['cat_auc']:.4f}")
lines.append(f" ECE: {metrics['cat_ece']:.4f}")
lines.append(f" Kripp Alpha: {metrics['cat_kripp_alpha']:.4f}")
lines.append("")
lines.append(f" {'Category':<25} {'F1':>8} {'Prec':>8} {'Recall':>8}")
lines.append(f" {'-'*25} {'-'*8} {'-'*8} {'-'*8}")
for c in CATEGORIES:
short = c.replace(" ", "").replace("/", "")[:8]
f1 = metrics.get(f"cat_f1_{short}", 0)
prec = metrics.get(f"cat_prec_{short}", 0)
rec = metrics.get(f"cat_recall_{short}", 0)
lines.append(f" {c:<25} {f1:>8.4f} {prec:>8.4f} {rec:>8.4f}")
# Specificity metrics
lines.append(f"\n {''*50}")
lines.append(f" SPECIFICITY CLASSIFICATION")
lines.append(f" {''*50}")
lines.append(f" Macro F1: {metrics['spec_macro_f1']:.4f} {'' if metrics['spec_macro_f1'] >= 0.80 else ''} (target: 0.80)")
lines.append(f" Weighted F1: {metrics['spec_weighted_f1']:.4f}")
lines.append(f" Macro Prec: {metrics['spec_macro_precision']:.4f}")
lines.append(f" Macro Recall: {metrics['spec_macro_recall']:.4f}")
lines.append(f" MCC: {metrics['spec_mcc']:.4f}")
lines.append(f" AUC (OvR): {metrics['spec_auc']:.4f}")
lines.append(f" QWK: {metrics['spec_qwk']:.4f}")
lines.append(f" MAE: {metrics['spec_mae']:.4f}")
lines.append(f" ECE: {metrics['spec_ece']:.4f}")
lines.append(f" Kripp Alpha: {metrics['spec_kripp_alpha']:.4f}")
lines.append("")
lines.append(f" {'Level':<25} {'F1':>8} {'Prec':>8} {'Recall':>8}")
lines.append(f" {'-'*25} {'-'*8} {'-'*8} {'-'*8}")
for s in SPEC_LABELS:
short = s.replace(" ", "").replace(":", "")[:8]
f1 = metrics.get(f"spec_f1_{short}", 0)
prec = metrics.get(f"spec_prec_{short}", 0)
rec = metrics.get(f"spec_recall_{short}", 0)
lines.append(f" {s:<25} {f1:>8.4f} {prec:>8.4f} {rec:>8.4f}")
lines.append(f"\n{'='*70}\n")
return "\n".join(lines)
def evaluate(config: EvalConfig) -> dict:
"""Run full holdout evaluation. Returns all metrics + generates figures."""
output_dir = Path(config.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"\n Device: {device}")
# Load holdout data
print(" Loading holdout data...")
records = load_holdout_data(
config.paragraphs_path,
config.holdout_path,
config.benchmark_paths,
)
print(f" Holdout paragraphs: {len(records)}")
# Load model
print(f" Loading model from {config.checkpoint_path}...")
model, tokenizer = load_model(config, device)
param_count = sum(p.numel() for p in model.parameters()) / 1e6
print(f" Model parameters: {param_count:.0f}M")
# Run inference
print(" Running inference...")
inference = run_inference(
model, tokenizer, records,
config.max_seq_length, config.batch_size,
device, config.specificity_head,
)
print(f" Inference complete: {inference['total_time_s']:.2f}s ({inference['avg_ms_per_sample']:.2f}ms/sample)")
# Evaluate against each benchmark
all_results = {}
all_figures = []
model_name = Path(config.checkpoint_path).parent.name
if model_name == "final":
model_name = Path(config.checkpoint_path).parent.parent.name
for ref_name, ref_path in config.benchmark_paths.items():
print(f"\n Evaluating against {ref_name}...")
# Get reference labels for paragraphs that have them
cat_labels = []
spec_labels = []
cat_preds = []
spec_preds = []
cat_probs = []
spec_probs = []
for i, rec in enumerate(records):
bench = rec["benchmark_labels"].get(ref_name)
if bench is None:
continue
cat_labels.append(CAT2ID[bench["category"]])
spec_labels.append(bench["specificity"] - 1) # 0-indexed
cat_preds.append(inference["cat_preds"][i])
spec_preds.append(inference["spec_preds"][i])
cat_probs.append(inference["cat_probs"][i])
spec_probs.append(inference["spec_probs"][i])
cat_labels = np.array(cat_labels)
spec_labels = np.array(spec_labels)
cat_preds = np.array(cat_preds)
spec_preds = np.array(spec_preds)
cat_probs = np.array(cat_probs)
spec_probs = np.array(spec_probs)
print(f" Matched samples: {len(cat_labels)}")
# Compute metrics
cat_metrics = compute_all_metrics(
cat_preds, cat_labels, cat_probs, CATEGORIES, "cat", is_ordinal=False
)
spec_metrics = compute_all_metrics(
spec_preds, spec_labels, spec_probs, SPEC_LABELS, "spec", is_ordinal=True
)
combined = {**cat_metrics, **spec_metrics, **inference}
combined["combined_macro_f1"] = (combined["cat_macro_f1"] + combined["spec_macro_f1"]) / 2
# Print report
report = format_report(model_name, ref_name, combined, inference)
print(report)
# Save report
report_path = output_dir / f"report_{ref_name.lower().replace(' ', '_').replace('.', '')}.txt"
with open(report_path, "w") as f:
f.write(report)
# Generate figures
figs = generate_figures(combined, output_dir, model_name, ref_name)
all_figures.extend(figs)
all_results[f"{model_name}_vs_{ref_name}"] = combined
# Cross-model comparison figures
comp_figs = generate_comparison_figures(all_results, output_dir)
all_figures.extend(comp_figs)
# Save all metrics as JSON
serializable = {}
for k, v in all_results.items():
serializable[k] = {
mk: mv for mk, mv in v.items()
if isinstance(mv, (int, float, str, list, bool))
}
metrics_path = output_dir / "metrics.json"
with open(metrics_path, "w") as f:
json.dump(serializable, f, indent=2, default=str)
print(f"\n Results saved to {output_dir}")
print(f" Figures: {len(all_figures)} generated")
for fig_path in all_figures:
print(f" {fig_path}")
return all_results
def evaluate_comparison(
configs: list[EvalConfig],
output_dir: str,
) -> dict:
"""Evaluate multiple model checkpoints and generate cross-model comparisons."""
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
all_results = {}
for config in configs:
results = evaluate(config)
all_results.update(results)
# Generate comparison figures across all models
comp_figs = generate_comparison_figures(all_results, output_dir)
print(f"\n Comparison figures: {len(comp_figs)}")
return all_results

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 161 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 221 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 111 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 117 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 124 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 124 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 88 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 89 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 74 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 109 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 111 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

View File

@ -0,0 +1,298 @@
{
"best-base_weighted_ce-ep5_vs_GPT-5.4": {
"cat_macro_f1": 0.9360575579144376,
"cat_weighted_f1": 0.9360564701876355,
"cat_macro_precision": 0.9336791798534633,
"cat_macro_recall": 0.9414080218768329,
"cat_mcc": 0.9248088496355107,
"cat_auc": 0.991343460842945,
"cat_ece": 0.0441274690628052,
"cat_confusion_matrix": [
[
220,
0,
8,
0,
1,
0,
1
],
[
0,
86,
0,
0,
1,
1,
0
],
[
1,
0,
143,
1,
5,
0,
0
],
[
0,
0,
2,
133,
0,
1,
0
],
[
6,
0,
7,
18,
165,
1,
1
],
[
0,
3,
1,
8,
2,
207,
0
],
[
0,
0,
0,
1,
6,
1,
169
]
],
"cat_f1_BoardGov": 0.962800875273523,
"cat_prec_BoardGov": 0.9691629955947136,
"cat_recall_BoardGov": 0.9565217391304348,
"cat_f1_Incident": 0.9717514124293786,
"cat_prec_Incident": 0.9662921348314607,
"cat_recall_Incident": 0.9772727272727273,
"cat_f1_Manageme": 0.9196141479099679,
"cat_prec_Manageme": 0.8881987577639752,
"cat_recall_Manageme": 0.9533333333333334,
"cat_f1_NoneOthe": 0.8956228956228957,
"cat_prec_NoneOthe": 0.8260869565217391,
"cat_recall_NoneOthe": 0.9779411764705882,
"cat_f1_RiskMana": 0.873015873015873,
"cat_prec_RiskMana": 0.9166666666666666,
"cat_recall_RiskMana": 0.8333333333333334,
"cat_f1_Strategy": 0.9583333333333334,
"cat_prec_Strategy": 0.981042654028436,
"cat_recall_Strategy": 0.9366515837104072,
"cat_f1_Third-Pa": 0.9712643678160919,
"cat_prec_Third-Pa": 0.9883040935672515,
"cat_recall_Third-Pa": 0.9548022598870056,
"cat_kripp_alpha": 0.9243601922903683,
"spec_macro_f1": 0.5970357338282843,
"spec_weighted_f1": 0.7040798408451929,
"spec_macro_precision": 0.7225196233593912,
"spec_macro_recall": 0.6139005306639329,
"spec_mcc": 0.6138700055328291,
"spec_auc": 0.9498756282617218,
"spec_ece": 0.1652249880135059,
"spec_confusion_matrix": [
[
596,
6,
11,
5
],
[
105,
46,
9,
8
],
[
14,
6,
52,
135
],
[
4,
0,
3,
200
]
],
"spec_f1_L1Generi": 0.8915482423335827,
"spec_prec_L1Generi": 0.8289290681502086,
"spec_recall_L1Generi": 0.9644012944983819,
"spec_f1_L2Domain": 0.40707964601769914,
"spec_prec_L2Domain": 0.7931034482758621,
"spec_recall_L2Domain": 0.27380952380952384,
"spec_f1_L3Firm-S": 0.36879432624113473,
"spec_prec_L3Firm-S": 0.6933333333333334,
"spec_recall_L3Firm-S": 0.25120772946859904,
"spec_f1_L4Quanti": 0.7207207207207207,
"spec_prec_L4Quanti": 0.5747126436781609,
"spec_recall_L4Quanti": 0.966183574879227,
"spec_qwk": 0.8757404773441285,
"spec_mae": 0.2975,
"spec_kripp_alpha": 0.8479072400833478,
"total_time_s": 6.695346015971154,
"num_samples": 1200,
"avg_ms_per_sample": 5.579455013309295,
"combined_macro_f1": 0.766546645871361
},
"best-base_weighted_ce-ep5_vs_Opus-4.6": {
"cat_macro_f1": 0.9280167387549427,
"cat_weighted_f1": 0.9273898648954128,
"cat_macro_precision": 0.9223465490796974,
"cat_macro_recall": 0.9382296607170699,
"cat_mcc": 0.9162751746063641,
"cat_auc": 0.992382433433919,
"cat_ece": 0.04689237485329312,
"cat_confusion_matrix": [
[
209,
0,
2,
1,
1,
0,
1
],
[
0,
78,
0,
0,
1,
0,
0
],
[
4,
0,
147,
2,
4,
0,
1
],
[
0,
0,
1,
139,
1,
0,
0
],
[
13,
1,
10,
15,
168,
1,
5
],
[
1,
10,
1,
4,
3,
209,
0
],
[
0,
0,
0,
0,
2,
1,
164
]
],
"cat_f1_BoardGov": 0.9478458049886621,
"cat_prec_BoardGov": 0.920704845814978,
"cat_recall_BoardGov": 0.9766355140186916,
"cat_f1_Incident": 0.9285714285714286,
"cat_prec_Incident": 0.8764044943820225,
"cat_recall_Incident": 0.9873417721518988,
"cat_f1_Manageme": 0.9216300940438872,
"cat_prec_Manageme": 0.9130434782608695,
"cat_recall_Manageme": 0.930379746835443,
"cat_f1_NoneOthe": 0.9205298013245033,
"cat_prec_NoneOthe": 0.8633540372670807,
"cat_recall_NoneOthe": 0.9858156028368794,
"cat_f1_RiskMana": 0.8549618320610687,
"cat_prec_RiskMana": 0.9333333333333333,
"cat_recall_RiskMana": 0.7887323943661971,
"cat_f1_Strategy": 0.9521640091116174,
"cat_prec_Strategy": 0.990521327014218,
"cat_recall_Strategy": 0.9166666666666666,
"cat_f1_Third-Pa": 0.9704142011834319,
"cat_prec_Third-Pa": 0.9590643274853801,
"cat_recall_Third-Pa": 0.9820359281437125,
"cat_kripp_alpha": 0.9154955768233572,
"spec_macro_f1": 0.5957642708821952,
"spec_weighted_f1": 0.693005282664721,
"spec_macro_precision": 0.731933400476396,
"spec_macro_recall": 0.6249872364065566,
"spec_mcc": 0.6143201053040909,
"spec_auc": 0.9470735892830423,
"spec_ece": 0.18189165468017254,
"spec_confusion_matrix": [
[
592,
4,
4,
5
],
[
92,
42,
6,
5
],
[
35,
12,
63,
150
],
[
0,
0,
2,
188
]
],
"spec_f1_L1Generi": 0.8942598187311178,
"spec_prec_L1Generi": 0.8233657858136301,
"spec_recall_L1Generi": 0.9785123966942149,
"spec_f1_L2Domain": 0.41379310344827586,
"spec_prec_L2Domain": 0.7241379310344828,
"spec_recall_L2Domain": 0.2896551724137931,
"spec_f1_L3Firm-S": 0.3761194029850746,
"spec_prec_L3Firm-S": 0.84,
"spec_recall_L3Firm-S": 0.2423076923076923,
"spec_f1_L4Quanti": 0.6988847583643123,
"spec_prec_L4Quanti": 0.5402298850574713,
"spec_recall_L4Quanti": 0.9894736842105263,
"spec_qwk": 0.872110225054491,
"spec_mae": 0.3075,
"spec_kripp_alpha": 0.8502616991488389,
"total_time_s": 6.695346015971154,
"num_samples": 1200,
"avg_ms_per_sample": 5.579455013309295,
"combined_macro_f1": 0.7618905048185689
}
}

View File

@ -0,0 +1,54 @@
======================================================================
HOLDOUT EVALUATION: best-base_weighted_ce-ep5 vs GPT-5.4
======================================================================
Samples evaluated: 1200
Total inference time: 6.70s
Avg latency: 5.58ms/sample
Throughput: 179 samples/sec
──────────────────────────────────────────────────
CATEGORY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.9361 ✓ (target: 0.80)
Weighted F1: 0.9361
Macro Prec: 0.9337
Macro Recall: 0.9414
MCC: 0.9248
AUC (OvR): 0.9913
ECE: 0.0441
Kripp Alpha: 0.9244
Category F1 Prec Recall
------------------------- -------- -------- --------
Board Governance 0.9628 0.9692 0.9565
Incident Disclosure 0.9718 0.9663 0.9773
Management Role 0.9196 0.8882 0.9533
None/Other 0.8956 0.8261 0.9779
Risk Management Process 0.8730 0.9167 0.8333
Strategy Integration 0.9583 0.9810 0.9367
Third-Party Risk 0.9713 0.9883 0.9548
──────────────────────────────────────────────────
SPECIFICITY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.5970 ✗ (target: 0.80)
Weighted F1: 0.7041
Macro Prec: 0.7225
Macro Recall: 0.6139
MCC: 0.6139
AUC (OvR): 0.9499
QWK: 0.8757
MAE: 0.2975
ECE: 0.1652
Kripp Alpha: 0.8479
Level F1 Prec Recall
------------------------- -------- -------- --------
L1: Generic 0.8915 0.8289 0.9644
L2: Domain 0.4071 0.7931 0.2738
L3: Firm-Specific 0.3688 0.6933 0.2512
L4: Quantified 0.7207 0.5747 0.9662
======================================================================

View File

@ -0,0 +1,54 @@
======================================================================
HOLDOUT EVALUATION: best-base_weighted_ce-ep5 vs Opus-4.6
======================================================================
Samples evaluated: 1200
Total inference time: 6.70s
Avg latency: 5.58ms/sample
Throughput: 179 samples/sec
──────────────────────────────────────────────────
CATEGORY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.9280 ✓ (target: 0.80)
Weighted F1: 0.9274
Macro Prec: 0.9223
Macro Recall: 0.9382
MCC: 0.9163
AUC (OvR): 0.9924
ECE: 0.0469
Kripp Alpha: 0.9155
Category F1 Prec Recall
------------------------- -------- -------- --------
Board Governance 0.9478 0.9207 0.9766
Incident Disclosure 0.9286 0.8764 0.9873
Management Role 0.9216 0.9130 0.9304
None/Other 0.9205 0.8634 0.9858
Risk Management Process 0.8550 0.9333 0.7887
Strategy Integration 0.9522 0.9905 0.9167
Third-Party Risk 0.9704 0.9591 0.9820
──────────────────────────────────────────────────
SPECIFICITY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.5958 ✗ (target: 0.80)
Weighted F1: 0.6930
Macro Prec: 0.7319
Macro Recall: 0.6250
MCC: 0.6143
AUC (OvR): 0.9471
QWK: 0.8721
MAE: 0.3075
ECE: 0.1819
Kripp Alpha: 0.8503
Level F1 Prec Recall
------------------------- -------- -------- --------
L1: Generic 0.8943 0.8234 0.9785
L2: Domain 0.4138 0.7241 0.2897
L3: Firm-Specific 0.3761 0.8400 0.2423
L4: Quantified 0.6989 0.5402 0.9895
======================================================================

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 120 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 118 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 82 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 84 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 65 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 104 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 105 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

View File

@ -0,0 +1,298 @@
{
"iter1-independent_vs_GPT-5.4": {
"cat_macro_f1": 0.9336741161693523,
"cat_weighted_f1": 0.9343162998643407,
"cat_macro_precision": 0.93189297179766,
"cat_macro_recall": 0.9377918652022429,
"cat_mcc": 0.9226990724708704,
"cat_auc": 0.991991833154947,
"cat_ece": 0.053848127176364245,
"cat_confusion_matrix": [
[
225,
0,
3,
0,
2,
0,
0
],
[
0,
85,
0,
0,
2,
1,
0
],
[
2,
0,
144,
1,
3,
0,
0
],
[
0,
0,
3,
131,
0,
2,
0
],
[
6,
1,
5,
19,
164,
1,
2
],
[
0,
3,
1,
8,
2,
207,
0
],
[
0,
0,
0,
0,
12,
0,
165
]
],
"cat_f1_BoardGov": 0.9719222462203023,
"cat_prec_BoardGov": 0.9656652360515021,
"cat_recall_BoardGov": 0.9782608695652174,
"cat_f1_Incident": 0.96045197740113,
"cat_prec_Incident": 0.9550561797752809,
"cat_recall_Incident": 0.9659090909090909,
"cat_f1_Manageme": 0.9411764705882353,
"cat_prec_Manageme": 0.9230769230769231,
"cat_recall_Manageme": 0.96,
"cat_f1_NoneOthe": 0.888135593220339,
"cat_prec_NoneOthe": 0.8238993710691824,
"cat_recall_NoneOthe": 0.9632352941176471,
"cat_f1_RiskMana": 0.856396866840731,
"cat_prec_RiskMana": 0.8864864864864865,
"cat_recall_RiskMana": 0.8282828282828283,
"cat_f1_Strategy": 0.9583333333333334,
"cat_prec_Strategy": 0.981042654028436,
"cat_recall_Strategy": 0.9366515837104072,
"cat_f1_Third-Pa": 0.9593023255813954,
"cat_prec_Third-Pa": 0.9880239520958084,
"cat_recall_Third-Pa": 0.9322033898305084,
"cat_kripp_alpha": 0.9223591517560865,
"spec_macro_f1": 0.8951731906425856,
"spec_weighted_f1": 0.9121524819510628,
"spec_macro_precision": 0.8980417155129858,
"spec_macro_recall": 0.8930560580782194,
"spec_mcc": 0.866381831963237,
"spec_auc": 0.981666223606385,
"spec_ece": 0.07135417198141418,
"spec_confusion_matrix": [
[
580,
23,
12,
3
],
[
29,
130,
7,
2
],
[
11,
4,
190,
2
],
[
2,
1,
9,
195
]
],
"spec_f1_L1Generi": 0.9354838709677419,
"spec_prec_L1Generi": 0.932475884244373,
"spec_recall_L1Generi": 0.9385113268608414,
"spec_f1_L2Domain": 0.7975460122699386,
"spec_prec_L2Domain": 0.8227848101265823,
"spec_recall_L2Domain": 0.7738095238095238,
"spec_f1_L3Firm-S": 0.8941176470588236,
"spec_prec_L3Firm-S": 0.8715596330275229,
"spec_recall_L3Firm-S": 0.9178743961352657,
"spec_f1_L4Quanti": 0.9535452322738386,
"spec_prec_L4Quanti": 0.9653465346534653,
"spec_recall_L4Quanti": 0.9420289855072463,
"spec_qwk": 0.9324447137231142,
"spec_mae": 0.1175,
"spec_kripp_alpha": 0.917725722448833,
"total_time_s": 6.732117835083045,
"num_samples": 1200,
"avg_ms_per_sample": 5.6100981959025376,
"combined_macro_f1": 0.9144236534059689
},
"iter1-independent_vs_Opus-4.6": {
"cat_macro_f1": 0.922684387023173,
"cat_weighted_f1": 0.9216414809666168,
"cat_macro_precision": 0.9177680939029339,
"cat_macro_recall": 0.9316060900094703,
"cat_mcc": 0.909266938399113,
"cat_auc": 0.9939660707189948,
"cat_ece": 0.06551479384303091,
"cat_confusion_matrix": [
[
211,
0,
1,
1,
1,
0,
0
],
[
0,
78,
0,
0,
1,
0,
0
],
[
8,
0,
144,
1,
4,
0,
1
],
[
0,
0,
1,
138,
1,
1,
0
],
[
13,
0,
9,
14,
169,
1,
7
],
[
1,
11,
1,
4,
3,
208,
0
],
[
0,
0,
0,
1,
6,
1,
159
]
],
"cat_f1_BoardGov": 0.9440715883668904,
"cat_prec_BoardGov": 0.9055793991416309,
"cat_recall_BoardGov": 0.985981308411215,
"cat_f1_Incident": 0.9285714285714286,
"cat_prec_Incident": 0.8764044943820225,
"cat_recall_Incident": 0.9873417721518988,
"cat_f1_Manageme": 0.9171974522292994,
"cat_prec_Manageme": 0.9230769230769231,
"cat_recall_Manageme": 0.9113924050632911,
"cat_f1_NoneOthe": 0.92,
"cat_prec_NoneOthe": 0.8679245283018868,
"cat_recall_NoneOthe": 0.9787234042553191,
"cat_f1_RiskMana": 0.8492462311557789,
"cat_prec_RiskMana": 0.9135135135135135,
"cat_recall_RiskMana": 0.7934272300469484,
"cat_f1_Strategy": 0.9476082004555809,
"cat_prec_Strategy": 0.985781990521327,
"cat_recall_Strategy": 0.9122807017543859,
"cat_f1_Third-Pa": 0.9520958083832335,
"cat_prec_Third-Pa": 0.9520958083832335,
"cat_recall_Third-Pa": 0.9520958083832335,
"cat_kripp_alpha": 0.908575631724203,
"spec_macro_f1": 0.8833694419146193,
"spec_weighted_f1": 0.9004034318676798,
"spec_macro_precision": 0.8858989636247611,
"spec_macro_recall": 0.8854684685880032,
"spec_mcc": 0.8500778641433316,
"spec_auc": 0.9736633898988131,
"spec_ece": 0.08248284702499709,
"spec_confusion_matrix": [
[
567,
30,
7,
1
],
[
22,
118,
3,
2
],
[
33,
10,
207,
10
],
[
0,
0,
1,
189
]
],
"spec_f1_L1Generi": 0.9242053789731052,
"spec_prec_L1Generi": 0.9115755627009646,
"spec_recall_L1Generi": 0.9371900826446281,
"spec_f1_L2Domain": 0.7788778877887789,
"spec_prec_L2Domain": 0.7468354430379747,
"spec_recall_L2Domain": 0.8137931034482758,
"spec_f1_L3Firm-S": 0.8661087866108786,
"spec_prec_L3Firm-S": 0.9495412844036697,
"spec_recall_L3Firm-S": 0.7961538461538461,
"spec_f1_L4Quanti": 0.9642857142857143,
"spec_prec_L4Quanti": 0.9356435643564357,
"spec_recall_L4Quanti": 0.9947368421052631,
"spec_qwk": 0.9227008860372746,
"spec_mae": 0.13583333333333333,
"spec_kripp_alpha": 0.9065248741550552,
"total_time_s": 6.732117835083045,
"num_samples": 1200,
"avg_ms_per_sample": 5.6100981959025376,
"combined_macro_f1": 0.9030269144688962
}
}

View File

@ -0,0 +1,54 @@
======================================================================
HOLDOUT EVALUATION: iter1-independent vs GPT-5.4
======================================================================
Samples evaluated: 1200
Total inference time: 6.73s
Avg latency: 5.61ms/sample
Throughput: 178 samples/sec
──────────────────────────────────────────────────
CATEGORY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.9337 ✓ (target: 0.80)
Weighted F1: 0.9343
Macro Prec: 0.9319
Macro Recall: 0.9378
MCC: 0.9227
AUC (OvR): 0.9920
ECE: 0.0538
Kripp Alpha: 0.9224
Category F1 Prec Recall
------------------------- -------- -------- --------
Board Governance 0.9719 0.9657 0.9783
Incident Disclosure 0.9605 0.9551 0.9659
Management Role 0.9412 0.9231 0.9600
None/Other 0.8881 0.8239 0.9632
Risk Management Process 0.8564 0.8865 0.8283
Strategy Integration 0.9583 0.9810 0.9367
Third-Party Risk 0.9593 0.9880 0.9322
──────────────────────────────────────────────────
SPECIFICITY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.8952 ✓ (target: 0.80)
Weighted F1: 0.9122
Macro Prec: 0.8980
Macro Recall: 0.8931
MCC: 0.8664
AUC (OvR): 0.9817
QWK: 0.9324
MAE: 0.1175
ECE: 0.0714
Kripp Alpha: 0.9177
Level F1 Prec Recall
------------------------- -------- -------- --------
L1: Generic 0.9355 0.9325 0.9385
L2: Domain 0.7975 0.8228 0.7738
L3: Firm-Specific 0.8941 0.8716 0.9179
L4: Quantified 0.9535 0.9653 0.9420
======================================================================

View File

@ -0,0 +1,54 @@
======================================================================
HOLDOUT EVALUATION: iter1-independent vs Opus-4.6
======================================================================
Samples evaluated: 1200
Total inference time: 6.73s
Avg latency: 5.61ms/sample
Throughput: 178 samples/sec
──────────────────────────────────────────────────
CATEGORY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.9227 ✓ (target: 0.80)
Weighted F1: 0.9216
Macro Prec: 0.9178
Macro Recall: 0.9316
MCC: 0.9093
AUC (OvR): 0.9940
ECE: 0.0655
Kripp Alpha: 0.9086
Category F1 Prec Recall
------------------------- -------- -------- --------
Board Governance 0.9441 0.9056 0.9860
Incident Disclosure 0.9286 0.8764 0.9873
Management Role 0.9172 0.9231 0.9114
None/Other 0.9200 0.8679 0.9787
Risk Management Process 0.8492 0.9135 0.7934
Strategy Integration 0.9476 0.9858 0.9123
Third-Party Risk 0.9521 0.9521 0.9521
──────────────────────────────────────────────────
SPECIFICITY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.8834 ✓ (target: 0.80)
Weighted F1: 0.9004
Macro Prec: 0.8859
Macro Recall: 0.8855
MCC: 0.8501
AUC (OvR): 0.9737
QWK: 0.9227
MAE: 0.1358
ECE: 0.0825
Kripp Alpha: 0.9065
Level F1 Prec Recall
------------------------- -------- -------- --------
L1: Generic 0.9242 0.9116 0.9372
L2: Domain 0.7789 0.7468 0.8138
L3: Firm-Specific 0.8661 0.9495 0.7962
L4: Quantified 0.9643 0.9356 0.9947
======================================================================