diff --git a/docs/NARRATIVE.md b/docs/NARRATIVE.md
index 0b3cb69..538ae03 100644
--- a/docs/NARRATIVE.md
+++ b/docs/NARRATIVE.md
@@ -901,6 +901,202 @@ thus invariant under T > 0.
 
 Output: `results/eval/ensemble-3seed-tempscaled/temperature_scaling.json`.
 
+### 10.5 Pooling Ablation (Attention vs [CLS])
+
+**Motivation:** The spec F1 jump from 0.517 → 0.945 was credited to three
+architectural changes — independent threshold heads, attention pooling, and
+confidence filtering. Independent thresholds were ablated against CORAL;
+confidence filtering was ablated in §10.3 (null result). Attention pooling
+had never been isolated. We needed to know whether it actually matters or
+whether independent thresholds carry all the gain.
+
+**Setup:** `iter1-clspool.yaml` — identical iter1 config but with
+`pooling: cls`. Same seed (42), same 11 epochs, confidence filtering on.
+
+**Results:**
+
+| Config | Val Cat F1 | Val Spec F1 | Val Combined | Holdout Cat F1 (GPT-5.4) | Holdout Spec F1 (GPT-5.4) |
+|--------|-----------:|------------:|-------------:|-------------------------:|--------------------------:|
+| iter1 (attention)    | 0.9430 | 0.9450 | 0.9440 | 0.9343 | 0.8950 |
+| iter1-clspool ([CLS])| 0.9368 | 0.9414 | 0.9391 | 0.9296 | 0.8920 |
+| **Δ (attention − CLS)** | **+0.006** | **+0.004** | **+0.005** | **+0.005** | **+0.003** |
+
+**Finding:** Attention pooling is consistently better than [CLS] pooling
+across all metrics and both references, but the effect is **small** —
+3-6 thousandths of F1. This is within 2-3× the seed-level std (±0.002), so
+the direction is credible but the magnitude is modest. Attention pooling is
+doing real work ("one CISO mention anywhere matters") but independent
+threshold heads are clearly carrying the majority of the architecture win.
+
+**Interpretation for the paper:** We can report this cleanly as "attention
+pooling contributes a small but consistent improvement over [CLS] pooling
+(~+0.005 F1 on both heads); the bulk of the CORAL → independent-threshold
+gain (~+0.43 on spec F1) is attributable to the decoupled threshold weights,
+not the pooling change." This is honest and gives each design choice its
+proper credit.
+
+Output: `checkpoints/finetune/iter1-clspool/`, `results/eval/iter1-clspool/`.
+
+### 10.6 DAPT Re-Test with New Architecture
+
+**Motivation:** During the original 12-config ablation grid (CORAL +
+[CLS] pooling), DAPT and TAPT both *hurt* — base ModernBERT-large
+outperformed DAPT and TAPT checkpoints on every loss combination. That was
+reported as a noteworthy null result. But the architecture has changed
+substantially since then (independent thresholds, attention pooling). The
+verdict on DAPT could now flip: maybe the DAPT vocabulary signal was
+previously wasted on a model that couldn't use it.
+
+**Setup:** `iter1-dapt.yaml` — identical iter1 config but
+`model.name_or_path` points at `checkpoints/dapt/modernbert-large/final`
+(eval loss 0.7250 from Phase 7). Same seed, 11 epochs, attention pooling,
+independent threshold heads, confidence filtering on.
+
+**Results (epoch 11 — final checkpoint):**
+
+| Config | Val Cat F1 | Val Spec F1 | Val Combined | Val NLL (ep 11) | Holdout Cat F1 (GPT-5.4) | Holdout Spec F1 (GPT-5.4) |
+|--------|-----------:|------------:|-------------:|----------------:|-------------------------:|--------------------------:|
+| iter1 (base ModernBERT, seed 69)  | 0.9384 | 0.9462 | 0.9423 | 0.511 | — | — |
+| iter1 (base ModernBERT, seed 42)  | 0.9430 | 0.9450 | 0.9440 | — | 0.9343 | 0.8950 |
+| iter1-dapt (DAPT init)            | 0.9500 | 0.9462 | 0.9481 | 0.494 | 0.9350 | 0.8959 |
+| **Δ (dapt − base)** | **+0.007** | **+0.001** | **+0.004** | **−0.017** | +0.001 | +0.001 |
+
+**Per-epoch val NLL trajectory (confirmed not overfitting-driven):**
+
+| Epoch | seed 69 (no DAPT) | DAPT | Δ |
+|-------|------------------:|-----:|----:|
+| 1     | 0.376 | 0.346 | −0.030 |
+| 2     | 0.337 | **0.318** (best) | −0.019 |
+| 3     | **0.333** (best) | 0.331 | −0.002 |
+| 5     | 0.394 | 0.385 | −0.009 |
+| 8     | 0.493 | 0.482 | −0.011 |
+| 11    | 0.511 | 0.494 | −0.017 |
+
+Both runs peak at epoch 2-3 and then overfit steadily. The overfit gap
+(val NLL at epoch 11 minus best) is **0.178 for the baseline** and
+**0.176 for DAPT** — essentially identical. DAPT is not overfitting worse;
+it is **starting from a better representation** and maintaining the same
+generalization gap through training.
+
+**Finding — a more nuanced null:** DAPT initialization genuinely improves
+val NLL by ~4.5% at the best checkpoint (0.333 → 0.318), with a matching
++0.007 category F1 improvement on val. The improvement is real and not a
+side-effect of overfitting: the train/val gap is unchanged. But this
+benefit does not transfer to the stratified holdout — holdout F1 gains are
+within noise (+0.001).
+
+But the holdout gain is **0.001** on both heads — within seed-level noise
+and nowhere near the val improvement. Something interesting is happening:
+
+- DAPT helps the model fit in-distribution data more tightly (val gain +
+  NLL drop)
+- That extra fit does not generalize to the stratified holdout
+- The holdout oversamples minority classes (L2, TP, ID) relative to the
+  training distribution; DAPT's benefit is on the head of the distribution
+
+**Interpretation for the paper:** This is a more interesting null result
+than the original "DAPT/TAPT did not help." The revised claim is:
+
+> *"Domain-adaptive pretraining improves in-distribution val NLL by ~4.5%
+> at the best checkpoint (0.333 → 0.318) and provides a modest val F1 gain
+> (+0.007 cat, +0.004 combined) under the independent-threshold +
+> attention-pooling architecture. The generalization gap (difference between
+> best val NLL and final val NLL) is unchanged by DAPT (0.178 vs 0.176),
+> confirming that DAPT is providing a better initialization rather than
+> just enabling overfitting. However, this val improvement does not
+> transfer to the stratified holdout — DAPT produces a model that is
+> better-calibrated on paragraphs similar to the training distribution,
+> yet no more generalizable to the rare-class boundary cases (L2, TP, ID)
+> that macro F1 weighs heavily. Our original finding (DAPT does not help
+> final macro F1) is reaffirmed; the mechanism is now clearer."*
+
+This is stronger than the original null because we can now point to a
+specific, measurable effect of DAPT (val NLL) distinct from overfitting,
+and explain why it doesn't show up in the headline macro F1 metric.
+
+The non-DAPT 3-seed ensemble remains the recommended headline checkpoint.
+The DAPT run is reportable as an ablation and a more precise null.
+
+Output: `checkpoints/finetune/iter1-dapt/`, `results/eval/iter1-dapt/`.
+
+### 10.7 The NLL-vs-F1 Decoupling and the Overfit Story
+
+Investigating the DAPT ablation (§10.6) surfaced a general property of
+every run in Phase 10 worth documenting explicitly, because it affects how
+the paper should report training dynamics.
+
+**Observation:** In all four independent-threshold runs (seeds 42/69/420,
+iter1-nofilter, iter1-clspool, iter1-dapt), **val NLL bottoms at epoch 2-3
+and then climbs monotonically through epoch 11, while val macro F1 peaks
+at epoch 8 and plateaus.** The two metrics disagree about when the model
+is at its best.
+
+**Per-epoch val NLL, representative run (seed 69):**
+
+| Epoch | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
+|-------|---|---|---|---|---|---|---|---|---|----|----|
+| Val NLL | 0.376 | 0.337 | **0.333** | 0.369 | 0.394 | 0.443 | 0.472 | 0.493 | 0.505 | 0.511 | — |
+| Val F1  | ~0.90 | ~0.92 | ~0.925 | ~0.932 | ~0.938 | ~0.941 | ~0.942 | **~0.944** | 0.944 | 0.944 | 0.943 |
+
+**Interpretation:** Past epoch 3, continued training memorizes *confidence*,
+not *decisions*. Two things happen simultaneously:
+
+1. Training-set probabilities are pushed toward 0/1 (training loss → 0)
+2. Very few argmax decision boundaries shift
+
+For val examples the model already gets right, sharpening is neutral-to-bad
+for NLL and neutral-to-good for F1. For val examples the model gets wrong,
+continued training makes the prediction *more confidently wrong* — terrible
+for NLL (log-penalty grows), irrelevant for F1 (still wrong by argmax).
+Net: NLL climbs, F1 inches up as a small number of borderline examples
+flip to the correct side.
+
+This is a well-documented decoupling in deep classifiers, not a pathology
+specific to this model.
+
+**Is it a problem for the F1 claim? No.** Model selection uses val F1, so
+we pick the epoch where F1 peaks (epoch 8). Val F1 at the selected
+checkpoint (0.943/0.945) closely tracks holdout F1 against proxy gold
+(0.934/0.895) — a ~0.01 category gap and ~0.05 specificity gap. The
+decision boundaries generalized. The model did not overfit the *task*.
+
+**Is it a problem for the probability claim? Yes, but measurable and
+fixable.** Raw logits at epoch 8 are overconfident, which is exactly what
+the pre-scaling ECE measured (0.05-0.08). The fitted temperatures
+(T_cat = 1.76, T_spec = 2.46) are a direct quantification of how
+overconfident the model became between epoch 3 and epoch 8: T > 1 means
+"divide logits to cool them off." Temperature scaling (§10.4) recovers
+calibration without touching predictions, so the cost of training to
+epoch 8 instead of epoch 3 is paid in a scalar that's learned in ~1 second
+on val.
+
+**Is it a problem for the holdout claim? No, by construction.** The
+holdout was never touched during training. The train/val loss gap measures
+memorization of the training distribution; the holdout measures
+generalization to a distributionally distinct sample. These are independent
+signals and both tell a consistent story: decision boundaries transfer,
+probability calibration does not.
+
+**Why not just stop at epoch 3?** Because you'd save ~0.18 in val NLL and
+lose ~0.02 in val F1. Epochs 3 → 8 buy ~0.015-0.020 F1 at the cost of
+calibration that temperature scaling mechanically recovers. For a
+task where F1 is the rubric metric, that is a good trade. Were this a
+deployment where confidence scores drive downstream decisions (e.g., a
+human-in-the-loop review queue prioritizing low-confidence paragraphs),
+epoch 3 + no temperature scaling would be a reasonable alternative choice.
+
+**Paper framing:**
+
+> *"Val NLL minimizes at epoch 2-3 while val macro F1 peaks at epoch 8 — a
+> well-documented decoupling between calibration and decision quality in
+> deep classifiers. We select checkpoints by F1, report pre- and
+> post-temperature-scaling ECE separately, and verify generalization via
+> an untouched stratified holdout. The model's val-holdout F1 gap (~0.01
+> category, ~0.05 specificity) is within the inter-reference agreement
+> ceiling, confirming decision-boundary generalization despite
+> in-distribution confidence memorization. Temperature scaling recovers
+> calibration (ECE −33% cat, −40% spec) without altering predictions."*
+
 ### Phase 10 Summary
 
 | Experiment | Cost | Outcome | Paper value |
@@ -909,6 +1105,8 @@ Output: `results/eval/ensemble-3seed-tempscaled/temperature_scaling.json`.
 | Dictionary baseline | ~1 min CPU | Cat 0.55, Spec 0.66 — clear gap to learned model | A-rubric "additional baselines" item |
 | Confidence-filter ablation | ~3h GPU | Null result — filtering does not matter | Justifies architecture, not data engineering |
 | Temperature scaling | ~10 min GPU | ECE −33% cat, −40% spec, F1 unchanged | Calibration story, deployment quality |
+| Pooling ablation (attention vs CLS) | ~3h GPU | +0.005 F1 consistent, small effect | Validates design, credits independent thresholds |
+| DAPT re-test with new architecture | ~3h GPU | Val best NLL 0.333→0.318 (−4.5%), F1 +0.007 cat; holdout null; gen gap unchanged | More nuanced null — better init, not better generalization |
 
 The 3-seed ensemble is now the recommended headline checkpoint. The
 calibrated ECE numbers should replace the pre-scaling ECE in the paper. The
diff --git a/docs/STATUS.md b/docs/STATUS.md
index ba92bd1..5e41eac 100644
--- a/docs/STATUS.md
+++ b/docs/STATUS.md
@@ -156,6 +156,8 @@
 - [x] Ensemble of 3 seeds for confidence intervals — seeds 42/69/420, val std ±0.002 spec, holdout +0.017 L2 F1, +0.007 spec F1 vs single seed
 - [x] Dictionary/keyword baseline (A-rubric "additional baselines") — Cat 0.55, Spec 0.66; gap to learned model documents value of context
 - [x] Confidence-filter ablation — null result, filtering does not affect F1; architecture changes carry the spec F1 improvement
+- [x] Pooling ablation (attention vs CLS) — attention +0.005 F1 consistent; small but credible effect
+- [x] DAPT re-test with new architecture — val +0.007 cat F1, best val NLL 0.333→0.318 (−4.5%), generalization gap unchanged; holdout gain ~0.001 (better init, not better generalization)
 - [ ] Error analysis against human gold, IGNITE slides
 - [ ] Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work
 - [ ] Note in paper: DAPT/TAPT did not improve fine-tuning — noteworthy null result
diff --git a/python/configs/finetune/iter1-clspool.yaml b/python/configs/finetune/iter1-clspool.yaml
new file mode 100644
index 0000000..69d0d23
--- /dev/null
+++ b/python/configs/finetune/iter1-clspool.yaml
@@ -0,0 +1,37 @@
+model:
+  name_or_path: answerdotai/ModernBERT-large
+
+data:
+  paragraphs_path: ../data/paragraphs/paragraphs-clean.patched.jsonl
+  consensus_path: ../data/annotations/v2-stage1/consensus.jsonl
+  quality_path: ../data/paragraphs/quality/quality-scores.jsonl
+  holdout_path: ../data/gold/v2-holdout-ids.json
+  max_seq_length: 512
+  validation_split: 0.1
+
+training:
+  output_dir: ../checkpoints/finetune/iter1-clspool
+  learning_rate: 0.00005
+  num_train_epochs: 11
+  per_device_train_batch_size: 32
+  per_device_eval_batch_size: 64
+  gradient_accumulation_steps: 1
+  warmup_ratio: 0.1
+  weight_decay: 0.01
+  dropout: 0.1
+  bf16: true
+  gradient_checkpointing: false
+  logging_steps: 50
+  save_total_limit: 3
+  dataloader_num_workers: 4
+  seed: 42
+  loss_type: ce
+  focal_gamma: 2.0
+  class_weighting: true
+  category_loss_weight: 1.0
+  specificity_loss_weight: 1.0
+  specificity_head: independent
+  spec_mlp_dim: 256
+  pooling: cls
+  ordinal_consistency_weight: 0.1
+  filter_spec_confidence: true
diff --git a/python/configs/finetune/iter1-dapt.yaml b/python/configs/finetune/iter1-dapt.yaml
new file mode 100644
index 0000000..2c013f6
--- /dev/null
+++ b/python/configs/finetune/iter1-dapt.yaml
@@ -0,0 +1,37 @@
+model:
+  name_or_path: ../checkpoints/dapt/modernbert-large/final
+
+data:
+  paragraphs_path: ../data/paragraphs/paragraphs-clean.patched.jsonl
+  consensus_path: ../data/annotations/v2-stage1/consensus.jsonl
+  quality_path: ../data/paragraphs/quality/quality-scores.jsonl
+  holdout_path: ../data/gold/v2-holdout-ids.json
+  max_seq_length: 512
+  validation_split: 0.1
+
+training:
+  output_dir: ../checkpoints/finetune/iter1-dapt
+  learning_rate: 0.00005
+  num_train_epochs: 11
+  per_device_train_batch_size: 32
+  per_device_eval_batch_size: 64
+  gradient_accumulation_steps: 1
+  warmup_ratio: 0.1
+  weight_decay: 0.01
+  dropout: 0.1
+  bf16: true
+  gradient_checkpointing: false
+  logging_steps: 50
+  save_total_limit: 3
+  dataloader_num_workers: 4
+  seed: 42
+  loss_type: ce
+  focal_gamma: 2.0
+  class_weighting: true
+  category_loss_weight: 1.0
+  specificity_loss_weight: 1.0
+  specificity_head: independent
+  spec_mlp_dim: 256
+  pooling: attention
+  ordinal_consistency_weight: 0.1
+  filter_spec_confidence: true
diff --git a/results/eval/iter1-clspool/figures/calibration_cat_gpt-5.4.png b/results/eval/iter1-clspool/figures/calibration_cat_gpt-5.4.png
new file mode 100644
index 0000000..3db8c07
Binary files /dev/null and b/results/eval/iter1-clspool/figures/calibration_cat_gpt-5.4.png differ
diff --git a/results/eval/iter1-clspool/figures/calibration_cat_opus-4.6.png b/results/eval/iter1-clspool/figures/calibration_cat_opus-4.6.png
new file mode 100644
index 0000000..621a314
Binary files /dev/null and b/results/eval/iter1-clspool/figures/calibration_cat_opus-4.6.png differ
diff --git a/results/eval/iter1-clspool/figures/confusion_cat_gpt-5.4.png b/results/eval/iter1-clspool/figures/confusion_cat_gpt-5.4.png
new file mode 100644
index 0000000..01bc357
Binary files /dev/null and b/results/eval/iter1-clspool/figures/confusion_cat_gpt-5.4.png differ
diff --git a/results/eval/iter1-clspool/figures/confusion_cat_opus-4.6.png b/results/eval/iter1-clspool/figures/confusion_cat_opus-4.6.png
new file mode 100644
index 0000000..f9a62ad
Binary files /dev/null and b/results/eval/iter1-clspool/figures/confusion_cat_opus-4.6.png differ
diff --git a/results/eval/iter1-clspool/figures/confusion_spec_gpt-5.4.png b/results/eval/iter1-clspool/figures/confusion_spec_gpt-5.4.png
new file mode 100644
index 0000000..ff8c9ff
Binary files /dev/null and b/results/eval/iter1-clspool/figures/confusion_spec_gpt-5.4.png differ
diff --git a/results/eval/iter1-clspool/figures/confusion_spec_opus-4.6.png b/results/eval/iter1-clspool/figures/confusion_spec_opus-4.6.png
new file mode 100644
index 0000000..436183d
Binary files /dev/null and b/results/eval/iter1-clspool/figures/confusion_spec_opus-4.6.png differ
diff --git a/results/eval/iter1-clspool/figures/model_comparison.png b/results/eval/iter1-clspool/figures/model_comparison.png
new file mode 100644
index 0000000..81d7221
Binary files /dev/null and b/results/eval/iter1-clspool/figures/model_comparison.png differ
diff --git a/results/eval/iter1-clspool/figures/per_class_f1_gpt-5.4.png b/results/eval/iter1-clspool/figures/per_class_f1_gpt-5.4.png
new file mode 100644
index 0000000..2250dc1
Binary files /dev/null and b/results/eval/iter1-clspool/figures/per_class_f1_gpt-5.4.png differ
diff --git a/results/eval/iter1-clspool/figures/per_class_f1_opus-4.6.png b/results/eval/iter1-clspool/figures/per_class_f1_opus-4.6.png
new file mode 100644
index 0000000..325d918
Binary files /dev/null and b/results/eval/iter1-clspool/figures/per_class_f1_opus-4.6.png differ
diff --git a/results/eval/iter1-clspool/figures/speed_comparison.png b/results/eval/iter1-clspool/figures/speed_comparison.png
new file mode 100644
index 0000000..af0c9d5
Binary files /dev/null and b/results/eval/iter1-clspool/figures/speed_comparison.png differ
diff --git a/results/eval/iter1-clspool/metrics.json b/results/eval/iter1-clspool/metrics.json
new file mode 100644
index 0000000..aa1e00b
--- /dev/null
+++ b/results/eval/iter1-clspool/metrics.json
@@ -0,0 +1,298 @@
+{
+  "iter1-clspool_vs_GPT-5.4": {
+    "cat_macro_f1": 0.9296272782528762,
+    "cat_weighted_f1": 0.9306824376807155,
+    "cat_macro_precision": 0.9289887550616817,
+    "cat_macro_recall": 0.9334375025997984,
+    "cat_mcc": 0.9179226636085169,
+    "cat_auc": 0.9911299127522846,
+    "cat_ece": 0.05557066917419438,
+    "cat_confusion_matrix": [
+      [
+        217,
+        0,
+        8,
+        3,
+        2,
+        0,
+        0
+      ],
+      [
+        0,
+        83,
+        0,
+        2,
+        2,
+        1,
+        0
+      ],
+      [
+        2,
+        0,
+        144,
+        1,
+        3,
+        0,
+        0
+      ],
+      [
+        1,
+        0,
+        2,
+        132,
+        1,
+        0,
+        0
+      ],
+      [
+        6,
+        1,
+        5,
+        17,
+        167,
+        1,
+        1
+      ],
+      [
+        0,
+        2,
+        1,
+        8,
+        2,
+        208,
+        0
+      ],
+      [
+        0,
+        0,
+        0,
+        1,
+        11,
+        0,
+        165
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9517543859649122,
+    "cat_prec_BoardGov": 0.9601769911504425,
+    "cat_recall_BoardGov": 0.9434782608695652,
+    "cat_f1_Incident": 0.9540229885057471,
+    "cat_prec_Incident": 0.9651162790697675,
+    "cat_recall_Incident": 0.9431818181818182,
+    "cat_f1_Manageme": 0.9290322580645162,
+    "cat_prec_Manageme": 0.9,
+    "cat_recall_Manageme": 0.96,
+    "cat_f1_NoneOthe": 0.88,
+    "cat_prec_NoneOthe": 0.8048780487804879,
+    "cat_recall_NoneOthe": 0.9705882352941176,
+    "cat_f1_RiskMana": 0.8652849740932642,
+    "cat_prec_RiskMana": 0.8882978723404256,
+    "cat_recall_RiskMana": 0.8434343434343434,
+    "cat_f1_Strategy": 0.9651972157772621,
+    "cat_prec_Strategy": 0.9904761904761905,
+    "cat_recall_Strategy": 0.9411764705882353,
+    "cat_f1_Third-Pa": 0.9620991253644315,
+    "cat_prec_Third-Pa": 0.9939759036144579,
+    "cat_recall_Third-Pa": 0.9322033898305084,
+    "cat_kripp_alpha": 0.9174669822467758,
+    "spec_macro_f1": 0.892010224838834,
+    "spec_weighted_f1": 0.9098424770121019,
+    "spec_macro_precision": 0.9042493173083448,
+    "spec_macro_recall": 0.8836163792237031,
+    "spec_mcc": 0.8634241541671751,
+    "spec_auc": 0.9777836963763646,
+    "spec_ece": 0.07659540871779125,
+    "spec_confusion_matrix": [
+      [
+        587,
+        11,
+        17,
+        3
+      ],
+      [
+        32,
+        125,
+        9,
+        2
+      ],
+      [
+        14,
+        4,
+        187,
+        2
+      ],
+      [
+        3,
+        1,
+        9,
+        194
+      ]
+    ],
+    "spec_f1_L1Generi": 0.9362041467304625,
+    "spec_prec_L1Generi": 0.9229559748427673,
+    "spec_recall_L1Generi": 0.9498381877022654,
+    "spec_f1_L2Domain": 0.8090614886731392,
+    "spec_prec_L2Domain": 0.8865248226950354,
+    "spec_recall_L2Domain": 0.7440476190476191,
+    "spec_f1_L3Firm-S": 0.8717948717948718,
+    "spec_prec_L3Firm-S": 0.8423423423423423,
+    "spec_recall_L3Firm-S": 0.9033816425120773,
+    "spec_f1_L4Quanti": 0.9509803921568627,
+    "spec_prec_L4Quanti": 0.9651741293532339,
+    "spec_recall_L4Quanti": 0.9371980676328503,
+    "spec_qwk": 0.9224750079938221,
+    "spec_mae": 0.1275,
+    "spec_kripp_alpha": 0.9099809044589873,
+    "total_time_s": 6.83874113188358,
+    "num_samples": 1200,
+    "avg_ms_per_sample": 5.698950943236317,
+    "combined_macro_f1": 0.910818751545855
+  },
+  "iter1-clspool_vs_Opus-4.6": {
+    "cat_macro_f1": 0.9228949790380195,
+    "cat_weighted_f1": 0.9228190044594041,
+    "cat_macro_precision": 0.9183239817151002,
+    "cat_macro_recall": 0.9310538134995027,
+    "cat_mcc": 0.9101930161599978,
+    "cat_auc": 0.9924519781241848,
+    "cat_ece": 0.06223733584086104,
+    "cat_confusion_matrix": [
+      [
+        208,
+        0,
+        3,
+        3,
+        0,
+        0,
+        0
+      ],
+      [
+        0,
+        76,
+        0,
+        1,
+        2,
+        0,
+        0
+      ],
+      [
+        5,
+        0,
+        147,
+        1,
+        4,
+        0,
+        1
+      ],
+      [
+        0,
+        0,
+        0,
+        139,
+        2,
+        0,
+        0
+      ],
+      [
+        12,
+        1,
+        9,
+        14,
+        171,
+        1,
+        5
+      ],
+      [
+        1,
+        9,
+        1,
+        6,
+        2,
+        208,
+        1
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        7,
+        1,
+        159
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9454545454545454,
+    "cat_prec_BoardGov": 0.9203539823008849,
+    "cat_recall_BoardGov": 0.9719626168224299,
+    "cat_f1_Incident": 0.9212121212121213,
+    "cat_prec_Incident": 0.8837209302325582,
+    "cat_recall_Incident": 0.9620253164556962,
+    "cat_f1_Manageme": 0.9245283018867925,
+    "cat_prec_Manageme": 0.91875,
+    "cat_recall_Manageme": 0.930379746835443,
+    "cat_f1_NoneOthe": 0.9114754098360656,
+    "cat_prec_NoneOthe": 0.8475609756097561,
+    "cat_recall_NoneOthe": 0.9858156028368794,
+    "cat_f1_RiskMana": 0.8528678304239401,
+    "cat_prec_RiskMana": 0.9095744680851063,
+    "cat_recall_RiskMana": 0.8028169014084507,
+    "cat_f1_Strategy": 0.9497716894977168,
+    "cat_prec_Strategy": 0.9904761904761905,
+    "cat_recall_Strategy": 0.9122807017543859,
+    "cat_f1_Third-Pa": 0.954954954954955,
+    "cat_prec_Third-Pa": 0.9578313253012049,
+    "cat_recall_Third-Pa": 0.9520958083832335,
+    "cat_kripp_alpha": 0.9095735484151157,
+    "spec_macro_f1": 0.8804386286358235,
+    "spec_weighted_f1": 0.8975676999782217,
+    "spec_macro_precision": 0.8892226854649037,
+    "spec_macro_recall": 0.8750457181821643,
+    "spec_mcc": 0.8465565454059848,
+    "spec_auc": 0.9697722386763277,
+    "spec_ece": 0.08741456707318629,
+    "spec_confusion_matrix": [
+      [
+        575,
+        19,
+        10,
+        1
+      ],
+      [
+        26,
+        114,
+        4,
+        1
+      ],
+      [
+        35,
+        8,
+        204,
+        13
+      ],
+      [
+        0,
+        0,
+        4,
+        186
+      ]
+    ],
+    "spec_f1_L1Generi": 0.9266720386784851,
+    "spec_prec_L1Generi": 0.9040880503144654,
+    "spec_recall_L1Generi": 0.9504132231404959,
+    "spec_f1_L2Domain": 0.7972027972027972,
+    "spec_prec_L2Domain": 0.8085106382978723,
+    "spec_recall_L2Domain": 0.7862068965517242,
+    "spec_f1_L3Firm-S": 0.8464730290456431,
+    "spec_prec_L3Firm-S": 0.918918918918919,
+    "spec_recall_L3Firm-S": 0.7846153846153846,
+    "spec_f1_L4Quanti": 0.9514066496163683,
+    "spec_prec_L4Quanti": 0.9253731343283582,
+    "spec_recall_L4Quanti": 0.9789473684210527,
+    "spec_qwk": 0.9187882106031572,
+    "spec_mae": 0.14083333333333334,
+    "spec_kripp_alpha": 0.9041056117796359,
+    "total_time_s": 6.83874113188358,
+    "num_samples": 1200,
+    "avg_ms_per_sample": 5.698950943236317,
+    "combined_macro_f1": 0.9016668038369215
+  }
+}
\ No newline at end of file
diff --git a/results/eval/iter1-clspool/report_gpt-54.txt b/results/eval/iter1-clspool/report_gpt-54.txt
new file mode 100644
index 0000000..a9b51c6
--- /dev/null
+++ b/results/eval/iter1-clspool/report_gpt-54.txt
@@ -0,0 +1,54 @@
+
+======================================================================
+  HOLDOUT EVALUATION: iter1-clspool vs GPT-5.4
+======================================================================
+
+  Samples evaluated: 1200
+  Total inference time: 6.84s
+  Avg latency: 5.70ms/sample
+  Throughput: 175 samples/sec
+
+  ──────────────────────────────────────────────────
+  CATEGORY CLASSIFICATION
+  ──────────────────────────────────────────────────
+  Macro F1:       0.9296  ✓ (target: 0.80)
+  Weighted F1:    0.9307
+  Macro Prec:     0.9290
+  Macro Recall:   0.9334
+  MCC:            0.9179
+  AUC (OvR):      0.9911
+  ECE:            0.0556
+  Kripp Alpha:    0.9175
+
+  Category                        F1     Prec   Recall
+  ------------------------- -------- -------- --------
+  Board Governance            0.9518   0.9602   0.9435
+  Incident Disclosure         0.9540   0.9651   0.9432
+  Management Role             0.9290   0.9000   0.9600
+  None/Other                  0.8800   0.8049   0.9706
+  Risk Management Process     0.8653   0.8883   0.8434
+  Strategy Integration        0.9652   0.9905   0.9412
+  Third-Party Risk            0.9621   0.9940   0.9322
+
+  ──────────────────────────────────────────────────
+  SPECIFICITY CLASSIFICATION
+  ──────────────────────────────────────────────────
+  Macro F1:       0.8920  ✓ (target: 0.80)
+  Weighted F1:    0.9098
+  Macro Prec:     0.9042
+  Macro Recall:   0.8836
+  MCC:            0.8634
+  AUC (OvR):      0.9778
+  QWK:            0.9225
+  MAE:            0.1275
+  ECE:            0.0766
+  Kripp Alpha:    0.9100
+
+  Level                           F1     Prec   Recall
+  ------------------------- -------- -------- --------
+  L1: Generic                 0.9362   0.9230   0.9498
+  L2: Domain                  0.8091   0.8865   0.7440
+  L3: Firm-Specific           0.8718   0.8423   0.9034
+  L4: Quantified              0.9510   0.9652   0.9372
+
+======================================================================
diff --git a/results/eval/iter1-clspool/report_opus-46.txt b/results/eval/iter1-clspool/report_opus-46.txt
new file mode 100644
index 0000000..1a89adb
--- /dev/null
+++ b/results/eval/iter1-clspool/report_opus-46.txt
@@ -0,0 +1,54 @@
+
+======================================================================
+  HOLDOUT EVALUATION: iter1-clspool vs Opus-4.6
+======================================================================
+
+  Samples evaluated: 1200
+  Total inference time: 6.84s
+  Avg latency: 5.70ms/sample
+  Throughput: 175 samples/sec
+
+  ──────────────────────────────────────────────────
+  CATEGORY CLASSIFICATION
+  ──────────────────────────────────────────────────
+  Macro F1:       0.9229  ✓ (target: 0.80)
+  Weighted F1:    0.9228
+  Macro Prec:     0.9183
+  Macro Recall:   0.9311
+  MCC:            0.9102
+  AUC (OvR):      0.9925
+  ECE:            0.0622
+  Kripp Alpha:    0.9096
+
+  Category                        F1     Prec   Recall
+  ------------------------- -------- -------- --------
+  Board Governance            0.9455   0.9204   0.9720
+  Incident Disclosure         0.9212   0.8837   0.9620
+  Management Role             0.9245   0.9187   0.9304
+  None/Other                  0.9115   0.8476   0.9858
+  Risk Management Process     0.8529   0.9096   0.8028
+  Strategy Integration        0.9498   0.9905   0.9123
+  Third-Party Risk            0.9550   0.9578   0.9521
+
+  ──────────────────────────────────────────────────
+  SPECIFICITY CLASSIFICATION
+  ──────────────────────────────────────────────────
+  Macro F1:       0.8804  ✓ (target: 0.80)
+  Weighted F1:    0.8976
+  Macro Prec:     0.8892
+  Macro Recall:   0.8750
+  MCC:            0.8466
+  AUC (OvR):      0.9698
+  QWK:            0.9188
+  MAE:            0.1408
+  ECE:            0.0874
+  Kripp Alpha:    0.9041
+
+  Level                           F1     Prec   Recall
+  ------------------------- -------- -------- --------
+  L1: Generic                 0.9267   0.9041   0.9504
+  L2: Domain                  0.7972   0.8085   0.7862
+  L3: Firm-Specific           0.8465   0.9189   0.7846
+  L4: Quantified              0.9514   0.9254   0.9789
+
+======================================================================
diff --git a/results/eval/iter1-dapt/figures/calibration_cat_gpt-5.4.png b/results/eval/iter1-dapt/figures/calibration_cat_gpt-5.4.png
new file mode 100644
index 0000000..9092ab9
Binary files /dev/null and b/results/eval/iter1-dapt/figures/calibration_cat_gpt-5.4.png differ
diff --git a/results/eval/iter1-dapt/figures/calibration_cat_opus-4.6.png b/results/eval/iter1-dapt/figures/calibration_cat_opus-4.6.png
new file mode 100644
index 0000000..263723b
Binary files /dev/null and b/results/eval/iter1-dapt/figures/calibration_cat_opus-4.6.png differ
diff --git a/results/eval/iter1-dapt/figures/confusion_cat_gpt-5.4.png b/results/eval/iter1-dapt/figures/confusion_cat_gpt-5.4.png
new file mode 100644
index 0000000..dbdc2bd
Binary files /dev/null and b/results/eval/iter1-dapt/figures/confusion_cat_gpt-5.4.png differ
diff --git a/results/eval/iter1-dapt/figures/confusion_cat_opus-4.6.png b/results/eval/iter1-dapt/figures/confusion_cat_opus-4.6.png
new file mode 100644
index 0000000..92b4600
Binary files /dev/null and b/results/eval/iter1-dapt/figures/confusion_cat_opus-4.6.png differ
diff --git a/results/eval/iter1-dapt/figures/confusion_spec_gpt-5.4.png b/results/eval/iter1-dapt/figures/confusion_spec_gpt-5.4.png
new file mode 100644
index 0000000..1ff03a8
Binary files /dev/null and b/results/eval/iter1-dapt/figures/confusion_spec_gpt-5.4.png differ
diff --git a/results/eval/iter1-dapt/figures/confusion_spec_opus-4.6.png b/results/eval/iter1-dapt/figures/confusion_spec_opus-4.6.png
new file mode 100644
index 0000000..4299858
Binary files /dev/null and b/results/eval/iter1-dapt/figures/confusion_spec_opus-4.6.png differ
diff --git a/results/eval/iter1-dapt/figures/model_comparison.png b/results/eval/iter1-dapt/figures/model_comparison.png
new file mode 100644
index 0000000..7dca2d9
Binary files /dev/null and b/results/eval/iter1-dapt/figures/model_comparison.png differ
diff --git a/results/eval/iter1-dapt/figures/per_class_f1_gpt-5.4.png b/results/eval/iter1-dapt/figures/per_class_f1_gpt-5.4.png
new file mode 100644
index 0000000..627d867
Binary files /dev/null and b/results/eval/iter1-dapt/figures/per_class_f1_gpt-5.4.png differ
diff --git a/results/eval/iter1-dapt/figures/per_class_f1_opus-4.6.png b/results/eval/iter1-dapt/figures/per_class_f1_opus-4.6.png
new file mode 100644
index 0000000..5f1c8aa
Binary files /dev/null and b/results/eval/iter1-dapt/figures/per_class_f1_opus-4.6.png differ
diff --git a/results/eval/iter1-dapt/figures/speed_comparison.png b/results/eval/iter1-dapt/figures/speed_comparison.png
new file mode 100644
index 0000000..ada2251
Binary files /dev/null and b/results/eval/iter1-dapt/figures/speed_comparison.png differ
diff --git a/results/eval/iter1-dapt/metrics.json b/results/eval/iter1-dapt/metrics.json
new file mode 100644
index 0000000..2390368
--- /dev/null
+++ b/results/eval/iter1-dapt/metrics.json
@@ -0,0 +1,298 @@
+{
+  "iter1-dapt_vs_GPT-5.4": {
+    "cat_macro_f1": 0.9350000205815902,
+    "cat_weighted_f1": 0.936034565494772,
+    "cat_macro_precision": 0.9344660111343602,
+    "cat_macro_recall": 0.9378555188267356,
+    "cat_mcc": 0.9246263785540332,
+    "cat_auc": 0.9915953686916092,
+    "cat_ece": 0.04942640244960788,
+    "cat_confusion_matrix": [
+      [
+        224,
+        0,
+        4,
+        0,
+        2,
+        0,
+        0
+      ],
+      [
+        0,
+        83,
+        0,
+        0,
+        2,
+        2,
+        1
+      ],
+      [
+        2,
+        0,
+        145,
+        1,
+        2,
+        0,
+        0
+      ],
+      [
+        0,
+        0,
+        2,
+        132,
+        1,
+        1,
+        0
+      ],
+      [
+        6,
+        1,
+        5,
+        18,
+        166,
+        1,
+        1
+      ],
+      [
+        0,
+        2,
+        1,
+        8,
+        1,
+        209,
+        0
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        13,
+        0,
+        164
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9696969696969697,
+    "cat_prec_BoardGov": 0.9655172413793104,
+    "cat_recall_BoardGov": 0.9739130434782609,
+    "cat_f1_Incident": 0.9540229885057471,
+    "cat_prec_Incident": 0.9651162790697675,
+    "cat_recall_Incident": 0.9431818181818182,
+    "cat_f1_Manageme": 0.9446254071661238,
+    "cat_prec_Manageme": 0.9235668789808917,
+    "cat_recall_Manageme": 0.9666666666666667,
+    "cat_f1_NoneOthe": 0.8949152542372881,
+    "cat_prec_NoneOthe": 0.8301886792452831,
+    "cat_recall_NoneOthe": 0.9705882352941176,
+    "cat_f1_RiskMana": 0.8623376623376623,
+    "cat_prec_RiskMana": 0.8877005347593583,
+    "cat_recall_RiskMana": 0.8383838383838383,
+    "cat_f1_Strategy": 0.9631336405529954,
+    "cat_prec_Strategy": 0.9812206572769953,
+    "cat_recall_Strategy": 0.9457013574660633,
+    "cat_f1_Third-Pa": 0.956268221574344,
+    "cat_prec_Third-Pa": 0.9879518072289156,
+    "cat_recall_Third-Pa": 0.9265536723163842,
+    "cat_kripp_alpha": 0.9243058890635424,
+    "spec_macro_f1": 0.8959443847575952,
+    "spec_weighted_f1": 0.914085249793483,
+    "spec_macro_precision": 0.9055333144570721,
+    "spec_macro_recall": 0.889132193611932,
+    "spec_mcc": 0.8698798188273218,
+    "spec_auc": 0.9806421467148638,
+    "spec_ece": 0.0693218584855397,
+    "spec_confusion_matrix": [
+      [
+        588,
+        14,
+        13,
+        3
+      ],
+      [
+        32,
+        126,
+        8,
+        2
+      ],
+      [
+        11,
+        4,
+        191,
+        1
+      ],
+      [
+        2,
+        2,
+        10,
+        193
+      ]
+    ],
+    "spec_f1_L1Generi": 0.9400479616306955,
+    "spec_prec_L1Generi": 0.9289099526066351,
+    "spec_recall_L1Generi": 0.9514563106796117,
+    "spec_f1_L2Domain": 0.802547770700637,
+    "spec_prec_L2Domain": 0.863013698630137,
+    "spec_recall_L2Domain": 0.75,
+    "spec_f1_L3Firm-S": 0.8904428904428905,
+    "spec_prec_L3Firm-S": 0.8603603603603603,
+    "spec_recall_L3Firm-S": 0.9227053140096618,
+    "spec_f1_L4Quanti": 0.9507389162561576,
+    "spec_prec_L4Quanti": 0.9698492462311558,
+    "spec_recall_L4Quanti": 0.9323671497584541,
+    "spec_qwk": 0.9315994086072762,
+    "spec_mae": 0.11666666666666667,
+    "spec_kripp_alpha": 0.9194074359344485,
+    "total_time_s": 6.855555058107711,
+    "num_samples": 1200,
+    "avg_ms_per_sample": 5.712962548423093,
+    "combined_macro_f1": 0.9154722026695927
+  },
+  "iter1-dapt_vs_Opus-4.6": {
+    "cat_macro_f1": 0.9277442873196512,
+    "cat_weighted_f1": 0.9268438855804646,
+    "cat_macro_precision": 0.9237899595225246,
+    "cat_macro_recall": 0.9349393170438051,
+    "cat_mcc": 0.9150420281652446,
+    "cat_auc": 0.9934333602136249,
+    "cat_ece": 0.057411353190739985,
+    "cat_confusion_matrix": [
+      [
+        210,
+        0,
+        2,
+        1,
+        1,
+        0,
+        0
+      ],
+      [
+        0,
+        77,
+        0,
+        0,
+        1,
+        0,
+        1
+      ],
+      [
+        8,
+        0,
+        145,
+        1,
+        3,
+        0,
+        1
+      ],
+      [
+        0,
+        0,
+        0,
+        139,
+        2,
+        0,
+        0
+      ],
+      [
+        13,
+        0,
+        9,
+        13,
+        172,
+        1,
+        5
+      ],
+      [
+        1,
+        9,
+        1,
+        4,
+        2,
+        211,
+        0
+      ],
+      [
+        0,
+        0,
+        0,
+        1,
+        6,
+        1,
+        159
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9417040358744395,
+    "cat_prec_BoardGov": 0.9051724137931034,
+    "cat_recall_BoardGov": 0.9813084112149533,
+    "cat_f1_Incident": 0.9333333333333333,
+    "cat_prec_Incident": 0.8953488372093024,
+    "cat_recall_Incident": 0.9746835443037974,
+    "cat_f1_Manageme": 0.9206349206349206,
+    "cat_prec_Manageme": 0.9235668789808917,
+    "cat_recall_Manageme": 0.9177215189873418,
+    "cat_f1_NoneOthe": 0.9266666666666666,
+    "cat_prec_NoneOthe": 0.8742138364779874,
+    "cat_recall_NoneOthe": 0.9858156028368794,
+    "cat_f1_RiskMana": 0.86,
+    "cat_prec_RiskMana": 0.9197860962566845,
+    "cat_recall_RiskMana": 0.8075117370892019,
+    "cat_f1_Strategy": 0.9569160997732427,
+    "cat_prec_Strategy": 0.9906103286384976,
+    "cat_recall_Strategy": 0.9254385964912281,
+    "cat_f1_Third-Pa": 0.954954954954955,
+    "cat_prec_Third-Pa": 0.9578313253012049,
+    "cat_recall_Third-Pa": 0.9520958083832335,
+    "cat_kripp_alpha": 0.9144489824694872,
+    "spec_macro_f1": 0.8823881241075249,
+    "spec_weighted_f1": 0.8997013825586678,
+    "spec_macro_precision": 0.8895415282112857,
+    "spec_macro_recall": 0.8784196767594721,
+    "spec_mcc": 0.84923108221758,
+    "spec_auc": 0.9732413764660657,
+    "spec_ece": 0.08008741805950799,
+    "spec_confusion_matrix": [
+      [
+        573,
+        22,
+        9,
+        1
+      ],
+      [
+        26,
+        114,
+        3,
+        2
+      ],
+      [
+        34,
+        10,
+        207,
+        9
+      ],
+      [
+        0,
+        0,
+        3,
+        187
+      ]
+    ],
+    "spec_f1_L1Generi": 0.925686591276252,
+    "spec_prec_L1Generi": 0.9052132701421801,
+    "spec_recall_L1Generi": 0.947107438016529,
+    "spec_f1_L2Domain": 0.7835051546391752,
+    "spec_prec_L2Domain": 0.7808219178082192,
+    "spec_recall_L2Domain": 0.7862068965517242,
+    "spec_f1_L3Firm-S": 0.8589211618257261,
+    "spec_prec_L3Firm-S": 0.9324324324324325,
+    "spec_recall_L3Firm-S": 0.7961538461538461,
+    "spec_f1_L4Quanti": 0.961439588688946,
+    "spec_prec_L4Quanti": 0.9396984924623115,
+    "spec_recall_L4Quanti": 0.9842105263157894,
+    "spec_qwk": 0.9200429286057613,
+    "spec_mae": 0.13833333333333334,
+    "spec_kripp_alpha": 0.9047987190793844,
+    "total_time_s": 6.855555058107711,
+    "num_samples": 1200,
+    "avg_ms_per_sample": 5.712962548423093,
+    "combined_macro_f1": 0.9050662057135881
+  }
+}
\ No newline at end of file
diff --git a/results/eval/iter1-dapt/report_gpt-54.txt b/results/eval/iter1-dapt/report_gpt-54.txt
new file mode 100644
index 0000000..bdafc0b
--- /dev/null
+++ b/results/eval/iter1-dapt/report_gpt-54.txt
@@ -0,0 +1,54 @@
+
+======================================================================
+  HOLDOUT EVALUATION: iter1-dapt vs GPT-5.4
+======================================================================
+
+  Samples evaluated: 1200
+  Total inference time: 6.86s
+  Avg latency: 5.71ms/sample
+  Throughput: 175 samples/sec
+
+  ──────────────────────────────────────────────────
+  CATEGORY CLASSIFICATION
+  ──────────────────────────────────────────────────
+  Macro F1:       0.9350  ✓ (target: 0.80)
+  Weighted F1:    0.9360
+  Macro Prec:     0.9345
+  Macro Recall:   0.9379
+  MCC:            0.9246
+  AUC (OvR):      0.9916
+  ECE:            0.0494
+  Kripp Alpha:    0.9243
+
+  Category                        F1     Prec   Recall
+  ------------------------- -------- -------- --------
+  Board Governance            0.9697   0.9655   0.9739
+  Incident Disclosure         0.9540   0.9651   0.9432
+  Management Role             0.9446   0.9236   0.9667
+  None/Other                  0.8949   0.8302   0.9706
+  Risk Management Process     0.8623   0.8877   0.8384
+  Strategy Integration        0.9631   0.9812   0.9457
+  Third-Party Risk            0.9563   0.9880   0.9266
+
+  ──────────────────────────────────────────────────
+  SPECIFICITY CLASSIFICATION
+  ──────────────────────────────────────────────────
+  Macro F1:       0.8959  ✓ (target: 0.80)
+  Weighted F1:    0.9141
+  Macro Prec:     0.9055
+  Macro Recall:   0.8891
+  MCC:            0.8699
+  AUC (OvR):      0.9806
+  QWK:            0.9316
+  MAE:            0.1167
+  ECE:            0.0693
+  Kripp Alpha:    0.9194
+
+  Level                           F1     Prec   Recall
+  ------------------------- -------- -------- --------
+  L1: Generic                 0.9400   0.9289   0.9515
+  L2: Domain                  0.8025   0.8630   0.7500
+  L3: Firm-Specific           0.8904   0.8604   0.9227
+  L4: Quantified              0.9507   0.9698   0.9324
+
+======================================================================
diff --git a/results/eval/iter1-dapt/report_opus-46.txt b/results/eval/iter1-dapt/report_opus-46.txt
new file mode 100644
index 0000000..83a28cd
--- /dev/null
+++ b/results/eval/iter1-dapt/report_opus-46.txt
@@ -0,0 +1,54 @@
+
+======================================================================
+  HOLDOUT EVALUATION: iter1-dapt vs Opus-4.6
+======================================================================
+
+  Samples evaluated: 1200
+  Total inference time: 6.86s
+  Avg latency: 5.71ms/sample
+  Throughput: 175 samples/sec
+
+  ──────────────────────────────────────────────────
+  CATEGORY CLASSIFICATION
+  ──────────────────────────────────────────────────
+  Macro F1:       0.9277  ✓ (target: 0.80)
+  Weighted F1:    0.9268
+  Macro Prec:     0.9238
+  Macro Recall:   0.9349
+  MCC:            0.9150
+  AUC (OvR):      0.9934
+  ECE:            0.0574
+  Kripp Alpha:    0.9144
+
+  Category                        F1     Prec   Recall
+  ------------------------- -------- -------- --------
+  Board Governance            0.9417   0.9052   0.9813
+  Incident Disclosure         0.9333   0.8953   0.9747
+  Management Role             0.9206   0.9236   0.9177
+  None/Other                  0.9267   0.8742   0.9858
+  Risk Management Process     0.8600   0.9198   0.8075
+  Strategy Integration        0.9569   0.9906   0.9254
+  Third-Party Risk            0.9550   0.9578   0.9521
+
+  ──────────────────────────────────────────────────
+  SPECIFICITY CLASSIFICATION
+  ──────────────────────────────────────────────────
+  Macro F1:       0.8824  ✓ (target: 0.80)
+  Weighted F1:    0.8997
+  Macro Prec:     0.8895
+  Macro Recall:   0.8784
+  MCC:            0.8492
+  AUC (OvR):      0.9732
+  QWK:            0.9200
+  MAE:            0.1383
+  ECE:            0.0801
+  Kripp Alpha:    0.9048
+
+  Level                           F1     Prec   Recall
+  ------------------------- -------- -------- --------
+  L1: Generic                 0.9257   0.9052   0.9471
+  L2: Domain                  0.7835   0.7808   0.7862
+  L3: Firm-Specific           0.8589   0.9324   0.7962
+  L4: Quantified              0.9614   0.9397   0.9842
+
+======================================================================