====================================================================== HOLDOUT EVALUATION: best-base_weighted_ce-ep5 vs Opus-4.6 ====================================================================== Samples evaluated: 1200 Total inference time: 6.70s Avg latency: 5.58ms/sample Throughput: 179 samples/sec ────────────────────────────────────────────────── CATEGORY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.9280 ✓ (target: 0.80) Weighted F1: 0.9274 Macro Prec: 0.9223 Macro Recall: 0.9382 MCC: 0.9163 AUC (OvR): 0.9924 ECE: 0.0469 Kripp Alpha: 0.9155 Category F1 Prec Recall ------------------------- -------- -------- -------- Board Governance 0.9478 0.9207 0.9766 Incident Disclosure 0.9286 0.8764 0.9873 Management Role 0.9216 0.9130 0.9304 None/Other 0.9205 0.8634 0.9858 Risk Management Process 0.8550 0.9333 0.7887 Strategy Integration 0.9522 0.9905 0.9167 Third-Party Risk 0.9704 0.9591 0.9820 ────────────────────────────────────────────────── SPECIFICITY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.5958 ✗ (target: 0.80) Weighted F1: 0.6930 Macro Prec: 0.7319 Macro Recall: 0.6250 MCC: 0.6143 AUC (OvR): 0.9471 QWK: 0.8721 MAE: 0.3075 ECE: 0.1819 Kripp Alpha: 0.8503 Level F1 Prec Recall ------------------------- -------- -------- -------- L1: Generic 0.8943 0.8234 0.9785 L2: Domain 0.4138 0.7241 0.2897 L3: Firm-Specific 0.3761 0.8400 0.2423 L4: Quantified 0.6989 0.5402 0.9895 ======================================================================