====================================================================== HOLDOUT EVALUATION: best-base_weighted_ce-ep5 vs GPT-5.4 ====================================================================== Samples evaluated: 1200 Total inference time: 6.70s Avg latency: 5.58ms/sample Throughput: 179 samples/sec ────────────────────────────────────────────────── CATEGORY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.9361 ✓ (target: 0.80) Weighted F1: 0.9361 Macro Prec: 0.9337 Macro Recall: 0.9414 MCC: 0.9248 AUC (OvR): 0.9913 ECE: 0.0441 Kripp Alpha: 0.9244 Category F1 Prec Recall ------------------------- -------- -------- -------- Board Governance 0.9628 0.9692 0.9565 Incident Disclosure 0.9718 0.9663 0.9773 Management Role 0.9196 0.8882 0.9533 None/Other 0.8956 0.8261 0.9779 Risk Management Process 0.8730 0.9167 0.8333 Strategy Integration 0.9583 0.9810 0.9367 Third-Party Risk 0.9713 0.9883 0.9548 ────────────────────────────────────────────────── SPECIFICITY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.5970 ✗ (target: 0.80) Weighted F1: 0.7041 Macro Prec: 0.7225 Macro Recall: 0.6139 MCC: 0.6139 AUC (OvR): 0.9499 QWK: 0.8757 MAE: 0.2975 ECE: 0.1652 Kripp Alpha: 0.8479 Level F1 Prec Recall ------------------------- -------- -------- -------- L1: Generic 0.8915 0.8289 0.9644 L2: Domain 0.4071 0.7931 0.2738 L3: Firm-Specific 0.3688 0.6933 0.2512 L4: Quantified 0.7207 0.5747 0.9662 ======================================================================