====================================================================== HOLDOUT EVALUATION: iter1-independent vs GPT-5.4 ====================================================================== Samples evaluated: 1200 Total inference time: 6.73s Avg latency: 5.61ms/sample Throughput: 178 samples/sec ────────────────────────────────────────────────── CATEGORY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.9337 ✓ (target: 0.80) Weighted F1: 0.9343 Macro Prec: 0.9319 Macro Recall: 0.9378 MCC: 0.9227 AUC (OvR): 0.9920 ECE: 0.0538 Kripp Alpha: 0.9224 Category F1 Prec Recall ------------------------- -------- -------- -------- Board Governance 0.9719 0.9657 0.9783 Incident Disclosure 0.9605 0.9551 0.9659 Management Role 0.9412 0.9231 0.9600 None/Other 0.8881 0.8239 0.9632 Risk Management Process 0.8564 0.8865 0.8283 Strategy Integration 0.9583 0.9810 0.9367 Third-Party Risk 0.9593 0.9880 0.9322 ────────────────────────────────────────────────── SPECIFICITY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.8952 ✓ (target: 0.80) Weighted F1: 0.9122 Macro Prec: 0.8980 Macro Recall: 0.8931 MCC: 0.8664 AUC (OvR): 0.9817 QWK: 0.9324 MAE: 0.1175 ECE: 0.0714 Kripp Alpha: 0.9177 Level F1 Prec Recall ------------------------- -------- -------- -------- L1: Generic 0.9355 0.9325 0.9385 L2: Domain 0.7975 0.8228 0.7738 L3: Firm-Specific 0.8941 0.8716 0.9179 L4: Quantified 0.9535 0.9653 0.9420 ======================================================================