====================================================================== HOLDOUT EVALUATION: iter1-clspool vs GPT-5.4 ====================================================================== Samples evaluated: 1200 Total inference time: 6.84s Avg latency: 5.70ms/sample Throughput: 175 samples/sec ────────────────────────────────────────────────── CATEGORY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.9296 ✓ (target: 0.80) Weighted F1: 0.9307 Macro Prec: 0.9290 Macro Recall: 0.9334 MCC: 0.9179 AUC (OvR): 0.9911 ECE: 0.0556 Kripp Alpha: 0.9175 Category F1 Prec Recall ------------------------- -------- -------- -------- Board Governance 0.9518 0.9602 0.9435 Incident Disclosure 0.9540 0.9651 0.9432 Management Role 0.9290 0.9000 0.9600 None/Other 0.8800 0.8049 0.9706 Risk Management Process 0.8653 0.8883 0.8434 Strategy Integration 0.9652 0.9905 0.9412 Third-Party Risk 0.9621 0.9940 0.9322 ────────────────────────────────────────────────── SPECIFICITY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.8920 ✓ (target: 0.80) Weighted F1: 0.9098 Macro Prec: 0.9042 Macro Recall: 0.8836 MCC: 0.8634 AUC (OvR): 0.9778 QWK: 0.9225 MAE: 0.1275 ECE: 0.0766 Kripp Alpha: 0.9100 Level F1 Prec Recall ------------------------- -------- -------- -------- L1: Generic 0.9362 0.9230 0.9498 L2: Domain 0.8091 0.8865 0.7440 L3: Firm-Specific 0.8718 0.8423 0.9034 L4: Quantified 0.9510 0.9652 0.9372 ======================================================================