====================================================================== HOLDOUT EVALUATION: ensemble-3seed vs GPT-5.4 ====================================================================== Samples evaluated: 1200 Total inference time: 19.85s Avg latency: 16.54ms/sample Throughput: 60 samples/sec ────────────────────────────────────────────────── CATEGORY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.9383 ✓ (target: 0.80) Weighted F1: 0.9386 Macro Prec: 0.9370 Macro Recall: 0.9418 MCC: 0.9276 AUC (OvR): 0.9931 ECE: 0.0509 Kripp Alpha: 0.9273 Category F1 Prec Recall ------------------------- -------- -------- -------- Board Governance 0.9719 0.9657 0.9783 Incident Disclosure 0.9659 0.9659 0.9659 Management Role 0.9477 0.9295 0.9667 None/Other 0.8949 0.8302 0.9706 Risk Management Process 0.8653 0.8883 0.8434 Strategy Integration 0.9630 0.9858 0.9412 Third-Party Risk 0.9591 0.9939 0.9266 ────────────────────────────────────────────────── SPECIFICITY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.9022 ✓ (target: 0.80) Weighted F1: 0.9178 Macro Prec: 0.9070 Macro Recall: 0.8991 MCC: 0.8754 AUC (OvR): 0.9826 QWK: 0.9339 MAE: 0.1125 ECE: 0.0692 Kripp Alpha: 0.9206 Level F1 Prec Recall ------------------------- -------- -------- -------- L1: Generic 0.9396 0.9358 0.9434 L2: Domain 0.8150 0.8609 0.7738 L3: Firm-Specific 0.8930 0.8610 0.9275 L4: Quantified 0.9610 0.9704 0.9517 ======================================================================