====================================================================== HOLDOUT EVALUATION: ensemble-3seed vs Opus-4.6 ====================================================================== Samples evaluated: 1200 Total inference time: 19.85s Avg latency: 16.54ms/sample Throughput: 60 samples/sec ────────────────────────────────────────────────── CATEGORY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.9288 ✓ (target: 0.80) Weighted F1: 0.9277 Macro Prec: 0.9243 Macro Recall: 0.9368 MCC: 0.9161 AUC (OvR): 0.9948 ECE: 0.0629 Kripp Alpha: 0.9154 Category F1 Prec Recall ------------------------- -------- -------- -------- Board Governance 0.9441 0.9056 0.9860 Incident Disclosure 0.9341 0.8864 0.9873 Management Role 0.9236 0.9295 0.9177 None/Other 0.9267 0.8742 0.9858 Risk Management Process 0.8628 0.9202 0.8122 Strategy Integration 0.9522 0.9905 0.9167 Third-Party Risk 0.9578 0.9636 0.9521 ────────────────────────────────────────────────── SPECIFICITY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.8853 ✓ (target: 0.80) Weighted F1: 0.9024 Macro Prec: 0.8881 Macro Recall: 0.8858 MCC: 0.8535 AUC (OvR): 0.9776 QWK: 0.9248 MAE: 0.1325 ECE: 0.0845 Kripp Alpha: 0.9110 Level F1 Prec Recall ------------------------- -------- -------- -------- L1: Generic 0.9300 0.9165 0.9438 L2: Domain 0.7973 0.7815 0.8138 L3: Firm-Specific 0.8571 0.9283 0.7962 L4: Quantified 0.9567 0.9261 0.9895 ======================================================================