====================================================================== HOLDOUT EVALUATION: iter1-clspool vs Opus-4.6 ====================================================================== Samples evaluated: 1200 Total inference time: 6.84s Avg latency: 5.70ms/sample Throughput: 175 samples/sec ────────────────────────────────────────────────── CATEGORY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.9229 ✓ (target: 0.80) Weighted F1: 0.9228 Macro Prec: 0.9183 Macro Recall: 0.9311 MCC: 0.9102 AUC (OvR): 0.9925 ECE: 0.0622 Kripp Alpha: 0.9096 Category F1 Prec Recall ------------------------- -------- -------- -------- Board Governance 0.9455 0.9204 0.9720 Incident Disclosure 0.9212 0.8837 0.9620 Management Role 0.9245 0.9187 0.9304 None/Other 0.9115 0.8476 0.9858 Risk Management Process 0.8529 0.9096 0.8028 Strategy Integration 0.9498 0.9905 0.9123 Third-Party Risk 0.9550 0.9578 0.9521 ────────────────────────────────────────────────── SPECIFICITY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.8804 ✓ (target: 0.80) Weighted F1: 0.8976 Macro Prec: 0.8892 Macro Recall: 0.8750 MCC: 0.8466 AUC (OvR): 0.9698 QWK: 0.9188 MAE: 0.1408 ECE: 0.0874 Kripp Alpha: 0.9041 Level F1 Prec Recall ------------------------- -------- -------- -------- L1: Generic 0.9267 0.9041 0.9504 L2: Domain 0.7972 0.8085 0.7862 L3: Firm-Specific 0.8465 0.9189 0.7846 L4: Quantified 0.9514 0.9254 0.9789 ======================================================================