====================================================================== HOLDOUT EVALUATION: iter1-independent vs Opus-4.6 ====================================================================== Samples evaluated: 1200 Total inference time: 6.73s Avg latency: 5.61ms/sample Throughput: 178 samples/sec ────────────────────────────────────────────────── CATEGORY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.9227 ✓ (target: 0.80) Weighted F1: 0.9216 Macro Prec: 0.9178 Macro Recall: 0.9316 MCC: 0.9093 AUC (OvR): 0.9940 ECE: 0.0655 Kripp Alpha: 0.9086 Category F1 Prec Recall ------------------------- -------- -------- -------- Board Governance 0.9441 0.9056 0.9860 Incident Disclosure 0.9286 0.8764 0.9873 Management Role 0.9172 0.9231 0.9114 None/Other 0.9200 0.8679 0.9787 Risk Management Process 0.8492 0.9135 0.7934 Strategy Integration 0.9476 0.9858 0.9123 Third-Party Risk 0.9521 0.9521 0.9521 ────────────────────────────────────────────────── SPECIFICITY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.8834 ✓ (target: 0.80) Weighted F1: 0.9004 Macro Prec: 0.8859 Macro Recall: 0.8855 MCC: 0.8501 AUC (OvR): 0.9737 QWK: 0.9227 MAE: 0.1358 ECE: 0.0825 Kripp Alpha: 0.9065 Level F1 Prec Recall ------------------------- -------- -------- -------- L1: Generic 0.9242 0.9116 0.9372 L2: Domain 0.7789 0.7468 0.8138 L3: Firm-Specific 0.8661 0.9495 0.7962 L4: Quantified 0.9643 0.9356 0.9947 ======================================================================