====================================================================== HOLDOUT EVALUATION: iter1-dapt vs Opus-4.6 ====================================================================== Samples evaluated: 1200 Total inference time: 6.86s Avg latency: 5.71ms/sample Throughput: 175 samples/sec ────────────────────────────────────────────────── CATEGORY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.9277 ✓ (target: 0.80) Weighted F1: 0.9268 Macro Prec: 0.9238 Macro Recall: 0.9349 MCC: 0.9150 AUC (OvR): 0.9934 ECE: 0.0574 Kripp Alpha: 0.9144 Category F1 Prec Recall ------------------------- -------- -------- -------- Board Governance 0.9417 0.9052 0.9813 Incident Disclosure 0.9333 0.8953 0.9747 Management Role 0.9206 0.9236 0.9177 None/Other 0.9267 0.8742 0.9858 Risk Management Process 0.8600 0.9198 0.8075 Strategy Integration 0.9569 0.9906 0.9254 Third-Party Risk 0.9550 0.9578 0.9521 ────────────────────────────────────────────────── SPECIFICITY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.8824 ✓ (target: 0.80) Weighted F1: 0.8997 Macro Prec: 0.8895 Macro Recall: 0.8784 MCC: 0.8492 AUC (OvR): 0.9732 QWK: 0.9200 MAE: 0.1383 ECE: 0.0801 Kripp Alpha: 0.9048 Level F1 Prec Recall ------------------------- -------- -------- -------- L1: Generic 0.9257 0.9052 0.9471 L2: Domain 0.7835 0.7808 0.7862 L3: Firm-Specific 0.8589 0.9324 0.7962 L4: Quantified 0.9614 0.9397 0.9842 ======================================================================