====================================================================== HOLDOUT EVALUATION: dictionary-baseline vs Opus-4.6 ====================================================================== Samples evaluated: 1200 Total inference time: 0.00s Avg latency: 0.00ms/sample Throughput: 1000000 samples/sec ────────────────────────────────────────────────── CATEGORY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.5405 ✗ (target: 0.80) Weighted F1: 0.5681 Macro Prec: 0.5642 Macro Recall: 0.5503 MCC: 0.4981 AUC (OvR): 0.7392 ECE: 0.4300 Kripp Alpha: 0.4905 Category F1 Prec Recall ------------------------- -------- -------- -------- Board Governance 0.7642 0.7714 0.7570 Incident Disclosure 0.3957 0.3426 0.4684 Management Role 0.6137 0.7143 0.5380 None/Other 0.2673 0.3816 0.2057 Risk Management Process 0.4163 0.3834 0.4554 Strategy Integration 0.6909 0.8471 0.5833 Third-Party Risk 0.6351 0.5090 0.8443 ────────────────────────────────────────────────── SPECIFICITY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.6345 ✗ (target: 0.80) Weighted F1: 0.6902 Macro Prec: 0.7051 Macro Recall: 0.6129 MCC: 0.5373 AUC (OvR): 0.7435 QWK: 0.5875 MAE: 0.5258 ECE: 0.2967 Kripp Alpha: 0.5620 Level F1 Prec Recall ------------------------- -------- -------- -------- L1: Generic 0.7918 0.7094 0.8959 L2: Domain 0.4883 0.4740 0.5034 L3: Firm-Specific 0.5625 0.8710 0.4154 L4: Quantified 0.6954 0.7658 0.6368 ======================================================================