====================================================================== HOLDOUT EVALUATION: dictionary-baseline vs GPT-5.4 ====================================================================== Samples evaluated: 1200 Total inference time: 0.00s Avg latency: 0.00ms/sample Throughput: 1000000 samples/sec ────────────────────────────────────────────────── CATEGORY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.5563 ✗ (target: 0.80) Weighted F1: 0.5867 Macro Prec: 0.5821 Macro Recall: 0.5593 MCC: 0.5160 AUC (OvR): 0.7450 ECE: 0.4142 Kripp Alpha: 0.5092 Category F1 Prec Recall ------------------------- -------- -------- -------- Board Governance 0.8045 0.8429 0.7696 Incident Disclosure 0.4184 0.3796 0.4659 Management Role 0.6171 0.6975 0.5533 None/Other 0.3113 0.4342 0.2426 Risk Management Process 0.4169 0.3715 0.4747 Strategy Integration 0.6825 0.8217 0.5837 Third-Party Risk 0.6432 0.5271 0.8249 ────────────────────────────────────────────────── SPECIFICITY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.6555 ✗ (target: 0.80) Weighted F1: 0.7095 Macro Prec: 0.7204 Macro Recall: 0.6226 MCC: 0.5555 AUC (OvR): 0.7507 QWK: 0.5757 MAE: 0.5158 ECE: 0.2800 Kripp Alpha: 0.5594 Level F1 Prec Recall ------------------------- -------- -------- -------- L1: Generic 0.8017 0.7251 0.8964 L2: Domain 0.5342 0.5584 0.5119 L3: Firm-Specific 0.6284 0.8387 0.5024 L4: Quantified 0.6575 0.7595 0.5797 ======================================================================