====================================================================== HOLDOUT EVALUATION: iter1-dapt vs GPT-5.4 ====================================================================== Samples evaluated: 1200 Total inference time: 6.86s Avg latency: 5.71ms/sample Throughput: 175 samples/sec ────────────────────────────────────────────────── CATEGORY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.9350 ✓ (target: 0.80) Weighted F1: 0.9360 Macro Prec: 0.9345 Macro Recall: 0.9379 MCC: 0.9246 AUC (OvR): 0.9916 ECE: 0.0494 Kripp Alpha: 0.9243 Category F1 Prec Recall ------------------------- -------- -------- -------- Board Governance 0.9697 0.9655 0.9739 Incident Disclosure 0.9540 0.9651 0.9432 Management Role 0.9446 0.9236 0.9667 None/Other 0.8949 0.8302 0.9706 Risk Management Process 0.8623 0.8877 0.8384 Strategy Integration 0.9631 0.9812 0.9457 Third-Party Risk 0.9563 0.9880 0.9266 ────────────────────────────────────────────────── SPECIFICITY CLASSIFICATION ────────────────────────────────────────────────── Macro F1: 0.8959 ✓ (target: 0.80) Weighted F1: 0.9141 Macro Prec: 0.9055 Macro Recall: 0.8891 MCC: 0.8699 AUC (OvR): 0.9806 QWK: 0.9316 MAE: 0.1167 ECE: 0.0693 Kripp Alpha: 0.9194 Level F1 Prec Recall ------------------------- -------- -------- -------- L1: Generic 0.9400 0.9289 0.9515 L2: Domain 0.8025 0.8630 0.7500 L3: Firm-Specific 0.8904 0.8604 0.9227 L4: Quantified 0.9507 0.9698 0.9324 ======================================================================