2026-04-05 15:37:50 -04:00

55 lines
2.3 KiB
Plaintext

======================================================================
HOLDOUT EVALUATION: best-base_weighted_ce-ep5 vs Opus-4.6
======================================================================
Samples evaluated: 1200
Total inference time: 6.70s
Avg latency: 5.58ms/sample
Throughput: 179 samples/sec
──────────────────────────────────────────────────
CATEGORY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.9280 ✓ (target: 0.80)
Weighted F1: 0.9274
Macro Prec: 0.9223
Macro Recall: 0.9382
MCC: 0.9163
AUC (OvR): 0.9924
ECE: 0.0469
Kripp Alpha: 0.9155
Category F1 Prec Recall
------------------------- -------- -------- --------
Board Governance 0.9478 0.9207 0.9766
Incident Disclosure 0.9286 0.8764 0.9873
Management Role 0.9216 0.9130 0.9304
None/Other 0.9205 0.8634 0.9858
Risk Management Process 0.8550 0.9333 0.7887
Strategy Integration 0.9522 0.9905 0.9167
Third-Party Risk 0.9704 0.9591 0.9820
──────────────────────────────────────────────────
SPECIFICITY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.5958 ✗ (target: 0.80)
Weighted F1: 0.6930
Macro Prec: 0.7319
Macro Recall: 0.6250
MCC: 0.6143
AUC (OvR): 0.9471
QWK: 0.8721
MAE: 0.3075
ECE: 0.1819
Kripp Alpha: 0.8503
Level F1 Prec Recall
------------------------- -------- -------- --------
L1: Generic 0.8943 0.8234 0.9785
L2: Domain 0.4138 0.7241 0.2897
L3: Firm-Specific 0.3761 0.8400 0.2423
L4: Quantified 0.6989 0.5402 0.9895
======================================================================