55 lines
2.3 KiB
Plaintext

======================================================================
HOLDOUT EVALUATION: dictionary-baseline vs Opus-4.6
======================================================================
Samples evaluated: 1200
Total inference time: 0.00s
Avg latency: 0.00ms/sample
Throughput: 1000000 samples/sec
──────────────────────────────────────────────────
CATEGORY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.5405 ✗ (target: 0.80)
Weighted F1: 0.5681
Macro Prec: 0.5642
Macro Recall: 0.5503
MCC: 0.4981
AUC (OvR): 0.7392
ECE: 0.4300
Kripp Alpha: 0.4905
Category F1 Prec Recall
------------------------- -------- -------- --------
Board Governance 0.7642 0.7714 0.7570
Incident Disclosure 0.3957 0.3426 0.4684
Management Role 0.6137 0.7143 0.5380
None/Other 0.2673 0.3816 0.2057
Risk Management Process 0.4163 0.3834 0.4554
Strategy Integration 0.6909 0.8471 0.5833
Third-Party Risk 0.6351 0.5090 0.8443
──────────────────────────────────────────────────
SPECIFICITY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.6345 ✗ (target: 0.80)
Weighted F1: 0.6902
Macro Prec: 0.7051
Macro Recall: 0.6129
MCC: 0.5373
AUC (OvR): 0.7435
QWK: 0.5875
MAE: 0.5258
ECE: 0.2967
Kripp Alpha: 0.5620
Level F1 Prec Recall
------------------------- -------- -------- --------
L1: Generic 0.7918 0.7094 0.8959
L2: Domain 0.4883 0.4740 0.5034
L3: Firm-Specific 0.5625 0.8710 0.4154
L4: Quantified 0.6954 0.7658 0.6368
======================================================================