55 lines
2.3 KiB
Plaintext
55 lines
2.3 KiB
Plaintext
|
|
======================================================================
|
|
HOLDOUT EVALUATION: dictionary-baseline vs Opus-4.6
|
|
======================================================================
|
|
|
|
Samples evaluated: 1200
|
|
Total inference time: 0.00s
|
|
Avg latency: 0.00ms/sample
|
|
Throughput: 1000000 samples/sec
|
|
|
|
──────────────────────────────────────────────────
|
|
CATEGORY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.5405 ✗ (target: 0.80)
|
|
Weighted F1: 0.5681
|
|
Macro Prec: 0.5642
|
|
Macro Recall: 0.5503
|
|
MCC: 0.4981
|
|
AUC (OvR): 0.7392
|
|
ECE: 0.4300
|
|
Kripp Alpha: 0.4905
|
|
|
|
Category F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
Board Governance 0.7642 0.7714 0.7570
|
|
Incident Disclosure 0.3957 0.3426 0.4684
|
|
Management Role 0.6137 0.7143 0.5380
|
|
None/Other 0.2673 0.3816 0.2057
|
|
Risk Management Process 0.4163 0.3834 0.4554
|
|
Strategy Integration 0.6909 0.8471 0.5833
|
|
Third-Party Risk 0.6351 0.5090 0.8443
|
|
|
|
──────────────────────────────────────────────────
|
|
SPECIFICITY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.6345 ✗ (target: 0.80)
|
|
Weighted F1: 0.6902
|
|
Macro Prec: 0.7051
|
|
Macro Recall: 0.6129
|
|
MCC: 0.5373
|
|
AUC (OvR): 0.7435
|
|
QWK: 0.5875
|
|
MAE: 0.5258
|
|
ECE: 0.2967
|
|
Kripp Alpha: 0.5620
|
|
|
|
Level F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
L1: Generic 0.7918 0.7094 0.8959
|
|
L2: Domain 0.4883 0.4740 0.5034
|
|
L3: Firm-Specific 0.5625 0.8710 0.4154
|
|
L4: Quantified 0.6954 0.7658 0.6368
|
|
|
|
======================================================================
|