55 lines
2.3 KiB
Plaintext
55 lines
2.3 KiB
Plaintext
|
|
======================================================================
|
|
HOLDOUT EVALUATION: best-base_weighted_ce-ep5 vs Opus-4.6
|
|
======================================================================
|
|
|
|
Samples evaluated: 1200
|
|
Total inference time: 6.70s
|
|
Avg latency: 5.58ms/sample
|
|
Throughput: 179 samples/sec
|
|
|
|
──────────────────────────────────────────────────
|
|
CATEGORY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.9280 ✓ (target: 0.80)
|
|
Weighted F1: 0.9274
|
|
Macro Prec: 0.9223
|
|
Macro Recall: 0.9382
|
|
MCC: 0.9163
|
|
AUC (OvR): 0.9924
|
|
ECE: 0.0469
|
|
Kripp Alpha: 0.9155
|
|
|
|
Category F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
Board Governance 0.9478 0.9207 0.9766
|
|
Incident Disclosure 0.9286 0.8764 0.9873
|
|
Management Role 0.9216 0.9130 0.9304
|
|
None/Other 0.9205 0.8634 0.9858
|
|
Risk Management Process 0.8550 0.9333 0.7887
|
|
Strategy Integration 0.9522 0.9905 0.9167
|
|
Third-Party Risk 0.9704 0.9591 0.9820
|
|
|
|
──────────────────────────────────────────────────
|
|
SPECIFICITY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.5958 ✗ (target: 0.80)
|
|
Weighted F1: 0.6930
|
|
Macro Prec: 0.7319
|
|
Macro Recall: 0.6250
|
|
MCC: 0.6143
|
|
AUC (OvR): 0.9471
|
|
QWK: 0.8721
|
|
MAE: 0.3075
|
|
ECE: 0.1819
|
|
Kripp Alpha: 0.8503
|
|
|
|
Level F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
L1: Generic 0.8943 0.8234 0.9785
|
|
L2: Domain 0.4138 0.7241 0.2897
|
|
L3: Firm-Specific 0.3761 0.8400 0.2423
|
|
L4: Quantified 0.6989 0.5402 0.9895
|
|
|
|
======================================================================
|