55 lines
2.3 KiB
Plaintext
55 lines
2.3 KiB
Plaintext
|
|
======================================================================
|
|
HOLDOUT EVALUATION: best-base_weighted_ce-ep5 vs GPT-5.4
|
|
======================================================================
|
|
|
|
Samples evaluated: 1200
|
|
Total inference time: 6.70s
|
|
Avg latency: 5.58ms/sample
|
|
Throughput: 179 samples/sec
|
|
|
|
──────────────────────────────────────────────────
|
|
CATEGORY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.9361 ✓ (target: 0.80)
|
|
Weighted F1: 0.9361
|
|
Macro Prec: 0.9337
|
|
Macro Recall: 0.9414
|
|
MCC: 0.9248
|
|
AUC (OvR): 0.9913
|
|
ECE: 0.0441
|
|
Kripp Alpha: 0.9244
|
|
|
|
Category F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
Board Governance 0.9628 0.9692 0.9565
|
|
Incident Disclosure 0.9718 0.9663 0.9773
|
|
Management Role 0.9196 0.8882 0.9533
|
|
None/Other 0.8956 0.8261 0.9779
|
|
Risk Management Process 0.8730 0.9167 0.8333
|
|
Strategy Integration 0.9583 0.9810 0.9367
|
|
Third-Party Risk 0.9713 0.9883 0.9548
|
|
|
|
──────────────────────────────────────────────────
|
|
SPECIFICITY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.5970 ✗ (target: 0.80)
|
|
Weighted F1: 0.7041
|
|
Macro Prec: 0.7225
|
|
Macro Recall: 0.6139
|
|
MCC: 0.6139
|
|
AUC (OvR): 0.9499
|
|
QWK: 0.8757
|
|
MAE: 0.2975
|
|
ECE: 0.1652
|
|
Kripp Alpha: 0.8479
|
|
|
|
Level F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
L1: Generic 0.8915 0.8289 0.9644
|
|
L2: Domain 0.4071 0.7931 0.2738
|
|
L3: Firm-Specific 0.3688 0.6933 0.2512
|
|
L4: Quantified 0.7207 0.5747 0.9662
|
|
|
|
======================================================================
|