55 lines
2.3 KiB
Plaintext
55 lines
2.3 KiB
Plaintext
|
|
======================================================================
|
|
HOLDOUT EVALUATION: iter1-independent vs GPT-5.4
|
|
======================================================================
|
|
|
|
Samples evaluated: 1200
|
|
Total inference time: 6.73s
|
|
Avg latency: 5.61ms/sample
|
|
Throughput: 178 samples/sec
|
|
|
|
──────────────────────────────────────────────────
|
|
CATEGORY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.9337 ✓ (target: 0.80)
|
|
Weighted F1: 0.9343
|
|
Macro Prec: 0.9319
|
|
Macro Recall: 0.9378
|
|
MCC: 0.9227
|
|
AUC (OvR): 0.9920
|
|
ECE: 0.0538
|
|
Kripp Alpha: 0.9224
|
|
|
|
Category F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
Board Governance 0.9719 0.9657 0.9783
|
|
Incident Disclosure 0.9605 0.9551 0.9659
|
|
Management Role 0.9412 0.9231 0.9600
|
|
None/Other 0.8881 0.8239 0.9632
|
|
Risk Management Process 0.8564 0.8865 0.8283
|
|
Strategy Integration 0.9583 0.9810 0.9367
|
|
Third-Party Risk 0.9593 0.9880 0.9322
|
|
|
|
──────────────────────────────────────────────────
|
|
SPECIFICITY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.8952 ✓ (target: 0.80)
|
|
Weighted F1: 0.9122
|
|
Macro Prec: 0.8980
|
|
Macro Recall: 0.8931
|
|
MCC: 0.8664
|
|
AUC (OvR): 0.9817
|
|
QWK: 0.9324
|
|
MAE: 0.1175
|
|
ECE: 0.0714
|
|
Kripp Alpha: 0.9177
|
|
|
|
Level F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
L1: Generic 0.9355 0.9325 0.9385
|
|
L2: Domain 0.7975 0.8228 0.7738
|
|
L3: Firm-Specific 0.8941 0.8716 0.9179
|
|
L4: Quantified 0.9535 0.9653 0.9420
|
|
|
|
======================================================================
|