55 lines
2.3 KiB
Plaintext
55 lines
2.3 KiB
Plaintext
|
|
======================================================================
|
|
HOLDOUT EVALUATION: iter1-clspool vs GPT-5.4
|
|
======================================================================
|
|
|
|
Samples evaluated: 1200
|
|
Total inference time: 6.84s
|
|
Avg latency: 5.70ms/sample
|
|
Throughput: 175 samples/sec
|
|
|
|
──────────────────────────────────────────────────
|
|
CATEGORY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.9296 ✓ (target: 0.80)
|
|
Weighted F1: 0.9307
|
|
Macro Prec: 0.9290
|
|
Macro Recall: 0.9334
|
|
MCC: 0.9179
|
|
AUC (OvR): 0.9911
|
|
ECE: 0.0556
|
|
Kripp Alpha: 0.9175
|
|
|
|
Category F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
Board Governance 0.9518 0.9602 0.9435
|
|
Incident Disclosure 0.9540 0.9651 0.9432
|
|
Management Role 0.9290 0.9000 0.9600
|
|
None/Other 0.8800 0.8049 0.9706
|
|
Risk Management Process 0.8653 0.8883 0.8434
|
|
Strategy Integration 0.9652 0.9905 0.9412
|
|
Third-Party Risk 0.9621 0.9940 0.9322
|
|
|
|
──────────────────────────────────────────────────
|
|
SPECIFICITY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.8920 ✓ (target: 0.80)
|
|
Weighted F1: 0.9098
|
|
Macro Prec: 0.9042
|
|
Macro Recall: 0.8836
|
|
MCC: 0.8634
|
|
AUC (OvR): 0.9778
|
|
QWK: 0.9225
|
|
MAE: 0.1275
|
|
ECE: 0.0766
|
|
Kripp Alpha: 0.9100
|
|
|
|
Level F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
L1: Generic 0.9362 0.9230 0.9498
|
|
L2: Domain 0.8091 0.8865 0.7440
|
|
L3: Firm-Specific 0.8718 0.8423 0.9034
|
|
L4: Quantified 0.9510 0.9652 0.9372
|
|
|
|
======================================================================
|