55 lines
2.3 KiB
Plaintext
55 lines
2.3 KiB
Plaintext
|
|
======================================================================
|
|
HOLDOUT EVALUATION: ensemble-3seed vs GPT-5.4
|
|
======================================================================
|
|
|
|
Samples evaluated: 1200
|
|
Total inference time: 19.85s
|
|
Avg latency: 16.54ms/sample
|
|
Throughput: 60 samples/sec
|
|
|
|
──────────────────────────────────────────────────
|
|
CATEGORY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.9383 ✓ (target: 0.80)
|
|
Weighted F1: 0.9386
|
|
Macro Prec: 0.9370
|
|
Macro Recall: 0.9418
|
|
MCC: 0.9276
|
|
AUC (OvR): 0.9931
|
|
ECE: 0.0509
|
|
Kripp Alpha: 0.9273
|
|
|
|
Category F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
Board Governance 0.9719 0.9657 0.9783
|
|
Incident Disclosure 0.9659 0.9659 0.9659
|
|
Management Role 0.9477 0.9295 0.9667
|
|
None/Other 0.8949 0.8302 0.9706
|
|
Risk Management Process 0.8653 0.8883 0.8434
|
|
Strategy Integration 0.9630 0.9858 0.9412
|
|
Third-Party Risk 0.9591 0.9939 0.9266
|
|
|
|
──────────────────────────────────────────────────
|
|
SPECIFICITY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.9022 ✓ (target: 0.80)
|
|
Weighted F1: 0.9178
|
|
Macro Prec: 0.9070
|
|
Macro Recall: 0.8991
|
|
MCC: 0.8754
|
|
AUC (OvR): 0.9826
|
|
QWK: 0.9339
|
|
MAE: 0.1125
|
|
ECE: 0.0692
|
|
Kripp Alpha: 0.9206
|
|
|
|
Level F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
L1: Generic 0.9396 0.9358 0.9434
|
|
L2: Domain 0.8150 0.8609 0.7738
|
|
L3: Firm-Specific 0.8930 0.8610 0.9275
|
|
L4: Quantified 0.9610 0.9704 0.9517
|
|
|
|
======================================================================
|