2026-04-05 15:37:50 -04:00

55 lines
2.3 KiB
Plaintext

======================================================================
HOLDOUT EVALUATION: iter1-independent vs GPT-5.4
======================================================================
Samples evaluated: 1200
Total inference time: 6.73s
Avg latency: 5.61ms/sample
Throughput: 178 samples/sec
──────────────────────────────────────────────────
CATEGORY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.9337 ✓ (target: 0.80)
Weighted F1: 0.9343
Macro Prec: 0.9319
Macro Recall: 0.9378
MCC: 0.9227
AUC (OvR): 0.9920
ECE: 0.0538
Kripp Alpha: 0.9224
Category F1 Prec Recall
------------------------- -------- -------- --------
Board Governance 0.9719 0.9657 0.9783
Incident Disclosure 0.9605 0.9551 0.9659
Management Role 0.9412 0.9231 0.9600
None/Other 0.8881 0.8239 0.9632
Risk Management Process 0.8564 0.8865 0.8283
Strategy Integration 0.9583 0.9810 0.9367
Third-Party Risk 0.9593 0.9880 0.9322
──────────────────────────────────────────────────
SPECIFICITY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.8952 ✓ (target: 0.80)
Weighted F1: 0.9122
Macro Prec: 0.8980
Macro Recall: 0.8931
MCC: 0.8664
AUC (OvR): 0.9817
QWK: 0.9324
MAE: 0.1175
ECE: 0.0714
Kripp Alpha: 0.9177
Level F1 Prec Recall
------------------------- -------- -------- --------
L1: Generic 0.9355 0.9325 0.9385
L2: Domain 0.7975 0.8228 0.7738
L3: Firm-Specific 0.8941 0.8716 0.9179
L4: Quantified 0.9535 0.9653 0.9420
======================================================================