55 lines
2.3 KiB
Plaintext

======================================================================
HOLDOUT EVALUATION: dictionary-baseline vs GPT-5.4
======================================================================
Samples evaluated: 1200
Total inference time: 0.00s
Avg latency: 0.00ms/sample
Throughput: 1000000 samples/sec
──────────────────────────────────────────────────
CATEGORY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.5563 ✗ (target: 0.80)
Weighted F1: 0.5867
Macro Prec: 0.5821
Macro Recall: 0.5593
MCC: 0.5160
AUC (OvR): 0.7450
ECE: 0.4142
Kripp Alpha: 0.5092
Category F1 Prec Recall
------------------------- -------- -------- --------
Board Governance 0.8045 0.8429 0.7696
Incident Disclosure 0.4184 0.3796 0.4659
Management Role 0.6171 0.6975 0.5533
None/Other 0.3113 0.4342 0.2426
Risk Management Process 0.4169 0.3715 0.4747
Strategy Integration 0.6825 0.8217 0.5837
Third-Party Risk 0.6432 0.5271 0.8249
──────────────────────────────────────────────────
SPECIFICITY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.6555 ✗ (target: 0.80)
Weighted F1: 0.7095
Macro Prec: 0.7204
Macro Recall: 0.6226
MCC: 0.5555
AUC (OvR): 0.7507
QWK: 0.5757
MAE: 0.5158
ECE: 0.2800
Kripp Alpha: 0.5594
Level F1 Prec Recall
------------------------- -------- -------- --------
L1: Generic 0.8017 0.7251 0.8964
L2: Domain 0.5342 0.5584 0.5119
L3: Firm-Specific 0.6284 0.8387 0.5024
L4: Quantified 0.6575 0.7595 0.5797
======================================================================