55 lines
2.3 KiB
Plaintext
55 lines
2.3 KiB
Plaintext
|
|
======================================================================
|
|
HOLDOUT EVALUATION: dictionary-baseline vs GPT-5.4
|
|
======================================================================
|
|
|
|
Samples evaluated: 1200
|
|
Total inference time: 0.00s
|
|
Avg latency: 0.00ms/sample
|
|
Throughput: 1000000 samples/sec
|
|
|
|
──────────────────────────────────────────────────
|
|
CATEGORY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.5563 ✗ (target: 0.80)
|
|
Weighted F1: 0.5867
|
|
Macro Prec: 0.5821
|
|
Macro Recall: 0.5593
|
|
MCC: 0.5160
|
|
AUC (OvR): 0.7450
|
|
ECE: 0.4142
|
|
Kripp Alpha: 0.5092
|
|
|
|
Category F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
Board Governance 0.8045 0.8429 0.7696
|
|
Incident Disclosure 0.4184 0.3796 0.4659
|
|
Management Role 0.6171 0.6975 0.5533
|
|
None/Other 0.3113 0.4342 0.2426
|
|
Risk Management Process 0.4169 0.3715 0.4747
|
|
Strategy Integration 0.6825 0.8217 0.5837
|
|
Third-Party Risk 0.6432 0.5271 0.8249
|
|
|
|
──────────────────────────────────────────────────
|
|
SPECIFICITY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.6555 ✗ (target: 0.80)
|
|
Weighted F1: 0.7095
|
|
Macro Prec: 0.7204
|
|
Macro Recall: 0.6226
|
|
MCC: 0.5555
|
|
AUC (OvR): 0.7507
|
|
QWK: 0.5757
|
|
MAE: 0.5158
|
|
ECE: 0.2800
|
|
Kripp Alpha: 0.5594
|
|
|
|
Level F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
L1: Generic 0.8017 0.7251 0.8964
|
|
L2: Domain 0.5342 0.5584 0.5119
|
|
L3: Firm-Specific 0.6284 0.8387 0.5024
|
|
L4: Quantified 0.6575 0.7595 0.5797
|
|
|
|
======================================================================
|