55 lines
2.3 KiB
Plaintext
55 lines
2.3 KiB
Plaintext
|
|
======================================================================
|
|
HOLDOUT EVALUATION: ensemble-3seed vs Opus-4.6
|
|
======================================================================
|
|
|
|
Samples evaluated: 1200
|
|
Total inference time: 19.85s
|
|
Avg latency: 16.54ms/sample
|
|
Throughput: 60 samples/sec
|
|
|
|
──────────────────────────────────────────────────
|
|
CATEGORY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.9288 ✓ (target: 0.80)
|
|
Weighted F1: 0.9277
|
|
Macro Prec: 0.9243
|
|
Macro Recall: 0.9368
|
|
MCC: 0.9161
|
|
AUC (OvR): 0.9948
|
|
ECE: 0.0629
|
|
Kripp Alpha: 0.9154
|
|
|
|
Category F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
Board Governance 0.9441 0.9056 0.9860
|
|
Incident Disclosure 0.9341 0.8864 0.9873
|
|
Management Role 0.9236 0.9295 0.9177
|
|
None/Other 0.9267 0.8742 0.9858
|
|
Risk Management Process 0.8628 0.9202 0.8122
|
|
Strategy Integration 0.9522 0.9905 0.9167
|
|
Third-Party Risk 0.9578 0.9636 0.9521
|
|
|
|
──────────────────────────────────────────────────
|
|
SPECIFICITY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.8853 ✓ (target: 0.80)
|
|
Weighted F1: 0.9024
|
|
Macro Prec: 0.8881
|
|
Macro Recall: 0.8858
|
|
MCC: 0.8535
|
|
AUC (OvR): 0.9776
|
|
QWK: 0.9248
|
|
MAE: 0.1325
|
|
ECE: 0.0845
|
|
Kripp Alpha: 0.9110
|
|
|
|
Level F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
L1: Generic 0.9300 0.9165 0.9438
|
|
L2: Domain 0.7973 0.7815 0.8138
|
|
L3: Firm-Specific 0.8571 0.9283 0.7962
|
|
L4: Quantified 0.9567 0.9261 0.9895
|
|
|
|
======================================================================
|