2026-04-05 15:37:50 -04:00

55 lines
2.3 KiB
Plaintext

======================================================================
HOLDOUT EVALUATION: iter1-independent vs Opus-4.6
======================================================================
Samples evaluated: 1200
Total inference time: 6.73s
Avg latency: 5.61ms/sample
Throughput: 178 samples/sec
──────────────────────────────────────────────────
CATEGORY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.9227 ✓ (target: 0.80)
Weighted F1: 0.9216
Macro Prec: 0.9178
Macro Recall: 0.9316
MCC: 0.9093
AUC (OvR): 0.9940
ECE: 0.0655
Kripp Alpha: 0.9086
Category F1 Prec Recall
------------------------- -------- -------- --------
Board Governance 0.9441 0.9056 0.9860
Incident Disclosure 0.9286 0.8764 0.9873
Management Role 0.9172 0.9231 0.9114
None/Other 0.9200 0.8679 0.9787
Risk Management Process 0.8492 0.9135 0.7934
Strategy Integration 0.9476 0.9858 0.9123
Third-Party Risk 0.9521 0.9521 0.9521
──────────────────────────────────────────────────
SPECIFICITY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.8834 ✓ (target: 0.80)
Weighted F1: 0.9004
Macro Prec: 0.8859
Macro Recall: 0.8855
MCC: 0.8501
AUC (OvR): 0.9737
QWK: 0.9227
MAE: 0.1358
ECE: 0.0825
Kripp Alpha: 0.9065
Level F1 Prec Recall
------------------------- -------- -------- --------
L1: Generic 0.9242 0.9116 0.9372
L2: Domain 0.7789 0.7468 0.8138
L3: Firm-Specific 0.8661 0.9495 0.7962
L4: Quantified 0.9643 0.9356 0.9947
======================================================================