55 lines
2.3 KiB
Plaintext
55 lines
2.3 KiB
Plaintext
|
|
======================================================================
|
|
HOLDOUT EVALUATION: iter1-independent vs Opus-4.6
|
|
======================================================================
|
|
|
|
Samples evaluated: 1200
|
|
Total inference time: 6.73s
|
|
Avg latency: 5.61ms/sample
|
|
Throughput: 178 samples/sec
|
|
|
|
──────────────────────────────────────────────────
|
|
CATEGORY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.9227 ✓ (target: 0.80)
|
|
Weighted F1: 0.9216
|
|
Macro Prec: 0.9178
|
|
Macro Recall: 0.9316
|
|
MCC: 0.9093
|
|
AUC (OvR): 0.9940
|
|
ECE: 0.0655
|
|
Kripp Alpha: 0.9086
|
|
|
|
Category F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
Board Governance 0.9441 0.9056 0.9860
|
|
Incident Disclosure 0.9286 0.8764 0.9873
|
|
Management Role 0.9172 0.9231 0.9114
|
|
None/Other 0.9200 0.8679 0.9787
|
|
Risk Management Process 0.8492 0.9135 0.7934
|
|
Strategy Integration 0.9476 0.9858 0.9123
|
|
Third-Party Risk 0.9521 0.9521 0.9521
|
|
|
|
──────────────────────────────────────────────────
|
|
SPECIFICITY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.8834 ✓ (target: 0.80)
|
|
Weighted F1: 0.9004
|
|
Macro Prec: 0.8859
|
|
Macro Recall: 0.8855
|
|
MCC: 0.8501
|
|
AUC (OvR): 0.9737
|
|
QWK: 0.9227
|
|
MAE: 0.1358
|
|
ECE: 0.0825
|
|
Kripp Alpha: 0.9065
|
|
|
|
Level F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
L1: Generic 0.9242 0.9116 0.9372
|
|
L2: Domain 0.7789 0.7468 0.8138
|
|
L3: Firm-Specific 0.8661 0.9495 0.7962
|
|
L4: Quantified 0.9643 0.9356 0.9947
|
|
|
|
======================================================================
|