55 lines
2.3 KiB
Plaintext
55 lines
2.3 KiB
Plaintext
|
|
======================================================================
|
|
HOLDOUT EVALUATION: iter1-clspool vs Opus-4.6
|
|
======================================================================
|
|
|
|
Samples evaluated: 1200
|
|
Total inference time: 6.84s
|
|
Avg latency: 5.70ms/sample
|
|
Throughput: 175 samples/sec
|
|
|
|
──────────────────────────────────────────────────
|
|
CATEGORY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.9229 ✓ (target: 0.80)
|
|
Weighted F1: 0.9228
|
|
Macro Prec: 0.9183
|
|
Macro Recall: 0.9311
|
|
MCC: 0.9102
|
|
AUC (OvR): 0.9925
|
|
ECE: 0.0622
|
|
Kripp Alpha: 0.9096
|
|
|
|
Category F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
Board Governance 0.9455 0.9204 0.9720
|
|
Incident Disclosure 0.9212 0.8837 0.9620
|
|
Management Role 0.9245 0.9187 0.9304
|
|
None/Other 0.9115 0.8476 0.9858
|
|
Risk Management Process 0.8529 0.9096 0.8028
|
|
Strategy Integration 0.9498 0.9905 0.9123
|
|
Third-Party Risk 0.9550 0.9578 0.9521
|
|
|
|
──────────────────────────────────────────────────
|
|
SPECIFICITY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.8804 ✓ (target: 0.80)
|
|
Weighted F1: 0.8976
|
|
Macro Prec: 0.8892
|
|
Macro Recall: 0.8750
|
|
MCC: 0.8466
|
|
AUC (OvR): 0.9698
|
|
QWK: 0.9188
|
|
MAE: 0.1408
|
|
ECE: 0.0874
|
|
Kripp Alpha: 0.9041
|
|
|
|
Level F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
L1: Generic 0.9267 0.9041 0.9504
|
|
L2: Domain 0.7972 0.8085 0.7862
|
|
L3: Firm-Specific 0.8465 0.9189 0.7846
|
|
L4: Quantified 0.9514 0.9254 0.9789
|
|
|
|
======================================================================
|