55 lines
2.3 KiB
Plaintext
55 lines
2.3 KiB
Plaintext
|
|
======================================================================
|
|
HOLDOUT EVALUATION: iter1-dapt vs Opus-4.6
|
|
======================================================================
|
|
|
|
Samples evaluated: 1200
|
|
Total inference time: 6.86s
|
|
Avg latency: 5.71ms/sample
|
|
Throughput: 175 samples/sec
|
|
|
|
──────────────────────────────────────────────────
|
|
CATEGORY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.9277 ✓ (target: 0.80)
|
|
Weighted F1: 0.9268
|
|
Macro Prec: 0.9238
|
|
Macro Recall: 0.9349
|
|
MCC: 0.9150
|
|
AUC (OvR): 0.9934
|
|
ECE: 0.0574
|
|
Kripp Alpha: 0.9144
|
|
|
|
Category F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
Board Governance 0.9417 0.9052 0.9813
|
|
Incident Disclosure 0.9333 0.8953 0.9747
|
|
Management Role 0.9206 0.9236 0.9177
|
|
None/Other 0.9267 0.8742 0.9858
|
|
Risk Management Process 0.8600 0.9198 0.8075
|
|
Strategy Integration 0.9569 0.9906 0.9254
|
|
Third-Party Risk 0.9550 0.9578 0.9521
|
|
|
|
──────────────────────────────────────────────────
|
|
SPECIFICITY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.8824 ✓ (target: 0.80)
|
|
Weighted F1: 0.8997
|
|
Macro Prec: 0.8895
|
|
Macro Recall: 0.8784
|
|
MCC: 0.8492
|
|
AUC (OvR): 0.9732
|
|
QWK: 0.9200
|
|
MAE: 0.1383
|
|
ECE: 0.0801
|
|
Kripp Alpha: 0.9048
|
|
|
|
Level F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
L1: Generic 0.9257 0.9052 0.9471
|
|
L2: Domain 0.7835 0.7808 0.7862
|
|
L3: Firm-Specific 0.8589 0.9324 0.7962
|
|
L4: Quantified 0.9614 0.9397 0.9842
|
|
|
|
======================================================================
|