55 lines
2.3 KiB
Plaintext
55 lines
2.3 KiB
Plaintext
|
|
======================================================================
|
|
HOLDOUT EVALUATION: iter1-dapt vs GPT-5.4
|
|
======================================================================
|
|
|
|
Samples evaluated: 1200
|
|
Total inference time: 6.86s
|
|
Avg latency: 5.71ms/sample
|
|
Throughput: 175 samples/sec
|
|
|
|
──────────────────────────────────────────────────
|
|
CATEGORY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.9350 ✓ (target: 0.80)
|
|
Weighted F1: 0.9360
|
|
Macro Prec: 0.9345
|
|
Macro Recall: 0.9379
|
|
MCC: 0.9246
|
|
AUC (OvR): 0.9916
|
|
ECE: 0.0494
|
|
Kripp Alpha: 0.9243
|
|
|
|
Category F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
Board Governance 0.9697 0.9655 0.9739
|
|
Incident Disclosure 0.9540 0.9651 0.9432
|
|
Management Role 0.9446 0.9236 0.9667
|
|
None/Other 0.8949 0.8302 0.9706
|
|
Risk Management Process 0.8623 0.8877 0.8384
|
|
Strategy Integration 0.9631 0.9812 0.9457
|
|
Third-Party Risk 0.9563 0.9880 0.9266
|
|
|
|
──────────────────────────────────────────────────
|
|
SPECIFICITY CLASSIFICATION
|
|
──────────────────────────────────────────────────
|
|
Macro F1: 0.8959 ✓ (target: 0.80)
|
|
Weighted F1: 0.9141
|
|
Macro Prec: 0.9055
|
|
Macro Recall: 0.8891
|
|
MCC: 0.8699
|
|
AUC (OvR): 0.9806
|
|
QWK: 0.9316
|
|
MAE: 0.1167
|
|
ECE: 0.0693
|
|
Kripp Alpha: 0.9194
|
|
|
|
Level F1 Prec Recall
|
|
------------------------- -------- -------- --------
|
|
L1: Generic 0.9400 0.9289 0.9515
|
|
L2: Domain 0.8025 0.8630 0.7500
|
|
L3: Firm-Specific 0.8904 0.8604 0.9227
|
|
L4: Quantified 0.9507 0.9698 0.9324
|
|
|
|
======================================================================
|