testing clspool and dapt on new architecture
@ -901,6 +901,202 @@ thus invariant under T > 0.
|
|||||||
|
|
||||||
Output: `results/eval/ensemble-3seed-tempscaled/temperature_scaling.json`.
|
Output: `results/eval/ensemble-3seed-tempscaled/temperature_scaling.json`.
|
||||||
|
|
||||||
|
### 10.5 Pooling Ablation (Attention vs [CLS])
|
||||||
|
|
||||||
|
**Motivation:** The spec F1 jump from 0.517 → 0.945 was credited to three
|
||||||
|
architectural changes — independent threshold heads, attention pooling, and
|
||||||
|
confidence filtering. Independent thresholds were ablated against CORAL;
|
||||||
|
confidence filtering was ablated in §10.3 (null result). Attention pooling
|
||||||
|
had never been isolated. We needed to know whether it actually matters or
|
||||||
|
whether independent thresholds carry all the gain.
|
||||||
|
|
||||||
|
**Setup:** `iter1-clspool.yaml` — identical iter1 config but with
|
||||||
|
`pooling: cls`. Same seed (42), same 11 epochs, confidence filtering on.
|
||||||
|
|
||||||
|
**Results:**
|
||||||
|
|
||||||
|
| Config | Val Cat F1 | Val Spec F1 | Val Combined | Holdout Cat F1 (GPT-5.4) | Holdout Spec F1 (GPT-5.4) |
|
||||||
|
|--------|-----------:|------------:|-------------:|-------------------------:|--------------------------:|
|
||||||
|
| iter1 (attention) | 0.9430 | 0.9450 | 0.9440 | 0.9343 | 0.8950 |
|
||||||
|
| iter1-clspool ([CLS])| 0.9368 | 0.9414 | 0.9391 | 0.9296 | 0.8920 |
|
||||||
|
| **Δ (attention − CLS)** | **+0.006** | **+0.004** | **+0.005** | **+0.005** | **+0.003** |
|
||||||
|
|
||||||
|
**Finding:** Attention pooling is consistently better than [CLS] pooling
|
||||||
|
across all metrics and both references, but the effect is **small** —
|
||||||
|
3-6 thousandths of F1. This is within 2-3× the seed-level std (±0.002), so
|
||||||
|
the direction is credible but the magnitude is modest. Attention pooling is
|
||||||
|
doing real work ("one CISO mention anywhere matters") but independent
|
||||||
|
threshold heads are clearly carrying the majority of the architecture win.
|
||||||
|
|
||||||
|
**Interpretation for the paper:** We can report this cleanly as "attention
|
||||||
|
pooling contributes a small but consistent improvement over [CLS] pooling
|
||||||
|
(~+0.005 F1 on both heads); the bulk of the CORAL → independent-threshold
|
||||||
|
gain (~+0.43 on spec F1) is attributable to the decoupled threshold weights,
|
||||||
|
not the pooling change." This is honest and gives each design choice its
|
||||||
|
proper credit.
|
||||||
|
|
||||||
|
Output: `checkpoints/finetune/iter1-clspool/`, `results/eval/iter1-clspool/`.
|
||||||
|
|
||||||
|
### 10.6 DAPT Re-Test with New Architecture
|
||||||
|
|
||||||
|
**Motivation:** During the original 12-config ablation grid (CORAL +
|
||||||
|
[CLS] pooling), DAPT and TAPT both *hurt* — base ModernBERT-large
|
||||||
|
outperformed DAPT and TAPT checkpoints on every loss combination. That was
|
||||||
|
reported as a noteworthy null result. But the architecture has changed
|
||||||
|
substantially since then (independent thresholds, attention pooling). The
|
||||||
|
verdict on DAPT could now flip: maybe the DAPT vocabulary signal was
|
||||||
|
previously wasted on a model that couldn't use it.
|
||||||
|
|
||||||
|
**Setup:** `iter1-dapt.yaml` — identical iter1 config but
|
||||||
|
`model.name_or_path` points at `checkpoints/dapt/modernbert-large/final`
|
||||||
|
(eval loss 0.7250 from Phase 7). Same seed, 11 epochs, attention pooling,
|
||||||
|
independent threshold heads, confidence filtering on.
|
||||||
|
|
||||||
|
**Results (epoch 11 — final checkpoint):**
|
||||||
|
|
||||||
|
| Config | Val Cat F1 | Val Spec F1 | Val Combined | Val NLL (ep 11) | Holdout Cat F1 (GPT-5.4) | Holdout Spec F1 (GPT-5.4) |
|
||||||
|
|--------|-----------:|------------:|-------------:|----------------:|-------------------------:|--------------------------:|
|
||||||
|
| iter1 (base ModernBERT, seed 69) | 0.9384 | 0.9462 | 0.9423 | 0.511 | — | — |
|
||||||
|
| iter1 (base ModernBERT, seed 42) | 0.9430 | 0.9450 | 0.9440 | — | 0.9343 | 0.8950 |
|
||||||
|
| iter1-dapt (DAPT init) | 0.9500 | 0.9462 | 0.9481 | 0.494 | 0.9350 | 0.8959 |
|
||||||
|
| **Δ (dapt − base)** | **+0.007** | **+0.001** | **+0.004** | **−0.017** | +0.001 | +0.001 |
|
||||||
|
|
||||||
|
**Per-epoch val NLL trajectory (confirmed not overfitting-driven):**
|
||||||
|
|
||||||
|
| Epoch | seed 69 (no DAPT) | DAPT | Δ |
|
||||||
|
|-------|------------------:|-----:|----:|
|
||||||
|
| 1 | 0.376 | 0.346 | −0.030 |
|
||||||
|
| 2 | 0.337 | **0.318** (best) | −0.019 |
|
||||||
|
| 3 | **0.333** (best) | 0.331 | −0.002 |
|
||||||
|
| 5 | 0.394 | 0.385 | −0.009 |
|
||||||
|
| 8 | 0.493 | 0.482 | −0.011 |
|
||||||
|
| 11 | 0.511 | 0.494 | −0.017 |
|
||||||
|
|
||||||
|
Both runs peak at epoch 2-3 and then overfit steadily. The overfit gap
|
||||||
|
(val NLL at epoch 11 minus best) is **0.178 for the baseline** and
|
||||||
|
**0.176 for DAPT** — essentially identical. DAPT is not overfitting worse;
|
||||||
|
it is **starting from a better representation** and maintaining the same
|
||||||
|
generalization gap through training.
|
||||||
|
|
||||||
|
**Finding — a more nuanced null:** DAPT initialization genuinely improves
|
||||||
|
val NLL by ~4.5% at the best checkpoint (0.333 → 0.318), with a matching
|
||||||
|
+0.007 category F1 improvement on val. The improvement is real and not a
|
||||||
|
side-effect of overfitting: the train/val gap is unchanged. But this
|
||||||
|
benefit does not transfer to the stratified holdout — holdout F1 gains are
|
||||||
|
within noise (+0.001).
|
||||||
|
|
||||||
|
But the holdout gain is **0.001** on both heads — within seed-level noise
|
||||||
|
and nowhere near the val improvement. Something interesting is happening:
|
||||||
|
|
||||||
|
- DAPT helps the model fit in-distribution data more tightly (val gain +
|
||||||
|
NLL drop)
|
||||||
|
- That extra fit does not generalize to the stratified holdout
|
||||||
|
- The holdout oversamples minority classes (L2, TP, ID) relative to the
|
||||||
|
training distribution; DAPT's benefit is on the head of the distribution
|
||||||
|
|
||||||
|
**Interpretation for the paper:** This is a more interesting null result
|
||||||
|
than the original "DAPT/TAPT did not help." The revised claim is:
|
||||||
|
|
||||||
|
> *"Domain-adaptive pretraining improves in-distribution val NLL by ~4.5%
|
||||||
|
> at the best checkpoint (0.333 → 0.318) and provides a modest val F1 gain
|
||||||
|
> (+0.007 cat, +0.004 combined) under the independent-threshold +
|
||||||
|
> attention-pooling architecture. The generalization gap (difference between
|
||||||
|
> best val NLL and final val NLL) is unchanged by DAPT (0.178 vs 0.176),
|
||||||
|
> confirming that DAPT is providing a better initialization rather than
|
||||||
|
> just enabling overfitting. However, this val improvement does not
|
||||||
|
> transfer to the stratified holdout — DAPT produces a model that is
|
||||||
|
> better-calibrated on paragraphs similar to the training distribution,
|
||||||
|
> yet no more generalizable to the rare-class boundary cases (L2, TP, ID)
|
||||||
|
> that macro F1 weighs heavily. Our original finding (DAPT does not help
|
||||||
|
> final macro F1) is reaffirmed; the mechanism is now clearer."*
|
||||||
|
|
||||||
|
This is stronger than the original null because we can now point to a
|
||||||
|
specific, measurable effect of DAPT (val NLL) distinct from overfitting,
|
||||||
|
and explain why it doesn't show up in the headline macro F1 metric.
|
||||||
|
|
||||||
|
The non-DAPT 3-seed ensemble remains the recommended headline checkpoint.
|
||||||
|
The DAPT run is reportable as an ablation and a more precise null.
|
||||||
|
|
||||||
|
Output: `checkpoints/finetune/iter1-dapt/`, `results/eval/iter1-dapt/`.
|
||||||
|
|
||||||
|
### 10.7 The NLL-vs-F1 Decoupling and the Overfit Story
|
||||||
|
|
||||||
|
Investigating the DAPT ablation (§10.6) surfaced a general property of
|
||||||
|
every run in Phase 10 worth documenting explicitly, because it affects how
|
||||||
|
the paper should report training dynamics.
|
||||||
|
|
||||||
|
**Observation:** In all four independent-threshold runs (seeds 42/69/420,
|
||||||
|
iter1-nofilter, iter1-clspool, iter1-dapt), **val NLL bottoms at epoch 2-3
|
||||||
|
and then climbs monotonically through epoch 11, while val macro F1 peaks
|
||||||
|
at epoch 8 and plateaus.** The two metrics disagree about when the model
|
||||||
|
is at its best.
|
||||||
|
|
||||||
|
**Per-epoch val NLL, representative run (seed 69):**
|
||||||
|
|
||||||
|
| Epoch | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
|
||||||
|
|-------|---|---|---|---|---|---|---|---|---|----|----|
|
||||||
|
| Val NLL | 0.376 | 0.337 | **0.333** | 0.369 | 0.394 | 0.443 | 0.472 | 0.493 | 0.505 | 0.511 | — |
|
||||||
|
| Val F1 | ~0.90 | ~0.92 | ~0.925 | ~0.932 | ~0.938 | ~0.941 | ~0.942 | **~0.944** | 0.944 | 0.944 | 0.943 |
|
||||||
|
|
||||||
|
**Interpretation:** Past epoch 3, continued training memorizes *confidence*,
|
||||||
|
not *decisions*. Two things happen simultaneously:
|
||||||
|
|
||||||
|
1. Training-set probabilities are pushed toward 0/1 (training loss → 0)
|
||||||
|
2. Very few argmax decision boundaries shift
|
||||||
|
|
||||||
|
For val examples the model already gets right, sharpening is neutral-to-bad
|
||||||
|
for NLL and neutral-to-good for F1. For val examples the model gets wrong,
|
||||||
|
continued training makes the prediction *more confidently wrong* — terrible
|
||||||
|
for NLL (log-penalty grows), irrelevant for F1 (still wrong by argmax).
|
||||||
|
Net: NLL climbs, F1 inches up as a small number of borderline examples
|
||||||
|
flip to the correct side.
|
||||||
|
|
||||||
|
This is a well-documented decoupling in deep classifiers, not a pathology
|
||||||
|
specific to this model.
|
||||||
|
|
||||||
|
**Is it a problem for the F1 claim? No.** Model selection uses val F1, so
|
||||||
|
we pick the epoch where F1 peaks (epoch 8). Val F1 at the selected
|
||||||
|
checkpoint (0.943/0.945) closely tracks holdout F1 against proxy gold
|
||||||
|
(0.934/0.895) — a ~0.01 category gap and ~0.05 specificity gap. The
|
||||||
|
decision boundaries generalized. The model did not overfit the *task*.
|
||||||
|
|
||||||
|
**Is it a problem for the probability claim? Yes, but measurable and
|
||||||
|
fixable.** Raw logits at epoch 8 are overconfident, which is exactly what
|
||||||
|
the pre-scaling ECE measured (0.05-0.08). The fitted temperatures
|
||||||
|
(T_cat = 1.76, T_spec = 2.46) are a direct quantification of how
|
||||||
|
overconfident the model became between epoch 3 and epoch 8: T > 1 means
|
||||||
|
"divide logits to cool them off." Temperature scaling (§10.4) recovers
|
||||||
|
calibration without touching predictions, so the cost of training to
|
||||||
|
epoch 8 instead of epoch 3 is paid in a scalar that's learned in ~1 second
|
||||||
|
on val.
|
||||||
|
|
||||||
|
**Is it a problem for the holdout claim? No, by construction.** The
|
||||||
|
holdout was never touched during training. The train/val loss gap measures
|
||||||
|
memorization of the training distribution; the holdout measures
|
||||||
|
generalization to a distributionally distinct sample. These are independent
|
||||||
|
signals and both tell a consistent story: decision boundaries transfer,
|
||||||
|
probability calibration does not.
|
||||||
|
|
||||||
|
**Why not just stop at epoch 3?** Because you'd save ~0.18 in val NLL and
|
||||||
|
lose ~0.02 in val F1. Epochs 3 → 8 buy ~0.015-0.020 F1 at the cost of
|
||||||
|
calibration that temperature scaling mechanically recovers. For a
|
||||||
|
task where F1 is the rubric metric, that is a good trade. Were this a
|
||||||
|
deployment where confidence scores drive downstream decisions (e.g., a
|
||||||
|
human-in-the-loop review queue prioritizing low-confidence paragraphs),
|
||||||
|
epoch 3 + no temperature scaling would be a reasonable alternative choice.
|
||||||
|
|
||||||
|
**Paper framing:**
|
||||||
|
|
||||||
|
> *"Val NLL minimizes at epoch 2-3 while val macro F1 peaks at epoch 8 — a
|
||||||
|
> well-documented decoupling between calibration and decision quality in
|
||||||
|
> deep classifiers. We select checkpoints by F1, report pre- and
|
||||||
|
> post-temperature-scaling ECE separately, and verify generalization via
|
||||||
|
> an untouched stratified holdout. The model's val-holdout F1 gap (~0.01
|
||||||
|
> category, ~0.05 specificity) is within the inter-reference agreement
|
||||||
|
> ceiling, confirming decision-boundary generalization despite
|
||||||
|
> in-distribution confidence memorization. Temperature scaling recovers
|
||||||
|
> calibration (ECE −33% cat, −40% spec) without altering predictions."*
|
||||||
|
|
||||||
### Phase 10 Summary
|
### Phase 10 Summary
|
||||||
|
|
||||||
| Experiment | Cost | Outcome | Paper value |
|
| Experiment | Cost | Outcome | Paper value |
|
||||||
@ -909,6 +1105,8 @@ Output: `results/eval/ensemble-3seed-tempscaled/temperature_scaling.json`.
|
|||||||
| Dictionary baseline | ~1 min CPU | Cat 0.55, Spec 0.66 — clear gap to learned model | A-rubric "additional baselines" item |
|
| Dictionary baseline | ~1 min CPU | Cat 0.55, Spec 0.66 — clear gap to learned model | A-rubric "additional baselines" item |
|
||||||
| Confidence-filter ablation | ~3h GPU | Null result — filtering does not matter | Justifies architecture, not data engineering |
|
| Confidence-filter ablation | ~3h GPU | Null result — filtering does not matter | Justifies architecture, not data engineering |
|
||||||
| Temperature scaling | ~10 min GPU | ECE −33% cat, −40% spec, F1 unchanged | Calibration story, deployment quality |
|
| Temperature scaling | ~10 min GPU | ECE −33% cat, −40% spec, F1 unchanged | Calibration story, deployment quality |
|
||||||
|
| Pooling ablation (attention vs CLS) | ~3h GPU | +0.005 F1 consistent, small effect | Validates design, credits independent thresholds |
|
||||||
|
| DAPT re-test with new architecture | ~3h GPU | Val best NLL 0.333→0.318 (−4.5%), F1 +0.007 cat; holdout null; gen gap unchanged | More nuanced null — better init, not better generalization |
|
||||||
|
|
||||||
The 3-seed ensemble is now the recommended headline checkpoint. The
|
The 3-seed ensemble is now the recommended headline checkpoint. The
|
||||||
calibrated ECE numbers should replace the pre-scaling ECE in the paper. The
|
calibrated ECE numbers should replace the pre-scaling ECE in the paper. The
|
||||||
|
|||||||
@ -156,6 +156,8 @@
|
|||||||
- [x] Ensemble of 3 seeds for confidence intervals — seeds 42/69/420, val std ±0.002 spec, holdout +0.017 L2 F1, +0.007 spec F1 vs single seed
|
- [x] Ensemble of 3 seeds for confidence intervals — seeds 42/69/420, val std ±0.002 spec, holdout +0.017 L2 F1, +0.007 spec F1 vs single seed
|
||||||
- [x] Dictionary/keyword baseline (A-rubric "additional baselines") — Cat 0.55, Spec 0.66; gap to learned model documents value of context
|
- [x] Dictionary/keyword baseline (A-rubric "additional baselines") — Cat 0.55, Spec 0.66; gap to learned model documents value of context
|
||||||
- [x] Confidence-filter ablation — null result, filtering does not affect F1; architecture changes carry the spec F1 improvement
|
- [x] Confidence-filter ablation — null result, filtering does not affect F1; architecture changes carry the spec F1 improvement
|
||||||
|
- [x] Pooling ablation (attention vs CLS) — attention +0.005 F1 consistent; small but credible effect
|
||||||
|
- [x] DAPT re-test with new architecture — val +0.007 cat F1, best val NLL 0.333→0.318 (−4.5%), generalization gap unchanged; holdout gain ~0.001 (better init, not better generalization)
|
||||||
- [ ] Error analysis against human gold, IGNITE slides
|
- [ ] Error analysis against human gold, IGNITE slides
|
||||||
- [ ] Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work
|
- [ ] Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work
|
||||||
- [ ] Note in paper: DAPT/TAPT did not improve fine-tuning — noteworthy null result
|
- [ ] Note in paper: DAPT/TAPT did not improve fine-tuning — noteworthy null result
|
||||||
|
|||||||
37
python/configs/finetune/iter1-clspool.yaml
Normal file
@ -0,0 +1,37 @@
|
|||||||
|
model:
|
||||||
|
name_or_path: answerdotai/ModernBERT-large
|
||||||
|
|
||||||
|
data:
|
||||||
|
paragraphs_path: ../data/paragraphs/paragraphs-clean.patched.jsonl
|
||||||
|
consensus_path: ../data/annotations/v2-stage1/consensus.jsonl
|
||||||
|
quality_path: ../data/paragraphs/quality/quality-scores.jsonl
|
||||||
|
holdout_path: ../data/gold/v2-holdout-ids.json
|
||||||
|
max_seq_length: 512
|
||||||
|
validation_split: 0.1
|
||||||
|
|
||||||
|
training:
|
||||||
|
output_dir: ../checkpoints/finetune/iter1-clspool
|
||||||
|
learning_rate: 0.00005
|
||||||
|
num_train_epochs: 11
|
||||||
|
per_device_train_batch_size: 32
|
||||||
|
per_device_eval_batch_size: 64
|
||||||
|
gradient_accumulation_steps: 1
|
||||||
|
warmup_ratio: 0.1
|
||||||
|
weight_decay: 0.01
|
||||||
|
dropout: 0.1
|
||||||
|
bf16: true
|
||||||
|
gradient_checkpointing: false
|
||||||
|
logging_steps: 50
|
||||||
|
save_total_limit: 3
|
||||||
|
dataloader_num_workers: 4
|
||||||
|
seed: 42
|
||||||
|
loss_type: ce
|
||||||
|
focal_gamma: 2.0
|
||||||
|
class_weighting: true
|
||||||
|
category_loss_weight: 1.0
|
||||||
|
specificity_loss_weight: 1.0
|
||||||
|
specificity_head: independent
|
||||||
|
spec_mlp_dim: 256
|
||||||
|
pooling: cls
|
||||||
|
ordinal_consistency_weight: 0.1
|
||||||
|
filter_spec_confidence: true
|
||||||
37
python/configs/finetune/iter1-dapt.yaml
Normal file
@ -0,0 +1,37 @@
|
|||||||
|
model:
|
||||||
|
name_or_path: ../checkpoints/dapt/modernbert-large/final
|
||||||
|
|
||||||
|
data:
|
||||||
|
paragraphs_path: ../data/paragraphs/paragraphs-clean.patched.jsonl
|
||||||
|
consensus_path: ../data/annotations/v2-stage1/consensus.jsonl
|
||||||
|
quality_path: ../data/paragraphs/quality/quality-scores.jsonl
|
||||||
|
holdout_path: ../data/gold/v2-holdout-ids.json
|
||||||
|
max_seq_length: 512
|
||||||
|
validation_split: 0.1
|
||||||
|
|
||||||
|
training:
|
||||||
|
output_dir: ../checkpoints/finetune/iter1-dapt
|
||||||
|
learning_rate: 0.00005
|
||||||
|
num_train_epochs: 11
|
||||||
|
per_device_train_batch_size: 32
|
||||||
|
per_device_eval_batch_size: 64
|
||||||
|
gradient_accumulation_steps: 1
|
||||||
|
warmup_ratio: 0.1
|
||||||
|
weight_decay: 0.01
|
||||||
|
dropout: 0.1
|
||||||
|
bf16: true
|
||||||
|
gradient_checkpointing: false
|
||||||
|
logging_steps: 50
|
||||||
|
save_total_limit: 3
|
||||||
|
dataloader_num_workers: 4
|
||||||
|
seed: 42
|
||||||
|
loss_type: ce
|
||||||
|
focal_gamma: 2.0
|
||||||
|
class_weighting: true
|
||||||
|
category_loss_weight: 1.0
|
||||||
|
specificity_loss_weight: 1.0
|
||||||
|
specificity_head: independent
|
||||||
|
spec_mlp_dim: 256
|
||||||
|
pooling: attention
|
||||||
|
ordinal_consistency_weight: 0.1
|
||||||
|
filter_spec_confidence: true
|
||||||
BIN
results/eval/iter1-clspool/figures/calibration_cat_gpt-5.4.png
Normal file
|
After Width: | Height: | Size: 53 KiB |
BIN
results/eval/iter1-clspool/figures/calibration_cat_opus-4.6.png
Normal file
|
After Width: | Height: | Size: 52 KiB |
BIN
results/eval/iter1-clspool/figures/confusion_cat_gpt-5.4.png
Normal file
|
After Width: | Height: | Size: 117 KiB |
BIN
results/eval/iter1-clspool/figures/confusion_cat_opus-4.6.png
Normal file
|
After Width: | Height: | Size: 117 KiB |
BIN
results/eval/iter1-clspool/figures/confusion_spec_gpt-5.4.png
Normal file
|
After Width: | Height: | Size: 80 KiB |
BIN
results/eval/iter1-clspool/figures/confusion_spec_opus-4.6.png
Normal file
|
After Width: | Height: | Size: 81 KiB |
BIN
results/eval/iter1-clspool/figures/model_comparison.png
Normal file
|
After Width: | Height: | Size: 62 KiB |
BIN
results/eval/iter1-clspool/figures/per_class_f1_gpt-5.4.png
Normal file
|
After Width: | Height: | Size: 104 KiB |
BIN
results/eval/iter1-clspool/figures/per_class_f1_opus-4.6.png
Normal file
|
After Width: | Height: | Size: 104 KiB |
BIN
results/eval/iter1-clspool/figures/speed_comparison.png
Normal file
|
After Width: | Height: | Size: 52 KiB |
298
results/eval/iter1-clspool/metrics.json
Normal file
@ -0,0 +1,298 @@
|
|||||||
|
{
|
||||||
|
"iter1-clspool_vs_GPT-5.4": {
|
||||||
|
"cat_macro_f1": 0.9296272782528762,
|
||||||
|
"cat_weighted_f1": 0.9306824376807155,
|
||||||
|
"cat_macro_precision": 0.9289887550616817,
|
||||||
|
"cat_macro_recall": 0.9334375025997984,
|
||||||
|
"cat_mcc": 0.9179226636085169,
|
||||||
|
"cat_auc": 0.9911299127522846,
|
||||||
|
"cat_ece": 0.05557066917419438,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
217,
|
||||||
|
0,
|
||||||
|
8,
|
||||||
|
3,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
83,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
144,
|
||||||
|
1,
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
132,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
6,
|
||||||
|
1,
|
||||||
|
5,
|
||||||
|
17,
|
||||||
|
167,
|
||||||
|
1,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
8,
|
||||||
|
2,
|
||||||
|
208,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
11,
|
||||||
|
0,
|
||||||
|
165
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9517543859649122,
|
||||||
|
"cat_prec_BoardGov": 0.9601769911504425,
|
||||||
|
"cat_recall_BoardGov": 0.9434782608695652,
|
||||||
|
"cat_f1_Incident": 0.9540229885057471,
|
||||||
|
"cat_prec_Incident": 0.9651162790697675,
|
||||||
|
"cat_recall_Incident": 0.9431818181818182,
|
||||||
|
"cat_f1_Manageme": 0.9290322580645162,
|
||||||
|
"cat_prec_Manageme": 0.9,
|
||||||
|
"cat_recall_Manageme": 0.96,
|
||||||
|
"cat_f1_NoneOthe": 0.88,
|
||||||
|
"cat_prec_NoneOthe": 0.8048780487804879,
|
||||||
|
"cat_recall_NoneOthe": 0.9705882352941176,
|
||||||
|
"cat_f1_RiskMana": 0.8652849740932642,
|
||||||
|
"cat_prec_RiskMana": 0.8882978723404256,
|
||||||
|
"cat_recall_RiskMana": 0.8434343434343434,
|
||||||
|
"cat_f1_Strategy": 0.9651972157772621,
|
||||||
|
"cat_prec_Strategy": 0.9904761904761905,
|
||||||
|
"cat_recall_Strategy": 0.9411764705882353,
|
||||||
|
"cat_f1_Third-Pa": 0.9620991253644315,
|
||||||
|
"cat_prec_Third-Pa": 0.9939759036144579,
|
||||||
|
"cat_recall_Third-Pa": 0.9322033898305084,
|
||||||
|
"cat_kripp_alpha": 0.9174669822467758,
|
||||||
|
"spec_macro_f1": 0.892010224838834,
|
||||||
|
"spec_weighted_f1": 0.9098424770121019,
|
||||||
|
"spec_macro_precision": 0.9042493173083448,
|
||||||
|
"spec_macro_recall": 0.8836163792237031,
|
||||||
|
"spec_mcc": 0.8634241541671751,
|
||||||
|
"spec_auc": 0.9777836963763646,
|
||||||
|
"spec_ece": 0.07659540871779125,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
587,
|
||||||
|
11,
|
||||||
|
17,
|
||||||
|
3
|
||||||
|
],
|
||||||
|
[
|
||||||
|
32,
|
||||||
|
125,
|
||||||
|
9,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
14,
|
||||||
|
4,
|
||||||
|
187,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
3,
|
||||||
|
1,
|
||||||
|
9,
|
||||||
|
194
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.9362041467304625,
|
||||||
|
"spec_prec_L1Generi": 0.9229559748427673,
|
||||||
|
"spec_recall_L1Generi": 0.9498381877022654,
|
||||||
|
"spec_f1_L2Domain": 0.8090614886731392,
|
||||||
|
"spec_prec_L2Domain": 0.8865248226950354,
|
||||||
|
"spec_recall_L2Domain": 0.7440476190476191,
|
||||||
|
"spec_f1_L3Firm-S": 0.8717948717948718,
|
||||||
|
"spec_prec_L3Firm-S": 0.8423423423423423,
|
||||||
|
"spec_recall_L3Firm-S": 0.9033816425120773,
|
||||||
|
"spec_f1_L4Quanti": 0.9509803921568627,
|
||||||
|
"spec_prec_L4Quanti": 0.9651741293532339,
|
||||||
|
"spec_recall_L4Quanti": 0.9371980676328503,
|
||||||
|
"spec_qwk": 0.9224750079938221,
|
||||||
|
"spec_mae": 0.1275,
|
||||||
|
"spec_kripp_alpha": 0.9099809044589873,
|
||||||
|
"total_time_s": 6.83874113188358,
|
||||||
|
"num_samples": 1200,
|
||||||
|
"avg_ms_per_sample": 5.698950943236317,
|
||||||
|
"combined_macro_f1": 0.910818751545855
|
||||||
|
},
|
||||||
|
"iter1-clspool_vs_Opus-4.6": {
|
||||||
|
"cat_macro_f1": 0.9228949790380195,
|
||||||
|
"cat_weighted_f1": 0.9228190044594041,
|
||||||
|
"cat_macro_precision": 0.9183239817151002,
|
||||||
|
"cat_macro_recall": 0.9310538134995027,
|
||||||
|
"cat_mcc": 0.9101930161599978,
|
||||||
|
"cat_auc": 0.9924519781241848,
|
||||||
|
"cat_ece": 0.06223733584086104,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
208,
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
76,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
5,
|
||||||
|
0,
|
||||||
|
147,
|
||||||
|
1,
|
||||||
|
4,
|
||||||
|
0,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
139,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
12,
|
||||||
|
1,
|
||||||
|
9,
|
||||||
|
14,
|
||||||
|
171,
|
||||||
|
1,
|
||||||
|
5
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
9,
|
||||||
|
1,
|
||||||
|
6,
|
||||||
|
2,
|
||||||
|
208,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
7,
|
||||||
|
1,
|
||||||
|
159
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9454545454545454,
|
||||||
|
"cat_prec_BoardGov": 0.9203539823008849,
|
||||||
|
"cat_recall_BoardGov": 0.9719626168224299,
|
||||||
|
"cat_f1_Incident": 0.9212121212121213,
|
||||||
|
"cat_prec_Incident": 0.8837209302325582,
|
||||||
|
"cat_recall_Incident": 0.9620253164556962,
|
||||||
|
"cat_f1_Manageme": 0.9245283018867925,
|
||||||
|
"cat_prec_Manageme": 0.91875,
|
||||||
|
"cat_recall_Manageme": 0.930379746835443,
|
||||||
|
"cat_f1_NoneOthe": 0.9114754098360656,
|
||||||
|
"cat_prec_NoneOthe": 0.8475609756097561,
|
||||||
|
"cat_recall_NoneOthe": 0.9858156028368794,
|
||||||
|
"cat_f1_RiskMana": 0.8528678304239401,
|
||||||
|
"cat_prec_RiskMana": 0.9095744680851063,
|
||||||
|
"cat_recall_RiskMana": 0.8028169014084507,
|
||||||
|
"cat_f1_Strategy": 0.9497716894977168,
|
||||||
|
"cat_prec_Strategy": 0.9904761904761905,
|
||||||
|
"cat_recall_Strategy": 0.9122807017543859,
|
||||||
|
"cat_f1_Third-Pa": 0.954954954954955,
|
||||||
|
"cat_prec_Third-Pa": 0.9578313253012049,
|
||||||
|
"cat_recall_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_kripp_alpha": 0.9095735484151157,
|
||||||
|
"spec_macro_f1": 0.8804386286358235,
|
||||||
|
"spec_weighted_f1": 0.8975676999782217,
|
||||||
|
"spec_macro_precision": 0.8892226854649037,
|
||||||
|
"spec_macro_recall": 0.8750457181821643,
|
||||||
|
"spec_mcc": 0.8465565454059848,
|
||||||
|
"spec_auc": 0.9697722386763277,
|
||||||
|
"spec_ece": 0.08741456707318629,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
575,
|
||||||
|
19,
|
||||||
|
10,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
26,
|
||||||
|
114,
|
||||||
|
4,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
35,
|
||||||
|
8,
|
||||||
|
204,
|
||||||
|
13
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
4,
|
||||||
|
186
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.9266720386784851,
|
||||||
|
"spec_prec_L1Generi": 0.9040880503144654,
|
||||||
|
"spec_recall_L1Generi": 0.9504132231404959,
|
||||||
|
"spec_f1_L2Domain": 0.7972027972027972,
|
||||||
|
"spec_prec_L2Domain": 0.8085106382978723,
|
||||||
|
"spec_recall_L2Domain": 0.7862068965517242,
|
||||||
|
"spec_f1_L3Firm-S": 0.8464730290456431,
|
||||||
|
"spec_prec_L3Firm-S": 0.918918918918919,
|
||||||
|
"spec_recall_L3Firm-S": 0.7846153846153846,
|
||||||
|
"spec_f1_L4Quanti": 0.9514066496163683,
|
||||||
|
"spec_prec_L4Quanti": 0.9253731343283582,
|
||||||
|
"spec_recall_L4Quanti": 0.9789473684210527,
|
||||||
|
"spec_qwk": 0.9187882106031572,
|
||||||
|
"spec_mae": 0.14083333333333334,
|
||||||
|
"spec_kripp_alpha": 0.9041056117796359,
|
||||||
|
"total_time_s": 6.83874113188358,
|
||||||
|
"num_samples": 1200,
|
||||||
|
"avg_ms_per_sample": 5.698950943236317,
|
||||||
|
"combined_macro_f1": 0.9016668038369215
|
||||||
|
}
|
||||||
|
}
|
||||||
54
results/eval/iter1-clspool/report_gpt-54.txt
Normal file
@ -0,0 +1,54 @@
|
|||||||
|
|
||||||
|
======================================================================
|
||||||
|
HOLDOUT EVALUATION: iter1-clspool vs GPT-5.4
|
||||||
|
======================================================================
|
||||||
|
|
||||||
|
Samples evaluated: 1200
|
||||||
|
Total inference time: 6.84s
|
||||||
|
Avg latency: 5.70ms/sample
|
||||||
|
Throughput: 175 samples/sec
|
||||||
|
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
CATEGORY CLASSIFICATION
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
Macro F1: 0.9296 ✓ (target: 0.80)
|
||||||
|
Weighted F1: 0.9307
|
||||||
|
Macro Prec: 0.9290
|
||||||
|
Macro Recall: 0.9334
|
||||||
|
MCC: 0.9179
|
||||||
|
AUC (OvR): 0.9911
|
||||||
|
ECE: 0.0556
|
||||||
|
Kripp Alpha: 0.9175
|
||||||
|
|
||||||
|
Category F1 Prec Recall
|
||||||
|
------------------------- -------- -------- --------
|
||||||
|
Board Governance 0.9518 0.9602 0.9435
|
||||||
|
Incident Disclosure 0.9540 0.9651 0.9432
|
||||||
|
Management Role 0.9290 0.9000 0.9600
|
||||||
|
None/Other 0.8800 0.8049 0.9706
|
||||||
|
Risk Management Process 0.8653 0.8883 0.8434
|
||||||
|
Strategy Integration 0.9652 0.9905 0.9412
|
||||||
|
Third-Party Risk 0.9621 0.9940 0.9322
|
||||||
|
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
SPECIFICITY CLASSIFICATION
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
Macro F1: 0.8920 ✓ (target: 0.80)
|
||||||
|
Weighted F1: 0.9098
|
||||||
|
Macro Prec: 0.9042
|
||||||
|
Macro Recall: 0.8836
|
||||||
|
MCC: 0.8634
|
||||||
|
AUC (OvR): 0.9778
|
||||||
|
QWK: 0.9225
|
||||||
|
MAE: 0.1275
|
||||||
|
ECE: 0.0766
|
||||||
|
Kripp Alpha: 0.9100
|
||||||
|
|
||||||
|
Level F1 Prec Recall
|
||||||
|
------------------------- -------- -------- --------
|
||||||
|
L1: Generic 0.9362 0.9230 0.9498
|
||||||
|
L2: Domain 0.8091 0.8865 0.7440
|
||||||
|
L3: Firm-Specific 0.8718 0.8423 0.9034
|
||||||
|
L4: Quantified 0.9510 0.9652 0.9372
|
||||||
|
|
||||||
|
======================================================================
|
||||||
54
results/eval/iter1-clspool/report_opus-46.txt
Normal file
@ -0,0 +1,54 @@
|
|||||||
|
|
||||||
|
======================================================================
|
||||||
|
HOLDOUT EVALUATION: iter1-clspool vs Opus-4.6
|
||||||
|
======================================================================
|
||||||
|
|
||||||
|
Samples evaluated: 1200
|
||||||
|
Total inference time: 6.84s
|
||||||
|
Avg latency: 5.70ms/sample
|
||||||
|
Throughput: 175 samples/sec
|
||||||
|
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
CATEGORY CLASSIFICATION
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
Macro F1: 0.9229 ✓ (target: 0.80)
|
||||||
|
Weighted F1: 0.9228
|
||||||
|
Macro Prec: 0.9183
|
||||||
|
Macro Recall: 0.9311
|
||||||
|
MCC: 0.9102
|
||||||
|
AUC (OvR): 0.9925
|
||||||
|
ECE: 0.0622
|
||||||
|
Kripp Alpha: 0.9096
|
||||||
|
|
||||||
|
Category F1 Prec Recall
|
||||||
|
------------------------- -------- -------- --------
|
||||||
|
Board Governance 0.9455 0.9204 0.9720
|
||||||
|
Incident Disclosure 0.9212 0.8837 0.9620
|
||||||
|
Management Role 0.9245 0.9187 0.9304
|
||||||
|
None/Other 0.9115 0.8476 0.9858
|
||||||
|
Risk Management Process 0.8529 0.9096 0.8028
|
||||||
|
Strategy Integration 0.9498 0.9905 0.9123
|
||||||
|
Third-Party Risk 0.9550 0.9578 0.9521
|
||||||
|
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
SPECIFICITY CLASSIFICATION
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
Macro F1: 0.8804 ✓ (target: 0.80)
|
||||||
|
Weighted F1: 0.8976
|
||||||
|
Macro Prec: 0.8892
|
||||||
|
Macro Recall: 0.8750
|
||||||
|
MCC: 0.8466
|
||||||
|
AUC (OvR): 0.9698
|
||||||
|
QWK: 0.9188
|
||||||
|
MAE: 0.1408
|
||||||
|
ECE: 0.0874
|
||||||
|
Kripp Alpha: 0.9041
|
||||||
|
|
||||||
|
Level F1 Prec Recall
|
||||||
|
------------------------- -------- -------- --------
|
||||||
|
L1: Generic 0.9267 0.9041 0.9504
|
||||||
|
L2: Domain 0.7972 0.8085 0.7862
|
||||||
|
L3: Firm-Specific 0.8465 0.9189 0.7846
|
||||||
|
L4: Quantified 0.9514 0.9254 0.9789
|
||||||
|
|
||||||
|
======================================================================
|
||||||
BIN
results/eval/iter1-dapt/figures/calibration_cat_gpt-5.4.png
Normal file
|
After Width: | Height: | Size: 52 KiB |
BIN
results/eval/iter1-dapt/figures/calibration_cat_opus-4.6.png
Normal file
|
After Width: | Height: | Size: 52 KiB |
BIN
results/eval/iter1-dapt/figures/confusion_cat_gpt-5.4.png
Normal file
|
After Width: | Height: | Size: 115 KiB |
BIN
results/eval/iter1-dapt/figures/confusion_cat_opus-4.6.png
Normal file
|
After Width: | Height: | Size: 115 KiB |
BIN
results/eval/iter1-dapt/figures/confusion_spec_gpt-5.4.png
Normal file
|
After Width: | Height: | Size: 79 KiB |
BIN
results/eval/iter1-dapt/figures/confusion_spec_opus-4.6.png
Normal file
|
After Width: | Height: | Size: 81 KiB |
BIN
results/eval/iter1-dapt/figures/model_comparison.png
Normal file
|
After Width: | Height: | Size: 60 KiB |
BIN
results/eval/iter1-dapt/figures/per_class_f1_gpt-5.4.png
Normal file
|
After Width: | Height: | Size: 103 KiB |
BIN
results/eval/iter1-dapt/figures/per_class_f1_opus-4.6.png
Normal file
|
After Width: | Height: | Size: 104 KiB |
BIN
results/eval/iter1-dapt/figures/speed_comparison.png
Normal file
|
After Width: | Height: | Size: 51 KiB |
298
results/eval/iter1-dapt/metrics.json
Normal file
@ -0,0 +1,298 @@
|
|||||||
|
{
|
||||||
|
"iter1-dapt_vs_GPT-5.4": {
|
||||||
|
"cat_macro_f1": 0.9350000205815902,
|
||||||
|
"cat_weighted_f1": 0.936034565494772,
|
||||||
|
"cat_macro_precision": 0.9344660111343602,
|
||||||
|
"cat_macro_recall": 0.9378555188267356,
|
||||||
|
"cat_mcc": 0.9246263785540332,
|
||||||
|
"cat_auc": 0.9915953686916092,
|
||||||
|
"cat_ece": 0.04942640244960788,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
224,
|
||||||
|
0,
|
||||||
|
4,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
83,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
2,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
145,
|
||||||
|
1,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
132,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
6,
|
||||||
|
1,
|
||||||
|
5,
|
||||||
|
18,
|
||||||
|
166,
|
||||||
|
1,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
8,
|
||||||
|
1,
|
||||||
|
209,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
13,
|
||||||
|
0,
|
||||||
|
164
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9696969696969697,
|
||||||
|
"cat_prec_BoardGov": 0.9655172413793104,
|
||||||
|
"cat_recall_BoardGov": 0.9739130434782609,
|
||||||
|
"cat_f1_Incident": 0.9540229885057471,
|
||||||
|
"cat_prec_Incident": 0.9651162790697675,
|
||||||
|
"cat_recall_Incident": 0.9431818181818182,
|
||||||
|
"cat_f1_Manageme": 0.9446254071661238,
|
||||||
|
"cat_prec_Manageme": 0.9235668789808917,
|
||||||
|
"cat_recall_Manageme": 0.9666666666666667,
|
||||||
|
"cat_f1_NoneOthe": 0.8949152542372881,
|
||||||
|
"cat_prec_NoneOthe": 0.8301886792452831,
|
||||||
|
"cat_recall_NoneOthe": 0.9705882352941176,
|
||||||
|
"cat_f1_RiskMana": 0.8623376623376623,
|
||||||
|
"cat_prec_RiskMana": 0.8877005347593583,
|
||||||
|
"cat_recall_RiskMana": 0.8383838383838383,
|
||||||
|
"cat_f1_Strategy": 0.9631336405529954,
|
||||||
|
"cat_prec_Strategy": 0.9812206572769953,
|
||||||
|
"cat_recall_Strategy": 0.9457013574660633,
|
||||||
|
"cat_f1_Third-Pa": 0.956268221574344,
|
||||||
|
"cat_prec_Third-Pa": 0.9879518072289156,
|
||||||
|
"cat_recall_Third-Pa": 0.9265536723163842,
|
||||||
|
"cat_kripp_alpha": 0.9243058890635424,
|
||||||
|
"spec_macro_f1": 0.8959443847575952,
|
||||||
|
"spec_weighted_f1": 0.914085249793483,
|
||||||
|
"spec_macro_precision": 0.9055333144570721,
|
||||||
|
"spec_macro_recall": 0.889132193611932,
|
||||||
|
"spec_mcc": 0.8698798188273218,
|
||||||
|
"spec_auc": 0.9806421467148638,
|
||||||
|
"spec_ece": 0.0693218584855397,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
588,
|
||||||
|
14,
|
||||||
|
13,
|
||||||
|
3
|
||||||
|
],
|
||||||
|
[
|
||||||
|
32,
|
||||||
|
126,
|
||||||
|
8,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
11,
|
||||||
|
4,
|
||||||
|
191,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
2,
|
||||||
|
10,
|
||||||
|
193
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.9400479616306955,
|
||||||
|
"spec_prec_L1Generi": 0.9289099526066351,
|
||||||
|
"spec_recall_L1Generi": 0.9514563106796117,
|
||||||
|
"spec_f1_L2Domain": 0.802547770700637,
|
||||||
|
"spec_prec_L2Domain": 0.863013698630137,
|
||||||
|
"spec_recall_L2Domain": 0.75,
|
||||||
|
"spec_f1_L3Firm-S": 0.8904428904428905,
|
||||||
|
"spec_prec_L3Firm-S": 0.8603603603603603,
|
||||||
|
"spec_recall_L3Firm-S": 0.9227053140096618,
|
||||||
|
"spec_f1_L4Quanti": 0.9507389162561576,
|
||||||
|
"spec_prec_L4Quanti": 0.9698492462311558,
|
||||||
|
"spec_recall_L4Quanti": 0.9323671497584541,
|
||||||
|
"spec_qwk": 0.9315994086072762,
|
||||||
|
"spec_mae": 0.11666666666666667,
|
||||||
|
"spec_kripp_alpha": 0.9194074359344485,
|
||||||
|
"total_time_s": 6.855555058107711,
|
||||||
|
"num_samples": 1200,
|
||||||
|
"avg_ms_per_sample": 5.712962548423093,
|
||||||
|
"combined_macro_f1": 0.9154722026695927
|
||||||
|
},
|
||||||
|
"iter1-dapt_vs_Opus-4.6": {
|
||||||
|
"cat_macro_f1": 0.9277442873196512,
|
||||||
|
"cat_weighted_f1": 0.9268438855804646,
|
||||||
|
"cat_macro_precision": 0.9237899595225246,
|
||||||
|
"cat_macro_recall": 0.9349393170438051,
|
||||||
|
"cat_mcc": 0.9150420281652446,
|
||||||
|
"cat_auc": 0.9934333602136249,
|
||||||
|
"cat_ece": 0.057411353190739985,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
210,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
77,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
8,
|
||||||
|
0,
|
||||||
|
145,
|
||||||
|
1,
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
139,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
13,
|
||||||
|
0,
|
||||||
|
9,
|
||||||
|
13,
|
||||||
|
172,
|
||||||
|
1,
|
||||||
|
5
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
9,
|
||||||
|
1,
|
||||||
|
4,
|
||||||
|
2,
|
||||||
|
211,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
6,
|
||||||
|
1,
|
||||||
|
159
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9417040358744395,
|
||||||
|
"cat_prec_BoardGov": 0.9051724137931034,
|
||||||
|
"cat_recall_BoardGov": 0.9813084112149533,
|
||||||
|
"cat_f1_Incident": 0.9333333333333333,
|
||||||
|
"cat_prec_Incident": 0.8953488372093024,
|
||||||
|
"cat_recall_Incident": 0.9746835443037974,
|
||||||
|
"cat_f1_Manageme": 0.9206349206349206,
|
||||||
|
"cat_prec_Manageme": 0.9235668789808917,
|
||||||
|
"cat_recall_Manageme": 0.9177215189873418,
|
||||||
|
"cat_f1_NoneOthe": 0.9266666666666666,
|
||||||
|
"cat_prec_NoneOthe": 0.8742138364779874,
|
||||||
|
"cat_recall_NoneOthe": 0.9858156028368794,
|
||||||
|
"cat_f1_RiskMana": 0.86,
|
||||||
|
"cat_prec_RiskMana": 0.9197860962566845,
|
||||||
|
"cat_recall_RiskMana": 0.8075117370892019,
|
||||||
|
"cat_f1_Strategy": 0.9569160997732427,
|
||||||
|
"cat_prec_Strategy": 0.9906103286384976,
|
||||||
|
"cat_recall_Strategy": 0.9254385964912281,
|
||||||
|
"cat_f1_Third-Pa": 0.954954954954955,
|
||||||
|
"cat_prec_Third-Pa": 0.9578313253012049,
|
||||||
|
"cat_recall_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_kripp_alpha": 0.9144489824694872,
|
||||||
|
"spec_macro_f1": 0.8823881241075249,
|
||||||
|
"spec_weighted_f1": 0.8997013825586678,
|
||||||
|
"spec_macro_precision": 0.8895415282112857,
|
||||||
|
"spec_macro_recall": 0.8784196767594721,
|
||||||
|
"spec_mcc": 0.84923108221758,
|
||||||
|
"spec_auc": 0.9732413764660657,
|
||||||
|
"spec_ece": 0.08008741805950799,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
573,
|
||||||
|
22,
|
||||||
|
9,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
26,
|
||||||
|
114,
|
||||||
|
3,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
34,
|
||||||
|
10,
|
||||||
|
207,
|
||||||
|
9
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
187
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.925686591276252,
|
||||||
|
"spec_prec_L1Generi": 0.9052132701421801,
|
||||||
|
"spec_recall_L1Generi": 0.947107438016529,
|
||||||
|
"spec_f1_L2Domain": 0.7835051546391752,
|
||||||
|
"spec_prec_L2Domain": 0.7808219178082192,
|
||||||
|
"spec_recall_L2Domain": 0.7862068965517242,
|
||||||
|
"spec_f1_L3Firm-S": 0.8589211618257261,
|
||||||
|
"spec_prec_L3Firm-S": 0.9324324324324325,
|
||||||
|
"spec_recall_L3Firm-S": 0.7961538461538461,
|
||||||
|
"spec_f1_L4Quanti": 0.961439588688946,
|
||||||
|
"spec_prec_L4Quanti": 0.9396984924623115,
|
||||||
|
"spec_recall_L4Quanti": 0.9842105263157894,
|
||||||
|
"spec_qwk": 0.9200429286057613,
|
||||||
|
"spec_mae": 0.13833333333333334,
|
||||||
|
"spec_kripp_alpha": 0.9047987190793844,
|
||||||
|
"total_time_s": 6.855555058107711,
|
||||||
|
"num_samples": 1200,
|
||||||
|
"avg_ms_per_sample": 5.712962548423093,
|
||||||
|
"combined_macro_f1": 0.9050662057135881
|
||||||
|
}
|
||||||
|
}
|
||||||
54
results/eval/iter1-dapt/report_gpt-54.txt
Normal file
@ -0,0 +1,54 @@
|
|||||||
|
|
||||||
|
======================================================================
|
||||||
|
HOLDOUT EVALUATION: iter1-dapt vs GPT-5.4
|
||||||
|
======================================================================
|
||||||
|
|
||||||
|
Samples evaluated: 1200
|
||||||
|
Total inference time: 6.86s
|
||||||
|
Avg latency: 5.71ms/sample
|
||||||
|
Throughput: 175 samples/sec
|
||||||
|
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
CATEGORY CLASSIFICATION
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
Macro F1: 0.9350 ✓ (target: 0.80)
|
||||||
|
Weighted F1: 0.9360
|
||||||
|
Macro Prec: 0.9345
|
||||||
|
Macro Recall: 0.9379
|
||||||
|
MCC: 0.9246
|
||||||
|
AUC (OvR): 0.9916
|
||||||
|
ECE: 0.0494
|
||||||
|
Kripp Alpha: 0.9243
|
||||||
|
|
||||||
|
Category F1 Prec Recall
|
||||||
|
------------------------- -------- -------- --------
|
||||||
|
Board Governance 0.9697 0.9655 0.9739
|
||||||
|
Incident Disclosure 0.9540 0.9651 0.9432
|
||||||
|
Management Role 0.9446 0.9236 0.9667
|
||||||
|
None/Other 0.8949 0.8302 0.9706
|
||||||
|
Risk Management Process 0.8623 0.8877 0.8384
|
||||||
|
Strategy Integration 0.9631 0.9812 0.9457
|
||||||
|
Third-Party Risk 0.9563 0.9880 0.9266
|
||||||
|
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
SPECIFICITY CLASSIFICATION
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
Macro F1: 0.8959 ✓ (target: 0.80)
|
||||||
|
Weighted F1: 0.9141
|
||||||
|
Macro Prec: 0.9055
|
||||||
|
Macro Recall: 0.8891
|
||||||
|
MCC: 0.8699
|
||||||
|
AUC (OvR): 0.9806
|
||||||
|
QWK: 0.9316
|
||||||
|
MAE: 0.1167
|
||||||
|
ECE: 0.0693
|
||||||
|
Kripp Alpha: 0.9194
|
||||||
|
|
||||||
|
Level F1 Prec Recall
|
||||||
|
------------------------- -------- -------- --------
|
||||||
|
L1: Generic 0.9400 0.9289 0.9515
|
||||||
|
L2: Domain 0.8025 0.8630 0.7500
|
||||||
|
L3: Firm-Specific 0.8904 0.8604 0.9227
|
||||||
|
L4: Quantified 0.9507 0.9698 0.9324
|
||||||
|
|
||||||
|
======================================================================
|
||||||
54
results/eval/iter1-dapt/report_opus-46.txt
Normal file
@ -0,0 +1,54 @@
|
|||||||
|
|
||||||
|
======================================================================
|
||||||
|
HOLDOUT EVALUATION: iter1-dapt vs Opus-4.6
|
||||||
|
======================================================================
|
||||||
|
|
||||||
|
Samples evaluated: 1200
|
||||||
|
Total inference time: 6.86s
|
||||||
|
Avg latency: 5.71ms/sample
|
||||||
|
Throughput: 175 samples/sec
|
||||||
|
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
CATEGORY CLASSIFICATION
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
Macro F1: 0.9277 ✓ (target: 0.80)
|
||||||
|
Weighted F1: 0.9268
|
||||||
|
Macro Prec: 0.9238
|
||||||
|
Macro Recall: 0.9349
|
||||||
|
MCC: 0.9150
|
||||||
|
AUC (OvR): 0.9934
|
||||||
|
ECE: 0.0574
|
||||||
|
Kripp Alpha: 0.9144
|
||||||
|
|
||||||
|
Category F1 Prec Recall
|
||||||
|
------------------------- -------- -------- --------
|
||||||
|
Board Governance 0.9417 0.9052 0.9813
|
||||||
|
Incident Disclosure 0.9333 0.8953 0.9747
|
||||||
|
Management Role 0.9206 0.9236 0.9177
|
||||||
|
None/Other 0.9267 0.8742 0.9858
|
||||||
|
Risk Management Process 0.8600 0.9198 0.8075
|
||||||
|
Strategy Integration 0.9569 0.9906 0.9254
|
||||||
|
Third-Party Risk 0.9550 0.9578 0.9521
|
||||||
|
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
SPECIFICITY CLASSIFICATION
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
Macro F1: 0.8824 ✓ (target: 0.80)
|
||||||
|
Weighted F1: 0.8997
|
||||||
|
Macro Prec: 0.8895
|
||||||
|
Macro Recall: 0.8784
|
||||||
|
MCC: 0.8492
|
||||||
|
AUC (OvR): 0.9732
|
||||||
|
QWK: 0.9200
|
||||||
|
MAE: 0.1383
|
||||||
|
ECE: 0.0801
|
||||||
|
Kripp Alpha: 0.9048
|
||||||
|
|
||||||
|
Level F1 Prec Recall
|
||||||
|
------------------------- -------- -------- --------
|
||||||
|
L1: Generic 0.9257 0.9052 0.9471
|
||||||
|
L2: Domain 0.7835 0.7808 0.7862
|
||||||
|
L3: Firm-Specific 0.8589 0.9324 0.7962
|
||||||
|
L4: Quantified 0.9614 0.9397 0.9842
|
||||||
|
|
||||||
|
======================================================================
|
||||||