testing clspool and dapt on new architecture

This commit is contained in:
Joey Eamigh 2026-04-07 00:51:48 -04:00
parent edcffbcc78
commit 07dc3d6133
No known key found for this signature in database
GPG Key ID: CE8C05DFFC53C9CB
30 changed files with 1086 additions and 0 deletions

View File

@ -901,6 +901,202 @@ thus invariant under T > 0.
Output: `results/eval/ensemble-3seed-tempscaled/temperature_scaling.json`. Output: `results/eval/ensemble-3seed-tempscaled/temperature_scaling.json`.
### 10.5 Pooling Ablation (Attention vs [CLS])
**Motivation:** The spec F1 jump from 0.517 → 0.945 was credited to three
architectural changes — independent threshold heads, attention pooling, and
confidence filtering. Independent thresholds were ablated against CORAL;
confidence filtering was ablated in §10.3 (null result). Attention pooling
had never been isolated. We needed to know whether it actually matters or
whether independent thresholds carry all the gain.
**Setup:** `iter1-clspool.yaml` — identical iter1 config but with
`pooling: cls`. Same seed (42), same 11 epochs, confidence filtering on.
**Results:**
| Config | Val Cat F1 | Val Spec F1 | Val Combined | Holdout Cat F1 (GPT-5.4) | Holdout Spec F1 (GPT-5.4) |
|--------|-----------:|------------:|-------------:|-------------------------:|--------------------------:|
| iter1 (attention) | 0.9430 | 0.9450 | 0.9440 | 0.9343 | 0.8950 |
| iter1-clspool ([CLS])| 0.9368 | 0.9414 | 0.9391 | 0.9296 | 0.8920 |
| **Δ (attention CLS)** | **+0.006** | **+0.004** | **+0.005** | **+0.005** | **+0.003** |
**Finding:** Attention pooling is consistently better than [CLS] pooling
across all metrics and both references, but the effect is **small**
3-6 thousandths of F1. This is within 2-3× the seed-level std (±0.002), so
the direction is credible but the magnitude is modest. Attention pooling is
doing real work ("one CISO mention anywhere matters") but independent
threshold heads are clearly carrying the majority of the architecture win.
**Interpretation for the paper:** We can report this cleanly as "attention
pooling contributes a small but consistent improvement over [CLS] pooling
(~+0.005 F1 on both heads); the bulk of the CORAL → independent-threshold
gain (~+0.43 on spec F1) is attributable to the decoupled threshold weights,
not the pooling change." This is honest and gives each design choice its
proper credit.
Output: `checkpoints/finetune/iter1-clspool/`, `results/eval/iter1-clspool/`.
### 10.6 DAPT Re-Test with New Architecture
**Motivation:** During the original 12-config ablation grid (CORAL +
[CLS] pooling), DAPT and TAPT both *hurt* — base ModernBERT-large
outperformed DAPT and TAPT checkpoints on every loss combination. That was
reported as a noteworthy null result. But the architecture has changed
substantially since then (independent thresholds, attention pooling). The
verdict on DAPT could now flip: maybe the DAPT vocabulary signal was
previously wasted on a model that couldn't use it.
**Setup:** `iter1-dapt.yaml` — identical iter1 config but
`model.name_or_path` points at `checkpoints/dapt/modernbert-large/final`
(eval loss 0.7250 from Phase 7). Same seed, 11 epochs, attention pooling,
independent threshold heads, confidence filtering on.
**Results (epoch 11 — final checkpoint):**
| Config | Val Cat F1 | Val Spec F1 | Val Combined | Val NLL (ep 11) | Holdout Cat F1 (GPT-5.4) | Holdout Spec F1 (GPT-5.4) |
|--------|-----------:|------------:|-------------:|----------------:|-------------------------:|--------------------------:|
| iter1 (base ModernBERT, seed 69) | 0.9384 | 0.9462 | 0.9423 | 0.511 | — | — |
| iter1 (base ModernBERT, seed 42) | 0.9430 | 0.9450 | 0.9440 | — | 0.9343 | 0.8950 |
| iter1-dapt (DAPT init) | 0.9500 | 0.9462 | 0.9481 | 0.494 | 0.9350 | 0.8959 |
| **Δ (dapt base)** | **+0.007** | **+0.001** | **+0.004** | **0.017** | +0.001 | +0.001 |
**Per-epoch val NLL trajectory (confirmed not overfitting-driven):**
| Epoch | seed 69 (no DAPT) | DAPT | Δ |
|-------|------------------:|-----:|----:|
| 1 | 0.376 | 0.346 | 0.030 |
| 2 | 0.337 | **0.318** (best) | 0.019 |
| 3 | **0.333** (best) | 0.331 | 0.002 |
| 5 | 0.394 | 0.385 | 0.009 |
| 8 | 0.493 | 0.482 | 0.011 |
| 11 | 0.511 | 0.494 | 0.017 |
Both runs peak at epoch 2-3 and then overfit steadily. The overfit gap
(val NLL at epoch 11 minus best) is **0.178 for the baseline** and
**0.176 for DAPT** — essentially identical. DAPT is not overfitting worse;
it is **starting from a better representation** and maintaining the same
generalization gap through training.
**Finding — a more nuanced null:** DAPT initialization genuinely improves
val NLL by ~4.5% at the best checkpoint (0.333 → 0.318), with a matching
+0.007 category F1 improvement on val. The improvement is real and not a
side-effect of overfitting: the train/val gap is unchanged. But this
benefit does not transfer to the stratified holdout — holdout F1 gains are
within noise (+0.001).
But the holdout gain is **0.001** on both heads — within seed-level noise
and nowhere near the val improvement. Something interesting is happening:
- DAPT helps the model fit in-distribution data more tightly (val gain +
NLL drop)
- That extra fit does not generalize to the stratified holdout
- The holdout oversamples minority classes (L2, TP, ID) relative to the
training distribution; DAPT's benefit is on the head of the distribution
**Interpretation for the paper:** This is a more interesting null result
than the original "DAPT/TAPT did not help." The revised claim is:
> *"Domain-adaptive pretraining improves in-distribution val NLL by ~4.5%
> at the best checkpoint (0.333 → 0.318) and provides a modest val F1 gain
> (+0.007 cat, +0.004 combined) under the independent-threshold +
> attention-pooling architecture. The generalization gap (difference between
> best val NLL and final val NLL) is unchanged by DAPT (0.178 vs 0.176),
> confirming that DAPT is providing a better initialization rather than
> just enabling overfitting. However, this val improvement does not
> transfer to the stratified holdout — DAPT produces a model that is
> better-calibrated on paragraphs similar to the training distribution,
> yet no more generalizable to the rare-class boundary cases (L2, TP, ID)
> that macro F1 weighs heavily. Our original finding (DAPT does not help
> final macro F1) is reaffirmed; the mechanism is now clearer."*
This is stronger than the original null because we can now point to a
specific, measurable effect of DAPT (val NLL) distinct from overfitting,
and explain why it doesn't show up in the headline macro F1 metric.
The non-DAPT 3-seed ensemble remains the recommended headline checkpoint.
The DAPT run is reportable as an ablation and a more precise null.
Output: `checkpoints/finetune/iter1-dapt/`, `results/eval/iter1-dapt/`.
### 10.7 The NLL-vs-F1 Decoupling and the Overfit Story
Investigating the DAPT ablation (§10.6) surfaced a general property of
every run in Phase 10 worth documenting explicitly, because it affects how
the paper should report training dynamics.
**Observation:** In all four independent-threshold runs (seeds 42/69/420,
iter1-nofilter, iter1-clspool, iter1-dapt), **val NLL bottoms at epoch 2-3
and then climbs monotonically through epoch 11, while val macro F1 peaks
at epoch 8 and plateaus.** The two metrics disagree about when the model
is at its best.
**Per-epoch val NLL, representative run (seed 69):**
| Epoch | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
|-------|---|---|---|---|---|---|---|---|---|----|----|
| Val NLL | 0.376 | 0.337 | **0.333** | 0.369 | 0.394 | 0.443 | 0.472 | 0.493 | 0.505 | 0.511 | — |
| Val F1 | ~0.90 | ~0.92 | ~0.925 | ~0.932 | ~0.938 | ~0.941 | ~0.942 | **~0.944** | 0.944 | 0.944 | 0.943 |
**Interpretation:** Past epoch 3, continued training memorizes *confidence*,
not *decisions*. Two things happen simultaneously:
1. Training-set probabilities are pushed toward 0/1 (training loss → 0)
2. Very few argmax decision boundaries shift
For val examples the model already gets right, sharpening is neutral-to-bad
for NLL and neutral-to-good for F1. For val examples the model gets wrong,
continued training makes the prediction *more confidently wrong* — terrible
for NLL (log-penalty grows), irrelevant for F1 (still wrong by argmax).
Net: NLL climbs, F1 inches up as a small number of borderline examples
flip to the correct side.
This is a well-documented decoupling in deep classifiers, not a pathology
specific to this model.
**Is it a problem for the F1 claim? No.** Model selection uses val F1, so
we pick the epoch where F1 peaks (epoch 8). Val F1 at the selected
checkpoint (0.943/0.945) closely tracks holdout F1 against proxy gold
(0.934/0.895) — a ~0.01 category gap and ~0.05 specificity gap. The
decision boundaries generalized. The model did not overfit the *task*.
**Is it a problem for the probability claim? Yes, but measurable and
fixable.** Raw logits at epoch 8 are overconfident, which is exactly what
the pre-scaling ECE measured (0.05-0.08). The fitted temperatures
(T_cat = 1.76, T_spec = 2.46) are a direct quantification of how
overconfident the model became between epoch 3 and epoch 8: T > 1 means
"divide logits to cool them off." Temperature scaling (§10.4) recovers
calibration without touching predictions, so the cost of training to
epoch 8 instead of epoch 3 is paid in a scalar that's learned in ~1 second
on val.
**Is it a problem for the holdout claim? No, by construction.** The
holdout was never touched during training. The train/val loss gap measures
memorization of the training distribution; the holdout measures
generalization to a distributionally distinct sample. These are independent
signals and both tell a consistent story: decision boundaries transfer,
probability calibration does not.
**Why not just stop at epoch 3?** Because you'd save ~0.18 in val NLL and
lose ~0.02 in val F1. Epochs 3 → 8 buy ~0.015-0.020 F1 at the cost of
calibration that temperature scaling mechanically recovers. For a
task where F1 is the rubric metric, that is a good trade. Were this a
deployment where confidence scores drive downstream decisions (e.g., a
human-in-the-loop review queue prioritizing low-confidence paragraphs),
epoch 3 + no temperature scaling would be a reasonable alternative choice.
**Paper framing:**
> *"Val NLL minimizes at epoch 2-3 while val macro F1 peaks at epoch 8 — a
> well-documented decoupling between calibration and decision quality in
> deep classifiers. We select checkpoints by F1, report pre- and
> post-temperature-scaling ECE separately, and verify generalization via
> an untouched stratified holdout. The model's val-holdout F1 gap (~0.01
> category, ~0.05 specificity) is within the inter-reference agreement
> ceiling, confirming decision-boundary generalization despite
> in-distribution confidence memorization. Temperature scaling recovers
> calibration (ECE 33% cat, 40% spec) without altering predictions."*
### Phase 10 Summary ### Phase 10 Summary
| Experiment | Cost | Outcome | Paper value | | Experiment | Cost | Outcome | Paper value |
@ -909,6 +1105,8 @@ Output: `results/eval/ensemble-3seed-tempscaled/temperature_scaling.json`.
| Dictionary baseline | ~1 min CPU | Cat 0.55, Spec 0.66 — clear gap to learned model | A-rubric "additional baselines" item | | Dictionary baseline | ~1 min CPU | Cat 0.55, Spec 0.66 — clear gap to learned model | A-rubric "additional baselines" item |
| Confidence-filter ablation | ~3h GPU | Null result — filtering does not matter | Justifies architecture, not data engineering | | Confidence-filter ablation | ~3h GPU | Null result — filtering does not matter | Justifies architecture, not data engineering |
| Temperature scaling | ~10 min GPU | ECE 33% cat, 40% spec, F1 unchanged | Calibration story, deployment quality | | Temperature scaling | ~10 min GPU | ECE 33% cat, 40% spec, F1 unchanged | Calibration story, deployment quality |
| Pooling ablation (attention vs CLS) | ~3h GPU | +0.005 F1 consistent, small effect | Validates design, credits independent thresholds |
| DAPT re-test with new architecture | ~3h GPU | Val best NLL 0.333→0.318 (4.5%), F1 +0.007 cat; holdout null; gen gap unchanged | More nuanced null — better init, not better generalization |
The 3-seed ensemble is now the recommended headline checkpoint. The The 3-seed ensemble is now the recommended headline checkpoint. The
calibrated ECE numbers should replace the pre-scaling ECE in the paper. The calibrated ECE numbers should replace the pre-scaling ECE in the paper. The

View File

@ -156,6 +156,8 @@
- [x] Ensemble of 3 seeds for confidence intervals — seeds 42/69/420, val std ±0.002 spec, holdout +0.017 L2 F1, +0.007 spec F1 vs single seed - [x] Ensemble of 3 seeds for confidence intervals — seeds 42/69/420, val std ±0.002 spec, holdout +0.017 L2 F1, +0.007 spec F1 vs single seed
- [x] Dictionary/keyword baseline (A-rubric "additional baselines") — Cat 0.55, Spec 0.66; gap to learned model documents value of context - [x] Dictionary/keyword baseline (A-rubric "additional baselines") — Cat 0.55, Spec 0.66; gap to learned model documents value of context
- [x] Confidence-filter ablation — null result, filtering does not affect F1; architecture changes carry the spec F1 improvement - [x] Confidence-filter ablation — null result, filtering does not affect F1; architecture changes carry the spec F1 improvement
- [x] Pooling ablation (attention vs CLS) — attention +0.005 F1 consistent; small but credible effect
- [x] DAPT re-test with new architecture — val +0.007 cat F1, best val NLL 0.333→0.318 (4.5%), generalization gap unchanged; holdout gain ~0.001 (better init, not better generalization)
- [ ] Error analysis against human gold, IGNITE slides - [ ] Error analysis against human gold, IGNITE slides
- [ ] Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work - [ ] Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work
- [ ] Note in paper: DAPT/TAPT did not improve fine-tuning — noteworthy null result - [ ] Note in paper: DAPT/TAPT did not improve fine-tuning — noteworthy null result

View File

@ -0,0 +1,37 @@
model:
name_or_path: answerdotai/ModernBERT-large
data:
paragraphs_path: ../data/paragraphs/paragraphs-clean.patched.jsonl
consensus_path: ../data/annotations/v2-stage1/consensus.jsonl
quality_path: ../data/paragraphs/quality/quality-scores.jsonl
holdout_path: ../data/gold/v2-holdout-ids.json
max_seq_length: 512
validation_split: 0.1
training:
output_dir: ../checkpoints/finetune/iter1-clspool
learning_rate: 0.00005
num_train_epochs: 11
per_device_train_batch_size: 32
per_device_eval_batch_size: 64
gradient_accumulation_steps: 1
warmup_ratio: 0.1
weight_decay: 0.01
dropout: 0.1
bf16: true
gradient_checkpointing: false
logging_steps: 50
save_total_limit: 3
dataloader_num_workers: 4
seed: 42
loss_type: ce
focal_gamma: 2.0
class_weighting: true
category_loss_weight: 1.0
specificity_loss_weight: 1.0
specificity_head: independent
spec_mlp_dim: 256
pooling: cls
ordinal_consistency_weight: 0.1
filter_spec_confidence: true

View File

@ -0,0 +1,37 @@
model:
name_or_path: ../checkpoints/dapt/modernbert-large/final
data:
paragraphs_path: ../data/paragraphs/paragraphs-clean.patched.jsonl
consensus_path: ../data/annotations/v2-stage1/consensus.jsonl
quality_path: ../data/paragraphs/quality/quality-scores.jsonl
holdout_path: ../data/gold/v2-holdout-ids.json
max_seq_length: 512
validation_split: 0.1
training:
output_dir: ../checkpoints/finetune/iter1-dapt
learning_rate: 0.00005
num_train_epochs: 11
per_device_train_batch_size: 32
per_device_eval_batch_size: 64
gradient_accumulation_steps: 1
warmup_ratio: 0.1
weight_decay: 0.01
dropout: 0.1
bf16: true
gradient_checkpointing: false
logging_steps: 50
save_total_limit: 3
dataloader_num_workers: 4
seed: 42
loss_type: ce
focal_gamma: 2.0
class_weighting: true
category_loss_weight: 1.0
specificity_loss_weight: 1.0
specificity_head: independent
spec_mlp_dim: 256
pooling: attention
ordinal_consistency_weight: 0.1
filter_spec_confidence: true

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 117 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 117 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 80 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 81 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 104 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 104 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

View File

@ -0,0 +1,298 @@
{
"iter1-clspool_vs_GPT-5.4": {
"cat_macro_f1": 0.9296272782528762,
"cat_weighted_f1": 0.9306824376807155,
"cat_macro_precision": 0.9289887550616817,
"cat_macro_recall": 0.9334375025997984,
"cat_mcc": 0.9179226636085169,
"cat_auc": 0.9911299127522846,
"cat_ece": 0.05557066917419438,
"cat_confusion_matrix": [
[
217,
0,
8,
3,
2,
0,
0
],
[
0,
83,
0,
2,
2,
1,
0
],
[
2,
0,
144,
1,
3,
0,
0
],
[
1,
0,
2,
132,
1,
0,
0
],
[
6,
1,
5,
17,
167,
1,
1
],
[
0,
2,
1,
8,
2,
208,
0
],
[
0,
0,
0,
1,
11,
0,
165
]
],
"cat_f1_BoardGov": 0.9517543859649122,
"cat_prec_BoardGov": 0.9601769911504425,
"cat_recall_BoardGov": 0.9434782608695652,
"cat_f1_Incident": 0.9540229885057471,
"cat_prec_Incident": 0.9651162790697675,
"cat_recall_Incident": 0.9431818181818182,
"cat_f1_Manageme": 0.9290322580645162,
"cat_prec_Manageme": 0.9,
"cat_recall_Manageme": 0.96,
"cat_f1_NoneOthe": 0.88,
"cat_prec_NoneOthe": 0.8048780487804879,
"cat_recall_NoneOthe": 0.9705882352941176,
"cat_f1_RiskMana": 0.8652849740932642,
"cat_prec_RiskMana": 0.8882978723404256,
"cat_recall_RiskMana": 0.8434343434343434,
"cat_f1_Strategy": 0.9651972157772621,
"cat_prec_Strategy": 0.9904761904761905,
"cat_recall_Strategy": 0.9411764705882353,
"cat_f1_Third-Pa": 0.9620991253644315,
"cat_prec_Third-Pa": 0.9939759036144579,
"cat_recall_Third-Pa": 0.9322033898305084,
"cat_kripp_alpha": 0.9174669822467758,
"spec_macro_f1": 0.892010224838834,
"spec_weighted_f1": 0.9098424770121019,
"spec_macro_precision": 0.9042493173083448,
"spec_macro_recall": 0.8836163792237031,
"spec_mcc": 0.8634241541671751,
"spec_auc": 0.9777836963763646,
"spec_ece": 0.07659540871779125,
"spec_confusion_matrix": [
[
587,
11,
17,
3
],
[
32,
125,
9,
2
],
[
14,
4,
187,
2
],
[
3,
1,
9,
194
]
],
"spec_f1_L1Generi": 0.9362041467304625,
"spec_prec_L1Generi": 0.9229559748427673,
"spec_recall_L1Generi": 0.9498381877022654,
"spec_f1_L2Domain": 0.8090614886731392,
"spec_prec_L2Domain": 0.8865248226950354,
"spec_recall_L2Domain": 0.7440476190476191,
"spec_f1_L3Firm-S": 0.8717948717948718,
"spec_prec_L3Firm-S": 0.8423423423423423,
"spec_recall_L3Firm-S": 0.9033816425120773,
"spec_f1_L4Quanti": 0.9509803921568627,
"spec_prec_L4Quanti": 0.9651741293532339,
"spec_recall_L4Quanti": 0.9371980676328503,
"spec_qwk": 0.9224750079938221,
"spec_mae": 0.1275,
"spec_kripp_alpha": 0.9099809044589873,
"total_time_s": 6.83874113188358,
"num_samples": 1200,
"avg_ms_per_sample": 5.698950943236317,
"combined_macro_f1": 0.910818751545855
},
"iter1-clspool_vs_Opus-4.6": {
"cat_macro_f1": 0.9228949790380195,
"cat_weighted_f1": 0.9228190044594041,
"cat_macro_precision": 0.9183239817151002,
"cat_macro_recall": 0.9310538134995027,
"cat_mcc": 0.9101930161599978,
"cat_auc": 0.9924519781241848,
"cat_ece": 0.06223733584086104,
"cat_confusion_matrix": [
[
208,
0,
3,
3,
0,
0,
0
],
[
0,
76,
0,
1,
2,
0,
0
],
[
5,
0,
147,
1,
4,
0,
1
],
[
0,
0,
0,
139,
2,
0,
0
],
[
12,
1,
9,
14,
171,
1,
5
],
[
1,
9,
1,
6,
2,
208,
1
],
[
0,
0,
0,
0,
7,
1,
159
]
],
"cat_f1_BoardGov": 0.9454545454545454,
"cat_prec_BoardGov": 0.9203539823008849,
"cat_recall_BoardGov": 0.9719626168224299,
"cat_f1_Incident": 0.9212121212121213,
"cat_prec_Incident": 0.8837209302325582,
"cat_recall_Incident": 0.9620253164556962,
"cat_f1_Manageme": 0.9245283018867925,
"cat_prec_Manageme": 0.91875,
"cat_recall_Manageme": 0.930379746835443,
"cat_f1_NoneOthe": 0.9114754098360656,
"cat_prec_NoneOthe": 0.8475609756097561,
"cat_recall_NoneOthe": 0.9858156028368794,
"cat_f1_RiskMana": 0.8528678304239401,
"cat_prec_RiskMana": 0.9095744680851063,
"cat_recall_RiskMana": 0.8028169014084507,
"cat_f1_Strategy": 0.9497716894977168,
"cat_prec_Strategy": 0.9904761904761905,
"cat_recall_Strategy": 0.9122807017543859,
"cat_f1_Third-Pa": 0.954954954954955,
"cat_prec_Third-Pa": 0.9578313253012049,
"cat_recall_Third-Pa": 0.9520958083832335,
"cat_kripp_alpha": 0.9095735484151157,
"spec_macro_f1": 0.8804386286358235,
"spec_weighted_f1": 0.8975676999782217,
"spec_macro_precision": 0.8892226854649037,
"spec_macro_recall": 0.8750457181821643,
"spec_mcc": 0.8465565454059848,
"spec_auc": 0.9697722386763277,
"spec_ece": 0.08741456707318629,
"spec_confusion_matrix": [
[
575,
19,
10,
1
],
[
26,
114,
4,
1
],
[
35,
8,
204,
13
],
[
0,
0,
4,
186
]
],
"spec_f1_L1Generi": 0.9266720386784851,
"spec_prec_L1Generi": 0.9040880503144654,
"spec_recall_L1Generi": 0.9504132231404959,
"spec_f1_L2Domain": 0.7972027972027972,
"spec_prec_L2Domain": 0.8085106382978723,
"spec_recall_L2Domain": 0.7862068965517242,
"spec_f1_L3Firm-S": 0.8464730290456431,
"spec_prec_L3Firm-S": 0.918918918918919,
"spec_recall_L3Firm-S": 0.7846153846153846,
"spec_f1_L4Quanti": 0.9514066496163683,
"spec_prec_L4Quanti": 0.9253731343283582,
"spec_recall_L4Quanti": 0.9789473684210527,
"spec_qwk": 0.9187882106031572,
"spec_mae": 0.14083333333333334,
"spec_kripp_alpha": 0.9041056117796359,
"total_time_s": 6.83874113188358,
"num_samples": 1200,
"avg_ms_per_sample": 5.698950943236317,
"combined_macro_f1": 0.9016668038369215
}
}

View File

@ -0,0 +1,54 @@
======================================================================
HOLDOUT EVALUATION: iter1-clspool vs GPT-5.4
======================================================================
Samples evaluated: 1200
Total inference time: 6.84s
Avg latency: 5.70ms/sample
Throughput: 175 samples/sec
──────────────────────────────────────────────────
CATEGORY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.9296 ✓ (target: 0.80)
Weighted F1: 0.9307
Macro Prec: 0.9290
Macro Recall: 0.9334
MCC: 0.9179
AUC (OvR): 0.9911
ECE: 0.0556
Kripp Alpha: 0.9175
Category F1 Prec Recall
------------------------- -------- -------- --------
Board Governance 0.9518 0.9602 0.9435
Incident Disclosure 0.9540 0.9651 0.9432
Management Role 0.9290 0.9000 0.9600
None/Other 0.8800 0.8049 0.9706
Risk Management Process 0.8653 0.8883 0.8434
Strategy Integration 0.9652 0.9905 0.9412
Third-Party Risk 0.9621 0.9940 0.9322
──────────────────────────────────────────────────
SPECIFICITY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.8920 ✓ (target: 0.80)
Weighted F1: 0.9098
Macro Prec: 0.9042
Macro Recall: 0.8836
MCC: 0.8634
AUC (OvR): 0.9778
QWK: 0.9225
MAE: 0.1275
ECE: 0.0766
Kripp Alpha: 0.9100
Level F1 Prec Recall
------------------------- -------- -------- --------
L1: Generic 0.9362 0.9230 0.9498
L2: Domain 0.8091 0.8865 0.7440
L3: Firm-Specific 0.8718 0.8423 0.9034
L4: Quantified 0.9510 0.9652 0.9372
======================================================================

View File

@ -0,0 +1,54 @@
======================================================================
HOLDOUT EVALUATION: iter1-clspool vs Opus-4.6
======================================================================
Samples evaluated: 1200
Total inference time: 6.84s
Avg latency: 5.70ms/sample
Throughput: 175 samples/sec
──────────────────────────────────────────────────
CATEGORY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.9229 ✓ (target: 0.80)
Weighted F1: 0.9228
Macro Prec: 0.9183
Macro Recall: 0.9311
MCC: 0.9102
AUC (OvR): 0.9925
ECE: 0.0622
Kripp Alpha: 0.9096
Category F1 Prec Recall
------------------------- -------- -------- --------
Board Governance 0.9455 0.9204 0.9720
Incident Disclosure 0.9212 0.8837 0.9620
Management Role 0.9245 0.9187 0.9304
None/Other 0.9115 0.8476 0.9858
Risk Management Process 0.8529 0.9096 0.8028
Strategy Integration 0.9498 0.9905 0.9123
Third-Party Risk 0.9550 0.9578 0.9521
──────────────────────────────────────────────────
SPECIFICITY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.8804 ✓ (target: 0.80)
Weighted F1: 0.8976
Macro Prec: 0.8892
Macro Recall: 0.8750
MCC: 0.8466
AUC (OvR): 0.9698
QWK: 0.9188
MAE: 0.1408
ECE: 0.0874
Kripp Alpha: 0.9041
Level F1 Prec Recall
------------------------- -------- -------- --------
L1: Generic 0.9267 0.9041 0.9504
L2: Domain 0.7972 0.8085 0.7862
L3: Firm-Specific 0.8465 0.9189 0.7846
L4: Quantified 0.9514 0.9254 0.9789
======================================================================

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 115 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 115 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 81 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 103 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 104 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

View File

@ -0,0 +1,298 @@
{
"iter1-dapt_vs_GPT-5.4": {
"cat_macro_f1": 0.9350000205815902,
"cat_weighted_f1": 0.936034565494772,
"cat_macro_precision": 0.9344660111343602,
"cat_macro_recall": 0.9378555188267356,
"cat_mcc": 0.9246263785540332,
"cat_auc": 0.9915953686916092,
"cat_ece": 0.04942640244960788,
"cat_confusion_matrix": [
[
224,
0,
4,
0,
2,
0,
0
],
[
0,
83,
0,
0,
2,
2,
1
],
[
2,
0,
145,
1,
2,
0,
0
],
[
0,
0,
2,
132,
1,
1,
0
],
[
6,
1,
5,
18,
166,
1,
1
],
[
0,
2,
1,
8,
1,
209,
0
],
[
0,
0,
0,
0,
13,
0,
164
]
],
"cat_f1_BoardGov": 0.9696969696969697,
"cat_prec_BoardGov": 0.9655172413793104,
"cat_recall_BoardGov": 0.9739130434782609,
"cat_f1_Incident": 0.9540229885057471,
"cat_prec_Incident": 0.9651162790697675,
"cat_recall_Incident": 0.9431818181818182,
"cat_f1_Manageme": 0.9446254071661238,
"cat_prec_Manageme": 0.9235668789808917,
"cat_recall_Manageme": 0.9666666666666667,
"cat_f1_NoneOthe": 0.8949152542372881,
"cat_prec_NoneOthe": 0.8301886792452831,
"cat_recall_NoneOthe": 0.9705882352941176,
"cat_f1_RiskMana": 0.8623376623376623,
"cat_prec_RiskMana": 0.8877005347593583,
"cat_recall_RiskMana": 0.8383838383838383,
"cat_f1_Strategy": 0.9631336405529954,
"cat_prec_Strategy": 0.9812206572769953,
"cat_recall_Strategy": 0.9457013574660633,
"cat_f1_Third-Pa": 0.956268221574344,
"cat_prec_Third-Pa": 0.9879518072289156,
"cat_recall_Third-Pa": 0.9265536723163842,
"cat_kripp_alpha": 0.9243058890635424,
"spec_macro_f1": 0.8959443847575952,
"spec_weighted_f1": 0.914085249793483,
"spec_macro_precision": 0.9055333144570721,
"spec_macro_recall": 0.889132193611932,
"spec_mcc": 0.8698798188273218,
"spec_auc": 0.9806421467148638,
"spec_ece": 0.0693218584855397,
"spec_confusion_matrix": [
[
588,
14,
13,
3
],
[
32,
126,
8,
2
],
[
11,
4,
191,
1
],
[
2,
2,
10,
193
]
],
"spec_f1_L1Generi": 0.9400479616306955,
"spec_prec_L1Generi": 0.9289099526066351,
"spec_recall_L1Generi": 0.9514563106796117,
"spec_f1_L2Domain": 0.802547770700637,
"spec_prec_L2Domain": 0.863013698630137,
"spec_recall_L2Domain": 0.75,
"spec_f1_L3Firm-S": 0.8904428904428905,
"spec_prec_L3Firm-S": 0.8603603603603603,
"spec_recall_L3Firm-S": 0.9227053140096618,
"spec_f1_L4Quanti": 0.9507389162561576,
"spec_prec_L4Quanti": 0.9698492462311558,
"spec_recall_L4Quanti": 0.9323671497584541,
"spec_qwk": 0.9315994086072762,
"spec_mae": 0.11666666666666667,
"spec_kripp_alpha": 0.9194074359344485,
"total_time_s": 6.855555058107711,
"num_samples": 1200,
"avg_ms_per_sample": 5.712962548423093,
"combined_macro_f1": 0.9154722026695927
},
"iter1-dapt_vs_Opus-4.6": {
"cat_macro_f1": 0.9277442873196512,
"cat_weighted_f1": 0.9268438855804646,
"cat_macro_precision": 0.9237899595225246,
"cat_macro_recall": 0.9349393170438051,
"cat_mcc": 0.9150420281652446,
"cat_auc": 0.9934333602136249,
"cat_ece": 0.057411353190739985,
"cat_confusion_matrix": [
[
210,
0,
2,
1,
1,
0,
0
],
[
0,
77,
0,
0,
1,
0,
1
],
[
8,
0,
145,
1,
3,
0,
1
],
[
0,
0,
0,
139,
2,
0,
0
],
[
13,
0,
9,
13,
172,
1,
5
],
[
1,
9,
1,
4,
2,
211,
0
],
[
0,
0,
0,
1,
6,
1,
159
]
],
"cat_f1_BoardGov": 0.9417040358744395,
"cat_prec_BoardGov": 0.9051724137931034,
"cat_recall_BoardGov": 0.9813084112149533,
"cat_f1_Incident": 0.9333333333333333,
"cat_prec_Incident": 0.8953488372093024,
"cat_recall_Incident": 0.9746835443037974,
"cat_f1_Manageme": 0.9206349206349206,
"cat_prec_Manageme": 0.9235668789808917,
"cat_recall_Manageme": 0.9177215189873418,
"cat_f1_NoneOthe": 0.9266666666666666,
"cat_prec_NoneOthe": 0.8742138364779874,
"cat_recall_NoneOthe": 0.9858156028368794,
"cat_f1_RiskMana": 0.86,
"cat_prec_RiskMana": 0.9197860962566845,
"cat_recall_RiskMana": 0.8075117370892019,
"cat_f1_Strategy": 0.9569160997732427,
"cat_prec_Strategy": 0.9906103286384976,
"cat_recall_Strategy": 0.9254385964912281,
"cat_f1_Third-Pa": 0.954954954954955,
"cat_prec_Third-Pa": 0.9578313253012049,
"cat_recall_Third-Pa": 0.9520958083832335,
"cat_kripp_alpha": 0.9144489824694872,
"spec_macro_f1": 0.8823881241075249,
"spec_weighted_f1": 0.8997013825586678,
"spec_macro_precision": 0.8895415282112857,
"spec_macro_recall": 0.8784196767594721,
"spec_mcc": 0.84923108221758,
"spec_auc": 0.9732413764660657,
"spec_ece": 0.08008741805950799,
"spec_confusion_matrix": [
[
573,
22,
9,
1
],
[
26,
114,
3,
2
],
[
34,
10,
207,
9
],
[
0,
0,
3,
187
]
],
"spec_f1_L1Generi": 0.925686591276252,
"spec_prec_L1Generi": 0.9052132701421801,
"spec_recall_L1Generi": 0.947107438016529,
"spec_f1_L2Domain": 0.7835051546391752,
"spec_prec_L2Domain": 0.7808219178082192,
"spec_recall_L2Domain": 0.7862068965517242,
"spec_f1_L3Firm-S": 0.8589211618257261,
"spec_prec_L3Firm-S": 0.9324324324324325,
"spec_recall_L3Firm-S": 0.7961538461538461,
"spec_f1_L4Quanti": 0.961439588688946,
"spec_prec_L4Quanti": 0.9396984924623115,
"spec_recall_L4Quanti": 0.9842105263157894,
"spec_qwk": 0.9200429286057613,
"spec_mae": 0.13833333333333334,
"spec_kripp_alpha": 0.9047987190793844,
"total_time_s": 6.855555058107711,
"num_samples": 1200,
"avg_ms_per_sample": 5.712962548423093,
"combined_macro_f1": 0.9050662057135881
}
}

View File

@ -0,0 +1,54 @@
======================================================================
HOLDOUT EVALUATION: iter1-dapt vs GPT-5.4
======================================================================
Samples evaluated: 1200
Total inference time: 6.86s
Avg latency: 5.71ms/sample
Throughput: 175 samples/sec
──────────────────────────────────────────────────
CATEGORY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.9350 ✓ (target: 0.80)
Weighted F1: 0.9360
Macro Prec: 0.9345
Macro Recall: 0.9379
MCC: 0.9246
AUC (OvR): 0.9916
ECE: 0.0494
Kripp Alpha: 0.9243
Category F1 Prec Recall
------------------------- -------- -------- --------
Board Governance 0.9697 0.9655 0.9739
Incident Disclosure 0.9540 0.9651 0.9432
Management Role 0.9446 0.9236 0.9667
None/Other 0.8949 0.8302 0.9706
Risk Management Process 0.8623 0.8877 0.8384
Strategy Integration 0.9631 0.9812 0.9457
Third-Party Risk 0.9563 0.9880 0.9266
──────────────────────────────────────────────────
SPECIFICITY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.8959 ✓ (target: 0.80)
Weighted F1: 0.9141
Macro Prec: 0.9055
Macro Recall: 0.8891
MCC: 0.8699
AUC (OvR): 0.9806
QWK: 0.9316
MAE: 0.1167
ECE: 0.0693
Kripp Alpha: 0.9194
Level F1 Prec Recall
------------------------- -------- -------- --------
L1: Generic 0.9400 0.9289 0.9515
L2: Domain 0.8025 0.8630 0.7500
L3: Firm-Specific 0.8904 0.8604 0.9227
L4: Quantified 0.9507 0.9698 0.9324
======================================================================

View File

@ -0,0 +1,54 @@
======================================================================
HOLDOUT EVALUATION: iter1-dapt vs Opus-4.6
======================================================================
Samples evaluated: 1200
Total inference time: 6.86s
Avg latency: 5.71ms/sample
Throughput: 175 samples/sec
──────────────────────────────────────────────────
CATEGORY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.9277 ✓ (target: 0.80)
Weighted F1: 0.9268
Macro Prec: 0.9238
Macro Recall: 0.9349
MCC: 0.9150
AUC (OvR): 0.9934
ECE: 0.0574
Kripp Alpha: 0.9144
Category F1 Prec Recall
------------------------- -------- -------- --------
Board Governance 0.9417 0.9052 0.9813
Incident Disclosure 0.9333 0.8953 0.9747
Management Role 0.9206 0.9236 0.9177
None/Other 0.9267 0.8742 0.9858
Risk Management Process 0.8600 0.9198 0.8075
Strategy Integration 0.9569 0.9906 0.9254
Third-Party Risk 0.9550 0.9578 0.9521
──────────────────────────────────────────────────
SPECIFICITY CLASSIFICATION
──────────────────────────────────────────────────
Macro F1: 0.8824 ✓ (target: 0.80)
Weighted F1: 0.8997
Macro Prec: 0.8895
Macro Recall: 0.8784
MCC: 0.8492
AUC (OvR): 0.9732
QWK: 0.9200
MAE: 0.1383
ECE: 0.0801
Kripp Alpha: 0.9048
Level F1 Prec Recall
------------------------- -------- -------- --------
L1: Generic 0.9257 0.9052 0.9471
L2: Domain 0.7835 0.7808 0.7862
L3: Firm-Specific 0.8589 0.9324 0.7962
L4: Quantified 0.9614 0.9397 0.9842
======================================================================