testing clspool and dapt on new architecture

2026-04-07 00:51:48 -04:00 · 2026-04-07 00:51:48 -04:00 · 07dc3d6133
commit 07dc3d6133
parent edcffbcc78
30 changed files with 1086 additions and 0 deletions
--- a/docs/NARRATIVE.md
+++ b/docs/NARRATIVE.md
@ -901,6 +901,202 @@ thus invariant under T > 0.
 Output: `results/eval/ensemble-3seed-tempscaled/temperature_scaling.json`.
 ### 10.5 Pooling Ablation (Attention vs [CLS])
 **Motivation:** The spec F1 jump from 0.517 → 0.945 was credited to three
 architectural changes — independent threshold heads, attention pooling, and
 confidence filtering. Independent thresholds were ablated against CORAL;
 confidence filtering was ablated in §10.3 (null result). Attention pooling
 had never been isolated. We needed to know whether it actually matters or
 whether independent thresholds carry all the gain.
 **Setup:** `iter1-clspool.yaml` — identical iter1 config but with
 `pooling: cls`. Same seed (42), same 11 epochs, confidence filtering on.
 **Results:**
 | Config | Val Cat F1 | Val Spec F1 | Val Combined | Holdout Cat F1 (GPT-5.4) | Holdout Spec F1 (GPT-5.4) |
 |--------|-----------:|------------:|-------------:|-------------------------:|--------------------------:|
 | iter1 (attention)    | 0.9430 | 0.9450 | 0.9440 | 0.9343 | 0.8950 |
 | iter1-clspool ([CLS])| 0.9368 | 0.9414 | 0.9391 | 0.9296 | 0.8920 |
 | **Δ (attention − CLS)** | **+0.006** | **+0.004** | **+0.005** | **+0.005** | **+0.003** |
 **Finding:** Attention pooling is consistently better than [CLS] pooling
 across all metrics and both references, but the effect is **small** —
 3-6 thousandths of F1. This is within 2-3× the seed-level std (±0.002), so
 the direction is credible but the magnitude is modest. Attention pooling is
 doing real work ("one CISO mention anywhere matters") but independent
 threshold heads are clearly carrying the majority of the architecture win.
 **Interpretation for the paper:** We can report this cleanly as "attention
 pooling contributes a small but consistent improvement over [CLS] pooling
 (~+0.005 F1 on both heads); the bulk of the CORAL → independent-threshold
 gain (~+0.43 on spec F1) is attributable to the decoupled threshold weights,
 not the pooling change." This is honest and gives each design choice its
 proper credit.
 Output: `checkpoints/finetune/iter1-clspool/`, `results/eval/iter1-clspool/`.
 ### 10.6 DAPT Re-Test with New Architecture
 **Motivation:** During the original 12-config ablation grid (CORAL +
 [CLS] pooling), DAPT and TAPT both *hurt* — base ModernBERT-large
 outperformed DAPT and TAPT checkpoints on every loss combination. That was
 reported as a noteworthy null result. But the architecture has changed
 substantially since then (independent thresholds, attention pooling). The
 verdict on DAPT could now flip: maybe the DAPT vocabulary signal was
 previously wasted on a model that couldn't use it.
 **Setup:** `iter1-dapt.yaml` — identical iter1 config but
 `model.name_or_path` points at `checkpoints/dapt/modernbert-large/final`
 (eval loss 0.7250 from Phase 7). Same seed, 11 epochs, attention pooling,
 independent threshold heads, confidence filtering on.
 **Results (epoch 11 — final checkpoint):**
 | Config | Val Cat F1 | Val Spec F1 | Val Combined | Val NLL (ep 11) | Holdout Cat F1 (GPT-5.4) | Holdout Spec F1 (GPT-5.4) |
 |--------|-----------:|------------:|-------------:|----------------:|-------------------------:|--------------------------:|
 | iter1 (base ModernBERT, seed 69)  | 0.9384 | 0.9462 | 0.9423 | 0.511 | — | — |
 | iter1 (base ModernBERT, seed 42)  | 0.9430 | 0.9450 | 0.9440 | — | 0.9343 | 0.8950 |
 | iter1-dapt (DAPT init)            | 0.9500 | 0.9462 | 0.9481 | 0.494 | 0.9350 | 0.8959 |
 | **Δ (dapt − base)** | **+0.007** | **+0.001** | **+0.004** | **−0.017** | +0.001 | +0.001 |
 **Per-epoch val NLL trajectory (confirmed not overfitting-driven):**
 | Epoch | seed 69 (no DAPT) | DAPT | Δ |
 |-------|------------------:|-----:|----:|
 | 1     | 0.376 | 0.346 | −0.030 |
 | 2     | 0.337 | **0.318** (best) | −0.019 |
 | 3     | **0.333** (best) | 0.331 | −0.002 |
 | 5     | 0.394 | 0.385 | −0.009 |
 | 8     | 0.493 | 0.482 | −0.011 |
 | 11    | 0.511 | 0.494 | −0.017 |
 Both runs peak at epoch 2-3 and then overfit steadily. The overfit gap
 (val NLL at epoch 11 minus best) is **0.178 for the baseline** and
 **0.176 for DAPT** — essentially identical. DAPT is not overfitting worse;
 it is **starting from a better representation** and maintaining the same
 generalization gap through training.
 **Finding — a more nuanced null:** DAPT initialization genuinely improves
 val NLL by ~4.5% at the best checkpoint (0.333 → 0.318), with a matching
 +0.007 category F1 improvement on val. The improvement is real and not a
 side-effect of overfitting: the train/val gap is unchanged. But this
 benefit does not transfer to the stratified holdout — holdout F1 gains are
 within noise (+0.001).
 But the holdout gain is **0.001** on both heads — within seed-level noise
 and nowhere near the val improvement. Something interesting is happening:
 - DAPT helps the model fit in-distribution data more tightly (val gain +
  NLL drop)
 - That extra fit does not generalize to the stratified holdout
 - The holdout oversamples minority classes (L2, TP, ID) relative to the
  training distribution; DAPT's benefit is on the head of the distribution
 **Interpretation for the paper:** This is a more interesting null result
 than the original "DAPT/TAPT did not help." The revised claim is:
 > *"Domain-adaptive pretraining improves in-distribution val NLL by ~4.5%
 > at the best checkpoint (0.333 → 0.318) and provides a modest val F1 gain
 > (+0.007 cat, +0.004 combined) under the independent-threshold +
 > attention-pooling architecture. The generalization gap (difference between
 > best val NLL and final val NLL) is unchanged by DAPT (0.178 vs 0.176),
 > confirming that DAPT is providing a better initialization rather than
 > just enabling overfitting. However, this val improvement does not
 > transfer to the stratified holdout — DAPT produces a model that is
 > better-calibrated on paragraphs similar to the training distribution,
 > yet no more generalizable to the rare-class boundary cases (L2, TP, ID)
 > that macro F1 weighs heavily. Our original finding (DAPT does not help
 > final macro F1) is reaffirmed; the mechanism is now clearer."*
 This is stronger than the original null because we can now point to a
 specific, measurable effect of DAPT (val NLL) distinct from overfitting,
 and explain why it doesn't show up in the headline macro F1 metric.
 The non-DAPT 3-seed ensemble remains the recommended headline checkpoint.
 The DAPT run is reportable as an ablation and a more precise null.
 Output: `checkpoints/finetune/iter1-dapt/`, `results/eval/iter1-dapt/`.
 ### 10.7 The NLL-vs-F1 Decoupling and the Overfit Story
 Investigating the DAPT ablation (§10.6) surfaced a general property of
 every run in Phase 10 worth documenting explicitly, because it affects how
 the paper should report training dynamics.
 **Observation:** In all four independent-threshold runs (seeds 42/69/420,
 iter1-nofilter, iter1-clspool, iter1-dapt), **val NLL bottoms at epoch 2-3
 and then climbs monotonically through epoch 11, while val macro F1 peaks
 at epoch 8 and plateaus.** The two metrics disagree about when the model
 is at its best.
 **Per-epoch val NLL, representative run (seed 69):**
 | Epoch | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
 |-------|---|---|---|---|---|---|---|---|---|----|----|
 | Val NLL | 0.376 | 0.337 | **0.333** | 0.369 | 0.394 | 0.443 | 0.472 | 0.493 | 0.505 | 0.511 | — |
 | Val F1  | ~0.90 | ~0.92 | ~0.925 | ~0.932 | ~0.938 | ~0.941 | ~0.942 | **~0.944** | 0.944 | 0.944 | 0.943 |
 **Interpretation:** Past epoch 3, continued training memorizes *confidence*,
 not *decisions*. Two things happen simultaneously:
 1. Training-set probabilities are pushed toward 0/1 (training loss → 0)
 2. Very few argmax decision boundaries shift
 For val examples the model already gets right, sharpening is neutral-to-bad
 for NLL and neutral-to-good for F1. For val examples the model gets wrong,
 continued training makes the prediction *more confidently wrong* — terrible
 for NLL (log-penalty grows), irrelevant for F1 (still wrong by argmax).
 Net: NLL climbs, F1 inches up as a small number of borderline examples
 flip to the correct side.
 This is a well-documented decoupling in deep classifiers, not a pathology
 specific to this model.
 **Is it a problem for the F1 claim? No.** Model selection uses val F1, so
 we pick the epoch where F1 peaks (epoch 8). Val F1 at the selected
 checkpoint (0.943/0.945) closely tracks holdout F1 against proxy gold
 (0.934/0.895) — a ~0.01 category gap and ~0.05 specificity gap. The
 decision boundaries generalized. The model did not overfit the *task*.
 **Is it a problem for the probability claim? Yes, but measurable and
 fixable.** Raw logits at epoch 8 are overconfident, which is exactly what
 the pre-scaling ECE measured (0.05-0.08). The fitted temperatures
 (T_cat = 1.76, T_spec = 2.46) are a direct quantification of how
 overconfident the model became between epoch 3 and epoch 8: T > 1 means
 "divide logits to cool them off." Temperature scaling (§10.4) recovers
 calibration without touching predictions, so the cost of training to
 epoch 8 instead of epoch 3 is paid in a scalar that's learned in ~1 second
 on val.
 **Is it a problem for the holdout claim? No, by construction.** The
 holdout was never touched during training. The train/val loss gap measures
 memorization of the training distribution; the holdout measures
 generalization to a distributionally distinct sample. These are independent
 signals and both tell a consistent story: decision boundaries transfer,
 probability calibration does not.
 **Why not just stop at epoch 3?** Because you'd save ~0.18 in val NLL and
 lose ~0.02 in val F1. Epochs 3 → 8 buy ~0.015-0.020 F1 at the cost of
 calibration that temperature scaling mechanically recovers. For a
 task where F1 is the rubric metric, that is a good trade. Were this a
 deployment where confidence scores drive downstream decisions (e.g., a
 human-in-the-loop review queue prioritizing low-confidence paragraphs),
 epoch 3 + no temperature scaling would be a reasonable alternative choice.
 **Paper framing:**
 > *"Val NLL minimizes at epoch 2-3 while val macro F1 peaks at epoch 8 — a
 > well-documented decoupling between calibration and decision quality in
 > deep classifiers. We select checkpoints by F1, report pre- and
 > post-temperature-scaling ECE separately, and verify generalization via
 > an untouched stratified holdout. The model's val-holdout F1 gap (~0.01
 > category, ~0.05 specificity) is within the inter-reference agreement
 > ceiling, confirming decision-boundary generalization despite
 > in-distribution confidence memorization. Temperature scaling recovers
 > calibration (ECE −33% cat, −40% spec) without altering predictions."*
 ### Phase 10 Summary
 | Experiment | Cost | Outcome | Paper value |
@ -909,6 +1105,8 @@ Output: `results/eval/ensemble-3seed-tempscaled/temperature_scaling.json`.
 | Dictionary baseline | ~1 min CPU | Cat 0.55, Spec 0.66 — clear gap to learned model | A-rubric "additional baselines" item |
 | Confidence-filter ablation | ~3h GPU | Null result — filtering does not matter | Justifies architecture, not data engineering |
 | Temperature scaling | ~10 min GPU | ECE −33% cat, −40% spec, F1 unchanged | Calibration story, deployment quality |
 | Pooling ablation (attention vs CLS) | ~3h GPU | +0.005 F1 consistent, small effect | Validates design, credits independent thresholds |
 | DAPT re-test with new architecture | ~3h GPU | Val best NLL 0.333→0.318 (−4.5%), F1 +0.007 cat; holdout null; gen gap unchanged | More nuanced null — better init, not better generalization |
 The 3-seed ensemble is now the recommended headline checkpoint. The
 calibrated ECE numbers should replace the pre-scaling ECE in the paper. The
--- a/docs/STATUS.md
+++ b/docs/STATUS.md
@ -156,6 +156,8 @@
 - [x] Ensemble of 3 seeds for confidence intervals — seeds 42/69/420, val std ±0.002 spec, holdout +0.017 L2 F1, +0.007 spec F1 vs single seed
 - [x] Dictionary/keyword baseline (A-rubric "additional baselines") — Cat 0.55, Spec 0.66; gap to learned model documents value of context
 - [x] Confidence-filter ablation — null result, filtering does not affect F1; architecture changes carry the spec F1 improvement
 - [x] Pooling ablation (attention vs CLS) — attention +0.005 F1 consistent; small but credible effect
 - [x] DAPT re-test with new architecture — val +0.007 cat F1, best val NLL 0.333→0.318 (−4.5%), generalization gap unchanged; holdout gain ~0.001 (better init, not better generalization)
 - [ ] Error analysis against human gold, IGNITE slides
 - [ ] Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work
 - [ ] Note in paper: DAPT/TAPT did not improve fine-tuning — noteworthy null result
--- a/python/configs/finetune/iter1-clspool.yaml
+++ b/python/configs/finetune/iter1-clspool.yaml
@ -0,0 +1,37 @@
 model:
  name_or_path: answerdotai/ModernBERT-large
 data:
  paragraphs_path: ../data/paragraphs/paragraphs-clean.patched.jsonl
  consensus_path: ../data/annotations/v2-stage1/consensus.jsonl
  quality_path: ../data/paragraphs/quality/quality-scores.jsonl
  holdout_path: ../data/gold/v2-holdout-ids.json
  max_seq_length: 512
  validation_split: 0.1
 training:
  output_dir: ../checkpoints/finetune/iter1-clspool
  learning_rate: 0.00005
  num_train_epochs: 11
  per_device_train_batch_size: 32
  per_device_eval_batch_size: 64
  gradient_accumulation_steps: 1
  warmup_ratio: 0.1
  weight_decay: 0.01
  dropout: 0.1
  bf16: true
  gradient_checkpointing: false
  logging_steps: 50
  save_total_limit: 3
  dataloader_num_workers: 4
  seed: 42
  loss_type: ce
  focal_gamma: 2.0
  class_weighting: true
  category_loss_weight: 1.0
  specificity_loss_weight: 1.0
  specificity_head: independent
  spec_mlp_dim: 256
  pooling: cls
  ordinal_consistency_weight: 0.1
  filter_spec_confidence: true
--- a/python/configs/finetune/iter1-dapt.yaml
+++ b/python/configs/finetune/iter1-dapt.yaml
@ -0,0 +1,37 @@
 model:
  name_or_path: ../checkpoints/dapt/modernbert-large/final
 data:
  paragraphs_path: ../data/paragraphs/paragraphs-clean.patched.jsonl
  consensus_path: ../data/annotations/v2-stage1/consensus.jsonl
  quality_path: ../data/paragraphs/quality/quality-scores.jsonl
  holdout_path: ../data/gold/v2-holdout-ids.json
  max_seq_length: 512
  validation_split: 0.1
 training:
  output_dir: ../checkpoints/finetune/iter1-dapt
  learning_rate: 0.00005
  num_train_epochs: 11
  per_device_train_batch_size: 32
  per_device_eval_batch_size: 64
  gradient_accumulation_steps: 1
  warmup_ratio: 0.1
  weight_decay: 0.01
  dropout: 0.1
  bf16: true
  gradient_checkpointing: false
  logging_steps: 50
  save_total_limit: 3
  dataloader_num_workers: 4
  seed: 42
  loss_type: ce
  focal_gamma: 2.0
  class_weighting: true
  category_loss_weight: 1.0
  specificity_loss_weight: 1.0
  specificity_head: independent
  spec_mlp_dim: 256
  pooling: attention
  ordinal_consistency_weight: 0.1
  filter_spec_confidence: true
--- a/results/eval/iter1-clspool/figures/calibration_cat_gpt-5.4.png
+++ b/results/eval/iter1-clspool/figures/calibration_cat_gpt-5.4.png
--- a/results/eval/iter1-clspool/figures/calibration_cat_opus-4.6.png
+++ b/results/eval/iter1-clspool/figures/calibration_cat_opus-4.6.png
--- a/results/eval/iter1-clspool/figures/confusion_cat_gpt-5.4.png
+++ b/results/eval/iter1-clspool/figures/confusion_cat_gpt-5.4.png
--- a/results/eval/iter1-clspool/figures/confusion_cat_opus-4.6.png
+++ b/results/eval/iter1-clspool/figures/confusion_cat_opus-4.6.png
--- a/results/eval/iter1-clspool/figures/confusion_spec_gpt-5.4.png
+++ b/results/eval/iter1-clspool/figures/confusion_spec_gpt-5.4.png
--- a/results/eval/iter1-clspool/figures/confusion_spec_opus-4.6.png
+++ b/results/eval/iter1-clspool/figures/confusion_spec_opus-4.6.png
--- a/results/eval/iter1-clspool/figures/model_comparison.png
+++ b/results/eval/iter1-clspool/figures/model_comparison.png
--- a/results/eval/iter1-clspool/figures/per_class_f1_gpt-5.4.png
+++ b/results/eval/iter1-clspool/figures/per_class_f1_gpt-5.4.png
--- a/results/eval/iter1-clspool/figures/per_class_f1_opus-4.6.png
+++ b/results/eval/iter1-clspool/figures/per_class_f1_opus-4.6.png
--- a/results/eval/iter1-clspool/figures/speed_comparison.png
+++ b/results/eval/iter1-clspool/figures/speed_comparison.png
--- a/results/eval/iter1-clspool/metrics.json
+++ b/results/eval/iter1-clspool/metrics.json
@ -0,0 +1,298 @@
 {
  "iter1-clspool_vs_GPT-5.4": {
    "cat_macro_f1": 0.9296272782528762,
    "cat_weighted_f1": 0.9306824376807155,
    "cat_macro_precision": 0.9289887550616817,
    "cat_macro_recall": 0.9334375025997984,
    "cat_mcc": 0.9179226636085169,
    "cat_auc": 0.9911299127522846,
    "cat_ece": 0.05557066917419438,
    "cat_confusion_matrix": [
      [
        217,
        0,
        8,
        3,
        2,
        0,
        0
      ],
      [
        0,
        83,
        0,
        2,
        2,
        1,
        0
      ],
      [
        2,
        0,
        144,
        1,
        3,
        0,
        0
      ],
      [
        1,
        0,
        2,
        132,
        1,
        0,
        0
      ],
      [
        6,
        1,
        5,
        17,
        167,
        1,
        1
      ],
      [
        0,
        2,
        1,
        8,
        2,
        208,
        0
      ],
      [
        0,
        0,
        0,
        1,
        11,
        0,
        165
      ]
    ],
    "cat_f1_BoardGov": 0.9517543859649122,
    "cat_prec_BoardGov": 0.9601769911504425,
    "cat_recall_BoardGov": 0.9434782608695652,
    "cat_f1_Incident": 0.9540229885057471,
    "cat_prec_Incident": 0.9651162790697675,
    "cat_recall_Incident": 0.9431818181818182,
    "cat_f1_Manageme": 0.9290322580645162,
    "cat_prec_Manageme": 0.9,
    "cat_recall_Manageme": 0.96,
    "cat_f1_NoneOthe": 0.88,
    "cat_prec_NoneOthe": 0.8048780487804879,
    "cat_recall_NoneOthe": 0.9705882352941176,
    "cat_f1_RiskMana": 0.8652849740932642,
    "cat_prec_RiskMana": 0.8882978723404256,
    "cat_recall_RiskMana": 0.8434343434343434,
    "cat_f1_Strategy": 0.9651972157772621,
    "cat_prec_Strategy": 0.9904761904761905,
    "cat_recall_Strategy": 0.9411764705882353,
    "cat_f1_Third-Pa": 0.9620991253644315,
    "cat_prec_Third-Pa": 0.9939759036144579,
    "cat_recall_Third-Pa": 0.9322033898305084,
    "cat_kripp_alpha": 0.9174669822467758,
    "spec_macro_f1": 0.892010224838834,
    "spec_weighted_f1": 0.9098424770121019,
    "spec_macro_precision": 0.9042493173083448,
    "spec_macro_recall": 0.8836163792237031,
    "spec_mcc": 0.8634241541671751,
    "spec_auc": 0.9777836963763646,
    "spec_ece": 0.07659540871779125,
    "spec_confusion_matrix": [
      [
        587,
        11,
        17,
        3
      ],
      [
        32,
        125,
        9,
        2
      ],
      [
        14,
        4,
        187,
        2
      ],
      [
        3,
        1,
        9,
        194
      ]
    ],
    "spec_f1_L1Generi": 0.9362041467304625,
    "spec_prec_L1Generi": 0.9229559748427673,
    "spec_recall_L1Generi": 0.9498381877022654,
    "spec_f1_L2Domain": 0.8090614886731392,
    "spec_prec_L2Domain": 0.8865248226950354,
    "spec_recall_L2Domain": 0.7440476190476191,
    "spec_f1_L3Firm-S": 0.8717948717948718,
    "spec_prec_L3Firm-S": 0.8423423423423423,
    "spec_recall_L3Firm-S": 0.9033816425120773,
    "spec_f1_L4Quanti": 0.9509803921568627,
    "spec_prec_L4Quanti": 0.9651741293532339,
    "spec_recall_L4Quanti": 0.9371980676328503,
    "spec_qwk": 0.9224750079938221,
    "spec_mae": 0.1275,
    "spec_kripp_alpha": 0.9099809044589873,
    "total_time_s": 6.83874113188358,
    "num_samples": 1200,
    "avg_ms_per_sample": 5.698950943236317,
    "combined_macro_f1": 0.910818751545855
  },
  "iter1-clspool_vs_Opus-4.6": {
    "cat_macro_f1": 0.9228949790380195,
    "cat_weighted_f1": 0.9228190044594041,
    "cat_macro_precision": 0.9183239817151002,
    "cat_macro_recall": 0.9310538134995027,
    "cat_mcc": 0.9101930161599978,
    "cat_auc": 0.9924519781241848,
    "cat_ece": 0.06223733584086104,
    "cat_confusion_matrix": [
      [
        208,
        0,
        3,
        3,
        0,
        0,
        0
      ],
      [
        0,
        76,
        0,
        1,
        2,
        0,
        0
      ],
      [
        5,
        0,
        147,
        1,
        4,
        0,
        1
      ],
      [
        0,
        0,
        0,
        139,
        2,
        0,
        0
      ],
      [
        12,
        1,
        9,
        14,
        171,
        1,
        5
      ],
      [
        1,
        9,
        1,
        6,
        2,
        208,
        1
      ],
      [
        0,
        0,
        0,
        0,
        7,
        1,
        159
      ]
    ],
    "cat_f1_BoardGov": 0.9454545454545454,
    "cat_prec_BoardGov": 0.9203539823008849,
    "cat_recall_BoardGov": 0.9719626168224299,
    "cat_f1_Incident": 0.9212121212121213,
    "cat_prec_Incident": 0.8837209302325582,
    "cat_recall_Incident": 0.9620253164556962,
    "cat_f1_Manageme": 0.9245283018867925,
    "cat_prec_Manageme": 0.91875,
    "cat_recall_Manageme": 0.930379746835443,
    "cat_f1_NoneOthe": 0.9114754098360656,
    "cat_prec_NoneOthe": 0.8475609756097561,
    "cat_recall_NoneOthe": 0.9858156028368794,
    "cat_f1_RiskMana": 0.8528678304239401,
    "cat_prec_RiskMana": 0.9095744680851063,
    "cat_recall_RiskMana": 0.8028169014084507,
    "cat_f1_Strategy": 0.9497716894977168,
    "cat_prec_Strategy": 0.9904761904761905,
    "cat_recall_Strategy": 0.9122807017543859,
    "cat_f1_Third-Pa": 0.954954954954955,
    "cat_prec_Third-Pa": 0.9578313253012049,
    "cat_recall_Third-Pa": 0.9520958083832335,
    "cat_kripp_alpha": 0.9095735484151157,
    "spec_macro_f1": 0.8804386286358235,
    "spec_weighted_f1": 0.8975676999782217,
    "spec_macro_precision": 0.8892226854649037,
    "spec_macro_recall": 0.8750457181821643,
    "spec_mcc": 0.8465565454059848,
    "spec_auc": 0.9697722386763277,
    "spec_ece": 0.08741456707318629,
    "spec_confusion_matrix": [
      [
        575,
        19,
        10,
        1
      ],
      [
        26,
        114,
        4,
        1
      ],
      [
        35,
        8,
        204,
        13
      ],
      [
        0,
        0,
        4,
        186
      ]
    ],
    "spec_f1_L1Generi": 0.9266720386784851,
    "spec_prec_L1Generi": 0.9040880503144654,
    "spec_recall_L1Generi": 0.9504132231404959,
    "spec_f1_L2Domain": 0.7972027972027972,
    "spec_prec_L2Domain": 0.8085106382978723,
    "spec_recall_L2Domain": 0.7862068965517242,
    "spec_f1_L3Firm-S": 0.8464730290456431,
    "spec_prec_L3Firm-S": 0.918918918918919,
    "spec_recall_L3Firm-S": 0.7846153846153846,
    "spec_f1_L4Quanti": 0.9514066496163683,
    "spec_prec_L4Quanti": 0.9253731343283582,
    "spec_recall_L4Quanti": 0.9789473684210527,
    "spec_qwk": 0.9187882106031572,
    "spec_mae": 0.14083333333333334,
    "spec_kripp_alpha": 0.9041056117796359,
    "total_time_s": 6.83874113188358,
    "num_samples": 1200,
    "avg_ms_per_sample": 5.698950943236317,
    "combined_macro_f1": 0.9016668038369215
  }
 }
--- a/results/eval/iter1-clspool/report_gpt-54.txt
+++ b/results/eval/iter1-clspool/report_gpt-54.txt
@ -0,0 +1,54 @@
 ======================================================================
  HOLDOUT EVALUATION: iter1-clspool vs GPT-5.4
 ======================================================================
  Samples evaluated: 1200
  Total inference time: 6.84s
  Avg latency: 5.70ms/sample
  Throughput: 175 samples/sec
  ──────────────────────────────────────────────────
  CATEGORY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.9296  ✓ (target: 0.80)
  Weighted F1:    0.9307
  Macro Prec:     0.9290
  Macro Recall:   0.9334
  MCC:            0.9179
  AUC (OvR):      0.9911
  ECE:            0.0556
  Kripp Alpha:    0.9175
  Category                        F1     Prec   Recall
  ------------------------- -------- -------- --------
  Board Governance            0.9518   0.9602   0.9435
  Incident Disclosure         0.9540   0.9651   0.9432
  Management Role             0.9290   0.9000   0.9600
  None/Other                  0.8800   0.8049   0.9706
  Risk Management Process     0.8653   0.8883   0.8434
  Strategy Integration        0.9652   0.9905   0.9412
  Third-Party Risk            0.9621   0.9940   0.9322
  ──────────────────────────────────────────────────
  SPECIFICITY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.8920  ✓ (target: 0.80)
  Weighted F1:    0.9098
  Macro Prec:     0.9042
  Macro Recall:   0.8836
  MCC:            0.8634
  AUC (OvR):      0.9778
  QWK:            0.9225
  MAE:            0.1275
  ECE:            0.0766
  Kripp Alpha:    0.9100
  Level                           F1     Prec   Recall
  ------------------------- -------- -------- --------
  L1: Generic                 0.9362   0.9230   0.9498
  L2: Domain                  0.8091   0.8865   0.7440
  L3: Firm-Specific           0.8718   0.8423   0.9034
  L4: Quantified              0.9510   0.9652   0.9372
 ======================================================================
--- a/results/eval/iter1-clspool/report_opus-46.txt
+++ b/results/eval/iter1-clspool/report_opus-46.txt
@ -0,0 +1,54 @@
 ======================================================================
  HOLDOUT EVALUATION: iter1-clspool vs Opus-4.6
 ======================================================================
  Samples evaluated: 1200
  Total inference time: 6.84s
  Avg latency: 5.70ms/sample
  Throughput: 175 samples/sec
  ──────────────────────────────────────────────────
  CATEGORY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.9229  ✓ (target: 0.80)
  Weighted F1:    0.9228
  Macro Prec:     0.9183
  Macro Recall:   0.9311
  MCC:            0.9102
  AUC (OvR):      0.9925
  ECE:            0.0622
  Kripp Alpha:    0.9096
  Category                        F1     Prec   Recall
  ------------------------- -------- -------- --------
  Board Governance            0.9455   0.9204   0.9720
  Incident Disclosure         0.9212   0.8837   0.9620
  Management Role             0.9245   0.9187   0.9304
  None/Other                  0.9115   0.8476   0.9858
  Risk Management Process     0.8529   0.9096   0.8028
  Strategy Integration        0.9498   0.9905   0.9123
  Third-Party Risk            0.9550   0.9578   0.9521
  ──────────────────────────────────────────────────
  SPECIFICITY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.8804  ✓ (target: 0.80)
  Weighted F1:    0.8976
  Macro Prec:     0.8892
  Macro Recall:   0.8750
  MCC:            0.8466
  AUC (OvR):      0.9698
  QWK:            0.9188
  MAE:            0.1408
  ECE:            0.0874
  Kripp Alpha:    0.9041
  Level                           F1     Prec   Recall
  ------------------------- -------- -------- --------
  L1: Generic                 0.9267   0.9041   0.9504
  L2: Domain                  0.7972   0.8085   0.7862
  L3: Firm-Specific           0.8465   0.9189   0.7846
  L4: Quantified              0.9514   0.9254   0.9789
 ======================================================================
--- a/results/eval/iter1-dapt/figures/calibration_cat_gpt-5.4.png
+++ b/results/eval/iter1-dapt/figures/calibration_cat_gpt-5.4.png
--- a/results/eval/iter1-dapt/figures/calibration_cat_opus-4.6.png
+++ b/results/eval/iter1-dapt/figures/calibration_cat_opus-4.6.png
--- a/results/eval/iter1-dapt/figures/confusion_cat_gpt-5.4.png
+++ b/results/eval/iter1-dapt/figures/confusion_cat_gpt-5.4.png
--- a/results/eval/iter1-dapt/figures/confusion_cat_opus-4.6.png
+++ b/results/eval/iter1-dapt/figures/confusion_cat_opus-4.6.png
--- a/results/eval/iter1-dapt/figures/confusion_spec_gpt-5.4.png
+++ b/results/eval/iter1-dapt/figures/confusion_spec_gpt-5.4.png
--- a/results/eval/iter1-dapt/figures/confusion_spec_opus-4.6.png
+++ b/results/eval/iter1-dapt/figures/confusion_spec_opus-4.6.png
--- a/results/eval/iter1-dapt/figures/model_comparison.png
+++ b/results/eval/iter1-dapt/figures/model_comparison.png
--- a/results/eval/iter1-dapt/figures/per_class_f1_gpt-5.4.png
+++ b/results/eval/iter1-dapt/figures/per_class_f1_gpt-5.4.png
--- a/results/eval/iter1-dapt/figures/per_class_f1_opus-4.6.png
+++ b/results/eval/iter1-dapt/figures/per_class_f1_opus-4.6.png
--- a/results/eval/iter1-dapt/figures/speed_comparison.png
+++ b/results/eval/iter1-dapt/figures/speed_comparison.png
--- a/results/eval/iter1-dapt/metrics.json
+++ b/results/eval/iter1-dapt/metrics.json
@ -0,0 +1,298 @@
 {
  "iter1-dapt_vs_GPT-5.4": {
    "cat_macro_f1": 0.9350000205815902,
    "cat_weighted_f1": 0.936034565494772,
    "cat_macro_precision": 0.9344660111343602,
    "cat_macro_recall": 0.9378555188267356,
    "cat_mcc": 0.9246263785540332,
    "cat_auc": 0.9915953686916092,
    "cat_ece": 0.04942640244960788,
    "cat_confusion_matrix": [
      [
        224,
        0,
        4,
        0,
        2,
        0,
        0
      ],
      [
        0,
        83,
        0,
        0,
        2,
        2,
        1
      ],
      [
        2,
        0,
        145,
        1,
        2,
        0,
        0
      ],
      [
        0,
        0,
        2,
        132,
        1,
        1,
        0
      ],
      [
        6,
        1,
        5,
        18,
        166,
        1,
        1
      ],
      [
        0,
        2,
        1,
        8,
        1,
        209,
        0
      ],
      [
        0,
        0,
        0,
        0,
        13,
        0,
        164
      ]
    ],
    "cat_f1_BoardGov": 0.9696969696969697,
    "cat_prec_BoardGov": 0.9655172413793104,
    "cat_recall_BoardGov": 0.9739130434782609,
    "cat_f1_Incident": 0.9540229885057471,
    "cat_prec_Incident": 0.9651162790697675,
    "cat_recall_Incident": 0.9431818181818182,
    "cat_f1_Manageme": 0.9446254071661238,
    "cat_prec_Manageme": 0.9235668789808917,
    "cat_recall_Manageme": 0.9666666666666667,
    "cat_f1_NoneOthe": 0.8949152542372881,
    "cat_prec_NoneOthe": 0.8301886792452831,
    "cat_recall_NoneOthe": 0.9705882352941176,
    "cat_f1_RiskMana": 0.8623376623376623,
    "cat_prec_RiskMana": 0.8877005347593583,
    "cat_recall_RiskMana": 0.8383838383838383,
    "cat_f1_Strategy": 0.9631336405529954,
    "cat_prec_Strategy": 0.9812206572769953,
    "cat_recall_Strategy": 0.9457013574660633,
    "cat_f1_Third-Pa": 0.956268221574344,
    "cat_prec_Third-Pa": 0.9879518072289156,
    "cat_recall_Third-Pa": 0.9265536723163842,
    "cat_kripp_alpha": 0.9243058890635424,
    "spec_macro_f1": 0.8959443847575952,
    "spec_weighted_f1": 0.914085249793483,
    "spec_macro_precision": 0.9055333144570721,
    "spec_macro_recall": 0.889132193611932,
    "spec_mcc": 0.8698798188273218,
    "spec_auc": 0.9806421467148638,
    "spec_ece": 0.0693218584855397,
    "spec_confusion_matrix": [
      [
        588,
        14,
        13,
        3
      ],
      [
        32,
        126,
        8,
        2
      ],
      [
        11,
        4,
        191,
        1
      ],
      [
        2,
        2,
        10,
        193
      ]
    ],
    "spec_f1_L1Generi": 0.9400479616306955,
    "spec_prec_L1Generi": 0.9289099526066351,
    "spec_recall_L1Generi": 0.9514563106796117,
    "spec_f1_L2Domain": 0.802547770700637,
    "spec_prec_L2Domain": 0.863013698630137,
    "spec_recall_L2Domain": 0.75,
    "spec_f1_L3Firm-S": 0.8904428904428905,
    "spec_prec_L3Firm-S": 0.8603603603603603,
    "spec_recall_L3Firm-S": 0.9227053140096618,
    "spec_f1_L4Quanti": 0.9507389162561576,
    "spec_prec_L4Quanti": 0.9698492462311558,
    "spec_recall_L4Quanti": 0.9323671497584541,
    "spec_qwk": 0.9315994086072762,
    "spec_mae": 0.11666666666666667,
    "spec_kripp_alpha": 0.9194074359344485,
    "total_time_s": 6.855555058107711,
    "num_samples": 1200,
    "avg_ms_per_sample": 5.712962548423093,
    "combined_macro_f1": 0.9154722026695927
  },
  "iter1-dapt_vs_Opus-4.6": {
    "cat_macro_f1": 0.9277442873196512,
    "cat_weighted_f1": 0.9268438855804646,
    "cat_macro_precision": 0.9237899595225246,
    "cat_macro_recall": 0.9349393170438051,
    "cat_mcc": 0.9150420281652446,
    "cat_auc": 0.9934333602136249,
    "cat_ece": 0.057411353190739985,
    "cat_confusion_matrix": [
      [
        210,
        0,
        2,
        1,
        1,
        0,
        0
      ],
      [
        0,
        77,
        0,
        0,
        1,
        0,
        1
      ],
      [
        8,
        0,
        145,
        1,
        3,
        0,
        1
      ],
      [
        0,
        0,
        0,
        139,
        2,
        0,
        0
      ],
      [
        13,
        0,
        9,
        13,
        172,
        1,
        5
      ],
      [
        1,
        9,
        1,
        4,
        2,
        211,
        0
      ],
      [
        0,
        0,
        0,
        1,
        6,
        1,
        159
      ]
    ],
    "cat_f1_BoardGov": 0.9417040358744395,
    "cat_prec_BoardGov": 0.9051724137931034,
    "cat_recall_BoardGov": 0.9813084112149533,
    "cat_f1_Incident": 0.9333333333333333,
    "cat_prec_Incident": 0.8953488372093024,
    "cat_recall_Incident": 0.9746835443037974,
    "cat_f1_Manageme": 0.9206349206349206,
    "cat_prec_Manageme": 0.9235668789808917,
    "cat_recall_Manageme": 0.9177215189873418,
    "cat_f1_NoneOthe": 0.9266666666666666,
    "cat_prec_NoneOthe": 0.8742138364779874,
    "cat_recall_NoneOthe": 0.9858156028368794,
    "cat_f1_RiskMana": 0.86,
    "cat_prec_RiskMana": 0.9197860962566845,
    "cat_recall_RiskMana": 0.8075117370892019,
    "cat_f1_Strategy": 0.9569160997732427,
    "cat_prec_Strategy": 0.9906103286384976,
    "cat_recall_Strategy": 0.9254385964912281,
    "cat_f1_Third-Pa": 0.954954954954955,
    "cat_prec_Third-Pa": 0.9578313253012049,
    "cat_recall_Third-Pa": 0.9520958083832335,
    "cat_kripp_alpha": 0.9144489824694872,
    "spec_macro_f1": 0.8823881241075249,
    "spec_weighted_f1": 0.8997013825586678,
    "spec_macro_precision": 0.8895415282112857,
    "spec_macro_recall": 0.8784196767594721,
    "spec_mcc": 0.84923108221758,
    "spec_auc": 0.9732413764660657,
    "spec_ece": 0.08008741805950799,
    "spec_confusion_matrix": [
      [
        573,
        22,
        9,
        1
      ],
      [
        26,
        114,
        3,
        2
      ],
      [
        34,
        10,
        207,
        9
      ],
      [
        0,
        0,
        3,
        187
      ]
    ],
    "spec_f1_L1Generi": 0.925686591276252,
    "spec_prec_L1Generi": 0.9052132701421801,
    "spec_recall_L1Generi": 0.947107438016529,
    "spec_f1_L2Domain": 0.7835051546391752,
    "spec_prec_L2Domain": 0.7808219178082192,
    "spec_recall_L2Domain": 0.7862068965517242,
    "spec_f1_L3Firm-S": 0.8589211618257261,
    "spec_prec_L3Firm-S": 0.9324324324324325,
    "spec_recall_L3Firm-S": 0.7961538461538461,
    "spec_f1_L4Quanti": 0.961439588688946,
    "spec_prec_L4Quanti": 0.9396984924623115,
    "spec_recall_L4Quanti": 0.9842105263157894,
    "spec_qwk": 0.9200429286057613,
    "spec_mae": 0.13833333333333334,
    "spec_kripp_alpha": 0.9047987190793844,
    "total_time_s": 6.855555058107711,
    "num_samples": 1200,
    "avg_ms_per_sample": 5.712962548423093,
    "combined_macro_f1": 0.9050662057135881
  }
 }
--- a/results/eval/iter1-dapt/report_gpt-54.txt
+++ b/results/eval/iter1-dapt/report_gpt-54.txt
@ -0,0 +1,54 @@
 ======================================================================
  HOLDOUT EVALUATION: iter1-dapt vs GPT-5.4
 ======================================================================
  Samples evaluated: 1200
  Total inference time: 6.86s
  Avg latency: 5.71ms/sample
  Throughput: 175 samples/sec
  ──────────────────────────────────────────────────
  CATEGORY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.9350  ✓ (target: 0.80)
  Weighted F1:    0.9360
  Macro Prec:     0.9345
  Macro Recall:   0.9379
  MCC:            0.9246
  AUC (OvR):      0.9916
  ECE:            0.0494
  Kripp Alpha:    0.9243
  Category                        F1     Prec   Recall
  ------------------------- -------- -------- --------
  Board Governance            0.9697   0.9655   0.9739
  Incident Disclosure         0.9540   0.9651   0.9432
  Management Role             0.9446   0.9236   0.9667
  None/Other                  0.8949   0.8302   0.9706
  Risk Management Process     0.8623   0.8877   0.8384
  Strategy Integration        0.9631   0.9812   0.9457
  Third-Party Risk            0.9563   0.9880   0.9266
  ──────────────────────────────────────────────────
  SPECIFICITY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.8959  ✓ (target: 0.80)
  Weighted F1:    0.9141
  Macro Prec:     0.9055
  Macro Recall:   0.8891
  MCC:            0.8699
  AUC (OvR):      0.9806
  QWK:            0.9316
  MAE:            0.1167
  ECE:            0.0693
  Kripp Alpha:    0.9194
  Level                           F1     Prec   Recall
  ------------------------- -------- -------- --------
  L1: Generic                 0.9400   0.9289   0.9515
  L2: Domain                  0.8025   0.8630   0.7500
  L3: Firm-Specific           0.8904   0.8604   0.9227
  L4: Quantified              0.9507   0.9698   0.9324
 ======================================================================
--- a/results/eval/iter1-dapt/report_opus-46.txt
+++ b/results/eval/iter1-dapt/report_opus-46.txt
@ -0,0 +1,54 @@
 ======================================================================
  HOLDOUT EVALUATION: iter1-dapt vs Opus-4.6
 ======================================================================
  Samples evaluated: 1200
  Total inference time: 6.86s
  Avg latency: 5.71ms/sample
  Throughput: 175 samples/sec
  ──────────────────────────────────────────────────
  CATEGORY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.9277  ✓ (target: 0.80)
  Weighted F1:    0.9268
  Macro Prec:     0.9238
  Macro Recall:   0.9349
  MCC:            0.9150
  AUC (OvR):      0.9934
  ECE:            0.0574
  Kripp Alpha:    0.9144
  Category                        F1     Prec   Recall
  ------------------------- -------- -------- --------
  Board Governance            0.9417   0.9052   0.9813
  Incident Disclosure         0.9333   0.8953   0.9747
  Management Role             0.9206   0.9236   0.9177
  None/Other                  0.9267   0.8742   0.9858
  Risk Management Process     0.8600   0.9198   0.8075
  Strategy Integration        0.9569   0.9906   0.9254
  Third-Party Risk            0.9550   0.9578   0.9521
  ──────────────────────────────────────────────────
  SPECIFICITY CLASSIFICATION
  ──────────────────────────────────────────────────
  Macro F1:       0.8824  ✓ (target: 0.80)
  Weighted F1:    0.8997
  Macro Prec:     0.8895
  Macro Recall:   0.8784
  MCC:            0.8492
  AUC (OvR):      0.9732
  QWK:            0.9200
  MAE:            0.1383
  ECE:            0.0801
  Kripp Alpha:    0.9048
  Level                           F1     Prec   Recall
  ------------------------- -------- -------- --------
  L1: Generic                 0.9257   0.9052   0.9471
  L2: Domain                  0.7835   0.7808   0.7862
  L3: Firm-Specific           0.8589   0.9324   0.7962
  L4: Quantified              0.9614   0.9397   0.9842
 ======================================================================