diff --git a/docs/NARRATIVE.md b/docs/NARRATIVE.md index 0b3cb69..538ae03 100644 --- a/docs/NARRATIVE.md +++ b/docs/NARRATIVE.md @@ -901,6 +901,202 @@ thus invariant under T > 0. Output: `results/eval/ensemble-3seed-tempscaled/temperature_scaling.json`. +### 10.5 Pooling Ablation (Attention vs [CLS]) + +**Motivation:** The spec F1 jump from 0.517 → 0.945 was credited to three +architectural changes — independent threshold heads, attention pooling, and +confidence filtering. Independent thresholds were ablated against CORAL; +confidence filtering was ablated in §10.3 (null result). Attention pooling +had never been isolated. We needed to know whether it actually matters or +whether independent thresholds carry all the gain. + +**Setup:** `iter1-clspool.yaml` — identical iter1 config but with +`pooling: cls`. Same seed (42), same 11 epochs, confidence filtering on. + +**Results:** + +| Config | Val Cat F1 | Val Spec F1 | Val Combined | Holdout Cat F1 (GPT-5.4) | Holdout Spec F1 (GPT-5.4) | +|--------|-----------:|------------:|-------------:|-------------------------:|--------------------------:| +| iter1 (attention) | 0.9430 | 0.9450 | 0.9440 | 0.9343 | 0.8950 | +| iter1-clspool ([CLS])| 0.9368 | 0.9414 | 0.9391 | 0.9296 | 0.8920 | +| **Δ (attention − CLS)** | **+0.006** | **+0.004** | **+0.005** | **+0.005** | **+0.003** | + +**Finding:** Attention pooling is consistently better than [CLS] pooling +across all metrics and both references, but the effect is **small** — +3-6 thousandths of F1. This is within 2-3× the seed-level std (±0.002), so +the direction is credible but the magnitude is modest. Attention pooling is +doing real work ("one CISO mention anywhere matters") but independent +threshold heads are clearly carrying the majority of the architecture win. + +**Interpretation for the paper:** We can report this cleanly as "attention +pooling contributes a small but consistent improvement over [CLS] pooling +(~+0.005 F1 on both heads); the bulk of the CORAL → independent-threshold +gain (~+0.43 on spec F1) is attributable to the decoupled threshold weights, +not the pooling change." This is honest and gives each design choice its +proper credit. + +Output: `checkpoints/finetune/iter1-clspool/`, `results/eval/iter1-clspool/`. + +### 10.6 DAPT Re-Test with New Architecture + +**Motivation:** During the original 12-config ablation grid (CORAL + +[CLS] pooling), DAPT and TAPT both *hurt* — base ModernBERT-large +outperformed DAPT and TAPT checkpoints on every loss combination. That was +reported as a noteworthy null result. But the architecture has changed +substantially since then (independent thresholds, attention pooling). The +verdict on DAPT could now flip: maybe the DAPT vocabulary signal was +previously wasted on a model that couldn't use it. + +**Setup:** `iter1-dapt.yaml` — identical iter1 config but +`model.name_or_path` points at `checkpoints/dapt/modernbert-large/final` +(eval loss 0.7250 from Phase 7). Same seed, 11 epochs, attention pooling, +independent threshold heads, confidence filtering on. + +**Results (epoch 11 — final checkpoint):** + +| Config | Val Cat F1 | Val Spec F1 | Val Combined | Val NLL (ep 11) | Holdout Cat F1 (GPT-5.4) | Holdout Spec F1 (GPT-5.4) | +|--------|-----------:|------------:|-------------:|----------------:|-------------------------:|--------------------------:| +| iter1 (base ModernBERT, seed 69) | 0.9384 | 0.9462 | 0.9423 | 0.511 | — | — | +| iter1 (base ModernBERT, seed 42) | 0.9430 | 0.9450 | 0.9440 | — | 0.9343 | 0.8950 | +| iter1-dapt (DAPT init) | 0.9500 | 0.9462 | 0.9481 | 0.494 | 0.9350 | 0.8959 | +| **Δ (dapt − base)** | **+0.007** | **+0.001** | **+0.004** | **−0.017** | +0.001 | +0.001 | + +**Per-epoch val NLL trajectory (confirmed not overfitting-driven):** + +| Epoch | seed 69 (no DAPT) | DAPT | Δ | +|-------|------------------:|-----:|----:| +| 1 | 0.376 | 0.346 | −0.030 | +| 2 | 0.337 | **0.318** (best) | −0.019 | +| 3 | **0.333** (best) | 0.331 | −0.002 | +| 5 | 0.394 | 0.385 | −0.009 | +| 8 | 0.493 | 0.482 | −0.011 | +| 11 | 0.511 | 0.494 | −0.017 | + +Both runs peak at epoch 2-3 and then overfit steadily. The overfit gap +(val NLL at epoch 11 minus best) is **0.178 for the baseline** and +**0.176 for DAPT** — essentially identical. DAPT is not overfitting worse; +it is **starting from a better representation** and maintaining the same +generalization gap through training. + +**Finding — a more nuanced null:** DAPT initialization genuinely improves +val NLL by ~4.5% at the best checkpoint (0.333 → 0.318), with a matching ++0.007 category F1 improvement on val. The improvement is real and not a +side-effect of overfitting: the train/val gap is unchanged. But this +benefit does not transfer to the stratified holdout — holdout F1 gains are +within noise (+0.001). + +But the holdout gain is **0.001** on both heads — within seed-level noise +and nowhere near the val improvement. Something interesting is happening: + +- DAPT helps the model fit in-distribution data more tightly (val gain + + NLL drop) +- That extra fit does not generalize to the stratified holdout +- The holdout oversamples minority classes (L2, TP, ID) relative to the + training distribution; DAPT's benefit is on the head of the distribution + +**Interpretation for the paper:** This is a more interesting null result +than the original "DAPT/TAPT did not help." The revised claim is: + +> *"Domain-adaptive pretraining improves in-distribution val NLL by ~4.5% +> at the best checkpoint (0.333 → 0.318) and provides a modest val F1 gain +> (+0.007 cat, +0.004 combined) under the independent-threshold + +> attention-pooling architecture. The generalization gap (difference between +> best val NLL and final val NLL) is unchanged by DAPT (0.178 vs 0.176), +> confirming that DAPT is providing a better initialization rather than +> just enabling overfitting. However, this val improvement does not +> transfer to the stratified holdout — DAPT produces a model that is +> better-calibrated on paragraphs similar to the training distribution, +> yet no more generalizable to the rare-class boundary cases (L2, TP, ID) +> that macro F1 weighs heavily. Our original finding (DAPT does not help +> final macro F1) is reaffirmed; the mechanism is now clearer."* + +This is stronger than the original null because we can now point to a +specific, measurable effect of DAPT (val NLL) distinct from overfitting, +and explain why it doesn't show up in the headline macro F1 metric. + +The non-DAPT 3-seed ensemble remains the recommended headline checkpoint. +The DAPT run is reportable as an ablation and a more precise null. + +Output: `checkpoints/finetune/iter1-dapt/`, `results/eval/iter1-dapt/`. + +### 10.7 The NLL-vs-F1 Decoupling and the Overfit Story + +Investigating the DAPT ablation (§10.6) surfaced a general property of +every run in Phase 10 worth documenting explicitly, because it affects how +the paper should report training dynamics. + +**Observation:** In all four independent-threshold runs (seeds 42/69/420, +iter1-nofilter, iter1-clspool, iter1-dapt), **val NLL bottoms at epoch 2-3 +and then climbs monotonically through epoch 11, while val macro F1 peaks +at epoch 8 and plateaus.** The two metrics disagree about when the model +is at its best. + +**Per-epoch val NLL, representative run (seed 69):** + +| Epoch | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | +|-------|---|---|---|---|---|---|---|---|---|----|----| +| Val NLL | 0.376 | 0.337 | **0.333** | 0.369 | 0.394 | 0.443 | 0.472 | 0.493 | 0.505 | 0.511 | — | +| Val F1 | ~0.90 | ~0.92 | ~0.925 | ~0.932 | ~0.938 | ~0.941 | ~0.942 | **~0.944** | 0.944 | 0.944 | 0.943 | + +**Interpretation:** Past epoch 3, continued training memorizes *confidence*, +not *decisions*. Two things happen simultaneously: + +1. Training-set probabilities are pushed toward 0/1 (training loss → 0) +2. Very few argmax decision boundaries shift + +For val examples the model already gets right, sharpening is neutral-to-bad +for NLL and neutral-to-good for F1. For val examples the model gets wrong, +continued training makes the prediction *more confidently wrong* — terrible +for NLL (log-penalty grows), irrelevant for F1 (still wrong by argmax). +Net: NLL climbs, F1 inches up as a small number of borderline examples +flip to the correct side. + +This is a well-documented decoupling in deep classifiers, not a pathology +specific to this model. + +**Is it a problem for the F1 claim? No.** Model selection uses val F1, so +we pick the epoch where F1 peaks (epoch 8). Val F1 at the selected +checkpoint (0.943/0.945) closely tracks holdout F1 against proxy gold +(0.934/0.895) — a ~0.01 category gap and ~0.05 specificity gap. The +decision boundaries generalized. The model did not overfit the *task*. + +**Is it a problem for the probability claim? Yes, but measurable and +fixable.** Raw logits at epoch 8 are overconfident, which is exactly what +the pre-scaling ECE measured (0.05-0.08). The fitted temperatures +(T_cat = 1.76, T_spec = 2.46) are a direct quantification of how +overconfident the model became between epoch 3 and epoch 8: T > 1 means +"divide logits to cool them off." Temperature scaling (§10.4) recovers +calibration without touching predictions, so the cost of training to +epoch 8 instead of epoch 3 is paid in a scalar that's learned in ~1 second +on val. + +**Is it a problem for the holdout claim? No, by construction.** The +holdout was never touched during training. The train/val loss gap measures +memorization of the training distribution; the holdout measures +generalization to a distributionally distinct sample. These are independent +signals and both tell a consistent story: decision boundaries transfer, +probability calibration does not. + +**Why not just stop at epoch 3?** Because you'd save ~0.18 in val NLL and +lose ~0.02 in val F1. Epochs 3 → 8 buy ~0.015-0.020 F1 at the cost of +calibration that temperature scaling mechanically recovers. For a +task where F1 is the rubric metric, that is a good trade. Were this a +deployment where confidence scores drive downstream decisions (e.g., a +human-in-the-loop review queue prioritizing low-confidence paragraphs), +epoch 3 + no temperature scaling would be a reasonable alternative choice. + +**Paper framing:** + +> *"Val NLL minimizes at epoch 2-3 while val macro F1 peaks at epoch 8 — a +> well-documented decoupling between calibration and decision quality in +> deep classifiers. We select checkpoints by F1, report pre- and +> post-temperature-scaling ECE separately, and verify generalization via +> an untouched stratified holdout. The model's val-holdout F1 gap (~0.01 +> category, ~0.05 specificity) is within the inter-reference agreement +> ceiling, confirming decision-boundary generalization despite +> in-distribution confidence memorization. Temperature scaling recovers +> calibration (ECE −33% cat, −40% spec) without altering predictions."* + ### Phase 10 Summary | Experiment | Cost | Outcome | Paper value | @@ -909,6 +1105,8 @@ Output: `results/eval/ensemble-3seed-tempscaled/temperature_scaling.json`. | Dictionary baseline | ~1 min CPU | Cat 0.55, Spec 0.66 — clear gap to learned model | A-rubric "additional baselines" item | | Confidence-filter ablation | ~3h GPU | Null result — filtering does not matter | Justifies architecture, not data engineering | | Temperature scaling | ~10 min GPU | ECE −33% cat, −40% spec, F1 unchanged | Calibration story, deployment quality | +| Pooling ablation (attention vs CLS) | ~3h GPU | +0.005 F1 consistent, small effect | Validates design, credits independent thresholds | +| DAPT re-test with new architecture | ~3h GPU | Val best NLL 0.333→0.318 (−4.5%), F1 +0.007 cat; holdout null; gen gap unchanged | More nuanced null — better init, not better generalization | The 3-seed ensemble is now the recommended headline checkpoint. The calibrated ECE numbers should replace the pre-scaling ECE in the paper. The diff --git a/docs/STATUS.md b/docs/STATUS.md index ba92bd1..5e41eac 100644 --- a/docs/STATUS.md +++ b/docs/STATUS.md @@ -156,6 +156,8 @@ - [x] Ensemble of 3 seeds for confidence intervals — seeds 42/69/420, val std ±0.002 spec, holdout +0.017 L2 F1, +0.007 spec F1 vs single seed - [x] Dictionary/keyword baseline (A-rubric "additional baselines") — Cat 0.55, Spec 0.66; gap to learned model documents value of context - [x] Confidence-filter ablation — null result, filtering does not affect F1; architecture changes carry the spec F1 improvement +- [x] Pooling ablation (attention vs CLS) — attention +0.005 F1 consistent; small but credible effect +- [x] DAPT re-test with new architecture — val +0.007 cat F1, best val NLL 0.333→0.318 (−4.5%), generalization gap unchanged; holdout gain ~0.001 (better init, not better generalization) - [ ] Error analysis against human gold, IGNITE slides - [ ] Note in paper: specificity is paragraph-level (presence check), not category-conditional — acknowledge as limitation/future work - [ ] Note in paper: DAPT/TAPT did not improve fine-tuning — noteworthy null result diff --git a/python/configs/finetune/iter1-clspool.yaml b/python/configs/finetune/iter1-clspool.yaml new file mode 100644 index 0000000..69d0d23 --- /dev/null +++ b/python/configs/finetune/iter1-clspool.yaml @@ -0,0 +1,37 @@ +model: + name_or_path: answerdotai/ModernBERT-large + +data: + paragraphs_path: ../data/paragraphs/paragraphs-clean.patched.jsonl + consensus_path: ../data/annotations/v2-stage1/consensus.jsonl + quality_path: ../data/paragraphs/quality/quality-scores.jsonl + holdout_path: ../data/gold/v2-holdout-ids.json + max_seq_length: 512 + validation_split: 0.1 + +training: + output_dir: ../checkpoints/finetune/iter1-clspool + learning_rate: 0.00005 + num_train_epochs: 11 + per_device_train_batch_size: 32 + per_device_eval_batch_size: 64 + gradient_accumulation_steps: 1 + warmup_ratio: 0.1 + weight_decay: 0.01 + dropout: 0.1 + bf16: true + gradient_checkpointing: false + logging_steps: 50 + save_total_limit: 3 + dataloader_num_workers: 4 + seed: 42 + loss_type: ce + focal_gamma: 2.0 + class_weighting: true + category_loss_weight: 1.0 + specificity_loss_weight: 1.0 + specificity_head: independent + spec_mlp_dim: 256 + pooling: cls + ordinal_consistency_weight: 0.1 + filter_spec_confidence: true diff --git a/python/configs/finetune/iter1-dapt.yaml b/python/configs/finetune/iter1-dapt.yaml new file mode 100644 index 0000000..2c013f6 --- /dev/null +++ b/python/configs/finetune/iter1-dapt.yaml @@ -0,0 +1,37 @@ +model: + name_or_path: ../checkpoints/dapt/modernbert-large/final + +data: + paragraphs_path: ../data/paragraphs/paragraphs-clean.patched.jsonl + consensus_path: ../data/annotations/v2-stage1/consensus.jsonl + quality_path: ../data/paragraphs/quality/quality-scores.jsonl + holdout_path: ../data/gold/v2-holdout-ids.json + max_seq_length: 512 + validation_split: 0.1 + +training: + output_dir: ../checkpoints/finetune/iter1-dapt + learning_rate: 0.00005 + num_train_epochs: 11 + per_device_train_batch_size: 32 + per_device_eval_batch_size: 64 + gradient_accumulation_steps: 1 + warmup_ratio: 0.1 + weight_decay: 0.01 + dropout: 0.1 + bf16: true + gradient_checkpointing: false + logging_steps: 50 + save_total_limit: 3 + dataloader_num_workers: 4 + seed: 42 + loss_type: ce + focal_gamma: 2.0 + class_weighting: true + category_loss_weight: 1.0 + specificity_loss_weight: 1.0 + specificity_head: independent + spec_mlp_dim: 256 + pooling: attention + ordinal_consistency_weight: 0.1 + filter_spec_confidence: true diff --git a/results/eval/iter1-clspool/figures/calibration_cat_gpt-5.4.png b/results/eval/iter1-clspool/figures/calibration_cat_gpt-5.4.png new file mode 100644 index 0000000..3db8c07 Binary files /dev/null and b/results/eval/iter1-clspool/figures/calibration_cat_gpt-5.4.png differ diff --git a/results/eval/iter1-clspool/figures/calibration_cat_opus-4.6.png b/results/eval/iter1-clspool/figures/calibration_cat_opus-4.6.png new file mode 100644 index 0000000..621a314 Binary files /dev/null and b/results/eval/iter1-clspool/figures/calibration_cat_opus-4.6.png differ diff --git a/results/eval/iter1-clspool/figures/confusion_cat_gpt-5.4.png b/results/eval/iter1-clspool/figures/confusion_cat_gpt-5.4.png new file mode 100644 index 0000000..01bc357 Binary files /dev/null and b/results/eval/iter1-clspool/figures/confusion_cat_gpt-5.4.png differ diff --git a/results/eval/iter1-clspool/figures/confusion_cat_opus-4.6.png b/results/eval/iter1-clspool/figures/confusion_cat_opus-4.6.png new file mode 100644 index 0000000..f9a62ad Binary files /dev/null and b/results/eval/iter1-clspool/figures/confusion_cat_opus-4.6.png differ diff --git a/results/eval/iter1-clspool/figures/confusion_spec_gpt-5.4.png b/results/eval/iter1-clspool/figures/confusion_spec_gpt-5.4.png new file mode 100644 index 0000000..ff8c9ff Binary files /dev/null and b/results/eval/iter1-clspool/figures/confusion_spec_gpt-5.4.png differ diff --git a/results/eval/iter1-clspool/figures/confusion_spec_opus-4.6.png b/results/eval/iter1-clspool/figures/confusion_spec_opus-4.6.png new file mode 100644 index 0000000..436183d Binary files /dev/null and b/results/eval/iter1-clspool/figures/confusion_spec_opus-4.6.png differ diff --git a/results/eval/iter1-clspool/figures/model_comparison.png b/results/eval/iter1-clspool/figures/model_comparison.png new file mode 100644 index 0000000..81d7221 Binary files /dev/null and b/results/eval/iter1-clspool/figures/model_comparison.png differ diff --git a/results/eval/iter1-clspool/figures/per_class_f1_gpt-5.4.png b/results/eval/iter1-clspool/figures/per_class_f1_gpt-5.4.png new file mode 100644 index 0000000..2250dc1 Binary files /dev/null and b/results/eval/iter1-clspool/figures/per_class_f1_gpt-5.4.png differ diff --git a/results/eval/iter1-clspool/figures/per_class_f1_opus-4.6.png b/results/eval/iter1-clspool/figures/per_class_f1_opus-4.6.png new file mode 100644 index 0000000..325d918 Binary files /dev/null and b/results/eval/iter1-clspool/figures/per_class_f1_opus-4.6.png differ diff --git a/results/eval/iter1-clspool/figures/speed_comparison.png b/results/eval/iter1-clspool/figures/speed_comparison.png new file mode 100644 index 0000000..af0c9d5 Binary files /dev/null and b/results/eval/iter1-clspool/figures/speed_comparison.png differ diff --git a/results/eval/iter1-clspool/metrics.json b/results/eval/iter1-clspool/metrics.json new file mode 100644 index 0000000..aa1e00b --- /dev/null +++ b/results/eval/iter1-clspool/metrics.json @@ -0,0 +1,298 @@ +{ + "iter1-clspool_vs_GPT-5.4": { + "cat_macro_f1": 0.9296272782528762, + "cat_weighted_f1": 0.9306824376807155, + "cat_macro_precision": 0.9289887550616817, + "cat_macro_recall": 0.9334375025997984, + "cat_mcc": 0.9179226636085169, + "cat_auc": 0.9911299127522846, + "cat_ece": 0.05557066917419438, + "cat_confusion_matrix": [ + [ + 217, + 0, + 8, + 3, + 2, + 0, + 0 + ], + [ + 0, + 83, + 0, + 2, + 2, + 1, + 0 + ], + [ + 2, + 0, + 144, + 1, + 3, + 0, + 0 + ], + [ + 1, + 0, + 2, + 132, + 1, + 0, + 0 + ], + [ + 6, + 1, + 5, + 17, + 167, + 1, + 1 + ], + [ + 0, + 2, + 1, + 8, + 2, + 208, + 0 + ], + [ + 0, + 0, + 0, + 1, + 11, + 0, + 165 + ] + ], + "cat_f1_BoardGov": 0.9517543859649122, + "cat_prec_BoardGov": 0.9601769911504425, + "cat_recall_BoardGov": 0.9434782608695652, + "cat_f1_Incident": 0.9540229885057471, + "cat_prec_Incident": 0.9651162790697675, + "cat_recall_Incident": 0.9431818181818182, + "cat_f1_Manageme": 0.9290322580645162, + "cat_prec_Manageme": 0.9, + "cat_recall_Manageme": 0.96, + "cat_f1_NoneOthe": 0.88, + "cat_prec_NoneOthe": 0.8048780487804879, + "cat_recall_NoneOthe": 0.9705882352941176, + "cat_f1_RiskMana": 0.8652849740932642, + "cat_prec_RiskMana": 0.8882978723404256, + "cat_recall_RiskMana": 0.8434343434343434, + "cat_f1_Strategy": 0.9651972157772621, + "cat_prec_Strategy": 0.9904761904761905, + "cat_recall_Strategy": 0.9411764705882353, + "cat_f1_Third-Pa": 0.9620991253644315, + "cat_prec_Third-Pa": 0.9939759036144579, + "cat_recall_Third-Pa": 0.9322033898305084, + "cat_kripp_alpha": 0.9174669822467758, + "spec_macro_f1": 0.892010224838834, + "spec_weighted_f1": 0.9098424770121019, + "spec_macro_precision": 0.9042493173083448, + "spec_macro_recall": 0.8836163792237031, + "spec_mcc": 0.8634241541671751, + "spec_auc": 0.9777836963763646, + "spec_ece": 0.07659540871779125, + "spec_confusion_matrix": [ + [ + 587, + 11, + 17, + 3 + ], + [ + 32, + 125, + 9, + 2 + ], + [ + 14, + 4, + 187, + 2 + ], + [ + 3, + 1, + 9, + 194 + ] + ], + "spec_f1_L1Generi": 0.9362041467304625, + "spec_prec_L1Generi": 0.9229559748427673, + "spec_recall_L1Generi": 0.9498381877022654, + "spec_f1_L2Domain": 0.8090614886731392, + "spec_prec_L2Domain": 0.8865248226950354, + "spec_recall_L2Domain": 0.7440476190476191, + "spec_f1_L3Firm-S": 0.8717948717948718, + "spec_prec_L3Firm-S": 0.8423423423423423, + "spec_recall_L3Firm-S": 0.9033816425120773, + "spec_f1_L4Quanti": 0.9509803921568627, + "spec_prec_L4Quanti": 0.9651741293532339, + "spec_recall_L4Quanti": 0.9371980676328503, + "spec_qwk": 0.9224750079938221, + "spec_mae": 0.1275, + "spec_kripp_alpha": 0.9099809044589873, + "total_time_s": 6.83874113188358, + "num_samples": 1200, + "avg_ms_per_sample": 5.698950943236317, + "combined_macro_f1": 0.910818751545855 + }, + "iter1-clspool_vs_Opus-4.6": { + "cat_macro_f1": 0.9228949790380195, + "cat_weighted_f1": 0.9228190044594041, + "cat_macro_precision": 0.9183239817151002, + "cat_macro_recall": 0.9310538134995027, + "cat_mcc": 0.9101930161599978, + "cat_auc": 0.9924519781241848, + "cat_ece": 0.06223733584086104, + "cat_confusion_matrix": [ + [ + 208, + 0, + 3, + 3, + 0, + 0, + 0 + ], + [ + 0, + 76, + 0, + 1, + 2, + 0, + 0 + ], + [ + 5, + 0, + 147, + 1, + 4, + 0, + 1 + ], + [ + 0, + 0, + 0, + 139, + 2, + 0, + 0 + ], + [ + 12, + 1, + 9, + 14, + 171, + 1, + 5 + ], + [ + 1, + 9, + 1, + 6, + 2, + 208, + 1 + ], + [ + 0, + 0, + 0, + 0, + 7, + 1, + 159 + ] + ], + "cat_f1_BoardGov": 0.9454545454545454, + "cat_prec_BoardGov": 0.9203539823008849, + "cat_recall_BoardGov": 0.9719626168224299, + "cat_f1_Incident": 0.9212121212121213, + "cat_prec_Incident": 0.8837209302325582, + "cat_recall_Incident": 0.9620253164556962, + "cat_f1_Manageme": 0.9245283018867925, + "cat_prec_Manageme": 0.91875, + "cat_recall_Manageme": 0.930379746835443, + "cat_f1_NoneOthe": 0.9114754098360656, + "cat_prec_NoneOthe": 0.8475609756097561, + "cat_recall_NoneOthe": 0.9858156028368794, + "cat_f1_RiskMana": 0.8528678304239401, + "cat_prec_RiskMana": 0.9095744680851063, + "cat_recall_RiskMana": 0.8028169014084507, + "cat_f1_Strategy": 0.9497716894977168, + "cat_prec_Strategy": 0.9904761904761905, + "cat_recall_Strategy": 0.9122807017543859, + "cat_f1_Third-Pa": 0.954954954954955, + "cat_prec_Third-Pa": 0.9578313253012049, + "cat_recall_Third-Pa": 0.9520958083832335, + "cat_kripp_alpha": 0.9095735484151157, + "spec_macro_f1": 0.8804386286358235, + "spec_weighted_f1": 0.8975676999782217, + "spec_macro_precision": 0.8892226854649037, + "spec_macro_recall": 0.8750457181821643, + "spec_mcc": 0.8465565454059848, + "spec_auc": 0.9697722386763277, + "spec_ece": 0.08741456707318629, + "spec_confusion_matrix": [ + [ + 575, + 19, + 10, + 1 + ], + [ + 26, + 114, + 4, + 1 + ], + [ + 35, + 8, + 204, + 13 + ], + [ + 0, + 0, + 4, + 186 + ] + ], + "spec_f1_L1Generi": 0.9266720386784851, + "spec_prec_L1Generi": 0.9040880503144654, + "spec_recall_L1Generi": 0.9504132231404959, + "spec_f1_L2Domain": 0.7972027972027972, + "spec_prec_L2Domain": 0.8085106382978723, + "spec_recall_L2Domain": 0.7862068965517242, + "spec_f1_L3Firm-S": 0.8464730290456431, + "spec_prec_L3Firm-S": 0.918918918918919, + "spec_recall_L3Firm-S": 0.7846153846153846, + "spec_f1_L4Quanti": 0.9514066496163683, + "spec_prec_L4Quanti": 0.9253731343283582, + "spec_recall_L4Quanti": 0.9789473684210527, + "spec_qwk": 0.9187882106031572, + "spec_mae": 0.14083333333333334, + "spec_kripp_alpha": 0.9041056117796359, + "total_time_s": 6.83874113188358, + "num_samples": 1200, + "avg_ms_per_sample": 5.698950943236317, + "combined_macro_f1": 0.9016668038369215 + } +} \ No newline at end of file diff --git a/results/eval/iter1-clspool/report_gpt-54.txt b/results/eval/iter1-clspool/report_gpt-54.txt new file mode 100644 index 0000000..a9b51c6 --- /dev/null +++ b/results/eval/iter1-clspool/report_gpt-54.txt @@ -0,0 +1,54 @@ + +====================================================================== + HOLDOUT EVALUATION: iter1-clspool vs GPT-5.4 +====================================================================== + + Samples evaluated: 1200 + Total inference time: 6.84s + Avg latency: 5.70ms/sample + Throughput: 175 samples/sec + + ────────────────────────────────────────────────── + CATEGORY CLASSIFICATION + ────────────────────────────────────────────────── + Macro F1: 0.9296 ✓ (target: 0.80) + Weighted F1: 0.9307 + Macro Prec: 0.9290 + Macro Recall: 0.9334 + MCC: 0.9179 + AUC (OvR): 0.9911 + ECE: 0.0556 + Kripp Alpha: 0.9175 + + Category F1 Prec Recall + ------------------------- -------- -------- -------- + Board Governance 0.9518 0.9602 0.9435 + Incident Disclosure 0.9540 0.9651 0.9432 + Management Role 0.9290 0.9000 0.9600 + None/Other 0.8800 0.8049 0.9706 + Risk Management Process 0.8653 0.8883 0.8434 + Strategy Integration 0.9652 0.9905 0.9412 + Third-Party Risk 0.9621 0.9940 0.9322 + + ────────────────────────────────────────────────── + SPECIFICITY CLASSIFICATION + ────────────────────────────────────────────────── + Macro F1: 0.8920 ✓ (target: 0.80) + Weighted F1: 0.9098 + Macro Prec: 0.9042 + Macro Recall: 0.8836 + MCC: 0.8634 + AUC (OvR): 0.9778 + QWK: 0.9225 + MAE: 0.1275 + ECE: 0.0766 + Kripp Alpha: 0.9100 + + Level F1 Prec Recall + ------------------------- -------- -------- -------- + L1: Generic 0.9362 0.9230 0.9498 + L2: Domain 0.8091 0.8865 0.7440 + L3: Firm-Specific 0.8718 0.8423 0.9034 + L4: Quantified 0.9510 0.9652 0.9372 + +====================================================================== diff --git a/results/eval/iter1-clspool/report_opus-46.txt b/results/eval/iter1-clspool/report_opus-46.txt new file mode 100644 index 0000000..1a89adb --- /dev/null +++ b/results/eval/iter1-clspool/report_opus-46.txt @@ -0,0 +1,54 @@ + +====================================================================== + HOLDOUT EVALUATION: iter1-clspool vs Opus-4.6 +====================================================================== + + Samples evaluated: 1200 + Total inference time: 6.84s + Avg latency: 5.70ms/sample + Throughput: 175 samples/sec + + ────────────────────────────────────────────────── + CATEGORY CLASSIFICATION + ────────────────────────────────────────────────── + Macro F1: 0.9229 ✓ (target: 0.80) + Weighted F1: 0.9228 + Macro Prec: 0.9183 + Macro Recall: 0.9311 + MCC: 0.9102 + AUC (OvR): 0.9925 + ECE: 0.0622 + Kripp Alpha: 0.9096 + + Category F1 Prec Recall + ------------------------- -------- -------- -------- + Board Governance 0.9455 0.9204 0.9720 + Incident Disclosure 0.9212 0.8837 0.9620 + Management Role 0.9245 0.9187 0.9304 + None/Other 0.9115 0.8476 0.9858 + Risk Management Process 0.8529 0.9096 0.8028 + Strategy Integration 0.9498 0.9905 0.9123 + Third-Party Risk 0.9550 0.9578 0.9521 + + ────────────────────────────────────────────────── + SPECIFICITY CLASSIFICATION + ────────────────────────────────────────────────── + Macro F1: 0.8804 ✓ (target: 0.80) + Weighted F1: 0.8976 + Macro Prec: 0.8892 + Macro Recall: 0.8750 + MCC: 0.8466 + AUC (OvR): 0.9698 + QWK: 0.9188 + MAE: 0.1408 + ECE: 0.0874 + Kripp Alpha: 0.9041 + + Level F1 Prec Recall + ------------------------- -------- -------- -------- + L1: Generic 0.9267 0.9041 0.9504 + L2: Domain 0.7972 0.8085 0.7862 + L3: Firm-Specific 0.8465 0.9189 0.7846 + L4: Quantified 0.9514 0.9254 0.9789 + +====================================================================== diff --git a/results/eval/iter1-dapt/figures/calibration_cat_gpt-5.4.png b/results/eval/iter1-dapt/figures/calibration_cat_gpt-5.4.png new file mode 100644 index 0000000..9092ab9 Binary files /dev/null and b/results/eval/iter1-dapt/figures/calibration_cat_gpt-5.4.png differ diff --git a/results/eval/iter1-dapt/figures/calibration_cat_opus-4.6.png b/results/eval/iter1-dapt/figures/calibration_cat_opus-4.6.png new file mode 100644 index 0000000..263723b Binary files /dev/null and b/results/eval/iter1-dapt/figures/calibration_cat_opus-4.6.png differ diff --git a/results/eval/iter1-dapt/figures/confusion_cat_gpt-5.4.png b/results/eval/iter1-dapt/figures/confusion_cat_gpt-5.4.png new file mode 100644 index 0000000..dbdc2bd Binary files /dev/null and b/results/eval/iter1-dapt/figures/confusion_cat_gpt-5.4.png differ diff --git a/results/eval/iter1-dapt/figures/confusion_cat_opus-4.6.png b/results/eval/iter1-dapt/figures/confusion_cat_opus-4.6.png new file mode 100644 index 0000000..92b4600 Binary files /dev/null and b/results/eval/iter1-dapt/figures/confusion_cat_opus-4.6.png differ diff --git a/results/eval/iter1-dapt/figures/confusion_spec_gpt-5.4.png b/results/eval/iter1-dapt/figures/confusion_spec_gpt-5.4.png new file mode 100644 index 0000000..1ff03a8 Binary files /dev/null and b/results/eval/iter1-dapt/figures/confusion_spec_gpt-5.4.png differ diff --git a/results/eval/iter1-dapt/figures/confusion_spec_opus-4.6.png b/results/eval/iter1-dapt/figures/confusion_spec_opus-4.6.png new file mode 100644 index 0000000..4299858 Binary files /dev/null and b/results/eval/iter1-dapt/figures/confusion_spec_opus-4.6.png differ diff --git a/results/eval/iter1-dapt/figures/model_comparison.png b/results/eval/iter1-dapt/figures/model_comparison.png new file mode 100644 index 0000000..7dca2d9 Binary files /dev/null and b/results/eval/iter1-dapt/figures/model_comparison.png differ diff --git a/results/eval/iter1-dapt/figures/per_class_f1_gpt-5.4.png b/results/eval/iter1-dapt/figures/per_class_f1_gpt-5.4.png new file mode 100644 index 0000000..627d867 Binary files /dev/null and b/results/eval/iter1-dapt/figures/per_class_f1_gpt-5.4.png differ diff --git a/results/eval/iter1-dapt/figures/per_class_f1_opus-4.6.png b/results/eval/iter1-dapt/figures/per_class_f1_opus-4.6.png new file mode 100644 index 0000000..5f1c8aa Binary files /dev/null and b/results/eval/iter1-dapt/figures/per_class_f1_opus-4.6.png differ diff --git a/results/eval/iter1-dapt/figures/speed_comparison.png b/results/eval/iter1-dapt/figures/speed_comparison.png new file mode 100644 index 0000000..ada2251 Binary files /dev/null and b/results/eval/iter1-dapt/figures/speed_comparison.png differ diff --git a/results/eval/iter1-dapt/metrics.json b/results/eval/iter1-dapt/metrics.json new file mode 100644 index 0000000..2390368 --- /dev/null +++ b/results/eval/iter1-dapt/metrics.json @@ -0,0 +1,298 @@ +{ + "iter1-dapt_vs_GPT-5.4": { + "cat_macro_f1": 0.9350000205815902, + "cat_weighted_f1": 0.936034565494772, + "cat_macro_precision": 0.9344660111343602, + "cat_macro_recall": 0.9378555188267356, + "cat_mcc": 0.9246263785540332, + "cat_auc": 0.9915953686916092, + "cat_ece": 0.04942640244960788, + "cat_confusion_matrix": [ + [ + 224, + 0, + 4, + 0, + 2, + 0, + 0 + ], + [ + 0, + 83, + 0, + 0, + 2, + 2, + 1 + ], + [ + 2, + 0, + 145, + 1, + 2, + 0, + 0 + ], + [ + 0, + 0, + 2, + 132, + 1, + 1, + 0 + ], + [ + 6, + 1, + 5, + 18, + 166, + 1, + 1 + ], + [ + 0, + 2, + 1, + 8, + 1, + 209, + 0 + ], + [ + 0, + 0, + 0, + 0, + 13, + 0, + 164 + ] + ], + "cat_f1_BoardGov": 0.9696969696969697, + "cat_prec_BoardGov": 0.9655172413793104, + "cat_recall_BoardGov": 0.9739130434782609, + "cat_f1_Incident": 0.9540229885057471, + "cat_prec_Incident": 0.9651162790697675, + "cat_recall_Incident": 0.9431818181818182, + "cat_f1_Manageme": 0.9446254071661238, + "cat_prec_Manageme": 0.9235668789808917, + "cat_recall_Manageme": 0.9666666666666667, + "cat_f1_NoneOthe": 0.8949152542372881, + "cat_prec_NoneOthe": 0.8301886792452831, + "cat_recall_NoneOthe": 0.9705882352941176, + "cat_f1_RiskMana": 0.8623376623376623, + "cat_prec_RiskMana": 0.8877005347593583, + "cat_recall_RiskMana": 0.8383838383838383, + "cat_f1_Strategy": 0.9631336405529954, + "cat_prec_Strategy": 0.9812206572769953, + "cat_recall_Strategy": 0.9457013574660633, + "cat_f1_Third-Pa": 0.956268221574344, + "cat_prec_Third-Pa": 0.9879518072289156, + "cat_recall_Third-Pa": 0.9265536723163842, + "cat_kripp_alpha": 0.9243058890635424, + "spec_macro_f1": 0.8959443847575952, + "spec_weighted_f1": 0.914085249793483, + "spec_macro_precision": 0.9055333144570721, + "spec_macro_recall": 0.889132193611932, + "spec_mcc": 0.8698798188273218, + "spec_auc": 0.9806421467148638, + "spec_ece": 0.0693218584855397, + "spec_confusion_matrix": [ + [ + 588, + 14, + 13, + 3 + ], + [ + 32, + 126, + 8, + 2 + ], + [ + 11, + 4, + 191, + 1 + ], + [ + 2, + 2, + 10, + 193 + ] + ], + "spec_f1_L1Generi": 0.9400479616306955, + "spec_prec_L1Generi": 0.9289099526066351, + "spec_recall_L1Generi": 0.9514563106796117, + "spec_f1_L2Domain": 0.802547770700637, + "spec_prec_L2Domain": 0.863013698630137, + "spec_recall_L2Domain": 0.75, + "spec_f1_L3Firm-S": 0.8904428904428905, + "spec_prec_L3Firm-S": 0.8603603603603603, + "spec_recall_L3Firm-S": 0.9227053140096618, + "spec_f1_L4Quanti": 0.9507389162561576, + "spec_prec_L4Quanti": 0.9698492462311558, + "spec_recall_L4Quanti": 0.9323671497584541, + "spec_qwk": 0.9315994086072762, + "spec_mae": 0.11666666666666667, + "spec_kripp_alpha": 0.9194074359344485, + "total_time_s": 6.855555058107711, + "num_samples": 1200, + "avg_ms_per_sample": 5.712962548423093, + "combined_macro_f1": 0.9154722026695927 + }, + "iter1-dapt_vs_Opus-4.6": { + "cat_macro_f1": 0.9277442873196512, + "cat_weighted_f1": 0.9268438855804646, + "cat_macro_precision": 0.9237899595225246, + "cat_macro_recall": 0.9349393170438051, + "cat_mcc": 0.9150420281652446, + "cat_auc": 0.9934333602136249, + "cat_ece": 0.057411353190739985, + "cat_confusion_matrix": [ + [ + 210, + 0, + 2, + 1, + 1, + 0, + 0 + ], + [ + 0, + 77, + 0, + 0, + 1, + 0, + 1 + ], + [ + 8, + 0, + 145, + 1, + 3, + 0, + 1 + ], + [ + 0, + 0, + 0, + 139, + 2, + 0, + 0 + ], + [ + 13, + 0, + 9, + 13, + 172, + 1, + 5 + ], + [ + 1, + 9, + 1, + 4, + 2, + 211, + 0 + ], + [ + 0, + 0, + 0, + 1, + 6, + 1, + 159 + ] + ], + "cat_f1_BoardGov": 0.9417040358744395, + "cat_prec_BoardGov": 0.9051724137931034, + "cat_recall_BoardGov": 0.9813084112149533, + "cat_f1_Incident": 0.9333333333333333, + "cat_prec_Incident": 0.8953488372093024, + "cat_recall_Incident": 0.9746835443037974, + "cat_f1_Manageme": 0.9206349206349206, + "cat_prec_Manageme": 0.9235668789808917, + "cat_recall_Manageme": 0.9177215189873418, + "cat_f1_NoneOthe": 0.9266666666666666, + "cat_prec_NoneOthe": 0.8742138364779874, + "cat_recall_NoneOthe": 0.9858156028368794, + "cat_f1_RiskMana": 0.86, + "cat_prec_RiskMana": 0.9197860962566845, + "cat_recall_RiskMana": 0.8075117370892019, + "cat_f1_Strategy": 0.9569160997732427, + "cat_prec_Strategy": 0.9906103286384976, + "cat_recall_Strategy": 0.9254385964912281, + "cat_f1_Third-Pa": 0.954954954954955, + "cat_prec_Third-Pa": 0.9578313253012049, + "cat_recall_Third-Pa": 0.9520958083832335, + "cat_kripp_alpha": 0.9144489824694872, + "spec_macro_f1": 0.8823881241075249, + "spec_weighted_f1": 0.8997013825586678, + "spec_macro_precision": 0.8895415282112857, + "spec_macro_recall": 0.8784196767594721, + "spec_mcc": 0.84923108221758, + "spec_auc": 0.9732413764660657, + "spec_ece": 0.08008741805950799, + "spec_confusion_matrix": [ + [ + 573, + 22, + 9, + 1 + ], + [ + 26, + 114, + 3, + 2 + ], + [ + 34, + 10, + 207, + 9 + ], + [ + 0, + 0, + 3, + 187 + ] + ], + "spec_f1_L1Generi": 0.925686591276252, + "spec_prec_L1Generi": 0.9052132701421801, + "spec_recall_L1Generi": 0.947107438016529, + "spec_f1_L2Domain": 0.7835051546391752, + "spec_prec_L2Domain": 0.7808219178082192, + "spec_recall_L2Domain": 0.7862068965517242, + "spec_f1_L3Firm-S": 0.8589211618257261, + "spec_prec_L3Firm-S": 0.9324324324324325, + "spec_recall_L3Firm-S": 0.7961538461538461, + "spec_f1_L4Quanti": 0.961439588688946, + "spec_prec_L4Quanti": 0.9396984924623115, + "spec_recall_L4Quanti": 0.9842105263157894, + "spec_qwk": 0.9200429286057613, + "spec_mae": 0.13833333333333334, + "spec_kripp_alpha": 0.9047987190793844, + "total_time_s": 6.855555058107711, + "num_samples": 1200, + "avg_ms_per_sample": 5.712962548423093, + "combined_macro_f1": 0.9050662057135881 + } +} \ No newline at end of file diff --git a/results/eval/iter1-dapt/report_gpt-54.txt b/results/eval/iter1-dapt/report_gpt-54.txt new file mode 100644 index 0000000..bdafc0b --- /dev/null +++ b/results/eval/iter1-dapt/report_gpt-54.txt @@ -0,0 +1,54 @@ + +====================================================================== + HOLDOUT EVALUATION: iter1-dapt vs GPT-5.4 +====================================================================== + + Samples evaluated: 1200 + Total inference time: 6.86s + Avg latency: 5.71ms/sample + Throughput: 175 samples/sec + + ────────────────────────────────────────────────── + CATEGORY CLASSIFICATION + ────────────────────────────────────────────────── + Macro F1: 0.9350 ✓ (target: 0.80) + Weighted F1: 0.9360 + Macro Prec: 0.9345 + Macro Recall: 0.9379 + MCC: 0.9246 + AUC (OvR): 0.9916 + ECE: 0.0494 + Kripp Alpha: 0.9243 + + Category F1 Prec Recall + ------------------------- -------- -------- -------- + Board Governance 0.9697 0.9655 0.9739 + Incident Disclosure 0.9540 0.9651 0.9432 + Management Role 0.9446 0.9236 0.9667 + None/Other 0.8949 0.8302 0.9706 + Risk Management Process 0.8623 0.8877 0.8384 + Strategy Integration 0.9631 0.9812 0.9457 + Third-Party Risk 0.9563 0.9880 0.9266 + + ────────────────────────────────────────────────── + SPECIFICITY CLASSIFICATION + ────────────────────────────────────────────────── + Macro F1: 0.8959 ✓ (target: 0.80) + Weighted F1: 0.9141 + Macro Prec: 0.9055 + Macro Recall: 0.8891 + MCC: 0.8699 + AUC (OvR): 0.9806 + QWK: 0.9316 + MAE: 0.1167 + ECE: 0.0693 + Kripp Alpha: 0.9194 + + Level F1 Prec Recall + ------------------------- -------- -------- -------- + L1: Generic 0.9400 0.9289 0.9515 + L2: Domain 0.8025 0.8630 0.7500 + L3: Firm-Specific 0.8904 0.8604 0.9227 + L4: Quantified 0.9507 0.9698 0.9324 + +====================================================================== diff --git a/results/eval/iter1-dapt/report_opus-46.txt b/results/eval/iter1-dapt/report_opus-46.txt new file mode 100644 index 0000000..83a28cd --- /dev/null +++ b/results/eval/iter1-dapt/report_opus-46.txt @@ -0,0 +1,54 @@ + +====================================================================== + HOLDOUT EVALUATION: iter1-dapt vs Opus-4.6 +====================================================================== + + Samples evaluated: 1200 + Total inference time: 6.86s + Avg latency: 5.71ms/sample + Throughput: 175 samples/sec + + ────────────────────────────────────────────────── + CATEGORY CLASSIFICATION + ────────────────────────────────────────────────── + Macro F1: 0.9277 ✓ (target: 0.80) + Weighted F1: 0.9268 + Macro Prec: 0.9238 + Macro Recall: 0.9349 + MCC: 0.9150 + AUC (OvR): 0.9934 + ECE: 0.0574 + Kripp Alpha: 0.9144 + + Category F1 Prec Recall + ------------------------- -------- -------- -------- + Board Governance 0.9417 0.9052 0.9813 + Incident Disclosure 0.9333 0.8953 0.9747 + Management Role 0.9206 0.9236 0.9177 + None/Other 0.9267 0.8742 0.9858 + Risk Management Process 0.8600 0.9198 0.8075 + Strategy Integration 0.9569 0.9906 0.9254 + Third-Party Risk 0.9550 0.9578 0.9521 + + ────────────────────────────────────────────────── + SPECIFICITY CLASSIFICATION + ────────────────────────────────────────────────── + Macro F1: 0.8824 ✓ (target: 0.80) + Weighted F1: 0.8997 + Macro Prec: 0.8895 + Macro Recall: 0.8784 + MCC: 0.8492 + AUC (OvR): 0.9732 + QWK: 0.9200 + MAE: 0.1383 + ECE: 0.0801 + Kripp Alpha: 0.9048 + + Level F1 Prec Recall + ------------------------- -------- -------- -------- + L1: Generic 0.9257 0.9052 0.9471 + L2: Domain 0.7835 0.7808 0.7862 + L3: Firm-Specific 0.8589 0.9324 0.7962 + L4: Quantified 0.9614 0.9397 0.9842 + +======================================================================