quantization + onnx sweeps

Phase 10.8: torchao/bnb quant sweep on iter1-independent. bf16 already optimal; torchao int8-wo gives -19% VRAM at no F1 cost; all 4-bit variants collapse (ModernBERT-large too quant-sensitive). Phase 10.9: ONNX export + ORT eval. Legacy exporter only working path (dynamo adds 56 Memcpy nodes); ORT fp32 -22% latency vs torch via kernel fusion but bf16+flash-attn-2 still wins; fp16 broken on rotary; dynamic int8 silently CPU-fallback + 0.5 F1 collapse. Driver scripts wired to bun run py:quant / py:onnx; full reports at results/eval/{quant,onnx}/REPORT.md.
2026-04-07 05:10:14 -04:00 · 2026-04-07 05:10:14 -04:00 · 67beaede45
commit 67beaede45
parent ebfc103dac
19 changed files with 4367 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -2,7 +2,10 @@
 /data/
 /models/
 /checkpoints/
+/results/eval/onnx/models/
 *.tar.zst
+*.onnx
+*.onnx.data

 # Dependencies
 ts/node_modules/
--- a/docs/NARRATIVE.md
+++ b/docs/NARRATIVE.md
@ -1097,6 +1097,212 @@ epoch 3 + no temperature scaling would be a reasonable alternative choice.
 > in-distribution confidence memorization. Temperature scaling recovers
 > calibration (ECE −33% cat, −40% spec) without altering predictions."*

+### 10.8 Quantization Sweep (2026-04-07)
+
+**Question:** does post-training quantization buy us a smaller / faster
+deployable model without giving back accuracy? And — almost more
+interesting — *which* quant schemes does ModernBERT-large tolerate?
+
+**Setup:** new sweep driver at `python/scripts/quantize_sweep.py` (wired
+to `bun run py:quant`). Loads the iter1-independent checkpoint, applies
+each scheme to the encoder backbone only (heads stay bf16), reruns the
+full holdout eval against GPT-5.4 and Opus-4.6 proxy gold, and records
+latency, peak VRAM, encoder footprint, and the full metrics suite. 5
+warmup batches before timing; batch 64; max_seq 512; RTX 3090.
+
+**Variants:** fp32, bf16 (baseline), fp16, torchao int8 weight-only,
+torchao int8 dynamic-act + int8 weight, torchao int4 weight-only,
+bitsandbytes LLM.int8, bitsandbytes nf4 (with and without
+double-quantization), bitsandbytes fp4.
+
+**Results (vs GPT-5.4 proxy gold):**
+
+| variant            | enc MB | ms/samp | thru/s | VRAM MB | cat F1 | spec F1 | spec QWK |
+|--------------------|-------:|--------:|-------:|--------:|-------:|--------:|---------:|
+| fp32               |  1579  |  16.29  |    61  |   3504  | 0.9337 |  0.8943 |   0.9321 |
+| **bf16 baseline**  |   790  |   5.52  |   181  |   1741  | 0.9337 |  0.8952 |   0.9324 |
+| fp16               |   790  |   5.54  |   181  |   1741  | 0.9337 |  0.8952 |   0.9324 |
+| **torchao int8-wo**|  ~395  |   6.08  |   165  |   1416  | 0.9345 |  0.8941 |   0.9330 |
+| torchao int8-dyn   |  ~395  |   9.67  |   103  |   1774  | 0.9336 |  0.8918 |   0.9315 |
+| torchao int4-wo    |    —   |    —    |    —   |    —    | err    |  err    |   err    |
+| bnb LLM.int8       |  ~395  |   7.76  |   129  |   2135  | 0.9361 |  0.8986 |   0.9308 |
+| bnb nf4 (DQ)       |   275  |   5.86  |   171  |   1287  | 0.3537 |  0.2205 |   0.2423 |
+| bnb nf4 (no DQ)    |   275  |   5.86  |   171  |   1287  | 0.3537 |  0.2205 |   0.2423 |
+| bnb fp4            |   275  |   5.87  |   170  |   1287  | 0.1629 |  0.2085 |   0.2326 |
+
+(torchao subclass tensors report bf16 element_size, so "395 MB" is the
+true storage estimate, not what `param.element_size()` returns.)
+
+**Six findings:**
+
+1. **bf16 + flash-attn-2 is already the sweet spot.** 3.0× throughput over
+   fp32 with bit-identical accuracy and half the VRAM. Nothing in the
+   precision dimension beats it on this hardware.
+2. **fp16 ≡ bf16.** RTX 3090 has matched fp16/bf16 tensor-core throughput
+   and the model has no overflow issues; pick whichever the loader
+   prefers.
+3. **torchao int8 weight-only is the only quantization that's worth
+   shipping.** −19% VRAM (1741 → 1416 MB), accuracy delta inside ±0.002
+   per-seed noise, +10% latency because RTX 3090 (sm_8.6) lacks the int8
+   tensor-core matmul path that torchao would otherwise route through —
+   so the int8 weight is dequantized to bf16 on the fly. **This is the
+   variant we'd ship as the "low-VRAM" deployment option**, and on
+   Hopper / Ada the latency would invert and be a strict win.
+4. **torchao int8 dynamic-activation regresses on Ampere.** −43%
+   throughput and *more* peak VRAM than bf16 because the per-batch
+   activation quantization adds work without unlocking the int8
+   matmul. Skip.
+5. **bnb LLM.int8 is the slowest int8 path and uses *more* VRAM than
+   bf16.** Mixed-precision outlier handling adds 23% peak memory and 41%
+   latency for an F1 bump that's inside noise. It's tuned for LLM-scale
+   models where outlier features dominate quant error; for an
+   encoder this size on a single 3090 it's a regression.
+6. **All 4-bit variants collapse to near-random.** Both nf4 (DQ and
+   no-DQ) and fp4 produce essentially category-prior and L1-collapsed
+   predictions (cat ECE jumps from 0.054 to 0.10–0.21). We verified per
+   layer that the dequantized weights of one MLP `Wi` differ from the
+   original by mean 0.005 / max 0.11 — quantization is *correct* — but
+   the relative output drift on a single Linear is already ~98% (mean),
+   and that compounds across 28 transformer blocks + GLU FFN paths until
+   the [CLS]/pooled representation no longer carries the discriminative
+   signal. **DQ vs no-DQ produce bit-identical predictions** because the
+   nf4 weight indices are stable under absmax requantization (only the
+   metadata block differs). The catastrophe is inherent to 4-bit weight
+   precision on this architecture, not to a config knob. Recovering 4-bit
+   would require QAT, GPTQ/AWQ-style per-channel calibration, or keeping
+   the GLU FFN in 8-bit while only 4-bit'ing attention projections —
+   none reachable inside the remaining capstone budget.
+
+**Paper hooks:**
+- Add a "deployment precision" row to the speed/cost table — bf16 vs
+  torchao int8-wo gives a clean Pareto pair (latency vs VRAM).
+- One paragraph in the discussion alongside the DAPT and CORAL nulls:
+  *naive post-training 4-bit weight quantization is not viable for
+  ModernBERT-large on this task; the GLU FFN amplifies per-layer weight
+  error across 28 blocks until signal is destroyed*. This is a useful
+  counterpoint to the 4-bit-by-default LLM serving narrative and a
+  legitimate negative result tied to architectural choices.
+- Caveat the int8 latency rows with the sm_8.6 hardware footnote — the
+  result would invert on H100/A100/Ada.
+
+Full standalone report at `results/eval/quant/REPORT.md`; per-variant
+metrics at `results/eval/quant/<variant>/metrics.json`; aggregate row data
+at `results/eval/quant/summary.json`.
+
+### 10.9 ONNX Export + Eval (2026-04-07)
+
+**Question:** can we get a portable ONNX artifact with comparable
+latency / accuracy? What does the ORT path look like for fp32, fp16,
+and int8?
+
+**Setup:** new driver at `python/scripts/onnx_export_eval.py` (`bun run
+py:onnx`). Exports the iter1-independent checkpoint, runs ORT inference
+on the full holdout via CUDAExecutionProvider, and compares against the
+proxy gold.
+
+**Six things broke along the way; documenting because each one is a real
+gotcha for the paper's reproducibility section:**
+
+1. **Dynamo exporter optimizer crashes.** `torch.onnx.export(...,
+   dynamo=True)` translates the graph but its post-translation `InlinePass`
+   trips on `onnx_ir`. Workaround: `optimize=False`.
+2. **Dynamo-exported graph is unusable on CUDA EP.** ORT inserts 56
+   Memcpy nodes between layers because dynamo emits scalar tensors with
+   CPU-side placement metadata. Result: 42.9 ms/sample (8× torch fp32)
+   and 15.4 GB peak VRAM (4.4× torch fp32). The legacy TorchScript
+   exporter (`dynamo=False`) only inserts 1 Memcpy and is the only
+   working export path.
+3. **`op_types_to_quantize=['MatMul']` quantizes nothing on the dynamo
+   graph.** Dynamo emits encoder linears as `Gemm`, not `MatMul`. Need
+   `['MatMul', 'Gemm']`.
+4. **Both ORT shape-inference paths choke on ModernBERT.** Symbolic
+   inference asserts in `_infer_Range` (the rotary embedding's `limit`
+   input is not a scalar); the C++ path raises a (1024)/(7) dimension
+   mismatch on the category head Gemm. The `skip_*` flags on
+   `quant_pre_process` are *ignored* — it always runs symbolic shape
+   inference — and `ONNXQuantizer.__init__` calls
+   `save_and_reload_model_with_shape_infer` unconditionally. Workaround:
+   monkey-patch both bindings to no-ops, then pass
+   `extra_options={'DefaultTensorType': onnx.TensorProto.FLOAT}` so the
+   quantizer can still type the head MatMul output.
+5. **fp16 conversion via `onnxconverter_common` breaks on rotary
+   embeddings.** Two distinct failure modes seen across exports — `Type
+   parameter (T) of Optype (Mul) bound to different types
+   (tensor(float) and tensor(float16)) in node
+   /model/backbone/rotary_emb_1/Mul_2`. The converter leaves the
+   `inv_freq` buffer in fp32 and the surrounding `Mul`/`Expand` ops
+   then can't unify their type parameter. Patchable with an
+   `op_block_list` for the rotary subgraph, but cost/value isn't there
+   given the int8 result below.
+6. **Dynamic int8 via ORT silently falls back to CPU.** The quantizer
+   replaces Gemm/MatMul with `MatMulInteger` + `DynamicQuantizeLinear`,
+   neither of which has CUDA kernels in onnxruntime-gpu 1.24. Session
+   creation succeeds with `CUDAExecutionProvider` but routes the
+   quantized ops to the CPU EP — observable from the load-time GPU
+   memory delta collapsing from 2074 MB (fp32) to 266 MB (int8) and
+   latency exploding to **95.9 ms/sample**. Accuracy also drops to
+   cat F1 = 0.397 / spec F1 = 0.336, further confirming the kernel
+   path is wrong (not just slow).
+
+**Results (legacy exporter, 1,200 holdout, vs GPT-5.4):**
+
+| variant            | size MB | ms/samp | VRAM MB | cat F1 | spec F1 | spec QWK |
+|--------------------|--------:|--------:|--------:|-------:|--------:|---------:|
+| **onnx-fp32**      |    1583 |  12.70  |    8228 | 0.9337 |  0.8952 |   0.9324 |
+| onnx-fp16          |     754 |   err   |    err  | err    |  err    |    err   |
+| onnx-int8 (dynamic)|     527 |  95.91  |   ~CPU  | 0.3972 |  0.3364 |   0.4413 |
+
+For comparison, the torch baselines from Phase 10.8:
+- torch fp32: 16.29 ms / 3504 MB / cat 0.9337 / spec 0.8943
+- torch bf16: **5.52 ms / 1741 MB** / cat 0.9337 / spec 0.8952
+
+**Three findings:**
+
+1. **The one clean win — ORT fp32 beats torch fp32 by 22% on latency
+   (12.70 vs 16.29 ms)** at bit-identical accuracy, thanks to ORT's
+   LayerNorm + Gelu + MatMul kernel fusion. VRAM is 2.3× torch's
+   (8228 vs 3504 MB) because the ORT session allocates a separate
+   ~5 GB workspace — fair trade for batched inference. But torch bf16
+   + flash-attn-2 still wins outright on every dimension (5.52 ms,
+   1741 MB), so this is a moral victory at best.
+2. **fp16 ONNX is currently unreachable** without writing custom rotary
+   handling for the float16 converter. Doable but several hours of
+   plumbing for an artifact that bf16 already dominates.
+3. **ORT dynamic int8 is a deployment trap on this hardware.** It looks
+   like it works (export succeeds, file shrinks 1583 → 527 MB, session
+   constructs cleanly with CUDAExecutionProvider in the providers list),
+   but at runtime the integer matmul ops route to the CPU EP and the
+   model produces ~uniform-prior predictions because the per-channel
+   weight quantization interacts badly with the activation
+   quantization path. Both observations would silently bite a
+   production deployment that didn't run a holdout sanity check.
+
+**Net recommendation: don't ship ONNX for this model on this hardware.**
+torchao int8-wo from §10.8 still owns the "smaller deployment" Pareto
+slot (5.52 → 6.08 ms, 1741 → 1416 MB, F1 within ±0.001) more cleanly
+than any ONNX variant we could produce here. ONNX would be worth
+revisiting only for CPU-only deployment, cross-runtime portability
+(TensorRT/OpenVINO/mobile), or a properly calibrated static int8 path
+with a ModernBERT-aware op block list — none reachable inside the
+remaining capstone budget.
+
+**Paper hooks:**
+- One paragraph in the deployment / reproducibility discussion:
+  *ONNX export of ModernBERT-large via the dynamo exporter is currently
+  broken (excessive Memcpy insertion); the legacy TorchScript exporter
+  produces a clean graph that's 22% faster than torch fp32 via ORT
+  kernel fusion, but bf16 + flash-attn-2 dominates at half the latency.
+  fp16 conversion via onnxconverter_common fails on rotary embeddings,
+  and ORT dynamic int8 silently falls back to CPU on
+  onnxruntime-gpu 1.24, dropping ~0.5 macro F1.*
+- Add a "deployment lessons learned" sub-bullet to the limitations
+  section so a follow-on engineering team doesn't waste a day chasing
+  the same dead ends.
+
+Full standalone report at `results/eval/onnx/REPORT.md`; aggregate
+results at `results/eval/onnx/summary.json`; exported models at
+`results/eval/onnx/models/`.
+
 ### Phase 10 Summary

 | Experiment | Cost | Outcome | Paper value |
@ -1107,6 +1313,8 @@ epoch 3 + no temperature scaling would be a reasonable alternative choice.
 | Temperature scaling | ~10 min GPU | ECE −33% cat, −40% spec, F1 unchanged | Calibration story, deployment quality |
 | Pooling ablation (attention vs CLS) | ~3h GPU | +0.005 F1 consistent, small effect | Validates design, credits independent thresholds |
 | DAPT re-test with new architecture | ~3h GPU | Val best NLL 0.333→0.318 (−4.5%), F1 +0.007 cat; holdout null; gen gap unchanged | More nuanced null — better init, not better generalization |
+| Quantization sweep (10 variants) | ~5 min GPU | bf16 already optimal; torchao int8-wo = −19% VRAM no F1 cost; **all 4-bit collapses** (ModernBERT-large too quant-sensitive) | Deployment Pareto + 4-bit null result |
+| ONNX export + ORT eval | ~10 min GPU | Legacy exporter only working path; ORT fp32 −22% latency vs torch (kernel fusion), but bf16 still wins; fp16 broken on rotary; int8 silently CPU-fallback + 0.5 F1 collapse | Deployment lessons learned, reproducibility caveats |

 The 3-seed ensemble is now the recommended headline checkpoint. The
 calibrated ECE numbers should replace the pre-scaling ECE in the paper. The
--- a/package.json
+++ b/package.json
@ -20,6 +20,8 @@
    "ts:sec": "bun run --filter sec-cybert sec",
    "ts:typecheck": "bun run --filter sec-cybert typecheck",
    "py:train": "cd python && PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True uv run main.py",
+    "py:quant": "cd python && PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True uv run scripts/quantize_sweep.py",
+    "py:onnx": "cd python && PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True uv run scripts/onnx_export_eval.py",
    "typecheck": "bun run --filter '*' typecheck",
    "data:push": "./scripts/data-push.sh",
    "data:pull": "./scripts/data-pull.sh",
--- a/python/pyproject.toml
+++ b/python/pyproject.toml
@ -18,6 +18,11 @@ dependencies = [
    "krippendorff>=0.8.2",
    "matplotlib>=3.10.8",
    "seaborn>=0.13.2",
+    "onnx>=1.21.0",
+    "onnxruntime-gpu>=1.24.4",
+    "onnxruntime>=1.24.4",
+    "onnxscript>=0.6.2",
+    "onnxconverter-common>=1.16.0",
 ]

 [project.scripts]
--- a/python/scripts/onnx_export_eval.py
+++ b/python/scripts/onnx_export_eval.py
@ -0,0 +1,369 @@
+"""ONNX export + eval for the iter1-independent ModernBERT-large checkpoint.
+
+Variants:
+    onnx-fp32      — straight torch.onnx.export from the fp32 model
+    onnx-fp16      — fp32 export converted to fp16 via onnxconverter_common
+                     (proxy for bf16; ORT does not support bf16 inference natively)
+    onnx-int8-dyn  — dynamic int8 quantization of the fp32 graph via
+                     onnxruntime.quantization.quantize_dynamic (weights in int8,
+                     activations quantized at runtime)
+
+For each variant:
+    - latency (ms/sample, batch=64, 5 warmup batches)
+    - peak GPU memory delta around the session (free-mem snapshot)
+    - on-disk size of model.onnx + model.onnx.data
+    - cat / spec macro F1, QWK, ECE on the 1,200-paragraph holdout
+      against GPT-5.4 + Opus-4.6 proxy gold
+
+Usage:
+    bun run py:onnx
+"""
+
+from __future__ import annotations
+
+import gc
+import json
+import os
+import sys
+import time
+from pathlib import Path
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+ROOT = Path(__file__).resolve().parents[1]
+sys.path.insert(0, str(ROOT))
+
+from src.finetune.data import CAT2ID, CATEGORIES, NUM_CATEGORIES, NUM_SPECIFICITY  # noqa: E402
+from src.finetune.eval import SPEC_LABELS, compute_all_metrics, load_holdout_data  # noqa: E402
+from src.finetune.model import ordinal_predict  # noqa: E402
+from scripts.quantize_sweep import (  # noqa: E402
+    BENCHMARKS, BATCH_SIZE, HOLDOUT, MAX_SEQ, PARAGRAPHS, WARMUP_BATCHES,
+    _build_model, evaluate_predictions,
+)
+
+OUTPUT_DIR = ROOT.parent / "results/eval/onnx"
+ONNX_DIR = OUTPUT_DIR / "models"
+OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+ONNX_DIR.mkdir(parents=True, exist_ok=True)
+
+
+# ──────────────────────────────────────────────────────────────────────
+# Export
+# ──────────────────────────────────────────────────────────────────────
+
+class _Wrap(nn.Module):
+    def __init__(self, model):
+        super().__init__()
+        self.model = model
+
+    def forward(self, input_ids, attention_mask):
+        out = self.model(input_ids=input_ids, attention_mask=attention_mask)
+        return out["category_logits"], out["specificity_logits"]
+
+
+def export_fp32(out_path: Path, sample_batch: int = 4, sample_seq: int = 64) -> None:
+    print(f"  building fp32 torch model...")
+    model, tokenizer = _build_model(torch.float32, attn_impl="sdpa")
+    model = model.cuda().eval()
+    wrap = _Wrap(model).cuda().eval()
+
+    dummy_text = ["the company maintains a cybersecurity program overseen by the board"] * sample_batch
+    enc = tokenizer(
+        dummy_text, padding="max_length", max_length=sample_seq,
+        truncation=True, return_tensors="pt",
+    ).to("cuda")
+
+    print(f"  exporting → {out_path}")
+    # Legacy TorchScript exporter (dynamo=False). The dynamo path produces a
+    # graph with 56+ Memcpy nodes when run on CUDAExecutionProvider, blowing
+    # latency 8× and VRAM 4× over native torch — unusable. The legacy
+    # exporter emits clean Gemm/MatMul/LayerNorm nodes ORT can fuse.
+    torch.onnx.export(
+        wrap,
+        (enc["input_ids"], enc["attention_mask"]),
+        str(out_path),
+        input_names=["input_ids", "attention_mask"],
+        output_names=["cat_logits", "spec_logits"],
+        dynamic_axes={
+            "input_ids": {0: "batch", 1: "seq"},
+            "attention_mask": {0: "batch", 1: "seq"},
+            "cat_logits": {0: "batch"},
+            "spec_logits": {0: "batch"},
+        },
+        opset_version=17,
+        dynamo=False,
+        do_constant_folding=True,
+    )
+
+    del wrap, model
+    gc.collect()
+    torch.cuda.empty_cache()
+
+
+def convert_fp16(fp32_path: Path, fp16_path: Path) -> None:
+    """Convert an fp32 ONNX model to fp16 via onnxconverter_common."""
+    import onnx
+    from onnxconverter_common import float16
+
+    print(f"  loading {fp32_path}")
+    model = onnx.load(str(fp32_path), load_external_data=True)
+    print(f"  converting to fp16...")
+    model_fp16 = float16.convert_float_to_float16(
+        model, keep_io_types=False, disable_shape_infer=True,
+    )
+    print(f"  saving → {fp16_path}")
+    onnx.save_model(
+        model_fp16, str(fp16_path),
+        save_as_external_data=True,
+        all_tensors_to_one_file=True,
+        location=fp16_path.name + ".data",
+        size_threshold=1024,
+    )
+
+
+def quantize_int8_dynamic(fp32_path: Path, int8_path: Path) -> None:
+    """Dynamic int8 quantization (weights → int8, activations on the fly).
+
+    Two shape-inference paths in the ORT quantizer choke on the dynamo
+    export of ModernBERT-large:
+
+    1. `SymbolicShapeInference._infer_Range` asserts on the dynamic limit
+       input emitted by RoPE (`assert len(x) == 1` in `as_scalar`).
+    2. `onnx.shape_inference.infer_shapes_path` (C++) raises a (1024)/(7)
+       dim mismatch on the category head Gemm — the dynamo decomposition
+       leaves a dimension hint the C++ inferencer disagrees with.
+
+    The skip flags on `quant_pre_process` are ignored (it always runs
+    `SymbolicShapeInference.infer_shapes`), and `ONNXQuantizer.__init__`
+    calls `save_and_reload_model_with_shape_infer` unconditionally. We
+    monkey-patch both to no-ops, then run `quantize_dynamic` restricted to
+    MatMul ops (the only nodes we want quantized anyway).
+    """
+    import onnx
+    from onnxruntime.quantization import QuantType, quantize_dynamic
+    from onnxruntime.quantization import quant_utils
+    from onnxruntime.tools import symbolic_shape_infer as sym
+
+    # No-op the broken shape passes.
+    original_save_reload = quant_utils.save_and_reload_model_with_shape_infer
+
+    def _passthrough(model):
+        return model
+
+    quant_utils.save_and_reload_model_with_shape_infer = _passthrough
+    # Some imports cache the symbol — patch the onnx_quantizer module too.
+    import onnxruntime.quantization.onnx_quantizer as oq
+    oq.save_and_reload_model_with_shape_infer = _passthrough
+
+    try:
+        print(f"  quantizing {fp32_path} → {int8_path}")
+        quantize_dynamic(
+            model_input=str(fp32_path),
+            model_output=str(int8_path),
+            weight_type=QuantType.QInt8,
+            per_channel=True,
+            reduce_range=False,
+            op_types_to_quantize=["MatMul", "Gemm"],
+            use_external_data_format=True,
+            extra_options={"DefaultTensorType": onnx.TensorProto.FLOAT},
+        )
+    finally:
+        quant_utils.save_and_reload_model_with_shape_infer = original_save_reload
+        oq.save_and_reload_model_with_shape_infer = original_save_reload
+
+
+# ──────────────────────────────────────────────────────────────────────
+# Inference + metrics
+# ──────────────────────────────────────────────────────────────────────
+
+def _files_size(model_path: Path) -> int:
+    """Sum of model.onnx + any external .data files in the same dir."""
+    total = model_path.stat().st_size
+    for sib in model_path.parent.iterdir():
+        if sib.name.startswith(model_path.name) and sib != model_path:
+            total += sib.stat().st_size
+    return total
+
+
+def run_onnx(model_path: Path, texts: list[str], use_cuda: bool = True) -> dict:
+    import onnxruntime as ort
+    from transformers import AutoTokenizer
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        "../checkpoints/finetune/iter1-independent/final"
+    )
+
+    so = ort.SessionOptions()
+    so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
+    providers = (
+        ["CUDAExecutionProvider", "CPUExecutionProvider"] if use_cuda
+        else ["CPUExecutionProvider"]
+    )
+
+    free_before, total_vram = torch.cuda.mem_get_info()
+    sess = ort.InferenceSession(str(model_path), so, providers=providers)
+    free_after_load, _ = torch.cuda.mem_get_info()
+    load_vram_mb = (free_before - free_after_load) / (1024 ** 2)
+
+    # Warmup
+    warm_enc = tokenizer(
+        texts[:BATCH_SIZE], truncation=True, max_length=MAX_SEQ,
+        padding="longest", return_tensors="np",
+    )
+    warm_inputs = {
+        "input_ids": warm_enc["input_ids"].astype(np.int64),
+        "attention_mask": warm_enc["attention_mask"].astype(np.int64),
+    }
+    for _ in range(WARMUP_BATCHES):
+        sess.run(None, warm_inputs)
+
+    free_after_warm, _ = torch.cuda.mem_get_info()
+    peak_vram_mb = (free_before - free_after_warm) / (1024 ** 2)
+
+    cat_logits_list = []
+    spec_logits_list = []
+    total_time = 0.0
+    for i in range(0, len(texts), BATCH_SIZE):
+        batch = texts[i : i + BATCH_SIZE]
+        enc = tokenizer(
+            batch, truncation=True, max_length=MAX_SEQ,
+            padding="longest", return_tensors="np",
+        )
+        inputs = {
+            "input_ids": enc["input_ids"].astype(np.int64),
+            "attention_mask": enc["attention_mask"].astype(np.int64),
+        }
+        t0 = time.perf_counter()
+        out = sess.run(None, inputs)
+        total_time += time.perf_counter() - t0
+        cat_logits_list.append(torch.from_numpy(out[0].astype(np.float32)))
+        spec_logits_list.append(torch.from_numpy(out[1].astype(np.float32)))
+
+    free_end, _ = torch.cuda.mem_get_info()
+    peak_vram_mb = max(peak_vram_mb, (free_before - free_end) / (1024 ** 2))
+
+    del sess
+    gc.collect()
+    torch.cuda.empty_cache()
+
+    return {
+        "cat_logits": torch.cat(cat_logits_list),
+        "spec_logits": torch.cat(spec_logits_list),
+        "ms_per_sample": (total_time / len(texts)) * 1000,
+        "throughput": len(texts) / total_time,
+        "peak_vram_mb": peak_vram_mb,
+        "load_vram_mb": load_vram_mb,
+        "providers": providers,
+    }
+
+
+# ──────────────────────────────────────────────────────────────────────
+# Driver
+# ──────────────────────────────────────────────────────────────────────
+
+def main():
+    print("loading holdout...")
+    records = load_holdout_data(
+        str(PARAGRAPHS), str(HOLDOUT), {k: str(v) for k, v in BENCHMARKS.items()},
+    )
+    texts = [r["text"] for r in records]
+    print(f"  {len(records)} paragraphs")
+
+    fp32_path = ONNX_DIR / "model_fp32.onnx"
+    fp16_path = ONNX_DIR / "model_fp16.onnx"
+    int8_path = ONNX_DIR / "model_int8_dyn.onnx"
+
+    # ── Export fp32 (source for both fp16 and int8 quant) ──
+    if not fp32_path.exists():
+        print("\n══ exporting fp32 ONNX")
+        export_fp32(fp32_path)
+    else:
+        print(f"\n══ reusing existing {fp32_path}")
+
+    # ── fp16 conversion ──
+    if not fp16_path.exists():
+        print("\n══ converting → fp16 ONNX")
+        convert_fp16(fp32_path, fp16_path)
+    else:
+        print(f"\n══ reusing existing {fp16_path}")
+
+    # ── int8 dynamic quantization ──
+    if not int8_path.exists():
+        print("\n══ quantizing → int8 dynamic ONNX")
+        quantize_int8_dynamic(fp32_path, int8_path)
+    else:
+        print(f"\n══ reusing existing {int8_path}")
+
+    summary = []
+    variants = [
+        ("onnx-fp32", fp32_path),
+        ("onnx-fp16", fp16_path),
+        ("onnx-int8-dyn", int8_path),
+    ]
+    for name, path in variants:
+        print(f"\n══ {name} — {path.name}")
+        size_mb = _files_size(path) / 1e6
+        print(f"    on-disk size: {size_mb:.1f} MB")
+        try:
+            inf = run_onnx(path, texts, use_cuda=True)
+            print(
+                f"    latency {inf['ms_per_sample']:.2f} ms/sample, "
+                f"throughput {inf['throughput']:.0f}/s, "
+                f"peak VRAM {inf['peak_vram_mb']:.0f} MB "
+                f"(load {inf['load_vram_mb']:.0f} MB)"
+            )
+            row = {
+                "variant": name,
+                "model_mb": size_mb,
+                "ms_per_sample": inf["ms_per_sample"],
+                "throughput_per_s": inf["throughput"],
+                "peak_vram_mb": inf["peak_vram_mb"],
+                "load_vram_mb": inf["load_vram_mb"],
+            }
+            for ref in BENCHMARKS:
+                m = evaluate_predictions(inf["cat_logits"], inf["spec_logits"], records, ref)
+                print(
+                    f"    vs {ref}: cat F1={m['cat_macro_f1']:.4f}, "
+                    f"spec F1={m['spec_macro_f1']:.4f}, QWK={m['spec_qwk']:.4f}, "
+                    f"cat ECE={m['cat_ece']:.4f}, spec ECE={m['spec_ece']:.4f}"
+                )
+                row[f"{ref}_cat_f1"] = m["cat_macro_f1"]
+                row[f"{ref}_spec_f1"] = m["spec_macro_f1"]
+                row[f"{ref}_cat_mcc"] = m["cat_mcc"]
+                row[f"{ref}_spec_qwk"] = m["spec_qwk"]
+                row[f"{ref}_spec_mae"] = m["spec_mae"]
+                row[f"{ref}_cat_ece"] = m["cat_ece"]
+                row[f"{ref}_spec_ece"] = m["spec_ece"]
+            summary.append(row)
+        except Exception as e:
+            import traceback
+            traceback.print_exc()
+            summary.append({"variant": name, "error": f"{type(e).__name__}: {e}"})
+
+    summary_path = OUTPUT_DIR / "summary.json"
+    with open(summary_path, "w") as f:
+        json.dump(summary, f, indent=2, default=str)
+    print(f"\nsummary → {summary_path}")
+
+    print("\n" + "=" * 110)
+    print(f"{'variant':<18} {'MB':>9} {'ms/samp':>9} {'throughput':>11} "
+          f"{'VRAM MB':>9} {'cat F1':>9} {'spec F1':>9} {'spec QWK':>9}")
+    print("-" * 110)
+    for r in summary:
+        if "error" in r:
+            print(f"{r['variant']:<18} ERROR: {r['error']}")
+            continue
+        print(
+            f"{r['variant']:<18} {r['model_mb']:>9.1f} {r['ms_per_sample']:>9.2f} "
+            f"{r['throughput_per_s']:>11.0f} {r['peak_vram_mb']:>9.0f} "
+            f"{r['GPT-5.4_cat_f1']:>9.4f} {r['GPT-5.4_spec_f1']:>9.4f} "
+            f"{r['GPT-5.4_spec_qwk']:>9.4f}"
+        )
+    print("=" * 110)
+
+
+if __name__ == "__main__":
+    main()
--- a/python/scripts/quantize_sweep.py
+++ b/python/scripts/quantize_sweep.py
@ -0,0 +1,491 @@
+"""Quantization sweep for the iter1-independent ModernBERT-large checkpoint.
+
+Loads the trained DualHeadModernBERT, applies a series of quantization
+schemes to the *encoder* (heads kept in their native dtype), and re-runs
+holdout evaluation against the GPT-5.4 / Opus-4.6 proxy gold.
+
+For each variant we record:
+    - cat / spec macro F1, per-class F1, QWK, MAE, ECE
+    - peak VRAM (encoder forward)
+    - latency (ms/sample, batch=64) and throughput
+    - encoder parameter footprint in MB
+    - delta vs bf16 baseline
+
+Variants:
+    fp32, bf16 (baseline), fp16,
+    torchao int8 weight-only,
+    torchao int8 dynamic-act + int8 weight,
+    torchao int4 weight-only (group=128),
+    bitsandbytes LLM.int8 (8-bit),
+    bitsandbytes nf4 (4-bit, double-quant, bf16 compute).
+
+Heads (category linear, attention pooler, independent threshold MLPs)
+stay in bf16 — they sit on a 1024-dim representation and account for
+< 0.3% of params, so quantizing them buys nothing and risks the threshold
+margins which already drive most of the spec error budget.
+
+Usage:
+    bun run py:quant   # via package.json wrapper
+    # or directly:
+    cd python && uv run scripts/quantize_sweep.py
+"""
+
+from __future__ import annotations
+
+import gc
+import json
+import sys
+import time
+import traceback
+from dataclasses import dataclass, field
+from pathlib import Path
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from safetensors.torch import load_file
+from transformers import AutoModel, AutoTokenizer
+
+# Make `src` importable when run as a script
+ROOT = Path(__file__).resolve().parents[1]
+sys.path.insert(0, str(ROOT))
+
+from src.finetune.data import CAT2ID, CATEGORIES, NUM_CATEGORIES, NUM_SPECIFICITY  # noqa: E402
+from src.finetune.eval import (  # noqa: E402
+    SPEC_LABELS,
+    compute_all_metrics,
+    load_holdout_data,
+)
+from src.finetune.model import DualHeadModernBERT, ordinal_predict  # noqa: E402
+
+REPO = ROOT.parent
+CHECKPOINT = REPO / "checkpoints/finetune/iter1-independent/final"
+PARAGRAPHS = REPO / "data/paragraphs/paragraphs-clean.patched.jsonl"
+HOLDOUT = REPO / "data/gold/v2-holdout-ids.json"
+BENCHMARKS = {
+    "GPT-5.4": REPO / "data/annotations/v2-bench/gpt-5.4.jsonl",
+    "Opus-4.6": REPO / "data/annotations/v2-bench/opus-4.6.jsonl",
+}
+OUTPUT_DIR = REPO / "results/eval/quant"
+BATCH_SIZE = 64
+MAX_SEQ = 512
+WARMUP_BATCHES = 5
+
+
+# ──────────────────────────────────────────────────────────────────────
+# Model loading
+# ──────────────────────────────────────────────────────────────────────
+
+def _build_model(dtype: torch.dtype, attn_impl: str = "sdpa") -> tuple[DualHeadModernBERT, AutoTokenizer]:
+    """Construct DualHeadModernBERT and load trained weights at the requested dtype."""
+    tokenizer = AutoTokenizer.from_pretrained(str(CHECKPOINT))
+    backbone = AutoModel.from_pretrained(
+        "answerdotai/ModernBERT-large",
+        trust_remote_code=True,
+        attn_implementation=attn_impl,
+        dtype=dtype,
+    )
+    model = DualHeadModernBERT(
+        backbone=backbone,
+        hidden_size=backbone.config.hidden_size,
+        num_categories=NUM_CATEGORIES,
+        num_specificity=NUM_SPECIFICITY,
+        specificity_head_type="independent",
+        spec_mlp_dim=256,
+        pooling="attention",
+    )
+    state = load_file(str(CHECKPOINT / "model.safetensors"))
+    model.load_state_dict(state, strict=False)
+    model = model.to(dtype)
+    model.eval()
+    return model, tokenizer
+
+
+def _try_flash_attn() -> str:
+    try:
+        import flash_attn  # noqa: F401
+        return "flash_attention_2"
+    except ImportError:
+        return "sdpa"
+
+
+# ──────────────────────────────────────────────────────────────────────
+# Quantization variants
+# ──────────────────────────────────────────────────────────────────────
+
+def variant_native(dtype: torch.dtype, attn: str | None = None):
+    def _build():
+        impl = attn or _try_flash_attn()
+        # bf16/fp16 supported by flash-attn; fp32 must use sdpa
+        if dtype == torch.float32:
+            impl = "sdpa"
+        model, tok = _build_model(dtype, attn_impl=impl)
+        return model.cuda(), tok
+    return _build
+
+
+def variant_torchao(config_factory):
+    def _build():
+        from torchao.quantization import quantize_
+        # torchao expects bf16 master weights
+        model, tok = _build_model(torch.bfloat16, attn_impl=_try_flash_attn())
+        model = model.cuda()
+        # Quantize encoder linears only (skip heads + attention pooler)
+        quantize_(model.backbone, config_factory())
+        return model, tok
+    return _build
+
+
+def _swap_bnb_linear(
+    module: nn.Module,
+    mode: str,
+    compute_dtype=torch.bfloat16,
+    compress_statistics: bool = True,
+) -> int:
+    """Recursively replace nn.Linear with bnb 8-bit / 4-bit equivalents.
+
+    Returns number of layers swapped. Copies weights from the original
+    module so the trained checkpoint is preserved.
+    """
+    import bitsandbytes as bnb
+
+    swapped = 0
+    for name, child in list(module.named_children()):
+        if isinstance(child, nn.Linear):
+            in_f, out_f = child.in_features, child.out_features
+            has_bias = child.bias is not None
+            if mode == "int8":
+                new = bnb.nn.Linear8bitLt(
+                    in_f, out_f, bias=has_bias,
+                    has_fp16_weights=False, threshold=6.0,
+                )
+                new.weight = bnb.nn.Int8Params(
+                    child.weight.data.clone(),
+                    requires_grad=False,
+                    has_fp16_weights=False,
+                )
+                if has_bias:
+                    new.bias = nn.Parameter(child.bias.data.clone())
+            elif mode in ("nf4", "fp4"):
+                new = bnb.nn.Linear4bit(
+                    in_f, out_f, bias=has_bias,
+                    compute_dtype=compute_dtype,
+                    quant_type=mode,
+                    compress_statistics=compress_statistics,
+                    quant_storage=torch.uint8,
+                    device="cuda",
+                )
+                w = child.weight.data.detach().to(torch.float32).clone()
+                new.weight = bnb.nn.Params4bit(
+                    w, requires_grad=False, quant_type=mode,
+                    compress_statistics=compress_statistics, module=new,
+                ).cuda()
+                if has_bias:
+                    new.bias = nn.Parameter(
+                        child.bias.data.detach().to(compute_dtype).clone().cuda()
+                    )
+            else:
+                raise ValueError(mode)
+            new = new.cuda()
+            setattr(module, name, new)
+            swapped += 1
+        else:
+            swapped += _swap_bnb_linear(child, mode, compute_dtype)
+    return swapped
+
+
+def variant_bnb(mode: str, compress_statistics: bool = True):
+    def _build():
+        model, tok = _build_model(torch.bfloat16, attn_impl="sdpa")
+        model = model.cuda()
+        n = _swap_bnb_linear(
+            model.backbone, mode, compress_statistics=compress_statistics,
+        )
+        print(f"    bnb {mode} (cs={compress_statistics}): swapped {n} linears")
+        return model, tok
+    return _build
+
+
+# ──────────────────────────────────────────────────────────────────────
+# Inference + measurement
+# ──────────────────────────────────────────────────────────────────────
+
+def _encoder_param_bytes(model: DualHeadModernBERT) -> int:
+    """Sum bytes of every parameter / buffer inside the encoder backbone.
+
+    Handles bnb Int8Params (int8 storage) and Params4bit (uint8 packed)
+    correctly because element_size() reflects the storage dtype.
+    """
+    total = 0
+    seen = set()
+    for p in list(model.backbone.parameters()) + list(model.backbone.buffers()):
+        if id(p) in seen:
+            continue
+        seen.add(id(p))
+        total += p.numel() * p.element_size()
+    return total
+
+
+@torch.no_grad()
+def run_inference(model, tokenizer, texts: list[str]) -> dict:
+    device = next(model.parameters()).device
+    cat_logits_list = []
+    spec_logits_list = []
+
+    # Warmup
+    warm_batch = tokenizer(
+        texts[: BATCH_SIZE], truncation=True, max_length=MAX_SEQ,
+        padding="longest", return_tensors="pt",
+    ).to(device)
+    for _ in range(WARMUP_BATCHES):
+        _ = model(input_ids=warm_batch["input_ids"], attention_mask=warm_batch["attention_mask"])
+    torch.cuda.synchronize()
+    torch.cuda.reset_peak_memory_stats()
+
+    total_time = 0.0
+    for i in range(0, len(texts), BATCH_SIZE):
+        batch = texts[i : i + BATCH_SIZE]
+        enc = tokenizer(
+            batch, truncation=True, max_length=MAX_SEQ,
+            padding="longest", return_tensors="pt",
+        ).to(device)
+        torch.cuda.synchronize()
+        t0 = time.perf_counter()
+        out = model(input_ids=enc["input_ids"], attention_mask=enc["attention_mask"])
+        torch.cuda.synchronize()
+        total_time += time.perf_counter() - t0
+        cat_logits_list.append(out["category_logits"].float().cpu())
+        spec_logits_list.append(out["specificity_logits"].float().cpu())
+
+    peak_vram = torch.cuda.max_memory_allocated()
+    cat_logits = torch.cat(cat_logits_list)
+    spec_logits = torch.cat(spec_logits_list)
+    return {
+        "cat_logits": cat_logits,
+        "spec_logits": spec_logits,
+        "total_time_s": total_time,
+        "ms_per_sample": (total_time / len(texts)) * 1000,
+        "throughput": len(texts) / total_time,
+        "peak_vram_mb": peak_vram / (1024 ** 2),
+        "num_samples": len(texts),
+    }
+
+
+def evaluate_predictions(
+    cat_logits: torch.Tensor,
+    spec_logits: torch.Tensor,
+    records: list[dict],
+    ref_name: str,
+) -> dict:
+    cat_probs_all = F.softmax(cat_logits, dim=1).numpy()
+    cat_preds_all = cat_logits.argmax(dim=1).numpy()
+    spec_preds_all = ordinal_predict(spec_logits).numpy()
+    # ordinal → class probs
+    sp = torch.sigmoid(spec_logits)
+    K = sp.shape[1] + 1
+    spec_probs_all = torch.zeros(sp.shape[0], K)
+    spec_probs_all[:, 0] = 1 - sp[:, 0]
+    for k in range(1, K - 1):
+        spec_probs_all[:, k] = sp[:, k - 1] - sp[:, k]
+    spec_probs_all[:, -1] = sp[:, -1]
+    spec_probs_all = spec_probs_all.clamp(min=0)
+    spec_probs_all = spec_probs_all / spec_probs_all.sum(dim=1, keepdim=True)
+    spec_probs_all = spec_probs_all.numpy()
+
+    cat_labels, spec_labels = [], []
+    cat_p, spec_p, cat_pr, spec_pr = [], [], [], []
+    for i, rec in enumerate(records):
+        b = rec["benchmark_labels"].get(ref_name)
+        if b is None:
+            continue
+        cat_labels.append(CAT2ID[b["category"]])
+        spec_labels.append(b["specificity"] - 1)
+        cat_p.append(cat_preds_all[i])
+        spec_p.append(spec_preds_all[i])
+        cat_pr.append(cat_probs_all[i])
+        spec_pr.append(spec_probs_all[i])
+
+    cat_m = compute_all_metrics(
+        np.array(cat_p), np.array(cat_labels), np.array(cat_pr),
+        CATEGORIES, "cat", is_ordinal=False,
+    )
+    spec_m = compute_all_metrics(
+        np.array(spec_p), np.array(spec_labels), np.array(spec_pr),
+        SPEC_LABELS, "spec", is_ordinal=True,
+    )
+    return {**cat_m, **spec_m}
+
+
+# ──────────────────────────────────────────────────────────────────────
+# Variant registry
+# ──────────────────────────────────────────────────────────────────────
+
+@dataclass
+class Variant:
+    name: str
+    description: str
+    builder: callable
+    skip_reason: str | None = None
+
+
+def build_variants() -> list[Variant]:
+    from torchao.quantization import (
+        Int4WeightOnlyConfig,
+        Int8DynamicActivationInt8WeightConfig,
+        Int8WeightOnlyConfig,
+    )
+
+    return [
+        Variant("fp32", "Float32 encoder + heads", variant_native(torch.float32, attn="sdpa")),
+        Variant("bf16", "BFloat16 baseline (matches eval pipeline)", variant_native(torch.bfloat16)),
+        Variant("fp16", "Float16 encoder + heads", variant_native(torch.float16)),
+        Variant(
+            "torchao-int8-wo",
+            "torchao Int8 weight-only on encoder linears",
+            variant_torchao(lambda: Int8WeightOnlyConfig()),
+        ),
+        Variant(
+            "torchao-int8-dyn",
+            "torchao Int8 dynamic activation + Int8 weight on encoder",
+            variant_torchao(lambda: Int8DynamicActivationInt8WeightConfig()),
+        ),
+        Variant(
+            "torchao-int4-wo",
+            "torchao Int4 weight-only (group=128) on encoder",
+            variant_torchao(lambda: Int4WeightOnlyConfig(group_size=128)),
+        ),
+        Variant("bnb-int8", "bitsandbytes LLM.int8 on encoder linears", variant_bnb("int8")),
+        Variant("bnb-nf4", "bitsandbytes NF4 4-bit (double-quant, bf16 compute)", variant_bnb("nf4", compress_statistics=True)),
+        Variant("bnb-nf4-nodq", "bitsandbytes NF4 4-bit (no double-quant)", variant_bnb("nf4", compress_statistics=False)),
+        Variant("bnb-fp4", "bitsandbytes FP4 4-bit (no double-quant)", variant_bnb("fp4", compress_statistics=False)),
+    ]
+
+
+# ──────────────────────────────────────────────────────────────────────
+# Driver
+# ──────────────────────────────────────────────────────────────────────
+
+def free():
+    gc.collect()
+    torch.cuda.empty_cache()
+    torch.cuda.synchronize()
+
+
+def main():
+    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+    print(f"Loading holdout from {HOLDOUT}")
+    records = load_holdout_data(
+        str(PARAGRAPHS), str(HOLDOUT), {k: str(v) for k, v in BENCHMARKS.items()},
+    )
+    texts = [r["text"] for r in records]
+    print(f"  {len(records)} holdout paragraphs loaded")
+
+    variants = build_variants()
+    summary = []
+
+    for v in variants:
+        print(f"\n══ {v.name} — {v.description}")
+        free()
+        try:
+            t0 = time.perf_counter()
+            model, tokenizer = v.builder()
+            build_s = time.perf_counter() - t0
+            enc_bytes = _encoder_param_bytes(model)
+            print(f"    encoder footprint: {enc_bytes / 1e6:.1f} MB  (build {build_s:.1f}s)")
+            inf = run_inference(model, tokenizer, texts)
+            print(
+                f"    latency {inf['ms_per_sample']:.2f} ms/sample, "
+                f"throughput {inf['throughput']:.0f}/s, "
+                f"peak VRAM {inf['peak_vram_mb']:.0f} MB"
+            )
+
+            metrics_per_ref = {}
+            for ref in BENCHMARKS:
+                m = evaluate_predictions(inf["cat_logits"], inf["spec_logits"], records, ref)
+                metrics_per_ref[ref] = m
+                print(
+                    f"    vs {ref}: cat F1={m['cat_macro_f1']:.4f}, "
+                    f"spec F1={m['spec_macro_f1']:.4f}, QWK={m['spec_qwk']:.4f}, "
+                    f"cat ECE={m['cat_ece']:.4f}, spec ECE={m['spec_ece']:.4f}"
+                )
+
+            row = {
+                "variant": v.name,
+                "description": v.description,
+                "encoder_mb": enc_bytes / 1e6,
+                "ms_per_sample": inf["ms_per_sample"],
+                "throughput_per_s": inf["throughput"],
+                "peak_vram_mb": inf["peak_vram_mb"],
+                "build_s": build_s,
+            }
+            for ref, m in metrics_per_ref.items():
+                row[f"{ref}_cat_f1"] = m["cat_macro_f1"]
+                row[f"{ref}_spec_f1"] = m["spec_macro_f1"]
+                row[f"{ref}_cat_mcc"] = m["cat_mcc"]
+                row[f"{ref}_spec_qwk"] = m["spec_qwk"]
+                row[f"{ref}_spec_mae"] = m["spec_mae"]
+                row[f"{ref}_cat_ece"] = m["cat_ece"]
+                row[f"{ref}_spec_ece"] = m["spec_ece"]
+                # per-spec-level F1
+                for s in SPEC_LABELS:
+                    short = s.replace(" ", "").replace(":", "")[:8]
+                    row[f"{ref}_spec_f1_{short}"] = m.get(f"spec_f1_{short}", 0)
+            summary.append(row)
+
+            # Per-variant detailed metrics dump
+            vdir = OUTPUT_DIR / v.name
+            vdir.mkdir(parents=True, exist_ok=True)
+            with open(vdir / "metrics.json", "w") as f:
+                ser = {}
+                for ref, m in metrics_per_ref.items():
+                    ser[ref] = {
+                        k: (v_ if not isinstance(v_, np.ndarray) else v_.tolist())
+                        for k, v_ in m.items()
+                        if isinstance(v_, (int, float, str, list, bool))
+                    }
+                ser["_runtime"] = {
+                    "encoder_mb": enc_bytes / 1e6,
+                    "ms_per_sample": inf["ms_per_sample"],
+                    "throughput_per_s": inf["throughput"],
+                    "peak_vram_mb": inf["peak_vram_mb"],
+                    "build_s": build_s,
+                }
+                json.dump(ser, f, indent=2, default=str)
+
+            del model, tokenizer, inf
+        except Exception as e:
+            print(f"    FAILED: {type(e).__name__}: {e}")
+            traceback.print_exc()
+            summary.append({
+                "variant": v.name,
+                "description": v.description,
+                "error": f"{type(e).__name__}: {e}",
+            })
+        free()
+
+    # Write summary
+    summary_path = OUTPUT_DIR / "summary.json"
+    with open(summary_path, "w") as f:
+        json.dump(summary, f, indent=2, default=str)
+    print(f"\nSummary written to {summary_path}")
+
+    # Print compact table
+    print("\n" + "=" * 110)
+    print(f"{'variant':<18} {'enc MB':>9} {'ms/samp':>9} {'throughput':>11} "
+          f"{'VRAM MB':>9} {'cat F1':>9} {'spec F1':>9} {'spec QWK':>9}")
+    print("-" * 110)
+    for r in summary:
+        if "error" in r:
+            print(f"{r['variant']:<18} ERROR: {r['error']}")
+            continue
+        print(
+            f"{r['variant']:<18} {r['encoder_mb']:>9.1f} {r['ms_per_sample']:>9.2f} "
+            f"{r['throughput_per_s']:>11.0f} {r['peak_vram_mb']:>9.0f} "
+            f"{r['GPT-5.4_cat_f1']:>9.4f} {r['GPT-5.4_spec_f1']:>9.4f} {r['GPT-5.4_spec_qwk']:>9.4f}"
+        )
+    print("=" * 110)
+
+
+if __name__ == "__main__":
+    main()
--- a/results/eval/onnx/REPORT.md
+++ b/results/eval/onnx/REPORT.md
@ -0,0 +1,117 @@
+# ONNX Export + Eval — iter1-independent ModernBERT-large
+
+**Date:** 2026-04-07
+**Checkpoint:** `checkpoints/finetune/iter1-independent/final/`
+**Hardware:** RTX 3090 (sm_8.6, 24 GB), onnxruntime-gpu 1.24.4, onnx 1.21
+**Driver:** `python/scripts/onnx_export_eval.py` (`bun run py:onnx`)
+**Eval set:** 1,200-paragraph v2 holdout, proxy gold = GPT-5.4 + Opus-4.6
+
+## TL;DR
+
+ONNX export of this model is *technically* possible but the path is full of
+dead ends. The dynamo exporter produces a graph with 56 Memcpy nodes that
+makes ORT 8× slower than native torch and 4× more VRAM-heavy; the legacy
+TorchScript exporter produces a clean graph that's actually 22% faster than
+torch fp32 (kernel fusion); fp16 conversion breaks on the rotary embedding;
+dynamic int8 quantization via ORT silently falls back to CPU and drops
+~0.5 macro F1. **Net: torchao int8-wo from the earlier sweep is still the
+right int8 deployment path. ONNX is not.**
+
+## What we tried
+
+| variant            | exporter             | size MB | ms/sample | VRAM MB | cat F1 | spec F1 | result          |
+|--------------------|----------------------|--------:|----------:|--------:|-------:|--------:|-----------------|
+| onnx-fp32 (dynamo) | torch.onnx (dynamo)  |    1583 |     42.92 |   15388 | 0.9337 |  0.8943 | works but unusable |
+| onnx-int8 (dynamo) | dynamo + ORT int8    |    1580 |     42.82 |   15398 | 0.9337 |  0.8943 | no-op (no quant) |
+| **onnx-fp32 (legacy)** | torch.onnx (TorchScript) |    1583 |     **12.70** |    8228 | 0.9337 |  0.8952 | **clean graph, faster than torch** |
+| onnx-fp16 (legacy) | onnxconverter_common |     754 |       err |     err | err    |  err    | rotary type unify |
+| onnx-int8 (legacy) | ORT quantize_dynamic |     527 |     95.91 |   ~CPU  | 0.3972 |  0.3364 | CPU fallback + accuracy collapse |
+
+(All entries above were re-run from scratch — fp32 timing improved 3× moving
+from dynamo to legacy export.)
+
+## Six things broke along the way (workarounds in the script)
+
+1. **Dynamo exporter optimizer crashes.** `torch.onnx.export(..., dynamo=True)`
+   succeeds at translation but the post-translation `InlinePass` optimizer
+   trips on `onnx_ir`. Workaround: `optimize=False`.
+2. **Dynamo-exported graph is unusable on CUDA EP.** ORT inserts 56 Memcpy
+   nodes between layers because dynamo emits scalar tensors with CPU
+   placement metadata. Result: 42.9 ms/sample (8× torch fp32) and 15.4 GB
+   VRAM (4.4× torch fp32). The legacy exporter only inserts 1 Memcpy.
+3. **`op_types_to_quantize=['MatMul']` quantizes nothing on the dynamo
+   graph.** Dynamo emits encoder linears as `Gemm` nodes, not `MatMul`.
+   Fix: `op_types_to_quantize=['MatMul', 'Gemm']`.
+4. **Both ORT shape-inference paths choke on ModernBERT.** Symbolic
+   inference asserts in `_infer_Range` (rotary embedding limit input is
+   not a scalar); the C++ inference raises a (1024)/(7) dim mismatch on
+   the category head Gemm. The `skip_*` flags on `quant_pre_process` are
+   ignored, and `ONNXQuantizer.__init__` calls
+   `save_and_reload_model_with_shape_infer` unconditionally. Workaround:
+   monkey-patch `quant_utils.save_and_reload_model_with_shape_infer`
+   *and* the cached binding in `onnx_quantizer` to a no-op, then pass
+   `extra_options={'DefaultTensorType': onnx.TensorProto.FLOAT}` so the
+   quantizer can still type the head MatMul.
+5. **fp16 conversion via `onnxconverter_common` breaks on rotary
+   embeddings.** Two distinct failure modes seen across exports:
+   `Type Error: Type (tensor(float16)) of output arg (val_58) of node
+   (node_Expand_56) does not match expected type (tensor(float))` (dynamo
+   graph) and `Type parameter (T) of Optype (Mul) bound to different types
+   (tensor(float) and tensor(float16) in node
+   (/model/backbone/rotary_emb_1/Mul_2)` (legacy graph). The converter
+   leaves the `inv_freq` buffer in fp32 and the surrounding Mul/Expand
+   ops then can't unify their type parameter. Could be patched with an
+   `op_block_list` for the rotary subgraph, but the cost/value isn't
+   there given the dynamic int8 result below.
+6. **Dynamic int8 via ORT silently falls back to CPU.** The quantizer
+   replaces Gemm/MatMul with `MatMulInteger` + `DynamicQuantizeLinear`,
+   neither of which has CUDA kernels in onnxruntime-gpu 1.24. Session
+   creation succeeds with CUDAExecutionProvider but routes the
+   quantized ops to the CPU EP — observable from the `load_vram_mb`
+   collapsing from 2074 MB (fp32) to 266 MB (int8) and latency exploding
+   to 95.9 ms/sample. Per-channel int8 weights also drop accuracy from
+   0.934 → 0.397 on category and 0.895 → 0.336 on spec, further
+   confirming the kernel path is wrong (not just slow).
+
+## What actually works
+
+**onnx-fp32 via the legacy TorchScript exporter** is the one clean win:
+12.70 ms/sample vs 16.29 for torch fp32 — a **22% latency improvement
+from ORT's LayerNorm/Gelu/MatMul fusion** at bit-identical accuracy. VRAM
+is 8228 MB vs 3504 MB for torch fp32 (the ORT session allocates a separate
+~5 GB workspace), so the speedup costs you ~2.3× memory. On a single
+3090 batch=64 inference run that's a fair trade.
+
+But this is fp32 — bf16 torch + flash-attn-2 is *still* the strict winner
+at 5.52 ms / 1741 MB (Phase 10.8 result). ORT can't run bf16 natively, and
+fp16 conversion is broken. So even the working ONNX path is dominated by
+what we already ship.
+
+## Recommendation
+
+**Don't use ONNX for this model on this hardware.** The torchao int8-wo
+result from the quantization sweep (5.52 → 6.08 ms, 1741 → 1416 MB peak
+VRAM, F1 within ±0.001) covers the "smaller deployment" use case more
+cleanly than anything ONNX can offer here, and bf16 + flash-attn-2
+remains the production default.
+
+ONNX *would* be worth revisiting in any of these scenarios:
+- **CPU-only deployment** — fp32 ONNX runs fine on CPUExecutionProvider
+  and ORT's int8 dynamic path is actually designed for this case. Worth
+  benchmarking if a CPU serving target ever shows up.
+- **Cross-runtime portability** — TensorRT, OpenVINO, mobile runtimes.
+  These would each need their own export validation pass.
+- **Static int8 with calibration** — `quantize_static` with a calibration
+  dataset can avoid the dynamic-quant CPU fallback path. Would need a
+  ModernBERT-friendly calibration loop and probably an `op_block_list`
+  to keep the rotary in fp32. Real engineering work, not a one-shot.
+
+## Reproduce
+
+```bash
+bun run py:onnx
+# writes to:
+#   results/eval/onnx/models/{model_fp32,model_fp16,model_int8_dyn}.onnx[.data]
+#   results/eval/onnx/summary.json
+#   results/eval/onnx/REPORT.md (this file)
+```
--- a/results/eval/onnx/summary.json
+++ b/results/eval/onnx/summary.json
@ -0,0 +1,50 @@
+[
+  {
+    "variant": "onnx-fp32",
+    "model_mb": 1583.256294,
+    "ms_per_sample": 12.703279327494482,
+    "throughput_per_s": 78.71983085781946,
+    "peak_vram_mb": 8228.0,
+    "load_vram_mb": 2074.0,
+    "GPT-5.4_cat_f1": 0.9336741161693523,
+    "GPT-5.4_spec_f1": 0.8951731906425856,
+    "GPT-5.4_cat_mcc": 0.9226990724708704,
+    "GPT-5.4_spec_qwk": 0.9324447137231142,
+    "GPT-5.4_spec_mae": 0.1175,
+    "GPT-5.4_cat_ece": 0.05386760701735813,
+    "GPT-5.4_spec_ece": 0.07004604930679002,
+    "Opus-4.6_cat_f1": 0.922684387023173,
+    "Opus-4.6_spec_f1": 0.8833694419146193,
+    "Opus-4.6_cat_mcc": 0.909266938399113,
+    "Opus-4.6_spec_qwk": 0.9227008860372746,
+    "Opus-4.6_spec_mae": 0.13583333333333333,
+    "Opus-4.6_cat_ece": 0.06540583113829297,
+    "Opus-4.6_spec_ece": 0.08156729981303217
+  },
+  {
+    "variant": "onnx-fp16",
+    "error": "Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /home/joey/Documents/sec-cyBERT/results/eval/onnx/models/model_fp16.onnx failed:Type Error: Type parameter (T) of Optype (Mul) bound to different types (tensor(float) and tensor(float16) in node (/model/backbone/rotary_emb_1/Mul_2)."
+  },
+  {
+    "variant": "onnx-int8-dyn",
+    "model_mb": 553.381903,
+    "ms_per_sample": 95.90791940659983,
+    "throughput_per_s": 10.42666764316426,
+    "peak_vram_mb": 7188.0,
+    "load_vram_mb": 266.0,
+    "GPT-5.4_cat_f1": 0.3971686880679718,
+    "GPT-5.4_spec_f1": 0.3364003775746365,
+    "GPT-5.4_cat_mcc": 0.3459776856134484,
+    "GPT-5.4_spec_qwk": 0.4412945592628398,
+    "GPT-5.4_spec_mae": 0.7225,
+    "GPT-5.4_cat_ece": 0.23434762333830195,
+    "GPT-5.4_spec_ece": 0.35458642202119034,
+    "Opus-4.6_cat_f1": 0.4150626036637055,
+    "Opus-4.6_spec_f1": 0.3204015536108683,
+    "Opus-4.6_cat_mcc": 0.3663561834842673,
+    "Opus-4.6_spec_qwk": 0.43979676755288855,
+    "Opus-4.6_spec_mae": 0.7375,
+    "Opus-4.6_cat_ece": 0.2126809566716353,
+    "Opus-4.6_spec_ece": 0.37541975535452365
+  }
+]
--- a/results/eval/quant/REPORT.md
+++ b/results/eval/quant/REPORT.md
@ -0,0 +1,163 @@
+# Quantization Sweep — iter1-independent ModernBERT-large
+
+**Date:** 2026-04-07
+**Checkpoint:** `checkpoints/finetune/iter1-independent/final/`
+**Hardware:** RTX 3090 (sm_8.6, 24 GB)
+**Eval set:** 1,200-paragraph v2 holdout, proxy gold = GPT-5.4 + Opus-4.6
+**Driver:** `python/scripts/quantize_sweep.py` (run via `bun run py:quant`)
+
+## Setup
+
+For each variant the *encoder* (ModernBERT-large backbone, 28 layers, 112
+nn.Linear modules) is converted to the target precision/scheme, while the
+attention pooler and the dual heads (category linear + 3 independent
+threshold MLPs) are kept in bf16. Heads are <0.3% of params and sit on
+already-distilled 1024-d representations — quantizing them buys nothing and
+risks the threshold margins that drive most of the spec error budget.
+
+For every variant we measure end-to-end inference on the full 1,200-paragraph
+holdout at batch=64, max_seq=512, after 5 warmup batches:
+
+- **encoder_mb** — sum of `param.numel() * param.element_size()` over the
+  encoder. **Caveat:** for torchao tensor subclasses (`AffineQuantizedTensor`)
+  this reports the *outer* dtype (bf16) rather than the int8 storage, so the
+  790 MB figure for the torchao rows is an over-estimate; real on-disk
+  storage is roughly half. The bnb 4-bit row (275 MB) is correct because
+  `Params4bit` reports `uint8` element_size.
+- **ms/sample** — wall-clock per paragraph at batch=64
+- **peak VRAM** — `torch.cuda.max_memory_allocated()` over the timed run
+  (encoder fwd + activations)
+- **F1 / QWK / ECE** — full eval pipeline reused from `src/finetune/eval.py`
+
+## Results
+
+| variant            | enc MB | ms/samp | thru/s | VRAM MB | cat F1 (GPT) | spec F1 (GPT) | spec QWK | cat F1 (Opus) | spec F1 (Opus) | notes                          |
+|--------------------|-------:|--------:|-------:|--------:|-------------:|--------------:|---------:|--------------:|---------------:|--------------------------------|
+| fp32               |  1579  |  16.29  |    61  |   3504  |       0.9337 |        0.8943 |   0.9321 |        0.9227 |         0.8825 | sdpa (no flash-attn)           |
+| **bf16 (baseline)**|   790  |   5.52  |   181  |   1741  |       0.9337 |        0.8952 |   0.9324 |        0.9227 |         0.8834 | flash-attn-2                   |
+| fp16               |   790  |   5.54  |   181  |   1741  |       0.9337 |        0.8952 |   0.9324 |        0.9227 |         0.8834 | flash-attn-2                   |
+| **torchao int8-wo**|  ~395* |   6.08  |   165  |   1416  |       0.9345 |        0.8941 |   0.9330 |        0.9235 |         0.8815 | weight-only int8               |
+| torchao int8-dyn   |  ~395* |   9.67  |   103  |   1774  |       0.9336 |        0.8918 |   0.9315 |        0.9243 |         0.8827 | dyn act + int8 weight          |
+| torchao int4-wo    |    —   |    —    |    —   |    —    |          —   |           —   |      —   |           —   |            —   | requires `mslk>=1.0.0`         |
+| bnb LLM.int8       |  ~395* |   7.76  |   129  |   2135  |       0.9361 |        0.8986 |   0.9308 |        0.9235 |         0.8827 | mixed-precision outliers       |
+| bnb nf4 (DQ)       |   275  |   5.86  |   171  |   1287  |       0.3537 |        0.2205 |   0.2423 |        0.3576 |         0.2075 | **collapsed**                  |
+| bnb nf4 (no DQ)    |   275  |   5.86  |   171  |   1287  |       0.3537 |        0.2205 |   0.2423 |        0.3576 |         0.2075 | **collapsed**                  |
+| bnb fp4 (no DQ)    |   275  |   5.87  |   170  |   1287  |       0.1629 |        0.2085 |   0.2326 |        0.1686 |         0.1978 | **collapsed harder**           |
+
+\*torchao subclass tensors report bf16 element_size; true storage ~395 MB.
+
+Per-variant detail (per-class F1, MCC, AUC, confusion matrices, calibration
+bins) is in `results/eval/quant/{variant}/metrics.json`. Aggregate row-level
+data is in `results/eval/quant/summary.json`.
+
+## Findings
+
+### 1. bf16 is already the production sweet spot
+Flash-attention-2 + bf16 gives **3.0× the throughput of fp32** (181 vs 61
+samples/sec) at **half the VRAM** (1.7 vs 3.5 GB) with bit-identical
+accuracy. This is what we already train and serve at; the sweep simply
+confirms there's no headroom in fp16/fp32 for this hardware.
+
+### 2. fp16 ≡ bf16 on Ampere
+Identical latency, identical VRAM, identical F1. RTX 3090 has matched
+bf16/fp16 throughput on tensor cores and the model has no overflow issues
+in either format. Pick whichever the loader prefers.
+
+### 3. torchao int8 weight-only is the only quantization variant worth shipping
+- **VRAM −19%** (1741 → 1416 MB) — meaningful for batched serving
+- **F1 essentially unchanged** (cat +0.0008, spec −0.0011 vs bf16 — both
+  inside per-seed noise)
+- **Latency +10%** (5.52 → 6.08 ms/sample) — the int8 weight is dequantized
+  to bf16 on the fly because RTX 3090 (sm_8.6) lacks the int8 tensor-core
+  matmul kernel paths torchao would otherwise use; on H100/A100/Ada this
+  same config would also be faster
+
+The accuracy delta is statistically nothing — well within the ±0.002 std we
+observed across the 3-seed ensemble. **This is the variant we'd ship as the
+"low-VRAM" deployment option.**
+
+### 4. torchao int8 dynamic activation: don't bother on this hardware
+−43% throughput (5.52 → 9.67 ms/sample) and *more* peak VRAM than bf16
+(1774 vs 1741 MB) because the per-batch activation quantization adds work
+without unlocking int8 tensor cores. Pure regression on Ampere.
+
+### 5. bnb LLM.int8: slowest int8, no accuracy upside
+- **+41% latency** (5.52 → 7.76 ms/sample) due to mixed-precision outlier
+  handling
+- **+23% VRAM** (1741 → 2135 MB) — outlier columns are kept in fp16 plus
+  scratch buffers
+- **F1 +0.0024 cat, +0.0034 spec** — within noise; not a real win
+
+bnb LLM.int8 was designed for LLM-scale models where outlier features
+dominate quant error; for an encoder of this size on a single 3090 it
+just trades performance for nothing.
+
+### 6. All 4-bit variants collapse — ModernBERT-large is too quant-sensitive
+Both nf4 (with and without double-quantization) and fp4 produce essentially
+random predictions:
+
+| variant | cat F1 | spec F1 | spec ECE |
+|---------|-------:|--------:|---------:|
+| nf4     | 0.354  | 0.221   | 0.434    |
+| fp4     | 0.163  | 0.209   | 0.443    |
+
+Per-layer dequantization is faithful — we verified that the dequantized
+weight of one MLP Wi layer differs from the original by mean 0.005 / max
+0.11 (sub-1% error). But the relative output drift on a single Linear is
+already ~98% (mean), and that error compounds across 28 transformer blocks
+ GLU FFN paths until the [CLS]/pooled representation no longer carries
+the discriminative signal. The category head essentially collapses to a
+near-uniform prior (cat ECE 0.10 vs the 0.054 baseline) and the threshold
+heads collapse onto L1 because all three thresholds emit similar logits.
+
+The fact that **DQ vs no-DQ are bit-identical** at this scale tells us the
+nf4 weight indices are stable under absmax requantization (only ~5% of the
+weight bytes change, all in the metadata block) — the catastrophe is
+inherent to 4-bit weight precision on this architecture, not to a
+quantization-config knob.
+
+This is a real noteworthy null for the paper: **naive post-training 4-bit
+weight quantization is not viable for ModernBERT-large on this task**.
+Recovering 4-bit would require either (a) QAT, (b) per-channel calibration
+with a held-out activation distribution (GPTQ / AWQ-style), or (c) keeping
+the GLU FFN in 8-bit and only 4-bit'ing attention projections. None of
+these are reachable inside the remaining capstone time budget.
+
+### 7. torchao int4-wo: dependency hole
+torchao 0.17 requires `mslk >= 1.0.0` for the new `Int4Tensor.from_hp` path.
+Not installed in the lockfile and not worth chasing given the bnb 4-bit
+collapse — even if the kernel ran cleanly we'd expect the same compounding
+error pattern.
+
+## Recommendations
+
+| Use case                          | Variant            | Why                                                         |
+|-----------------------------------|--------------------|-------------------------------------------------------------|
+| **Production / paper headline**   | bf16               | Best of every dimension on this hardware                    |
+| **Low-VRAM batch serving**        | torchao int8-wo    | −19% VRAM, accuracy intact, only 10% latency penalty        |
+| **Multi-GPU sharded serving**     | bf16               | int8-wo's dequant overhead grows with replica count         |
+| **Embedded / 4-bit**              | not viable         | Needs QAT or AWQ-style calibration; future work             |
+
+## Paper-worthy notes
+
+1. **Quantization story** — bf16 is already the sweet spot; torchao int8-wo
+   buys 19% VRAM with no accuracy cost; 4-bit fails. This adds another row
+   to the speed/cost table.
+2. **Architecture-specific quant fragility** — ModernBERT-large's GLU FFN
+   amplifies per-layer weight error across 28 blocks. This is a noteworthy
+   counterpoint to the 4-bit-by-default LLM serving narrative and worth
+   one paragraph in the discussion section alongside the DAPT and
+   CORAL null results.
+3. **Hardware caveat** — int8 latency results would invert on
+   Hopper/Ada/A100; the 3090 just doesn't have the matmul path. State the
+   sm_8.6 caveat in the table caption.
+
+## Reproduce
+
+```bash
+# from repo root
+bun run py:quant
+# writes to results/eval/quant/{summary.json, REPORT.md, <variant>/metrics.json}
+```
+
+Run time: ~5 minutes total (most spent in fp32 + torchao build steps).
--- a/results/eval/quant/bf16/metrics.json
+++ b/results/eval/quant/bf16/metrics.json
@ -0,0 +1,297 @@
+{
+  "GPT-5.4": {
+    "cat_macro_f1": 0.9336741161693523,
+    "cat_weighted_f1": 0.9343162998643407,
+    "cat_macro_precision": 0.93189297179766,
+    "cat_macro_recall": 0.9377918652022429,
+    "cat_mcc": 0.9226990724708704,
+    "cat_auc": 0.991991833154947,
+    "cat_ece": 0.053848127176364245,
+    "cat_confusion_matrix": [
+      [
+        225,
+        0,
+        3,
+        0,
+        2,
+        0,
+        0
+      ],
+      [
+        0,
+        85,
+        0,
+        0,
+        2,
+        1,
+        0
+      ],
+      [
+        2,
+        0,
+        144,
+        1,
+        3,
+        0,
+        0
+      ],
+      [
+        0,
+        0,
+        3,
+        131,
+        0,
+        2,
+        0
+      ],
+      [
+        6,
+        1,
+        5,
+        19,
+        164,
+        1,
+        2
+      ],
+      [
+        0,
+        3,
+        1,
+        8,
+        2,
+        207,
+        0
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        12,
+        0,
+        165
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9719222462203023,
+    "cat_prec_BoardGov": 0.9656652360515021,
+    "cat_recall_BoardGov": 0.9782608695652174,
+    "cat_f1_Incident": 0.96045197740113,
+    "cat_prec_Incident": 0.9550561797752809,
+    "cat_recall_Incident": 0.9659090909090909,
+    "cat_f1_Manageme": 0.9411764705882353,
+    "cat_prec_Manageme": 0.9230769230769231,
+    "cat_recall_Manageme": 0.96,
+    "cat_f1_NoneOthe": 0.888135593220339,
+    "cat_prec_NoneOthe": 0.8238993710691824,
+    "cat_recall_NoneOthe": 0.9632352941176471,
+    "cat_f1_RiskMana": 0.856396866840731,
+    "cat_prec_RiskMana": 0.8864864864864865,
+    "cat_recall_RiskMana": 0.8282828282828283,
+    "cat_f1_Strategy": 0.9583333333333334,
+    "cat_prec_Strategy": 0.981042654028436,
+    "cat_recall_Strategy": 0.9366515837104072,
+    "cat_f1_Third-Pa": 0.9593023255813954,
+    "cat_prec_Third-Pa": 0.9880239520958084,
+    "cat_recall_Third-Pa": 0.9322033898305084,
+    "cat_kripp_alpha": 0.9223591517560865,
+    "spec_macro_f1": 0.8951731906425856,
+    "spec_weighted_f1": 0.9121524819510628,
+    "spec_macro_precision": 0.8980417155129858,
+    "spec_macro_recall": 0.8930560580782194,
+    "spec_mcc": 0.866381831963237,
+    "spec_auc": 0.981666223606385,
+    "spec_ece": 0.07135417198141418,
+    "spec_confusion_matrix": [
+      [
+        580,
+        23,
+        12,
+        3
+      ],
+      [
+        29,
+        130,
+        7,
+        2
+      ],
+      [
+        11,
+        4,
+        190,
+        2
+      ],
+      [
+        2,
+        1,
+        9,
+        195
+      ]
+    ],
+    "spec_f1_L1Generi": 0.9354838709677419,
+    "spec_prec_L1Generi": 0.932475884244373,
+    "spec_recall_L1Generi": 0.9385113268608414,
+    "spec_f1_L2Domain": 0.7975460122699386,
+    "spec_prec_L2Domain": 0.8227848101265823,
+    "spec_recall_L2Domain": 0.7738095238095238,
+    "spec_f1_L3Firm-S": 0.8941176470588236,
+    "spec_prec_L3Firm-S": 0.8715596330275229,
+    "spec_recall_L3Firm-S": 0.9178743961352657,
+    "spec_f1_L4Quanti": 0.9535452322738386,
+    "spec_prec_L4Quanti": 0.9653465346534653,
+    "spec_recall_L4Quanti": 0.9420289855072463,
+    "spec_qwk": 0.9324447137231142,
+    "spec_mae": 0.1175,
+    "spec_kripp_alpha": 0.917725722448833
+  },
+  "Opus-4.6": {
+    "cat_macro_f1": 0.922684387023173,
+    "cat_weighted_f1": 0.9216414809666168,
+    "cat_macro_precision": 0.9177680939029339,
+    "cat_macro_recall": 0.9316060900094703,
+    "cat_mcc": 0.909266938399113,
+    "cat_auc": 0.9939660707189948,
+    "cat_ece": 0.06551479384303091,
+    "cat_confusion_matrix": [
+      [
+        211,
+        0,
+        1,
+        1,
+        1,
+        0,
+        0
+      ],
+      [
+        0,
+        78,
+        0,
+        0,
+        1,
+        0,
+        0
+      ],
+      [
+        8,
+        0,
+        144,
+        1,
+        4,
+        0,
+        1
+      ],
+      [
+        0,
+        0,
+        1,
+        138,
+        1,
+        1,
+        0
+      ],
+      [
+        13,
+        0,
+        9,
+        14,
+        169,
+        1,
+        7
+      ],
+      [
+        1,
+        11,
+        1,
+        4,
+        3,
+        208,
+        0
+      ],
+      [
+        0,
+        0,
+        0,
+        1,
+        6,
+        1,
+        159
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9440715883668904,
+    "cat_prec_BoardGov": 0.9055793991416309,
+    "cat_recall_BoardGov": 0.985981308411215,
+    "cat_f1_Incident": 0.9285714285714286,
+    "cat_prec_Incident": 0.8764044943820225,
+    "cat_recall_Incident": 0.9873417721518988,
+    "cat_f1_Manageme": 0.9171974522292994,
+    "cat_prec_Manageme": 0.9230769230769231,
+    "cat_recall_Manageme": 0.9113924050632911,
+    "cat_f1_NoneOthe": 0.92,
+    "cat_prec_NoneOthe": 0.8679245283018868,
+    "cat_recall_NoneOthe": 0.9787234042553191,
+    "cat_f1_RiskMana": 0.8492462311557789,
+    "cat_prec_RiskMana": 0.9135135135135135,
+    "cat_recall_RiskMana": 0.7934272300469484,
+    "cat_f1_Strategy": 0.9476082004555809,
+    "cat_prec_Strategy": 0.985781990521327,
+    "cat_recall_Strategy": 0.9122807017543859,
+    "cat_f1_Third-Pa": 0.9520958083832335,
+    "cat_prec_Third-Pa": 0.9520958083832335,
+    "cat_recall_Third-Pa": 0.9520958083832335,
+    "cat_kripp_alpha": 0.908575631724203,
+    "spec_macro_f1": 0.8833694419146193,
+    "spec_weighted_f1": 0.9004034318676798,
+    "spec_macro_precision": 0.8858989636247611,
+    "spec_macro_recall": 0.8854684685880032,
+    "spec_mcc": 0.8500778641433316,
+    "spec_auc": 0.9736633898988131,
+    "spec_ece": 0.08248284702499709,
+    "spec_confusion_matrix": [
+      [
+        567,
+        30,
+        7,
+        1
+      ],
+      [
+        22,
+        118,
+        3,
+        2
+      ],
+      [
+        33,
+        10,
+        207,
+        10
+      ],
+      [
+        0,
+        0,
+        1,
+        189
+      ]
+    ],
+    "spec_f1_L1Generi": 0.9242053789731052,
+    "spec_prec_L1Generi": 0.9115755627009646,
+    "spec_recall_L1Generi": 0.9371900826446281,
+    "spec_f1_L2Domain": 0.7788778877887789,
+    "spec_prec_L2Domain": 0.7468354430379747,
+    "spec_recall_L2Domain": 0.8137931034482758,
+    "spec_f1_L3Firm-S": 0.8661087866108786,
+    "spec_prec_L3Firm-S": 0.9495412844036697,
+    "spec_recall_L3Firm-S": 0.7961538461538461,
+    "spec_f1_L4Quanti": 0.9642857142857143,
+    "spec_prec_L4Quanti": 0.9356435643564357,
+    "spec_recall_L4Quanti": 0.9947368421052631,
+    "spec_qwk": 0.9227008860372746,
+    "spec_mae": 0.13583333333333333,
+    "spec_kripp_alpha": 0.9065248741550552
+  },
+  "_runtime": {
+    "encoder_mb": 789.563648,
+    "ms_per_sample": 5.516677870764397,
+    "throughput_per_s": 181.26851402716375,
+    "peak_vram_mb": 1740.83837890625,
+    "build_s": 0.48778308398323134
+  }
+}
--- a/results/eval/quant/bnb-fp4/metrics.json
+++ b/results/eval/quant/bnb-fp4/metrics.json
@ -0,0 +1,297 @@
+{
+  "GPT-5.4": {
+    "cat_macro_f1": 0.16293893512410998,
+    "cat_weighted_f1": 0.1746727986514593,
+    "cat_macro_precision": 0.6289222195093943,
+    "cat_macro_recall": 0.23220413662370398,
+    "cat_mcc": 0.22345796853389935,
+    "cat_auc": 0.8960306312891495,
+    "cat_ece": 0.2080524676044782,
+    "cat_confusion_matrix": [
+      [
+        8,
+        0,
+        0,
+        0,
+        221,
+        0,
+        1
+      ],
+      [
+        0,
+        0,
+        0,
+        1,
+        82,
+        0,
+        5
+      ],
+      [
+        0,
+        0,
+        4,
+        0,
+        145,
+        0,
+        1
+      ],
+      [
+        0,
+        0,
+        2,
+        3,
+        128,
+        0,
+        3
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        195,
+        0,
+        3
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        208,
+        2,
+        11
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        80,
+        0,
+        97
+      ]
+    ],
+    "cat_f1_BoardGov": 0.06722689075630252,
+    "cat_prec_BoardGov": 1.0,
+    "cat_recall_BoardGov": 0.034782608695652174,
+    "cat_f1_Incident": 0.0,
+    "cat_prec_Incident": 0.0,
+    "cat_recall_Incident": 0.0,
+    "cat_f1_Manageme": 0.05128205128205128,
+    "cat_prec_Manageme": 0.6666666666666666,
+    "cat_recall_Manageme": 0.02666666666666667,
+    "cat_f1_NoneOthe": 0.04285714285714286,
+    "cat_prec_NoneOthe": 0.75,
+    "cat_recall_NoneOthe": 0.022058823529411766,
+    "cat_f1_RiskMana": 0.31026252983293556,
+    "cat_prec_RiskMana": 0.18413597733711048,
+    "cat_recall_RiskMana": 0.9848484848484849,
+    "cat_f1_Strategy": 0.017937219730941704,
+    "cat_prec_Strategy": 1.0,
+    "cat_recall_Strategy": 0.00904977375565611,
+    "cat_f1_Third-Pa": 0.6510067114093959,
+    "cat_prec_Third-Pa": 0.8016528925619835,
+    "cat_recall_Third-Pa": 0.5480225988700564,
+    "cat_kripp_alpha": -0.08693512028952255,
+    "spec_macro_f1": 0.20854117827130608,
+    "spec_weighted_f1": 0.2571301750438355,
+    "spec_macro_precision": 0.3741612607031285,
+    "spec_macro_recall": 0.33018440069147115,
+    "spec_mcc": 0.1895317453505129,
+    "spec_auc": 0.8110497500610155,
+    "spec_ece": 0.44289420386155437,
+    "spec_confusion_matrix": [
+      [
+        136,
+        473,
+        9,
+        0
+      ],
+      [
+        4,
+        163,
+        1,
+        0
+      ],
+      [
+        1,
+        179,
+        27,
+        0
+      ],
+      [
+        2,
+        171,
+        34,
+        0
+      ]
+    ],
+    "spec_f1_L1Generi": 0.35742444152431013,
+    "spec_prec_L1Generi": 0.951048951048951,
+    "spec_recall_L1Generi": 0.22006472491909385,
+    "spec_f1_L2Domain": 0.2824956672443674,
+    "spec_prec_L2Domain": 0.16531440162271804,
+    "spec_recall_L2Domain": 0.9702380952380952,
+    "spec_f1_L3Firm-S": 0.19424460431654678,
+    "spec_prec_L3Firm-S": 0.38028169014084506,
+    "spec_recall_L3Firm-S": 0.13043478260869565,
+    "spec_f1_L4Quanti": 0.0,
+    "spec_prec_L4Quanti": 0.0,
+    "spec_recall_L4Quanti": 0.0,
+    "spec_qwk": 0.2326064604575444,
+    "spec_mae": 0.8825,
+    "spec_kripp_alpha": 0.26499611744119067
+  },
+  "Opus-4.6": {
+    "cat_macro_f1": 0.16861118726256397,
+    "cat_weighted_f1": 0.1792365613004711,
+    "cat_macro_precision": 0.6306758954840335,
+    "cat_macro_recall": 0.2357303291121537,
+    "cat_mcc": 0.2251562222131823,
+    "cat_auc": 0.8995073249291591,
+    "cat_ece": 0.19888580093781152,
+    "cat_confusion_matrix": [
+      [
+        8,
+        0,
+        0,
+        0,
+        205,
+        0,
+        1
+      ],
+      [
+        0,
+        0,
+        0,
+        1,
+        73,
+        0,
+        5
+      ],
+      [
+        0,
+        0,
+        4,
+        0,
+        154,
+        0,
+        0
+      ],
+      [
+        0,
+        0,
+        2,
+        3,
+        133,
+        0,
+        3
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        208,
+        0,
+        5
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        216,
+        2,
+        10
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        70,
+        0,
+        97
+      ]
+    ],
+    "cat_f1_BoardGov": 0.07207207207207207,
+    "cat_prec_BoardGov": 1.0,
+    "cat_recall_BoardGov": 0.037383177570093455,
+    "cat_f1_Incident": 0.0,
+    "cat_prec_Incident": 0.0,
+    "cat_recall_Incident": 0.0,
+    "cat_f1_Manageme": 0.04878048780487805,
+    "cat_prec_Manageme": 0.6666666666666666,
+    "cat_recall_Manageme": 0.02531645569620253,
+    "cat_f1_NoneOthe": 0.041379310344827586,
+    "cat_prec_NoneOthe": 0.75,
+    "cat_recall_NoneOthe": 0.02127659574468085,
+    "cat_f1_RiskMana": 0.3270440251572327,
+    "cat_prec_RiskMana": 0.1964117091595845,
+    "cat_recall_RiskMana": 0.9765258215962441,
+    "cat_f1_Strategy": 0.017391304347826087,
+    "cat_prec_Strategy": 1.0,
+    "cat_recall_Strategy": 0.008771929824561403,
+    "cat_f1_Third-Pa": 0.6736111111111112,
+    "cat_prec_Third-Pa": 0.8016528925619835,
+    "cat_recall_Third-Pa": 0.5808383233532934,
+    "cat_kripp_alpha": -0.07941064783948448,
+    "spec_macro_f1": 0.19783939283519508,
+    "spec_weighted_f1": 0.24886714543281097,
+    "spec_macro_precision": 0.37592821714182745,
+    "spec_macro_recall": 0.3291807330600434,
+    "spec_mcc": 0.18219176358380398,
+    "spec_auc": 0.790090253498083,
+    "spec_ece": 0.45814307530721027,
+    "spec_confusion_matrix": [
+      [
+        132,
+        466,
+        7,
+        0
+      ],
+      [
+        1,
+        142,
+        2,
+        0
+      ],
+      [
+        8,
+        221,
+        31,
+        0
+      ],
+      [
+        2,
+        157,
+        31,
+        0
+      ]
+    ],
+    "spec_f1_L1Generi": 0.35294117647058826,
+    "spec_prec_L1Generi": 0.9230769230769231,
+    "spec_recall_L1Generi": 0.21818181818181817,
+    "spec_f1_L2Domain": 0.251105216622458,
+    "spec_prec_L2Domain": 0.1440162271805274,
+    "spec_recall_L2Domain": 0.9793103448275862,
+    "spec_f1_L3Firm-S": 0.18731117824773413,
+    "spec_prec_L3Firm-S": 0.43661971830985913,
+    "spec_recall_L3Firm-S": 0.11923076923076924,
+    "spec_f1_L4Quanti": 0.0,
+    "spec_prec_L4Quanti": 0.0,
+    "spec_recall_L4Quanti": 0.0,
+    "spec_qwk": 0.22580295138888895,
+    "spec_mae": 0.8925,
+    "spec_kripp_alpha": 0.2579634594689497
+  },
+  "_runtime": {
+    "encoder_mb": 274.843904,
+    "ms_per_sample": 5.865302347471394,
+    "throughput_per_s": 170.49419463109393,
+    "peak_vram_mb": 1287.34326171875,
+    "build_s": 0.4887635139748454
+  }
+}
--- a/results/eval/quant/bnb-int8/metrics.json
+++ b/results/eval/quant/bnb-int8/metrics.json
@ -0,0 +1,297 @@
+{
+  "GPT-5.4": {
+    "cat_macro_f1": 0.9360988760303737,
+    "cat_weighted_f1": 0.9367630863906107,
+    "cat_macro_precision": 0.934342558672944,
+    "cat_macro_recall": 0.9404157843351134,
+    "cat_mcc": 0.9256911778959798,
+    "cat_auc": 0.9918112947607864,
+    "cat_ece": 0.052939765204985965,
+    "cat_confusion_matrix": [
+      [
+        226,
+        0,
+        2,
+        0,
+        2,
+        0,
+        0
+      ],
+      [
+        0,
+        85,
+        0,
+        0,
+        2,
+        1,
+        0
+      ],
+      [
+        2,
+        0,
+        145,
+        1,
+        2,
+        0,
+        0
+      ],
+      [
+        0,
+        0,
+        3,
+        132,
+        0,
+        1,
+        0
+      ],
+      [
+        6,
+        1,
+        5,
+        19,
+        164,
+        1,
+        2
+      ],
+      [
+        0,
+        3,
+        1,
+        8,
+        2,
+        207,
+        0
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        12,
+        0,
+        165
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9741379310344828,
+    "cat_prec_BoardGov": 0.9658119658119658,
+    "cat_recall_BoardGov": 0.9826086956521739,
+    "cat_f1_Incident": 0.96045197740113,
+    "cat_prec_Incident": 0.9550561797752809,
+    "cat_recall_Incident": 0.9659090909090909,
+    "cat_f1_Manageme": 0.9477124183006536,
+    "cat_prec_Manageme": 0.9294871794871795,
+    "cat_recall_Manageme": 0.9666666666666667,
+    "cat_f1_NoneOthe": 0.8918918918918919,
+    "cat_prec_NoneOthe": 0.825,
+    "cat_recall_NoneOthe": 0.9705882352941176,
+    "cat_f1_RiskMana": 0.8586387434554974,
+    "cat_prec_RiskMana": 0.8913043478260869,
+    "cat_recall_RiskMana": 0.8282828282828283,
+    "cat_f1_Strategy": 0.9605568445475638,
+    "cat_prec_Strategy": 0.9857142857142858,
+    "cat_recall_Strategy": 0.9366515837104072,
+    "cat_f1_Third-Pa": 0.9593023255813954,
+    "cat_prec_Third-Pa": 0.9880239520958084,
+    "cat_recall_Third-Pa": 0.9322033898305084,
+    "cat_kripp_alpha": 0.9253092213149172,
+    "spec_macro_f1": 0.8986323186392307,
+    "spec_weighted_f1": 0.9144644120807768,
+    "spec_macro_precision": 0.9034925881673722,
+    "spec_macro_recall": 0.8950728490354916,
+    "spec_mcc": 0.870090391628814,
+    "spec_auc": 0.98134918835569,
+    "spec_ece": 0.06740866973996164,
+    "spec_confusion_matrix": [
+      [
+        582,
+        19,
+        14,
+        3
+      ],
+      [
+        29,
+        130,
+        7,
+        2
+      ],
+      [
+        12,
+        3,
+        190,
+        2
+      ],
+      [
+        2,
+        1,
+        8,
+        196
+      ]
+    ],
+    "spec_f1_L1Generi": 0.9364440868865648,
+    "spec_prec_L1Generi": 0.9312,
+    "spec_recall_L1Generi": 0.941747572815534,
+    "spec_f1_L2Domain": 0.8099688473520249,
+    "spec_prec_L2Domain": 0.8496732026143791,
+    "spec_recall_L2Domain": 0.7738095238095238,
+    "spec_f1_L3Firm-S": 0.892018779342723,
+    "spec_prec_L3Firm-S": 0.867579908675799,
+    "spec_recall_L3Firm-S": 0.9178743961352657,
+    "spec_f1_L4Quanti": 0.9560975609756097,
+    "spec_prec_L4Quanti": 0.9655172413793104,
+    "spec_recall_L4Quanti": 0.9468599033816425,
+    "spec_qwk": 0.9307948020550015,
+    "spec_mae": 0.1175,
+    "spec_kripp_alpha": 0.9166492249745117
+  },
+  "Opus-4.6": {
+    "cat_macro_f1": 0.9235105849558979,
+    "cat_weighted_f1": 0.9224780370334836,
+    "cat_macro_precision": 0.9187130112710481,
+    "cat_macro_recall": 0.9326192612354074,
+    "cat_mcc": 0.9103198007176273,
+    "cat_auc": 0.9937246318315877,
+    "cat_ece": 0.06465620135267579,
+    "cat_confusion_matrix": [
+      [
+        211,
+        0,
+        1,
+        1,
+        1,
+        0,
+        0
+      ],
+      [
+        0,
+        78,
+        0,
+        0,
+        1,
+        0,
+        0
+      ],
+      [
+        9,
+        0,
+        144,
+        1,
+        3,
+        0,
+        1
+      ],
+      [
+        0,
+        0,
+        1,
+        139,
+        1,
+        0,
+        0
+      ],
+      [
+        13,
+        0,
+        9,
+        14,
+        169,
+        1,
+        7
+      ],
+      [
+        1,
+        11,
+        1,
+        4,
+        3,
+        208,
+        0
+      ],
+      [
+        0,
+        0,
+        0,
+        1,
+        6,
+        1,
+        159
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9419642857142857,
+    "cat_prec_BoardGov": 0.9017094017094017,
+    "cat_recall_BoardGov": 0.985981308411215,
+    "cat_f1_Incident": 0.9285714285714286,
+    "cat_prec_Incident": 0.8764044943820225,
+    "cat_recall_Incident": 0.9873417721518988,
+    "cat_f1_Manageme": 0.9171974522292994,
+    "cat_prec_Manageme": 0.9230769230769231,
+    "cat_recall_Manageme": 0.9113924050632911,
+    "cat_f1_NoneOthe": 0.9235880398671097,
+    "cat_prec_NoneOthe": 0.86875,
+    "cat_recall_NoneOthe": 0.9858156028368794,
+    "cat_f1_RiskMana": 0.8513853904282116,
+    "cat_prec_RiskMana": 0.9184782608695652,
+    "cat_recall_RiskMana": 0.7934272300469484,
+    "cat_f1_Strategy": 0.9497716894977168,
+    "cat_prec_Strategy": 0.9904761904761905,
+    "cat_recall_Strategy": 0.9122807017543859,
+    "cat_f1_Third-Pa": 0.9520958083832335,
+    "cat_prec_Third-Pa": 0.9520958083832335,
+    "cat_recall_Third-Pa": 0.9520958083832335,
+    "cat_kripp_alpha": 0.9095619506866199,
+    "spec_macro_f1": 0.8826923642825633,
+    "spec_weighted_f1": 0.8991699562480843,
+    "spec_macro_precision": 0.8862949086294886,
+    "spec_macro_recall": 0.8831960153359262,
+    "spec_mcc": 0.8485449936701916,
+    "spec_auc": 0.9725823165743999,
+    "spec_ece": 0.083350846717755,
+    "spec_confusion_matrix": [
+      [
+        568,
+        27,
+        9,
+        1
+      ],
+      [
+        23,
+        117,
+        3,
+        2
+      ],
+      [
+        34,
+        9,
+        206,
+        11
+      ],
+      [
+        0,
+        0,
+        1,
+        189
+      ]
+    ],
+    "spec_f1_L1Generi": 0.9235772357723577,
+    "spec_prec_L1Generi": 0.9088,
+    "spec_recall_L1Generi": 0.9388429752066115,
+    "spec_f1_L2Domain": 0.785234899328859,
+    "spec_prec_L2Domain": 0.7647058823529411,
+    "spec_recall_L2Domain": 0.8068965517241379,
+    "spec_f1_L3Firm-S": 0.860125260960334,
+    "spec_prec_L3Firm-S": 0.9406392694063926,
+    "spec_recall_L3Firm-S": 0.7923076923076923,
+    "spec_f1_L4Quanti": 0.9618320610687023,
+    "spec_prec_L4Quanti": 0.9310344827586207,
+    "spec_recall_L4Quanti": 0.9947368421052631,
+    "spec_qwk": 0.9198415117342273,
+    "spec_mae": 0.13916666666666666,
+    "spec_kripp_alpha": 0.9038906079654127
+  },
+  "_runtime": {
+    "encoder_mb": 789.563648,
+    "ms_per_sample": 7.762363941583317,
+    "throughput_per_s": 128.82673468103667,
+    "peak_vram_mb": 2135.203125,
+    "build_s": 1.1878160500200465
+  }
+}
--- a/results/eval/quant/bnb-nf4-nodq/metrics.json
+++ b/results/eval/quant/bnb-nf4-nodq/metrics.json
@ -0,0 +1,297 @@
+{
+  "GPT-5.4": {
+    "cat_macro_f1": 0.3536909012886116,
+    "cat_weighted_f1": 0.4058815979606338,
+    "cat_macro_precision": 0.6317997184487815,
+    "cat_macro_recall": 0.38979766446605063,
+    "cat_mcc": 0.42471542150657926,
+    "cat_auc": 0.9205800077405307,
+    "cat_ece": 0.09734637491405013,
+    "cat_confusion_matrix": [
+      [
+        143,
+        0,
+        0,
+        0,
+        85,
+        0,
+        2
+      ],
+      [
+        0,
+        0,
+        0,
+        4,
+        73,
+        0,
+        11
+      ],
+      [
+        3,
+        0,
+        20,
+        1,
+        124,
+        0,
+        2
+      ],
+      [
+        1,
+        0,
+        1,
+        5,
+        122,
+        0,
+        7
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        185,
+        0,
+        13
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        180,
+        28,
+        13
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        22,
+        0,
+        155
+      ]
+    ],
+    "cat_f1_BoardGov": 0.7586206896551724,
+    "cat_prec_BoardGov": 0.9727891156462585,
+    "cat_recall_BoardGov": 0.6217391304347826,
+    "cat_f1_Incident": 0.0,
+    "cat_prec_Incident": 0.0,
+    "cat_recall_Incident": 0.0,
+    "cat_f1_Manageme": 0.23391812865497075,
+    "cat_prec_Manageme": 0.9523809523809523,
+    "cat_recall_Manageme": 0.13333333333333333,
+    "cat_f1_NoneOthe": 0.0684931506849315,
+    "cat_prec_NoneOthe": 0.5,
+    "cat_recall_NoneOthe": 0.03676470588235294,
+    "cat_f1_RiskMana": 0.3741152679474216,
+    "cat_prec_RiskMana": 0.23388116308470291,
+    "cat_recall_RiskMana": 0.9343434343434344,
+    "cat_f1_Strategy": 0.2248995983935743,
+    "cat_prec_Strategy": 1.0,
+    "cat_recall_Strategy": 0.12669683257918551,
+    "cat_f1_Third-Pa": 0.8157894736842105,
+    "cat_prec_Third-Pa": 0.7635467980295566,
+    "cat_recall_Third-Pa": 0.8757062146892656,
+    "cat_kripp_alpha": 0.27180867501339423,
+    "spec_macro_f1": 0.22049451330952025,
+    "spec_weighted_f1": 0.26278390857815354,
+    "spec_macro_precision": 0.4075440073341987,
+    "spec_macro_recall": 0.34148466970860386,
+    "spec_mcc": 0.20939315966102864,
+    "spec_auc": 0.8490039116946011,
+    "spec_ece": 0.43363295723994577,
+    "spec_confusion_matrix": [
+      [
+        132,
+        483,
+        3,
+        0
+      ],
+      [
+        2,
+        166,
+        0,
+        0
+      ],
+      [
+        2,
+        171,
+        34,
+        0
+      ],
+      [
+        0,
+        175,
+        32,
+        0
+      ]
+    ],
+    "spec_f1_L1Generi": 0.35013262599469497,
+    "spec_prec_L1Generi": 0.9705882352941176,
+    "spec_recall_L1Generi": 0.21359223300970873,
+    "spec_f1_L2Domain": 0.28546861564918313,
+    "spec_prec_L2Domain": 0.16683417085427135,
+    "spec_recall_L2Domain": 0.9880952380952381,
+    "spec_f1_L3Firm-S": 0.2463768115942029,
+    "spec_prec_L3Firm-S": 0.4927536231884058,
+    "spec_recall_L3Firm-S": 0.1642512077294686,
+    "spec_f1_L4Quanti": 0.0,
+    "spec_prec_L4Quanti": 0.0,
+    "spec_recall_L4Quanti": 0.0,
+    "spec_qwk": 0.24233251808742773,
+    "spec_mae": 0.8733333333333333,
+    "spec_kripp_alpha": 0.2761091078775676
+  },
+  "Opus-4.6": {
+    "cat_macro_f1": 0.35763512449392704,
+    "cat_weighted_f1": 0.40173099854659305,
+    "cat_macro_precision": 0.6354693148020794,
+    "cat_macro_recall": 0.39500680662311666,
+    "cat_mcc": 0.42166882753874363,
+    "cat_auc": 0.9209441610065957,
+    "cat_ece": 0.09567970824738346,
+    "cat_confusion_matrix": [
+      [
+        141,
+        0,
+        0,
+        0,
+        71,
+        0,
+        2
+      ],
+      [
+        0,
+        0,
+        0,
+        4,
+        65,
+        0,
+        10
+      ],
+      [
+        5,
+        0,
+        21,
+        1,
+        131,
+        0,
+        0
+      ],
+      [
+        1,
+        0,
+        0,
+        5,
+        128,
+        0,
+        7
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        194,
+        0,
+        19
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        186,
+        28,
+        14
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        16,
+        0,
+        151
+      ]
+    ],
+    "cat_f1_BoardGov": 0.7811634349030471,
+    "cat_prec_BoardGov": 0.9591836734693877,
+    "cat_recall_BoardGov": 0.6588785046728972,
+    "cat_f1_Incident": 0.0,
+    "cat_prec_Incident": 0.0,
+    "cat_recall_Incident": 0.0,
+    "cat_f1_Manageme": 0.2346368715083799,
+    "cat_prec_Manageme": 1.0,
+    "cat_recall_Manageme": 0.13291139240506328,
+    "cat_f1_NoneOthe": 0.06622516556291391,
+    "cat_prec_NoneOthe": 0.5,
+    "cat_recall_NoneOthe": 0.03546099290780142,
+    "cat_f1_RiskMana": 0.38645418326693226,
+    "cat_prec_RiskMana": 0.24525916561314792,
+    "cat_recall_RiskMana": 0.9107981220657277,
+    "cat_f1_Strategy": 0.21875,
+    "cat_prec_Strategy": 1.0,
+    "cat_recall_Strategy": 0.12280701754385964,
+    "cat_f1_Third-Pa": 0.8162162162162162,
+    "cat_prec_Third-Pa": 0.7438423645320197,
+    "cat_recall_Third-Pa": 0.9041916167664671,
+    "cat_kripp_alpha": 0.27338793761748126,
+    "spec_macro_f1": 0.20754679251319788,
+    "spec_weighted_f1": 0.25637242485646744,
+    "spec_macro_precision": 0.40946072005380696,
+    "spec_macro_recall": 0.33929593134138586,
+    "spec_mcc": 0.2041103760829744,
+    "spec_auc": 0.8271022317290393,
+    "spec_ece": 0.4489923599362374,
+    "spec_confusion_matrix": [
+      [
+        130,
+        473,
+        2,
+        0
+      ],
+      [
+        0,
+        145,
+        0,
+        0
+      ],
+      [
+        6,
+        217,
+        37,
+        0
+      ],
+      [
+        0,
+        160,
+        30,
+        0
+      ]
+    ],
+    "spec_f1_L1Generi": 0.3508771929824561,
+    "spec_prec_L1Generi": 0.9558823529411765,
+    "spec_recall_L1Generi": 0.21487603305785125,
+    "spec_f1_L2Domain": 0.2543859649122807,
+    "spec_prec_L2Domain": 0.1457286432160804,
+    "spec_recall_L2Domain": 1.0,
+    "spec_f1_L3Firm-S": 0.22492401215805471,
+    "spec_prec_L3Firm-S": 0.5362318840579711,
+    "spec_recall_L3Firm-S": 0.1423076923076923,
+    "spec_f1_L4Quanti": 0.0,
+    "spec_prec_L4Quanti": 0.0,
+    "spec_recall_L4Quanti": 0.0,
+    "spec_qwk": 0.24096533359991634,
+    "spec_mae": 0.88,
+    "spec_kripp_alpha": 0.2758412395136435
+  },
+  "_runtime": {
+    "encoder_mb": 274.843904,
+    "ms_per_sample": 5.861402786540566,
+    "throughput_per_s": 170.607623536175,
+    "peak_vram_mb": 1287.34326171875,
+    "build_s": 0.4908116469741799
+  }
+}
--- a/results/eval/quant/bnb-nf4/metrics.json
+++ b/results/eval/quant/bnb-nf4/metrics.json
@ -0,0 +1,297 @@
+{
+  "GPT-5.4": {
+    "cat_macro_f1": 0.3536909012886116,
+    "cat_weighted_f1": 0.4058815979606338,
+    "cat_macro_precision": 0.6317997184487815,
+    "cat_macro_recall": 0.38979766446605063,
+    "cat_mcc": 0.42471542150657926,
+    "cat_auc": 0.9205800077405307,
+    "cat_ece": 0.09734637491405013,
+    "cat_confusion_matrix": [
+      [
+        143,
+        0,
+        0,
+        0,
+        85,
+        0,
+        2
+      ],
+      [
+        0,
+        0,
+        0,
+        4,
+        73,
+        0,
+        11
+      ],
+      [
+        3,
+        0,
+        20,
+        1,
+        124,
+        0,
+        2
+      ],
+      [
+        1,
+        0,
+        1,
+        5,
+        122,
+        0,
+        7
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        185,
+        0,
+        13
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        180,
+        28,
+        13
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        22,
+        0,
+        155
+      ]
+    ],
+    "cat_f1_BoardGov": 0.7586206896551724,
+    "cat_prec_BoardGov": 0.9727891156462585,
+    "cat_recall_BoardGov": 0.6217391304347826,
+    "cat_f1_Incident": 0.0,
+    "cat_prec_Incident": 0.0,
+    "cat_recall_Incident": 0.0,
+    "cat_f1_Manageme": 0.23391812865497075,
+    "cat_prec_Manageme": 0.9523809523809523,
+    "cat_recall_Manageme": 0.13333333333333333,
+    "cat_f1_NoneOthe": 0.0684931506849315,
+    "cat_prec_NoneOthe": 0.5,
+    "cat_recall_NoneOthe": 0.03676470588235294,
+    "cat_f1_RiskMana": 0.3741152679474216,
+    "cat_prec_RiskMana": 0.23388116308470291,
+    "cat_recall_RiskMana": 0.9343434343434344,
+    "cat_f1_Strategy": 0.2248995983935743,
+    "cat_prec_Strategy": 1.0,
+    "cat_recall_Strategy": 0.12669683257918551,
+    "cat_f1_Third-Pa": 0.8157894736842105,
+    "cat_prec_Third-Pa": 0.7635467980295566,
+    "cat_recall_Third-Pa": 0.8757062146892656,
+    "cat_kripp_alpha": 0.27180867501339423,
+    "spec_macro_f1": 0.22049451330952025,
+    "spec_weighted_f1": 0.26278390857815354,
+    "spec_macro_precision": 0.4075440073341987,
+    "spec_macro_recall": 0.34148466970860386,
+    "spec_mcc": 0.20939315966102864,
+    "spec_auc": 0.8490039116946011,
+    "spec_ece": 0.43363295723994577,
+    "spec_confusion_matrix": [
+      [
+        132,
+        483,
+        3,
+        0
+      ],
+      [
+        2,
+        166,
+        0,
+        0
+      ],
+      [
+        2,
+        171,
+        34,
+        0
+      ],
+      [
+        0,
+        175,
+        32,
+        0
+      ]
+    ],
+    "spec_f1_L1Generi": 0.35013262599469497,
+    "spec_prec_L1Generi": 0.9705882352941176,
+    "spec_recall_L1Generi": 0.21359223300970873,
+    "spec_f1_L2Domain": 0.28546861564918313,
+    "spec_prec_L2Domain": 0.16683417085427135,
+    "spec_recall_L2Domain": 0.9880952380952381,
+    "spec_f1_L3Firm-S": 0.2463768115942029,
+    "spec_prec_L3Firm-S": 0.4927536231884058,
+    "spec_recall_L3Firm-S": 0.1642512077294686,
+    "spec_f1_L4Quanti": 0.0,
+    "spec_prec_L4Quanti": 0.0,
+    "spec_recall_L4Quanti": 0.0,
+    "spec_qwk": 0.24233251808742773,
+    "spec_mae": 0.8733333333333333,
+    "spec_kripp_alpha": 0.2761091078775676
+  },
+  "Opus-4.6": {
+    "cat_macro_f1": 0.35763512449392704,
+    "cat_weighted_f1": 0.40173099854659305,
+    "cat_macro_precision": 0.6354693148020794,
+    "cat_macro_recall": 0.39500680662311666,
+    "cat_mcc": 0.42166882753874363,
+    "cat_auc": 0.9209441610065957,
+    "cat_ece": 0.09567970824738346,
+    "cat_confusion_matrix": [
+      [
+        141,
+        0,
+        0,
+        0,
+        71,
+        0,
+        2
+      ],
+      [
+        0,
+        0,
+        0,
+        4,
+        65,
+        0,
+        10
+      ],
+      [
+        5,
+        0,
+        21,
+        1,
+        131,
+        0,
+        0
+      ],
+      [
+        1,
+        0,
+        0,
+        5,
+        128,
+        0,
+        7
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        194,
+        0,
+        19
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        186,
+        28,
+        14
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        16,
+        0,
+        151
+      ]
+    ],
+    "cat_f1_BoardGov": 0.7811634349030471,
+    "cat_prec_BoardGov": 0.9591836734693877,
+    "cat_recall_BoardGov": 0.6588785046728972,
+    "cat_f1_Incident": 0.0,
+    "cat_prec_Incident": 0.0,
+    "cat_recall_Incident": 0.0,
+    "cat_f1_Manageme": 0.2346368715083799,
+    "cat_prec_Manageme": 1.0,
+    "cat_recall_Manageme": 0.13291139240506328,
+    "cat_f1_NoneOthe": 0.06622516556291391,
+    "cat_prec_NoneOthe": 0.5,
+    "cat_recall_NoneOthe": 0.03546099290780142,
+    "cat_f1_RiskMana": 0.38645418326693226,
+    "cat_prec_RiskMana": 0.24525916561314792,
+    "cat_recall_RiskMana": 0.9107981220657277,
+    "cat_f1_Strategy": 0.21875,
+    "cat_prec_Strategy": 1.0,
+    "cat_recall_Strategy": 0.12280701754385964,
+    "cat_f1_Third-Pa": 0.8162162162162162,
+    "cat_prec_Third-Pa": 0.7438423645320197,
+    "cat_recall_Third-Pa": 0.9041916167664671,
+    "cat_kripp_alpha": 0.27338793761748126,
+    "spec_macro_f1": 0.20754679251319788,
+    "spec_weighted_f1": 0.25637242485646744,
+    "spec_macro_precision": 0.40946072005380696,
+    "spec_macro_recall": 0.33929593134138586,
+    "spec_mcc": 0.2041103760829744,
+    "spec_auc": 0.8271022317290393,
+    "spec_ece": 0.4489923599362374,
+    "spec_confusion_matrix": [
+      [
+        130,
+        473,
+        2,
+        0
+      ],
+      [
+        0,
+        145,
+        0,
+        0
+      ],
+      [
+        6,
+        217,
+        37,
+        0
+      ],
+      [
+        0,
+        160,
+        30,
+        0
+      ]
+    ],
+    "spec_f1_L1Generi": 0.3508771929824561,
+    "spec_prec_L1Generi": 0.9558823529411765,
+    "spec_recall_L1Generi": 0.21487603305785125,
+    "spec_f1_L2Domain": 0.2543859649122807,
+    "spec_prec_L2Domain": 0.1457286432160804,
+    "spec_recall_L2Domain": 1.0,
+    "spec_f1_L3Firm-S": 0.22492401215805471,
+    "spec_prec_L3Firm-S": 0.5362318840579711,
+    "spec_recall_L3Firm-S": 0.1423076923076923,
+    "spec_f1_L4Quanti": 0.0,
+    "spec_prec_L4Quanti": 0.0,
+    "spec_recall_L4Quanti": 0.0,
+    "spec_qwk": 0.24096533359991634,
+    "spec_mae": 0.88,
+    "spec_kripp_alpha": 0.2758412395136435
+  },
+  "_runtime": {
+    "encoder_mb": 274.843904,
+    "ms_per_sample": 5.860076693982895,
+    "throughput_per_s": 170.64623079537446,
+    "peak_vram_mb": 1287.34326171875,
+    "build_s": 0.4858604749897495
+  }
+}
--- a/results/eval/quant/fp16/metrics.json
+++ b/results/eval/quant/fp16/metrics.json
@ -0,0 +1,297 @@
+{
+  "GPT-5.4": {
+    "cat_macro_f1": 0.9336741161693523,
+    "cat_weighted_f1": 0.9343162998643407,
+    "cat_macro_precision": 0.93189297179766,
+    "cat_macro_recall": 0.9377918652022429,
+    "cat_mcc": 0.9226990724708704,
+    "cat_auc": 0.9920503365435541,
+    "cat_ece": 0.053747650533914546,
+    "cat_confusion_matrix": [
+      [
+        225,
+        0,
+        3,
+        0,
+        2,
+        0,
+        0
+      ],
+      [
+        0,
+        85,
+        0,
+        0,
+        2,
+        1,
+        0
+      ],
+      [
+        2,
+        0,
+        144,
+        1,
+        3,
+        0,
+        0
+      ],
+      [
+        0,
+        0,
+        3,
+        131,
+        0,
+        2,
+        0
+      ],
+      [
+        6,
+        1,
+        5,
+        19,
+        164,
+        1,
+        2
+      ],
+      [
+        0,
+        3,
+        1,
+        8,
+        2,
+        207,
+        0
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        12,
+        0,
+        165
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9719222462203023,
+    "cat_prec_BoardGov": 0.9656652360515021,
+    "cat_recall_BoardGov": 0.9782608695652174,
+    "cat_f1_Incident": 0.96045197740113,
+    "cat_prec_Incident": 0.9550561797752809,
+    "cat_recall_Incident": 0.9659090909090909,
+    "cat_f1_Manageme": 0.9411764705882353,
+    "cat_prec_Manageme": 0.9230769230769231,
+    "cat_recall_Manageme": 0.96,
+    "cat_f1_NoneOthe": 0.888135593220339,
+    "cat_prec_NoneOthe": 0.8238993710691824,
+    "cat_recall_NoneOthe": 0.9632352941176471,
+    "cat_f1_RiskMana": 0.856396866840731,
+    "cat_prec_RiskMana": 0.8864864864864865,
+    "cat_recall_RiskMana": 0.8282828282828283,
+    "cat_f1_Strategy": 0.9583333333333334,
+    "cat_prec_Strategy": 0.981042654028436,
+    "cat_recall_Strategy": 0.9366515837104072,
+    "cat_f1_Third-Pa": 0.9593023255813954,
+    "cat_prec_Third-Pa": 0.9880239520958084,
+    "cat_recall_Third-Pa": 0.9322033898305084,
+    "cat_kripp_alpha": 0.9223591517560865,
+    "spec_macro_f1": 0.8951731906425856,
+    "spec_weighted_f1": 0.9121524819510628,
+    "spec_macro_precision": 0.8980417155129858,
+    "spec_macro_recall": 0.8930560580782194,
+    "spec_mcc": 0.866381831963237,
+    "spec_auc": 0.981767664615518,
+    "spec_ece": 0.07004868157207966,
+    "spec_confusion_matrix": [
+      [
+        580,
+        23,
+        12,
+        3
+      ],
+      [
+        29,
+        130,
+        7,
+        2
+      ],
+      [
+        11,
+        4,
+        190,
+        2
+      ],
+      [
+        2,
+        1,
+        9,
+        195
+      ]
+    ],
+    "spec_f1_L1Generi": 0.9354838709677419,
+    "spec_prec_L1Generi": 0.932475884244373,
+    "spec_recall_L1Generi": 0.9385113268608414,
+    "spec_f1_L2Domain": 0.7975460122699386,
+    "spec_prec_L2Domain": 0.8227848101265823,
+    "spec_recall_L2Domain": 0.7738095238095238,
+    "spec_f1_L3Firm-S": 0.8941176470588236,
+    "spec_prec_L3Firm-S": 0.8715596330275229,
+    "spec_recall_L3Firm-S": 0.9178743961352657,
+    "spec_f1_L4Quanti": 0.9535452322738386,
+    "spec_prec_L4Quanti": 0.9653465346534653,
+    "spec_recall_L4Quanti": 0.9420289855072463,
+    "spec_qwk": 0.9324447137231142,
+    "spec_mae": 0.1175,
+    "spec_kripp_alpha": 0.917725722448833
+  },
+  "Opus-4.6": {
+    "cat_macro_f1": 0.922684387023173,
+    "cat_weighted_f1": 0.9216414809666168,
+    "cat_macro_precision": 0.9177680939029339,
+    "cat_macro_recall": 0.9316060900094703,
+    "cat_mcc": 0.909266938399113,
+    "cat_auc": 0.993963602835296,
+    "cat_ece": 0.06541431720058125,
+    "cat_confusion_matrix": [
+      [
+        211,
+        0,
+        1,
+        1,
+        1,
+        0,
+        0
+      ],
+      [
+        0,
+        78,
+        0,
+        0,
+        1,
+        0,
+        0
+      ],
+      [
+        8,
+        0,
+        144,
+        1,
+        4,
+        0,
+        1
+      ],
+      [
+        0,
+        0,
+        1,
+        138,
+        1,
+        1,
+        0
+      ],
+      [
+        13,
+        0,
+        9,
+        14,
+        169,
+        1,
+        7
+      ],
+      [
+        1,
+        11,
+        1,
+        4,
+        3,
+        208,
+        0
+      ],
+      [
+        0,
+        0,
+        0,
+        1,
+        6,
+        1,
+        159
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9440715883668904,
+    "cat_prec_BoardGov": 0.9055793991416309,
+    "cat_recall_BoardGov": 0.985981308411215,
+    "cat_f1_Incident": 0.9285714285714286,
+    "cat_prec_Incident": 0.8764044943820225,
+    "cat_recall_Incident": 0.9873417721518988,
+    "cat_f1_Manageme": 0.9171974522292994,
+    "cat_prec_Manageme": 0.9230769230769231,
+    "cat_recall_Manageme": 0.9113924050632911,
+    "cat_f1_NoneOthe": 0.92,
+    "cat_prec_NoneOthe": 0.8679245283018868,
+    "cat_recall_NoneOthe": 0.9787234042553191,
+    "cat_f1_RiskMana": 0.8492462311557789,
+    "cat_prec_RiskMana": 0.9135135135135135,
+    "cat_recall_RiskMana": 0.7934272300469484,
+    "cat_f1_Strategy": 0.9476082004555809,
+    "cat_prec_Strategy": 0.985781990521327,
+    "cat_recall_Strategy": 0.9122807017543859,
+    "cat_f1_Third-Pa": 0.9520958083832335,
+    "cat_prec_Third-Pa": 0.9520958083832335,
+    "cat_recall_Third-Pa": 0.9520958083832335,
+    "cat_kripp_alpha": 0.908575631724203,
+    "spec_macro_f1": 0.8833694419146193,
+    "spec_weighted_f1": 0.9004034318676798,
+    "spec_macro_precision": 0.8858989636247611,
+    "spec_macro_recall": 0.8854684685880032,
+    "spec_mcc": 0.8500778641433316,
+    "spec_auc": 0.9736589116420353,
+    "spec_ece": 0.0816012116521597,
+    "spec_confusion_matrix": [
+      [
+        567,
+        30,
+        7,
+        1
+      ],
+      [
+        22,
+        118,
+        3,
+        2
+      ],
+      [
+        33,
+        10,
+        207,
+        10
+      ],
+      [
+        0,
+        0,
+        1,
+        189
+      ]
+    ],
+    "spec_f1_L1Generi": 0.9242053789731052,
+    "spec_prec_L1Generi": 0.9115755627009646,
+    "spec_recall_L1Generi": 0.9371900826446281,
+    "spec_f1_L2Domain": 0.7788778877887789,
+    "spec_prec_L2Domain": 0.7468354430379747,
+    "spec_recall_L2Domain": 0.8137931034482758,
+    "spec_f1_L3Firm-S": 0.8661087866108786,
+    "spec_prec_L3Firm-S": 0.9495412844036697,
+    "spec_recall_L3Firm-S": 0.7961538461538461,
+    "spec_f1_L4Quanti": 0.9642857142857143,
+    "spec_prec_L4Quanti": 0.9356435643564357,
+    "spec_recall_L4Quanti": 0.9947368421052631,
+    "spec_qwk": 0.9227008860372746,
+    "spec_mae": 0.13583333333333333,
+    "spec_kripp_alpha": 0.9065248741550552
+  },
+  "_runtime": {
+    "encoder_mb": 789.563648,
+    "ms_per_sample": 5.539002780715236,
+    "throughput_per_s": 180.53791261517884,
+    "peak_vram_mb": 1740.83837890625,
+    "build_s": 0.46582157304510474
+  }
+}
--- a/results/eval/quant/fp32/metrics.json
+++ b/results/eval/quant/fp32/metrics.json
@ -0,0 +1,297 @@
+{
+  "GPT-5.4": {
+    "cat_macro_f1": 0.9336741161693523,
+    "cat_weighted_f1": 0.9343162998643407,
+    "cat_macro_precision": 0.93189297179766,
+    "cat_macro_recall": 0.9377918652022429,
+    "cat_mcc": 0.9226990724708704,
+    "cat_auc": 0.9920546854722492,
+    "cat_ece": 0.05388230005900064,
+    "cat_confusion_matrix": [
+      [
+        225,
+        0,
+        3,
+        0,
+        2,
+        0,
+        0
+      ],
+      [
+        0,
+        85,
+        0,
+        0,
+        2,
+        1,
+        0
+      ],
+      [
+        2,
+        0,
+        144,
+        1,
+        3,
+        0,
+        0
+      ],
+      [
+        0,
+        0,
+        3,
+        131,
+        0,
+        2,
+        0
+      ],
+      [
+        6,
+        1,
+        5,
+        19,
+        164,
+        1,
+        2
+      ],
+      [
+        0,
+        3,
+        1,
+        8,
+        2,
+        207,
+        0
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        12,
+        0,
+        165
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9719222462203023,
+    "cat_prec_BoardGov": 0.9656652360515021,
+    "cat_recall_BoardGov": 0.9782608695652174,
+    "cat_f1_Incident": 0.96045197740113,
+    "cat_prec_Incident": 0.9550561797752809,
+    "cat_recall_Incident": 0.9659090909090909,
+    "cat_f1_Manageme": 0.9411764705882353,
+    "cat_prec_Manageme": 0.9230769230769231,
+    "cat_recall_Manageme": 0.96,
+    "cat_f1_NoneOthe": 0.888135593220339,
+    "cat_prec_NoneOthe": 0.8238993710691824,
+    "cat_recall_NoneOthe": 0.9632352941176471,
+    "cat_f1_RiskMana": 0.856396866840731,
+    "cat_prec_RiskMana": 0.8864864864864865,
+    "cat_recall_RiskMana": 0.8282828282828283,
+    "cat_f1_Strategy": 0.9583333333333334,
+    "cat_prec_Strategy": 0.981042654028436,
+    "cat_recall_Strategy": 0.9366515837104072,
+    "cat_f1_Third-Pa": 0.9593023255813954,
+    "cat_prec_Third-Pa": 0.9880239520958084,
+    "cat_recall_Third-Pa": 0.9322033898305084,
+    "cat_kripp_alpha": 0.9223591517560865,
+    "spec_macro_f1": 0.8943486525770918,
+    "spec_weighted_f1": 0.9113685505226937,
+    "spec_macro_precision": 0.896720845083131,
+    "spec_macro_recall": 0.8926515273338829,
+    "spec_mcc": 0.8651595302642376,
+    "spec_auc": 0.9817804600696421,
+    "spec_ece": 0.07088303024570146,
+    "spec_confusion_matrix": [
+      [
+        579,
+        24,
+        12,
+        3
+      ],
+      [
+        29,
+        130,
+        7,
+        2
+      ],
+      [
+        11,
+        4,
+        190,
+        2
+      ],
+      [
+        2,
+        1,
+        9,
+        195
+      ]
+    ],
+    "spec_f1_L1Generi": 0.9346246973365617,
+    "spec_prec_L1Generi": 0.9323671497584541,
+    "spec_recall_L1Generi": 0.9368932038834952,
+    "spec_f1_L2Domain": 0.7951070336391437,
+    "spec_prec_L2Domain": 0.8176100628930818,
+    "spec_recall_L2Domain": 0.7738095238095238,
+    "spec_f1_L3Firm-S": 0.8941176470588236,
+    "spec_prec_L3Firm-S": 0.8715596330275229,
+    "spec_recall_L3Firm-S": 0.9178743961352657,
+    "spec_f1_L4Quanti": 0.9535452322738386,
+    "spec_prec_L4Quanti": 0.9653465346534653,
+    "spec_recall_L4Quanti": 0.9420289855072463,
+    "spec_qwk": 0.9321211092744079,
+    "spec_mae": 0.11833333333333333,
+    "spec_kripp_alpha": 0.9170875429859872
+  },
+  "Opus-4.6": {
+    "cat_macro_f1": 0.922684387023173,
+    "cat_weighted_f1": 0.9216414809666168,
+    "cat_macro_precision": 0.9177680939029339,
+    "cat_macro_recall": 0.9316060900094703,
+    "cat_mcc": 0.909266938399113,
+    "cat_auc": 0.9939567083286731,
+    "cat_ece": 0.06541596949100496,
+    "cat_confusion_matrix": [
+      [
+        211,
+        0,
+        1,
+        1,
+        1,
+        0,
+        0
+      ],
+      [
+        0,
+        78,
+        0,
+        0,
+        1,
+        0,
+        0
+      ],
+      [
+        8,
+        0,
+        144,
+        1,
+        4,
+        0,
+        1
+      ],
+      [
+        0,
+        0,
+        1,
+        138,
+        1,
+        1,
+        0
+      ],
+      [
+        13,
+        0,
+        9,
+        14,
+        169,
+        1,
+        7
+      ],
+      [
+        1,
+        11,
+        1,
+        4,
+        3,
+        208,
+        0
+      ],
+      [
+        0,
+        0,
+        0,
+        1,
+        6,
+        1,
+        159
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9440715883668904,
+    "cat_prec_BoardGov": 0.9055793991416309,
+    "cat_recall_BoardGov": 0.985981308411215,
+    "cat_f1_Incident": 0.9285714285714286,
+    "cat_prec_Incident": 0.8764044943820225,
+    "cat_recall_Incident": 0.9873417721518988,
+    "cat_f1_Manageme": 0.9171974522292994,
+    "cat_prec_Manageme": 0.9230769230769231,
+    "cat_recall_Manageme": 0.9113924050632911,
+    "cat_f1_NoneOthe": 0.92,
+    "cat_prec_NoneOthe": 0.8679245283018868,
+    "cat_recall_NoneOthe": 0.9787234042553191,
+    "cat_f1_RiskMana": 0.8492462311557789,
+    "cat_prec_RiskMana": 0.9135135135135135,
+    "cat_recall_RiskMana": 0.7934272300469484,
+    "cat_f1_Strategy": 0.9476082004555809,
+    "cat_prec_Strategy": 0.985781990521327,
+    "cat_recall_Strategy": 0.9122807017543859,
+    "cat_f1_Third-Pa": 0.9520958083832335,
+    "cat_prec_Third-Pa": 0.9520958083832335,
+    "cat_recall_Third-Pa": 0.9520958083832335,
+    "cat_kripp_alpha": 0.908575631724203,
+    "spec_macro_f1": 0.8825095464914274,
+    "spec_weighted_f1": 0.8996514471183623,
+    "spec_macro_precision": 0.8846890964606792,
+    "spec_macro_recall": 0.8850552454475074,
+    "spec_mcc": 0.8488763096810703,
+    "spec_auc": 0.9736482774372809,
+    "spec_ece": 0.08238246644536655,
+    "spec_confusion_matrix": [
+      [
+        566,
+        31,
+        7,
+        1
+      ],
+      [
+        22,
+        118,
+        3,
+        2
+      ],
+      [
+        33,
+        10,
+        207,
+        10
+      ],
+      [
+        0,
+        0,
+        1,
+        189
+      ]
+    ],
+    "spec_f1_L1Generi": 0.9233278955954323,
+    "spec_prec_L1Generi": 0.9114331723027376,
+    "spec_recall_L1Generi": 0.9355371900826446,
+    "spec_f1_L2Domain": 0.7763157894736842,
+    "spec_prec_L2Domain": 0.7421383647798742,
+    "spec_recall_L2Domain": 0.8137931034482758,
+    "spec_f1_L3Firm-S": 0.8661087866108786,
+    "spec_prec_L3Firm-S": 0.9495412844036697,
+    "spec_recall_L3Firm-S": 0.7961538461538461,
+    "spec_f1_L4Quanti": 0.9642857142857143,
+    "spec_prec_L4Quanti": 0.9356435643564357,
+    "spec_recall_L4Quanti": 0.9947368421052631,
+    "spec_qwk": 0.9223702541559166,
+    "spec_mae": 0.13666666666666666,
+    "spec_kripp_alpha": 0.9059072309806726
+  },
+  "_runtime": {
+    "encoder_mb": 1579.127296,
+    "ms_per_sample": 16.293709366727853,
+    "throughput_per_s": 61.37337898281309,
+    "peak_vram_mb": 3503.53369140625,
+    "build_s": 0.6251941699883901
+  }
+}
--- a/results/eval/quant/summary.json
+++ b/results/eval/quant/summary.json
@ -0,0 +1,286 @@
+[
+  {
+    "variant": "fp32",
+    "description": "Float32 encoder + heads",
+    "encoder_mb": 1579.127296,
+    "ms_per_sample": 16.293709366727853,
+    "throughput_per_s": 61.37337898281309,
+    "peak_vram_mb": 3503.53369140625,
+    "build_s": 0.6251941699883901,
+    "GPT-5.4_cat_f1": 0.9336741161693523,
+    "GPT-5.4_spec_f1": 0.8943486525770918,
+    "GPT-5.4_cat_mcc": 0.9226990724708704,
+    "GPT-5.4_spec_qwk": 0.9321211092744079,
+    "GPT-5.4_spec_mae": 0.11833333333333333,
+    "GPT-5.4_cat_ece": 0.05388230005900064,
+    "GPT-5.4_spec_ece": 0.07088303024570146,
+    "GPT-5.4_spec_f1_L1Generi": 0.9346246973365617,
+    "GPT-5.4_spec_f1_L2Domain": 0.7951070336391437,
+    "GPT-5.4_spec_f1_L3Firm-S": 0.8941176470588236,
+    "GPT-5.4_spec_f1_L4Quanti": 0.9535452322738386,
+    "Opus-4.6_cat_f1": 0.922684387023173,
+    "Opus-4.6_spec_f1": 0.8825095464914274,
+    "Opus-4.6_cat_mcc": 0.909266938399113,
+    "Opus-4.6_spec_qwk": 0.9223702541559166,
+    "Opus-4.6_spec_mae": 0.13666666666666666,
+    "Opus-4.6_cat_ece": 0.06541596949100496,
+    "Opus-4.6_spec_ece": 0.08238246644536655,
+    "Opus-4.6_spec_f1_L1Generi": 0.9233278955954323,
+    "Opus-4.6_spec_f1_L2Domain": 0.7763157894736842,
+    "Opus-4.6_spec_f1_L3Firm-S": 0.8661087866108786,
+    "Opus-4.6_spec_f1_L4Quanti": 0.9642857142857143
+  },
+  {
+    "variant": "bf16",
+    "description": "BFloat16 baseline (matches eval pipeline)",
+    "encoder_mb": 789.563648,
+    "ms_per_sample": 5.516677870764397,
+    "throughput_per_s": 181.26851402716375,
+    "peak_vram_mb": 1740.83837890625,
+    "build_s": 0.48778308398323134,
+    "GPT-5.4_cat_f1": 0.9336741161693523,
+    "GPT-5.4_spec_f1": 0.8951731906425856,
+    "GPT-5.4_cat_mcc": 0.9226990724708704,
+    "GPT-5.4_spec_qwk": 0.9324447137231142,
+    "GPT-5.4_spec_mae": 0.1175,
+    "GPT-5.4_cat_ece": 0.053848127176364245,
+    "GPT-5.4_spec_ece": 0.07135417198141418,
+    "GPT-5.4_spec_f1_L1Generi": 0.9354838709677419,
+    "GPT-5.4_spec_f1_L2Domain": 0.7975460122699386,
+    "GPT-5.4_spec_f1_L3Firm-S": 0.8941176470588236,
+    "GPT-5.4_spec_f1_L4Quanti": 0.9535452322738386,
+    "Opus-4.6_cat_f1": 0.922684387023173,
+    "Opus-4.6_spec_f1": 0.8833694419146193,
+    "Opus-4.6_cat_mcc": 0.909266938399113,
+    "Opus-4.6_spec_qwk": 0.9227008860372746,
+    "Opus-4.6_spec_mae": 0.13583333333333333,
+    "Opus-4.6_cat_ece": 0.06551479384303091,
+    "Opus-4.6_spec_ece": 0.08248284702499709,
+    "Opus-4.6_spec_f1_L1Generi": 0.9242053789731052,
+    "Opus-4.6_spec_f1_L2Domain": 0.7788778877887789,
+    "Opus-4.6_spec_f1_L3Firm-S": 0.8661087866108786,
+    "Opus-4.6_spec_f1_L4Quanti": 0.9642857142857143
+  },
+  {
+    "variant": "fp16",
+    "description": "Float16 encoder + heads",
+    "encoder_mb": 789.563648,
+    "ms_per_sample": 5.539002780715236,
+    "throughput_per_s": 180.53791261517884,
+    "peak_vram_mb": 1740.83837890625,
+    "build_s": 0.46582157304510474,
+    "GPT-5.4_cat_f1": 0.9336741161693523,
+    "GPT-5.4_spec_f1": 0.8951731906425856,
+    "GPT-5.4_cat_mcc": 0.9226990724708704,
+    "GPT-5.4_spec_qwk": 0.9324447137231142,
+    "GPT-5.4_spec_mae": 0.1175,
+    "GPT-5.4_cat_ece": 0.053747650533914546,
+    "GPT-5.4_spec_ece": 0.07004868157207966,
+    "GPT-5.4_spec_f1_L1Generi": 0.9354838709677419,
+    "GPT-5.4_spec_f1_L2Domain": 0.7975460122699386,
+    "GPT-5.4_spec_f1_L3Firm-S": 0.8941176470588236,
+    "GPT-5.4_spec_f1_L4Quanti": 0.9535452322738386,
+    "Opus-4.6_cat_f1": 0.922684387023173,
+    "Opus-4.6_spec_f1": 0.8833694419146193,
+    "Opus-4.6_cat_mcc": 0.909266938399113,
+    "Opus-4.6_spec_qwk": 0.9227008860372746,
+    "Opus-4.6_spec_mae": 0.13583333333333333,
+    "Opus-4.6_cat_ece": 0.06541431720058125,
+    "Opus-4.6_spec_ece": 0.0816012116521597,
+    "Opus-4.6_spec_f1_L1Generi": 0.9242053789731052,
+    "Opus-4.6_spec_f1_L2Domain": 0.7788778877887789,
+    "Opus-4.6_spec_f1_L3Firm-S": 0.8661087866108786,
+    "Opus-4.6_spec_f1_L4Quanti": 0.9642857142857143
+  },
+  {
+    "variant": "torchao-int8-wo",
+    "description": "torchao Int8 weight-only on encoder linears",
+    "encoder_mb": 789.563648,
+    "ms_per_sample": 6.078403938445263,
+    "throughput_per_s": 164.5168715549004,
+    "peak_vram_mb": 1416.36376953125,
+    "build_s": 0.5027359619853087,
+    "GPT-5.4_cat_f1": 0.9344870894825886,
+    "GPT-5.4_spec_f1": 0.8941203230194683,
+    "GPT-5.4_cat_mcc": 0.9237006314618685,
+    "GPT-5.4_spec_qwk": 0.9329693660903852,
+    "GPT-5.4_spec_mae": 0.1175,
+    "GPT-5.4_cat_ece": 0.05415941931307314,
+    "GPT-5.4_spec_ece": 0.06980206420024232,
+    "GPT-5.4_spec_f1_L1Generi": 0.9353796445880452,
+    "GPT-5.4_spec_f1_L2Domain": 0.793939393939394,
+    "GPT-5.4_spec_f1_L3Firm-S": 0.8936170212765957,
+    "GPT-5.4_spec_f1_L4Quanti": 0.9535452322738386,
+    "Opus-4.6_cat_f1": 0.9234810481200378,
+    "Opus-4.6_spec_f1": 0.8814731397444973,
+    "Opus-4.6_cat_mcc": 0.9102750101817324,
+    "Opus-4.6_spec_qwk": 0.9207708779443254,
+    "Opus-4.6_spec_mae": 0.13916666666666666,
+    "Opus-4.6_cat_ece": 0.0641141641388337,
+    "Opus-4.6_spec_ece": 0.08370273689428968,
+    "Opus-4.6_spec_f1_L1Generi": 0.9208163265306123,
+    "Opus-4.6_spec_f1_L2Domain": 0.7752442996742671,
+    "Opus-4.6_spec_f1_L3Firm-S": 0.865546218487395,
+    "Opus-4.6_spec_f1_L4Quanti": 0.9642857142857143
+  },
+  {
+    "variant": "torchao-int8-dyn",
+    "description": "torchao Int8 dynamic activation + Int8 weight on encoder",
+    "encoder_mb": 789.563648,
+    "ms_per_sample": 9.671733896636093,
+    "throughput_per_s": 103.39407707937539,
+    "peak_vram_mb": 1774.27392578125,
+    "build_s": 0.4831273259478621,
+    "GPT-5.4_cat_f1": 0.9336475878058536,
+    "GPT-5.4_spec_f1": 0.8918479759675974,
+    "GPT-5.4_cat_mcc": 0.9226968780743573,
+    "GPT-5.4_spec_qwk": 0.931514217618119,
+    "GPT-5.4_spec_mae": 0.12,
+    "GPT-5.4_cat_ece": 0.05363284418980283,
+    "GPT-5.4_spec_ece": 0.07049367701013878,
+    "GPT-5.4_spec_f1_L1Generi": 0.934412955465587,
+    "GPT-5.4_spec_f1_L2Domain": 0.7889908256880734,
+    "GPT-5.4_spec_f1_L3Firm-S": 0.8904428904428905,
+    "GPT-5.4_spec_f1_L4Quanti": 0.9535452322738386,
+    "Opus-4.6_cat_f1": 0.9242573204255528,
+    "Opus-4.6_spec_f1": 0.8827245859621925,
+    "Opus-4.6_cat_mcc": 0.9112549308356716,
+    "Opus-4.6_spec_qwk": 0.92235918049198,
+    "Opus-4.6_spec_mae": 0.13666666666666666,
+    "Opus-4.6_cat_ece": 0.06330573419729862,
+    "Opus-4.6_spec_ece": 0.08290670409798626,
+    "Opus-4.6_spec_f1_L1Generi": 0.9230769230769231,
+    "Opus-4.6_spec_f1_L2Domain": 0.7763157894736842,
+    "Opus-4.6_spec_f1_L3Firm-S": 0.8672199170124482,
+    "Opus-4.6_spec_f1_L4Quanti": 0.9642857142857143
+  },
+  {
+    "variant": "torchao-int4-wo",
+    "description": "torchao Int4 weight-only (group=128) on encoder",
+    "error": "ImportError: Requires mslk >= 1.0.0"
+  },
+  {
+    "variant": "bnb-int8",
+    "description": "bitsandbytes LLM.int8 on encoder linears",
+    "encoder_mb": 789.563648,
+    "ms_per_sample": 7.762363941583317,
+    "throughput_per_s": 128.82673468103667,
+    "peak_vram_mb": 2135.203125,
+    "build_s": 1.1878160500200465,
+    "GPT-5.4_cat_f1": 0.9360988760303737,
+    "GPT-5.4_spec_f1": 0.8986323186392307,
+    "GPT-5.4_cat_mcc": 0.9256911778959798,
+    "GPT-5.4_spec_qwk": 0.9307948020550015,
+    "GPT-5.4_spec_mae": 0.1175,
+    "GPT-5.4_cat_ece": 0.052939765204985965,
+    "GPT-5.4_spec_ece": 0.06740866973996164,
+    "GPT-5.4_spec_f1_L1Generi": 0.9364440868865648,
+    "GPT-5.4_spec_f1_L2Domain": 0.8099688473520249,
+    "GPT-5.4_spec_f1_L3Firm-S": 0.892018779342723,
+    "GPT-5.4_spec_f1_L4Quanti": 0.9560975609756097,
+    "Opus-4.6_cat_f1": 0.9235105849558979,
+    "Opus-4.6_spec_f1": 0.8826923642825633,
+    "Opus-4.6_cat_mcc": 0.9103198007176273,
+    "Opus-4.6_spec_qwk": 0.9198415117342273,
+    "Opus-4.6_spec_mae": 0.13916666666666666,
+    "Opus-4.6_cat_ece": 0.06465620135267579,
+    "Opus-4.6_spec_ece": 0.083350846717755,
+    "Opus-4.6_spec_f1_L1Generi": 0.9235772357723577,
+    "Opus-4.6_spec_f1_L2Domain": 0.785234899328859,
+    "Opus-4.6_spec_f1_L3Firm-S": 0.860125260960334,
+    "Opus-4.6_spec_f1_L4Quanti": 0.9618320610687023
+  },
+  {
+    "variant": "bnb-nf4",
+    "description": "bitsandbytes NF4 4-bit (double-quant, bf16 compute)",
+    "encoder_mb": 274.843904,
+    "ms_per_sample": 5.860076693982895,
+    "throughput_per_s": 170.64623079537446,
+    "peak_vram_mb": 1287.34326171875,
+    "build_s": 0.4858604749897495,
+    "GPT-5.4_cat_f1": 0.3536909012886116,
+    "GPT-5.4_spec_f1": 0.22049451330952025,
+    "GPT-5.4_cat_mcc": 0.42471542150657926,
+    "GPT-5.4_spec_qwk": 0.24233251808742773,
+    "GPT-5.4_spec_mae": 0.8733333333333333,
+    "GPT-5.4_cat_ece": 0.09734637491405013,
+    "GPT-5.4_spec_ece": 0.43363295723994577,
+    "GPT-5.4_spec_f1_L1Generi": 0.35013262599469497,
+    "GPT-5.4_spec_f1_L2Domain": 0.28546861564918313,
+    "GPT-5.4_spec_f1_L3Firm-S": 0.2463768115942029,
+    "GPT-5.4_spec_f1_L4Quanti": 0.0,
+    "Opus-4.6_cat_f1": 0.35763512449392704,
+    "Opus-4.6_spec_f1": 0.20754679251319788,
+    "Opus-4.6_cat_mcc": 0.42166882753874363,
+    "Opus-4.6_spec_qwk": 0.24096533359991634,
+    "Opus-4.6_spec_mae": 0.88,
+    "Opus-4.6_cat_ece": 0.09567970824738346,
+    "Opus-4.6_spec_ece": 0.4489923599362374,
+    "Opus-4.6_spec_f1_L1Generi": 0.3508771929824561,
+    "Opus-4.6_spec_f1_L2Domain": 0.2543859649122807,
+    "Opus-4.6_spec_f1_L3Firm-S": 0.22492401215805471,
+    "Opus-4.6_spec_f1_L4Quanti": 0.0
+  },
+  {
+    "variant": "bnb-nf4-nodq",
+    "description": "bitsandbytes NF4 4-bit (no double-quant)",
+    "encoder_mb": 274.843904,
+    "ms_per_sample": 5.861402786540566,
+    "throughput_per_s": 170.607623536175,
+    "peak_vram_mb": 1287.34326171875,
+    "build_s": 0.4908116469741799,
+    "GPT-5.4_cat_f1": 0.3536909012886116,
+    "GPT-5.4_spec_f1": 0.22049451330952025,
+    "GPT-5.4_cat_mcc": 0.42471542150657926,
+    "GPT-5.4_spec_qwk": 0.24233251808742773,
+    "GPT-5.4_spec_mae": 0.8733333333333333,
+    "GPT-5.4_cat_ece": 0.09734637491405013,
+    "GPT-5.4_spec_ece": 0.43363295723994577,
+    "GPT-5.4_spec_f1_L1Generi": 0.35013262599469497,
+    "GPT-5.4_spec_f1_L2Domain": 0.28546861564918313,
+    "GPT-5.4_spec_f1_L3Firm-S": 0.2463768115942029,
+    "GPT-5.4_spec_f1_L4Quanti": 0.0,
+    "Opus-4.6_cat_f1": 0.35763512449392704,
+    "Opus-4.6_spec_f1": 0.20754679251319788,
+    "Opus-4.6_cat_mcc": 0.42166882753874363,
+    "Opus-4.6_spec_qwk": 0.24096533359991634,
+    "Opus-4.6_spec_mae": 0.88,
+    "Opus-4.6_cat_ece": 0.09567970824738346,
+    "Opus-4.6_spec_ece": 0.4489923599362374,
+    "Opus-4.6_spec_f1_L1Generi": 0.3508771929824561,
+    "Opus-4.6_spec_f1_L2Domain": 0.2543859649122807,
+    "Opus-4.6_spec_f1_L3Firm-S": 0.22492401215805471,
+    "Opus-4.6_spec_f1_L4Quanti": 0.0
+  },
+  {
+    "variant": "bnb-fp4",
+    "description": "bitsandbytes FP4 4-bit (no double-quant)",
+    "encoder_mb": 274.843904,
+    "ms_per_sample": 5.865302347471394,
+    "throughput_per_s": 170.49419463109393,
+    "peak_vram_mb": 1287.34326171875,
+    "build_s": 0.4887635139748454,
+    "GPT-5.4_cat_f1": 0.16293893512410998,
+    "GPT-5.4_spec_f1": 0.20854117827130608,
+    "GPT-5.4_cat_mcc": 0.22345796853389935,
+    "GPT-5.4_spec_qwk": 0.2326064604575444,
+    "GPT-5.4_spec_mae": 0.8825,
+    "GPT-5.4_cat_ece": 0.2080524676044782,
+    "GPT-5.4_spec_ece": 0.44289420386155437,
+    "GPT-5.4_spec_f1_L1Generi": 0.35742444152431013,
+    "GPT-5.4_spec_f1_L2Domain": 0.2824956672443674,
+    "GPT-5.4_spec_f1_L3Firm-S": 0.19424460431654678,
+    "GPT-5.4_spec_f1_L4Quanti": 0.0,
+    "Opus-4.6_cat_f1": 0.16861118726256397,
+    "Opus-4.6_spec_f1": 0.19783939283519508,
+    "Opus-4.6_cat_mcc": 0.2251562222131823,
+    "Opus-4.6_spec_qwk": 0.22580295138888895,
+    "Opus-4.6_spec_mae": 0.8925,
+    "Opus-4.6_cat_ece": 0.19888580093781152,
+    "Opus-4.6_spec_ece": 0.45814307530721027,
+    "Opus-4.6_spec_f1_L1Generi": 0.35294117647058826,
+    "Opus-4.6_spec_f1_L2Domain": 0.251105216622458,
+    "Opus-4.6_spec_f1_L3Firm-S": 0.18731117824773413,
+    "Opus-4.6_spec_f1_L4Quanti": 0.0
+  }
+]
--- a/results/eval/quant/torchao-int8-dyn/metrics.json
+++ b/results/eval/quant/torchao-int8-dyn/metrics.json
@ -0,0 +1,297 @@
+{
+  "GPT-5.4": {
+    "cat_macro_f1": 0.9336475878058536,
+    "cat_weighted_f1": 0.9342872402134198,
+    "cat_macro_precision": 0.9319024691959354,
+    "cat_macro_recall": 0.9376938259865566,
+    "cat_mcc": 0.9226968780743573,
+    "cat_auc": 0.9924054453050574,
+    "cat_ece": 0.05363284418980283,
+    "cat_confusion_matrix": [
+      [
+        225,
+        0,
+        3,
+        0,
+        2,
+        0,
+        0
+      ],
+      [
+        0,
+        85,
+        0,
+        0,
+        2,
+        1,
+        0
+      ],
+      [
+        2,
+        0,
+        145,
+        1,
+        2,
+        0,
+        0
+      ],
+      [
+        1,
+        0,
+        3,
+        130,
+        0,
+        2,
+        0
+      ],
+      [
+        6,
+        1,
+        5,
+        19,
+        164,
+        1,
+        2
+      ],
+      [
+        0,
+        3,
+        1,
+        8,
+        2,
+        207,
+        0
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        12,
+        0,
+        165
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9698275862068966,
+    "cat_prec_BoardGov": 0.9615384615384616,
+    "cat_recall_BoardGov": 0.9782608695652174,
+    "cat_f1_Incident": 0.96045197740113,
+    "cat_prec_Incident": 0.9550561797752809,
+    "cat_recall_Incident": 0.9659090909090909,
+    "cat_f1_Manageme": 0.9446254071661238,
+    "cat_prec_Manageme": 0.9235668789808917,
+    "cat_recall_Manageme": 0.9666666666666667,
+    "cat_f1_NoneOthe": 0.8843537414965986,
+    "cat_prec_NoneOthe": 0.8227848101265823,
+    "cat_recall_NoneOthe": 0.9558823529411765,
+    "cat_f1_RiskMana": 0.8586387434554974,
+    "cat_prec_RiskMana": 0.8913043478260869,
+    "cat_recall_RiskMana": 0.8282828282828283,
+    "cat_f1_Strategy": 0.9583333333333334,
+    "cat_prec_Strategy": 0.981042654028436,
+    "cat_recall_Strategy": 0.9366515837104072,
+    "cat_f1_Third-Pa": 0.9593023255813954,
+    "cat_prec_Third-Pa": 0.9880239520958084,
+    "cat_recall_Third-Pa": 0.9322033898305084,
+    "cat_kripp_alpha": 0.9223561935890119,
+    "spec_macro_f1": 0.8918479759675974,
+    "spec_weighted_f1": 0.9097693388297432,
+    "spec_macro_precision": 0.8930494570032042,
+    "spec_macro_recall": 0.8915621000757135,
+    "spec_mcc": 0.8628946887605918,
+    "spec_auc": 0.9807842405238503,
+    "spec_ece": 0.07049367701013878,
+    "spec_confusion_matrix": [
+      [
+        577,
+        24,
+        14,
+        3
+      ],
+      [
+        29,
+        129,
+        8,
+        2
+      ],
+      [
+        9,
+        5,
+        191,
+        2
+      ],
+      [
+        2,
+        1,
+        9,
+        195
+      ]
+    ],
+    "spec_f1_L1Generi": 0.934412955465587,
+    "spec_prec_L1Generi": 0.9351701782820098,
+    "spec_recall_L1Generi": 0.9336569579288025,
+    "spec_f1_L2Domain": 0.7889908256880734,
+    "spec_prec_L2Domain": 0.8113207547169812,
+    "spec_recall_L2Domain": 0.7678571428571429,
+    "spec_f1_L3Firm-S": 0.8904428904428905,
+    "spec_prec_L3Firm-S": 0.8603603603603603,
+    "spec_recall_L3Firm-S": 0.9227053140096618,
+    "spec_f1_L4Quanti": 0.9535452322738386,
+    "spec_prec_L4Quanti": 0.9653465346534653,
+    "spec_recall_L4Quanti": 0.9420289855072463,
+    "spec_qwk": 0.931514217618119,
+    "spec_mae": 0.12,
+    "spec_kripp_alpha": 0.9169918680049234
+  },
+  "Opus-4.6": {
+    "cat_macro_f1": 0.9242573204255528,
+    "cat_weighted_f1": 0.9232556488517519,
+    "cat_macro_precision": 0.9193897229484191,
+    "cat_macro_recall": 0.9331778058838005,
+    "cat_mcc": 0.9112549308356716,
+    "cat_auc": 0.9941614030336741,
+    "cat_ece": 0.06330573419729862,
+    "cat_confusion_matrix": [
+      [
+        212,
+        0,
+        1,
+        0,
+        1,
+        0,
+        0
+      ],
+      [
+        0,
+        78,
+        0,
+        0,
+        1,
+        0,
+        0
+      ],
+      [
+        8,
+        0,
+        145,
+        1,
+        3,
+        0,
+        1
+      ],
+      [
+        0,
+        0,
+        1,
+        138,
+        1,
+        1,
+        0
+      ],
+      [
+        13,
+        0,
+        9,
+        14,
+        169,
+        1,
+        7
+      ],
+      [
+        1,
+        11,
+        1,
+        4,
+        3,
+        208,
+        0
+      ],
+      [
+        0,
+        0,
+        0,
+        1,
+        6,
+        1,
+        159
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9464285714285714,
+    "cat_prec_BoardGov": 0.905982905982906,
+    "cat_recall_BoardGov": 0.9906542056074766,
+    "cat_f1_Incident": 0.9285714285714286,
+    "cat_prec_Incident": 0.8764044943820225,
+    "cat_recall_Incident": 0.9873417721518988,
+    "cat_f1_Manageme": 0.9206349206349206,
+    "cat_prec_Manageme": 0.9235668789808917,
+    "cat_recall_Manageme": 0.9177215189873418,
+    "cat_f1_NoneOthe": 0.9230769230769231,
+    "cat_prec_NoneOthe": 0.8734177215189873,
+    "cat_recall_NoneOthe": 0.9787234042553191,
+    "cat_f1_RiskMana": 0.8513853904282116,
+    "cat_prec_RiskMana": 0.9184782608695652,
+    "cat_recall_RiskMana": 0.7934272300469484,
+    "cat_f1_Strategy": 0.9476082004555809,
+    "cat_prec_Strategy": 0.985781990521327,
+    "cat_recall_Strategy": 0.9122807017543859,
+    "cat_f1_Third-Pa": 0.9520958083832335,
+    "cat_prec_Third-Pa": 0.9520958083832335,
+    "cat_recall_Third-Pa": 0.9520958083832335,
+    "cat_kripp_alpha": 0.9105393643352402,
+    "spec_macro_f1": 0.8827245859621925,
+    "spec_weighted_f1": 0.8997656600606208,
+    "spec_macro_precision": 0.8833309642003535,
+    "spec_macro_recall": 0.8861518760895928,
+    "spec_mcc": 0.8488976906438819,
+    "spec_auc": 0.9740582923879771,
+    "spec_ece": 0.08290670409798626,
+    "spec_confusion_matrix": [
+      [
+        564,
+        31,
+        9,
+        1
+      ],
+      [
+        22,
+        118,
+        3,
+        2
+      ],
+      [
+        31,
+        10,
+        209,
+        10
+      ],
+      [
+        0,
+        0,
+        1,
+        189
+      ]
+    ],
+    "spec_f1_L1Generi": 0.9230769230769231,
+    "spec_prec_L1Generi": 0.9141004862236629,
+    "spec_recall_L1Generi": 0.9322314049586777,
+    "spec_f1_L2Domain": 0.7763157894736842,
+    "spec_prec_L2Domain": 0.7421383647798742,
+    "spec_recall_L2Domain": 0.8137931034482758,
+    "spec_f1_L3Firm-S": 0.8672199170124482,
+    "spec_prec_L3Firm-S": 0.9414414414414415,
+    "spec_recall_L3Firm-S": 0.8038461538461539,
+    "spec_f1_L4Quanti": 0.9642857142857143,
+    "spec_prec_L4Quanti": 0.9356435643564357,
+    "spec_recall_L4Quanti": 0.9947368421052631,
+    "spec_qwk": 0.92235918049198,
+    "spec_mae": 0.13666666666666666,
+    "spec_kripp_alpha": 0.9061330450504643
+  },
+  "_runtime": {
+    "encoder_mb": 789.563648,
+    "ms_per_sample": 9.671733896636093,
+    "throughput_per_s": 103.39407707937539,
+    "peak_vram_mb": 1774.27392578125,
+    "build_s": 0.4831273259478621
+  }
+}
--- a/results/eval/quant/torchao-int8-wo/metrics.json
+++ b/results/eval/quant/torchao-int8-wo/metrics.json
@ -0,0 +1,297 @@
+{
+  "GPT-5.4": {
+    "cat_macro_f1": 0.9344870894825886,
+    "cat_weighted_f1": 0.9351173265780133,
+    "cat_macro_precision": 0.9326512314038842,
+    "cat_macro_recall": 0.9387442461546238,
+    "cat_mcc": 0.9237006314618685,
+    "cat_auc": 0.992309699625497,
+    "cat_ece": 0.05415941931307314,
+    "cat_confusion_matrix": [
+      [
+        225,
+        0,
+        3,
+        0,
+        2,
+        0,
+        0
+      ],
+      [
+        0,
+        85,
+        0,
+        0,
+        2,
+        1,
+        0
+      ],
+      [
+        2,
+        0,
+        145,
+        1,
+        2,
+        0,
+        0
+      ],
+      [
+        0,
+        0,
+        3,
+        131,
+        0,
+        2,
+        0
+      ],
+      [
+        6,
+        1,
+        5,
+        19,
+        164,
+        1,
+        2
+      ],
+      [
+        0,
+        3,
+        1,
+        8,
+        2,
+        207,
+        0
+      ],
+      [
+        0,
+        0,
+        0,
+        0,
+        12,
+        0,
+        165
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9719222462203023,
+    "cat_prec_BoardGov": 0.9656652360515021,
+    "cat_recall_BoardGov": 0.9782608695652174,
+    "cat_f1_Incident": 0.96045197740113,
+    "cat_prec_Incident": 0.9550561797752809,
+    "cat_recall_Incident": 0.9659090909090909,
+    "cat_f1_Manageme": 0.9446254071661238,
+    "cat_prec_Manageme": 0.9235668789808917,
+    "cat_recall_Manageme": 0.9666666666666667,
+    "cat_f1_NoneOthe": 0.888135593220339,
+    "cat_prec_NoneOthe": 0.8238993710691824,
+    "cat_recall_NoneOthe": 0.9632352941176471,
+    "cat_f1_RiskMana": 0.8586387434554974,
+    "cat_prec_RiskMana": 0.8913043478260869,
+    "cat_recall_RiskMana": 0.8282828282828283,
+    "cat_f1_Strategy": 0.9583333333333334,
+    "cat_prec_Strategy": 0.981042654028436,
+    "cat_recall_Strategy": 0.9366515837104072,
+    "cat_f1_Third-Pa": 0.9593023255813954,
+    "cat_prec_Third-Pa": 0.9880239520958084,
+    "cat_recall_Third-Pa": 0.9322033898305084,
+    "cat_kripp_alpha": 0.9233443339647499,
+    "spec_macro_f1": 0.8941203230194683,
+    "spec_weighted_f1": 0.9115075208518084,
+    "spec_macro_precision": 0.8957148694260108,
+    "spec_macro_recall": 0.892931893103379,
+    "spec_mcc": 0.8651929532300995,
+    "spec_auc": 0.981624069084201,
+    "spec_ece": 0.06980206420024232,
+    "spec_confusion_matrix": [
+      [
+        579,
+        24,
+        12,
+        3
+      ],
+      [
+        29,
+        131,
+        6,
+        2
+      ],
+      [
+        10,
+        6,
+        189,
+        2
+      ],
+      [
+        2,
+        1,
+        9,
+        195
+      ]
+    ],
+    "spec_f1_L1Generi": 0.9353796445880452,
+    "spec_prec_L1Generi": 0.9338709677419355,
+    "spec_recall_L1Generi": 0.9368932038834952,
+    "spec_f1_L2Domain": 0.793939393939394,
+    "spec_prec_L2Domain": 0.808641975308642,
+    "spec_recall_L2Domain": 0.7797619047619048,
+    "spec_f1_L3Firm-S": 0.8936170212765957,
+    "spec_prec_L3Firm-S": 0.875,
+    "spec_recall_L3Firm-S": 0.9130434782608695,
+    "spec_f1_L4Quanti": 0.9535452322738386,
+    "spec_prec_L4Quanti": 0.9653465346534653,
+    "spec_recall_L4Quanti": 0.9420289855072463,
+    "spec_qwk": 0.9329693660903852,
+    "spec_mae": 0.1175,
+    "spec_kripp_alpha": 0.9181842655510584
+  },
+  "Opus-4.6": {
+    "cat_macro_f1": 0.9234810481200378,
+    "cat_weighted_f1": 0.9224737817442137,
+    "cat_macro_precision": 0.9185473372257941,
+    "cat_macro_recall": 0.9325102491414775,
+    "cat_mcc": 0.9102750101817324,
+    "cat_auc": 0.9940184741579791,
+    "cat_ece": 0.0641141641388337,
+    "cat_confusion_matrix": [
+      [
+        211,
+        0,
+        1,
+        1,
+        1,
+        0,
+        0
+      ],
+      [
+        0,
+        78,
+        0,
+        0,
+        1,
+        0,
+        0
+      ],
+      [
+        8,
+        0,
+        145,
+        1,
+        3,
+        0,
+        1
+      ],
+      [
+        0,
+        0,
+        1,
+        138,
+        1,
+        1,
+        0
+      ],
+      [
+        13,
+        0,
+        9,
+        14,
+        169,
+        1,
+        7
+      ],
+      [
+        1,
+        11,
+        1,
+        4,
+        3,
+        208,
+        0
+      ],
+      [
+        0,
+        0,
+        0,
+        1,
+        6,
+        1,
+        159
+      ]
+    ],
+    "cat_f1_BoardGov": 0.9440715883668904,
+    "cat_prec_BoardGov": 0.9055793991416309,
+    "cat_recall_BoardGov": 0.985981308411215,
+    "cat_f1_Incident": 0.9285714285714286,
+    "cat_prec_Incident": 0.8764044943820225,
+    "cat_recall_Incident": 0.9873417721518988,
+    "cat_f1_Manageme": 0.9206349206349206,
+    "cat_prec_Manageme": 0.9235668789808917,
+    "cat_recall_Manageme": 0.9177215189873418,
+    "cat_f1_NoneOthe": 0.92,
+    "cat_prec_NoneOthe": 0.8679245283018868,
+    "cat_recall_NoneOthe": 0.9787234042553191,
+    "cat_f1_RiskMana": 0.8513853904282116,
+    "cat_prec_RiskMana": 0.9184782608695652,
+    "cat_recall_RiskMana": 0.7934272300469484,
+    "cat_f1_Strategy": 0.9476082004555809,
+    "cat_prec_Strategy": 0.985781990521327,
+    "cat_recall_Strategy": 0.9122807017543859,
+    "cat_f1_Third-Pa": 0.9520958083832335,
+    "cat_prec_Third-Pa": 0.9520958083832335,
+    "cat_recall_Third-Pa": 0.9520958083832335,
+    "cat_kripp_alpha": 0.9095617653952504,
+    "spec_macro_f1": 0.8814731397444973,
+    "spec_weighted_f1": 0.8981338362706646,
+    "spec_macro_precision": 0.8833981471623865,
+    "spec_macro_recall": 0.8849913986360116,
+    "spec_mcc": 0.8465512998506631,
+    "spec_auc": 0.9729999946345258,
+    "spec_ece": 0.08370273689428968,
+    "spec_confusion_matrix": [
+      [
+        564,
+        33,
+        7,
+        1
+      ],
+      [
+        22,
+        119,
+        2,
+        2
+      ],
+      [
+        34,
+        10,
+        206,
+        10
+      ],
+      [
+        0,
+        0,
+        1,
+        189
+      ]
+    ],
+    "spec_f1_L1Generi": 0.9208163265306123,
+    "spec_prec_L1Generi": 0.9096774193548387,
+    "spec_recall_L1Generi": 0.9322314049586777,
+    "spec_f1_L2Domain": 0.7752442996742671,
+    "spec_prec_L2Domain": 0.7345679012345679,
+    "spec_recall_L2Domain": 0.8206896551724138,
+    "spec_f1_L3Firm-S": 0.865546218487395,
+    "spec_prec_L3Firm-S": 0.9537037037037037,
+    "spec_recall_L3Firm-S": 0.7923076923076923,
+    "spec_f1_L4Quanti": 0.9642857142857143,
+    "spec_prec_L4Quanti": 0.9356435643564357,
+    "spec_recall_L4Quanti": 0.9947368421052631,
+    "spec_qwk": 0.9207708779443254,
+    "spec_mae": 0.13916666666666666,
+    "spec_kripp_alpha": 0.9033268512180281
+  },
+  "_runtime": {
+    "encoder_mb": 789.563648,
+    "ms_per_sample": 6.078403938445263,
+    "throughput_per_s": 164.5168715549004,
+    "peak_vram_mb": 1416.36376953125,
+    "build_s": 0.5027359619853087
+  }
+}