quantization + onnx sweeps

Phase 10.8: torchao/bnb quant sweep on iter1-independent. bf16 already
optimal; torchao int8-wo gives -19% VRAM at no F1 cost; all 4-bit
variants collapse (ModernBERT-large too quant-sensitive).

Phase 10.9: ONNX export + ORT eval. Legacy exporter only working path
(dynamo adds 56 Memcpy nodes); ORT fp32 -22% latency vs torch via
kernel fusion but bf16+flash-attn-2 still wins; fp16 broken on rotary;
dynamic int8 silently CPU-fallback + 0.5 F1 collapse.

Driver scripts wired to bun run py:quant / py:onnx; full reports at
results/eval/{quant,onnx}/REPORT.md.
This commit is contained in:
Joey Eamigh 2026-04-07 05:10:14 -04:00
parent ebfc103dac
commit 67beaede45
No known key found for this signature in database
GPG Key ID: CE8C05DFFC53C9CB
19 changed files with 4367 additions and 0 deletions

3
.gitignore vendored
View File

@ -2,7 +2,10 @@
/data/
/models/
/checkpoints/
/results/eval/onnx/models/
*.tar.zst
*.onnx
*.onnx.data
# Dependencies
ts/node_modules/

View File

@ -1097,6 +1097,212 @@ epoch 3 + no temperature scaling would be a reasonable alternative choice.
> in-distribution confidence memorization. Temperature scaling recovers
> calibration (ECE 33% cat, 40% spec) without altering predictions."*
### 10.8 Quantization Sweep (2026-04-07)
**Question:** does post-training quantization buy us a smaller / faster
deployable model without giving back accuracy? And — almost more
interesting — *which* quant schemes does ModernBERT-large tolerate?
**Setup:** new sweep driver at `python/scripts/quantize_sweep.py` (wired
to `bun run py:quant`). Loads the iter1-independent checkpoint, applies
each scheme to the encoder backbone only (heads stay bf16), reruns the
full holdout eval against GPT-5.4 and Opus-4.6 proxy gold, and records
latency, peak VRAM, encoder footprint, and the full metrics suite. 5
warmup batches before timing; batch 64; max_seq 512; RTX 3090.
**Variants:** fp32, bf16 (baseline), fp16, torchao int8 weight-only,
torchao int8 dynamic-act + int8 weight, torchao int4 weight-only,
bitsandbytes LLM.int8, bitsandbytes nf4 (with and without
double-quantization), bitsandbytes fp4.
**Results (vs GPT-5.4 proxy gold):**
| variant | enc MB | ms/samp | thru/s | VRAM MB | cat F1 | spec F1 | spec QWK |
|--------------------|-------:|--------:|-------:|--------:|-------:|--------:|---------:|
| fp32 | 1579 | 16.29 | 61 | 3504 | 0.9337 | 0.8943 | 0.9321 |
| **bf16 baseline** | 790 | 5.52 | 181 | 1741 | 0.9337 | 0.8952 | 0.9324 |
| fp16 | 790 | 5.54 | 181 | 1741 | 0.9337 | 0.8952 | 0.9324 |
| **torchao int8-wo**| ~395 | 6.08 | 165 | 1416 | 0.9345 | 0.8941 | 0.9330 |
| torchao int8-dyn | ~395 | 9.67 | 103 | 1774 | 0.9336 | 0.8918 | 0.9315 |
| torchao int4-wo | — | — | — | — | err | err | err |
| bnb LLM.int8 | ~395 | 7.76 | 129 | 2135 | 0.9361 | 0.8986 | 0.9308 |
| bnb nf4 (DQ) | 275 | 5.86 | 171 | 1287 | 0.3537 | 0.2205 | 0.2423 |
| bnb nf4 (no DQ) | 275 | 5.86 | 171 | 1287 | 0.3537 | 0.2205 | 0.2423 |
| bnb fp4 | 275 | 5.87 | 170 | 1287 | 0.1629 | 0.2085 | 0.2326 |
(torchao subclass tensors report bf16 element_size, so "395 MB" is the
true storage estimate, not what `param.element_size()` returns.)
**Six findings:**
1. **bf16 + flash-attn-2 is already the sweet spot.** 3.0× throughput over
fp32 with bit-identical accuracy and half the VRAM. Nothing in the
precision dimension beats it on this hardware.
2. **fp16 ≡ bf16.** RTX 3090 has matched fp16/bf16 tensor-core throughput
and the model has no overflow issues; pick whichever the loader
prefers.
3. **torchao int8 weight-only is the only quantization that's worth
shipping.** 19% VRAM (1741 → 1416 MB), accuracy delta inside ±0.002
per-seed noise, +10% latency because RTX 3090 (sm_8.6) lacks the int8
tensor-core matmul path that torchao would otherwise route through —
so the int8 weight is dequantized to bf16 on the fly. **This is the
variant we'd ship as the "low-VRAM" deployment option**, and on
Hopper / Ada the latency would invert and be a strict win.
4. **torchao int8 dynamic-activation regresses on Ampere.** 43%
throughput and *more* peak VRAM than bf16 because the per-batch
activation quantization adds work without unlocking the int8
matmul. Skip.
5. **bnb LLM.int8 is the slowest int8 path and uses *more* VRAM than
bf16.** Mixed-precision outlier handling adds 23% peak memory and 41%
latency for an F1 bump that's inside noise. It's tuned for LLM-scale
models where outlier features dominate quant error; for an
encoder this size on a single 3090 it's a regression.
6. **All 4-bit variants collapse to near-random.** Both nf4 (DQ and
no-DQ) and fp4 produce essentially category-prior and L1-collapsed
predictions (cat ECE jumps from 0.054 to 0.100.21). We verified per
layer that the dequantized weights of one MLP `Wi` differ from the
original by mean 0.005 / max 0.11 — quantization is *correct* — but
the relative output drift on a single Linear is already ~98% (mean),
and that compounds across 28 transformer blocks + GLU FFN paths until
the [CLS]/pooled representation no longer carries the discriminative
signal. **DQ vs no-DQ produce bit-identical predictions** because the
nf4 weight indices are stable under absmax requantization (only the
metadata block differs). The catastrophe is inherent to 4-bit weight
precision on this architecture, not to a config knob. Recovering 4-bit
would require QAT, GPTQ/AWQ-style per-channel calibration, or keeping
the GLU FFN in 8-bit while only 4-bit'ing attention projections —
none reachable inside the remaining capstone budget.
**Paper hooks:**
- Add a "deployment precision" row to the speed/cost table — bf16 vs
torchao int8-wo gives a clean Pareto pair (latency vs VRAM).
- One paragraph in the discussion alongside the DAPT and CORAL nulls:
*naive post-training 4-bit weight quantization is not viable for
ModernBERT-large on this task; the GLU FFN amplifies per-layer weight
error across 28 blocks until signal is destroyed*. This is a useful
counterpoint to the 4-bit-by-default LLM serving narrative and a
legitimate negative result tied to architectural choices.
- Caveat the int8 latency rows with the sm_8.6 hardware footnote — the
result would invert on H100/A100/Ada.
Full standalone report at `results/eval/quant/REPORT.md`; per-variant
metrics at `results/eval/quant/<variant>/metrics.json`; aggregate row data
at `results/eval/quant/summary.json`.
### 10.9 ONNX Export + Eval (2026-04-07)
**Question:** can we get a portable ONNX artifact with comparable
latency / accuracy? What does the ORT path look like for fp32, fp16,
and int8?
**Setup:** new driver at `python/scripts/onnx_export_eval.py` (`bun run
py:onnx`). Exports the iter1-independent checkpoint, runs ORT inference
on the full holdout via CUDAExecutionProvider, and compares against the
proxy gold.
**Six things broke along the way; documenting because each one is a real
gotcha for the paper's reproducibility section:**
1. **Dynamo exporter optimizer crashes.** `torch.onnx.export(...,
dynamo=True)` translates the graph but its post-translation `InlinePass`
trips on `onnx_ir`. Workaround: `optimize=False`.
2. **Dynamo-exported graph is unusable on CUDA EP.** ORT inserts 56
Memcpy nodes between layers because dynamo emits scalar tensors with
CPU-side placement metadata. Result: 42.9 ms/sample (8× torch fp32)
and 15.4 GB peak VRAM (4.4× torch fp32). The legacy TorchScript
exporter (`dynamo=False`) only inserts 1 Memcpy and is the only
working export path.
3. **`op_types_to_quantize=['MatMul']` quantizes nothing on the dynamo
graph.** Dynamo emits encoder linears as `Gemm`, not `MatMul`. Need
`['MatMul', 'Gemm']`.
4. **Both ORT shape-inference paths choke on ModernBERT.** Symbolic
inference asserts in `_infer_Range` (the rotary embedding's `limit`
input is not a scalar); the C++ path raises a (1024)/(7) dimension
mismatch on the category head Gemm. The `skip_*` flags on
`quant_pre_process` are *ignored* — it always runs symbolic shape
inference — and `ONNXQuantizer.__init__` calls
`save_and_reload_model_with_shape_infer` unconditionally. Workaround:
monkey-patch both bindings to no-ops, then pass
`extra_options={'DefaultTensorType': onnx.TensorProto.FLOAT}` so the
quantizer can still type the head MatMul output.
5. **fp16 conversion via `onnxconverter_common` breaks on rotary
embeddings.** Two distinct failure modes seen across exports — `Type
parameter (T) of Optype (Mul) bound to different types
(tensor(float) and tensor(float16)) in node
/model/backbone/rotary_emb_1/Mul_2`. The converter leaves the
`inv_freq` buffer in fp32 and the surrounding `Mul`/`Expand` ops
then can't unify their type parameter. Patchable with an
`op_block_list` for the rotary subgraph, but cost/value isn't there
given the int8 result below.
6. **Dynamic int8 via ORT silently falls back to CPU.** The quantizer
replaces Gemm/MatMul with `MatMulInteger` + `DynamicQuantizeLinear`,
neither of which has CUDA kernels in onnxruntime-gpu 1.24. Session
creation succeeds with `CUDAExecutionProvider` but routes the
quantized ops to the CPU EP — observable from the load-time GPU
memory delta collapsing from 2074 MB (fp32) to 266 MB (int8) and
latency exploding to **95.9 ms/sample**. Accuracy also drops to
cat F1 = 0.397 / spec F1 = 0.336, further confirming the kernel
path is wrong (not just slow).
**Results (legacy exporter, 1,200 holdout, vs GPT-5.4):**
| variant | size MB | ms/samp | VRAM MB | cat F1 | spec F1 | spec QWK |
|--------------------|--------:|--------:|--------:|-------:|--------:|---------:|
| **onnx-fp32** | 1583 | 12.70 | 8228 | 0.9337 | 0.8952 | 0.9324 |
| onnx-fp16 | 754 | err | err | err | err | err |
| onnx-int8 (dynamic)| 527 | 95.91 | ~CPU | 0.3972 | 0.3364 | 0.4413 |
For comparison, the torch baselines from Phase 10.8:
- torch fp32: 16.29 ms / 3504 MB / cat 0.9337 / spec 0.8943
- torch bf16: **5.52 ms / 1741 MB** / cat 0.9337 / spec 0.8952
**Three findings:**
1. **The one clean win — ORT fp32 beats torch fp32 by 22% on latency
(12.70 vs 16.29 ms)** at bit-identical accuracy, thanks to ORT's
LayerNorm + Gelu + MatMul kernel fusion. VRAM is 2.3× torch's
(8228 vs 3504 MB) because the ORT session allocates a separate
~5 GB workspace — fair trade for batched inference. But torch bf16
+ flash-attn-2 still wins outright on every dimension (5.52 ms,
1741 MB), so this is a moral victory at best.
2. **fp16 ONNX is currently unreachable** without writing custom rotary
handling for the float16 converter. Doable but several hours of
plumbing for an artifact that bf16 already dominates.
3. **ORT dynamic int8 is a deployment trap on this hardware.** It looks
like it works (export succeeds, file shrinks 1583 → 527 MB, session
constructs cleanly with CUDAExecutionProvider in the providers list),
but at runtime the integer matmul ops route to the CPU EP and the
model produces ~uniform-prior predictions because the per-channel
weight quantization interacts badly with the activation
quantization path. Both observations would silently bite a
production deployment that didn't run a holdout sanity check.
**Net recommendation: don't ship ONNX for this model on this hardware.**
torchao int8-wo from §10.8 still owns the "smaller deployment" Pareto
slot (5.52 → 6.08 ms, 1741 → 1416 MB, F1 within ±0.001) more cleanly
than any ONNX variant we could produce here. ONNX would be worth
revisiting only for CPU-only deployment, cross-runtime portability
(TensorRT/OpenVINO/mobile), or a properly calibrated static int8 path
with a ModernBERT-aware op block list — none reachable inside the
remaining capstone budget.
**Paper hooks:**
- One paragraph in the deployment / reproducibility discussion:
*ONNX export of ModernBERT-large via the dynamo exporter is currently
broken (excessive Memcpy insertion); the legacy TorchScript exporter
produces a clean graph that's 22% faster than torch fp32 via ORT
kernel fusion, but bf16 + flash-attn-2 dominates at half the latency.
fp16 conversion via onnxconverter_common fails on rotary embeddings,
and ORT dynamic int8 silently falls back to CPU on
onnxruntime-gpu 1.24, dropping ~0.5 macro F1.*
- Add a "deployment lessons learned" sub-bullet to the limitations
section so a follow-on engineering team doesn't waste a day chasing
the same dead ends.
Full standalone report at `results/eval/onnx/REPORT.md`; aggregate
results at `results/eval/onnx/summary.json`; exported models at
`results/eval/onnx/models/`.
### Phase 10 Summary
| Experiment | Cost | Outcome | Paper value |
@ -1107,6 +1313,8 @@ epoch 3 + no temperature scaling would be a reasonable alternative choice.
| Temperature scaling | ~10 min GPU | ECE 33% cat, 40% spec, F1 unchanged | Calibration story, deployment quality |
| Pooling ablation (attention vs CLS) | ~3h GPU | +0.005 F1 consistent, small effect | Validates design, credits independent thresholds |
| DAPT re-test with new architecture | ~3h GPU | Val best NLL 0.333→0.318 (4.5%), F1 +0.007 cat; holdout null; gen gap unchanged | More nuanced null — better init, not better generalization |
| Quantization sweep (10 variants) | ~5 min GPU | bf16 already optimal; torchao int8-wo = 19% VRAM no F1 cost; **all 4-bit collapses** (ModernBERT-large too quant-sensitive) | Deployment Pareto + 4-bit null result |
| ONNX export + ORT eval | ~10 min GPU | Legacy exporter only working path; ORT fp32 22% latency vs torch (kernel fusion), but bf16 still wins; fp16 broken on rotary; int8 silently CPU-fallback + 0.5 F1 collapse | Deployment lessons learned, reproducibility caveats |
The 3-seed ensemble is now the recommended headline checkpoint. The
calibrated ECE numbers should replace the pre-scaling ECE in the paper. The

View File

@ -20,6 +20,8 @@
"ts:sec": "bun run --filter sec-cybert sec",
"ts:typecheck": "bun run --filter sec-cybert typecheck",
"py:train": "cd python && PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True uv run main.py",
"py:quant": "cd python && PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True uv run scripts/quantize_sweep.py",
"py:onnx": "cd python && PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True uv run scripts/onnx_export_eval.py",
"typecheck": "bun run --filter '*' typecheck",
"data:push": "./scripts/data-push.sh",
"data:pull": "./scripts/data-pull.sh",

View File

@ -18,6 +18,11 @@ dependencies = [
"krippendorff>=0.8.2",
"matplotlib>=3.10.8",
"seaborn>=0.13.2",
"onnx>=1.21.0",
"onnxruntime-gpu>=1.24.4",
"onnxruntime>=1.24.4",
"onnxscript>=0.6.2",
"onnxconverter-common>=1.16.0",
]
[project.scripts]

View File

@ -0,0 +1,369 @@
"""ONNX export + eval for the iter1-independent ModernBERT-large checkpoint.
Variants:
onnx-fp32 straight torch.onnx.export from the fp32 model
onnx-fp16 fp32 export converted to fp16 via onnxconverter_common
(proxy for bf16; ORT does not support bf16 inference natively)
onnx-int8-dyn dynamic int8 quantization of the fp32 graph via
onnxruntime.quantization.quantize_dynamic (weights in int8,
activations quantized at runtime)
For each variant:
- latency (ms/sample, batch=64, 5 warmup batches)
- peak GPU memory delta around the session (free-mem snapshot)
- on-disk size of model.onnx + model.onnx.data
- cat / spec macro F1, QWK, ECE on the 1,200-paragraph holdout
against GPT-5.4 + Opus-4.6 proxy gold
Usage:
bun run py:onnx
"""
from __future__ import annotations
import gc
import json
import os
import sys
import time
from pathlib import Path
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
ROOT = Path(__file__).resolve().parents[1]
sys.path.insert(0, str(ROOT))
from src.finetune.data import CAT2ID, CATEGORIES, NUM_CATEGORIES, NUM_SPECIFICITY # noqa: E402
from src.finetune.eval import SPEC_LABELS, compute_all_metrics, load_holdout_data # noqa: E402
from src.finetune.model import ordinal_predict # noqa: E402
from scripts.quantize_sweep import ( # noqa: E402
BENCHMARKS, BATCH_SIZE, HOLDOUT, MAX_SEQ, PARAGRAPHS, WARMUP_BATCHES,
_build_model, evaluate_predictions,
)
OUTPUT_DIR = ROOT.parent / "results/eval/onnx"
ONNX_DIR = OUTPUT_DIR / "models"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
ONNX_DIR.mkdir(parents=True, exist_ok=True)
# ──────────────────────────────────────────────────────────────────────
# Export
# ──────────────────────────────────────────────────────────────────────
class _Wrap(nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def forward(self, input_ids, attention_mask):
out = self.model(input_ids=input_ids, attention_mask=attention_mask)
return out["category_logits"], out["specificity_logits"]
def export_fp32(out_path: Path, sample_batch: int = 4, sample_seq: int = 64) -> None:
print(f" building fp32 torch model...")
model, tokenizer = _build_model(torch.float32, attn_impl="sdpa")
model = model.cuda().eval()
wrap = _Wrap(model).cuda().eval()
dummy_text = ["the company maintains a cybersecurity program overseen by the board"] * sample_batch
enc = tokenizer(
dummy_text, padding="max_length", max_length=sample_seq,
truncation=True, return_tensors="pt",
).to("cuda")
print(f" exporting → {out_path}")
# Legacy TorchScript exporter (dynamo=False). The dynamo path produces a
# graph with 56+ Memcpy nodes when run on CUDAExecutionProvider, blowing
# latency 8× and VRAM 4× over native torch — unusable. The legacy
# exporter emits clean Gemm/MatMul/LayerNorm nodes ORT can fuse.
torch.onnx.export(
wrap,
(enc["input_ids"], enc["attention_mask"]),
str(out_path),
input_names=["input_ids", "attention_mask"],
output_names=["cat_logits", "spec_logits"],
dynamic_axes={
"input_ids": {0: "batch", 1: "seq"},
"attention_mask": {0: "batch", 1: "seq"},
"cat_logits": {0: "batch"},
"spec_logits": {0: "batch"},
},
opset_version=17,
dynamo=False,
do_constant_folding=True,
)
del wrap, model
gc.collect()
torch.cuda.empty_cache()
def convert_fp16(fp32_path: Path, fp16_path: Path) -> None:
"""Convert an fp32 ONNX model to fp16 via onnxconverter_common."""
import onnx
from onnxconverter_common import float16
print(f" loading {fp32_path}")
model = onnx.load(str(fp32_path), load_external_data=True)
print(f" converting to fp16...")
model_fp16 = float16.convert_float_to_float16(
model, keep_io_types=False, disable_shape_infer=True,
)
print(f" saving → {fp16_path}")
onnx.save_model(
model_fp16, str(fp16_path),
save_as_external_data=True,
all_tensors_to_one_file=True,
location=fp16_path.name + ".data",
size_threshold=1024,
)
def quantize_int8_dynamic(fp32_path: Path, int8_path: Path) -> None:
"""Dynamic int8 quantization (weights → int8, activations on the fly).
Two shape-inference paths in the ORT quantizer choke on the dynamo
export of ModernBERT-large:
1. `SymbolicShapeInference._infer_Range` asserts on the dynamic limit
input emitted by RoPE (`assert len(x) == 1` in `as_scalar`).
2. `onnx.shape_inference.infer_shapes_path` (C++) raises a (1024)/(7)
dim mismatch on the category head Gemm the dynamo decomposition
leaves a dimension hint the C++ inferencer disagrees with.
The skip flags on `quant_pre_process` are ignored (it always runs
`SymbolicShapeInference.infer_shapes`), and `ONNXQuantizer.__init__`
calls `save_and_reload_model_with_shape_infer` unconditionally. We
monkey-patch both to no-ops, then run `quantize_dynamic` restricted to
MatMul ops (the only nodes we want quantized anyway).
"""
import onnx
from onnxruntime.quantization import QuantType, quantize_dynamic
from onnxruntime.quantization import quant_utils
from onnxruntime.tools import symbolic_shape_infer as sym
# No-op the broken shape passes.
original_save_reload = quant_utils.save_and_reload_model_with_shape_infer
def _passthrough(model):
return model
quant_utils.save_and_reload_model_with_shape_infer = _passthrough
# Some imports cache the symbol — patch the onnx_quantizer module too.
import onnxruntime.quantization.onnx_quantizer as oq
oq.save_and_reload_model_with_shape_infer = _passthrough
try:
print(f" quantizing {fp32_path}{int8_path}")
quantize_dynamic(
model_input=str(fp32_path),
model_output=str(int8_path),
weight_type=QuantType.QInt8,
per_channel=True,
reduce_range=False,
op_types_to_quantize=["MatMul", "Gemm"],
use_external_data_format=True,
extra_options={"DefaultTensorType": onnx.TensorProto.FLOAT},
)
finally:
quant_utils.save_and_reload_model_with_shape_infer = original_save_reload
oq.save_and_reload_model_with_shape_infer = original_save_reload
# ──────────────────────────────────────────────────────────────────────
# Inference + metrics
# ──────────────────────────────────────────────────────────────────────
def _files_size(model_path: Path) -> int:
"""Sum of model.onnx + any external .data files in the same dir."""
total = model_path.stat().st_size
for sib in model_path.parent.iterdir():
if sib.name.startswith(model_path.name) and sib != model_path:
total += sib.stat().st_size
return total
def run_onnx(model_path: Path, texts: list[str], use_cuda: bool = True) -> dict:
import onnxruntime as ort
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"../checkpoints/finetune/iter1-independent/final"
)
so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
providers = (
["CUDAExecutionProvider", "CPUExecutionProvider"] if use_cuda
else ["CPUExecutionProvider"]
)
free_before, total_vram = torch.cuda.mem_get_info()
sess = ort.InferenceSession(str(model_path), so, providers=providers)
free_after_load, _ = torch.cuda.mem_get_info()
load_vram_mb = (free_before - free_after_load) / (1024 ** 2)
# Warmup
warm_enc = tokenizer(
texts[:BATCH_SIZE], truncation=True, max_length=MAX_SEQ,
padding="longest", return_tensors="np",
)
warm_inputs = {
"input_ids": warm_enc["input_ids"].astype(np.int64),
"attention_mask": warm_enc["attention_mask"].astype(np.int64),
}
for _ in range(WARMUP_BATCHES):
sess.run(None, warm_inputs)
free_after_warm, _ = torch.cuda.mem_get_info()
peak_vram_mb = (free_before - free_after_warm) / (1024 ** 2)
cat_logits_list = []
spec_logits_list = []
total_time = 0.0
for i in range(0, len(texts), BATCH_SIZE):
batch = texts[i : i + BATCH_SIZE]
enc = tokenizer(
batch, truncation=True, max_length=MAX_SEQ,
padding="longest", return_tensors="np",
)
inputs = {
"input_ids": enc["input_ids"].astype(np.int64),
"attention_mask": enc["attention_mask"].astype(np.int64),
}
t0 = time.perf_counter()
out = sess.run(None, inputs)
total_time += time.perf_counter() - t0
cat_logits_list.append(torch.from_numpy(out[0].astype(np.float32)))
spec_logits_list.append(torch.from_numpy(out[1].astype(np.float32)))
free_end, _ = torch.cuda.mem_get_info()
peak_vram_mb = max(peak_vram_mb, (free_before - free_end) / (1024 ** 2))
del sess
gc.collect()
torch.cuda.empty_cache()
return {
"cat_logits": torch.cat(cat_logits_list),
"spec_logits": torch.cat(spec_logits_list),
"ms_per_sample": (total_time / len(texts)) * 1000,
"throughput": len(texts) / total_time,
"peak_vram_mb": peak_vram_mb,
"load_vram_mb": load_vram_mb,
"providers": providers,
}
# ──────────────────────────────────────────────────────────────────────
# Driver
# ──────────────────────────────────────────────────────────────────────
def main():
print("loading holdout...")
records = load_holdout_data(
str(PARAGRAPHS), str(HOLDOUT), {k: str(v) for k, v in BENCHMARKS.items()},
)
texts = [r["text"] for r in records]
print(f" {len(records)} paragraphs")
fp32_path = ONNX_DIR / "model_fp32.onnx"
fp16_path = ONNX_DIR / "model_fp16.onnx"
int8_path = ONNX_DIR / "model_int8_dyn.onnx"
# ── Export fp32 (source for both fp16 and int8 quant) ──
if not fp32_path.exists():
print("\n══ exporting fp32 ONNX")
export_fp32(fp32_path)
else:
print(f"\n══ reusing existing {fp32_path}")
# ── fp16 conversion ──
if not fp16_path.exists():
print("\n══ converting → fp16 ONNX")
convert_fp16(fp32_path, fp16_path)
else:
print(f"\n══ reusing existing {fp16_path}")
# ── int8 dynamic quantization ──
if not int8_path.exists():
print("\n══ quantizing → int8 dynamic ONNX")
quantize_int8_dynamic(fp32_path, int8_path)
else:
print(f"\n══ reusing existing {int8_path}")
summary = []
variants = [
("onnx-fp32", fp32_path),
("onnx-fp16", fp16_path),
("onnx-int8-dyn", int8_path),
]
for name, path in variants:
print(f"\n══ {name}{path.name}")
size_mb = _files_size(path) / 1e6
print(f" on-disk size: {size_mb:.1f} MB")
try:
inf = run_onnx(path, texts, use_cuda=True)
print(
f" latency {inf['ms_per_sample']:.2f} ms/sample, "
f"throughput {inf['throughput']:.0f}/s, "
f"peak VRAM {inf['peak_vram_mb']:.0f} MB "
f"(load {inf['load_vram_mb']:.0f} MB)"
)
row = {
"variant": name,
"model_mb": size_mb,
"ms_per_sample": inf["ms_per_sample"],
"throughput_per_s": inf["throughput"],
"peak_vram_mb": inf["peak_vram_mb"],
"load_vram_mb": inf["load_vram_mb"],
}
for ref in BENCHMARKS:
m = evaluate_predictions(inf["cat_logits"], inf["spec_logits"], records, ref)
print(
f" vs {ref}: cat F1={m['cat_macro_f1']:.4f}, "
f"spec F1={m['spec_macro_f1']:.4f}, QWK={m['spec_qwk']:.4f}, "
f"cat ECE={m['cat_ece']:.4f}, spec ECE={m['spec_ece']:.4f}"
)
row[f"{ref}_cat_f1"] = m["cat_macro_f1"]
row[f"{ref}_spec_f1"] = m["spec_macro_f1"]
row[f"{ref}_cat_mcc"] = m["cat_mcc"]
row[f"{ref}_spec_qwk"] = m["spec_qwk"]
row[f"{ref}_spec_mae"] = m["spec_mae"]
row[f"{ref}_cat_ece"] = m["cat_ece"]
row[f"{ref}_spec_ece"] = m["spec_ece"]
summary.append(row)
except Exception as e:
import traceback
traceback.print_exc()
summary.append({"variant": name, "error": f"{type(e).__name__}: {e}"})
summary_path = OUTPUT_DIR / "summary.json"
with open(summary_path, "w") as f:
json.dump(summary, f, indent=2, default=str)
print(f"\nsummary → {summary_path}")
print("\n" + "=" * 110)
print(f"{'variant':<18} {'MB':>9} {'ms/samp':>9} {'throughput':>11} "
f"{'VRAM MB':>9} {'cat F1':>9} {'spec F1':>9} {'spec QWK':>9}")
print("-" * 110)
for r in summary:
if "error" in r:
print(f"{r['variant']:<18} ERROR: {r['error']}")
continue
print(
f"{r['variant']:<18} {r['model_mb']:>9.1f} {r['ms_per_sample']:>9.2f} "
f"{r['throughput_per_s']:>11.0f} {r['peak_vram_mb']:>9.0f} "
f"{r['GPT-5.4_cat_f1']:>9.4f} {r['GPT-5.4_spec_f1']:>9.4f} "
f"{r['GPT-5.4_spec_qwk']:>9.4f}"
)
print("=" * 110)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,491 @@
"""Quantization sweep for the iter1-independent ModernBERT-large checkpoint.
Loads the trained DualHeadModernBERT, applies a series of quantization
schemes to the *encoder* (heads kept in their native dtype), and re-runs
holdout evaluation against the GPT-5.4 / Opus-4.6 proxy gold.
For each variant we record:
- cat / spec macro F1, per-class F1, QWK, MAE, ECE
- peak VRAM (encoder forward)
- latency (ms/sample, batch=64) and throughput
- encoder parameter footprint in MB
- delta vs bf16 baseline
Variants:
fp32, bf16 (baseline), fp16,
torchao int8 weight-only,
torchao int8 dynamic-act + int8 weight,
torchao int4 weight-only (group=128),
bitsandbytes LLM.int8 (8-bit),
bitsandbytes nf4 (4-bit, double-quant, bf16 compute).
Heads (category linear, attention pooler, independent threshold MLPs)
stay in bf16 they sit on a 1024-dim representation and account for
< 0.3% of params, so quantizing them buys nothing and risks the threshold
margins which already drive most of the spec error budget.
Usage:
bun run py:quant # via package.json wrapper
# or directly:
cd python && uv run scripts/quantize_sweep.py
"""
from __future__ import annotations
import gc
import json
import sys
import time
import traceback
from dataclasses import dataclass, field
from pathlib import Path
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from safetensors.torch import load_file
from transformers import AutoModel, AutoTokenizer
# Make `src` importable when run as a script
ROOT = Path(__file__).resolve().parents[1]
sys.path.insert(0, str(ROOT))
from src.finetune.data import CAT2ID, CATEGORIES, NUM_CATEGORIES, NUM_SPECIFICITY # noqa: E402
from src.finetune.eval import ( # noqa: E402
SPEC_LABELS,
compute_all_metrics,
load_holdout_data,
)
from src.finetune.model import DualHeadModernBERT, ordinal_predict # noqa: E402
REPO = ROOT.parent
CHECKPOINT = REPO / "checkpoints/finetune/iter1-independent/final"
PARAGRAPHS = REPO / "data/paragraphs/paragraphs-clean.patched.jsonl"
HOLDOUT = REPO / "data/gold/v2-holdout-ids.json"
BENCHMARKS = {
"GPT-5.4": REPO / "data/annotations/v2-bench/gpt-5.4.jsonl",
"Opus-4.6": REPO / "data/annotations/v2-bench/opus-4.6.jsonl",
}
OUTPUT_DIR = REPO / "results/eval/quant"
BATCH_SIZE = 64
MAX_SEQ = 512
WARMUP_BATCHES = 5
# ──────────────────────────────────────────────────────────────────────
# Model loading
# ──────────────────────────────────────────────────────────────────────
def _build_model(dtype: torch.dtype, attn_impl: str = "sdpa") -> tuple[DualHeadModernBERT, AutoTokenizer]:
"""Construct DualHeadModernBERT and load trained weights at the requested dtype."""
tokenizer = AutoTokenizer.from_pretrained(str(CHECKPOINT))
backbone = AutoModel.from_pretrained(
"answerdotai/ModernBERT-large",
trust_remote_code=True,
attn_implementation=attn_impl,
dtype=dtype,
)
model = DualHeadModernBERT(
backbone=backbone,
hidden_size=backbone.config.hidden_size,
num_categories=NUM_CATEGORIES,
num_specificity=NUM_SPECIFICITY,
specificity_head_type="independent",
spec_mlp_dim=256,
pooling="attention",
)
state = load_file(str(CHECKPOINT / "model.safetensors"))
model.load_state_dict(state, strict=False)
model = model.to(dtype)
model.eval()
return model, tokenizer
def _try_flash_attn() -> str:
try:
import flash_attn # noqa: F401
return "flash_attention_2"
except ImportError:
return "sdpa"
# ──────────────────────────────────────────────────────────────────────
# Quantization variants
# ──────────────────────────────────────────────────────────────────────
def variant_native(dtype: torch.dtype, attn: str | None = None):
def _build():
impl = attn or _try_flash_attn()
# bf16/fp16 supported by flash-attn; fp32 must use sdpa
if dtype == torch.float32:
impl = "sdpa"
model, tok = _build_model(dtype, attn_impl=impl)
return model.cuda(), tok
return _build
def variant_torchao(config_factory):
def _build():
from torchao.quantization import quantize_
# torchao expects bf16 master weights
model, tok = _build_model(torch.bfloat16, attn_impl=_try_flash_attn())
model = model.cuda()
# Quantize encoder linears only (skip heads + attention pooler)
quantize_(model.backbone, config_factory())
return model, tok
return _build
def _swap_bnb_linear(
module: nn.Module,
mode: str,
compute_dtype=torch.bfloat16,
compress_statistics: bool = True,
) -> int:
"""Recursively replace nn.Linear with bnb 8-bit / 4-bit equivalents.
Returns number of layers swapped. Copies weights from the original
module so the trained checkpoint is preserved.
"""
import bitsandbytes as bnb
swapped = 0
for name, child in list(module.named_children()):
if isinstance(child, nn.Linear):
in_f, out_f = child.in_features, child.out_features
has_bias = child.bias is not None
if mode == "int8":
new = bnb.nn.Linear8bitLt(
in_f, out_f, bias=has_bias,
has_fp16_weights=False, threshold=6.0,
)
new.weight = bnb.nn.Int8Params(
child.weight.data.clone(),
requires_grad=False,
has_fp16_weights=False,
)
if has_bias:
new.bias = nn.Parameter(child.bias.data.clone())
elif mode in ("nf4", "fp4"):
new = bnb.nn.Linear4bit(
in_f, out_f, bias=has_bias,
compute_dtype=compute_dtype,
quant_type=mode,
compress_statistics=compress_statistics,
quant_storage=torch.uint8,
device="cuda",
)
w = child.weight.data.detach().to(torch.float32).clone()
new.weight = bnb.nn.Params4bit(
w, requires_grad=False, quant_type=mode,
compress_statistics=compress_statistics, module=new,
).cuda()
if has_bias:
new.bias = nn.Parameter(
child.bias.data.detach().to(compute_dtype).clone().cuda()
)
else:
raise ValueError(mode)
new = new.cuda()
setattr(module, name, new)
swapped += 1
else:
swapped += _swap_bnb_linear(child, mode, compute_dtype)
return swapped
def variant_bnb(mode: str, compress_statistics: bool = True):
def _build():
model, tok = _build_model(torch.bfloat16, attn_impl="sdpa")
model = model.cuda()
n = _swap_bnb_linear(
model.backbone, mode, compress_statistics=compress_statistics,
)
print(f" bnb {mode} (cs={compress_statistics}): swapped {n} linears")
return model, tok
return _build
# ──────────────────────────────────────────────────────────────────────
# Inference + measurement
# ──────────────────────────────────────────────────────────────────────
def _encoder_param_bytes(model: DualHeadModernBERT) -> int:
"""Sum bytes of every parameter / buffer inside the encoder backbone.
Handles bnb Int8Params (int8 storage) and Params4bit (uint8 packed)
correctly because element_size() reflects the storage dtype.
"""
total = 0
seen = set()
for p in list(model.backbone.parameters()) + list(model.backbone.buffers()):
if id(p) in seen:
continue
seen.add(id(p))
total += p.numel() * p.element_size()
return total
@torch.no_grad()
def run_inference(model, tokenizer, texts: list[str]) -> dict:
device = next(model.parameters()).device
cat_logits_list = []
spec_logits_list = []
# Warmup
warm_batch = tokenizer(
texts[: BATCH_SIZE], truncation=True, max_length=MAX_SEQ,
padding="longest", return_tensors="pt",
).to(device)
for _ in range(WARMUP_BATCHES):
_ = model(input_ids=warm_batch["input_ids"], attention_mask=warm_batch["attention_mask"])
torch.cuda.synchronize()
torch.cuda.reset_peak_memory_stats()
total_time = 0.0
for i in range(0, len(texts), BATCH_SIZE):
batch = texts[i : i + BATCH_SIZE]
enc = tokenizer(
batch, truncation=True, max_length=MAX_SEQ,
padding="longest", return_tensors="pt",
).to(device)
torch.cuda.synchronize()
t0 = time.perf_counter()
out = model(input_ids=enc["input_ids"], attention_mask=enc["attention_mask"])
torch.cuda.synchronize()
total_time += time.perf_counter() - t0
cat_logits_list.append(out["category_logits"].float().cpu())
spec_logits_list.append(out["specificity_logits"].float().cpu())
peak_vram = torch.cuda.max_memory_allocated()
cat_logits = torch.cat(cat_logits_list)
spec_logits = torch.cat(spec_logits_list)
return {
"cat_logits": cat_logits,
"spec_logits": spec_logits,
"total_time_s": total_time,
"ms_per_sample": (total_time / len(texts)) * 1000,
"throughput": len(texts) / total_time,
"peak_vram_mb": peak_vram / (1024 ** 2),
"num_samples": len(texts),
}
def evaluate_predictions(
cat_logits: torch.Tensor,
spec_logits: torch.Tensor,
records: list[dict],
ref_name: str,
) -> dict:
cat_probs_all = F.softmax(cat_logits, dim=1).numpy()
cat_preds_all = cat_logits.argmax(dim=1).numpy()
spec_preds_all = ordinal_predict(spec_logits).numpy()
# ordinal → class probs
sp = torch.sigmoid(spec_logits)
K = sp.shape[1] + 1
spec_probs_all = torch.zeros(sp.shape[0], K)
spec_probs_all[:, 0] = 1 - sp[:, 0]
for k in range(1, K - 1):
spec_probs_all[:, k] = sp[:, k - 1] - sp[:, k]
spec_probs_all[:, -1] = sp[:, -1]
spec_probs_all = spec_probs_all.clamp(min=0)
spec_probs_all = spec_probs_all / spec_probs_all.sum(dim=1, keepdim=True)
spec_probs_all = spec_probs_all.numpy()
cat_labels, spec_labels = [], []
cat_p, spec_p, cat_pr, spec_pr = [], [], [], []
for i, rec in enumerate(records):
b = rec["benchmark_labels"].get(ref_name)
if b is None:
continue
cat_labels.append(CAT2ID[b["category"]])
spec_labels.append(b["specificity"] - 1)
cat_p.append(cat_preds_all[i])
spec_p.append(spec_preds_all[i])
cat_pr.append(cat_probs_all[i])
spec_pr.append(spec_probs_all[i])
cat_m = compute_all_metrics(
np.array(cat_p), np.array(cat_labels), np.array(cat_pr),
CATEGORIES, "cat", is_ordinal=False,
)
spec_m = compute_all_metrics(
np.array(spec_p), np.array(spec_labels), np.array(spec_pr),
SPEC_LABELS, "spec", is_ordinal=True,
)
return {**cat_m, **spec_m}
# ──────────────────────────────────────────────────────────────────────
# Variant registry
# ──────────────────────────────────────────────────────────────────────
@dataclass
class Variant:
name: str
description: str
builder: callable
skip_reason: str | None = None
def build_variants() -> list[Variant]:
from torchao.quantization import (
Int4WeightOnlyConfig,
Int8DynamicActivationInt8WeightConfig,
Int8WeightOnlyConfig,
)
return [
Variant("fp32", "Float32 encoder + heads", variant_native(torch.float32, attn="sdpa")),
Variant("bf16", "BFloat16 baseline (matches eval pipeline)", variant_native(torch.bfloat16)),
Variant("fp16", "Float16 encoder + heads", variant_native(torch.float16)),
Variant(
"torchao-int8-wo",
"torchao Int8 weight-only on encoder linears",
variant_torchao(lambda: Int8WeightOnlyConfig()),
),
Variant(
"torchao-int8-dyn",
"torchao Int8 dynamic activation + Int8 weight on encoder",
variant_torchao(lambda: Int8DynamicActivationInt8WeightConfig()),
),
Variant(
"torchao-int4-wo",
"torchao Int4 weight-only (group=128) on encoder",
variant_torchao(lambda: Int4WeightOnlyConfig(group_size=128)),
),
Variant("bnb-int8", "bitsandbytes LLM.int8 on encoder linears", variant_bnb("int8")),
Variant("bnb-nf4", "bitsandbytes NF4 4-bit (double-quant, bf16 compute)", variant_bnb("nf4", compress_statistics=True)),
Variant("bnb-nf4-nodq", "bitsandbytes NF4 4-bit (no double-quant)", variant_bnb("nf4", compress_statistics=False)),
Variant("bnb-fp4", "bitsandbytes FP4 4-bit (no double-quant)", variant_bnb("fp4", compress_statistics=False)),
]
# ──────────────────────────────────────────────────────────────────────
# Driver
# ──────────────────────────────────────────────────────────────────────
def free():
gc.collect()
torch.cuda.empty_cache()
torch.cuda.synchronize()
def main():
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print(f"Loading holdout from {HOLDOUT}")
records = load_holdout_data(
str(PARAGRAPHS), str(HOLDOUT), {k: str(v) for k, v in BENCHMARKS.items()},
)
texts = [r["text"] for r in records]
print(f" {len(records)} holdout paragraphs loaded")
variants = build_variants()
summary = []
for v in variants:
print(f"\n══ {v.name}{v.description}")
free()
try:
t0 = time.perf_counter()
model, tokenizer = v.builder()
build_s = time.perf_counter() - t0
enc_bytes = _encoder_param_bytes(model)
print(f" encoder footprint: {enc_bytes / 1e6:.1f} MB (build {build_s:.1f}s)")
inf = run_inference(model, tokenizer, texts)
print(
f" latency {inf['ms_per_sample']:.2f} ms/sample, "
f"throughput {inf['throughput']:.0f}/s, "
f"peak VRAM {inf['peak_vram_mb']:.0f} MB"
)
metrics_per_ref = {}
for ref in BENCHMARKS:
m = evaluate_predictions(inf["cat_logits"], inf["spec_logits"], records, ref)
metrics_per_ref[ref] = m
print(
f" vs {ref}: cat F1={m['cat_macro_f1']:.4f}, "
f"spec F1={m['spec_macro_f1']:.4f}, QWK={m['spec_qwk']:.4f}, "
f"cat ECE={m['cat_ece']:.4f}, spec ECE={m['spec_ece']:.4f}"
)
row = {
"variant": v.name,
"description": v.description,
"encoder_mb": enc_bytes / 1e6,
"ms_per_sample": inf["ms_per_sample"],
"throughput_per_s": inf["throughput"],
"peak_vram_mb": inf["peak_vram_mb"],
"build_s": build_s,
}
for ref, m in metrics_per_ref.items():
row[f"{ref}_cat_f1"] = m["cat_macro_f1"]
row[f"{ref}_spec_f1"] = m["spec_macro_f1"]
row[f"{ref}_cat_mcc"] = m["cat_mcc"]
row[f"{ref}_spec_qwk"] = m["spec_qwk"]
row[f"{ref}_spec_mae"] = m["spec_mae"]
row[f"{ref}_cat_ece"] = m["cat_ece"]
row[f"{ref}_spec_ece"] = m["spec_ece"]
# per-spec-level F1
for s in SPEC_LABELS:
short = s.replace(" ", "").replace(":", "")[:8]
row[f"{ref}_spec_f1_{short}"] = m.get(f"spec_f1_{short}", 0)
summary.append(row)
# Per-variant detailed metrics dump
vdir = OUTPUT_DIR / v.name
vdir.mkdir(parents=True, exist_ok=True)
with open(vdir / "metrics.json", "w") as f:
ser = {}
for ref, m in metrics_per_ref.items():
ser[ref] = {
k: (v_ if not isinstance(v_, np.ndarray) else v_.tolist())
for k, v_ in m.items()
if isinstance(v_, (int, float, str, list, bool))
}
ser["_runtime"] = {
"encoder_mb": enc_bytes / 1e6,
"ms_per_sample": inf["ms_per_sample"],
"throughput_per_s": inf["throughput"],
"peak_vram_mb": inf["peak_vram_mb"],
"build_s": build_s,
}
json.dump(ser, f, indent=2, default=str)
del model, tokenizer, inf
except Exception as e:
print(f" FAILED: {type(e).__name__}: {e}")
traceback.print_exc()
summary.append({
"variant": v.name,
"description": v.description,
"error": f"{type(e).__name__}: {e}",
})
free()
# Write summary
summary_path = OUTPUT_DIR / "summary.json"
with open(summary_path, "w") as f:
json.dump(summary, f, indent=2, default=str)
print(f"\nSummary written to {summary_path}")
# Print compact table
print("\n" + "=" * 110)
print(f"{'variant':<18} {'enc MB':>9} {'ms/samp':>9} {'throughput':>11} "
f"{'VRAM MB':>9} {'cat F1':>9} {'spec F1':>9} {'spec QWK':>9}")
print("-" * 110)
for r in summary:
if "error" in r:
print(f"{r['variant']:<18} ERROR: {r['error']}")
continue
print(
f"{r['variant']:<18} {r['encoder_mb']:>9.1f} {r['ms_per_sample']:>9.2f} "
f"{r['throughput_per_s']:>11.0f} {r['peak_vram_mb']:>9.0f} "
f"{r['GPT-5.4_cat_f1']:>9.4f} {r['GPT-5.4_spec_f1']:>9.4f} {r['GPT-5.4_spec_qwk']:>9.4f}"
)
print("=" * 110)
if __name__ == "__main__":
main()

117
results/eval/onnx/REPORT.md Normal file
View File

@ -0,0 +1,117 @@
# ONNX Export + Eval — iter1-independent ModernBERT-large
**Date:** 2026-04-07
**Checkpoint:** `checkpoints/finetune/iter1-independent/final/`
**Hardware:** RTX 3090 (sm_8.6, 24 GB), onnxruntime-gpu 1.24.4, onnx 1.21
**Driver:** `python/scripts/onnx_export_eval.py` (`bun run py:onnx`)
**Eval set:** 1,200-paragraph v2 holdout, proxy gold = GPT-5.4 + Opus-4.6
## TL;DR
ONNX export of this model is *technically* possible but the path is full of
dead ends. The dynamo exporter produces a graph with 56 Memcpy nodes that
makes ORT 8× slower than native torch and 4× more VRAM-heavy; the legacy
TorchScript exporter produces a clean graph that's actually 22% faster than
torch fp32 (kernel fusion); fp16 conversion breaks on the rotary embedding;
dynamic int8 quantization via ORT silently falls back to CPU and drops
~0.5 macro F1. **Net: torchao int8-wo from the earlier sweep is still the
right int8 deployment path. ONNX is not.**
## What we tried
| variant | exporter | size MB | ms/sample | VRAM MB | cat F1 | spec F1 | result |
|--------------------|----------------------|--------:|----------:|--------:|-------:|--------:|-----------------|
| onnx-fp32 (dynamo) | torch.onnx (dynamo) | 1583 | 42.92 | 15388 | 0.9337 | 0.8943 | works but unusable |
| onnx-int8 (dynamo) | dynamo + ORT int8 | 1580 | 42.82 | 15398 | 0.9337 | 0.8943 | no-op (no quant) |
| **onnx-fp32 (legacy)** | torch.onnx (TorchScript) | 1583 | **12.70** | 8228 | 0.9337 | 0.8952 | **clean graph, faster than torch** |
| onnx-fp16 (legacy) | onnxconverter_common | 754 | err | err | err | err | rotary type unify |
| onnx-int8 (legacy) | ORT quantize_dynamic | 527 | 95.91 | ~CPU | 0.3972 | 0.3364 | CPU fallback + accuracy collapse |
(All entries above were re-run from scratch — fp32 timing improved 3× moving
from dynamo to legacy export.)
## Six things broke along the way (workarounds in the script)
1. **Dynamo exporter optimizer crashes.** `torch.onnx.export(..., dynamo=True)`
succeeds at translation but the post-translation `InlinePass` optimizer
trips on `onnx_ir`. Workaround: `optimize=False`.
2. **Dynamo-exported graph is unusable on CUDA EP.** ORT inserts 56 Memcpy
nodes between layers because dynamo emits scalar tensors with CPU
placement metadata. Result: 42.9 ms/sample (8× torch fp32) and 15.4 GB
VRAM (4.4× torch fp32). The legacy exporter only inserts 1 Memcpy.
3. **`op_types_to_quantize=['MatMul']` quantizes nothing on the dynamo
graph.** Dynamo emits encoder linears as `Gemm` nodes, not `MatMul`.
Fix: `op_types_to_quantize=['MatMul', 'Gemm']`.
4. **Both ORT shape-inference paths choke on ModernBERT.** Symbolic
inference asserts in `_infer_Range` (rotary embedding limit input is
not a scalar); the C++ inference raises a (1024)/(7) dim mismatch on
the category head Gemm. The `skip_*` flags on `quant_pre_process` are
ignored, and `ONNXQuantizer.__init__` calls
`save_and_reload_model_with_shape_infer` unconditionally. Workaround:
monkey-patch `quant_utils.save_and_reload_model_with_shape_infer`
*and* the cached binding in `onnx_quantizer` to a no-op, then pass
`extra_options={'DefaultTensorType': onnx.TensorProto.FLOAT}` so the
quantizer can still type the head MatMul.
5. **fp16 conversion via `onnxconverter_common` breaks on rotary
embeddings.** Two distinct failure modes seen across exports:
`Type Error: Type (tensor(float16)) of output arg (val_58) of node
(node_Expand_56) does not match expected type (tensor(float))` (dynamo
graph) and `Type parameter (T) of Optype (Mul) bound to different types
(tensor(float) and tensor(float16) in node
(/model/backbone/rotary_emb_1/Mul_2)` (legacy graph). The converter
leaves the `inv_freq` buffer in fp32 and the surrounding Mul/Expand
ops then can't unify their type parameter. Could be patched with an
`op_block_list` for the rotary subgraph, but the cost/value isn't
there given the dynamic int8 result below.
6. **Dynamic int8 via ORT silently falls back to CPU.** The quantizer
replaces Gemm/MatMul with `MatMulInteger` + `DynamicQuantizeLinear`,
neither of which has CUDA kernels in onnxruntime-gpu 1.24. Session
creation succeeds with CUDAExecutionProvider but routes the
quantized ops to the CPU EP — observable from the `load_vram_mb`
collapsing from 2074 MB (fp32) to 266 MB (int8) and latency exploding
to 95.9 ms/sample. Per-channel int8 weights also drop accuracy from
0.934 → 0.397 on category and 0.895 → 0.336 on spec, further
confirming the kernel path is wrong (not just slow).
## What actually works
**onnx-fp32 via the legacy TorchScript exporter** is the one clean win:
12.70 ms/sample vs 16.29 for torch fp32 — a **22% latency improvement
from ORT's LayerNorm/Gelu/MatMul fusion** at bit-identical accuracy. VRAM
is 8228 MB vs 3504 MB for torch fp32 (the ORT session allocates a separate
~5 GB workspace), so the speedup costs you ~2.3× memory. On a single
3090 batch=64 inference run that's a fair trade.
But this is fp32 — bf16 torch + flash-attn-2 is *still* the strict winner
at 5.52 ms / 1741 MB (Phase 10.8 result). ORT can't run bf16 natively, and
fp16 conversion is broken. So even the working ONNX path is dominated by
what we already ship.
## Recommendation
**Don't use ONNX for this model on this hardware.** The torchao int8-wo
result from the quantization sweep (5.52 → 6.08 ms, 1741 → 1416 MB peak
VRAM, F1 within ±0.001) covers the "smaller deployment" use case more
cleanly than anything ONNX can offer here, and bf16 + flash-attn-2
remains the production default.
ONNX *would* be worth revisiting in any of these scenarios:
- **CPU-only deployment** — fp32 ONNX runs fine on CPUExecutionProvider
and ORT's int8 dynamic path is actually designed for this case. Worth
benchmarking if a CPU serving target ever shows up.
- **Cross-runtime portability** — TensorRT, OpenVINO, mobile runtimes.
These would each need their own export validation pass.
- **Static int8 with calibration**`quantize_static` with a calibration
dataset can avoid the dynamic-quant CPU fallback path. Would need a
ModernBERT-friendly calibration loop and probably an `op_block_list`
to keep the rotary in fp32. Real engineering work, not a one-shot.
## Reproduce
```bash
bun run py:onnx
# writes to:
# results/eval/onnx/models/{model_fp32,model_fp16,model_int8_dyn}.onnx[.data]
# results/eval/onnx/summary.json
# results/eval/onnx/REPORT.md (this file)
```

View File

@ -0,0 +1,50 @@
[
{
"variant": "onnx-fp32",
"model_mb": 1583.256294,
"ms_per_sample": 12.703279327494482,
"throughput_per_s": 78.71983085781946,
"peak_vram_mb": 8228.0,
"load_vram_mb": 2074.0,
"GPT-5.4_cat_f1": 0.9336741161693523,
"GPT-5.4_spec_f1": 0.8951731906425856,
"GPT-5.4_cat_mcc": 0.9226990724708704,
"GPT-5.4_spec_qwk": 0.9324447137231142,
"GPT-5.4_spec_mae": 0.1175,
"GPT-5.4_cat_ece": 0.05386760701735813,
"GPT-5.4_spec_ece": 0.07004604930679002,
"Opus-4.6_cat_f1": 0.922684387023173,
"Opus-4.6_spec_f1": 0.8833694419146193,
"Opus-4.6_cat_mcc": 0.909266938399113,
"Opus-4.6_spec_qwk": 0.9227008860372746,
"Opus-4.6_spec_mae": 0.13583333333333333,
"Opus-4.6_cat_ece": 0.06540583113829297,
"Opus-4.6_spec_ece": 0.08156729981303217
},
{
"variant": "onnx-fp16",
"error": "Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /home/joey/Documents/sec-cyBERT/results/eval/onnx/models/model_fp16.onnx failed:Type Error: Type parameter (T) of Optype (Mul) bound to different types (tensor(float) and tensor(float16) in node (/model/backbone/rotary_emb_1/Mul_2)."
},
{
"variant": "onnx-int8-dyn",
"model_mb": 553.381903,
"ms_per_sample": 95.90791940659983,
"throughput_per_s": 10.42666764316426,
"peak_vram_mb": 7188.0,
"load_vram_mb": 266.0,
"GPT-5.4_cat_f1": 0.3971686880679718,
"GPT-5.4_spec_f1": 0.3364003775746365,
"GPT-5.4_cat_mcc": 0.3459776856134484,
"GPT-5.4_spec_qwk": 0.4412945592628398,
"GPT-5.4_spec_mae": 0.7225,
"GPT-5.4_cat_ece": 0.23434762333830195,
"GPT-5.4_spec_ece": 0.35458642202119034,
"Opus-4.6_cat_f1": 0.4150626036637055,
"Opus-4.6_spec_f1": 0.3204015536108683,
"Opus-4.6_cat_mcc": 0.3663561834842673,
"Opus-4.6_spec_qwk": 0.43979676755288855,
"Opus-4.6_spec_mae": 0.7375,
"Opus-4.6_cat_ece": 0.2126809566716353,
"Opus-4.6_spec_ece": 0.37541975535452365
}
]

View File

@ -0,0 +1,163 @@
# Quantization Sweep — iter1-independent ModernBERT-large
**Date:** 2026-04-07
**Checkpoint:** `checkpoints/finetune/iter1-independent/final/`
**Hardware:** RTX 3090 (sm_8.6, 24 GB)
**Eval set:** 1,200-paragraph v2 holdout, proxy gold = GPT-5.4 + Opus-4.6
**Driver:** `python/scripts/quantize_sweep.py` (run via `bun run py:quant`)
## Setup
For each variant the *encoder* (ModernBERT-large backbone, 28 layers, 112
nn.Linear modules) is converted to the target precision/scheme, while the
attention pooler and the dual heads (category linear + 3 independent
threshold MLPs) are kept in bf16. Heads are <0.3% of params and sit on
already-distilled 1024-d representations — quantizing them buys nothing and
risks the threshold margins that drive most of the spec error budget.
For every variant we measure end-to-end inference on the full 1,200-paragraph
holdout at batch=64, max_seq=512, after 5 warmup batches:
- **encoder_mb** — sum of `param.numel() * param.element_size()` over the
encoder. **Caveat:** for torchao tensor subclasses (`AffineQuantizedTensor`)
this reports the *outer* dtype (bf16) rather than the int8 storage, so the
790 MB figure for the torchao rows is an over-estimate; real on-disk
storage is roughly half. The bnb 4-bit row (275 MB) is correct because
`Params4bit` reports `uint8` element_size.
- **ms/sample** — wall-clock per paragraph at batch=64
- **peak VRAM**`torch.cuda.max_memory_allocated()` over the timed run
(encoder fwd + activations)
- **F1 / QWK / ECE** — full eval pipeline reused from `src/finetune/eval.py`
## Results
| variant | enc MB | ms/samp | thru/s | VRAM MB | cat F1 (GPT) | spec F1 (GPT) | spec QWK | cat F1 (Opus) | spec F1 (Opus) | notes |
|--------------------|-------:|--------:|-------:|--------:|-------------:|--------------:|---------:|--------------:|---------------:|--------------------------------|
| fp32 | 1579 | 16.29 | 61 | 3504 | 0.9337 | 0.8943 | 0.9321 | 0.9227 | 0.8825 | sdpa (no flash-attn) |
| **bf16 (baseline)**| 790 | 5.52 | 181 | 1741 | 0.9337 | 0.8952 | 0.9324 | 0.9227 | 0.8834 | flash-attn-2 |
| fp16 | 790 | 5.54 | 181 | 1741 | 0.9337 | 0.8952 | 0.9324 | 0.9227 | 0.8834 | flash-attn-2 |
| **torchao int8-wo**| ~395* | 6.08 | 165 | 1416 | 0.9345 | 0.8941 | 0.9330 | 0.9235 | 0.8815 | weight-only int8 |
| torchao int8-dyn | ~395* | 9.67 | 103 | 1774 | 0.9336 | 0.8918 | 0.9315 | 0.9243 | 0.8827 | dyn act + int8 weight |
| torchao int4-wo | — | — | — | — | — | — | — | — | — | requires `mslk>=1.0.0` |
| bnb LLM.int8 | ~395* | 7.76 | 129 | 2135 | 0.9361 | 0.8986 | 0.9308 | 0.9235 | 0.8827 | mixed-precision outliers |
| bnb nf4 (DQ) | 275 | 5.86 | 171 | 1287 | 0.3537 | 0.2205 | 0.2423 | 0.3576 | 0.2075 | **collapsed** |
| bnb nf4 (no DQ) | 275 | 5.86 | 171 | 1287 | 0.3537 | 0.2205 | 0.2423 | 0.3576 | 0.2075 | **collapsed** |
| bnb fp4 (no DQ) | 275 | 5.87 | 170 | 1287 | 0.1629 | 0.2085 | 0.2326 | 0.1686 | 0.1978 | **collapsed harder** |
\*torchao subclass tensors report bf16 element_size; true storage ~395 MB.
Per-variant detail (per-class F1, MCC, AUC, confusion matrices, calibration
bins) is in `results/eval/quant/{variant}/metrics.json`. Aggregate row-level
data is in `results/eval/quant/summary.json`.
## Findings
### 1. bf16 is already the production sweet spot
Flash-attention-2 + bf16 gives **3.0× the throughput of fp32** (181 vs 61
samples/sec) at **half the VRAM** (1.7 vs 3.5 GB) with bit-identical
accuracy. This is what we already train and serve at; the sweep simply
confirms there's no headroom in fp16/fp32 for this hardware.
### 2. fp16 ≡ bf16 on Ampere
Identical latency, identical VRAM, identical F1. RTX 3090 has matched
bf16/fp16 throughput on tensor cores and the model has no overflow issues
in either format. Pick whichever the loader prefers.
### 3. torchao int8 weight-only is the only quantization variant worth shipping
- **VRAM 19%** (1741 → 1416 MB) — meaningful for batched serving
- **F1 essentially unchanged** (cat +0.0008, spec 0.0011 vs bf16 — both
inside per-seed noise)
- **Latency +10%** (5.52 → 6.08 ms/sample) — the int8 weight is dequantized
to bf16 on the fly because RTX 3090 (sm_8.6) lacks the int8 tensor-core
matmul kernel paths torchao would otherwise use; on H100/A100/Ada this
same config would also be faster
The accuracy delta is statistically nothing — well within the ±0.002 std we
observed across the 3-seed ensemble. **This is the variant we'd ship as the
"low-VRAM" deployment option.**
### 4. torchao int8 dynamic activation: don't bother on this hardware
43% throughput (5.52 → 9.67 ms/sample) and *more* peak VRAM than bf16
(1774 vs 1741 MB) because the per-batch activation quantization adds work
without unlocking int8 tensor cores. Pure regression on Ampere.
### 5. bnb LLM.int8: slowest int8, no accuracy upside
- **+41% latency** (5.52 → 7.76 ms/sample) due to mixed-precision outlier
handling
- **+23% VRAM** (1741 → 2135 MB) — outlier columns are kept in fp16 plus
scratch buffers
- **F1 +0.0024 cat, +0.0034 spec** — within noise; not a real win
bnb LLM.int8 was designed for LLM-scale models where outlier features
dominate quant error; for an encoder of this size on a single 3090 it
just trades performance for nothing.
### 6. All 4-bit variants collapse — ModernBERT-large is too quant-sensitive
Both nf4 (with and without double-quantization) and fp4 produce essentially
random predictions:
| variant | cat F1 | spec F1 | spec ECE |
|---------|-------:|--------:|---------:|
| nf4 | 0.354 | 0.221 | 0.434 |
| fp4 | 0.163 | 0.209 | 0.443 |
Per-layer dequantization is faithful — we verified that the dequantized
weight of one MLP Wi layer differs from the original by mean 0.005 / max
0.11 (sub-1% error). But the relative output drift on a single Linear is
already ~98% (mean), and that error compounds across 28 transformer blocks
+ GLU FFN paths until the [CLS]/pooled representation no longer carries
the discriminative signal. The category head essentially collapses to a
near-uniform prior (cat ECE 0.10 vs the 0.054 baseline) and the threshold
heads collapse onto L1 because all three thresholds emit similar logits.
The fact that **DQ vs no-DQ are bit-identical** at this scale tells us the
nf4 weight indices are stable under absmax requantization (only ~5% of the
weight bytes change, all in the metadata block) — the catastrophe is
inherent to 4-bit weight precision on this architecture, not to a
quantization-config knob.
This is a real noteworthy null for the paper: **naive post-training 4-bit
weight quantization is not viable for ModernBERT-large on this task**.
Recovering 4-bit would require either (a) QAT, (b) per-channel calibration
with a held-out activation distribution (GPTQ / AWQ-style), or (c) keeping
the GLU FFN in 8-bit and only 4-bit'ing attention projections. None of
these are reachable inside the remaining capstone time budget.
### 7. torchao int4-wo: dependency hole
torchao 0.17 requires `mslk >= 1.0.0` for the new `Int4Tensor.from_hp` path.
Not installed in the lockfile and not worth chasing given the bnb 4-bit
collapse — even if the kernel ran cleanly we'd expect the same compounding
error pattern.
## Recommendations
| Use case | Variant | Why |
|-----------------------------------|--------------------|-------------------------------------------------------------|
| **Production / paper headline** | bf16 | Best of every dimension on this hardware |
| **Low-VRAM batch serving** | torchao int8-wo | 19% VRAM, accuracy intact, only 10% latency penalty |
| **Multi-GPU sharded serving** | bf16 | int8-wo's dequant overhead grows with replica count |
| **Embedded / 4-bit** | not viable | Needs QAT or AWQ-style calibration; future work |
## Paper-worthy notes
1. **Quantization story** — bf16 is already the sweet spot; torchao int8-wo
buys 19% VRAM with no accuracy cost; 4-bit fails. This adds another row
to the speed/cost table.
2. **Architecture-specific quant fragility** — ModernBERT-large's GLU FFN
amplifies per-layer weight error across 28 blocks. This is a noteworthy
counterpoint to the 4-bit-by-default LLM serving narrative and worth
one paragraph in the discussion section alongside the DAPT and
CORAL null results.
3. **Hardware caveat** — int8 latency results would invert on
Hopper/Ada/A100; the 3090 just doesn't have the matmul path. State the
sm_8.6 caveat in the table caption.
## Reproduce
```bash
# from repo root
bun run py:quant
# writes to results/eval/quant/{summary.json, REPORT.md, <variant>/metrics.json}
```
Run time: ~5 minutes total (most spent in fp32 + torchao build steps).

View File

@ -0,0 +1,297 @@
{
"GPT-5.4": {
"cat_macro_f1": 0.9336741161693523,
"cat_weighted_f1": 0.9343162998643407,
"cat_macro_precision": 0.93189297179766,
"cat_macro_recall": 0.9377918652022429,
"cat_mcc": 0.9226990724708704,
"cat_auc": 0.991991833154947,
"cat_ece": 0.053848127176364245,
"cat_confusion_matrix": [
[
225,
0,
3,
0,
2,
0,
0
],
[
0,
85,
0,
0,
2,
1,
0
],
[
2,
0,
144,
1,
3,
0,
0
],
[
0,
0,
3,
131,
0,
2,
0
],
[
6,
1,
5,
19,
164,
1,
2
],
[
0,
3,
1,
8,
2,
207,
0
],
[
0,
0,
0,
0,
12,
0,
165
]
],
"cat_f1_BoardGov": 0.9719222462203023,
"cat_prec_BoardGov": 0.9656652360515021,
"cat_recall_BoardGov": 0.9782608695652174,
"cat_f1_Incident": 0.96045197740113,
"cat_prec_Incident": 0.9550561797752809,
"cat_recall_Incident": 0.9659090909090909,
"cat_f1_Manageme": 0.9411764705882353,
"cat_prec_Manageme": 0.9230769230769231,
"cat_recall_Manageme": 0.96,
"cat_f1_NoneOthe": 0.888135593220339,
"cat_prec_NoneOthe": 0.8238993710691824,
"cat_recall_NoneOthe": 0.9632352941176471,
"cat_f1_RiskMana": 0.856396866840731,
"cat_prec_RiskMana": 0.8864864864864865,
"cat_recall_RiskMana": 0.8282828282828283,
"cat_f1_Strategy": 0.9583333333333334,
"cat_prec_Strategy": 0.981042654028436,
"cat_recall_Strategy": 0.9366515837104072,
"cat_f1_Third-Pa": 0.9593023255813954,
"cat_prec_Third-Pa": 0.9880239520958084,
"cat_recall_Third-Pa": 0.9322033898305084,
"cat_kripp_alpha": 0.9223591517560865,
"spec_macro_f1": 0.8951731906425856,
"spec_weighted_f1": 0.9121524819510628,
"spec_macro_precision": 0.8980417155129858,
"spec_macro_recall": 0.8930560580782194,
"spec_mcc": 0.866381831963237,
"spec_auc": 0.981666223606385,
"spec_ece": 0.07135417198141418,
"spec_confusion_matrix": [
[
580,
23,
12,
3
],
[
29,
130,
7,
2
],
[
11,
4,
190,
2
],
[
2,
1,
9,
195
]
],
"spec_f1_L1Generi": 0.9354838709677419,
"spec_prec_L1Generi": 0.932475884244373,
"spec_recall_L1Generi": 0.9385113268608414,
"spec_f1_L2Domain": 0.7975460122699386,
"spec_prec_L2Domain": 0.8227848101265823,
"spec_recall_L2Domain": 0.7738095238095238,
"spec_f1_L3Firm-S": 0.8941176470588236,
"spec_prec_L3Firm-S": 0.8715596330275229,
"spec_recall_L3Firm-S": 0.9178743961352657,
"spec_f1_L4Quanti": 0.9535452322738386,
"spec_prec_L4Quanti": 0.9653465346534653,
"spec_recall_L4Quanti": 0.9420289855072463,
"spec_qwk": 0.9324447137231142,
"spec_mae": 0.1175,
"spec_kripp_alpha": 0.917725722448833
},
"Opus-4.6": {
"cat_macro_f1": 0.922684387023173,
"cat_weighted_f1": 0.9216414809666168,
"cat_macro_precision": 0.9177680939029339,
"cat_macro_recall": 0.9316060900094703,
"cat_mcc": 0.909266938399113,
"cat_auc": 0.9939660707189948,
"cat_ece": 0.06551479384303091,
"cat_confusion_matrix": [
[
211,
0,
1,
1,
1,
0,
0
],
[
0,
78,
0,
0,
1,
0,
0
],
[
8,
0,
144,
1,
4,
0,
1
],
[
0,
0,
1,
138,
1,
1,
0
],
[
13,
0,
9,
14,
169,
1,
7
],
[
1,
11,
1,
4,
3,
208,
0
],
[
0,
0,
0,
1,
6,
1,
159
]
],
"cat_f1_BoardGov": 0.9440715883668904,
"cat_prec_BoardGov": 0.9055793991416309,
"cat_recall_BoardGov": 0.985981308411215,
"cat_f1_Incident": 0.9285714285714286,
"cat_prec_Incident": 0.8764044943820225,
"cat_recall_Incident": 0.9873417721518988,
"cat_f1_Manageme": 0.9171974522292994,
"cat_prec_Manageme": 0.9230769230769231,
"cat_recall_Manageme": 0.9113924050632911,
"cat_f1_NoneOthe": 0.92,
"cat_prec_NoneOthe": 0.8679245283018868,
"cat_recall_NoneOthe": 0.9787234042553191,
"cat_f1_RiskMana": 0.8492462311557789,
"cat_prec_RiskMana": 0.9135135135135135,
"cat_recall_RiskMana": 0.7934272300469484,
"cat_f1_Strategy": 0.9476082004555809,
"cat_prec_Strategy": 0.985781990521327,
"cat_recall_Strategy": 0.9122807017543859,
"cat_f1_Third-Pa": 0.9520958083832335,
"cat_prec_Third-Pa": 0.9520958083832335,
"cat_recall_Third-Pa": 0.9520958083832335,
"cat_kripp_alpha": 0.908575631724203,
"spec_macro_f1": 0.8833694419146193,
"spec_weighted_f1": 0.9004034318676798,
"spec_macro_precision": 0.8858989636247611,
"spec_macro_recall": 0.8854684685880032,
"spec_mcc": 0.8500778641433316,
"spec_auc": 0.9736633898988131,
"spec_ece": 0.08248284702499709,
"spec_confusion_matrix": [
[
567,
30,
7,
1
],
[
22,
118,
3,
2
],
[
33,
10,
207,
10
],
[
0,
0,
1,
189
]
],
"spec_f1_L1Generi": 0.9242053789731052,
"spec_prec_L1Generi": 0.9115755627009646,
"spec_recall_L1Generi": 0.9371900826446281,
"spec_f1_L2Domain": 0.7788778877887789,
"spec_prec_L2Domain": 0.7468354430379747,
"spec_recall_L2Domain": 0.8137931034482758,
"spec_f1_L3Firm-S": 0.8661087866108786,
"spec_prec_L3Firm-S": 0.9495412844036697,
"spec_recall_L3Firm-S": 0.7961538461538461,
"spec_f1_L4Quanti": 0.9642857142857143,
"spec_prec_L4Quanti": 0.9356435643564357,
"spec_recall_L4Quanti": 0.9947368421052631,
"spec_qwk": 0.9227008860372746,
"spec_mae": 0.13583333333333333,
"spec_kripp_alpha": 0.9065248741550552
},
"_runtime": {
"encoder_mb": 789.563648,
"ms_per_sample": 5.516677870764397,
"throughput_per_s": 181.26851402716375,
"peak_vram_mb": 1740.83837890625,
"build_s": 0.48778308398323134
}
}

View File

@ -0,0 +1,297 @@
{
"GPT-5.4": {
"cat_macro_f1": 0.16293893512410998,
"cat_weighted_f1": 0.1746727986514593,
"cat_macro_precision": 0.6289222195093943,
"cat_macro_recall": 0.23220413662370398,
"cat_mcc": 0.22345796853389935,
"cat_auc": 0.8960306312891495,
"cat_ece": 0.2080524676044782,
"cat_confusion_matrix": [
[
8,
0,
0,
0,
221,
0,
1
],
[
0,
0,
0,
1,
82,
0,
5
],
[
0,
0,
4,
0,
145,
0,
1
],
[
0,
0,
2,
3,
128,
0,
3
],
[
0,
0,
0,
0,
195,
0,
3
],
[
0,
0,
0,
0,
208,
2,
11
],
[
0,
0,
0,
0,
80,
0,
97
]
],
"cat_f1_BoardGov": 0.06722689075630252,
"cat_prec_BoardGov": 1.0,
"cat_recall_BoardGov": 0.034782608695652174,
"cat_f1_Incident": 0.0,
"cat_prec_Incident": 0.0,
"cat_recall_Incident": 0.0,
"cat_f1_Manageme": 0.05128205128205128,
"cat_prec_Manageme": 0.6666666666666666,
"cat_recall_Manageme": 0.02666666666666667,
"cat_f1_NoneOthe": 0.04285714285714286,
"cat_prec_NoneOthe": 0.75,
"cat_recall_NoneOthe": 0.022058823529411766,
"cat_f1_RiskMana": 0.31026252983293556,
"cat_prec_RiskMana": 0.18413597733711048,
"cat_recall_RiskMana": 0.9848484848484849,
"cat_f1_Strategy": 0.017937219730941704,
"cat_prec_Strategy": 1.0,
"cat_recall_Strategy": 0.00904977375565611,
"cat_f1_Third-Pa": 0.6510067114093959,
"cat_prec_Third-Pa": 0.8016528925619835,
"cat_recall_Third-Pa": 0.5480225988700564,
"cat_kripp_alpha": -0.08693512028952255,
"spec_macro_f1": 0.20854117827130608,
"spec_weighted_f1": 0.2571301750438355,
"spec_macro_precision": 0.3741612607031285,
"spec_macro_recall": 0.33018440069147115,
"spec_mcc": 0.1895317453505129,
"spec_auc": 0.8110497500610155,
"spec_ece": 0.44289420386155437,
"spec_confusion_matrix": [
[
136,
473,
9,
0
],
[
4,
163,
1,
0
],
[
1,
179,
27,
0
],
[
2,
171,
34,
0
]
],
"spec_f1_L1Generi": 0.35742444152431013,
"spec_prec_L1Generi": 0.951048951048951,
"spec_recall_L1Generi": 0.22006472491909385,
"spec_f1_L2Domain": 0.2824956672443674,
"spec_prec_L2Domain": 0.16531440162271804,
"spec_recall_L2Domain": 0.9702380952380952,
"spec_f1_L3Firm-S": 0.19424460431654678,
"spec_prec_L3Firm-S": 0.38028169014084506,
"spec_recall_L3Firm-S": 0.13043478260869565,
"spec_f1_L4Quanti": 0.0,
"spec_prec_L4Quanti": 0.0,
"spec_recall_L4Quanti": 0.0,
"spec_qwk": 0.2326064604575444,
"spec_mae": 0.8825,
"spec_kripp_alpha": 0.26499611744119067
},
"Opus-4.6": {
"cat_macro_f1": 0.16861118726256397,
"cat_weighted_f1": 0.1792365613004711,
"cat_macro_precision": 0.6306758954840335,
"cat_macro_recall": 0.2357303291121537,
"cat_mcc": 0.2251562222131823,
"cat_auc": 0.8995073249291591,
"cat_ece": 0.19888580093781152,
"cat_confusion_matrix": [
[
8,
0,
0,
0,
205,
0,
1
],
[
0,
0,
0,
1,
73,
0,
5
],
[
0,
0,
4,
0,
154,
0,
0
],
[
0,
0,
2,
3,
133,
0,
3
],
[
0,
0,
0,
0,
208,
0,
5
],
[
0,
0,
0,
0,
216,
2,
10
],
[
0,
0,
0,
0,
70,
0,
97
]
],
"cat_f1_BoardGov": 0.07207207207207207,
"cat_prec_BoardGov": 1.0,
"cat_recall_BoardGov": 0.037383177570093455,
"cat_f1_Incident": 0.0,
"cat_prec_Incident": 0.0,
"cat_recall_Incident": 0.0,
"cat_f1_Manageme": 0.04878048780487805,
"cat_prec_Manageme": 0.6666666666666666,
"cat_recall_Manageme": 0.02531645569620253,
"cat_f1_NoneOthe": 0.041379310344827586,
"cat_prec_NoneOthe": 0.75,
"cat_recall_NoneOthe": 0.02127659574468085,
"cat_f1_RiskMana": 0.3270440251572327,
"cat_prec_RiskMana": 0.1964117091595845,
"cat_recall_RiskMana": 0.9765258215962441,
"cat_f1_Strategy": 0.017391304347826087,
"cat_prec_Strategy": 1.0,
"cat_recall_Strategy": 0.008771929824561403,
"cat_f1_Third-Pa": 0.6736111111111112,
"cat_prec_Third-Pa": 0.8016528925619835,
"cat_recall_Third-Pa": 0.5808383233532934,
"cat_kripp_alpha": -0.07941064783948448,
"spec_macro_f1": 0.19783939283519508,
"spec_weighted_f1": 0.24886714543281097,
"spec_macro_precision": 0.37592821714182745,
"spec_macro_recall": 0.3291807330600434,
"spec_mcc": 0.18219176358380398,
"spec_auc": 0.790090253498083,
"spec_ece": 0.45814307530721027,
"spec_confusion_matrix": [
[
132,
466,
7,
0
],
[
1,
142,
2,
0
],
[
8,
221,
31,
0
],
[
2,
157,
31,
0
]
],
"spec_f1_L1Generi": 0.35294117647058826,
"spec_prec_L1Generi": 0.9230769230769231,
"spec_recall_L1Generi": 0.21818181818181817,
"spec_f1_L2Domain": 0.251105216622458,
"spec_prec_L2Domain": 0.1440162271805274,
"spec_recall_L2Domain": 0.9793103448275862,
"spec_f1_L3Firm-S": 0.18731117824773413,
"spec_prec_L3Firm-S": 0.43661971830985913,
"spec_recall_L3Firm-S": 0.11923076923076924,
"spec_f1_L4Quanti": 0.0,
"spec_prec_L4Quanti": 0.0,
"spec_recall_L4Quanti": 0.0,
"spec_qwk": 0.22580295138888895,
"spec_mae": 0.8925,
"spec_kripp_alpha": 0.2579634594689497
},
"_runtime": {
"encoder_mb": 274.843904,
"ms_per_sample": 5.865302347471394,
"throughput_per_s": 170.49419463109393,
"peak_vram_mb": 1287.34326171875,
"build_s": 0.4887635139748454
}
}

View File

@ -0,0 +1,297 @@
{
"GPT-5.4": {
"cat_macro_f1": 0.9360988760303737,
"cat_weighted_f1": 0.9367630863906107,
"cat_macro_precision": 0.934342558672944,
"cat_macro_recall": 0.9404157843351134,
"cat_mcc": 0.9256911778959798,
"cat_auc": 0.9918112947607864,
"cat_ece": 0.052939765204985965,
"cat_confusion_matrix": [
[
226,
0,
2,
0,
2,
0,
0
],
[
0,
85,
0,
0,
2,
1,
0
],
[
2,
0,
145,
1,
2,
0,
0
],
[
0,
0,
3,
132,
0,
1,
0
],
[
6,
1,
5,
19,
164,
1,
2
],
[
0,
3,
1,
8,
2,
207,
0
],
[
0,
0,
0,
0,
12,
0,
165
]
],
"cat_f1_BoardGov": 0.9741379310344828,
"cat_prec_BoardGov": 0.9658119658119658,
"cat_recall_BoardGov": 0.9826086956521739,
"cat_f1_Incident": 0.96045197740113,
"cat_prec_Incident": 0.9550561797752809,
"cat_recall_Incident": 0.9659090909090909,
"cat_f1_Manageme": 0.9477124183006536,
"cat_prec_Manageme": 0.9294871794871795,
"cat_recall_Manageme": 0.9666666666666667,
"cat_f1_NoneOthe": 0.8918918918918919,
"cat_prec_NoneOthe": 0.825,
"cat_recall_NoneOthe": 0.9705882352941176,
"cat_f1_RiskMana": 0.8586387434554974,
"cat_prec_RiskMana": 0.8913043478260869,
"cat_recall_RiskMana": 0.8282828282828283,
"cat_f1_Strategy": 0.9605568445475638,
"cat_prec_Strategy": 0.9857142857142858,
"cat_recall_Strategy": 0.9366515837104072,
"cat_f1_Third-Pa": 0.9593023255813954,
"cat_prec_Third-Pa": 0.9880239520958084,
"cat_recall_Third-Pa": 0.9322033898305084,
"cat_kripp_alpha": 0.9253092213149172,
"spec_macro_f1": 0.8986323186392307,
"spec_weighted_f1": 0.9144644120807768,
"spec_macro_precision": 0.9034925881673722,
"spec_macro_recall": 0.8950728490354916,
"spec_mcc": 0.870090391628814,
"spec_auc": 0.98134918835569,
"spec_ece": 0.06740866973996164,
"spec_confusion_matrix": [
[
582,
19,
14,
3
],
[
29,
130,
7,
2
],
[
12,
3,
190,
2
],
[
2,
1,
8,
196
]
],
"spec_f1_L1Generi": 0.9364440868865648,
"spec_prec_L1Generi": 0.9312,
"spec_recall_L1Generi": 0.941747572815534,
"spec_f1_L2Domain": 0.8099688473520249,
"spec_prec_L2Domain": 0.8496732026143791,
"spec_recall_L2Domain": 0.7738095238095238,
"spec_f1_L3Firm-S": 0.892018779342723,
"spec_prec_L3Firm-S": 0.867579908675799,
"spec_recall_L3Firm-S": 0.9178743961352657,
"spec_f1_L4Quanti": 0.9560975609756097,
"spec_prec_L4Quanti": 0.9655172413793104,
"spec_recall_L4Quanti": 0.9468599033816425,
"spec_qwk": 0.9307948020550015,
"spec_mae": 0.1175,
"spec_kripp_alpha": 0.9166492249745117
},
"Opus-4.6": {
"cat_macro_f1": 0.9235105849558979,
"cat_weighted_f1": 0.9224780370334836,
"cat_macro_precision": 0.9187130112710481,
"cat_macro_recall": 0.9326192612354074,
"cat_mcc": 0.9103198007176273,
"cat_auc": 0.9937246318315877,
"cat_ece": 0.06465620135267579,
"cat_confusion_matrix": [
[
211,
0,
1,
1,
1,
0,
0
],
[
0,
78,
0,
0,
1,
0,
0
],
[
9,
0,
144,
1,
3,
0,
1
],
[
0,
0,
1,
139,
1,
0,
0
],
[
13,
0,
9,
14,
169,
1,
7
],
[
1,
11,
1,
4,
3,
208,
0
],
[
0,
0,
0,
1,
6,
1,
159
]
],
"cat_f1_BoardGov": 0.9419642857142857,
"cat_prec_BoardGov": 0.9017094017094017,
"cat_recall_BoardGov": 0.985981308411215,
"cat_f1_Incident": 0.9285714285714286,
"cat_prec_Incident": 0.8764044943820225,
"cat_recall_Incident": 0.9873417721518988,
"cat_f1_Manageme": 0.9171974522292994,
"cat_prec_Manageme": 0.9230769230769231,
"cat_recall_Manageme": 0.9113924050632911,
"cat_f1_NoneOthe": 0.9235880398671097,
"cat_prec_NoneOthe": 0.86875,
"cat_recall_NoneOthe": 0.9858156028368794,
"cat_f1_RiskMana": 0.8513853904282116,
"cat_prec_RiskMana": 0.9184782608695652,
"cat_recall_RiskMana": 0.7934272300469484,
"cat_f1_Strategy": 0.9497716894977168,
"cat_prec_Strategy": 0.9904761904761905,
"cat_recall_Strategy": 0.9122807017543859,
"cat_f1_Third-Pa": 0.9520958083832335,
"cat_prec_Third-Pa": 0.9520958083832335,
"cat_recall_Third-Pa": 0.9520958083832335,
"cat_kripp_alpha": 0.9095619506866199,
"spec_macro_f1": 0.8826923642825633,
"spec_weighted_f1": 0.8991699562480843,
"spec_macro_precision": 0.8862949086294886,
"spec_macro_recall": 0.8831960153359262,
"spec_mcc": 0.8485449936701916,
"spec_auc": 0.9725823165743999,
"spec_ece": 0.083350846717755,
"spec_confusion_matrix": [
[
568,
27,
9,
1
],
[
23,
117,
3,
2
],
[
34,
9,
206,
11
],
[
0,
0,
1,
189
]
],
"spec_f1_L1Generi": 0.9235772357723577,
"spec_prec_L1Generi": 0.9088,
"spec_recall_L1Generi": 0.9388429752066115,
"spec_f1_L2Domain": 0.785234899328859,
"spec_prec_L2Domain": 0.7647058823529411,
"spec_recall_L2Domain": 0.8068965517241379,
"spec_f1_L3Firm-S": 0.860125260960334,
"spec_prec_L3Firm-S": 0.9406392694063926,
"spec_recall_L3Firm-S": 0.7923076923076923,
"spec_f1_L4Quanti": 0.9618320610687023,
"spec_prec_L4Quanti": 0.9310344827586207,
"spec_recall_L4Quanti": 0.9947368421052631,
"spec_qwk": 0.9198415117342273,
"spec_mae": 0.13916666666666666,
"spec_kripp_alpha": 0.9038906079654127
},
"_runtime": {
"encoder_mb": 789.563648,
"ms_per_sample": 7.762363941583317,
"throughput_per_s": 128.82673468103667,
"peak_vram_mb": 2135.203125,
"build_s": 1.1878160500200465
}
}

View File

@ -0,0 +1,297 @@
{
"GPT-5.4": {
"cat_macro_f1": 0.3536909012886116,
"cat_weighted_f1": 0.4058815979606338,
"cat_macro_precision": 0.6317997184487815,
"cat_macro_recall": 0.38979766446605063,
"cat_mcc": 0.42471542150657926,
"cat_auc": 0.9205800077405307,
"cat_ece": 0.09734637491405013,
"cat_confusion_matrix": [
[
143,
0,
0,
0,
85,
0,
2
],
[
0,
0,
0,
4,
73,
0,
11
],
[
3,
0,
20,
1,
124,
0,
2
],
[
1,
0,
1,
5,
122,
0,
7
],
[
0,
0,
0,
0,
185,
0,
13
],
[
0,
0,
0,
0,
180,
28,
13
],
[
0,
0,
0,
0,
22,
0,
155
]
],
"cat_f1_BoardGov": 0.7586206896551724,
"cat_prec_BoardGov": 0.9727891156462585,
"cat_recall_BoardGov": 0.6217391304347826,
"cat_f1_Incident": 0.0,
"cat_prec_Incident": 0.0,
"cat_recall_Incident": 0.0,
"cat_f1_Manageme": 0.23391812865497075,
"cat_prec_Manageme": 0.9523809523809523,
"cat_recall_Manageme": 0.13333333333333333,
"cat_f1_NoneOthe": 0.0684931506849315,
"cat_prec_NoneOthe": 0.5,
"cat_recall_NoneOthe": 0.03676470588235294,
"cat_f1_RiskMana": 0.3741152679474216,
"cat_prec_RiskMana": 0.23388116308470291,
"cat_recall_RiskMana": 0.9343434343434344,
"cat_f1_Strategy": 0.2248995983935743,
"cat_prec_Strategy": 1.0,
"cat_recall_Strategy": 0.12669683257918551,
"cat_f1_Third-Pa": 0.8157894736842105,
"cat_prec_Third-Pa": 0.7635467980295566,
"cat_recall_Third-Pa": 0.8757062146892656,
"cat_kripp_alpha": 0.27180867501339423,
"spec_macro_f1": 0.22049451330952025,
"spec_weighted_f1": 0.26278390857815354,
"spec_macro_precision": 0.4075440073341987,
"spec_macro_recall": 0.34148466970860386,
"spec_mcc": 0.20939315966102864,
"spec_auc": 0.8490039116946011,
"spec_ece": 0.43363295723994577,
"spec_confusion_matrix": [
[
132,
483,
3,
0
],
[
2,
166,
0,
0
],
[
2,
171,
34,
0
],
[
0,
175,
32,
0
]
],
"spec_f1_L1Generi": 0.35013262599469497,
"spec_prec_L1Generi": 0.9705882352941176,
"spec_recall_L1Generi": 0.21359223300970873,
"spec_f1_L2Domain": 0.28546861564918313,
"spec_prec_L2Domain": 0.16683417085427135,
"spec_recall_L2Domain": 0.9880952380952381,
"spec_f1_L3Firm-S": 0.2463768115942029,
"spec_prec_L3Firm-S": 0.4927536231884058,
"spec_recall_L3Firm-S": 0.1642512077294686,
"spec_f1_L4Quanti": 0.0,
"spec_prec_L4Quanti": 0.0,
"spec_recall_L4Quanti": 0.0,
"spec_qwk": 0.24233251808742773,
"spec_mae": 0.8733333333333333,
"spec_kripp_alpha": 0.2761091078775676
},
"Opus-4.6": {
"cat_macro_f1": 0.35763512449392704,
"cat_weighted_f1": 0.40173099854659305,
"cat_macro_precision": 0.6354693148020794,
"cat_macro_recall": 0.39500680662311666,
"cat_mcc": 0.42166882753874363,
"cat_auc": 0.9209441610065957,
"cat_ece": 0.09567970824738346,
"cat_confusion_matrix": [
[
141,
0,
0,
0,
71,
0,
2
],
[
0,
0,
0,
4,
65,
0,
10
],
[
5,
0,
21,
1,
131,
0,
0
],
[
1,
0,
0,
5,
128,
0,
7
],
[
0,
0,
0,
0,
194,
0,
19
],
[
0,
0,
0,
0,
186,
28,
14
],
[
0,
0,
0,
0,
16,
0,
151
]
],
"cat_f1_BoardGov": 0.7811634349030471,
"cat_prec_BoardGov": 0.9591836734693877,
"cat_recall_BoardGov": 0.6588785046728972,
"cat_f1_Incident": 0.0,
"cat_prec_Incident": 0.0,
"cat_recall_Incident": 0.0,
"cat_f1_Manageme": 0.2346368715083799,
"cat_prec_Manageme": 1.0,
"cat_recall_Manageme": 0.13291139240506328,
"cat_f1_NoneOthe": 0.06622516556291391,
"cat_prec_NoneOthe": 0.5,
"cat_recall_NoneOthe": 0.03546099290780142,
"cat_f1_RiskMana": 0.38645418326693226,
"cat_prec_RiskMana": 0.24525916561314792,
"cat_recall_RiskMana": 0.9107981220657277,
"cat_f1_Strategy": 0.21875,
"cat_prec_Strategy": 1.0,
"cat_recall_Strategy": 0.12280701754385964,
"cat_f1_Third-Pa": 0.8162162162162162,
"cat_prec_Third-Pa": 0.7438423645320197,
"cat_recall_Third-Pa": 0.9041916167664671,
"cat_kripp_alpha": 0.27338793761748126,
"spec_macro_f1": 0.20754679251319788,
"spec_weighted_f1": 0.25637242485646744,
"spec_macro_precision": 0.40946072005380696,
"spec_macro_recall": 0.33929593134138586,
"spec_mcc": 0.2041103760829744,
"spec_auc": 0.8271022317290393,
"spec_ece": 0.4489923599362374,
"spec_confusion_matrix": [
[
130,
473,
2,
0
],
[
0,
145,
0,
0
],
[
6,
217,
37,
0
],
[
0,
160,
30,
0
]
],
"spec_f1_L1Generi": 0.3508771929824561,
"spec_prec_L1Generi": 0.9558823529411765,
"spec_recall_L1Generi": 0.21487603305785125,
"spec_f1_L2Domain": 0.2543859649122807,
"spec_prec_L2Domain": 0.1457286432160804,
"spec_recall_L2Domain": 1.0,
"spec_f1_L3Firm-S": 0.22492401215805471,
"spec_prec_L3Firm-S": 0.5362318840579711,
"spec_recall_L3Firm-S": 0.1423076923076923,
"spec_f1_L4Quanti": 0.0,
"spec_prec_L4Quanti": 0.0,
"spec_recall_L4Quanti": 0.0,
"spec_qwk": 0.24096533359991634,
"spec_mae": 0.88,
"spec_kripp_alpha": 0.2758412395136435
},
"_runtime": {
"encoder_mb": 274.843904,
"ms_per_sample": 5.861402786540566,
"throughput_per_s": 170.607623536175,
"peak_vram_mb": 1287.34326171875,
"build_s": 0.4908116469741799
}
}

View File

@ -0,0 +1,297 @@
{
"GPT-5.4": {
"cat_macro_f1": 0.3536909012886116,
"cat_weighted_f1": 0.4058815979606338,
"cat_macro_precision": 0.6317997184487815,
"cat_macro_recall": 0.38979766446605063,
"cat_mcc": 0.42471542150657926,
"cat_auc": 0.9205800077405307,
"cat_ece": 0.09734637491405013,
"cat_confusion_matrix": [
[
143,
0,
0,
0,
85,
0,
2
],
[
0,
0,
0,
4,
73,
0,
11
],
[
3,
0,
20,
1,
124,
0,
2
],
[
1,
0,
1,
5,
122,
0,
7
],
[
0,
0,
0,
0,
185,
0,
13
],
[
0,
0,
0,
0,
180,
28,
13
],
[
0,
0,
0,
0,
22,
0,
155
]
],
"cat_f1_BoardGov": 0.7586206896551724,
"cat_prec_BoardGov": 0.9727891156462585,
"cat_recall_BoardGov": 0.6217391304347826,
"cat_f1_Incident": 0.0,
"cat_prec_Incident": 0.0,
"cat_recall_Incident": 0.0,
"cat_f1_Manageme": 0.23391812865497075,
"cat_prec_Manageme": 0.9523809523809523,
"cat_recall_Manageme": 0.13333333333333333,
"cat_f1_NoneOthe": 0.0684931506849315,
"cat_prec_NoneOthe": 0.5,
"cat_recall_NoneOthe": 0.03676470588235294,
"cat_f1_RiskMana": 0.3741152679474216,
"cat_prec_RiskMana": 0.23388116308470291,
"cat_recall_RiskMana": 0.9343434343434344,
"cat_f1_Strategy": 0.2248995983935743,
"cat_prec_Strategy": 1.0,
"cat_recall_Strategy": 0.12669683257918551,
"cat_f1_Third-Pa": 0.8157894736842105,
"cat_prec_Third-Pa": 0.7635467980295566,
"cat_recall_Third-Pa": 0.8757062146892656,
"cat_kripp_alpha": 0.27180867501339423,
"spec_macro_f1": 0.22049451330952025,
"spec_weighted_f1": 0.26278390857815354,
"spec_macro_precision": 0.4075440073341987,
"spec_macro_recall": 0.34148466970860386,
"spec_mcc": 0.20939315966102864,
"spec_auc": 0.8490039116946011,
"spec_ece": 0.43363295723994577,
"spec_confusion_matrix": [
[
132,
483,
3,
0
],
[
2,
166,
0,
0
],
[
2,
171,
34,
0
],
[
0,
175,
32,
0
]
],
"spec_f1_L1Generi": 0.35013262599469497,
"spec_prec_L1Generi": 0.9705882352941176,
"spec_recall_L1Generi": 0.21359223300970873,
"spec_f1_L2Domain": 0.28546861564918313,
"spec_prec_L2Domain": 0.16683417085427135,
"spec_recall_L2Domain": 0.9880952380952381,
"spec_f1_L3Firm-S": 0.2463768115942029,
"spec_prec_L3Firm-S": 0.4927536231884058,
"spec_recall_L3Firm-S": 0.1642512077294686,
"spec_f1_L4Quanti": 0.0,
"spec_prec_L4Quanti": 0.0,
"spec_recall_L4Quanti": 0.0,
"spec_qwk": 0.24233251808742773,
"spec_mae": 0.8733333333333333,
"spec_kripp_alpha": 0.2761091078775676
},
"Opus-4.6": {
"cat_macro_f1": 0.35763512449392704,
"cat_weighted_f1": 0.40173099854659305,
"cat_macro_precision": 0.6354693148020794,
"cat_macro_recall": 0.39500680662311666,
"cat_mcc": 0.42166882753874363,
"cat_auc": 0.9209441610065957,
"cat_ece": 0.09567970824738346,
"cat_confusion_matrix": [
[
141,
0,
0,
0,
71,
0,
2
],
[
0,
0,
0,
4,
65,
0,
10
],
[
5,
0,
21,
1,
131,
0,
0
],
[
1,
0,
0,
5,
128,
0,
7
],
[
0,
0,
0,
0,
194,
0,
19
],
[
0,
0,
0,
0,
186,
28,
14
],
[
0,
0,
0,
0,
16,
0,
151
]
],
"cat_f1_BoardGov": 0.7811634349030471,
"cat_prec_BoardGov": 0.9591836734693877,
"cat_recall_BoardGov": 0.6588785046728972,
"cat_f1_Incident": 0.0,
"cat_prec_Incident": 0.0,
"cat_recall_Incident": 0.0,
"cat_f1_Manageme": 0.2346368715083799,
"cat_prec_Manageme": 1.0,
"cat_recall_Manageme": 0.13291139240506328,
"cat_f1_NoneOthe": 0.06622516556291391,
"cat_prec_NoneOthe": 0.5,
"cat_recall_NoneOthe": 0.03546099290780142,
"cat_f1_RiskMana": 0.38645418326693226,
"cat_prec_RiskMana": 0.24525916561314792,
"cat_recall_RiskMana": 0.9107981220657277,
"cat_f1_Strategy": 0.21875,
"cat_prec_Strategy": 1.0,
"cat_recall_Strategy": 0.12280701754385964,
"cat_f1_Third-Pa": 0.8162162162162162,
"cat_prec_Third-Pa": 0.7438423645320197,
"cat_recall_Third-Pa": 0.9041916167664671,
"cat_kripp_alpha": 0.27338793761748126,
"spec_macro_f1": 0.20754679251319788,
"spec_weighted_f1": 0.25637242485646744,
"spec_macro_precision": 0.40946072005380696,
"spec_macro_recall": 0.33929593134138586,
"spec_mcc": 0.2041103760829744,
"spec_auc": 0.8271022317290393,
"spec_ece": 0.4489923599362374,
"spec_confusion_matrix": [
[
130,
473,
2,
0
],
[
0,
145,
0,
0
],
[
6,
217,
37,
0
],
[
0,
160,
30,
0
]
],
"spec_f1_L1Generi": 0.3508771929824561,
"spec_prec_L1Generi": 0.9558823529411765,
"spec_recall_L1Generi": 0.21487603305785125,
"spec_f1_L2Domain": 0.2543859649122807,
"spec_prec_L2Domain": 0.1457286432160804,
"spec_recall_L2Domain": 1.0,
"spec_f1_L3Firm-S": 0.22492401215805471,
"spec_prec_L3Firm-S": 0.5362318840579711,
"spec_recall_L3Firm-S": 0.1423076923076923,
"spec_f1_L4Quanti": 0.0,
"spec_prec_L4Quanti": 0.0,
"spec_recall_L4Quanti": 0.0,
"spec_qwk": 0.24096533359991634,
"spec_mae": 0.88,
"spec_kripp_alpha": 0.2758412395136435
},
"_runtime": {
"encoder_mb": 274.843904,
"ms_per_sample": 5.860076693982895,
"throughput_per_s": 170.64623079537446,
"peak_vram_mb": 1287.34326171875,
"build_s": 0.4858604749897495
}
}

View File

@ -0,0 +1,297 @@
{
"GPT-5.4": {
"cat_macro_f1": 0.9336741161693523,
"cat_weighted_f1": 0.9343162998643407,
"cat_macro_precision": 0.93189297179766,
"cat_macro_recall": 0.9377918652022429,
"cat_mcc": 0.9226990724708704,
"cat_auc": 0.9920503365435541,
"cat_ece": 0.053747650533914546,
"cat_confusion_matrix": [
[
225,
0,
3,
0,
2,
0,
0
],
[
0,
85,
0,
0,
2,
1,
0
],
[
2,
0,
144,
1,
3,
0,
0
],
[
0,
0,
3,
131,
0,
2,
0
],
[
6,
1,
5,
19,
164,
1,
2
],
[
0,
3,
1,
8,
2,
207,
0
],
[
0,
0,
0,
0,
12,
0,
165
]
],
"cat_f1_BoardGov": 0.9719222462203023,
"cat_prec_BoardGov": 0.9656652360515021,
"cat_recall_BoardGov": 0.9782608695652174,
"cat_f1_Incident": 0.96045197740113,
"cat_prec_Incident": 0.9550561797752809,
"cat_recall_Incident": 0.9659090909090909,
"cat_f1_Manageme": 0.9411764705882353,
"cat_prec_Manageme": 0.9230769230769231,
"cat_recall_Manageme": 0.96,
"cat_f1_NoneOthe": 0.888135593220339,
"cat_prec_NoneOthe": 0.8238993710691824,
"cat_recall_NoneOthe": 0.9632352941176471,
"cat_f1_RiskMana": 0.856396866840731,
"cat_prec_RiskMana": 0.8864864864864865,
"cat_recall_RiskMana": 0.8282828282828283,
"cat_f1_Strategy": 0.9583333333333334,
"cat_prec_Strategy": 0.981042654028436,
"cat_recall_Strategy": 0.9366515837104072,
"cat_f1_Third-Pa": 0.9593023255813954,
"cat_prec_Third-Pa": 0.9880239520958084,
"cat_recall_Third-Pa": 0.9322033898305084,
"cat_kripp_alpha": 0.9223591517560865,
"spec_macro_f1": 0.8951731906425856,
"spec_weighted_f1": 0.9121524819510628,
"spec_macro_precision": 0.8980417155129858,
"spec_macro_recall": 0.8930560580782194,
"spec_mcc": 0.866381831963237,
"spec_auc": 0.981767664615518,
"spec_ece": 0.07004868157207966,
"spec_confusion_matrix": [
[
580,
23,
12,
3
],
[
29,
130,
7,
2
],
[
11,
4,
190,
2
],
[
2,
1,
9,
195
]
],
"spec_f1_L1Generi": 0.9354838709677419,
"spec_prec_L1Generi": 0.932475884244373,
"spec_recall_L1Generi": 0.9385113268608414,
"spec_f1_L2Domain": 0.7975460122699386,
"spec_prec_L2Domain": 0.8227848101265823,
"spec_recall_L2Domain": 0.7738095238095238,
"spec_f1_L3Firm-S": 0.8941176470588236,
"spec_prec_L3Firm-S": 0.8715596330275229,
"spec_recall_L3Firm-S": 0.9178743961352657,
"spec_f1_L4Quanti": 0.9535452322738386,
"spec_prec_L4Quanti": 0.9653465346534653,
"spec_recall_L4Quanti": 0.9420289855072463,
"spec_qwk": 0.9324447137231142,
"spec_mae": 0.1175,
"spec_kripp_alpha": 0.917725722448833
},
"Opus-4.6": {
"cat_macro_f1": 0.922684387023173,
"cat_weighted_f1": 0.9216414809666168,
"cat_macro_precision": 0.9177680939029339,
"cat_macro_recall": 0.9316060900094703,
"cat_mcc": 0.909266938399113,
"cat_auc": 0.993963602835296,
"cat_ece": 0.06541431720058125,
"cat_confusion_matrix": [
[
211,
0,
1,
1,
1,
0,
0
],
[
0,
78,
0,
0,
1,
0,
0
],
[
8,
0,
144,
1,
4,
0,
1
],
[
0,
0,
1,
138,
1,
1,
0
],
[
13,
0,
9,
14,
169,
1,
7
],
[
1,
11,
1,
4,
3,
208,
0
],
[
0,
0,
0,
1,
6,
1,
159
]
],
"cat_f1_BoardGov": 0.9440715883668904,
"cat_prec_BoardGov": 0.9055793991416309,
"cat_recall_BoardGov": 0.985981308411215,
"cat_f1_Incident": 0.9285714285714286,
"cat_prec_Incident": 0.8764044943820225,
"cat_recall_Incident": 0.9873417721518988,
"cat_f1_Manageme": 0.9171974522292994,
"cat_prec_Manageme": 0.9230769230769231,
"cat_recall_Manageme": 0.9113924050632911,
"cat_f1_NoneOthe": 0.92,
"cat_prec_NoneOthe": 0.8679245283018868,
"cat_recall_NoneOthe": 0.9787234042553191,
"cat_f1_RiskMana": 0.8492462311557789,
"cat_prec_RiskMana": 0.9135135135135135,
"cat_recall_RiskMana": 0.7934272300469484,
"cat_f1_Strategy": 0.9476082004555809,
"cat_prec_Strategy": 0.985781990521327,
"cat_recall_Strategy": 0.9122807017543859,
"cat_f1_Third-Pa": 0.9520958083832335,
"cat_prec_Third-Pa": 0.9520958083832335,
"cat_recall_Third-Pa": 0.9520958083832335,
"cat_kripp_alpha": 0.908575631724203,
"spec_macro_f1": 0.8833694419146193,
"spec_weighted_f1": 0.9004034318676798,
"spec_macro_precision": 0.8858989636247611,
"spec_macro_recall": 0.8854684685880032,
"spec_mcc": 0.8500778641433316,
"spec_auc": 0.9736589116420353,
"spec_ece": 0.0816012116521597,
"spec_confusion_matrix": [
[
567,
30,
7,
1
],
[
22,
118,
3,
2
],
[
33,
10,
207,
10
],
[
0,
0,
1,
189
]
],
"spec_f1_L1Generi": 0.9242053789731052,
"spec_prec_L1Generi": 0.9115755627009646,
"spec_recall_L1Generi": 0.9371900826446281,
"spec_f1_L2Domain": 0.7788778877887789,
"spec_prec_L2Domain": 0.7468354430379747,
"spec_recall_L2Domain": 0.8137931034482758,
"spec_f1_L3Firm-S": 0.8661087866108786,
"spec_prec_L3Firm-S": 0.9495412844036697,
"spec_recall_L3Firm-S": 0.7961538461538461,
"spec_f1_L4Quanti": 0.9642857142857143,
"spec_prec_L4Quanti": 0.9356435643564357,
"spec_recall_L4Quanti": 0.9947368421052631,
"spec_qwk": 0.9227008860372746,
"spec_mae": 0.13583333333333333,
"spec_kripp_alpha": 0.9065248741550552
},
"_runtime": {
"encoder_mb": 789.563648,
"ms_per_sample": 5.539002780715236,
"throughput_per_s": 180.53791261517884,
"peak_vram_mb": 1740.83837890625,
"build_s": 0.46582157304510474
}
}

View File

@ -0,0 +1,297 @@
{
"GPT-5.4": {
"cat_macro_f1": 0.9336741161693523,
"cat_weighted_f1": 0.9343162998643407,
"cat_macro_precision": 0.93189297179766,
"cat_macro_recall": 0.9377918652022429,
"cat_mcc": 0.9226990724708704,
"cat_auc": 0.9920546854722492,
"cat_ece": 0.05388230005900064,
"cat_confusion_matrix": [
[
225,
0,
3,
0,
2,
0,
0
],
[
0,
85,
0,
0,
2,
1,
0
],
[
2,
0,
144,
1,
3,
0,
0
],
[
0,
0,
3,
131,
0,
2,
0
],
[
6,
1,
5,
19,
164,
1,
2
],
[
0,
3,
1,
8,
2,
207,
0
],
[
0,
0,
0,
0,
12,
0,
165
]
],
"cat_f1_BoardGov": 0.9719222462203023,
"cat_prec_BoardGov": 0.9656652360515021,
"cat_recall_BoardGov": 0.9782608695652174,
"cat_f1_Incident": 0.96045197740113,
"cat_prec_Incident": 0.9550561797752809,
"cat_recall_Incident": 0.9659090909090909,
"cat_f1_Manageme": 0.9411764705882353,
"cat_prec_Manageme": 0.9230769230769231,
"cat_recall_Manageme": 0.96,
"cat_f1_NoneOthe": 0.888135593220339,
"cat_prec_NoneOthe": 0.8238993710691824,
"cat_recall_NoneOthe": 0.9632352941176471,
"cat_f1_RiskMana": 0.856396866840731,
"cat_prec_RiskMana": 0.8864864864864865,
"cat_recall_RiskMana": 0.8282828282828283,
"cat_f1_Strategy": 0.9583333333333334,
"cat_prec_Strategy": 0.981042654028436,
"cat_recall_Strategy": 0.9366515837104072,
"cat_f1_Third-Pa": 0.9593023255813954,
"cat_prec_Third-Pa": 0.9880239520958084,
"cat_recall_Third-Pa": 0.9322033898305084,
"cat_kripp_alpha": 0.9223591517560865,
"spec_macro_f1": 0.8943486525770918,
"spec_weighted_f1": 0.9113685505226937,
"spec_macro_precision": 0.896720845083131,
"spec_macro_recall": 0.8926515273338829,
"spec_mcc": 0.8651595302642376,
"spec_auc": 0.9817804600696421,
"spec_ece": 0.07088303024570146,
"spec_confusion_matrix": [
[
579,
24,
12,
3
],
[
29,
130,
7,
2
],
[
11,
4,
190,
2
],
[
2,
1,
9,
195
]
],
"spec_f1_L1Generi": 0.9346246973365617,
"spec_prec_L1Generi": 0.9323671497584541,
"spec_recall_L1Generi": 0.9368932038834952,
"spec_f1_L2Domain": 0.7951070336391437,
"spec_prec_L2Domain": 0.8176100628930818,
"spec_recall_L2Domain": 0.7738095238095238,
"spec_f1_L3Firm-S": 0.8941176470588236,
"spec_prec_L3Firm-S": 0.8715596330275229,
"spec_recall_L3Firm-S": 0.9178743961352657,
"spec_f1_L4Quanti": 0.9535452322738386,
"spec_prec_L4Quanti": 0.9653465346534653,
"spec_recall_L4Quanti": 0.9420289855072463,
"spec_qwk": 0.9321211092744079,
"spec_mae": 0.11833333333333333,
"spec_kripp_alpha": 0.9170875429859872
},
"Opus-4.6": {
"cat_macro_f1": 0.922684387023173,
"cat_weighted_f1": 0.9216414809666168,
"cat_macro_precision": 0.9177680939029339,
"cat_macro_recall": 0.9316060900094703,
"cat_mcc": 0.909266938399113,
"cat_auc": 0.9939567083286731,
"cat_ece": 0.06541596949100496,
"cat_confusion_matrix": [
[
211,
0,
1,
1,
1,
0,
0
],
[
0,
78,
0,
0,
1,
0,
0
],
[
8,
0,
144,
1,
4,
0,
1
],
[
0,
0,
1,
138,
1,
1,
0
],
[
13,
0,
9,
14,
169,
1,
7
],
[
1,
11,
1,
4,
3,
208,
0
],
[
0,
0,
0,
1,
6,
1,
159
]
],
"cat_f1_BoardGov": 0.9440715883668904,
"cat_prec_BoardGov": 0.9055793991416309,
"cat_recall_BoardGov": 0.985981308411215,
"cat_f1_Incident": 0.9285714285714286,
"cat_prec_Incident": 0.8764044943820225,
"cat_recall_Incident": 0.9873417721518988,
"cat_f1_Manageme": 0.9171974522292994,
"cat_prec_Manageme": 0.9230769230769231,
"cat_recall_Manageme": 0.9113924050632911,
"cat_f1_NoneOthe": 0.92,
"cat_prec_NoneOthe": 0.8679245283018868,
"cat_recall_NoneOthe": 0.9787234042553191,
"cat_f1_RiskMana": 0.8492462311557789,
"cat_prec_RiskMana": 0.9135135135135135,
"cat_recall_RiskMana": 0.7934272300469484,
"cat_f1_Strategy": 0.9476082004555809,
"cat_prec_Strategy": 0.985781990521327,
"cat_recall_Strategy": 0.9122807017543859,
"cat_f1_Third-Pa": 0.9520958083832335,
"cat_prec_Third-Pa": 0.9520958083832335,
"cat_recall_Third-Pa": 0.9520958083832335,
"cat_kripp_alpha": 0.908575631724203,
"spec_macro_f1": 0.8825095464914274,
"spec_weighted_f1": 0.8996514471183623,
"spec_macro_precision": 0.8846890964606792,
"spec_macro_recall": 0.8850552454475074,
"spec_mcc": 0.8488763096810703,
"spec_auc": 0.9736482774372809,
"spec_ece": 0.08238246644536655,
"spec_confusion_matrix": [
[
566,
31,
7,
1
],
[
22,
118,
3,
2
],
[
33,
10,
207,
10
],
[
0,
0,
1,
189
]
],
"spec_f1_L1Generi": 0.9233278955954323,
"spec_prec_L1Generi": 0.9114331723027376,
"spec_recall_L1Generi": 0.9355371900826446,
"spec_f1_L2Domain": 0.7763157894736842,
"spec_prec_L2Domain": 0.7421383647798742,
"spec_recall_L2Domain": 0.8137931034482758,
"spec_f1_L3Firm-S": 0.8661087866108786,
"spec_prec_L3Firm-S": 0.9495412844036697,
"spec_recall_L3Firm-S": 0.7961538461538461,
"spec_f1_L4Quanti": 0.9642857142857143,
"spec_prec_L4Quanti": 0.9356435643564357,
"spec_recall_L4Quanti": 0.9947368421052631,
"spec_qwk": 0.9223702541559166,
"spec_mae": 0.13666666666666666,
"spec_kripp_alpha": 0.9059072309806726
},
"_runtime": {
"encoder_mb": 1579.127296,
"ms_per_sample": 16.293709366727853,
"throughput_per_s": 61.37337898281309,
"peak_vram_mb": 3503.53369140625,
"build_s": 0.6251941699883901
}
}

View File

@ -0,0 +1,286 @@
[
{
"variant": "fp32",
"description": "Float32 encoder + heads",
"encoder_mb": 1579.127296,
"ms_per_sample": 16.293709366727853,
"throughput_per_s": 61.37337898281309,
"peak_vram_mb": 3503.53369140625,
"build_s": 0.6251941699883901,
"GPT-5.4_cat_f1": 0.9336741161693523,
"GPT-5.4_spec_f1": 0.8943486525770918,
"GPT-5.4_cat_mcc": 0.9226990724708704,
"GPT-5.4_spec_qwk": 0.9321211092744079,
"GPT-5.4_spec_mae": 0.11833333333333333,
"GPT-5.4_cat_ece": 0.05388230005900064,
"GPT-5.4_spec_ece": 0.07088303024570146,
"GPT-5.4_spec_f1_L1Generi": 0.9346246973365617,
"GPT-5.4_spec_f1_L2Domain": 0.7951070336391437,
"GPT-5.4_spec_f1_L3Firm-S": 0.8941176470588236,
"GPT-5.4_spec_f1_L4Quanti": 0.9535452322738386,
"Opus-4.6_cat_f1": 0.922684387023173,
"Opus-4.6_spec_f1": 0.8825095464914274,
"Opus-4.6_cat_mcc": 0.909266938399113,
"Opus-4.6_spec_qwk": 0.9223702541559166,
"Opus-4.6_spec_mae": 0.13666666666666666,
"Opus-4.6_cat_ece": 0.06541596949100496,
"Opus-4.6_spec_ece": 0.08238246644536655,
"Opus-4.6_spec_f1_L1Generi": 0.9233278955954323,
"Opus-4.6_spec_f1_L2Domain": 0.7763157894736842,
"Opus-4.6_spec_f1_L3Firm-S": 0.8661087866108786,
"Opus-4.6_spec_f1_L4Quanti": 0.9642857142857143
},
{
"variant": "bf16",
"description": "BFloat16 baseline (matches eval pipeline)",
"encoder_mb": 789.563648,
"ms_per_sample": 5.516677870764397,
"throughput_per_s": 181.26851402716375,
"peak_vram_mb": 1740.83837890625,
"build_s": 0.48778308398323134,
"GPT-5.4_cat_f1": 0.9336741161693523,
"GPT-5.4_spec_f1": 0.8951731906425856,
"GPT-5.4_cat_mcc": 0.9226990724708704,
"GPT-5.4_spec_qwk": 0.9324447137231142,
"GPT-5.4_spec_mae": 0.1175,
"GPT-5.4_cat_ece": 0.053848127176364245,
"GPT-5.4_spec_ece": 0.07135417198141418,
"GPT-5.4_spec_f1_L1Generi": 0.9354838709677419,
"GPT-5.4_spec_f1_L2Domain": 0.7975460122699386,
"GPT-5.4_spec_f1_L3Firm-S": 0.8941176470588236,
"GPT-5.4_spec_f1_L4Quanti": 0.9535452322738386,
"Opus-4.6_cat_f1": 0.922684387023173,
"Opus-4.6_spec_f1": 0.8833694419146193,
"Opus-4.6_cat_mcc": 0.909266938399113,
"Opus-4.6_spec_qwk": 0.9227008860372746,
"Opus-4.6_spec_mae": 0.13583333333333333,
"Opus-4.6_cat_ece": 0.06551479384303091,
"Opus-4.6_spec_ece": 0.08248284702499709,
"Opus-4.6_spec_f1_L1Generi": 0.9242053789731052,
"Opus-4.6_spec_f1_L2Domain": 0.7788778877887789,
"Opus-4.6_spec_f1_L3Firm-S": 0.8661087866108786,
"Opus-4.6_spec_f1_L4Quanti": 0.9642857142857143
},
{
"variant": "fp16",
"description": "Float16 encoder + heads",
"encoder_mb": 789.563648,
"ms_per_sample": 5.539002780715236,
"throughput_per_s": 180.53791261517884,
"peak_vram_mb": 1740.83837890625,
"build_s": 0.46582157304510474,
"GPT-5.4_cat_f1": 0.9336741161693523,
"GPT-5.4_spec_f1": 0.8951731906425856,
"GPT-5.4_cat_mcc": 0.9226990724708704,
"GPT-5.4_spec_qwk": 0.9324447137231142,
"GPT-5.4_spec_mae": 0.1175,
"GPT-5.4_cat_ece": 0.053747650533914546,
"GPT-5.4_spec_ece": 0.07004868157207966,
"GPT-5.4_spec_f1_L1Generi": 0.9354838709677419,
"GPT-5.4_spec_f1_L2Domain": 0.7975460122699386,
"GPT-5.4_spec_f1_L3Firm-S": 0.8941176470588236,
"GPT-5.4_spec_f1_L4Quanti": 0.9535452322738386,
"Opus-4.6_cat_f1": 0.922684387023173,
"Opus-4.6_spec_f1": 0.8833694419146193,
"Opus-4.6_cat_mcc": 0.909266938399113,
"Opus-4.6_spec_qwk": 0.9227008860372746,
"Opus-4.6_spec_mae": 0.13583333333333333,
"Opus-4.6_cat_ece": 0.06541431720058125,
"Opus-4.6_spec_ece": 0.0816012116521597,
"Opus-4.6_spec_f1_L1Generi": 0.9242053789731052,
"Opus-4.6_spec_f1_L2Domain": 0.7788778877887789,
"Opus-4.6_spec_f1_L3Firm-S": 0.8661087866108786,
"Opus-4.6_spec_f1_L4Quanti": 0.9642857142857143
},
{
"variant": "torchao-int8-wo",
"description": "torchao Int8 weight-only on encoder linears",
"encoder_mb": 789.563648,
"ms_per_sample": 6.078403938445263,
"throughput_per_s": 164.5168715549004,
"peak_vram_mb": 1416.36376953125,
"build_s": 0.5027359619853087,
"GPT-5.4_cat_f1": 0.9344870894825886,
"GPT-5.4_spec_f1": 0.8941203230194683,
"GPT-5.4_cat_mcc": 0.9237006314618685,
"GPT-5.4_spec_qwk": 0.9329693660903852,
"GPT-5.4_spec_mae": 0.1175,
"GPT-5.4_cat_ece": 0.05415941931307314,
"GPT-5.4_spec_ece": 0.06980206420024232,
"GPT-5.4_spec_f1_L1Generi": 0.9353796445880452,
"GPT-5.4_spec_f1_L2Domain": 0.793939393939394,
"GPT-5.4_spec_f1_L3Firm-S": 0.8936170212765957,
"GPT-5.4_spec_f1_L4Quanti": 0.9535452322738386,
"Opus-4.6_cat_f1": 0.9234810481200378,
"Opus-4.6_spec_f1": 0.8814731397444973,
"Opus-4.6_cat_mcc": 0.9102750101817324,
"Opus-4.6_spec_qwk": 0.9207708779443254,
"Opus-4.6_spec_mae": 0.13916666666666666,
"Opus-4.6_cat_ece": 0.0641141641388337,
"Opus-4.6_spec_ece": 0.08370273689428968,
"Opus-4.6_spec_f1_L1Generi": 0.9208163265306123,
"Opus-4.6_spec_f1_L2Domain": 0.7752442996742671,
"Opus-4.6_spec_f1_L3Firm-S": 0.865546218487395,
"Opus-4.6_spec_f1_L4Quanti": 0.9642857142857143
},
{
"variant": "torchao-int8-dyn",
"description": "torchao Int8 dynamic activation + Int8 weight on encoder",
"encoder_mb": 789.563648,
"ms_per_sample": 9.671733896636093,
"throughput_per_s": 103.39407707937539,
"peak_vram_mb": 1774.27392578125,
"build_s": 0.4831273259478621,
"GPT-5.4_cat_f1": 0.9336475878058536,
"GPT-5.4_spec_f1": 0.8918479759675974,
"GPT-5.4_cat_mcc": 0.9226968780743573,
"GPT-5.4_spec_qwk": 0.931514217618119,
"GPT-5.4_spec_mae": 0.12,
"GPT-5.4_cat_ece": 0.05363284418980283,
"GPT-5.4_spec_ece": 0.07049367701013878,
"GPT-5.4_spec_f1_L1Generi": 0.934412955465587,
"GPT-5.4_spec_f1_L2Domain": 0.7889908256880734,
"GPT-5.4_spec_f1_L3Firm-S": 0.8904428904428905,
"GPT-5.4_spec_f1_L4Quanti": 0.9535452322738386,
"Opus-4.6_cat_f1": 0.9242573204255528,
"Opus-4.6_spec_f1": 0.8827245859621925,
"Opus-4.6_cat_mcc": 0.9112549308356716,
"Opus-4.6_spec_qwk": 0.92235918049198,
"Opus-4.6_spec_mae": 0.13666666666666666,
"Opus-4.6_cat_ece": 0.06330573419729862,
"Opus-4.6_spec_ece": 0.08290670409798626,
"Opus-4.6_spec_f1_L1Generi": 0.9230769230769231,
"Opus-4.6_spec_f1_L2Domain": 0.7763157894736842,
"Opus-4.6_spec_f1_L3Firm-S": 0.8672199170124482,
"Opus-4.6_spec_f1_L4Quanti": 0.9642857142857143
},
{
"variant": "torchao-int4-wo",
"description": "torchao Int4 weight-only (group=128) on encoder",
"error": "ImportError: Requires mslk >= 1.0.0"
},
{
"variant": "bnb-int8",
"description": "bitsandbytes LLM.int8 on encoder linears",
"encoder_mb": 789.563648,
"ms_per_sample": 7.762363941583317,
"throughput_per_s": 128.82673468103667,
"peak_vram_mb": 2135.203125,
"build_s": 1.1878160500200465,
"GPT-5.4_cat_f1": 0.9360988760303737,
"GPT-5.4_spec_f1": 0.8986323186392307,
"GPT-5.4_cat_mcc": 0.9256911778959798,
"GPT-5.4_spec_qwk": 0.9307948020550015,
"GPT-5.4_spec_mae": 0.1175,
"GPT-5.4_cat_ece": 0.052939765204985965,
"GPT-5.4_spec_ece": 0.06740866973996164,
"GPT-5.4_spec_f1_L1Generi": 0.9364440868865648,
"GPT-5.4_spec_f1_L2Domain": 0.8099688473520249,
"GPT-5.4_spec_f1_L3Firm-S": 0.892018779342723,
"GPT-5.4_spec_f1_L4Quanti": 0.9560975609756097,
"Opus-4.6_cat_f1": 0.9235105849558979,
"Opus-4.6_spec_f1": 0.8826923642825633,
"Opus-4.6_cat_mcc": 0.9103198007176273,
"Opus-4.6_spec_qwk": 0.9198415117342273,
"Opus-4.6_spec_mae": 0.13916666666666666,
"Opus-4.6_cat_ece": 0.06465620135267579,
"Opus-4.6_spec_ece": 0.083350846717755,
"Opus-4.6_spec_f1_L1Generi": 0.9235772357723577,
"Opus-4.6_spec_f1_L2Domain": 0.785234899328859,
"Opus-4.6_spec_f1_L3Firm-S": 0.860125260960334,
"Opus-4.6_spec_f1_L4Quanti": 0.9618320610687023
},
{
"variant": "bnb-nf4",
"description": "bitsandbytes NF4 4-bit (double-quant, bf16 compute)",
"encoder_mb": 274.843904,
"ms_per_sample": 5.860076693982895,
"throughput_per_s": 170.64623079537446,
"peak_vram_mb": 1287.34326171875,
"build_s": 0.4858604749897495,
"GPT-5.4_cat_f1": 0.3536909012886116,
"GPT-5.4_spec_f1": 0.22049451330952025,
"GPT-5.4_cat_mcc": 0.42471542150657926,
"GPT-5.4_spec_qwk": 0.24233251808742773,
"GPT-5.4_spec_mae": 0.8733333333333333,
"GPT-5.4_cat_ece": 0.09734637491405013,
"GPT-5.4_spec_ece": 0.43363295723994577,
"GPT-5.4_spec_f1_L1Generi": 0.35013262599469497,
"GPT-5.4_spec_f1_L2Domain": 0.28546861564918313,
"GPT-5.4_spec_f1_L3Firm-S": 0.2463768115942029,
"GPT-5.4_spec_f1_L4Quanti": 0.0,
"Opus-4.6_cat_f1": 0.35763512449392704,
"Opus-4.6_spec_f1": 0.20754679251319788,
"Opus-4.6_cat_mcc": 0.42166882753874363,
"Opus-4.6_spec_qwk": 0.24096533359991634,
"Opus-4.6_spec_mae": 0.88,
"Opus-4.6_cat_ece": 0.09567970824738346,
"Opus-4.6_spec_ece": 0.4489923599362374,
"Opus-4.6_spec_f1_L1Generi": 0.3508771929824561,
"Opus-4.6_spec_f1_L2Domain": 0.2543859649122807,
"Opus-4.6_spec_f1_L3Firm-S": 0.22492401215805471,
"Opus-4.6_spec_f1_L4Quanti": 0.0
},
{
"variant": "bnb-nf4-nodq",
"description": "bitsandbytes NF4 4-bit (no double-quant)",
"encoder_mb": 274.843904,
"ms_per_sample": 5.861402786540566,
"throughput_per_s": 170.607623536175,
"peak_vram_mb": 1287.34326171875,
"build_s": 0.4908116469741799,
"GPT-5.4_cat_f1": 0.3536909012886116,
"GPT-5.4_spec_f1": 0.22049451330952025,
"GPT-5.4_cat_mcc": 0.42471542150657926,
"GPT-5.4_spec_qwk": 0.24233251808742773,
"GPT-5.4_spec_mae": 0.8733333333333333,
"GPT-5.4_cat_ece": 0.09734637491405013,
"GPT-5.4_spec_ece": 0.43363295723994577,
"GPT-5.4_spec_f1_L1Generi": 0.35013262599469497,
"GPT-5.4_spec_f1_L2Domain": 0.28546861564918313,
"GPT-5.4_spec_f1_L3Firm-S": 0.2463768115942029,
"GPT-5.4_spec_f1_L4Quanti": 0.0,
"Opus-4.6_cat_f1": 0.35763512449392704,
"Opus-4.6_spec_f1": 0.20754679251319788,
"Opus-4.6_cat_mcc": 0.42166882753874363,
"Opus-4.6_spec_qwk": 0.24096533359991634,
"Opus-4.6_spec_mae": 0.88,
"Opus-4.6_cat_ece": 0.09567970824738346,
"Opus-4.6_spec_ece": 0.4489923599362374,
"Opus-4.6_spec_f1_L1Generi": 0.3508771929824561,
"Opus-4.6_spec_f1_L2Domain": 0.2543859649122807,
"Opus-4.6_spec_f1_L3Firm-S": 0.22492401215805471,
"Opus-4.6_spec_f1_L4Quanti": 0.0
},
{
"variant": "bnb-fp4",
"description": "bitsandbytes FP4 4-bit (no double-quant)",
"encoder_mb": 274.843904,
"ms_per_sample": 5.865302347471394,
"throughput_per_s": 170.49419463109393,
"peak_vram_mb": 1287.34326171875,
"build_s": 0.4887635139748454,
"GPT-5.4_cat_f1": 0.16293893512410998,
"GPT-5.4_spec_f1": 0.20854117827130608,
"GPT-5.4_cat_mcc": 0.22345796853389935,
"GPT-5.4_spec_qwk": 0.2326064604575444,
"GPT-5.4_spec_mae": 0.8825,
"GPT-5.4_cat_ece": 0.2080524676044782,
"GPT-5.4_spec_ece": 0.44289420386155437,
"GPT-5.4_spec_f1_L1Generi": 0.35742444152431013,
"GPT-5.4_spec_f1_L2Domain": 0.2824956672443674,
"GPT-5.4_spec_f1_L3Firm-S": 0.19424460431654678,
"GPT-5.4_spec_f1_L4Quanti": 0.0,
"Opus-4.6_cat_f1": 0.16861118726256397,
"Opus-4.6_spec_f1": 0.19783939283519508,
"Opus-4.6_cat_mcc": 0.2251562222131823,
"Opus-4.6_spec_qwk": 0.22580295138888895,
"Opus-4.6_spec_mae": 0.8925,
"Opus-4.6_cat_ece": 0.19888580093781152,
"Opus-4.6_spec_ece": 0.45814307530721027,
"Opus-4.6_spec_f1_L1Generi": 0.35294117647058826,
"Opus-4.6_spec_f1_L2Domain": 0.251105216622458,
"Opus-4.6_spec_f1_L3Firm-S": 0.18731117824773413,
"Opus-4.6_spec_f1_L4Quanti": 0.0
}
]

View File

@ -0,0 +1,297 @@
{
"GPT-5.4": {
"cat_macro_f1": 0.9336475878058536,
"cat_weighted_f1": 0.9342872402134198,
"cat_macro_precision": 0.9319024691959354,
"cat_macro_recall": 0.9376938259865566,
"cat_mcc": 0.9226968780743573,
"cat_auc": 0.9924054453050574,
"cat_ece": 0.05363284418980283,
"cat_confusion_matrix": [
[
225,
0,
3,
0,
2,
0,
0
],
[
0,
85,
0,
0,
2,
1,
0
],
[
2,
0,
145,
1,
2,
0,
0
],
[
1,
0,
3,
130,
0,
2,
0
],
[
6,
1,
5,
19,
164,
1,
2
],
[
0,
3,
1,
8,
2,
207,
0
],
[
0,
0,
0,
0,
12,
0,
165
]
],
"cat_f1_BoardGov": 0.9698275862068966,
"cat_prec_BoardGov": 0.9615384615384616,
"cat_recall_BoardGov": 0.9782608695652174,
"cat_f1_Incident": 0.96045197740113,
"cat_prec_Incident": 0.9550561797752809,
"cat_recall_Incident": 0.9659090909090909,
"cat_f1_Manageme": 0.9446254071661238,
"cat_prec_Manageme": 0.9235668789808917,
"cat_recall_Manageme": 0.9666666666666667,
"cat_f1_NoneOthe": 0.8843537414965986,
"cat_prec_NoneOthe": 0.8227848101265823,
"cat_recall_NoneOthe": 0.9558823529411765,
"cat_f1_RiskMana": 0.8586387434554974,
"cat_prec_RiskMana": 0.8913043478260869,
"cat_recall_RiskMana": 0.8282828282828283,
"cat_f1_Strategy": 0.9583333333333334,
"cat_prec_Strategy": 0.981042654028436,
"cat_recall_Strategy": 0.9366515837104072,
"cat_f1_Third-Pa": 0.9593023255813954,
"cat_prec_Third-Pa": 0.9880239520958084,
"cat_recall_Third-Pa": 0.9322033898305084,
"cat_kripp_alpha": 0.9223561935890119,
"spec_macro_f1": 0.8918479759675974,
"spec_weighted_f1": 0.9097693388297432,
"spec_macro_precision": 0.8930494570032042,
"spec_macro_recall": 0.8915621000757135,
"spec_mcc": 0.8628946887605918,
"spec_auc": 0.9807842405238503,
"spec_ece": 0.07049367701013878,
"spec_confusion_matrix": [
[
577,
24,
14,
3
],
[
29,
129,
8,
2
],
[
9,
5,
191,
2
],
[
2,
1,
9,
195
]
],
"spec_f1_L1Generi": 0.934412955465587,
"spec_prec_L1Generi": 0.9351701782820098,
"spec_recall_L1Generi": 0.9336569579288025,
"spec_f1_L2Domain": 0.7889908256880734,
"spec_prec_L2Domain": 0.8113207547169812,
"spec_recall_L2Domain": 0.7678571428571429,
"spec_f1_L3Firm-S": 0.8904428904428905,
"spec_prec_L3Firm-S": 0.8603603603603603,
"spec_recall_L3Firm-S": 0.9227053140096618,
"spec_f1_L4Quanti": 0.9535452322738386,
"spec_prec_L4Quanti": 0.9653465346534653,
"spec_recall_L4Quanti": 0.9420289855072463,
"spec_qwk": 0.931514217618119,
"spec_mae": 0.12,
"spec_kripp_alpha": 0.9169918680049234
},
"Opus-4.6": {
"cat_macro_f1": 0.9242573204255528,
"cat_weighted_f1": 0.9232556488517519,
"cat_macro_precision": 0.9193897229484191,
"cat_macro_recall": 0.9331778058838005,
"cat_mcc": 0.9112549308356716,
"cat_auc": 0.9941614030336741,
"cat_ece": 0.06330573419729862,
"cat_confusion_matrix": [
[
212,
0,
1,
0,
1,
0,
0
],
[
0,
78,
0,
0,
1,
0,
0
],
[
8,
0,
145,
1,
3,
0,
1
],
[
0,
0,
1,
138,
1,
1,
0
],
[
13,
0,
9,
14,
169,
1,
7
],
[
1,
11,
1,
4,
3,
208,
0
],
[
0,
0,
0,
1,
6,
1,
159
]
],
"cat_f1_BoardGov": 0.9464285714285714,
"cat_prec_BoardGov": 0.905982905982906,
"cat_recall_BoardGov": 0.9906542056074766,
"cat_f1_Incident": 0.9285714285714286,
"cat_prec_Incident": 0.8764044943820225,
"cat_recall_Incident": 0.9873417721518988,
"cat_f1_Manageme": 0.9206349206349206,
"cat_prec_Manageme": 0.9235668789808917,
"cat_recall_Manageme": 0.9177215189873418,
"cat_f1_NoneOthe": 0.9230769230769231,
"cat_prec_NoneOthe": 0.8734177215189873,
"cat_recall_NoneOthe": 0.9787234042553191,
"cat_f1_RiskMana": 0.8513853904282116,
"cat_prec_RiskMana": 0.9184782608695652,
"cat_recall_RiskMana": 0.7934272300469484,
"cat_f1_Strategy": 0.9476082004555809,
"cat_prec_Strategy": 0.985781990521327,
"cat_recall_Strategy": 0.9122807017543859,
"cat_f1_Third-Pa": 0.9520958083832335,
"cat_prec_Third-Pa": 0.9520958083832335,
"cat_recall_Third-Pa": 0.9520958083832335,
"cat_kripp_alpha": 0.9105393643352402,
"spec_macro_f1": 0.8827245859621925,
"spec_weighted_f1": 0.8997656600606208,
"spec_macro_precision": 0.8833309642003535,
"spec_macro_recall": 0.8861518760895928,
"spec_mcc": 0.8488976906438819,
"spec_auc": 0.9740582923879771,
"spec_ece": 0.08290670409798626,
"spec_confusion_matrix": [
[
564,
31,
9,
1
],
[
22,
118,
3,
2
],
[
31,
10,
209,
10
],
[
0,
0,
1,
189
]
],
"spec_f1_L1Generi": 0.9230769230769231,
"spec_prec_L1Generi": 0.9141004862236629,
"spec_recall_L1Generi": 0.9322314049586777,
"spec_f1_L2Domain": 0.7763157894736842,
"spec_prec_L2Domain": 0.7421383647798742,
"spec_recall_L2Domain": 0.8137931034482758,
"spec_f1_L3Firm-S": 0.8672199170124482,
"spec_prec_L3Firm-S": 0.9414414414414415,
"spec_recall_L3Firm-S": 0.8038461538461539,
"spec_f1_L4Quanti": 0.9642857142857143,
"spec_prec_L4Quanti": 0.9356435643564357,
"spec_recall_L4Quanti": 0.9947368421052631,
"spec_qwk": 0.92235918049198,
"spec_mae": 0.13666666666666666,
"spec_kripp_alpha": 0.9061330450504643
},
"_runtime": {
"encoder_mb": 789.563648,
"ms_per_sample": 9.671733896636093,
"throughput_per_s": 103.39407707937539,
"peak_vram_mb": 1774.27392578125,
"build_s": 0.4831273259478621
}
}

View File

@ -0,0 +1,297 @@
{
"GPT-5.4": {
"cat_macro_f1": 0.9344870894825886,
"cat_weighted_f1": 0.9351173265780133,
"cat_macro_precision": 0.9326512314038842,
"cat_macro_recall": 0.9387442461546238,
"cat_mcc": 0.9237006314618685,
"cat_auc": 0.992309699625497,
"cat_ece": 0.05415941931307314,
"cat_confusion_matrix": [
[
225,
0,
3,
0,
2,
0,
0
],
[
0,
85,
0,
0,
2,
1,
0
],
[
2,
0,
145,
1,
2,
0,
0
],
[
0,
0,
3,
131,
0,
2,
0
],
[
6,
1,
5,
19,
164,
1,
2
],
[
0,
3,
1,
8,
2,
207,
0
],
[
0,
0,
0,
0,
12,
0,
165
]
],
"cat_f1_BoardGov": 0.9719222462203023,
"cat_prec_BoardGov": 0.9656652360515021,
"cat_recall_BoardGov": 0.9782608695652174,
"cat_f1_Incident": 0.96045197740113,
"cat_prec_Incident": 0.9550561797752809,
"cat_recall_Incident": 0.9659090909090909,
"cat_f1_Manageme": 0.9446254071661238,
"cat_prec_Manageme": 0.9235668789808917,
"cat_recall_Manageme": 0.9666666666666667,
"cat_f1_NoneOthe": 0.888135593220339,
"cat_prec_NoneOthe": 0.8238993710691824,
"cat_recall_NoneOthe": 0.9632352941176471,
"cat_f1_RiskMana": 0.8586387434554974,
"cat_prec_RiskMana": 0.8913043478260869,
"cat_recall_RiskMana": 0.8282828282828283,
"cat_f1_Strategy": 0.9583333333333334,
"cat_prec_Strategy": 0.981042654028436,
"cat_recall_Strategy": 0.9366515837104072,
"cat_f1_Third-Pa": 0.9593023255813954,
"cat_prec_Third-Pa": 0.9880239520958084,
"cat_recall_Third-Pa": 0.9322033898305084,
"cat_kripp_alpha": 0.9233443339647499,
"spec_macro_f1": 0.8941203230194683,
"spec_weighted_f1": 0.9115075208518084,
"spec_macro_precision": 0.8957148694260108,
"spec_macro_recall": 0.892931893103379,
"spec_mcc": 0.8651929532300995,
"spec_auc": 0.981624069084201,
"spec_ece": 0.06980206420024232,
"spec_confusion_matrix": [
[
579,
24,
12,
3
],
[
29,
131,
6,
2
],
[
10,
6,
189,
2
],
[
2,
1,
9,
195
]
],
"spec_f1_L1Generi": 0.9353796445880452,
"spec_prec_L1Generi": 0.9338709677419355,
"spec_recall_L1Generi": 0.9368932038834952,
"spec_f1_L2Domain": 0.793939393939394,
"spec_prec_L2Domain": 0.808641975308642,
"spec_recall_L2Domain": 0.7797619047619048,
"spec_f1_L3Firm-S": 0.8936170212765957,
"spec_prec_L3Firm-S": 0.875,
"spec_recall_L3Firm-S": 0.9130434782608695,
"spec_f1_L4Quanti": 0.9535452322738386,
"spec_prec_L4Quanti": 0.9653465346534653,
"spec_recall_L4Quanti": 0.9420289855072463,
"spec_qwk": 0.9329693660903852,
"spec_mae": 0.1175,
"spec_kripp_alpha": 0.9181842655510584
},
"Opus-4.6": {
"cat_macro_f1": 0.9234810481200378,
"cat_weighted_f1": 0.9224737817442137,
"cat_macro_precision": 0.9185473372257941,
"cat_macro_recall": 0.9325102491414775,
"cat_mcc": 0.9102750101817324,
"cat_auc": 0.9940184741579791,
"cat_ece": 0.0641141641388337,
"cat_confusion_matrix": [
[
211,
0,
1,
1,
1,
0,
0
],
[
0,
78,
0,
0,
1,
0,
0
],
[
8,
0,
145,
1,
3,
0,
1
],
[
0,
0,
1,
138,
1,
1,
0
],
[
13,
0,
9,
14,
169,
1,
7
],
[
1,
11,
1,
4,
3,
208,
0
],
[
0,
0,
0,
1,
6,
1,
159
]
],
"cat_f1_BoardGov": 0.9440715883668904,
"cat_prec_BoardGov": 0.9055793991416309,
"cat_recall_BoardGov": 0.985981308411215,
"cat_f1_Incident": 0.9285714285714286,
"cat_prec_Incident": 0.8764044943820225,
"cat_recall_Incident": 0.9873417721518988,
"cat_f1_Manageme": 0.9206349206349206,
"cat_prec_Manageme": 0.9235668789808917,
"cat_recall_Manageme": 0.9177215189873418,
"cat_f1_NoneOthe": 0.92,
"cat_prec_NoneOthe": 0.8679245283018868,
"cat_recall_NoneOthe": 0.9787234042553191,
"cat_f1_RiskMana": 0.8513853904282116,
"cat_prec_RiskMana": 0.9184782608695652,
"cat_recall_RiskMana": 0.7934272300469484,
"cat_f1_Strategy": 0.9476082004555809,
"cat_prec_Strategy": 0.985781990521327,
"cat_recall_Strategy": 0.9122807017543859,
"cat_f1_Third-Pa": 0.9520958083832335,
"cat_prec_Third-Pa": 0.9520958083832335,
"cat_recall_Third-Pa": 0.9520958083832335,
"cat_kripp_alpha": 0.9095617653952504,
"spec_macro_f1": 0.8814731397444973,
"spec_weighted_f1": 0.8981338362706646,
"spec_macro_precision": 0.8833981471623865,
"spec_macro_recall": 0.8849913986360116,
"spec_mcc": 0.8465512998506631,
"spec_auc": 0.9729999946345258,
"spec_ece": 0.08370273689428968,
"spec_confusion_matrix": [
[
564,
33,
7,
1
],
[
22,
119,
2,
2
],
[
34,
10,
206,
10
],
[
0,
0,
1,
189
]
],
"spec_f1_L1Generi": 0.9208163265306123,
"spec_prec_L1Generi": 0.9096774193548387,
"spec_recall_L1Generi": 0.9322314049586777,
"spec_f1_L2Domain": 0.7752442996742671,
"spec_prec_L2Domain": 0.7345679012345679,
"spec_recall_L2Domain": 0.8206896551724138,
"spec_f1_L3Firm-S": 0.865546218487395,
"spec_prec_L3Firm-S": 0.9537037037037037,
"spec_recall_L3Firm-S": 0.7923076923076923,
"spec_f1_L4Quanti": 0.9642857142857143,
"spec_prec_L4Quanti": 0.9356435643564357,
"spec_recall_L4Quanti": 0.9947368421052631,
"spec_qwk": 0.9207708779443254,
"spec_mae": 0.13916666666666666,
"spec_kripp_alpha": 0.9033268512180281
},
"_runtime": {
"encoder_mb": 789.563648,
"ms_per_sample": 6.078403938445263,
"throughput_per_s": 164.5168715549004,
"peak_vram_mb": 1416.36376953125,
"build_s": 0.5027359619853087
}
}