quantization + onnx sweeps
Phase 10.8: torchao/bnb quant sweep on iter1-independent. bf16 already
optimal; torchao int8-wo gives -19% VRAM at no F1 cost; all 4-bit
variants collapse (ModernBERT-large too quant-sensitive).
Phase 10.9: ONNX export + ORT eval. Legacy exporter only working path
(dynamo adds 56 Memcpy nodes); ORT fp32 -22% latency vs torch via
kernel fusion but bf16+flash-attn-2 still wins; fp16 broken on rotary;
dynamic int8 silently CPU-fallback + 0.5 F1 collapse.
Driver scripts wired to bun run py:quant / py:onnx; full reports at
results/eval/{quant,onnx}/REPORT.md.
This commit is contained in:
parent
ebfc103dac
commit
67beaede45
3
.gitignore
vendored
3
.gitignore
vendored
@ -2,7 +2,10 @@
|
|||||||
/data/
|
/data/
|
||||||
/models/
|
/models/
|
||||||
/checkpoints/
|
/checkpoints/
|
||||||
|
/results/eval/onnx/models/
|
||||||
*.tar.zst
|
*.tar.zst
|
||||||
|
*.onnx
|
||||||
|
*.onnx.data
|
||||||
|
|
||||||
# Dependencies
|
# Dependencies
|
||||||
ts/node_modules/
|
ts/node_modules/
|
||||||
|
|||||||
@ -1097,6 +1097,212 @@ epoch 3 + no temperature scaling would be a reasonable alternative choice.
|
|||||||
> in-distribution confidence memorization. Temperature scaling recovers
|
> in-distribution confidence memorization. Temperature scaling recovers
|
||||||
> calibration (ECE −33% cat, −40% spec) without altering predictions."*
|
> calibration (ECE −33% cat, −40% spec) without altering predictions."*
|
||||||
|
|
||||||
|
### 10.8 Quantization Sweep (2026-04-07)
|
||||||
|
|
||||||
|
**Question:** does post-training quantization buy us a smaller / faster
|
||||||
|
deployable model without giving back accuracy? And — almost more
|
||||||
|
interesting — *which* quant schemes does ModernBERT-large tolerate?
|
||||||
|
|
||||||
|
**Setup:** new sweep driver at `python/scripts/quantize_sweep.py` (wired
|
||||||
|
to `bun run py:quant`). Loads the iter1-independent checkpoint, applies
|
||||||
|
each scheme to the encoder backbone only (heads stay bf16), reruns the
|
||||||
|
full holdout eval against GPT-5.4 and Opus-4.6 proxy gold, and records
|
||||||
|
latency, peak VRAM, encoder footprint, and the full metrics suite. 5
|
||||||
|
warmup batches before timing; batch 64; max_seq 512; RTX 3090.
|
||||||
|
|
||||||
|
**Variants:** fp32, bf16 (baseline), fp16, torchao int8 weight-only,
|
||||||
|
torchao int8 dynamic-act + int8 weight, torchao int4 weight-only,
|
||||||
|
bitsandbytes LLM.int8, bitsandbytes nf4 (with and without
|
||||||
|
double-quantization), bitsandbytes fp4.
|
||||||
|
|
||||||
|
**Results (vs GPT-5.4 proxy gold):**
|
||||||
|
|
||||||
|
| variant | enc MB | ms/samp | thru/s | VRAM MB | cat F1 | spec F1 | spec QWK |
|
||||||
|
|--------------------|-------:|--------:|-------:|--------:|-------:|--------:|---------:|
|
||||||
|
| fp32 | 1579 | 16.29 | 61 | 3504 | 0.9337 | 0.8943 | 0.9321 |
|
||||||
|
| **bf16 baseline** | 790 | 5.52 | 181 | 1741 | 0.9337 | 0.8952 | 0.9324 |
|
||||||
|
| fp16 | 790 | 5.54 | 181 | 1741 | 0.9337 | 0.8952 | 0.9324 |
|
||||||
|
| **torchao int8-wo**| ~395 | 6.08 | 165 | 1416 | 0.9345 | 0.8941 | 0.9330 |
|
||||||
|
| torchao int8-dyn | ~395 | 9.67 | 103 | 1774 | 0.9336 | 0.8918 | 0.9315 |
|
||||||
|
| torchao int4-wo | — | — | — | — | err | err | err |
|
||||||
|
| bnb LLM.int8 | ~395 | 7.76 | 129 | 2135 | 0.9361 | 0.8986 | 0.9308 |
|
||||||
|
| bnb nf4 (DQ) | 275 | 5.86 | 171 | 1287 | 0.3537 | 0.2205 | 0.2423 |
|
||||||
|
| bnb nf4 (no DQ) | 275 | 5.86 | 171 | 1287 | 0.3537 | 0.2205 | 0.2423 |
|
||||||
|
| bnb fp4 | 275 | 5.87 | 170 | 1287 | 0.1629 | 0.2085 | 0.2326 |
|
||||||
|
|
||||||
|
(torchao subclass tensors report bf16 element_size, so "395 MB" is the
|
||||||
|
true storage estimate, not what `param.element_size()` returns.)
|
||||||
|
|
||||||
|
**Six findings:**
|
||||||
|
|
||||||
|
1. **bf16 + flash-attn-2 is already the sweet spot.** 3.0× throughput over
|
||||||
|
fp32 with bit-identical accuracy and half the VRAM. Nothing in the
|
||||||
|
precision dimension beats it on this hardware.
|
||||||
|
2. **fp16 ≡ bf16.** RTX 3090 has matched fp16/bf16 tensor-core throughput
|
||||||
|
and the model has no overflow issues; pick whichever the loader
|
||||||
|
prefers.
|
||||||
|
3. **torchao int8 weight-only is the only quantization that's worth
|
||||||
|
shipping.** −19% VRAM (1741 → 1416 MB), accuracy delta inside ±0.002
|
||||||
|
per-seed noise, +10% latency because RTX 3090 (sm_8.6) lacks the int8
|
||||||
|
tensor-core matmul path that torchao would otherwise route through —
|
||||||
|
so the int8 weight is dequantized to bf16 on the fly. **This is the
|
||||||
|
variant we'd ship as the "low-VRAM" deployment option**, and on
|
||||||
|
Hopper / Ada the latency would invert and be a strict win.
|
||||||
|
4. **torchao int8 dynamic-activation regresses on Ampere.** −43%
|
||||||
|
throughput and *more* peak VRAM than bf16 because the per-batch
|
||||||
|
activation quantization adds work without unlocking the int8
|
||||||
|
matmul. Skip.
|
||||||
|
5. **bnb LLM.int8 is the slowest int8 path and uses *more* VRAM than
|
||||||
|
bf16.** Mixed-precision outlier handling adds 23% peak memory and 41%
|
||||||
|
latency for an F1 bump that's inside noise. It's tuned for LLM-scale
|
||||||
|
models where outlier features dominate quant error; for an
|
||||||
|
encoder this size on a single 3090 it's a regression.
|
||||||
|
6. **All 4-bit variants collapse to near-random.** Both nf4 (DQ and
|
||||||
|
no-DQ) and fp4 produce essentially category-prior and L1-collapsed
|
||||||
|
predictions (cat ECE jumps from 0.054 to 0.10–0.21). We verified per
|
||||||
|
layer that the dequantized weights of one MLP `Wi` differ from the
|
||||||
|
original by mean 0.005 / max 0.11 — quantization is *correct* — but
|
||||||
|
the relative output drift on a single Linear is already ~98% (mean),
|
||||||
|
and that compounds across 28 transformer blocks + GLU FFN paths until
|
||||||
|
the [CLS]/pooled representation no longer carries the discriminative
|
||||||
|
signal. **DQ vs no-DQ produce bit-identical predictions** because the
|
||||||
|
nf4 weight indices are stable under absmax requantization (only the
|
||||||
|
metadata block differs). The catastrophe is inherent to 4-bit weight
|
||||||
|
precision on this architecture, not to a config knob. Recovering 4-bit
|
||||||
|
would require QAT, GPTQ/AWQ-style per-channel calibration, or keeping
|
||||||
|
the GLU FFN in 8-bit while only 4-bit'ing attention projections —
|
||||||
|
none reachable inside the remaining capstone budget.
|
||||||
|
|
||||||
|
**Paper hooks:**
|
||||||
|
- Add a "deployment precision" row to the speed/cost table — bf16 vs
|
||||||
|
torchao int8-wo gives a clean Pareto pair (latency vs VRAM).
|
||||||
|
- One paragraph in the discussion alongside the DAPT and CORAL nulls:
|
||||||
|
*naive post-training 4-bit weight quantization is not viable for
|
||||||
|
ModernBERT-large on this task; the GLU FFN amplifies per-layer weight
|
||||||
|
error across 28 blocks until signal is destroyed*. This is a useful
|
||||||
|
counterpoint to the 4-bit-by-default LLM serving narrative and a
|
||||||
|
legitimate negative result tied to architectural choices.
|
||||||
|
- Caveat the int8 latency rows with the sm_8.6 hardware footnote — the
|
||||||
|
result would invert on H100/A100/Ada.
|
||||||
|
|
||||||
|
Full standalone report at `results/eval/quant/REPORT.md`; per-variant
|
||||||
|
metrics at `results/eval/quant/<variant>/metrics.json`; aggregate row data
|
||||||
|
at `results/eval/quant/summary.json`.
|
||||||
|
|
||||||
|
### 10.9 ONNX Export + Eval (2026-04-07)
|
||||||
|
|
||||||
|
**Question:** can we get a portable ONNX artifact with comparable
|
||||||
|
latency / accuracy? What does the ORT path look like for fp32, fp16,
|
||||||
|
and int8?
|
||||||
|
|
||||||
|
**Setup:** new driver at `python/scripts/onnx_export_eval.py` (`bun run
|
||||||
|
py:onnx`). Exports the iter1-independent checkpoint, runs ORT inference
|
||||||
|
on the full holdout via CUDAExecutionProvider, and compares against the
|
||||||
|
proxy gold.
|
||||||
|
|
||||||
|
**Six things broke along the way; documenting because each one is a real
|
||||||
|
gotcha for the paper's reproducibility section:**
|
||||||
|
|
||||||
|
1. **Dynamo exporter optimizer crashes.** `torch.onnx.export(...,
|
||||||
|
dynamo=True)` translates the graph but its post-translation `InlinePass`
|
||||||
|
trips on `onnx_ir`. Workaround: `optimize=False`.
|
||||||
|
2. **Dynamo-exported graph is unusable on CUDA EP.** ORT inserts 56
|
||||||
|
Memcpy nodes between layers because dynamo emits scalar tensors with
|
||||||
|
CPU-side placement metadata. Result: 42.9 ms/sample (8× torch fp32)
|
||||||
|
and 15.4 GB peak VRAM (4.4× torch fp32). The legacy TorchScript
|
||||||
|
exporter (`dynamo=False`) only inserts 1 Memcpy and is the only
|
||||||
|
working export path.
|
||||||
|
3. **`op_types_to_quantize=['MatMul']` quantizes nothing on the dynamo
|
||||||
|
graph.** Dynamo emits encoder linears as `Gemm`, not `MatMul`. Need
|
||||||
|
`['MatMul', 'Gemm']`.
|
||||||
|
4. **Both ORT shape-inference paths choke on ModernBERT.** Symbolic
|
||||||
|
inference asserts in `_infer_Range` (the rotary embedding's `limit`
|
||||||
|
input is not a scalar); the C++ path raises a (1024)/(7) dimension
|
||||||
|
mismatch on the category head Gemm. The `skip_*` flags on
|
||||||
|
`quant_pre_process` are *ignored* — it always runs symbolic shape
|
||||||
|
inference — and `ONNXQuantizer.__init__` calls
|
||||||
|
`save_and_reload_model_with_shape_infer` unconditionally. Workaround:
|
||||||
|
monkey-patch both bindings to no-ops, then pass
|
||||||
|
`extra_options={'DefaultTensorType': onnx.TensorProto.FLOAT}` so the
|
||||||
|
quantizer can still type the head MatMul output.
|
||||||
|
5. **fp16 conversion via `onnxconverter_common` breaks on rotary
|
||||||
|
embeddings.** Two distinct failure modes seen across exports — `Type
|
||||||
|
parameter (T) of Optype (Mul) bound to different types
|
||||||
|
(tensor(float) and tensor(float16)) in node
|
||||||
|
/model/backbone/rotary_emb_1/Mul_2`. The converter leaves the
|
||||||
|
`inv_freq` buffer in fp32 and the surrounding `Mul`/`Expand` ops
|
||||||
|
then can't unify their type parameter. Patchable with an
|
||||||
|
`op_block_list` for the rotary subgraph, but cost/value isn't there
|
||||||
|
given the int8 result below.
|
||||||
|
6. **Dynamic int8 via ORT silently falls back to CPU.** The quantizer
|
||||||
|
replaces Gemm/MatMul with `MatMulInteger` + `DynamicQuantizeLinear`,
|
||||||
|
neither of which has CUDA kernels in onnxruntime-gpu 1.24. Session
|
||||||
|
creation succeeds with `CUDAExecutionProvider` but routes the
|
||||||
|
quantized ops to the CPU EP — observable from the load-time GPU
|
||||||
|
memory delta collapsing from 2074 MB (fp32) to 266 MB (int8) and
|
||||||
|
latency exploding to **95.9 ms/sample**. Accuracy also drops to
|
||||||
|
cat F1 = 0.397 / spec F1 = 0.336, further confirming the kernel
|
||||||
|
path is wrong (not just slow).
|
||||||
|
|
||||||
|
**Results (legacy exporter, 1,200 holdout, vs GPT-5.4):**
|
||||||
|
|
||||||
|
| variant | size MB | ms/samp | VRAM MB | cat F1 | spec F1 | spec QWK |
|
||||||
|
|--------------------|--------:|--------:|--------:|-------:|--------:|---------:|
|
||||||
|
| **onnx-fp32** | 1583 | 12.70 | 8228 | 0.9337 | 0.8952 | 0.9324 |
|
||||||
|
| onnx-fp16 | 754 | err | err | err | err | err |
|
||||||
|
| onnx-int8 (dynamic)| 527 | 95.91 | ~CPU | 0.3972 | 0.3364 | 0.4413 |
|
||||||
|
|
||||||
|
For comparison, the torch baselines from Phase 10.8:
|
||||||
|
- torch fp32: 16.29 ms / 3504 MB / cat 0.9337 / spec 0.8943
|
||||||
|
- torch bf16: **5.52 ms / 1741 MB** / cat 0.9337 / spec 0.8952
|
||||||
|
|
||||||
|
**Three findings:**
|
||||||
|
|
||||||
|
1. **The one clean win — ORT fp32 beats torch fp32 by 22% on latency
|
||||||
|
(12.70 vs 16.29 ms)** at bit-identical accuracy, thanks to ORT's
|
||||||
|
LayerNorm + Gelu + MatMul kernel fusion. VRAM is 2.3× torch's
|
||||||
|
(8228 vs 3504 MB) because the ORT session allocates a separate
|
||||||
|
~5 GB workspace — fair trade for batched inference. But torch bf16
|
||||||
|
+ flash-attn-2 still wins outright on every dimension (5.52 ms,
|
||||||
|
1741 MB), so this is a moral victory at best.
|
||||||
|
2. **fp16 ONNX is currently unreachable** without writing custom rotary
|
||||||
|
handling for the float16 converter. Doable but several hours of
|
||||||
|
plumbing for an artifact that bf16 already dominates.
|
||||||
|
3. **ORT dynamic int8 is a deployment trap on this hardware.** It looks
|
||||||
|
like it works (export succeeds, file shrinks 1583 → 527 MB, session
|
||||||
|
constructs cleanly with CUDAExecutionProvider in the providers list),
|
||||||
|
but at runtime the integer matmul ops route to the CPU EP and the
|
||||||
|
model produces ~uniform-prior predictions because the per-channel
|
||||||
|
weight quantization interacts badly with the activation
|
||||||
|
quantization path. Both observations would silently bite a
|
||||||
|
production deployment that didn't run a holdout sanity check.
|
||||||
|
|
||||||
|
**Net recommendation: don't ship ONNX for this model on this hardware.**
|
||||||
|
torchao int8-wo from §10.8 still owns the "smaller deployment" Pareto
|
||||||
|
slot (5.52 → 6.08 ms, 1741 → 1416 MB, F1 within ±0.001) more cleanly
|
||||||
|
than any ONNX variant we could produce here. ONNX would be worth
|
||||||
|
revisiting only for CPU-only deployment, cross-runtime portability
|
||||||
|
(TensorRT/OpenVINO/mobile), or a properly calibrated static int8 path
|
||||||
|
with a ModernBERT-aware op block list — none reachable inside the
|
||||||
|
remaining capstone budget.
|
||||||
|
|
||||||
|
**Paper hooks:**
|
||||||
|
- One paragraph in the deployment / reproducibility discussion:
|
||||||
|
*ONNX export of ModernBERT-large via the dynamo exporter is currently
|
||||||
|
broken (excessive Memcpy insertion); the legacy TorchScript exporter
|
||||||
|
produces a clean graph that's 22% faster than torch fp32 via ORT
|
||||||
|
kernel fusion, but bf16 + flash-attn-2 dominates at half the latency.
|
||||||
|
fp16 conversion via onnxconverter_common fails on rotary embeddings,
|
||||||
|
and ORT dynamic int8 silently falls back to CPU on
|
||||||
|
onnxruntime-gpu 1.24, dropping ~0.5 macro F1.*
|
||||||
|
- Add a "deployment lessons learned" sub-bullet to the limitations
|
||||||
|
section so a follow-on engineering team doesn't waste a day chasing
|
||||||
|
the same dead ends.
|
||||||
|
|
||||||
|
Full standalone report at `results/eval/onnx/REPORT.md`; aggregate
|
||||||
|
results at `results/eval/onnx/summary.json`; exported models at
|
||||||
|
`results/eval/onnx/models/`.
|
||||||
|
|
||||||
### Phase 10 Summary
|
### Phase 10 Summary
|
||||||
|
|
||||||
| Experiment | Cost | Outcome | Paper value |
|
| Experiment | Cost | Outcome | Paper value |
|
||||||
@ -1107,6 +1313,8 @@ epoch 3 + no temperature scaling would be a reasonable alternative choice.
|
|||||||
| Temperature scaling | ~10 min GPU | ECE −33% cat, −40% spec, F1 unchanged | Calibration story, deployment quality |
|
| Temperature scaling | ~10 min GPU | ECE −33% cat, −40% spec, F1 unchanged | Calibration story, deployment quality |
|
||||||
| Pooling ablation (attention vs CLS) | ~3h GPU | +0.005 F1 consistent, small effect | Validates design, credits independent thresholds |
|
| Pooling ablation (attention vs CLS) | ~3h GPU | +0.005 F1 consistent, small effect | Validates design, credits independent thresholds |
|
||||||
| DAPT re-test with new architecture | ~3h GPU | Val best NLL 0.333→0.318 (−4.5%), F1 +0.007 cat; holdout null; gen gap unchanged | More nuanced null — better init, not better generalization |
|
| DAPT re-test with new architecture | ~3h GPU | Val best NLL 0.333→0.318 (−4.5%), F1 +0.007 cat; holdout null; gen gap unchanged | More nuanced null — better init, not better generalization |
|
||||||
|
| Quantization sweep (10 variants) | ~5 min GPU | bf16 already optimal; torchao int8-wo = −19% VRAM no F1 cost; **all 4-bit collapses** (ModernBERT-large too quant-sensitive) | Deployment Pareto + 4-bit null result |
|
||||||
|
| ONNX export + ORT eval | ~10 min GPU | Legacy exporter only working path; ORT fp32 −22% latency vs torch (kernel fusion), but bf16 still wins; fp16 broken on rotary; int8 silently CPU-fallback + 0.5 F1 collapse | Deployment lessons learned, reproducibility caveats |
|
||||||
|
|
||||||
The 3-seed ensemble is now the recommended headline checkpoint. The
|
The 3-seed ensemble is now the recommended headline checkpoint. The
|
||||||
calibrated ECE numbers should replace the pre-scaling ECE in the paper. The
|
calibrated ECE numbers should replace the pre-scaling ECE in the paper. The
|
||||||
|
|||||||
@ -20,6 +20,8 @@
|
|||||||
"ts:sec": "bun run --filter sec-cybert sec",
|
"ts:sec": "bun run --filter sec-cybert sec",
|
||||||
"ts:typecheck": "bun run --filter sec-cybert typecheck",
|
"ts:typecheck": "bun run --filter sec-cybert typecheck",
|
||||||
"py:train": "cd python && PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True uv run main.py",
|
"py:train": "cd python && PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True uv run main.py",
|
||||||
|
"py:quant": "cd python && PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True uv run scripts/quantize_sweep.py",
|
||||||
|
"py:onnx": "cd python && PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True uv run scripts/onnx_export_eval.py",
|
||||||
"typecheck": "bun run --filter '*' typecheck",
|
"typecheck": "bun run --filter '*' typecheck",
|
||||||
"data:push": "./scripts/data-push.sh",
|
"data:push": "./scripts/data-push.sh",
|
||||||
"data:pull": "./scripts/data-pull.sh",
|
"data:pull": "./scripts/data-pull.sh",
|
||||||
|
|||||||
@ -18,6 +18,11 @@ dependencies = [
|
|||||||
"krippendorff>=0.8.2",
|
"krippendorff>=0.8.2",
|
||||||
"matplotlib>=3.10.8",
|
"matplotlib>=3.10.8",
|
||||||
"seaborn>=0.13.2",
|
"seaborn>=0.13.2",
|
||||||
|
"onnx>=1.21.0",
|
||||||
|
"onnxruntime-gpu>=1.24.4",
|
||||||
|
"onnxruntime>=1.24.4",
|
||||||
|
"onnxscript>=0.6.2",
|
||||||
|
"onnxconverter-common>=1.16.0",
|
||||||
]
|
]
|
||||||
|
|
||||||
[project.scripts]
|
[project.scripts]
|
||||||
|
|||||||
369
python/scripts/onnx_export_eval.py
Normal file
369
python/scripts/onnx_export_eval.py
Normal file
@ -0,0 +1,369 @@
|
|||||||
|
"""ONNX export + eval for the iter1-independent ModernBERT-large checkpoint.
|
||||||
|
|
||||||
|
Variants:
|
||||||
|
onnx-fp32 — straight torch.onnx.export from the fp32 model
|
||||||
|
onnx-fp16 — fp32 export converted to fp16 via onnxconverter_common
|
||||||
|
(proxy for bf16; ORT does not support bf16 inference natively)
|
||||||
|
onnx-int8-dyn — dynamic int8 quantization of the fp32 graph via
|
||||||
|
onnxruntime.quantization.quantize_dynamic (weights in int8,
|
||||||
|
activations quantized at runtime)
|
||||||
|
|
||||||
|
For each variant:
|
||||||
|
- latency (ms/sample, batch=64, 5 warmup batches)
|
||||||
|
- peak GPU memory delta around the session (free-mem snapshot)
|
||||||
|
- on-disk size of model.onnx + model.onnx.data
|
||||||
|
- cat / spec macro F1, QWK, ECE on the 1,200-paragraph holdout
|
||||||
|
against GPT-5.4 + Opus-4.6 proxy gold
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
bun run py:onnx
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import gc
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
import torch.nn as nn
|
||||||
|
import torch.nn.functional as F
|
||||||
|
|
||||||
|
ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
sys.path.insert(0, str(ROOT))
|
||||||
|
|
||||||
|
from src.finetune.data import CAT2ID, CATEGORIES, NUM_CATEGORIES, NUM_SPECIFICITY # noqa: E402
|
||||||
|
from src.finetune.eval import SPEC_LABELS, compute_all_metrics, load_holdout_data # noqa: E402
|
||||||
|
from src.finetune.model import ordinal_predict # noqa: E402
|
||||||
|
from scripts.quantize_sweep import ( # noqa: E402
|
||||||
|
BENCHMARKS, BATCH_SIZE, HOLDOUT, MAX_SEQ, PARAGRAPHS, WARMUP_BATCHES,
|
||||||
|
_build_model, evaluate_predictions,
|
||||||
|
)
|
||||||
|
|
||||||
|
OUTPUT_DIR = ROOT.parent / "results/eval/onnx"
|
||||||
|
ONNX_DIR = OUTPUT_DIR / "models"
|
||||||
|
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
ONNX_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
# Export
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
class _Wrap(nn.Module):
|
||||||
|
def __init__(self, model):
|
||||||
|
super().__init__()
|
||||||
|
self.model = model
|
||||||
|
|
||||||
|
def forward(self, input_ids, attention_mask):
|
||||||
|
out = self.model(input_ids=input_ids, attention_mask=attention_mask)
|
||||||
|
return out["category_logits"], out["specificity_logits"]
|
||||||
|
|
||||||
|
|
||||||
|
def export_fp32(out_path: Path, sample_batch: int = 4, sample_seq: int = 64) -> None:
|
||||||
|
print(f" building fp32 torch model...")
|
||||||
|
model, tokenizer = _build_model(torch.float32, attn_impl="sdpa")
|
||||||
|
model = model.cuda().eval()
|
||||||
|
wrap = _Wrap(model).cuda().eval()
|
||||||
|
|
||||||
|
dummy_text = ["the company maintains a cybersecurity program overseen by the board"] * sample_batch
|
||||||
|
enc = tokenizer(
|
||||||
|
dummy_text, padding="max_length", max_length=sample_seq,
|
||||||
|
truncation=True, return_tensors="pt",
|
||||||
|
).to("cuda")
|
||||||
|
|
||||||
|
print(f" exporting → {out_path}")
|
||||||
|
# Legacy TorchScript exporter (dynamo=False). The dynamo path produces a
|
||||||
|
# graph with 56+ Memcpy nodes when run on CUDAExecutionProvider, blowing
|
||||||
|
# latency 8× and VRAM 4× over native torch — unusable. The legacy
|
||||||
|
# exporter emits clean Gemm/MatMul/LayerNorm nodes ORT can fuse.
|
||||||
|
torch.onnx.export(
|
||||||
|
wrap,
|
||||||
|
(enc["input_ids"], enc["attention_mask"]),
|
||||||
|
str(out_path),
|
||||||
|
input_names=["input_ids", "attention_mask"],
|
||||||
|
output_names=["cat_logits", "spec_logits"],
|
||||||
|
dynamic_axes={
|
||||||
|
"input_ids": {0: "batch", 1: "seq"},
|
||||||
|
"attention_mask": {0: "batch", 1: "seq"},
|
||||||
|
"cat_logits": {0: "batch"},
|
||||||
|
"spec_logits": {0: "batch"},
|
||||||
|
},
|
||||||
|
opset_version=17,
|
||||||
|
dynamo=False,
|
||||||
|
do_constant_folding=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
del wrap, model
|
||||||
|
gc.collect()
|
||||||
|
torch.cuda.empty_cache()
|
||||||
|
|
||||||
|
|
||||||
|
def convert_fp16(fp32_path: Path, fp16_path: Path) -> None:
|
||||||
|
"""Convert an fp32 ONNX model to fp16 via onnxconverter_common."""
|
||||||
|
import onnx
|
||||||
|
from onnxconverter_common import float16
|
||||||
|
|
||||||
|
print(f" loading {fp32_path}")
|
||||||
|
model = onnx.load(str(fp32_path), load_external_data=True)
|
||||||
|
print(f" converting to fp16...")
|
||||||
|
model_fp16 = float16.convert_float_to_float16(
|
||||||
|
model, keep_io_types=False, disable_shape_infer=True,
|
||||||
|
)
|
||||||
|
print(f" saving → {fp16_path}")
|
||||||
|
onnx.save_model(
|
||||||
|
model_fp16, str(fp16_path),
|
||||||
|
save_as_external_data=True,
|
||||||
|
all_tensors_to_one_file=True,
|
||||||
|
location=fp16_path.name + ".data",
|
||||||
|
size_threshold=1024,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def quantize_int8_dynamic(fp32_path: Path, int8_path: Path) -> None:
|
||||||
|
"""Dynamic int8 quantization (weights → int8, activations on the fly).
|
||||||
|
|
||||||
|
Two shape-inference paths in the ORT quantizer choke on the dynamo
|
||||||
|
export of ModernBERT-large:
|
||||||
|
|
||||||
|
1. `SymbolicShapeInference._infer_Range` asserts on the dynamic limit
|
||||||
|
input emitted by RoPE (`assert len(x) == 1` in `as_scalar`).
|
||||||
|
2. `onnx.shape_inference.infer_shapes_path` (C++) raises a (1024)/(7)
|
||||||
|
dim mismatch on the category head Gemm — the dynamo decomposition
|
||||||
|
leaves a dimension hint the C++ inferencer disagrees with.
|
||||||
|
|
||||||
|
The skip flags on `quant_pre_process` are ignored (it always runs
|
||||||
|
`SymbolicShapeInference.infer_shapes`), and `ONNXQuantizer.__init__`
|
||||||
|
calls `save_and_reload_model_with_shape_infer` unconditionally. We
|
||||||
|
monkey-patch both to no-ops, then run `quantize_dynamic` restricted to
|
||||||
|
MatMul ops (the only nodes we want quantized anyway).
|
||||||
|
"""
|
||||||
|
import onnx
|
||||||
|
from onnxruntime.quantization import QuantType, quantize_dynamic
|
||||||
|
from onnxruntime.quantization import quant_utils
|
||||||
|
from onnxruntime.tools import symbolic_shape_infer as sym
|
||||||
|
|
||||||
|
# No-op the broken shape passes.
|
||||||
|
original_save_reload = quant_utils.save_and_reload_model_with_shape_infer
|
||||||
|
|
||||||
|
def _passthrough(model):
|
||||||
|
return model
|
||||||
|
|
||||||
|
quant_utils.save_and_reload_model_with_shape_infer = _passthrough
|
||||||
|
# Some imports cache the symbol — patch the onnx_quantizer module too.
|
||||||
|
import onnxruntime.quantization.onnx_quantizer as oq
|
||||||
|
oq.save_and_reload_model_with_shape_infer = _passthrough
|
||||||
|
|
||||||
|
try:
|
||||||
|
print(f" quantizing {fp32_path} → {int8_path}")
|
||||||
|
quantize_dynamic(
|
||||||
|
model_input=str(fp32_path),
|
||||||
|
model_output=str(int8_path),
|
||||||
|
weight_type=QuantType.QInt8,
|
||||||
|
per_channel=True,
|
||||||
|
reduce_range=False,
|
||||||
|
op_types_to_quantize=["MatMul", "Gemm"],
|
||||||
|
use_external_data_format=True,
|
||||||
|
extra_options={"DefaultTensorType": onnx.TensorProto.FLOAT},
|
||||||
|
)
|
||||||
|
finally:
|
||||||
|
quant_utils.save_and_reload_model_with_shape_infer = original_save_reload
|
||||||
|
oq.save_and_reload_model_with_shape_infer = original_save_reload
|
||||||
|
|
||||||
|
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
# Inference + metrics
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def _files_size(model_path: Path) -> int:
|
||||||
|
"""Sum of model.onnx + any external .data files in the same dir."""
|
||||||
|
total = model_path.stat().st_size
|
||||||
|
for sib in model_path.parent.iterdir():
|
||||||
|
if sib.name.startswith(model_path.name) and sib != model_path:
|
||||||
|
total += sib.stat().st_size
|
||||||
|
return total
|
||||||
|
|
||||||
|
|
||||||
|
def run_onnx(model_path: Path, texts: list[str], use_cuda: bool = True) -> dict:
|
||||||
|
import onnxruntime as ort
|
||||||
|
from transformers import AutoTokenizer
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(
|
||||||
|
"../checkpoints/finetune/iter1-independent/final"
|
||||||
|
)
|
||||||
|
|
||||||
|
so = ort.SessionOptions()
|
||||||
|
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
|
||||||
|
providers = (
|
||||||
|
["CUDAExecutionProvider", "CPUExecutionProvider"] if use_cuda
|
||||||
|
else ["CPUExecutionProvider"]
|
||||||
|
)
|
||||||
|
|
||||||
|
free_before, total_vram = torch.cuda.mem_get_info()
|
||||||
|
sess = ort.InferenceSession(str(model_path), so, providers=providers)
|
||||||
|
free_after_load, _ = torch.cuda.mem_get_info()
|
||||||
|
load_vram_mb = (free_before - free_after_load) / (1024 ** 2)
|
||||||
|
|
||||||
|
# Warmup
|
||||||
|
warm_enc = tokenizer(
|
||||||
|
texts[:BATCH_SIZE], truncation=True, max_length=MAX_SEQ,
|
||||||
|
padding="longest", return_tensors="np",
|
||||||
|
)
|
||||||
|
warm_inputs = {
|
||||||
|
"input_ids": warm_enc["input_ids"].astype(np.int64),
|
||||||
|
"attention_mask": warm_enc["attention_mask"].astype(np.int64),
|
||||||
|
}
|
||||||
|
for _ in range(WARMUP_BATCHES):
|
||||||
|
sess.run(None, warm_inputs)
|
||||||
|
|
||||||
|
free_after_warm, _ = torch.cuda.mem_get_info()
|
||||||
|
peak_vram_mb = (free_before - free_after_warm) / (1024 ** 2)
|
||||||
|
|
||||||
|
cat_logits_list = []
|
||||||
|
spec_logits_list = []
|
||||||
|
total_time = 0.0
|
||||||
|
for i in range(0, len(texts), BATCH_SIZE):
|
||||||
|
batch = texts[i : i + BATCH_SIZE]
|
||||||
|
enc = tokenizer(
|
||||||
|
batch, truncation=True, max_length=MAX_SEQ,
|
||||||
|
padding="longest", return_tensors="np",
|
||||||
|
)
|
||||||
|
inputs = {
|
||||||
|
"input_ids": enc["input_ids"].astype(np.int64),
|
||||||
|
"attention_mask": enc["attention_mask"].astype(np.int64),
|
||||||
|
}
|
||||||
|
t0 = time.perf_counter()
|
||||||
|
out = sess.run(None, inputs)
|
||||||
|
total_time += time.perf_counter() - t0
|
||||||
|
cat_logits_list.append(torch.from_numpy(out[0].astype(np.float32)))
|
||||||
|
spec_logits_list.append(torch.from_numpy(out[1].astype(np.float32)))
|
||||||
|
|
||||||
|
free_end, _ = torch.cuda.mem_get_info()
|
||||||
|
peak_vram_mb = max(peak_vram_mb, (free_before - free_end) / (1024 ** 2))
|
||||||
|
|
||||||
|
del sess
|
||||||
|
gc.collect()
|
||||||
|
torch.cuda.empty_cache()
|
||||||
|
|
||||||
|
return {
|
||||||
|
"cat_logits": torch.cat(cat_logits_list),
|
||||||
|
"spec_logits": torch.cat(spec_logits_list),
|
||||||
|
"ms_per_sample": (total_time / len(texts)) * 1000,
|
||||||
|
"throughput": len(texts) / total_time,
|
||||||
|
"peak_vram_mb": peak_vram_mb,
|
||||||
|
"load_vram_mb": load_vram_mb,
|
||||||
|
"providers": providers,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
# Driver
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def main():
|
||||||
|
print("loading holdout...")
|
||||||
|
records = load_holdout_data(
|
||||||
|
str(PARAGRAPHS), str(HOLDOUT), {k: str(v) for k, v in BENCHMARKS.items()},
|
||||||
|
)
|
||||||
|
texts = [r["text"] for r in records]
|
||||||
|
print(f" {len(records)} paragraphs")
|
||||||
|
|
||||||
|
fp32_path = ONNX_DIR / "model_fp32.onnx"
|
||||||
|
fp16_path = ONNX_DIR / "model_fp16.onnx"
|
||||||
|
int8_path = ONNX_DIR / "model_int8_dyn.onnx"
|
||||||
|
|
||||||
|
# ── Export fp32 (source for both fp16 and int8 quant) ──
|
||||||
|
if not fp32_path.exists():
|
||||||
|
print("\n══ exporting fp32 ONNX")
|
||||||
|
export_fp32(fp32_path)
|
||||||
|
else:
|
||||||
|
print(f"\n══ reusing existing {fp32_path}")
|
||||||
|
|
||||||
|
# ── fp16 conversion ──
|
||||||
|
if not fp16_path.exists():
|
||||||
|
print("\n══ converting → fp16 ONNX")
|
||||||
|
convert_fp16(fp32_path, fp16_path)
|
||||||
|
else:
|
||||||
|
print(f"\n══ reusing existing {fp16_path}")
|
||||||
|
|
||||||
|
# ── int8 dynamic quantization ──
|
||||||
|
if not int8_path.exists():
|
||||||
|
print("\n══ quantizing → int8 dynamic ONNX")
|
||||||
|
quantize_int8_dynamic(fp32_path, int8_path)
|
||||||
|
else:
|
||||||
|
print(f"\n══ reusing existing {int8_path}")
|
||||||
|
|
||||||
|
summary = []
|
||||||
|
variants = [
|
||||||
|
("onnx-fp32", fp32_path),
|
||||||
|
("onnx-fp16", fp16_path),
|
||||||
|
("onnx-int8-dyn", int8_path),
|
||||||
|
]
|
||||||
|
for name, path in variants:
|
||||||
|
print(f"\n══ {name} — {path.name}")
|
||||||
|
size_mb = _files_size(path) / 1e6
|
||||||
|
print(f" on-disk size: {size_mb:.1f} MB")
|
||||||
|
try:
|
||||||
|
inf = run_onnx(path, texts, use_cuda=True)
|
||||||
|
print(
|
||||||
|
f" latency {inf['ms_per_sample']:.2f} ms/sample, "
|
||||||
|
f"throughput {inf['throughput']:.0f}/s, "
|
||||||
|
f"peak VRAM {inf['peak_vram_mb']:.0f} MB "
|
||||||
|
f"(load {inf['load_vram_mb']:.0f} MB)"
|
||||||
|
)
|
||||||
|
row = {
|
||||||
|
"variant": name,
|
||||||
|
"model_mb": size_mb,
|
||||||
|
"ms_per_sample": inf["ms_per_sample"],
|
||||||
|
"throughput_per_s": inf["throughput"],
|
||||||
|
"peak_vram_mb": inf["peak_vram_mb"],
|
||||||
|
"load_vram_mb": inf["load_vram_mb"],
|
||||||
|
}
|
||||||
|
for ref in BENCHMARKS:
|
||||||
|
m = evaluate_predictions(inf["cat_logits"], inf["spec_logits"], records, ref)
|
||||||
|
print(
|
||||||
|
f" vs {ref}: cat F1={m['cat_macro_f1']:.4f}, "
|
||||||
|
f"spec F1={m['spec_macro_f1']:.4f}, QWK={m['spec_qwk']:.4f}, "
|
||||||
|
f"cat ECE={m['cat_ece']:.4f}, spec ECE={m['spec_ece']:.4f}"
|
||||||
|
)
|
||||||
|
row[f"{ref}_cat_f1"] = m["cat_macro_f1"]
|
||||||
|
row[f"{ref}_spec_f1"] = m["spec_macro_f1"]
|
||||||
|
row[f"{ref}_cat_mcc"] = m["cat_mcc"]
|
||||||
|
row[f"{ref}_spec_qwk"] = m["spec_qwk"]
|
||||||
|
row[f"{ref}_spec_mae"] = m["spec_mae"]
|
||||||
|
row[f"{ref}_cat_ece"] = m["cat_ece"]
|
||||||
|
row[f"{ref}_spec_ece"] = m["spec_ece"]
|
||||||
|
summary.append(row)
|
||||||
|
except Exception as e:
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
summary.append({"variant": name, "error": f"{type(e).__name__}: {e}"})
|
||||||
|
|
||||||
|
summary_path = OUTPUT_DIR / "summary.json"
|
||||||
|
with open(summary_path, "w") as f:
|
||||||
|
json.dump(summary, f, indent=2, default=str)
|
||||||
|
print(f"\nsummary → {summary_path}")
|
||||||
|
|
||||||
|
print("\n" + "=" * 110)
|
||||||
|
print(f"{'variant':<18} {'MB':>9} {'ms/samp':>9} {'throughput':>11} "
|
||||||
|
f"{'VRAM MB':>9} {'cat F1':>9} {'spec F1':>9} {'spec QWK':>9}")
|
||||||
|
print("-" * 110)
|
||||||
|
for r in summary:
|
||||||
|
if "error" in r:
|
||||||
|
print(f"{r['variant']:<18} ERROR: {r['error']}")
|
||||||
|
continue
|
||||||
|
print(
|
||||||
|
f"{r['variant']:<18} {r['model_mb']:>9.1f} {r['ms_per_sample']:>9.2f} "
|
||||||
|
f"{r['throughput_per_s']:>11.0f} {r['peak_vram_mb']:>9.0f} "
|
||||||
|
f"{r['GPT-5.4_cat_f1']:>9.4f} {r['GPT-5.4_spec_f1']:>9.4f} "
|
||||||
|
f"{r['GPT-5.4_spec_qwk']:>9.4f}"
|
||||||
|
)
|
||||||
|
print("=" * 110)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
491
python/scripts/quantize_sweep.py
Normal file
491
python/scripts/quantize_sweep.py
Normal file
@ -0,0 +1,491 @@
|
|||||||
|
"""Quantization sweep for the iter1-independent ModernBERT-large checkpoint.
|
||||||
|
|
||||||
|
Loads the trained DualHeadModernBERT, applies a series of quantization
|
||||||
|
schemes to the *encoder* (heads kept in their native dtype), and re-runs
|
||||||
|
holdout evaluation against the GPT-5.4 / Opus-4.6 proxy gold.
|
||||||
|
|
||||||
|
For each variant we record:
|
||||||
|
- cat / spec macro F1, per-class F1, QWK, MAE, ECE
|
||||||
|
- peak VRAM (encoder forward)
|
||||||
|
- latency (ms/sample, batch=64) and throughput
|
||||||
|
- encoder parameter footprint in MB
|
||||||
|
- delta vs bf16 baseline
|
||||||
|
|
||||||
|
Variants:
|
||||||
|
fp32, bf16 (baseline), fp16,
|
||||||
|
torchao int8 weight-only,
|
||||||
|
torchao int8 dynamic-act + int8 weight,
|
||||||
|
torchao int4 weight-only (group=128),
|
||||||
|
bitsandbytes LLM.int8 (8-bit),
|
||||||
|
bitsandbytes nf4 (4-bit, double-quant, bf16 compute).
|
||||||
|
|
||||||
|
Heads (category linear, attention pooler, independent threshold MLPs)
|
||||||
|
stay in bf16 — they sit on a 1024-dim representation and account for
|
||||||
|
< 0.3% of params, so quantizing them buys nothing and risks the threshold
|
||||||
|
margins which already drive most of the spec error budget.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
bun run py:quant # via package.json wrapper
|
||||||
|
# or directly:
|
||||||
|
cd python && uv run scripts/quantize_sweep.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import gc
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
import traceback
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
import torch.nn as nn
|
||||||
|
import torch.nn.functional as F
|
||||||
|
from safetensors.torch import load_file
|
||||||
|
from transformers import AutoModel, AutoTokenizer
|
||||||
|
|
||||||
|
# Make `src` importable when run as a script
|
||||||
|
ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
sys.path.insert(0, str(ROOT))
|
||||||
|
|
||||||
|
from src.finetune.data import CAT2ID, CATEGORIES, NUM_CATEGORIES, NUM_SPECIFICITY # noqa: E402
|
||||||
|
from src.finetune.eval import ( # noqa: E402
|
||||||
|
SPEC_LABELS,
|
||||||
|
compute_all_metrics,
|
||||||
|
load_holdout_data,
|
||||||
|
)
|
||||||
|
from src.finetune.model import DualHeadModernBERT, ordinal_predict # noqa: E402
|
||||||
|
|
||||||
|
REPO = ROOT.parent
|
||||||
|
CHECKPOINT = REPO / "checkpoints/finetune/iter1-independent/final"
|
||||||
|
PARAGRAPHS = REPO / "data/paragraphs/paragraphs-clean.patched.jsonl"
|
||||||
|
HOLDOUT = REPO / "data/gold/v2-holdout-ids.json"
|
||||||
|
BENCHMARKS = {
|
||||||
|
"GPT-5.4": REPO / "data/annotations/v2-bench/gpt-5.4.jsonl",
|
||||||
|
"Opus-4.6": REPO / "data/annotations/v2-bench/opus-4.6.jsonl",
|
||||||
|
}
|
||||||
|
OUTPUT_DIR = REPO / "results/eval/quant"
|
||||||
|
BATCH_SIZE = 64
|
||||||
|
MAX_SEQ = 512
|
||||||
|
WARMUP_BATCHES = 5
|
||||||
|
|
||||||
|
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
# Model loading
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def _build_model(dtype: torch.dtype, attn_impl: str = "sdpa") -> tuple[DualHeadModernBERT, AutoTokenizer]:
|
||||||
|
"""Construct DualHeadModernBERT and load trained weights at the requested dtype."""
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(str(CHECKPOINT))
|
||||||
|
backbone = AutoModel.from_pretrained(
|
||||||
|
"answerdotai/ModernBERT-large",
|
||||||
|
trust_remote_code=True,
|
||||||
|
attn_implementation=attn_impl,
|
||||||
|
dtype=dtype,
|
||||||
|
)
|
||||||
|
model = DualHeadModernBERT(
|
||||||
|
backbone=backbone,
|
||||||
|
hidden_size=backbone.config.hidden_size,
|
||||||
|
num_categories=NUM_CATEGORIES,
|
||||||
|
num_specificity=NUM_SPECIFICITY,
|
||||||
|
specificity_head_type="independent",
|
||||||
|
spec_mlp_dim=256,
|
||||||
|
pooling="attention",
|
||||||
|
)
|
||||||
|
state = load_file(str(CHECKPOINT / "model.safetensors"))
|
||||||
|
model.load_state_dict(state, strict=False)
|
||||||
|
model = model.to(dtype)
|
||||||
|
model.eval()
|
||||||
|
return model, tokenizer
|
||||||
|
|
||||||
|
|
||||||
|
def _try_flash_attn() -> str:
|
||||||
|
try:
|
||||||
|
import flash_attn # noqa: F401
|
||||||
|
return "flash_attention_2"
|
||||||
|
except ImportError:
|
||||||
|
return "sdpa"
|
||||||
|
|
||||||
|
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
# Quantization variants
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def variant_native(dtype: torch.dtype, attn: str | None = None):
|
||||||
|
def _build():
|
||||||
|
impl = attn or _try_flash_attn()
|
||||||
|
# bf16/fp16 supported by flash-attn; fp32 must use sdpa
|
||||||
|
if dtype == torch.float32:
|
||||||
|
impl = "sdpa"
|
||||||
|
model, tok = _build_model(dtype, attn_impl=impl)
|
||||||
|
return model.cuda(), tok
|
||||||
|
return _build
|
||||||
|
|
||||||
|
|
||||||
|
def variant_torchao(config_factory):
|
||||||
|
def _build():
|
||||||
|
from torchao.quantization import quantize_
|
||||||
|
# torchao expects bf16 master weights
|
||||||
|
model, tok = _build_model(torch.bfloat16, attn_impl=_try_flash_attn())
|
||||||
|
model = model.cuda()
|
||||||
|
# Quantize encoder linears only (skip heads + attention pooler)
|
||||||
|
quantize_(model.backbone, config_factory())
|
||||||
|
return model, tok
|
||||||
|
return _build
|
||||||
|
|
||||||
|
|
||||||
|
def _swap_bnb_linear(
|
||||||
|
module: nn.Module,
|
||||||
|
mode: str,
|
||||||
|
compute_dtype=torch.bfloat16,
|
||||||
|
compress_statistics: bool = True,
|
||||||
|
) -> int:
|
||||||
|
"""Recursively replace nn.Linear with bnb 8-bit / 4-bit equivalents.
|
||||||
|
|
||||||
|
Returns number of layers swapped. Copies weights from the original
|
||||||
|
module so the trained checkpoint is preserved.
|
||||||
|
"""
|
||||||
|
import bitsandbytes as bnb
|
||||||
|
|
||||||
|
swapped = 0
|
||||||
|
for name, child in list(module.named_children()):
|
||||||
|
if isinstance(child, nn.Linear):
|
||||||
|
in_f, out_f = child.in_features, child.out_features
|
||||||
|
has_bias = child.bias is not None
|
||||||
|
if mode == "int8":
|
||||||
|
new = bnb.nn.Linear8bitLt(
|
||||||
|
in_f, out_f, bias=has_bias,
|
||||||
|
has_fp16_weights=False, threshold=6.0,
|
||||||
|
)
|
||||||
|
new.weight = bnb.nn.Int8Params(
|
||||||
|
child.weight.data.clone(),
|
||||||
|
requires_grad=False,
|
||||||
|
has_fp16_weights=False,
|
||||||
|
)
|
||||||
|
if has_bias:
|
||||||
|
new.bias = nn.Parameter(child.bias.data.clone())
|
||||||
|
elif mode in ("nf4", "fp4"):
|
||||||
|
new = bnb.nn.Linear4bit(
|
||||||
|
in_f, out_f, bias=has_bias,
|
||||||
|
compute_dtype=compute_dtype,
|
||||||
|
quant_type=mode,
|
||||||
|
compress_statistics=compress_statistics,
|
||||||
|
quant_storage=torch.uint8,
|
||||||
|
device="cuda",
|
||||||
|
)
|
||||||
|
w = child.weight.data.detach().to(torch.float32).clone()
|
||||||
|
new.weight = bnb.nn.Params4bit(
|
||||||
|
w, requires_grad=False, quant_type=mode,
|
||||||
|
compress_statistics=compress_statistics, module=new,
|
||||||
|
).cuda()
|
||||||
|
if has_bias:
|
||||||
|
new.bias = nn.Parameter(
|
||||||
|
child.bias.data.detach().to(compute_dtype).clone().cuda()
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
raise ValueError(mode)
|
||||||
|
new = new.cuda()
|
||||||
|
setattr(module, name, new)
|
||||||
|
swapped += 1
|
||||||
|
else:
|
||||||
|
swapped += _swap_bnb_linear(child, mode, compute_dtype)
|
||||||
|
return swapped
|
||||||
|
|
||||||
|
|
||||||
|
def variant_bnb(mode: str, compress_statistics: bool = True):
|
||||||
|
def _build():
|
||||||
|
model, tok = _build_model(torch.bfloat16, attn_impl="sdpa")
|
||||||
|
model = model.cuda()
|
||||||
|
n = _swap_bnb_linear(
|
||||||
|
model.backbone, mode, compress_statistics=compress_statistics,
|
||||||
|
)
|
||||||
|
print(f" bnb {mode} (cs={compress_statistics}): swapped {n} linears")
|
||||||
|
return model, tok
|
||||||
|
return _build
|
||||||
|
|
||||||
|
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
# Inference + measurement
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def _encoder_param_bytes(model: DualHeadModernBERT) -> int:
|
||||||
|
"""Sum bytes of every parameter / buffer inside the encoder backbone.
|
||||||
|
|
||||||
|
Handles bnb Int8Params (int8 storage) and Params4bit (uint8 packed)
|
||||||
|
correctly because element_size() reflects the storage dtype.
|
||||||
|
"""
|
||||||
|
total = 0
|
||||||
|
seen = set()
|
||||||
|
for p in list(model.backbone.parameters()) + list(model.backbone.buffers()):
|
||||||
|
if id(p) in seen:
|
||||||
|
continue
|
||||||
|
seen.add(id(p))
|
||||||
|
total += p.numel() * p.element_size()
|
||||||
|
return total
|
||||||
|
|
||||||
|
|
||||||
|
@torch.no_grad()
|
||||||
|
def run_inference(model, tokenizer, texts: list[str]) -> dict:
|
||||||
|
device = next(model.parameters()).device
|
||||||
|
cat_logits_list = []
|
||||||
|
spec_logits_list = []
|
||||||
|
|
||||||
|
# Warmup
|
||||||
|
warm_batch = tokenizer(
|
||||||
|
texts[: BATCH_SIZE], truncation=True, max_length=MAX_SEQ,
|
||||||
|
padding="longest", return_tensors="pt",
|
||||||
|
).to(device)
|
||||||
|
for _ in range(WARMUP_BATCHES):
|
||||||
|
_ = model(input_ids=warm_batch["input_ids"], attention_mask=warm_batch["attention_mask"])
|
||||||
|
torch.cuda.synchronize()
|
||||||
|
torch.cuda.reset_peak_memory_stats()
|
||||||
|
|
||||||
|
total_time = 0.0
|
||||||
|
for i in range(0, len(texts), BATCH_SIZE):
|
||||||
|
batch = texts[i : i + BATCH_SIZE]
|
||||||
|
enc = tokenizer(
|
||||||
|
batch, truncation=True, max_length=MAX_SEQ,
|
||||||
|
padding="longest", return_tensors="pt",
|
||||||
|
).to(device)
|
||||||
|
torch.cuda.synchronize()
|
||||||
|
t0 = time.perf_counter()
|
||||||
|
out = model(input_ids=enc["input_ids"], attention_mask=enc["attention_mask"])
|
||||||
|
torch.cuda.synchronize()
|
||||||
|
total_time += time.perf_counter() - t0
|
||||||
|
cat_logits_list.append(out["category_logits"].float().cpu())
|
||||||
|
spec_logits_list.append(out["specificity_logits"].float().cpu())
|
||||||
|
|
||||||
|
peak_vram = torch.cuda.max_memory_allocated()
|
||||||
|
cat_logits = torch.cat(cat_logits_list)
|
||||||
|
spec_logits = torch.cat(spec_logits_list)
|
||||||
|
return {
|
||||||
|
"cat_logits": cat_logits,
|
||||||
|
"spec_logits": spec_logits,
|
||||||
|
"total_time_s": total_time,
|
||||||
|
"ms_per_sample": (total_time / len(texts)) * 1000,
|
||||||
|
"throughput": len(texts) / total_time,
|
||||||
|
"peak_vram_mb": peak_vram / (1024 ** 2),
|
||||||
|
"num_samples": len(texts),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def evaluate_predictions(
|
||||||
|
cat_logits: torch.Tensor,
|
||||||
|
spec_logits: torch.Tensor,
|
||||||
|
records: list[dict],
|
||||||
|
ref_name: str,
|
||||||
|
) -> dict:
|
||||||
|
cat_probs_all = F.softmax(cat_logits, dim=1).numpy()
|
||||||
|
cat_preds_all = cat_logits.argmax(dim=1).numpy()
|
||||||
|
spec_preds_all = ordinal_predict(spec_logits).numpy()
|
||||||
|
# ordinal → class probs
|
||||||
|
sp = torch.sigmoid(spec_logits)
|
||||||
|
K = sp.shape[1] + 1
|
||||||
|
spec_probs_all = torch.zeros(sp.shape[0], K)
|
||||||
|
spec_probs_all[:, 0] = 1 - sp[:, 0]
|
||||||
|
for k in range(1, K - 1):
|
||||||
|
spec_probs_all[:, k] = sp[:, k - 1] - sp[:, k]
|
||||||
|
spec_probs_all[:, -1] = sp[:, -1]
|
||||||
|
spec_probs_all = spec_probs_all.clamp(min=0)
|
||||||
|
spec_probs_all = spec_probs_all / spec_probs_all.sum(dim=1, keepdim=True)
|
||||||
|
spec_probs_all = spec_probs_all.numpy()
|
||||||
|
|
||||||
|
cat_labels, spec_labels = [], []
|
||||||
|
cat_p, spec_p, cat_pr, spec_pr = [], [], [], []
|
||||||
|
for i, rec in enumerate(records):
|
||||||
|
b = rec["benchmark_labels"].get(ref_name)
|
||||||
|
if b is None:
|
||||||
|
continue
|
||||||
|
cat_labels.append(CAT2ID[b["category"]])
|
||||||
|
spec_labels.append(b["specificity"] - 1)
|
||||||
|
cat_p.append(cat_preds_all[i])
|
||||||
|
spec_p.append(spec_preds_all[i])
|
||||||
|
cat_pr.append(cat_probs_all[i])
|
||||||
|
spec_pr.append(spec_probs_all[i])
|
||||||
|
|
||||||
|
cat_m = compute_all_metrics(
|
||||||
|
np.array(cat_p), np.array(cat_labels), np.array(cat_pr),
|
||||||
|
CATEGORIES, "cat", is_ordinal=False,
|
||||||
|
)
|
||||||
|
spec_m = compute_all_metrics(
|
||||||
|
np.array(spec_p), np.array(spec_labels), np.array(spec_pr),
|
||||||
|
SPEC_LABELS, "spec", is_ordinal=True,
|
||||||
|
)
|
||||||
|
return {**cat_m, **spec_m}
|
||||||
|
|
||||||
|
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
# Variant registry
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Variant:
|
||||||
|
name: str
|
||||||
|
description: str
|
||||||
|
builder: callable
|
||||||
|
skip_reason: str | None = None
|
||||||
|
|
||||||
|
|
||||||
|
def build_variants() -> list[Variant]:
|
||||||
|
from torchao.quantization import (
|
||||||
|
Int4WeightOnlyConfig,
|
||||||
|
Int8DynamicActivationInt8WeightConfig,
|
||||||
|
Int8WeightOnlyConfig,
|
||||||
|
)
|
||||||
|
|
||||||
|
return [
|
||||||
|
Variant("fp32", "Float32 encoder + heads", variant_native(torch.float32, attn="sdpa")),
|
||||||
|
Variant("bf16", "BFloat16 baseline (matches eval pipeline)", variant_native(torch.bfloat16)),
|
||||||
|
Variant("fp16", "Float16 encoder + heads", variant_native(torch.float16)),
|
||||||
|
Variant(
|
||||||
|
"torchao-int8-wo",
|
||||||
|
"torchao Int8 weight-only on encoder linears",
|
||||||
|
variant_torchao(lambda: Int8WeightOnlyConfig()),
|
||||||
|
),
|
||||||
|
Variant(
|
||||||
|
"torchao-int8-dyn",
|
||||||
|
"torchao Int8 dynamic activation + Int8 weight on encoder",
|
||||||
|
variant_torchao(lambda: Int8DynamicActivationInt8WeightConfig()),
|
||||||
|
),
|
||||||
|
Variant(
|
||||||
|
"torchao-int4-wo",
|
||||||
|
"torchao Int4 weight-only (group=128) on encoder",
|
||||||
|
variant_torchao(lambda: Int4WeightOnlyConfig(group_size=128)),
|
||||||
|
),
|
||||||
|
Variant("bnb-int8", "bitsandbytes LLM.int8 on encoder linears", variant_bnb("int8")),
|
||||||
|
Variant("bnb-nf4", "bitsandbytes NF4 4-bit (double-quant, bf16 compute)", variant_bnb("nf4", compress_statistics=True)),
|
||||||
|
Variant("bnb-nf4-nodq", "bitsandbytes NF4 4-bit (no double-quant)", variant_bnb("nf4", compress_statistics=False)),
|
||||||
|
Variant("bnb-fp4", "bitsandbytes FP4 4-bit (no double-quant)", variant_bnb("fp4", compress_statistics=False)),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
# Driver
|
||||||
|
# ──────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def free():
|
||||||
|
gc.collect()
|
||||||
|
torch.cuda.empty_cache()
|
||||||
|
torch.cuda.synchronize()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
print(f"Loading holdout from {HOLDOUT}")
|
||||||
|
records = load_holdout_data(
|
||||||
|
str(PARAGRAPHS), str(HOLDOUT), {k: str(v) for k, v in BENCHMARKS.items()},
|
||||||
|
)
|
||||||
|
texts = [r["text"] for r in records]
|
||||||
|
print(f" {len(records)} holdout paragraphs loaded")
|
||||||
|
|
||||||
|
variants = build_variants()
|
||||||
|
summary = []
|
||||||
|
|
||||||
|
for v in variants:
|
||||||
|
print(f"\n══ {v.name} — {v.description}")
|
||||||
|
free()
|
||||||
|
try:
|
||||||
|
t0 = time.perf_counter()
|
||||||
|
model, tokenizer = v.builder()
|
||||||
|
build_s = time.perf_counter() - t0
|
||||||
|
enc_bytes = _encoder_param_bytes(model)
|
||||||
|
print(f" encoder footprint: {enc_bytes / 1e6:.1f} MB (build {build_s:.1f}s)")
|
||||||
|
inf = run_inference(model, tokenizer, texts)
|
||||||
|
print(
|
||||||
|
f" latency {inf['ms_per_sample']:.2f} ms/sample, "
|
||||||
|
f"throughput {inf['throughput']:.0f}/s, "
|
||||||
|
f"peak VRAM {inf['peak_vram_mb']:.0f} MB"
|
||||||
|
)
|
||||||
|
|
||||||
|
metrics_per_ref = {}
|
||||||
|
for ref in BENCHMARKS:
|
||||||
|
m = evaluate_predictions(inf["cat_logits"], inf["spec_logits"], records, ref)
|
||||||
|
metrics_per_ref[ref] = m
|
||||||
|
print(
|
||||||
|
f" vs {ref}: cat F1={m['cat_macro_f1']:.4f}, "
|
||||||
|
f"spec F1={m['spec_macro_f1']:.4f}, QWK={m['spec_qwk']:.4f}, "
|
||||||
|
f"cat ECE={m['cat_ece']:.4f}, spec ECE={m['spec_ece']:.4f}"
|
||||||
|
)
|
||||||
|
|
||||||
|
row = {
|
||||||
|
"variant": v.name,
|
||||||
|
"description": v.description,
|
||||||
|
"encoder_mb": enc_bytes / 1e6,
|
||||||
|
"ms_per_sample": inf["ms_per_sample"],
|
||||||
|
"throughput_per_s": inf["throughput"],
|
||||||
|
"peak_vram_mb": inf["peak_vram_mb"],
|
||||||
|
"build_s": build_s,
|
||||||
|
}
|
||||||
|
for ref, m in metrics_per_ref.items():
|
||||||
|
row[f"{ref}_cat_f1"] = m["cat_macro_f1"]
|
||||||
|
row[f"{ref}_spec_f1"] = m["spec_macro_f1"]
|
||||||
|
row[f"{ref}_cat_mcc"] = m["cat_mcc"]
|
||||||
|
row[f"{ref}_spec_qwk"] = m["spec_qwk"]
|
||||||
|
row[f"{ref}_spec_mae"] = m["spec_mae"]
|
||||||
|
row[f"{ref}_cat_ece"] = m["cat_ece"]
|
||||||
|
row[f"{ref}_spec_ece"] = m["spec_ece"]
|
||||||
|
# per-spec-level F1
|
||||||
|
for s in SPEC_LABELS:
|
||||||
|
short = s.replace(" ", "").replace(":", "")[:8]
|
||||||
|
row[f"{ref}_spec_f1_{short}"] = m.get(f"spec_f1_{short}", 0)
|
||||||
|
summary.append(row)
|
||||||
|
|
||||||
|
# Per-variant detailed metrics dump
|
||||||
|
vdir = OUTPUT_DIR / v.name
|
||||||
|
vdir.mkdir(parents=True, exist_ok=True)
|
||||||
|
with open(vdir / "metrics.json", "w") as f:
|
||||||
|
ser = {}
|
||||||
|
for ref, m in metrics_per_ref.items():
|
||||||
|
ser[ref] = {
|
||||||
|
k: (v_ if not isinstance(v_, np.ndarray) else v_.tolist())
|
||||||
|
for k, v_ in m.items()
|
||||||
|
if isinstance(v_, (int, float, str, list, bool))
|
||||||
|
}
|
||||||
|
ser["_runtime"] = {
|
||||||
|
"encoder_mb": enc_bytes / 1e6,
|
||||||
|
"ms_per_sample": inf["ms_per_sample"],
|
||||||
|
"throughput_per_s": inf["throughput"],
|
||||||
|
"peak_vram_mb": inf["peak_vram_mb"],
|
||||||
|
"build_s": build_s,
|
||||||
|
}
|
||||||
|
json.dump(ser, f, indent=2, default=str)
|
||||||
|
|
||||||
|
del model, tokenizer, inf
|
||||||
|
except Exception as e:
|
||||||
|
print(f" FAILED: {type(e).__name__}: {e}")
|
||||||
|
traceback.print_exc()
|
||||||
|
summary.append({
|
||||||
|
"variant": v.name,
|
||||||
|
"description": v.description,
|
||||||
|
"error": f"{type(e).__name__}: {e}",
|
||||||
|
})
|
||||||
|
free()
|
||||||
|
|
||||||
|
# Write summary
|
||||||
|
summary_path = OUTPUT_DIR / "summary.json"
|
||||||
|
with open(summary_path, "w") as f:
|
||||||
|
json.dump(summary, f, indent=2, default=str)
|
||||||
|
print(f"\nSummary written to {summary_path}")
|
||||||
|
|
||||||
|
# Print compact table
|
||||||
|
print("\n" + "=" * 110)
|
||||||
|
print(f"{'variant':<18} {'enc MB':>9} {'ms/samp':>9} {'throughput':>11} "
|
||||||
|
f"{'VRAM MB':>9} {'cat F1':>9} {'spec F1':>9} {'spec QWK':>9}")
|
||||||
|
print("-" * 110)
|
||||||
|
for r in summary:
|
||||||
|
if "error" in r:
|
||||||
|
print(f"{r['variant']:<18} ERROR: {r['error']}")
|
||||||
|
continue
|
||||||
|
print(
|
||||||
|
f"{r['variant']:<18} {r['encoder_mb']:>9.1f} {r['ms_per_sample']:>9.2f} "
|
||||||
|
f"{r['throughput_per_s']:>11.0f} {r['peak_vram_mb']:>9.0f} "
|
||||||
|
f"{r['GPT-5.4_cat_f1']:>9.4f} {r['GPT-5.4_spec_f1']:>9.4f} {r['GPT-5.4_spec_qwk']:>9.4f}"
|
||||||
|
)
|
||||||
|
print("=" * 110)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
117
results/eval/onnx/REPORT.md
Normal file
117
results/eval/onnx/REPORT.md
Normal file
@ -0,0 +1,117 @@
|
|||||||
|
# ONNX Export + Eval — iter1-independent ModernBERT-large
|
||||||
|
|
||||||
|
**Date:** 2026-04-07
|
||||||
|
**Checkpoint:** `checkpoints/finetune/iter1-independent/final/`
|
||||||
|
**Hardware:** RTX 3090 (sm_8.6, 24 GB), onnxruntime-gpu 1.24.4, onnx 1.21
|
||||||
|
**Driver:** `python/scripts/onnx_export_eval.py` (`bun run py:onnx`)
|
||||||
|
**Eval set:** 1,200-paragraph v2 holdout, proxy gold = GPT-5.4 + Opus-4.6
|
||||||
|
|
||||||
|
## TL;DR
|
||||||
|
|
||||||
|
ONNX export of this model is *technically* possible but the path is full of
|
||||||
|
dead ends. The dynamo exporter produces a graph with 56 Memcpy nodes that
|
||||||
|
makes ORT 8× slower than native torch and 4× more VRAM-heavy; the legacy
|
||||||
|
TorchScript exporter produces a clean graph that's actually 22% faster than
|
||||||
|
torch fp32 (kernel fusion); fp16 conversion breaks on the rotary embedding;
|
||||||
|
dynamic int8 quantization via ORT silently falls back to CPU and drops
|
||||||
|
~0.5 macro F1. **Net: torchao int8-wo from the earlier sweep is still the
|
||||||
|
right int8 deployment path. ONNX is not.**
|
||||||
|
|
||||||
|
## What we tried
|
||||||
|
|
||||||
|
| variant | exporter | size MB | ms/sample | VRAM MB | cat F1 | spec F1 | result |
|
||||||
|
|--------------------|----------------------|--------:|----------:|--------:|-------:|--------:|-----------------|
|
||||||
|
| onnx-fp32 (dynamo) | torch.onnx (dynamo) | 1583 | 42.92 | 15388 | 0.9337 | 0.8943 | works but unusable |
|
||||||
|
| onnx-int8 (dynamo) | dynamo + ORT int8 | 1580 | 42.82 | 15398 | 0.9337 | 0.8943 | no-op (no quant) |
|
||||||
|
| **onnx-fp32 (legacy)** | torch.onnx (TorchScript) | 1583 | **12.70** | 8228 | 0.9337 | 0.8952 | **clean graph, faster than torch** |
|
||||||
|
| onnx-fp16 (legacy) | onnxconverter_common | 754 | err | err | err | err | rotary type unify |
|
||||||
|
| onnx-int8 (legacy) | ORT quantize_dynamic | 527 | 95.91 | ~CPU | 0.3972 | 0.3364 | CPU fallback + accuracy collapse |
|
||||||
|
|
||||||
|
(All entries above were re-run from scratch — fp32 timing improved 3× moving
|
||||||
|
from dynamo to legacy export.)
|
||||||
|
|
||||||
|
## Six things broke along the way (workarounds in the script)
|
||||||
|
|
||||||
|
1. **Dynamo exporter optimizer crashes.** `torch.onnx.export(..., dynamo=True)`
|
||||||
|
succeeds at translation but the post-translation `InlinePass` optimizer
|
||||||
|
trips on `onnx_ir`. Workaround: `optimize=False`.
|
||||||
|
2. **Dynamo-exported graph is unusable on CUDA EP.** ORT inserts 56 Memcpy
|
||||||
|
nodes between layers because dynamo emits scalar tensors with CPU
|
||||||
|
placement metadata. Result: 42.9 ms/sample (8× torch fp32) and 15.4 GB
|
||||||
|
VRAM (4.4× torch fp32). The legacy exporter only inserts 1 Memcpy.
|
||||||
|
3. **`op_types_to_quantize=['MatMul']` quantizes nothing on the dynamo
|
||||||
|
graph.** Dynamo emits encoder linears as `Gemm` nodes, not `MatMul`.
|
||||||
|
Fix: `op_types_to_quantize=['MatMul', 'Gemm']`.
|
||||||
|
4. **Both ORT shape-inference paths choke on ModernBERT.** Symbolic
|
||||||
|
inference asserts in `_infer_Range` (rotary embedding limit input is
|
||||||
|
not a scalar); the C++ inference raises a (1024)/(7) dim mismatch on
|
||||||
|
the category head Gemm. The `skip_*` flags on `quant_pre_process` are
|
||||||
|
ignored, and `ONNXQuantizer.__init__` calls
|
||||||
|
`save_and_reload_model_with_shape_infer` unconditionally. Workaround:
|
||||||
|
monkey-patch `quant_utils.save_and_reload_model_with_shape_infer`
|
||||||
|
*and* the cached binding in `onnx_quantizer` to a no-op, then pass
|
||||||
|
`extra_options={'DefaultTensorType': onnx.TensorProto.FLOAT}` so the
|
||||||
|
quantizer can still type the head MatMul.
|
||||||
|
5. **fp16 conversion via `onnxconverter_common` breaks on rotary
|
||||||
|
embeddings.** Two distinct failure modes seen across exports:
|
||||||
|
`Type Error: Type (tensor(float16)) of output arg (val_58) of node
|
||||||
|
(node_Expand_56) does not match expected type (tensor(float))` (dynamo
|
||||||
|
graph) and `Type parameter (T) of Optype (Mul) bound to different types
|
||||||
|
(tensor(float) and tensor(float16) in node
|
||||||
|
(/model/backbone/rotary_emb_1/Mul_2)` (legacy graph). The converter
|
||||||
|
leaves the `inv_freq` buffer in fp32 and the surrounding Mul/Expand
|
||||||
|
ops then can't unify their type parameter. Could be patched with an
|
||||||
|
`op_block_list` for the rotary subgraph, but the cost/value isn't
|
||||||
|
there given the dynamic int8 result below.
|
||||||
|
6. **Dynamic int8 via ORT silently falls back to CPU.** The quantizer
|
||||||
|
replaces Gemm/MatMul with `MatMulInteger` + `DynamicQuantizeLinear`,
|
||||||
|
neither of which has CUDA kernels in onnxruntime-gpu 1.24. Session
|
||||||
|
creation succeeds with CUDAExecutionProvider but routes the
|
||||||
|
quantized ops to the CPU EP — observable from the `load_vram_mb`
|
||||||
|
collapsing from 2074 MB (fp32) to 266 MB (int8) and latency exploding
|
||||||
|
to 95.9 ms/sample. Per-channel int8 weights also drop accuracy from
|
||||||
|
0.934 → 0.397 on category and 0.895 → 0.336 on spec, further
|
||||||
|
confirming the kernel path is wrong (not just slow).
|
||||||
|
|
||||||
|
## What actually works
|
||||||
|
|
||||||
|
**onnx-fp32 via the legacy TorchScript exporter** is the one clean win:
|
||||||
|
12.70 ms/sample vs 16.29 for torch fp32 — a **22% latency improvement
|
||||||
|
from ORT's LayerNorm/Gelu/MatMul fusion** at bit-identical accuracy. VRAM
|
||||||
|
is 8228 MB vs 3504 MB for torch fp32 (the ORT session allocates a separate
|
||||||
|
~5 GB workspace), so the speedup costs you ~2.3× memory. On a single
|
||||||
|
3090 batch=64 inference run that's a fair trade.
|
||||||
|
|
||||||
|
But this is fp32 — bf16 torch + flash-attn-2 is *still* the strict winner
|
||||||
|
at 5.52 ms / 1741 MB (Phase 10.8 result). ORT can't run bf16 natively, and
|
||||||
|
fp16 conversion is broken. So even the working ONNX path is dominated by
|
||||||
|
what we already ship.
|
||||||
|
|
||||||
|
## Recommendation
|
||||||
|
|
||||||
|
**Don't use ONNX for this model on this hardware.** The torchao int8-wo
|
||||||
|
result from the quantization sweep (5.52 → 6.08 ms, 1741 → 1416 MB peak
|
||||||
|
VRAM, F1 within ±0.001) covers the "smaller deployment" use case more
|
||||||
|
cleanly than anything ONNX can offer here, and bf16 + flash-attn-2
|
||||||
|
remains the production default.
|
||||||
|
|
||||||
|
ONNX *would* be worth revisiting in any of these scenarios:
|
||||||
|
- **CPU-only deployment** — fp32 ONNX runs fine on CPUExecutionProvider
|
||||||
|
and ORT's int8 dynamic path is actually designed for this case. Worth
|
||||||
|
benchmarking if a CPU serving target ever shows up.
|
||||||
|
- **Cross-runtime portability** — TensorRT, OpenVINO, mobile runtimes.
|
||||||
|
These would each need their own export validation pass.
|
||||||
|
- **Static int8 with calibration** — `quantize_static` with a calibration
|
||||||
|
dataset can avoid the dynamic-quant CPU fallback path. Would need a
|
||||||
|
ModernBERT-friendly calibration loop and probably an `op_block_list`
|
||||||
|
to keep the rotary in fp32. Real engineering work, not a one-shot.
|
||||||
|
|
||||||
|
## Reproduce
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bun run py:onnx
|
||||||
|
# writes to:
|
||||||
|
# results/eval/onnx/models/{model_fp32,model_fp16,model_int8_dyn}.onnx[.data]
|
||||||
|
# results/eval/onnx/summary.json
|
||||||
|
# results/eval/onnx/REPORT.md (this file)
|
||||||
|
```
|
||||||
50
results/eval/onnx/summary.json
Normal file
50
results/eval/onnx/summary.json
Normal file
@ -0,0 +1,50 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"variant": "onnx-fp32",
|
||||||
|
"model_mb": 1583.256294,
|
||||||
|
"ms_per_sample": 12.703279327494482,
|
||||||
|
"throughput_per_s": 78.71983085781946,
|
||||||
|
"peak_vram_mb": 8228.0,
|
||||||
|
"load_vram_mb": 2074.0,
|
||||||
|
"GPT-5.4_cat_f1": 0.9336741161693523,
|
||||||
|
"GPT-5.4_spec_f1": 0.8951731906425856,
|
||||||
|
"GPT-5.4_cat_mcc": 0.9226990724708704,
|
||||||
|
"GPT-5.4_spec_qwk": 0.9324447137231142,
|
||||||
|
"GPT-5.4_spec_mae": 0.1175,
|
||||||
|
"GPT-5.4_cat_ece": 0.05386760701735813,
|
||||||
|
"GPT-5.4_spec_ece": 0.07004604930679002,
|
||||||
|
"Opus-4.6_cat_f1": 0.922684387023173,
|
||||||
|
"Opus-4.6_spec_f1": 0.8833694419146193,
|
||||||
|
"Opus-4.6_cat_mcc": 0.909266938399113,
|
||||||
|
"Opus-4.6_spec_qwk": 0.9227008860372746,
|
||||||
|
"Opus-4.6_spec_mae": 0.13583333333333333,
|
||||||
|
"Opus-4.6_cat_ece": 0.06540583113829297,
|
||||||
|
"Opus-4.6_spec_ece": 0.08156729981303217
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"variant": "onnx-fp16",
|
||||||
|
"error": "Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /home/joey/Documents/sec-cyBERT/results/eval/onnx/models/model_fp16.onnx failed:Type Error: Type parameter (T) of Optype (Mul) bound to different types (tensor(float) and tensor(float16) in node (/model/backbone/rotary_emb_1/Mul_2)."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"variant": "onnx-int8-dyn",
|
||||||
|
"model_mb": 553.381903,
|
||||||
|
"ms_per_sample": 95.90791940659983,
|
||||||
|
"throughput_per_s": 10.42666764316426,
|
||||||
|
"peak_vram_mb": 7188.0,
|
||||||
|
"load_vram_mb": 266.0,
|
||||||
|
"GPT-5.4_cat_f1": 0.3971686880679718,
|
||||||
|
"GPT-5.4_spec_f1": 0.3364003775746365,
|
||||||
|
"GPT-5.4_cat_mcc": 0.3459776856134484,
|
||||||
|
"GPT-5.4_spec_qwk": 0.4412945592628398,
|
||||||
|
"GPT-5.4_spec_mae": 0.7225,
|
||||||
|
"GPT-5.4_cat_ece": 0.23434762333830195,
|
||||||
|
"GPT-5.4_spec_ece": 0.35458642202119034,
|
||||||
|
"Opus-4.6_cat_f1": 0.4150626036637055,
|
||||||
|
"Opus-4.6_spec_f1": 0.3204015536108683,
|
||||||
|
"Opus-4.6_cat_mcc": 0.3663561834842673,
|
||||||
|
"Opus-4.6_spec_qwk": 0.43979676755288855,
|
||||||
|
"Opus-4.6_spec_mae": 0.7375,
|
||||||
|
"Opus-4.6_cat_ece": 0.2126809566716353,
|
||||||
|
"Opus-4.6_spec_ece": 0.37541975535452365
|
||||||
|
}
|
||||||
|
]
|
||||||
163
results/eval/quant/REPORT.md
Normal file
163
results/eval/quant/REPORT.md
Normal file
@ -0,0 +1,163 @@
|
|||||||
|
# Quantization Sweep — iter1-independent ModernBERT-large
|
||||||
|
|
||||||
|
**Date:** 2026-04-07
|
||||||
|
**Checkpoint:** `checkpoints/finetune/iter1-independent/final/`
|
||||||
|
**Hardware:** RTX 3090 (sm_8.6, 24 GB)
|
||||||
|
**Eval set:** 1,200-paragraph v2 holdout, proxy gold = GPT-5.4 + Opus-4.6
|
||||||
|
**Driver:** `python/scripts/quantize_sweep.py` (run via `bun run py:quant`)
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
|
||||||
|
For each variant the *encoder* (ModernBERT-large backbone, 28 layers, 112
|
||||||
|
nn.Linear modules) is converted to the target precision/scheme, while the
|
||||||
|
attention pooler and the dual heads (category linear + 3 independent
|
||||||
|
threshold MLPs) are kept in bf16. Heads are <0.3% of params and sit on
|
||||||
|
already-distilled 1024-d representations — quantizing them buys nothing and
|
||||||
|
risks the threshold margins that drive most of the spec error budget.
|
||||||
|
|
||||||
|
For every variant we measure end-to-end inference on the full 1,200-paragraph
|
||||||
|
holdout at batch=64, max_seq=512, after 5 warmup batches:
|
||||||
|
|
||||||
|
- **encoder_mb** — sum of `param.numel() * param.element_size()` over the
|
||||||
|
encoder. **Caveat:** for torchao tensor subclasses (`AffineQuantizedTensor`)
|
||||||
|
this reports the *outer* dtype (bf16) rather than the int8 storage, so the
|
||||||
|
790 MB figure for the torchao rows is an over-estimate; real on-disk
|
||||||
|
storage is roughly half. The bnb 4-bit row (275 MB) is correct because
|
||||||
|
`Params4bit` reports `uint8` element_size.
|
||||||
|
- **ms/sample** — wall-clock per paragraph at batch=64
|
||||||
|
- **peak VRAM** — `torch.cuda.max_memory_allocated()` over the timed run
|
||||||
|
(encoder fwd + activations)
|
||||||
|
- **F1 / QWK / ECE** — full eval pipeline reused from `src/finetune/eval.py`
|
||||||
|
|
||||||
|
## Results
|
||||||
|
|
||||||
|
| variant | enc MB | ms/samp | thru/s | VRAM MB | cat F1 (GPT) | spec F1 (GPT) | spec QWK | cat F1 (Opus) | spec F1 (Opus) | notes |
|
||||||
|
|--------------------|-------:|--------:|-------:|--------:|-------------:|--------------:|---------:|--------------:|---------------:|--------------------------------|
|
||||||
|
| fp32 | 1579 | 16.29 | 61 | 3504 | 0.9337 | 0.8943 | 0.9321 | 0.9227 | 0.8825 | sdpa (no flash-attn) |
|
||||||
|
| **bf16 (baseline)**| 790 | 5.52 | 181 | 1741 | 0.9337 | 0.8952 | 0.9324 | 0.9227 | 0.8834 | flash-attn-2 |
|
||||||
|
| fp16 | 790 | 5.54 | 181 | 1741 | 0.9337 | 0.8952 | 0.9324 | 0.9227 | 0.8834 | flash-attn-2 |
|
||||||
|
| **torchao int8-wo**| ~395* | 6.08 | 165 | 1416 | 0.9345 | 0.8941 | 0.9330 | 0.9235 | 0.8815 | weight-only int8 |
|
||||||
|
| torchao int8-dyn | ~395* | 9.67 | 103 | 1774 | 0.9336 | 0.8918 | 0.9315 | 0.9243 | 0.8827 | dyn act + int8 weight |
|
||||||
|
| torchao int4-wo | — | — | — | — | — | — | — | — | — | requires `mslk>=1.0.0` |
|
||||||
|
| bnb LLM.int8 | ~395* | 7.76 | 129 | 2135 | 0.9361 | 0.8986 | 0.9308 | 0.9235 | 0.8827 | mixed-precision outliers |
|
||||||
|
| bnb nf4 (DQ) | 275 | 5.86 | 171 | 1287 | 0.3537 | 0.2205 | 0.2423 | 0.3576 | 0.2075 | **collapsed** |
|
||||||
|
| bnb nf4 (no DQ) | 275 | 5.86 | 171 | 1287 | 0.3537 | 0.2205 | 0.2423 | 0.3576 | 0.2075 | **collapsed** |
|
||||||
|
| bnb fp4 (no DQ) | 275 | 5.87 | 170 | 1287 | 0.1629 | 0.2085 | 0.2326 | 0.1686 | 0.1978 | **collapsed harder** |
|
||||||
|
|
||||||
|
\*torchao subclass tensors report bf16 element_size; true storage ~395 MB.
|
||||||
|
|
||||||
|
Per-variant detail (per-class F1, MCC, AUC, confusion matrices, calibration
|
||||||
|
bins) is in `results/eval/quant/{variant}/metrics.json`. Aggregate row-level
|
||||||
|
data is in `results/eval/quant/summary.json`.
|
||||||
|
|
||||||
|
## Findings
|
||||||
|
|
||||||
|
### 1. bf16 is already the production sweet spot
|
||||||
|
Flash-attention-2 + bf16 gives **3.0× the throughput of fp32** (181 vs 61
|
||||||
|
samples/sec) at **half the VRAM** (1.7 vs 3.5 GB) with bit-identical
|
||||||
|
accuracy. This is what we already train and serve at; the sweep simply
|
||||||
|
confirms there's no headroom in fp16/fp32 for this hardware.
|
||||||
|
|
||||||
|
### 2. fp16 ≡ bf16 on Ampere
|
||||||
|
Identical latency, identical VRAM, identical F1. RTX 3090 has matched
|
||||||
|
bf16/fp16 throughput on tensor cores and the model has no overflow issues
|
||||||
|
in either format. Pick whichever the loader prefers.
|
||||||
|
|
||||||
|
### 3. torchao int8 weight-only is the only quantization variant worth shipping
|
||||||
|
- **VRAM −19%** (1741 → 1416 MB) — meaningful for batched serving
|
||||||
|
- **F1 essentially unchanged** (cat +0.0008, spec −0.0011 vs bf16 — both
|
||||||
|
inside per-seed noise)
|
||||||
|
- **Latency +10%** (5.52 → 6.08 ms/sample) — the int8 weight is dequantized
|
||||||
|
to bf16 on the fly because RTX 3090 (sm_8.6) lacks the int8 tensor-core
|
||||||
|
matmul kernel paths torchao would otherwise use; on H100/A100/Ada this
|
||||||
|
same config would also be faster
|
||||||
|
|
||||||
|
The accuracy delta is statistically nothing — well within the ±0.002 std we
|
||||||
|
observed across the 3-seed ensemble. **This is the variant we'd ship as the
|
||||||
|
"low-VRAM" deployment option.**
|
||||||
|
|
||||||
|
### 4. torchao int8 dynamic activation: don't bother on this hardware
|
||||||
|
−43% throughput (5.52 → 9.67 ms/sample) and *more* peak VRAM than bf16
|
||||||
|
(1774 vs 1741 MB) because the per-batch activation quantization adds work
|
||||||
|
without unlocking int8 tensor cores. Pure regression on Ampere.
|
||||||
|
|
||||||
|
### 5. bnb LLM.int8: slowest int8, no accuracy upside
|
||||||
|
- **+41% latency** (5.52 → 7.76 ms/sample) due to mixed-precision outlier
|
||||||
|
handling
|
||||||
|
- **+23% VRAM** (1741 → 2135 MB) — outlier columns are kept in fp16 plus
|
||||||
|
scratch buffers
|
||||||
|
- **F1 +0.0024 cat, +0.0034 spec** — within noise; not a real win
|
||||||
|
|
||||||
|
bnb LLM.int8 was designed for LLM-scale models where outlier features
|
||||||
|
dominate quant error; for an encoder of this size on a single 3090 it
|
||||||
|
just trades performance for nothing.
|
||||||
|
|
||||||
|
### 6. All 4-bit variants collapse — ModernBERT-large is too quant-sensitive
|
||||||
|
Both nf4 (with and without double-quantization) and fp4 produce essentially
|
||||||
|
random predictions:
|
||||||
|
|
||||||
|
| variant | cat F1 | spec F1 | spec ECE |
|
||||||
|
|---------|-------:|--------:|---------:|
|
||||||
|
| nf4 | 0.354 | 0.221 | 0.434 |
|
||||||
|
| fp4 | 0.163 | 0.209 | 0.443 |
|
||||||
|
|
||||||
|
Per-layer dequantization is faithful — we verified that the dequantized
|
||||||
|
weight of one MLP Wi layer differs from the original by mean 0.005 / max
|
||||||
|
0.11 (sub-1% error). But the relative output drift on a single Linear is
|
||||||
|
already ~98% (mean), and that error compounds across 28 transformer blocks
|
||||||
|
+ GLU FFN paths until the [CLS]/pooled representation no longer carries
|
||||||
|
the discriminative signal. The category head essentially collapses to a
|
||||||
|
near-uniform prior (cat ECE 0.10 vs the 0.054 baseline) and the threshold
|
||||||
|
heads collapse onto L1 because all three thresholds emit similar logits.
|
||||||
|
|
||||||
|
The fact that **DQ vs no-DQ are bit-identical** at this scale tells us the
|
||||||
|
nf4 weight indices are stable under absmax requantization (only ~5% of the
|
||||||
|
weight bytes change, all in the metadata block) — the catastrophe is
|
||||||
|
inherent to 4-bit weight precision on this architecture, not to a
|
||||||
|
quantization-config knob.
|
||||||
|
|
||||||
|
This is a real noteworthy null for the paper: **naive post-training 4-bit
|
||||||
|
weight quantization is not viable for ModernBERT-large on this task**.
|
||||||
|
Recovering 4-bit would require either (a) QAT, (b) per-channel calibration
|
||||||
|
with a held-out activation distribution (GPTQ / AWQ-style), or (c) keeping
|
||||||
|
the GLU FFN in 8-bit and only 4-bit'ing attention projections. None of
|
||||||
|
these are reachable inside the remaining capstone time budget.
|
||||||
|
|
||||||
|
### 7. torchao int4-wo: dependency hole
|
||||||
|
torchao 0.17 requires `mslk >= 1.0.0` for the new `Int4Tensor.from_hp` path.
|
||||||
|
Not installed in the lockfile and not worth chasing given the bnb 4-bit
|
||||||
|
collapse — even if the kernel ran cleanly we'd expect the same compounding
|
||||||
|
error pattern.
|
||||||
|
|
||||||
|
## Recommendations
|
||||||
|
|
||||||
|
| Use case | Variant | Why |
|
||||||
|
|-----------------------------------|--------------------|-------------------------------------------------------------|
|
||||||
|
| **Production / paper headline** | bf16 | Best of every dimension on this hardware |
|
||||||
|
| **Low-VRAM batch serving** | torchao int8-wo | −19% VRAM, accuracy intact, only 10% latency penalty |
|
||||||
|
| **Multi-GPU sharded serving** | bf16 | int8-wo's dequant overhead grows with replica count |
|
||||||
|
| **Embedded / 4-bit** | not viable | Needs QAT or AWQ-style calibration; future work |
|
||||||
|
|
||||||
|
## Paper-worthy notes
|
||||||
|
|
||||||
|
1. **Quantization story** — bf16 is already the sweet spot; torchao int8-wo
|
||||||
|
buys 19% VRAM with no accuracy cost; 4-bit fails. This adds another row
|
||||||
|
to the speed/cost table.
|
||||||
|
2. **Architecture-specific quant fragility** — ModernBERT-large's GLU FFN
|
||||||
|
amplifies per-layer weight error across 28 blocks. This is a noteworthy
|
||||||
|
counterpoint to the 4-bit-by-default LLM serving narrative and worth
|
||||||
|
one paragraph in the discussion section alongside the DAPT and
|
||||||
|
CORAL null results.
|
||||||
|
3. **Hardware caveat** — int8 latency results would invert on
|
||||||
|
Hopper/Ada/A100; the 3090 just doesn't have the matmul path. State the
|
||||||
|
sm_8.6 caveat in the table caption.
|
||||||
|
|
||||||
|
## Reproduce
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# from repo root
|
||||||
|
bun run py:quant
|
||||||
|
# writes to results/eval/quant/{summary.json, REPORT.md, <variant>/metrics.json}
|
||||||
|
```
|
||||||
|
|
||||||
|
Run time: ~5 minutes total (most spent in fp32 + torchao build steps).
|
||||||
297
results/eval/quant/bf16/metrics.json
Normal file
297
results/eval/quant/bf16/metrics.json
Normal file
@ -0,0 +1,297 @@
|
|||||||
|
{
|
||||||
|
"GPT-5.4": {
|
||||||
|
"cat_macro_f1": 0.9336741161693523,
|
||||||
|
"cat_weighted_f1": 0.9343162998643407,
|
||||||
|
"cat_macro_precision": 0.93189297179766,
|
||||||
|
"cat_macro_recall": 0.9377918652022429,
|
||||||
|
"cat_mcc": 0.9226990724708704,
|
||||||
|
"cat_auc": 0.991991833154947,
|
||||||
|
"cat_ece": 0.053848127176364245,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
225,
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
85,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
144,
|
||||||
|
1,
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
131,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
6,
|
||||||
|
1,
|
||||||
|
5,
|
||||||
|
19,
|
||||||
|
164,
|
||||||
|
1,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
1,
|
||||||
|
8,
|
||||||
|
2,
|
||||||
|
207,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
12,
|
||||||
|
0,
|
||||||
|
165
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9719222462203023,
|
||||||
|
"cat_prec_BoardGov": 0.9656652360515021,
|
||||||
|
"cat_recall_BoardGov": 0.9782608695652174,
|
||||||
|
"cat_f1_Incident": 0.96045197740113,
|
||||||
|
"cat_prec_Incident": 0.9550561797752809,
|
||||||
|
"cat_recall_Incident": 0.9659090909090909,
|
||||||
|
"cat_f1_Manageme": 0.9411764705882353,
|
||||||
|
"cat_prec_Manageme": 0.9230769230769231,
|
||||||
|
"cat_recall_Manageme": 0.96,
|
||||||
|
"cat_f1_NoneOthe": 0.888135593220339,
|
||||||
|
"cat_prec_NoneOthe": 0.8238993710691824,
|
||||||
|
"cat_recall_NoneOthe": 0.9632352941176471,
|
||||||
|
"cat_f1_RiskMana": 0.856396866840731,
|
||||||
|
"cat_prec_RiskMana": 0.8864864864864865,
|
||||||
|
"cat_recall_RiskMana": 0.8282828282828283,
|
||||||
|
"cat_f1_Strategy": 0.9583333333333334,
|
||||||
|
"cat_prec_Strategy": 0.981042654028436,
|
||||||
|
"cat_recall_Strategy": 0.9366515837104072,
|
||||||
|
"cat_f1_Third-Pa": 0.9593023255813954,
|
||||||
|
"cat_prec_Third-Pa": 0.9880239520958084,
|
||||||
|
"cat_recall_Third-Pa": 0.9322033898305084,
|
||||||
|
"cat_kripp_alpha": 0.9223591517560865,
|
||||||
|
"spec_macro_f1": 0.8951731906425856,
|
||||||
|
"spec_weighted_f1": 0.9121524819510628,
|
||||||
|
"spec_macro_precision": 0.8980417155129858,
|
||||||
|
"spec_macro_recall": 0.8930560580782194,
|
||||||
|
"spec_mcc": 0.866381831963237,
|
||||||
|
"spec_auc": 0.981666223606385,
|
||||||
|
"spec_ece": 0.07135417198141418,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
580,
|
||||||
|
23,
|
||||||
|
12,
|
||||||
|
3
|
||||||
|
],
|
||||||
|
[
|
||||||
|
29,
|
||||||
|
130,
|
||||||
|
7,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
11,
|
||||||
|
4,
|
||||||
|
190,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
9,
|
||||||
|
195
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.9354838709677419,
|
||||||
|
"spec_prec_L1Generi": 0.932475884244373,
|
||||||
|
"spec_recall_L1Generi": 0.9385113268608414,
|
||||||
|
"spec_f1_L2Domain": 0.7975460122699386,
|
||||||
|
"spec_prec_L2Domain": 0.8227848101265823,
|
||||||
|
"spec_recall_L2Domain": 0.7738095238095238,
|
||||||
|
"spec_f1_L3Firm-S": 0.8941176470588236,
|
||||||
|
"spec_prec_L3Firm-S": 0.8715596330275229,
|
||||||
|
"spec_recall_L3Firm-S": 0.9178743961352657,
|
||||||
|
"spec_f1_L4Quanti": 0.9535452322738386,
|
||||||
|
"spec_prec_L4Quanti": 0.9653465346534653,
|
||||||
|
"spec_recall_L4Quanti": 0.9420289855072463,
|
||||||
|
"spec_qwk": 0.9324447137231142,
|
||||||
|
"spec_mae": 0.1175,
|
||||||
|
"spec_kripp_alpha": 0.917725722448833
|
||||||
|
},
|
||||||
|
"Opus-4.6": {
|
||||||
|
"cat_macro_f1": 0.922684387023173,
|
||||||
|
"cat_weighted_f1": 0.9216414809666168,
|
||||||
|
"cat_macro_precision": 0.9177680939029339,
|
||||||
|
"cat_macro_recall": 0.9316060900094703,
|
||||||
|
"cat_mcc": 0.909266938399113,
|
||||||
|
"cat_auc": 0.9939660707189948,
|
||||||
|
"cat_ece": 0.06551479384303091,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
211,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
78,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
8,
|
||||||
|
0,
|
||||||
|
144,
|
||||||
|
1,
|
||||||
|
4,
|
||||||
|
0,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
138,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
13,
|
||||||
|
0,
|
||||||
|
9,
|
||||||
|
14,
|
||||||
|
169,
|
||||||
|
1,
|
||||||
|
7
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
11,
|
||||||
|
1,
|
||||||
|
4,
|
||||||
|
3,
|
||||||
|
208,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
6,
|
||||||
|
1,
|
||||||
|
159
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9440715883668904,
|
||||||
|
"cat_prec_BoardGov": 0.9055793991416309,
|
||||||
|
"cat_recall_BoardGov": 0.985981308411215,
|
||||||
|
"cat_f1_Incident": 0.9285714285714286,
|
||||||
|
"cat_prec_Incident": 0.8764044943820225,
|
||||||
|
"cat_recall_Incident": 0.9873417721518988,
|
||||||
|
"cat_f1_Manageme": 0.9171974522292994,
|
||||||
|
"cat_prec_Manageme": 0.9230769230769231,
|
||||||
|
"cat_recall_Manageme": 0.9113924050632911,
|
||||||
|
"cat_f1_NoneOthe": 0.92,
|
||||||
|
"cat_prec_NoneOthe": 0.8679245283018868,
|
||||||
|
"cat_recall_NoneOthe": 0.9787234042553191,
|
||||||
|
"cat_f1_RiskMana": 0.8492462311557789,
|
||||||
|
"cat_prec_RiskMana": 0.9135135135135135,
|
||||||
|
"cat_recall_RiskMana": 0.7934272300469484,
|
||||||
|
"cat_f1_Strategy": 0.9476082004555809,
|
||||||
|
"cat_prec_Strategy": 0.985781990521327,
|
||||||
|
"cat_recall_Strategy": 0.9122807017543859,
|
||||||
|
"cat_f1_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_prec_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_recall_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_kripp_alpha": 0.908575631724203,
|
||||||
|
"spec_macro_f1": 0.8833694419146193,
|
||||||
|
"spec_weighted_f1": 0.9004034318676798,
|
||||||
|
"spec_macro_precision": 0.8858989636247611,
|
||||||
|
"spec_macro_recall": 0.8854684685880032,
|
||||||
|
"spec_mcc": 0.8500778641433316,
|
||||||
|
"spec_auc": 0.9736633898988131,
|
||||||
|
"spec_ece": 0.08248284702499709,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
567,
|
||||||
|
30,
|
||||||
|
7,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
22,
|
||||||
|
118,
|
||||||
|
3,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
33,
|
||||||
|
10,
|
||||||
|
207,
|
||||||
|
10
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
189
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.9242053789731052,
|
||||||
|
"spec_prec_L1Generi": 0.9115755627009646,
|
||||||
|
"spec_recall_L1Generi": 0.9371900826446281,
|
||||||
|
"spec_f1_L2Domain": 0.7788778877887789,
|
||||||
|
"spec_prec_L2Domain": 0.7468354430379747,
|
||||||
|
"spec_recall_L2Domain": 0.8137931034482758,
|
||||||
|
"spec_f1_L3Firm-S": 0.8661087866108786,
|
||||||
|
"spec_prec_L3Firm-S": 0.9495412844036697,
|
||||||
|
"spec_recall_L3Firm-S": 0.7961538461538461,
|
||||||
|
"spec_f1_L4Quanti": 0.9642857142857143,
|
||||||
|
"spec_prec_L4Quanti": 0.9356435643564357,
|
||||||
|
"spec_recall_L4Quanti": 0.9947368421052631,
|
||||||
|
"spec_qwk": 0.9227008860372746,
|
||||||
|
"spec_mae": 0.13583333333333333,
|
||||||
|
"spec_kripp_alpha": 0.9065248741550552
|
||||||
|
},
|
||||||
|
"_runtime": {
|
||||||
|
"encoder_mb": 789.563648,
|
||||||
|
"ms_per_sample": 5.516677870764397,
|
||||||
|
"throughput_per_s": 181.26851402716375,
|
||||||
|
"peak_vram_mb": 1740.83837890625,
|
||||||
|
"build_s": 0.48778308398323134
|
||||||
|
}
|
||||||
|
}
|
||||||
297
results/eval/quant/bnb-fp4/metrics.json
Normal file
297
results/eval/quant/bnb-fp4/metrics.json
Normal file
@ -0,0 +1,297 @@
|
|||||||
|
{
|
||||||
|
"GPT-5.4": {
|
||||||
|
"cat_macro_f1": 0.16293893512410998,
|
||||||
|
"cat_weighted_f1": 0.1746727986514593,
|
||||||
|
"cat_macro_precision": 0.6289222195093943,
|
||||||
|
"cat_macro_recall": 0.23220413662370398,
|
||||||
|
"cat_mcc": 0.22345796853389935,
|
||||||
|
"cat_auc": 0.8960306312891495,
|
||||||
|
"cat_ece": 0.2080524676044782,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
8,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
221,
|
||||||
|
0,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
82,
|
||||||
|
0,
|
||||||
|
5
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
4,
|
||||||
|
0,
|
||||||
|
145,
|
||||||
|
0,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
3,
|
||||||
|
128,
|
||||||
|
0,
|
||||||
|
3
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
195,
|
||||||
|
0,
|
||||||
|
3
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
208,
|
||||||
|
2,
|
||||||
|
11
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
80,
|
||||||
|
0,
|
||||||
|
97
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.06722689075630252,
|
||||||
|
"cat_prec_BoardGov": 1.0,
|
||||||
|
"cat_recall_BoardGov": 0.034782608695652174,
|
||||||
|
"cat_f1_Incident": 0.0,
|
||||||
|
"cat_prec_Incident": 0.0,
|
||||||
|
"cat_recall_Incident": 0.0,
|
||||||
|
"cat_f1_Manageme": 0.05128205128205128,
|
||||||
|
"cat_prec_Manageme": 0.6666666666666666,
|
||||||
|
"cat_recall_Manageme": 0.02666666666666667,
|
||||||
|
"cat_f1_NoneOthe": 0.04285714285714286,
|
||||||
|
"cat_prec_NoneOthe": 0.75,
|
||||||
|
"cat_recall_NoneOthe": 0.022058823529411766,
|
||||||
|
"cat_f1_RiskMana": 0.31026252983293556,
|
||||||
|
"cat_prec_RiskMana": 0.18413597733711048,
|
||||||
|
"cat_recall_RiskMana": 0.9848484848484849,
|
||||||
|
"cat_f1_Strategy": 0.017937219730941704,
|
||||||
|
"cat_prec_Strategy": 1.0,
|
||||||
|
"cat_recall_Strategy": 0.00904977375565611,
|
||||||
|
"cat_f1_Third-Pa": 0.6510067114093959,
|
||||||
|
"cat_prec_Third-Pa": 0.8016528925619835,
|
||||||
|
"cat_recall_Third-Pa": 0.5480225988700564,
|
||||||
|
"cat_kripp_alpha": -0.08693512028952255,
|
||||||
|
"spec_macro_f1": 0.20854117827130608,
|
||||||
|
"spec_weighted_f1": 0.2571301750438355,
|
||||||
|
"spec_macro_precision": 0.3741612607031285,
|
||||||
|
"spec_macro_recall": 0.33018440069147115,
|
||||||
|
"spec_mcc": 0.1895317453505129,
|
||||||
|
"spec_auc": 0.8110497500610155,
|
||||||
|
"spec_ece": 0.44289420386155437,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
136,
|
||||||
|
473,
|
||||||
|
9,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
4,
|
||||||
|
163,
|
||||||
|
1,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
179,
|
||||||
|
27,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
171,
|
||||||
|
34,
|
||||||
|
0
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.35742444152431013,
|
||||||
|
"spec_prec_L1Generi": 0.951048951048951,
|
||||||
|
"spec_recall_L1Generi": 0.22006472491909385,
|
||||||
|
"spec_f1_L2Domain": 0.2824956672443674,
|
||||||
|
"spec_prec_L2Domain": 0.16531440162271804,
|
||||||
|
"spec_recall_L2Domain": 0.9702380952380952,
|
||||||
|
"spec_f1_L3Firm-S": 0.19424460431654678,
|
||||||
|
"spec_prec_L3Firm-S": 0.38028169014084506,
|
||||||
|
"spec_recall_L3Firm-S": 0.13043478260869565,
|
||||||
|
"spec_f1_L4Quanti": 0.0,
|
||||||
|
"spec_prec_L4Quanti": 0.0,
|
||||||
|
"spec_recall_L4Quanti": 0.0,
|
||||||
|
"spec_qwk": 0.2326064604575444,
|
||||||
|
"spec_mae": 0.8825,
|
||||||
|
"spec_kripp_alpha": 0.26499611744119067
|
||||||
|
},
|
||||||
|
"Opus-4.6": {
|
||||||
|
"cat_macro_f1": 0.16861118726256397,
|
||||||
|
"cat_weighted_f1": 0.1792365613004711,
|
||||||
|
"cat_macro_precision": 0.6306758954840335,
|
||||||
|
"cat_macro_recall": 0.2357303291121537,
|
||||||
|
"cat_mcc": 0.2251562222131823,
|
||||||
|
"cat_auc": 0.8995073249291591,
|
||||||
|
"cat_ece": 0.19888580093781152,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
8,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
205,
|
||||||
|
0,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
73,
|
||||||
|
0,
|
||||||
|
5
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
4,
|
||||||
|
0,
|
||||||
|
154,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
3,
|
||||||
|
133,
|
||||||
|
0,
|
||||||
|
3
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
208,
|
||||||
|
0,
|
||||||
|
5
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
216,
|
||||||
|
2,
|
||||||
|
10
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
70,
|
||||||
|
0,
|
||||||
|
97
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.07207207207207207,
|
||||||
|
"cat_prec_BoardGov": 1.0,
|
||||||
|
"cat_recall_BoardGov": 0.037383177570093455,
|
||||||
|
"cat_f1_Incident": 0.0,
|
||||||
|
"cat_prec_Incident": 0.0,
|
||||||
|
"cat_recall_Incident": 0.0,
|
||||||
|
"cat_f1_Manageme": 0.04878048780487805,
|
||||||
|
"cat_prec_Manageme": 0.6666666666666666,
|
||||||
|
"cat_recall_Manageme": 0.02531645569620253,
|
||||||
|
"cat_f1_NoneOthe": 0.041379310344827586,
|
||||||
|
"cat_prec_NoneOthe": 0.75,
|
||||||
|
"cat_recall_NoneOthe": 0.02127659574468085,
|
||||||
|
"cat_f1_RiskMana": 0.3270440251572327,
|
||||||
|
"cat_prec_RiskMana": 0.1964117091595845,
|
||||||
|
"cat_recall_RiskMana": 0.9765258215962441,
|
||||||
|
"cat_f1_Strategy": 0.017391304347826087,
|
||||||
|
"cat_prec_Strategy": 1.0,
|
||||||
|
"cat_recall_Strategy": 0.008771929824561403,
|
||||||
|
"cat_f1_Third-Pa": 0.6736111111111112,
|
||||||
|
"cat_prec_Third-Pa": 0.8016528925619835,
|
||||||
|
"cat_recall_Third-Pa": 0.5808383233532934,
|
||||||
|
"cat_kripp_alpha": -0.07941064783948448,
|
||||||
|
"spec_macro_f1": 0.19783939283519508,
|
||||||
|
"spec_weighted_f1": 0.24886714543281097,
|
||||||
|
"spec_macro_precision": 0.37592821714182745,
|
||||||
|
"spec_macro_recall": 0.3291807330600434,
|
||||||
|
"spec_mcc": 0.18219176358380398,
|
||||||
|
"spec_auc": 0.790090253498083,
|
||||||
|
"spec_ece": 0.45814307530721027,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
132,
|
||||||
|
466,
|
||||||
|
7,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
142,
|
||||||
|
2,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
8,
|
||||||
|
221,
|
||||||
|
31,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
157,
|
||||||
|
31,
|
||||||
|
0
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.35294117647058826,
|
||||||
|
"spec_prec_L1Generi": 0.9230769230769231,
|
||||||
|
"spec_recall_L1Generi": 0.21818181818181817,
|
||||||
|
"spec_f1_L2Domain": 0.251105216622458,
|
||||||
|
"spec_prec_L2Domain": 0.1440162271805274,
|
||||||
|
"spec_recall_L2Domain": 0.9793103448275862,
|
||||||
|
"spec_f1_L3Firm-S": 0.18731117824773413,
|
||||||
|
"spec_prec_L3Firm-S": 0.43661971830985913,
|
||||||
|
"spec_recall_L3Firm-S": 0.11923076923076924,
|
||||||
|
"spec_f1_L4Quanti": 0.0,
|
||||||
|
"spec_prec_L4Quanti": 0.0,
|
||||||
|
"spec_recall_L4Quanti": 0.0,
|
||||||
|
"spec_qwk": 0.22580295138888895,
|
||||||
|
"spec_mae": 0.8925,
|
||||||
|
"spec_kripp_alpha": 0.2579634594689497
|
||||||
|
},
|
||||||
|
"_runtime": {
|
||||||
|
"encoder_mb": 274.843904,
|
||||||
|
"ms_per_sample": 5.865302347471394,
|
||||||
|
"throughput_per_s": 170.49419463109393,
|
||||||
|
"peak_vram_mb": 1287.34326171875,
|
||||||
|
"build_s": 0.4887635139748454
|
||||||
|
}
|
||||||
|
}
|
||||||
297
results/eval/quant/bnb-int8/metrics.json
Normal file
297
results/eval/quant/bnb-int8/metrics.json
Normal file
@ -0,0 +1,297 @@
|
|||||||
|
{
|
||||||
|
"GPT-5.4": {
|
||||||
|
"cat_macro_f1": 0.9360988760303737,
|
||||||
|
"cat_weighted_f1": 0.9367630863906107,
|
||||||
|
"cat_macro_precision": 0.934342558672944,
|
||||||
|
"cat_macro_recall": 0.9404157843351134,
|
||||||
|
"cat_mcc": 0.9256911778959798,
|
||||||
|
"cat_auc": 0.9918112947607864,
|
||||||
|
"cat_ece": 0.052939765204985965,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
226,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
85,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
145,
|
||||||
|
1,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
132,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
6,
|
||||||
|
1,
|
||||||
|
5,
|
||||||
|
19,
|
||||||
|
164,
|
||||||
|
1,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
1,
|
||||||
|
8,
|
||||||
|
2,
|
||||||
|
207,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
12,
|
||||||
|
0,
|
||||||
|
165
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9741379310344828,
|
||||||
|
"cat_prec_BoardGov": 0.9658119658119658,
|
||||||
|
"cat_recall_BoardGov": 0.9826086956521739,
|
||||||
|
"cat_f1_Incident": 0.96045197740113,
|
||||||
|
"cat_prec_Incident": 0.9550561797752809,
|
||||||
|
"cat_recall_Incident": 0.9659090909090909,
|
||||||
|
"cat_f1_Manageme": 0.9477124183006536,
|
||||||
|
"cat_prec_Manageme": 0.9294871794871795,
|
||||||
|
"cat_recall_Manageme": 0.9666666666666667,
|
||||||
|
"cat_f1_NoneOthe": 0.8918918918918919,
|
||||||
|
"cat_prec_NoneOthe": 0.825,
|
||||||
|
"cat_recall_NoneOthe": 0.9705882352941176,
|
||||||
|
"cat_f1_RiskMana": 0.8586387434554974,
|
||||||
|
"cat_prec_RiskMana": 0.8913043478260869,
|
||||||
|
"cat_recall_RiskMana": 0.8282828282828283,
|
||||||
|
"cat_f1_Strategy": 0.9605568445475638,
|
||||||
|
"cat_prec_Strategy": 0.9857142857142858,
|
||||||
|
"cat_recall_Strategy": 0.9366515837104072,
|
||||||
|
"cat_f1_Third-Pa": 0.9593023255813954,
|
||||||
|
"cat_prec_Third-Pa": 0.9880239520958084,
|
||||||
|
"cat_recall_Third-Pa": 0.9322033898305084,
|
||||||
|
"cat_kripp_alpha": 0.9253092213149172,
|
||||||
|
"spec_macro_f1": 0.8986323186392307,
|
||||||
|
"spec_weighted_f1": 0.9144644120807768,
|
||||||
|
"spec_macro_precision": 0.9034925881673722,
|
||||||
|
"spec_macro_recall": 0.8950728490354916,
|
||||||
|
"spec_mcc": 0.870090391628814,
|
||||||
|
"spec_auc": 0.98134918835569,
|
||||||
|
"spec_ece": 0.06740866973996164,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
582,
|
||||||
|
19,
|
||||||
|
14,
|
||||||
|
3
|
||||||
|
],
|
||||||
|
[
|
||||||
|
29,
|
||||||
|
130,
|
||||||
|
7,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
12,
|
||||||
|
3,
|
||||||
|
190,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
8,
|
||||||
|
196
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.9364440868865648,
|
||||||
|
"spec_prec_L1Generi": 0.9312,
|
||||||
|
"spec_recall_L1Generi": 0.941747572815534,
|
||||||
|
"spec_f1_L2Domain": 0.8099688473520249,
|
||||||
|
"spec_prec_L2Domain": 0.8496732026143791,
|
||||||
|
"spec_recall_L2Domain": 0.7738095238095238,
|
||||||
|
"spec_f1_L3Firm-S": 0.892018779342723,
|
||||||
|
"spec_prec_L3Firm-S": 0.867579908675799,
|
||||||
|
"spec_recall_L3Firm-S": 0.9178743961352657,
|
||||||
|
"spec_f1_L4Quanti": 0.9560975609756097,
|
||||||
|
"spec_prec_L4Quanti": 0.9655172413793104,
|
||||||
|
"spec_recall_L4Quanti": 0.9468599033816425,
|
||||||
|
"spec_qwk": 0.9307948020550015,
|
||||||
|
"spec_mae": 0.1175,
|
||||||
|
"spec_kripp_alpha": 0.9166492249745117
|
||||||
|
},
|
||||||
|
"Opus-4.6": {
|
||||||
|
"cat_macro_f1": 0.9235105849558979,
|
||||||
|
"cat_weighted_f1": 0.9224780370334836,
|
||||||
|
"cat_macro_precision": 0.9187130112710481,
|
||||||
|
"cat_macro_recall": 0.9326192612354074,
|
||||||
|
"cat_mcc": 0.9103198007176273,
|
||||||
|
"cat_auc": 0.9937246318315877,
|
||||||
|
"cat_ece": 0.06465620135267579,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
211,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
78,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
9,
|
||||||
|
0,
|
||||||
|
144,
|
||||||
|
1,
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
139,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
13,
|
||||||
|
0,
|
||||||
|
9,
|
||||||
|
14,
|
||||||
|
169,
|
||||||
|
1,
|
||||||
|
7
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
11,
|
||||||
|
1,
|
||||||
|
4,
|
||||||
|
3,
|
||||||
|
208,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
6,
|
||||||
|
1,
|
||||||
|
159
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9419642857142857,
|
||||||
|
"cat_prec_BoardGov": 0.9017094017094017,
|
||||||
|
"cat_recall_BoardGov": 0.985981308411215,
|
||||||
|
"cat_f1_Incident": 0.9285714285714286,
|
||||||
|
"cat_prec_Incident": 0.8764044943820225,
|
||||||
|
"cat_recall_Incident": 0.9873417721518988,
|
||||||
|
"cat_f1_Manageme": 0.9171974522292994,
|
||||||
|
"cat_prec_Manageme": 0.9230769230769231,
|
||||||
|
"cat_recall_Manageme": 0.9113924050632911,
|
||||||
|
"cat_f1_NoneOthe": 0.9235880398671097,
|
||||||
|
"cat_prec_NoneOthe": 0.86875,
|
||||||
|
"cat_recall_NoneOthe": 0.9858156028368794,
|
||||||
|
"cat_f1_RiskMana": 0.8513853904282116,
|
||||||
|
"cat_prec_RiskMana": 0.9184782608695652,
|
||||||
|
"cat_recall_RiskMana": 0.7934272300469484,
|
||||||
|
"cat_f1_Strategy": 0.9497716894977168,
|
||||||
|
"cat_prec_Strategy": 0.9904761904761905,
|
||||||
|
"cat_recall_Strategy": 0.9122807017543859,
|
||||||
|
"cat_f1_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_prec_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_recall_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_kripp_alpha": 0.9095619506866199,
|
||||||
|
"spec_macro_f1": 0.8826923642825633,
|
||||||
|
"spec_weighted_f1": 0.8991699562480843,
|
||||||
|
"spec_macro_precision": 0.8862949086294886,
|
||||||
|
"spec_macro_recall": 0.8831960153359262,
|
||||||
|
"spec_mcc": 0.8485449936701916,
|
||||||
|
"spec_auc": 0.9725823165743999,
|
||||||
|
"spec_ece": 0.083350846717755,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
568,
|
||||||
|
27,
|
||||||
|
9,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
23,
|
||||||
|
117,
|
||||||
|
3,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
34,
|
||||||
|
9,
|
||||||
|
206,
|
||||||
|
11
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
189
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.9235772357723577,
|
||||||
|
"spec_prec_L1Generi": 0.9088,
|
||||||
|
"spec_recall_L1Generi": 0.9388429752066115,
|
||||||
|
"spec_f1_L2Domain": 0.785234899328859,
|
||||||
|
"spec_prec_L2Domain": 0.7647058823529411,
|
||||||
|
"spec_recall_L2Domain": 0.8068965517241379,
|
||||||
|
"spec_f1_L3Firm-S": 0.860125260960334,
|
||||||
|
"spec_prec_L3Firm-S": 0.9406392694063926,
|
||||||
|
"spec_recall_L3Firm-S": 0.7923076923076923,
|
||||||
|
"spec_f1_L4Quanti": 0.9618320610687023,
|
||||||
|
"spec_prec_L4Quanti": 0.9310344827586207,
|
||||||
|
"spec_recall_L4Quanti": 0.9947368421052631,
|
||||||
|
"spec_qwk": 0.9198415117342273,
|
||||||
|
"spec_mae": 0.13916666666666666,
|
||||||
|
"spec_kripp_alpha": 0.9038906079654127
|
||||||
|
},
|
||||||
|
"_runtime": {
|
||||||
|
"encoder_mb": 789.563648,
|
||||||
|
"ms_per_sample": 7.762363941583317,
|
||||||
|
"throughput_per_s": 128.82673468103667,
|
||||||
|
"peak_vram_mb": 2135.203125,
|
||||||
|
"build_s": 1.1878160500200465
|
||||||
|
}
|
||||||
|
}
|
||||||
297
results/eval/quant/bnb-nf4-nodq/metrics.json
Normal file
297
results/eval/quant/bnb-nf4-nodq/metrics.json
Normal file
@ -0,0 +1,297 @@
|
|||||||
|
{
|
||||||
|
"GPT-5.4": {
|
||||||
|
"cat_macro_f1": 0.3536909012886116,
|
||||||
|
"cat_weighted_f1": 0.4058815979606338,
|
||||||
|
"cat_macro_precision": 0.6317997184487815,
|
||||||
|
"cat_macro_recall": 0.38979766446605063,
|
||||||
|
"cat_mcc": 0.42471542150657926,
|
||||||
|
"cat_auc": 0.9205800077405307,
|
||||||
|
"cat_ece": 0.09734637491405013,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
143,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
85,
|
||||||
|
0,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
4,
|
||||||
|
73,
|
||||||
|
0,
|
||||||
|
11
|
||||||
|
],
|
||||||
|
[
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
20,
|
||||||
|
1,
|
||||||
|
124,
|
||||||
|
0,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
5,
|
||||||
|
122,
|
||||||
|
0,
|
||||||
|
7
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
185,
|
||||||
|
0,
|
||||||
|
13
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
180,
|
||||||
|
28,
|
||||||
|
13
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
22,
|
||||||
|
0,
|
||||||
|
155
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.7586206896551724,
|
||||||
|
"cat_prec_BoardGov": 0.9727891156462585,
|
||||||
|
"cat_recall_BoardGov": 0.6217391304347826,
|
||||||
|
"cat_f1_Incident": 0.0,
|
||||||
|
"cat_prec_Incident": 0.0,
|
||||||
|
"cat_recall_Incident": 0.0,
|
||||||
|
"cat_f1_Manageme": 0.23391812865497075,
|
||||||
|
"cat_prec_Manageme": 0.9523809523809523,
|
||||||
|
"cat_recall_Manageme": 0.13333333333333333,
|
||||||
|
"cat_f1_NoneOthe": 0.0684931506849315,
|
||||||
|
"cat_prec_NoneOthe": 0.5,
|
||||||
|
"cat_recall_NoneOthe": 0.03676470588235294,
|
||||||
|
"cat_f1_RiskMana": 0.3741152679474216,
|
||||||
|
"cat_prec_RiskMana": 0.23388116308470291,
|
||||||
|
"cat_recall_RiskMana": 0.9343434343434344,
|
||||||
|
"cat_f1_Strategy": 0.2248995983935743,
|
||||||
|
"cat_prec_Strategy": 1.0,
|
||||||
|
"cat_recall_Strategy": 0.12669683257918551,
|
||||||
|
"cat_f1_Third-Pa": 0.8157894736842105,
|
||||||
|
"cat_prec_Third-Pa": 0.7635467980295566,
|
||||||
|
"cat_recall_Third-Pa": 0.8757062146892656,
|
||||||
|
"cat_kripp_alpha": 0.27180867501339423,
|
||||||
|
"spec_macro_f1": 0.22049451330952025,
|
||||||
|
"spec_weighted_f1": 0.26278390857815354,
|
||||||
|
"spec_macro_precision": 0.4075440073341987,
|
||||||
|
"spec_macro_recall": 0.34148466970860386,
|
||||||
|
"spec_mcc": 0.20939315966102864,
|
||||||
|
"spec_auc": 0.8490039116946011,
|
||||||
|
"spec_ece": 0.43363295723994577,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
132,
|
||||||
|
483,
|
||||||
|
3,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
166,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
171,
|
||||||
|
34,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
175,
|
||||||
|
32,
|
||||||
|
0
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.35013262599469497,
|
||||||
|
"spec_prec_L1Generi": 0.9705882352941176,
|
||||||
|
"spec_recall_L1Generi": 0.21359223300970873,
|
||||||
|
"spec_f1_L2Domain": 0.28546861564918313,
|
||||||
|
"spec_prec_L2Domain": 0.16683417085427135,
|
||||||
|
"spec_recall_L2Domain": 0.9880952380952381,
|
||||||
|
"spec_f1_L3Firm-S": 0.2463768115942029,
|
||||||
|
"spec_prec_L3Firm-S": 0.4927536231884058,
|
||||||
|
"spec_recall_L3Firm-S": 0.1642512077294686,
|
||||||
|
"spec_f1_L4Quanti": 0.0,
|
||||||
|
"spec_prec_L4Quanti": 0.0,
|
||||||
|
"spec_recall_L4Quanti": 0.0,
|
||||||
|
"spec_qwk": 0.24233251808742773,
|
||||||
|
"spec_mae": 0.8733333333333333,
|
||||||
|
"spec_kripp_alpha": 0.2761091078775676
|
||||||
|
},
|
||||||
|
"Opus-4.6": {
|
||||||
|
"cat_macro_f1": 0.35763512449392704,
|
||||||
|
"cat_weighted_f1": 0.40173099854659305,
|
||||||
|
"cat_macro_precision": 0.6354693148020794,
|
||||||
|
"cat_macro_recall": 0.39500680662311666,
|
||||||
|
"cat_mcc": 0.42166882753874363,
|
||||||
|
"cat_auc": 0.9209441610065957,
|
||||||
|
"cat_ece": 0.09567970824738346,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
141,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
71,
|
||||||
|
0,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
4,
|
||||||
|
65,
|
||||||
|
0,
|
||||||
|
10
|
||||||
|
],
|
||||||
|
[
|
||||||
|
5,
|
||||||
|
0,
|
||||||
|
21,
|
||||||
|
1,
|
||||||
|
131,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
5,
|
||||||
|
128,
|
||||||
|
0,
|
||||||
|
7
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
194,
|
||||||
|
0,
|
||||||
|
19
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
186,
|
||||||
|
28,
|
||||||
|
14
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
16,
|
||||||
|
0,
|
||||||
|
151
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.7811634349030471,
|
||||||
|
"cat_prec_BoardGov": 0.9591836734693877,
|
||||||
|
"cat_recall_BoardGov": 0.6588785046728972,
|
||||||
|
"cat_f1_Incident": 0.0,
|
||||||
|
"cat_prec_Incident": 0.0,
|
||||||
|
"cat_recall_Incident": 0.0,
|
||||||
|
"cat_f1_Manageme": 0.2346368715083799,
|
||||||
|
"cat_prec_Manageme": 1.0,
|
||||||
|
"cat_recall_Manageme": 0.13291139240506328,
|
||||||
|
"cat_f1_NoneOthe": 0.06622516556291391,
|
||||||
|
"cat_prec_NoneOthe": 0.5,
|
||||||
|
"cat_recall_NoneOthe": 0.03546099290780142,
|
||||||
|
"cat_f1_RiskMana": 0.38645418326693226,
|
||||||
|
"cat_prec_RiskMana": 0.24525916561314792,
|
||||||
|
"cat_recall_RiskMana": 0.9107981220657277,
|
||||||
|
"cat_f1_Strategy": 0.21875,
|
||||||
|
"cat_prec_Strategy": 1.0,
|
||||||
|
"cat_recall_Strategy": 0.12280701754385964,
|
||||||
|
"cat_f1_Third-Pa": 0.8162162162162162,
|
||||||
|
"cat_prec_Third-Pa": 0.7438423645320197,
|
||||||
|
"cat_recall_Third-Pa": 0.9041916167664671,
|
||||||
|
"cat_kripp_alpha": 0.27338793761748126,
|
||||||
|
"spec_macro_f1": 0.20754679251319788,
|
||||||
|
"spec_weighted_f1": 0.25637242485646744,
|
||||||
|
"spec_macro_precision": 0.40946072005380696,
|
||||||
|
"spec_macro_recall": 0.33929593134138586,
|
||||||
|
"spec_mcc": 0.2041103760829744,
|
||||||
|
"spec_auc": 0.8271022317290393,
|
||||||
|
"spec_ece": 0.4489923599362374,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
130,
|
||||||
|
473,
|
||||||
|
2,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
145,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
6,
|
||||||
|
217,
|
||||||
|
37,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
160,
|
||||||
|
30,
|
||||||
|
0
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.3508771929824561,
|
||||||
|
"spec_prec_L1Generi": 0.9558823529411765,
|
||||||
|
"spec_recall_L1Generi": 0.21487603305785125,
|
||||||
|
"spec_f1_L2Domain": 0.2543859649122807,
|
||||||
|
"spec_prec_L2Domain": 0.1457286432160804,
|
||||||
|
"spec_recall_L2Domain": 1.0,
|
||||||
|
"spec_f1_L3Firm-S": 0.22492401215805471,
|
||||||
|
"spec_prec_L3Firm-S": 0.5362318840579711,
|
||||||
|
"spec_recall_L3Firm-S": 0.1423076923076923,
|
||||||
|
"spec_f1_L4Quanti": 0.0,
|
||||||
|
"spec_prec_L4Quanti": 0.0,
|
||||||
|
"spec_recall_L4Quanti": 0.0,
|
||||||
|
"spec_qwk": 0.24096533359991634,
|
||||||
|
"spec_mae": 0.88,
|
||||||
|
"spec_kripp_alpha": 0.2758412395136435
|
||||||
|
},
|
||||||
|
"_runtime": {
|
||||||
|
"encoder_mb": 274.843904,
|
||||||
|
"ms_per_sample": 5.861402786540566,
|
||||||
|
"throughput_per_s": 170.607623536175,
|
||||||
|
"peak_vram_mb": 1287.34326171875,
|
||||||
|
"build_s": 0.4908116469741799
|
||||||
|
}
|
||||||
|
}
|
||||||
297
results/eval/quant/bnb-nf4/metrics.json
Normal file
297
results/eval/quant/bnb-nf4/metrics.json
Normal file
@ -0,0 +1,297 @@
|
|||||||
|
{
|
||||||
|
"GPT-5.4": {
|
||||||
|
"cat_macro_f1": 0.3536909012886116,
|
||||||
|
"cat_weighted_f1": 0.4058815979606338,
|
||||||
|
"cat_macro_precision": 0.6317997184487815,
|
||||||
|
"cat_macro_recall": 0.38979766446605063,
|
||||||
|
"cat_mcc": 0.42471542150657926,
|
||||||
|
"cat_auc": 0.9205800077405307,
|
||||||
|
"cat_ece": 0.09734637491405013,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
143,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
85,
|
||||||
|
0,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
4,
|
||||||
|
73,
|
||||||
|
0,
|
||||||
|
11
|
||||||
|
],
|
||||||
|
[
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
20,
|
||||||
|
1,
|
||||||
|
124,
|
||||||
|
0,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
5,
|
||||||
|
122,
|
||||||
|
0,
|
||||||
|
7
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
185,
|
||||||
|
0,
|
||||||
|
13
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
180,
|
||||||
|
28,
|
||||||
|
13
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
22,
|
||||||
|
0,
|
||||||
|
155
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.7586206896551724,
|
||||||
|
"cat_prec_BoardGov": 0.9727891156462585,
|
||||||
|
"cat_recall_BoardGov": 0.6217391304347826,
|
||||||
|
"cat_f1_Incident": 0.0,
|
||||||
|
"cat_prec_Incident": 0.0,
|
||||||
|
"cat_recall_Incident": 0.0,
|
||||||
|
"cat_f1_Manageme": 0.23391812865497075,
|
||||||
|
"cat_prec_Manageme": 0.9523809523809523,
|
||||||
|
"cat_recall_Manageme": 0.13333333333333333,
|
||||||
|
"cat_f1_NoneOthe": 0.0684931506849315,
|
||||||
|
"cat_prec_NoneOthe": 0.5,
|
||||||
|
"cat_recall_NoneOthe": 0.03676470588235294,
|
||||||
|
"cat_f1_RiskMana": 0.3741152679474216,
|
||||||
|
"cat_prec_RiskMana": 0.23388116308470291,
|
||||||
|
"cat_recall_RiskMana": 0.9343434343434344,
|
||||||
|
"cat_f1_Strategy": 0.2248995983935743,
|
||||||
|
"cat_prec_Strategy": 1.0,
|
||||||
|
"cat_recall_Strategy": 0.12669683257918551,
|
||||||
|
"cat_f1_Third-Pa": 0.8157894736842105,
|
||||||
|
"cat_prec_Third-Pa": 0.7635467980295566,
|
||||||
|
"cat_recall_Third-Pa": 0.8757062146892656,
|
||||||
|
"cat_kripp_alpha": 0.27180867501339423,
|
||||||
|
"spec_macro_f1": 0.22049451330952025,
|
||||||
|
"spec_weighted_f1": 0.26278390857815354,
|
||||||
|
"spec_macro_precision": 0.4075440073341987,
|
||||||
|
"spec_macro_recall": 0.34148466970860386,
|
||||||
|
"spec_mcc": 0.20939315966102864,
|
||||||
|
"spec_auc": 0.8490039116946011,
|
||||||
|
"spec_ece": 0.43363295723994577,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
132,
|
||||||
|
483,
|
||||||
|
3,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
166,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
171,
|
||||||
|
34,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
175,
|
||||||
|
32,
|
||||||
|
0
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.35013262599469497,
|
||||||
|
"spec_prec_L1Generi": 0.9705882352941176,
|
||||||
|
"spec_recall_L1Generi": 0.21359223300970873,
|
||||||
|
"spec_f1_L2Domain": 0.28546861564918313,
|
||||||
|
"spec_prec_L2Domain": 0.16683417085427135,
|
||||||
|
"spec_recall_L2Domain": 0.9880952380952381,
|
||||||
|
"spec_f1_L3Firm-S": 0.2463768115942029,
|
||||||
|
"spec_prec_L3Firm-S": 0.4927536231884058,
|
||||||
|
"spec_recall_L3Firm-S": 0.1642512077294686,
|
||||||
|
"spec_f1_L4Quanti": 0.0,
|
||||||
|
"spec_prec_L4Quanti": 0.0,
|
||||||
|
"spec_recall_L4Quanti": 0.0,
|
||||||
|
"spec_qwk": 0.24233251808742773,
|
||||||
|
"spec_mae": 0.8733333333333333,
|
||||||
|
"spec_kripp_alpha": 0.2761091078775676
|
||||||
|
},
|
||||||
|
"Opus-4.6": {
|
||||||
|
"cat_macro_f1": 0.35763512449392704,
|
||||||
|
"cat_weighted_f1": 0.40173099854659305,
|
||||||
|
"cat_macro_precision": 0.6354693148020794,
|
||||||
|
"cat_macro_recall": 0.39500680662311666,
|
||||||
|
"cat_mcc": 0.42166882753874363,
|
||||||
|
"cat_auc": 0.9209441610065957,
|
||||||
|
"cat_ece": 0.09567970824738346,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
141,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
71,
|
||||||
|
0,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
4,
|
||||||
|
65,
|
||||||
|
0,
|
||||||
|
10
|
||||||
|
],
|
||||||
|
[
|
||||||
|
5,
|
||||||
|
0,
|
||||||
|
21,
|
||||||
|
1,
|
||||||
|
131,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
5,
|
||||||
|
128,
|
||||||
|
0,
|
||||||
|
7
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
194,
|
||||||
|
0,
|
||||||
|
19
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
186,
|
||||||
|
28,
|
||||||
|
14
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
16,
|
||||||
|
0,
|
||||||
|
151
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.7811634349030471,
|
||||||
|
"cat_prec_BoardGov": 0.9591836734693877,
|
||||||
|
"cat_recall_BoardGov": 0.6588785046728972,
|
||||||
|
"cat_f1_Incident": 0.0,
|
||||||
|
"cat_prec_Incident": 0.0,
|
||||||
|
"cat_recall_Incident": 0.0,
|
||||||
|
"cat_f1_Manageme": 0.2346368715083799,
|
||||||
|
"cat_prec_Manageme": 1.0,
|
||||||
|
"cat_recall_Manageme": 0.13291139240506328,
|
||||||
|
"cat_f1_NoneOthe": 0.06622516556291391,
|
||||||
|
"cat_prec_NoneOthe": 0.5,
|
||||||
|
"cat_recall_NoneOthe": 0.03546099290780142,
|
||||||
|
"cat_f1_RiskMana": 0.38645418326693226,
|
||||||
|
"cat_prec_RiskMana": 0.24525916561314792,
|
||||||
|
"cat_recall_RiskMana": 0.9107981220657277,
|
||||||
|
"cat_f1_Strategy": 0.21875,
|
||||||
|
"cat_prec_Strategy": 1.0,
|
||||||
|
"cat_recall_Strategy": 0.12280701754385964,
|
||||||
|
"cat_f1_Third-Pa": 0.8162162162162162,
|
||||||
|
"cat_prec_Third-Pa": 0.7438423645320197,
|
||||||
|
"cat_recall_Third-Pa": 0.9041916167664671,
|
||||||
|
"cat_kripp_alpha": 0.27338793761748126,
|
||||||
|
"spec_macro_f1": 0.20754679251319788,
|
||||||
|
"spec_weighted_f1": 0.25637242485646744,
|
||||||
|
"spec_macro_precision": 0.40946072005380696,
|
||||||
|
"spec_macro_recall": 0.33929593134138586,
|
||||||
|
"spec_mcc": 0.2041103760829744,
|
||||||
|
"spec_auc": 0.8271022317290393,
|
||||||
|
"spec_ece": 0.4489923599362374,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
130,
|
||||||
|
473,
|
||||||
|
2,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
145,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
6,
|
||||||
|
217,
|
||||||
|
37,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
160,
|
||||||
|
30,
|
||||||
|
0
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.3508771929824561,
|
||||||
|
"spec_prec_L1Generi": 0.9558823529411765,
|
||||||
|
"spec_recall_L1Generi": 0.21487603305785125,
|
||||||
|
"spec_f1_L2Domain": 0.2543859649122807,
|
||||||
|
"spec_prec_L2Domain": 0.1457286432160804,
|
||||||
|
"spec_recall_L2Domain": 1.0,
|
||||||
|
"spec_f1_L3Firm-S": 0.22492401215805471,
|
||||||
|
"spec_prec_L3Firm-S": 0.5362318840579711,
|
||||||
|
"spec_recall_L3Firm-S": 0.1423076923076923,
|
||||||
|
"spec_f1_L4Quanti": 0.0,
|
||||||
|
"spec_prec_L4Quanti": 0.0,
|
||||||
|
"spec_recall_L4Quanti": 0.0,
|
||||||
|
"spec_qwk": 0.24096533359991634,
|
||||||
|
"spec_mae": 0.88,
|
||||||
|
"spec_kripp_alpha": 0.2758412395136435
|
||||||
|
},
|
||||||
|
"_runtime": {
|
||||||
|
"encoder_mb": 274.843904,
|
||||||
|
"ms_per_sample": 5.860076693982895,
|
||||||
|
"throughput_per_s": 170.64623079537446,
|
||||||
|
"peak_vram_mb": 1287.34326171875,
|
||||||
|
"build_s": 0.4858604749897495
|
||||||
|
}
|
||||||
|
}
|
||||||
297
results/eval/quant/fp16/metrics.json
Normal file
297
results/eval/quant/fp16/metrics.json
Normal file
@ -0,0 +1,297 @@
|
|||||||
|
{
|
||||||
|
"GPT-5.4": {
|
||||||
|
"cat_macro_f1": 0.9336741161693523,
|
||||||
|
"cat_weighted_f1": 0.9343162998643407,
|
||||||
|
"cat_macro_precision": 0.93189297179766,
|
||||||
|
"cat_macro_recall": 0.9377918652022429,
|
||||||
|
"cat_mcc": 0.9226990724708704,
|
||||||
|
"cat_auc": 0.9920503365435541,
|
||||||
|
"cat_ece": 0.053747650533914546,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
225,
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
85,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
144,
|
||||||
|
1,
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
131,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
6,
|
||||||
|
1,
|
||||||
|
5,
|
||||||
|
19,
|
||||||
|
164,
|
||||||
|
1,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
1,
|
||||||
|
8,
|
||||||
|
2,
|
||||||
|
207,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
12,
|
||||||
|
0,
|
||||||
|
165
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9719222462203023,
|
||||||
|
"cat_prec_BoardGov": 0.9656652360515021,
|
||||||
|
"cat_recall_BoardGov": 0.9782608695652174,
|
||||||
|
"cat_f1_Incident": 0.96045197740113,
|
||||||
|
"cat_prec_Incident": 0.9550561797752809,
|
||||||
|
"cat_recall_Incident": 0.9659090909090909,
|
||||||
|
"cat_f1_Manageme": 0.9411764705882353,
|
||||||
|
"cat_prec_Manageme": 0.9230769230769231,
|
||||||
|
"cat_recall_Manageme": 0.96,
|
||||||
|
"cat_f1_NoneOthe": 0.888135593220339,
|
||||||
|
"cat_prec_NoneOthe": 0.8238993710691824,
|
||||||
|
"cat_recall_NoneOthe": 0.9632352941176471,
|
||||||
|
"cat_f1_RiskMana": 0.856396866840731,
|
||||||
|
"cat_prec_RiskMana": 0.8864864864864865,
|
||||||
|
"cat_recall_RiskMana": 0.8282828282828283,
|
||||||
|
"cat_f1_Strategy": 0.9583333333333334,
|
||||||
|
"cat_prec_Strategy": 0.981042654028436,
|
||||||
|
"cat_recall_Strategy": 0.9366515837104072,
|
||||||
|
"cat_f1_Third-Pa": 0.9593023255813954,
|
||||||
|
"cat_prec_Third-Pa": 0.9880239520958084,
|
||||||
|
"cat_recall_Third-Pa": 0.9322033898305084,
|
||||||
|
"cat_kripp_alpha": 0.9223591517560865,
|
||||||
|
"spec_macro_f1": 0.8951731906425856,
|
||||||
|
"spec_weighted_f1": 0.9121524819510628,
|
||||||
|
"spec_macro_precision": 0.8980417155129858,
|
||||||
|
"spec_macro_recall": 0.8930560580782194,
|
||||||
|
"spec_mcc": 0.866381831963237,
|
||||||
|
"spec_auc": 0.981767664615518,
|
||||||
|
"spec_ece": 0.07004868157207966,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
580,
|
||||||
|
23,
|
||||||
|
12,
|
||||||
|
3
|
||||||
|
],
|
||||||
|
[
|
||||||
|
29,
|
||||||
|
130,
|
||||||
|
7,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
11,
|
||||||
|
4,
|
||||||
|
190,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
9,
|
||||||
|
195
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.9354838709677419,
|
||||||
|
"spec_prec_L1Generi": 0.932475884244373,
|
||||||
|
"spec_recall_L1Generi": 0.9385113268608414,
|
||||||
|
"spec_f1_L2Domain": 0.7975460122699386,
|
||||||
|
"spec_prec_L2Domain": 0.8227848101265823,
|
||||||
|
"spec_recall_L2Domain": 0.7738095238095238,
|
||||||
|
"spec_f1_L3Firm-S": 0.8941176470588236,
|
||||||
|
"spec_prec_L3Firm-S": 0.8715596330275229,
|
||||||
|
"spec_recall_L3Firm-S": 0.9178743961352657,
|
||||||
|
"spec_f1_L4Quanti": 0.9535452322738386,
|
||||||
|
"spec_prec_L4Quanti": 0.9653465346534653,
|
||||||
|
"spec_recall_L4Quanti": 0.9420289855072463,
|
||||||
|
"spec_qwk": 0.9324447137231142,
|
||||||
|
"spec_mae": 0.1175,
|
||||||
|
"spec_kripp_alpha": 0.917725722448833
|
||||||
|
},
|
||||||
|
"Opus-4.6": {
|
||||||
|
"cat_macro_f1": 0.922684387023173,
|
||||||
|
"cat_weighted_f1": 0.9216414809666168,
|
||||||
|
"cat_macro_precision": 0.9177680939029339,
|
||||||
|
"cat_macro_recall": 0.9316060900094703,
|
||||||
|
"cat_mcc": 0.909266938399113,
|
||||||
|
"cat_auc": 0.993963602835296,
|
||||||
|
"cat_ece": 0.06541431720058125,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
211,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
78,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
8,
|
||||||
|
0,
|
||||||
|
144,
|
||||||
|
1,
|
||||||
|
4,
|
||||||
|
0,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
138,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
13,
|
||||||
|
0,
|
||||||
|
9,
|
||||||
|
14,
|
||||||
|
169,
|
||||||
|
1,
|
||||||
|
7
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
11,
|
||||||
|
1,
|
||||||
|
4,
|
||||||
|
3,
|
||||||
|
208,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
6,
|
||||||
|
1,
|
||||||
|
159
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9440715883668904,
|
||||||
|
"cat_prec_BoardGov": 0.9055793991416309,
|
||||||
|
"cat_recall_BoardGov": 0.985981308411215,
|
||||||
|
"cat_f1_Incident": 0.9285714285714286,
|
||||||
|
"cat_prec_Incident": 0.8764044943820225,
|
||||||
|
"cat_recall_Incident": 0.9873417721518988,
|
||||||
|
"cat_f1_Manageme": 0.9171974522292994,
|
||||||
|
"cat_prec_Manageme": 0.9230769230769231,
|
||||||
|
"cat_recall_Manageme": 0.9113924050632911,
|
||||||
|
"cat_f1_NoneOthe": 0.92,
|
||||||
|
"cat_prec_NoneOthe": 0.8679245283018868,
|
||||||
|
"cat_recall_NoneOthe": 0.9787234042553191,
|
||||||
|
"cat_f1_RiskMana": 0.8492462311557789,
|
||||||
|
"cat_prec_RiskMana": 0.9135135135135135,
|
||||||
|
"cat_recall_RiskMana": 0.7934272300469484,
|
||||||
|
"cat_f1_Strategy": 0.9476082004555809,
|
||||||
|
"cat_prec_Strategy": 0.985781990521327,
|
||||||
|
"cat_recall_Strategy": 0.9122807017543859,
|
||||||
|
"cat_f1_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_prec_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_recall_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_kripp_alpha": 0.908575631724203,
|
||||||
|
"spec_macro_f1": 0.8833694419146193,
|
||||||
|
"spec_weighted_f1": 0.9004034318676798,
|
||||||
|
"spec_macro_precision": 0.8858989636247611,
|
||||||
|
"spec_macro_recall": 0.8854684685880032,
|
||||||
|
"spec_mcc": 0.8500778641433316,
|
||||||
|
"spec_auc": 0.9736589116420353,
|
||||||
|
"spec_ece": 0.0816012116521597,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
567,
|
||||||
|
30,
|
||||||
|
7,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
22,
|
||||||
|
118,
|
||||||
|
3,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
33,
|
||||||
|
10,
|
||||||
|
207,
|
||||||
|
10
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
189
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.9242053789731052,
|
||||||
|
"spec_prec_L1Generi": 0.9115755627009646,
|
||||||
|
"spec_recall_L1Generi": 0.9371900826446281,
|
||||||
|
"spec_f1_L2Domain": 0.7788778877887789,
|
||||||
|
"spec_prec_L2Domain": 0.7468354430379747,
|
||||||
|
"spec_recall_L2Domain": 0.8137931034482758,
|
||||||
|
"spec_f1_L3Firm-S": 0.8661087866108786,
|
||||||
|
"spec_prec_L3Firm-S": 0.9495412844036697,
|
||||||
|
"spec_recall_L3Firm-S": 0.7961538461538461,
|
||||||
|
"spec_f1_L4Quanti": 0.9642857142857143,
|
||||||
|
"spec_prec_L4Quanti": 0.9356435643564357,
|
||||||
|
"spec_recall_L4Quanti": 0.9947368421052631,
|
||||||
|
"spec_qwk": 0.9227008860372746,
|
||||||
|
"spec_mae": 0.13583333333333333,
|
||||||
|
"spec_kripp_alpha": 0.9065248741550552
|
||||||
|
},
|
||||||
|
"_runtime": {
|
||||||
|
"encoder_mb": 789.563648,
|
||||||
|
"ms_per_sample": 5.539002780715236,
|
||||||
|
"throughput_per_s": 180.53791261517884,
|
||||||
|
"peak_vram_mb": 1740.83837890625,
|
||||||
|
"build_s": 0.46582157304510474
|
||||||
|
}
|
||||||
|
}
|
||||||
297
results/eval/quant/fp32/metrics.json
Normal file
297
results/eval/quant/fp32/metrics.json
Normal file
@ -0,0 +1,297 @@
|
|||||||
|
{
|
||||||
|
"GPT-5.4": {
|
||||||
|
"cat_macro_f1": 0.9336741161693523,
|
||||||
|
"cat_weighted_f1": 0.9343162998643407,
|
||||||
|
"cat_macro_precision": 0.93189297179766,
|
||||||
|
"cat_macro_recall": 0.9377918652022429,
|
||||||
|
"cat_mcc": 0.9226990724708704,
|
||||||
|
"cat_auc": 0.9920546854722492,
|
||||||
|
"cat_ece": 0.05388230005900064,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
225,
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
85,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
144,
|
||||||
|
1,
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
131,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
6,
|
||||||
|
1,
|
||||||
|
5,
|
||||||
|
19,
|
||||||
|
164,
|
||||||
|
1,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
1,
|
||||||
|
8,
|
||||||
|
2,
|
||||||
|
207,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
12,
|
||||||
|
0,
|
||||||
|
165
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9719222462203023,
|
||||||
|
"cat_prec_BoardGov": 0.9656652360515021,
|
||||||
|
"cat_recall_BoardGov": 0.9782608695652174,
|
||||||
|
"cat_f1_Incident": 0.96045197740113,
|
||||||
|
"cat_prec_Incident": 0.9550561797752809,
|
||||||
|
"cat_recall_Incident": 0.9659090909090909,
|
||||||
|
"cat_f1_Manageme": 0.9411764705882353,
|
||||||
|
"cat_prec_Manageme": 0.9230769230769231,
|
||||||
|
"cat_recall_Manageme": 0.96,
|
||||||
|
"cat_f1_NoneOthe": 0.888135593220339,
|
||||||
|
"cat_prec_NoneOthe": 0.8238993710691824,
|
||||||
|
"cat_recall_NoneOthe": 0.9632352941176471,
|
||||||
|
"cat_f1_RiskMana": 0.856396866840731,
|
||||||
|
"cat_prec_RiskMana": 0.8864864864864865,
|
||||||
|
"cat_recall_RiskMana": 0.8282828282828283,
|
||||||
|
"cat_f1_Strategy": 0.9583333333333334,
|
||||||
|
"cat_prec_Strategy": 0.981042654028436,
|
||||||
|
"cat_recall_Strategy": 0.9366515837104072,
|
||||||
|
"cat_f1_Third-Pa": 0.9593023255813954,
|
||||||
|
"cat_prec_Third-Pa": 0.9880239520958084,
|
||||||
|
"cat_recall_Third-Pa": 0.9322033898305084,
|
||||||
|
"cat_kripp_alpha": 0.9223591517560865,
|
||||||
|
"spec_macro_f1": 0.8943486525770918,
|
||||||
|
"spec_weighted_f1": 0.9113685505226937,
|
||||||
|
"spec_macro_precision": 0.896720845083131,
|
||||||
|
"spec_macro_recall": 0.8926515273338829,
|
||||||
|
"spec_mcc": 0.8651595302642376,
|
||||||
|
"spec_auc": 0.9817804600696421,
|
||||||
|
"spec_ece": 0.07088303024570146,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
579,
|
||||||
|
24,
|
||||||
|
12,
|
||||||
|
3
|
||||||
|
],
|
||||||
|
[
|
||||||
|
29,
|
||||||
|
130,
|
||||||
|
7,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
11,
|
||||||
|
4,
|
||||||
|
190,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
9,
|
||||||
|
195
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.9346246973365617,
|
||||||
|
"spec_prec_L1Generi": 0.9323671497584541,
|
||||||
|
"spec_recall_L1Generi": 0.9368932038834952,
|
||||||
|
"spec_f1_L2Domain": 0.7951070336391437,
|
||||||
|
"spec_prec_L2Domain": 0.8176100628930818,
|
||||||
|
"spec_recall_L2Domain": 0.7738095238095238,
|
||||||
|
"spec_f1_L3Firm-S": 0.8941176470588236,
|
||||||
|
"spec_prec_L3Firm-S": 0.8715596330275229,
|
||||||
|
"spec_recall_L3Firm-S": 0.9178743961352657,
|
||||||
|
"spec_f1_L4Quanti": 0.9535452322738386,
|
||||||
|
"spec_prec_L4Quanti": 0.9653465346534653,
|
||||||
|
"spec_recall_L4Quanti": 0.9420289855072463,
|
||||||
|
"spec_qwk": 0.9321211092744079,
|
||||||
|
"spec_mae": 0.11833333333333333,
|
||||||
|
"spec_kripp_alpha": 0.9170875429859872
|
||||||
|
},
|
||||||
|
"Opus-4.6": {
|
||||||
|
"cat_macro_f1": 0.922684387023173,
|
||||||
|
"cat_weighted_f1": 0.9216414809666168,
|
||||||
|
"cat_macro_precision": 0.9177680939029339,
|
||||||
|
"cat_macro_recall": 0.9316060900094703,
|
||||||
|
"cat_mcc": 0.909266938399113,
|
||||||
|
"cat_auc": 0.9939567083286731,
|
||||||
|
"cat_ece": 0.06541596949100496,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
211,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
78,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
8,
|
||||||
|
0,
|
||||||
|
144,
|
||||||
|
1,
|
||||||
|
4,
|
||||||
|
0,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
138,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
13,
|
||||||
|
0,
|
||||||
|
9,
|
||||||
|
14,
|
||||||
|
169,
|
||||||
|
1,
|
||||||
|
7
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
11,
|
||||||
|
1,
|
||||||
|
4,
|
||||||
|
3,
|
||||||
|
208,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
6,
|
||||||
|
1,
|
||||||
|
159
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9440715883668904,
|
||||||
|
"cat_prec_BoardGov": 0.9055793991416309,
|
||||||
|
"cat_recall_BoardGov": 0.985981308411215,
|
||||||
|
"cat_f1_Incident": 0.9285714285714286,
|
||||||
|
"cat_prec_Incident": 0.8764044943820225,
|
||||||
|
"cat_recall_Incident": 0.9873417721518988,
|
||||||
|
"cat_f1_Manageme": 0.9171974522292994,
|
||||||
|
"cat_prec_Manageme": 0.9230769230769231,
|
||||||
|
"cat_recall_Manageme": 0.9113924050632911,
|
||||||
|
"cat_f1_NoneOthe": 0.92,
|
||||||
|
"cat_prec_NoneOthe": 0.8679245283018868,
|
||||||
|
"cat_recall_NoneOthe": 0.9787234042553191,
|
||||||
|
"cat_f1_RiskMana": 0.8492462311557789,
|
||||||
|
"cat_prec_RiskMana": 0.9135135135135135,
|
||||||
|
"cat_recall_RiskMana": 0.7934272300469484,
|
||||||
|
"cat_f1_Strategy": 0.9476082004555809,
|
||||||
|
"cat_prec_Strategy": 0.985781990521327,
|
||||||
|
"cat_recall_Strategy": 0.9122807017543859,
|
||||||
|
"cat_f1_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_prec_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_recall_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_kripp_alpha": 0.908575631724203,
|
||||||
|
"spec_macro_f1": 0.8825095464914274,
|
||||||
|
"spec_weighted_f1": 0.8996514471183623,
|
||||||
|
"spec_macro_precision": 0.8846890964606792,
|
||||||
|
"spec_macro_recall": 0.8850552454475074,
|
||||||
|
"spec_mcc": 0.8488763096810703,
|
||||||
|
"spec_auc": 0.9736482774372809,
|
||||||
|
"spec_ece": 0.08238246644536655,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
566,
|
||||||
|
31,
|
||||||
|
7,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
22,
|
||||||
|
118,
|
||||||
|
3,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
33,
|
||||||
|
10,
|
||||||
|
207,
|
||||||
|
10
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
189
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.9233278955954323,
|
||||||
|
"spec_prec_L1Generi": 0.9114331723027376,
|
||||||
|
"spec_recall_L1Generi": 0.9355371900826446,
|
||||||
|
"spec_f1_L2Domain": 0.7763157894736842,
|
||||||
|
"spec_prec_L2Domain": 0.7421383647798742,
|
||||||
|
"spec_recall_L2Domain": 0.8137931034482758,
|
||||||
|
"spec_f1_L3Firm-S": 0.8661087866108786,
|
||||||
|
"spec_prec_L3Firm-S": 0.9495412844036697,
|
||||||
|
"spec_recall_L3Firm-S": 0.7961538461538461,
|
||||||
|
"spec_f1_L4Quanti": 0.9642857142857143,
|
||||||
|
"spec_prec_L4Quanti": 0.9356435643564357,
|
||||||
|
"spec_recall_L4Quanti": 0.9947368421052631,
|
||||||
|
"spec_qwk": 0.9223702541559166,
|
||||||
|
"spec_mae": 0.13666666666666666,
|
||||||
|
"spec_kripp_alpha": 0.9059072309806726
|
||||||
|
},
|
||||||
|
"_runtime": {
|
||||||
|
"encoder_mb": 1579.127296,
|
||||||
|
"ms_per_sample": 16.293709366727853,
|
||||||
|
"throughput_per_s": 61.37337898281309,
|
||||||
|
"peak_vram_mb": 3503.53369140625,
|
||||||
|
"build_s": 0.6251941699883901
|
||||||
|
}
|
||||||
|
}
|
||||||
286
results/eval/quant/summary.json
Normal file
286
results/eval/quant/summary.json
Normal file
@ -0,0 +1,286 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"variant": "fp32",
|
||||||
|
"description": "Float32 encoder + heads",
|
||||||
|
"encoder_mb": 1579.127296,
|
||||||
|
"ms_per_sample": 16.293709366727853,
|
||||||
|
"throughput_per_s": 61.37337898281309,
|
||||||
|
"peak_vram_mb": 3503.53369140625,
|
||||||
|
"build_s": 0.6251941699883901,
|
||||||
|
"GPT-5.4_cat_f1": 0.9336741161693523,
|
||||||
|
"GPT-5.4_spec_f1": 0.8943486525770918,
|
||||||
|
"GPT-5.4_cat_mcc": 0.9226990724708704,
|
||||||
|
"GPT-5.4_spec_qwk": 0.9321211092744079,
|
||||||
|
"GPT-5.4_spec_mae": 0.11833333333333333,
|
||||||
|
"GPT-5.4_cat_ece": 0.05388230005900064,
|
||||||
|
"GPT-5.4_spec_ece": 0.07088303024570146,
|
||||||
|
"GPT-5.4_spec_f1_L1Generi": 0.9346246973365617,
|
||||||
|
"GPT-5.4_spec_f1_L2Domain": 0.7951070336391437,
|
||||||
|
"GPT-5.4_spec_f1_L3Firm-S": 0.8941176470588236,
|
||||||
|
"GPT-5.4_spec_f1_L4Quanti": 0.9535452322738386,
|
||||||
|
"Opus-4.6_cat_f1": 0.922684387023173,
|
||||||
|
"Opus-4.6_spec_f1": 0.8825095464914274,
|
||||||
|
"Opus-4.6_cat_mcc": 0.909266938399113,
|
||||||
|
"Opus-4.6_spec_qwk": 0.9223702541559166,
|
||||||
|
"Opus-4.6_spec_mae": 0.13666666666666666,
|
||||||
|
"Opus-4.6_cat_ece": 0.06541596949100496,
|
||||||
|
"Opus-4.6_spec_ece": 0.08238246644536655,
|
||||||
|
"Opus-4.6_spec_f1_L1Generi": 0.9233278955954323,
|
||||||
|
"Opus-4.6_spec_f1_L2Domain": 0.7763157894736842,
|
||||||
|
"Opus-4.6_spec_f1_L3Firm-S": 0.8661087866108786,
|
||||||
|
"Opus-4.6_spec_f1_L4Quanti": 0.9642857142857143
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"variant": "bf16",
|
||||||
|
"description": "BFloat16 baseline (matches eval pipeline)",
|
||||||
|
"encoder_mb": 789.563648,
|
||||||
|
"ms_per_sample": 5.516677870764397,
|
||||||
|
"throughput_per_s": 181.26851402716375,
|
||||||
|
"peak_vram_mb": 1740.83837890625,
|
||||||
|
"build_s": 0.48778308398323134,
|
||||||
|
"GPT-5.4_cat_f1": 0.9336741161693523,
|
||||||
|
"GPT-5.4_spec_f1": 0.8951731906425856,
|
||||||
|
"GPT-5.4_cat_mcc": 0.9226990724708704,
|
||||||
|
"GPT-5.4_spec_qwk": 0.9324447137231142,
|
||||||
|
"GPT-5.4_spec_mae": 0.1175,
|
||||||
|
"GPT-5.4_cat_ece": 0.053848127176364245,
|
||||||
|
"GPT-5.4_spec_ece": 0.07135417198141418,
|
||||||
|
"GPT-5.4_spec_f1_L1Generi": 0.9354838709677419,
|
||||||
|
"GPT-5.4_spec_f1_L2Domain": 0.7975460122699386,
|
||||||
|
"GPT-5.4_spec_f1_L3Firm-S": 0.8941176470588236,
|
||||||
|
"GPT-5.4_spec_f1_L4Quanti": 0.9535452322738386,
|
||||||
|
"Opus-4.6_cat_f1": 0.922684387023173,
|
||||||
|
"Opus-4.6_spec_f1": 0.8833694419146193,
|
||||||
|
"Opus-4.6_cat_mcc": 0.909266938399113,
|
||||||
|
"Opus-4.6_spec_qwk": 0.9227008860372746,
|
||||||
|
"Opus-4.6_spec_mae": 0.13583333333333333,
|
||||||
|
"Opus-4.6_cat_ece": 0.06551479384303091,
|
||||||
|
"Opus-4.6_spec_ece": 0.08248284702499709,
|
||||||
|
"Opus-4.6_spec_f1_L1Generi": 0.9242053789731052,
|
||||||
|
"Opus-4.6_spec_f1_L2Domain": 0.7788778877887789,
|
||||||
|
"Opus-4.6_spec_f1_L3Firm-S": 0.8661087866108786,
|
||||||
|
"Opus-4.6_spec_f1_L4Quanti": 0.9642857142857143
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"variant": "fp16",
|
||||||
|
"description": "Float16 encoder + heads",
|
||||||
|
"encoder_mb": 789.563648,
|
||||||
|
"ms_per_sample": 5.539002780715236,
|
||||||
|
"throughput_per_s": 180.53791261517884,
|
||||||
|
"peak_vram_mb": 1740.83837890625,
|
||||||
|
"build_s": 0.46582157304510474,
|
||||||
|
"GPT-5.4_cat_f1": 0.9336741161693523,
|
||||||
|
"GPT-5.4_spec_f1": 0.8951731906425856,
|
||||||
|
"GPT-5.4_cat_mcc": 0.9226990724708704,
|
||||||
|
"GPT-5.4_spec_qwk": 0.9324447137231142,
|
||||||
|
"GPT-5.4_spec_mae": 0.1175,
|
||||||
|
"GPT-5.4_cat_ece": 0.053747650533914546,
|
||||||
|
"GPT-5.4_spec_ece": 0.07004868157207966,
|
||||||
|
"GPT-5.4_spec_f1_L1Generi": 0.9354838709677419,
|
||||||
|
"GPT-5.4_spec_f1_L2Domain": 0.7975460122699386,
|
||||||
|
"GPT-5.4_spec_f1_L3Firm-S": 0.8941176470588236,
|
||||||
|
"GPT-5.4_spec_f1_L4Quanti": 0.9535452322738386,
|
||||||
|
"Opus-4.6_cat_f1": 0.922684387023173,
|
||||||
|
"Opus-4.6_spec_f1": 0.8833694419146193,
|
||||||
|
"Opus-4.6_cat_mcc": 0.909266938399113,
|
||||||
|
"Opus-4.6_spec_qwk": 0.9227008860372746,
|
||||||
|
"Opus-4.6_spec_mae": 0.13583333333333333,
|
||||||
|
"Opus-4.6_cat_ece": 0.06541431720058125,
|
||||||
|
"Opus-4.6_spec_ece": 0.0816012116521597,
|
||||||
|
"Opus-4.6_spec_f1_L1Generi": 0.9242053789731052,
|
||||||
|
"Opus-4.6_spec_f1_L2Domain": 0.7788778877887789,
|
||||||
|
"Opus-4.6_spec_f1_L3Firm-S": 0.8661087866108786,
|
||||||
|
"Opus-4.6_spec_f1_L4Quanti": 0.9642857142857143
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"variant": "torchao-int8-wo",
|
||||||
|
"description": "torchao Int8 weight-only on encoder linears",
|
||||||
|
"encoder_mb": 789.563648,
|
||||||
|
"ms_per_sample": 6.078403938445263,
|
||||||
|
"throughput_per_s": 164.5168715549004,
|
||||||
|
"peak_vram_mb": 1416.36376953125,
|
||||||
|
"build_s": 0.5027359619853087,
|
||||||
|
"GPT-5.4_cat_f1": 0.9344870894825886,
|
||||||
|
"GPT-5.4_spec_f1": 0.8941203230194683,
|
||||||
|
"GPT-5.4_cat_mcc": 0.9237006314618685,
|
||||||
|
"GPT-5.4_spec_qwk": 0.9329693660903852,
|
||||||
|
"GPT-5.4_spec_mae": 0.1175,
|
||||||
|
"GPT-5.4_cat_ece": 0.05415941931307314,
|
||||||
|
"GPT-5.4_spec_ece": 0.06980206420024232,
|
||||||
|
"GPT-5.4_spec_f1_L1Generi": 0.9353796445880452,
|
||||||
|
"GPT-5.4_spec_f1_L2Domain": 0.793939393939394,
|
||||||
|
"GPT-5.4_spec_f1_L3Firm-S": 0.8936170212765957,
|
||||||
|
"GPT-5.4_spec_f1_L4Quanti": 0.9535452322738386,
|
||||||
|
"Opus-4.6_cat_f1": 0.9234810481200378,
|
||||||
|
"Opus-4.6_spec_f1": 0.8814731397444973,
|
||||||
|
"Opus-4.6_cat_mcc": 0.9102750101817324,
|
||||||
|
"Opus-4.6_spec_qwk": 0.9207708779443254,
|
||||||
|
"Opus-4.6_spec_mae": 0.13916666666666666,
|
||||||
|
"Opus-4.6_cat_ece": 0.0641141641388337,
|
||||||
|
"Opus-4.6_spec_ece": 0.08370273689428968,
|
||||||
|
"Opus-4.6_spec_f1_L1Generi": 0.9208163265306123,
|
||||||
|
"Opus-4.6_spec_f1_L2Domain": 0.7752442996742671,
|
||||||
|
"Opus-4.6_spec_f1_L3Firm-S": 0.865546218487395,
|
||||||
|
"Opus-4.6_spec_f1_L4Quanti": 0.9642857142857143
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"variant": "torchao-int8-dyn",
|
||||||
|
"description": "torchao Int8 dynamic activation + Int8 weight on encoder",
|
||||||
|
"encoder_mb": 789.563648,
|
||||||
|
"ms_per_sample": 9.671733896636093,
|
||||||
|
"throughput_per_s": 103.39407707937539,
|
||||||
|
"peak_vram_mb": 1774.27392578125,
|
||||||
|
"build_s": 0.4831273259478621,
|
||||||
|
"GPT-5.4_cat_f1": 0.9336475878058536,
|
||||||
|
"GPT-5.4_spec_f1": 0.8918479759675974,
|
||||||
|
"GPT-5.4_cat_mcc": 0.9226968780743573,
|
||||||
|
"GPT-5.4_spec_qwk": 0.931514217618119,
|
||||||
|
"GPT-5.4_spec_mae": 0.12,
|
||||||
|
"GPT-5.4_cat_ece": 0.05363284418980283,
|
||||||
|
"GPT-5.4_spec_ece": 0.07049367701013878,
|
||||||
|
"GPT-5.4_spec_f1_L1Generi": 0.934412955465587,
|
||||||
|
"GPT-5.4_spec_f1_L2Domain": 0.7889908256880734,
|
||||||
|
"GPT-5.4_spec_f1_L3Firm-S": 0.8904428904428905,
|
||||||
|
"GPT-5.4_spec_f1_L4Quanti": 0.9535452322738386,
|
||||||
|
"Opus-4.6_cat_f1": 0.9242573204255528,
|
||||||
|
"Opus-4.6_spec_f1": 0.8827245859621925,
|
||||||
|
"Opus-4.6_cat_mcc": 0.9112549308356716,
|
||||||
|
"Opus-4.6_spec_qwk": 0.92235918049198,
|
||||||
|
"Opus-4.6_spec_mae": 0.13666666666666666,
|
||||||
|
"Opus-4.6_cat_ece": 0.06330573419729862,
|
||||||
|
"Opus-4.6_spec_ece": 0.08290670409798626,
|
||||||
|
"Opus-4.6_spec_f1_L1Generi": 0.9230769230769231,
|
||||||
|
"Opus-4.6_spec_f1_L2Domain": 0.7763157894736842,
|
||||||
|
"Opus-4.6_spec_f1_L3Firm-S": 0.8672199170124482,
|
||||||
|
"Opus-4.6_spec_f1_L4Quanti": 0.9642857142857143
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"variant": "torchao-int4-wo",
|
||||||
|
"description": "torchao Int4 weight-only (group=128) on encoder",
|
||||||
|
"error": "ImportError: Requires mslk >= 1.0.0"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"variant": "bnb-int8",
|
||||||
|
"description": "bitsandbytes LLM.int8 on encoder linears",
|
||||||
|
"encoder_mb": 789.563648,
|
||||||
|
"ms_per_sample": 7.762363941583317,
|
||||||
|
"throughput_per_s": 128.82673468103667,
|
||||||
|
"peak_vram_mb": 2135.203125,
|
||||||
|
"build_s": 1.1878160500200465,
|
||||||
|
"GPT-5.4_cat_f1": 0.9360988760303737,
|
||||||
|
"GPT-5.4_spec_f1": 0.8986323186392307,
|
||||||
|
"GPT-5.4_cat_mcc": 0.9256911778959798,
|
||||||
|
"GPT-5.4_spec_qwk": 0.9307948020550015,
|
||||||
|
"GPT-5.4_spec_mae": 0.1175,
|
||||||
|
"GPT-5.4_cat_ece": 0.052939765204985965,
|
||||||
|
"GPT-5.4_spec_ece": 0.06740866973996164,
|
||||||
|
"GPT-5.4_spec_f1_L1Generi": 0.9364440868865648,
|
||||||
|
"GPT-5.4_spec_f1_L2Domain": 0.8099688473520249,
|
||||||
|
"GPT-5.4_spec_f1_L3Firm-S": 0.892018779342723,
|
||||||
|
"GPT-5.4_spec_f1_L4Quanti": 0.9560975609756097,
|
||||||
|
"Opus-4.6_cat_f1": 0.9235105849558979,
|
||||||
|
"Opus-4.6_spec_f1": 0.8826923642825633,
|
||||||
|
"Opus-4.6_cat_mcc": 0.9103198007176273,
|
||||||
|
"Opus-4.6_spec_qwk": 0.9198415117342273,
|
||||||
|
"Opus-4.6_spec_mae": 0.13916666666666666,
|
||||||
|
"Opus-4.6_cat_ece": 0.06465620135267579,
|
||||||
|
"Opus-4.6_spec_ece": 0.083350846717755,
|
||||||
|
"Opus-4.6_spec_f1_L1Generi": 0.9235772357723577,
|
||||||
|
"Opus-4.6_spec_f1_L2Domain": 0.785234899328859,
|
||||||
|
"Opus-4.6_spec_f1_L3Firm-S": 0.860125260960334,
|
||||||
|
"Opus-4.6_spec_f1_L4Quanti": 0.9618320610687023
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"variant": "bnb-nf4",
|
||||||
|
"description": "bitsandbytes NF4 4-bit (double-quant, bf16 compute)",
|
||||||
|
"encoder_mb": 274.843904,
|
||||||
|
"ms_per_sample": 5.860076693982895,
|
||||||
|
"throughput_per_s": 170.64623079537446,
|
||||||
|
"peak_vram_mb": 1287.34326171875,
|
||||||
|
"build_s": 0.4858604749897495,
|
||||||
|
"GPT-5.4_cat_f1": 0.3536909012886116,
|
||||||
|
"GPT-5.4_spec_f1": 0.22049451330952025,
|
||||||
|
"GPT-5.4_cat_mcc": 0.42471542150657926,
|
||||||
|
"GPT-5.4_spec_qwk": 0.24233251808742773,
|
||||||
|
"GPT-5.4_spec_mae": 0.8733333333333333,
|
||||||
|
"GPT-5.4_cat_ece": 0.09734637491405013,
|
||||||
|
"GPT-5.4_spec_ece": 0.43363295723994577,
|
||||||
|
"GPT-5.4_spec_f1_L1Generi": 0.35013262599469497,
|
||||||
|
"GPT-5.4_spec_f1_L2Domain": 0.28546861564918313,
|
||||||
|
"GPT-5.4_spec_f1_L3Firm-S": 0.2463768115942029,
|
||||||
|
"GPT-5.4_spec_f1_L4Quanti": 0.0,
|
||||||
|
"Opus-4.6_cat_f1": 0.35763512449392704,
|
||||||
|
"Opus-4.6_spec_f1": 0.20754679251319788,
|
||||||
|
"Opus-4.6_cat_mcc": 0.42166882753874363,
|
||||||
|
"Opus-4.6_spec_qwk": 0.24096533359991634,
|
||||||
|
"Opus-4.6_spec_mae": 0.88,
|
||||||
|
"Opus-4.6_cat_ece": 0.09567970824738346,
|
||||||
|
"Opus-4.6_spec_ece": 0.4489923599362374,
|
||||||
|
"Opus-4.6_spec_f1_L1Generi": 0.3508771929824561,
|
||||||
|
"Opus-4.6_spec_f1_L2Domain": 0.2543859649122807,
|
||||||
|
"Opus-4.6_spec_f1_L3Firm-S": 0.22492401215805471,
|
||||||
|
"Opus-4.6_spec_f1_L4Quanti": 0.0
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"variant": "bnb-nf4-nodq",
|
||||||
|
"description": "bitsandbytes NF4 4-bit (no double-quant)",
|
||||||
|
"encoder_mb": 274.843904,
|
||||||
|
"ms_per_sample": 5.861402786540566,
|
||||||
|
"throughput_per_s": 170.607623536175,
|
||||||
|
"peak_vram_mb": 1287.34326171875,
|
||||||
|
"build_s": 0.4908116469741799,
|
||||||
|
"GPT-5.4_cat_f1": 0.3536909012886116,
|
||||||
|
"GPT-5.4_spec_f1": 0.22049451330952025,
|
||||||
|
"GPT-5.4_cat_mcc": 0.42471542150657926,
|
||||||
|
"GPT-5.4_spec_qwk": 0.24233251808742773,
|
||||||
|
"GPT-5.4_spec_mae": 0.8733333333333333,
|
||||||
|
"GPT-5.4_cat_ece": 0.09734637491405013,
|
||||||
|
"GPT-5.4_spec_ece": 0.43363295723994577,
|
||||||
|
"GPT-5.4_spec_f1_L1Generi": 0.35013262599469497,
|
||||||
|
"GPT-5.4_spec_f1_L2Domain": 0.28546861564918313,
|
||||||
|
"GPT-5.4_spec_f1_L3Firm-S": 0.2463768115942029,
|
||||||
|
"GPT-5.4_spec_f1_L4Quanti": 0.0,
|
||||||
|
"Opus-4.6_cat_f1": 0.35763512449392704,
|
||||||
|
"Opus-4.6_spec_f1": 0.20754679251319788,
|
||||||
|
"Opus-4.6_cat_mcc": 0.42166882753874363,
|
||||||
|
"Opus-4.6_spec_qwk": 0.24096533359991634,
|
||||||
|
"Opus-4.6_spec_mae": 0.88,
|
||||||
|
"Opus-4.6_cat_ece": 0.09567970824738346,
|
||||||
|
"Opus-4.6_spec_ece": 0.4489923599362374,
|
||||||
|
"Opus-4.6_spec_f1_L1Generi": 0.3508771929824561,
|
||||||
|
"Opus-4.6_spec_f1_L2Domain": 0.2543859649122807,
|
||||||
|
"Opus-4.6_spec_f1_L3Firm-S": 0.22492401215805471,
|
||||||
|
"Opus-4.6_spec_f1_L4Quanti": 0.0
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"variant": "bnb-fp4",
|
||||||
|
"description": "bitsandbytes FP4 4-bit (no double-quant)",
|
||||||
|
"encoder_mb": 274.843904,
|
||||||
|
"ms_per_sample": 5.865302347471394,
|
||||||
|
"throughput_per_s": 170.49419463109393,
|
||||||
|
"peak_vram_mb": 1287.34326171875,
|
||||||
|
"build_s": 0.4887635139748454,
|
||||||
|
"GPT-5.4_cat_f1": 0.16293893512410998,
|
||||||
|
"GPT-5.4_spec_f1": 0.20854117827130608,
|
||||||
|
"GPT-5.4_cat_mcc": 0.22345796853389935,
|
||||||
|
"GPT-5.4_spec_qwk": 0.2326064604575444,
|
||||||
|
"GPT-5.4_spec_mae": 0.8825,
|
||||||
|
"GPT-5.4_cat_ece": 0.2080524676044782,
|
||||||
|
"GPT-5.4_spec_ece": 0.44289420386155437,
|
||||||
|
"GPT-5.4_spec_f1_L1Generi": 0.35742444152431013,
|
||||||
|
"GPT-5.4_spec_f1_L2Domain": 0.2824956672443674,
|
||||||
|
"GPT-5.4_spec_f1_L3Firm-S": 0.19424460431654678,
|
||||||
|
"GPT-5.4_spec_f1_L4Quanti": 0.0,
|
||||||
|
"Opus-4.6_cat_f1": 0.16861118726256397,
|
||||||
|
"Opus-4.6_spec_f1": 0.19783939283519508,
|
||||||
|
"Opus-4.6_cat_mcc": 0.2251562222131823,
|
||||||
|
"Opus-4.6_spec_qwk": 0.22580295138888895,
|
||||||
|
"Opus-4.6_spec_mae": 0.8925,
|
||||||
|
"Opus-4.6_cat_ece": 0.19888580093781152,
|
||||||
|
"Opus-4.6_spec_ece": 0.45814307530721027,
|
||||||
|
"Opus-4.6_spec_f1_L1Generi": 0.35294117647058826,
|
||||||
|
"Opus-4.6_spec_f1_L2Domain": 0.251105216622458,
|
||||||
|
"Opus-4.6_spec_f1_L3Firm-S": 0.18731117824773413,
|
||||||
|
"Opus-4.6_spec_f1_L4Quanti": 0.0
|
||||||
|
}
|
||||||
|
]
|
||||||
297
results/eval/quant/torchao-int8-dyn/metrics.json
Normal file
297
results/eval/quant/torchao-int8-dyn/metrics.json
Normal file
@ -0,0 +1,297 @@
|
|||||||
|
{
|
||||||
|
"GPT-5.4": {
|
||||||
|
"cat_macro_f1": 0.9336475878058536,
|
||||||
|
"cat_weighted_f1": 0.9342872402134198,
|
||||||
|
"cat_macro_precision": 0.9319024691959354,
|
||||||
|
"cat_macro_recall": 0.9376938259865566,
|
||||||
|
"cat_mcc": 0.9226968780743573,
|
||||||
|
"cat_auc": 0.9924054453050574,
|
||||||
|
"cat_ece": 0.05363284418980283,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
225,
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
85,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
145,
|
||||||
|
1,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
130,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
6,
|
||||||
|
1,
|
||||||
|
5,
|
||||||
|
19,
|
||||||
|
164,
|
||||||
|
1,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
1,
|
||||||
|
8,
|
||||||
|
2,
|
||||||
|
207,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
12,
|
||||||
|
0,
|
||||||
|
165
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9698275862068966,
|
||||||
|
"cat_prec_BoardGov": 0.9615384615384616,
|
||||||
|
"cat_recall_BoardGov": 0.9782608695652174,
|
||||||
|
"cat_f1_Incident": 0.96045197740113,
|
||||||
|
"cat_prec_Incident": 0.9550561797752809,
|
||||||
|
"cat_recall_Incident": 0.9659090909090909,
|
||||||
|
"cat_f1_Manageme": 0.9446254071661238,
|
||||||
|
"cat_prec_Manageme": 0.9235668789808917,
|
||||||
|
"cat_recall_Manageme": 0.9666666666666667,
|
||||||
|
"cat_f1_NoneOthe": 0.8843537414965986,
|
||||||
|
"cat_prec_NoneOthe": 0.8227848101265823,
|
||||||
|
"cat_recall_NoneOthe": 0.9558823529411765,
|
||||||
|
"cat_f1_RiskMana": 0.8586387434554974,
|
||||||
|
"cat_prec_RiskMana": 0.8913043478260869,
|
||||||
|
"cat_recall_RiskMana": 0.8282828282828283,
|
||||||
|
"cat_f1_Strategy": 0.9583333333333334,
|
||||||
|
"cat_prec_Strategy": 0.981042654028436,
|
||||||
|
"cat_recall_Strategy": 0.9366515837104072,
|
||||||
|
"cat_f1_Third-Pa": 0.9593023255813954,
|
||||||
|
"cat_prec_Third-Pa": 0.9880239520958084,
|
||||||
|
"cat_recall_Third-Pa": 0.9322033898305084,
|
||||||
|
"cat_kripp_alpha": 0.9223561935890119,
|
||||||
|
"spec_macro_f1": 0.8918479759675974,
|
||||||
|
"spec_weighted_f1": 0.9097693388297432,
|
||||||
|
"spec_macro_precision": 0.8930494570032042,
|
||||||
|
"spec_macro_recall": 0.8915621000757135,
|
||||||
|
"spec_mcc": 0.8628946887605918,
|
||||||
|
"spec_auc": 0.9807842405238503,
|
||||||
|
"spec_ece": 0.07049367701013878,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
577,
|
||||||
|
24,
|
||||||
|
14,
|
||||||
|
3
|
||||||
|
],
|
||||||
|
[
|
||||||
|
29,
|
||||||
|
129,
|
||||||
|
8,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
9,
|
||||||
|
5,
|
||||||
|
191,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
9,
|
||||||
|
195
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.934412955465587,
|
||||||
|
"spec_prec_L1Generi": 0.9351701782820098,
|
||||||
|
"spec_recall_L1Generi": 0.9336569579288025,
|
||||||
|
"spec_f1_L2Domain": 0.7889908256880734,
|
||||||
|
"spec_prec_L2Domain": 0.8113207547169812,
|
||||||
|
"spec_recall_L2Domain": 0.7678571428571429,
|
||||||
|
"spec_f1_L3Firm-S": 0.8904428904428905,
|
||||||
|
"spec_prec_L3Firm-S": 0.8603603603603603,
|
||||||
|
"spec_recall_L3Firm-S": 0.9227053140096618,
|
||||||
|
"spec_f1_L4Quanti": 0.9535452322738386,
|
||||||
|
"spec_prec_L4Quanti": 0.9653465346534653,
|
||||||
|
"spec_recall_L4Quanti": 0.9420289855072463,
|
||||||
|
"spec_qwk": 0.931514217618119,
|
||||||
|
"spec_mae": 0.12,
|
||||||
|
"spec_kripp_alpha": 0.9169918680049234
|
||||||
|
},
|
||||||
|
"Opus-4.6": {
|
||||||
|
"cat_macro_f1": 0.9242573204255528,
|
||||||
|
"cat_weighted_f1": 0.9232556488517519,
|
||||||
|
"cat_macro_precision": 0.9193897229484191,
|
||||||
|
"cat_macro_recall": 0.9331778058838005,
|
||||||
|
"cat_mcc": 0.9112549308356716,
|
||||||
|
"cat_auc": 0.9941614030336741,
|
||||||
|
"cat_ece": 0.06330573419729862,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
212,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
78,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
8,
|
||||||
|
0,
|
||||||
|
145,
|
||||||
|
1,
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
138,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
13,
|
||||||
|
0,
|
||||||
|
9,
|
||||||
|
14,
|
||||||
|
169,
|
||||||
|
1,
|
||||||
|
7
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
11,
|
||||||
|
1,
|
||||||
|
4,
|
||||||
|
3,
|
||||||
|
208,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
6,
|
||||||
|
1,
|
||||||
|
159
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9464285714285714,
|
||||||
|
"cat_prec_BoardGov": 0.905982905982906,
|
||||||
|
"cat_recall_BoardGov": 0.9906542056074766,
|
||||||
|
"cat_f1_Incident": 0.9285714285714286,
|
||||||
|
"cat_prec_Incident": 0.8764044943820225,
|
||||||
|
"cat_recall_Incident": 0.9873417721518988,
|
||||||
|
"cat_f1_Manageme": 0.9206349206349206,
|
||||||
|
"cat_prec_Manageme": 0.9235668789808917,
|
||||||
|
"cat_recall_Manageme": 0.9177215189873418,
|
||||||
|
"cat_f1_NoneOthe": 0.9230769230769231,
|
||||||
|
"cat_prec_NoneOthe": 0.8734177215189873,
|
||||||
|
"cat_recall_NoneOthe": 0.9787234042553191,
|
||||||
|
"cat_f1_RiskMana": 0.8513853904282116,
|
||||||
|
"cat_prec_RiskMana": 0.9184782608695652,
|
||||||
|
"cat_recall_RiskMana": 0.7934272300469484,
|
||||||
|
"cat_f1_Strategy": 0.9476082004555809,
|
||||||
|
"cat_prec_Strategy": 0.985781990521327,
|
||||||
|
"cat_recall_Strategy": 0.9122807017543859,
|
||||||
|
"cat_f1_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_prec_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_recall_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_kripp_alpha": 0.9105393643352402,
|
||||||
|
"spec_macro_f1": 0.8827245859621925,
|
||||||
|
"spec_weighted_f1": 0.8997656600606208,
|
||||||
|
"spec_macro_precision": 0.8833309642003535,
|
||||||
|
"spec_macro_recall": 0.8861518760895928,
|
||||||
|
"spec_mcc": 0.8488976906438819,
|
||||||
|
"spec_auc": 0.9740582923879771,
|
||||||
|
"spec_ece": 0.08290670409798626,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
564,
|
||||||
|
31,
|
||||||
|
9,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
22,
|
||||||
|
118,
|
||||||
|
3,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
31,
|
||||||
|
10,
|
||||||
|
209,
|
||||||
|
10
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
189
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.9230769230769231,
|
||||||
|
"spec_prec_L1Generi": 0.9141004862236629,
|
||||||
|
"spec_recall_L1Generi": 0.9322314049586777,
|
||||||
|
"spec_f1_L2Domain": 0.7763157894736842,
|
||||||
|
"spec_prec_L2Domain": 0.7421383647798742,
|
||||||
|
"spec_recall_L2Domain": 0.8137931034482758,
|
||||||
|
"spec_f1_L3Firm-S": 0.8672199170124482,
|
||||||
|
"spec_prec_L3Firm-S": 0.9414414414414415,
|
||||||
|
"spec_recall_L3Firm-S": 0.8038461538461539,
|
||||||
|
"spec_f1_L4Quanti": 0.9642857142857143,
|
||||||
|
"spec_prec_L4Quanti": 0.9356435643564357,
|
||||||
|
"spec_recall_L4Quanti": 0.9947368421052631,
|
||||||
|
"spec_qwk": 0.92235918049198,
|
||||||
|
"spec_mae": 0.13666666666666666,
|
||||||
|
"spec_kripp_alpha": 0.9061330450504643
|
||||||
|
},
|
||||||
|
"_runtime": {
|
||||||
|
"encoder_mb": 789.563648,
|
||||||
|
"ms_per_sample": 9.671733896636093,
|
||||||
|
"throughput_per_s": 103.39407707937539,
|
||||||
|
"peak_vram_mb": 1774.27392578125,
|
||||||
|
"build_s": 0.4831273259478621
|
||||||
|
}
|
||||||
|
}
|
||||||
297
results/eval/quant/torchao-int8-wo/metrics.json
Normal file
297
results/eval/quant/torchao-int8-wo/metrics.json
Normal file
@ -0,0 +1,297 @@
|
|||||||
|
{
|
||||||
|
"GPT-5.4": {
|
||||||
|
"cat_macro_f1": 0.9344870894825886,
|
||||||
|
"cat_weighted_f1": 0.9351173265780133,
|
||||||
|
"cat_macro_precision": 0.9326512314038842,
|
||||||
|
"cat_macro_recall": 0.9387442461546238,
|
||||||
|
"cat_mcc": 0.9237006314618685,
|
||||||
|
"cat_auc": 0.992309699625497,
|
||||||
|
"cat_ece": 0.05415941931307314,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
225,
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
85,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
145,
|
||||||
|
1,
|
||||||
|
2,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
131,
|
||||||
|
0,
|
||||||
|
2,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
6,
|
||||||
|
1,
|
||||||
|
5,
|
||||||
|
19,
|
||||||
|
164,
|
||||||
|
1,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
3,
|
||||||
|
1,
|
||||||
|
8,
|
||||||
|
2,
|
||||||
|
207,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
12,
|
||||||
|
0,
|
||||||
|
165
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9719222462203023,
|
||||||
|
"cat_prec_BoardGov": 0.9656652360515021,
|
||||||
|
"cat_recall_BoardGov": 0.9782608695652174,
|
||||||
|
"cat_f1_Incident": 0.96045197740113,
|
||||||
|
"cat_prec_Incident": 0.9550561797752809,
|
||||||
|
"cat_recall_Incident": 0.9659090909090909,
|
||||||
|
"cat_f1_Manageme": 0.9446254071661238,
|
||||||
|
"cat_prec_Manageme": 0.9235668789808917,
|
||||||
|
"cat_recall_Manageme": 0.9666666666666667,
|
||||||
|
"cat_f1_NoneOthe": 0.888135593220339,
|
||||||
|
"cat_prec_NoneOthe": 0.8238993710691824,
|
||||||
|
"cat_recall_NoneOthe": 0.9632352941176471,
|
||||||
|
"cat_f1_RiskMana": 0.8586387434554974,
|
||||||
|
"cat_prec_RiskMana": 0.8913043478260869,
|
||||||
|
"cat_recall_RiskMana": 0.8282828282828283,
|
||||||
|
"cat_f1_Strategy": 0.9583333333333334,
|
||||||
|
"cat_prec_Strategy": 0.981042654028436,
|
||||||
|
"cat_recall_Strategy": 0.9366515837104072,
|
||||||
|
"cat_f1_Third-Pa": 0.9593023255813954,
|
||||||
|
"cat_prec_Third-Pa": 0.9880239520958084,
|
||||||
|
"cat_recall_Third-Pa": 0.9322033898305084,
|
||||||
|
"cat_kripp_alpha": 0.9233443339647499,
|
||||||
|
"spec_macro_f1": 0.8941203230194683,
|
||||||
|
"spec_weighted_f1": 0.9115075208518084,
|
||||||
|
"spec_macro_precision": 0.8957148694260108,
|
||||||
|
"spec_macro_recall": 0.892931893103379,
|
||||||
|
"spec_mcc": 0.8651929532300995,
|
||||||
|
"spec_auc": 0.981624069084201,
|
||||||
|
"spec_ece": 0.06980206420024232,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
579,
|
||||||
|
24,
|
||||||
|
12,
|
||||||
|
3
|
||||||
|
],
|
||||||
|
[
|
||||||
|
29,
|
||||||
|
131,
|
||||||
|
6,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
10,
|
||||||
|
6,
|
||||||
|
189,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
1,
|
||||||
|
9,
|
||||||
|
195
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.9353796445880452,
|
||||||
|
"spec_prec_L1Generi": 0.9338709677419355,
|
||||||
|
"spec_recall_L1Generi": 0.9368932038834952,
|
||||||
|
"spec_f1_L2Domain": 0.793939393939394,
|
||||||
|
"spec_prec_L2Domain": 0.808641975308642,
|
||||||
|
"spec_recall_L2Domain": 0.7797619047619048,
|
||||||
|
"spec_f1_L3Firm-S": 0.8936170212765957,
|
||||||
|
"spec_prec_L3Firm-S": 0.875,
|
||||||
|
"spec_recall_L3Firm-S": 0.9130434782608695,
|
||||||
|
"spec_f1_L4Quanti": 0.9535452322738386,
|
||||||
|
"spec_prec_L4Quanti": 0.9653465346534653,
|
||||||
|
"spec_recall_L4Quanti": 0.9420289855072463,
|
||||||
|
"spec_qwk": 0.9329693660903852,
|
||||||
|
"spec_mae": 0.1175,
|
||||||
|
"spec_kripp_alpha": 0.9181842655510584
|
||||||
|
},
|
||||||
|
"Opus-4.6": {
|
||||||
|
"cat_macro_f1": 0.9234810481200378,
|
||||||
|
"cat_weighted_f1": 0.9224737817442137,
|
||||||
|
"cat_macro_precision": 0.9185473372257941,
|
||||||
|
"cat_macro_recall": 0.9325102491414775,
|
||||||
|
"cat_mcc": 0.9102750101817324,
|
||||||
|
"cat_auc": 0.9940184741579791,
|
||||||
|
"cat_ece": 0.0641141641388337,
|
||||||
|
"cat_confusion_matrix": [
|
||||||
|
[
|
||||||
|
211,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
78,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
0,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
8,
|
||||||
|
0,
|
||||||
|
145,
|
||||||
|
1,
|
||||||
|
3,
|
||||||
|
0,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
138,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
13,
|
||||||
|
0,
|
||||||
|
9,
|
||||||
|
14,
|
||||||
|
169,
|
||||||
|
1,
|
||||||
|
7
|
||||||
|
],
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
11,
|
||||||
|
1,
|
||||||
|
4,
|
||||||
|
3,
|
||||||
|
208,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
6,
|
||||||
|
1,
|
||||||
|
159
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"cat_f1_BoardGov": 0.9440715883668904,
|
||||||
|
"cat_prec_BoardGov": 0.9055793991416309,
|
||||||
|
"cat_recall_BoardGov": 0.985981308411215,
|
||||||
|
"cat_f1_Incident": 0.9285714285714286,
|
||||||
|
"cat_prec_Incident": 0.8764044943820225,
|
||||||
|
"cat_recall_Incident": 0.9873417721518988,
|
||||||
|
"cat_f1_Manageme": 0.9206349206349206,
|
||||||
|
"cat_prec_Manageme": 0.9235668789808917,
|
||||||
|
"cat_recall_Manageme": 0.9177215189873418,
|
||||||
|
"cat_f1_NoneOthe": 0.92,
|
||||||
|
"cat_prec_NoneOthe": 0.8679245283018868,
|
||||||
|
"cat_recall_NoneOthe": 0.9787234042553191,
|
||||||
|
"cat_f1_RiskMana": 0.8513853904282116,
|
||||||
|
"cat_prec_RiskMana": 0.9184782608695652,
|
||||||
|
"cat_recall_RiskMana": 0.7934272300469484,
|
||||||
|
"cat_f1_Strategy": 0.9476082004555809,
|
||||||
|
"cat_prec_Strategy": 0.985781990521327,
|
||||||
|
"cat_recall_Strategy": 0.9122807017543859,
|
||||||
|
"cat_f1_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_prec_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_recall_Third-Pa": 0.9520958083832335,
|
||||||
|
"cat_kripp_alpha": 0.9095617653952504,
|
||||||
|
"spec_macro_f1": 0.8814731397444973,
|
||||||
|
"spec_weighted_f1": 0.8981338362706646,
|
||||||
|
"spec_macro_precision": 0.8833981471623865,
|
||||||
|
"spec_macro_recall": 0.8849913986360116,
|
||||||
|
"spec_mcc": 0.8465512998506631,
|
||||||
|
"spec_auc": 0.9729999946345258,
|
||||||
|
"spec_ece": 0.08370273689428968,
|
||||||
|
"spec_confusion_matrix": [
|
||||||
|
[
|
||||||
|
564,
|
||||||
|
33,
|
||||||
|
7,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
22,
|
||||||
|
119,
|
||||||
|
2,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
[
|
||||||
|
34,
|
||||||
|
10,
|
||||||
|
206,
|
||||||
|
10
|
||||||
|
],
|
||||||
|
[
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
189
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"spec_f1_L1Generi": 0.9208163265306123,
|
||||||
|
"spec_prec_L1Generi": 0.9096774193548387,
|
||||||
|
"spec_recall_L1Generi": 0.9322314049586777,
|
||||||
|
"spec_f1_L2Domain": 0.7752442996742671,
|
||||||
|
"spec_prec_L2Domain": 0.7345679012345679,
|
||||||
|
"spec_recall_L2Domain": 0.8206896551724138,
|
||||||
|
"spec_f1_L3Firm-S": 0.865546218487395,
|
||||||
|
"spec_prec_L3Firm-S": 0.9537037037037037,
|
||||||
|
"spec_recall_L3Firm-S": 0.7923076923076923,
|
||||||
|
"spec_f1_L4Quanti": 0.9642857142857143,
|
||||||
|
"spec_prec_L4Quanti": 0.9356435643564357,
|
||||||
|
"spec_recall_L4Quanti": 0.9947368421052631,
|
||||||
|
"spec_qwk": 0.9207708779443254,
|
||||||
|
"spec_mae": 0.13916666666666666,
|
||||||
|
"spec_kripp_alpha": 0.9033268512180281
|
||||||
|
},
|
||||||
|
"_runtime": {
|
||||||
|
"encoder_mb": 789.563648,
|
||||||
|
"ms_per_sample": 6.078403938445263,
|
||||||
|
"throughput_per_s": 164.5168715549004,
|
||||||
|
"peak_vram_mb": 1416.36376953125,
|
||||||
|
"build_s": 0.5027359619853087
|
||||||
|
}
|
||||||
|
}
|
||||||
Loading…
x
Reference in New Issue
Block a user