SEC-cyBERT/results/eval/onnx/REPORT.md

# ONNX Export + Eval — iter1-independent ModernBERT-large

**Date:** 2026-04-07
**Checkpoint:** `checkpoints/finetune/iter1-independent/final/`
**Hardware:** RTX 3090 (sm_8.6, 24 GB), onnxruntime-gpu 1.24.4, onnx 1.21
**Driver:** `python/scripts/onnx_export_eval.py` (`bun run py:onnx`)
**Eval set:** 1,200-paragraph v2 holdout, proxy gold = GPT-5.4 + Opus-4.6

## TL;DR

ONNX export of this model is *technically* possible but the path is full of
dead ends. The dynamo exporter produces a graph with 56 Memcpy nodes that
makes ORT 8× slower than native torch and 4× more VRAM-heavy; the legacy
TorchScript exporter produces a clean graph that's actually 22% faster than
torch fp32 (kernel fusion); fp16 conversion breaks on the rotary embedding;
dynamic int8 quantization via ORT silently falls back to CPU and drops
~0.5 macro F1. **Net: torchao int8-wo from the earlier sweep is still the
right int8 deployment path. ONNX is not.**

## What we tried

| variant            | exporter             | size MB | ms/sample | VRAM MB | cat F1 | spec F1 | result          |
|--------------------|----------------------|--------:|----------:|--------:|-------:|--------:|-----------------|
| onnx-fp32 (dynamo) | torch.onnx (dynamo)  |    1583 |     42.92 |   15388 | 0.9337 |  0.8943 | works but unusable |
| onnx-int8 (dynamo) | dynamo + ORT int8    |    1580 |     42.82 |   15398 | 0.9337 |  0.8943 | no-op (no quant) |
| **onnx-fp32 (legacy)** | torch.onnx (TorchScript) |    1583 |     **12.70** |    8228 | 0.9337 |  0.8952 | **clean graph, faster than torch** |
| onnx-fp16 (legacy) | onnxconverter_common |     754 |       err |     err | err    |  err    | rotary type unify |
| onnx-int8 (legacy) | ORT quantize_dynamic |     527 |     95.91 |   ~CPU  | 0.3972 |  0.3364 | CPU fallback + accuracy collapse |

(All entries above were re-run from scratch — fp32 timing improved 3× moving
from dynamo to legacy export.)

## Six things broke along the way (workarounds in the script)

1. **Dynamo exporter optimizer crashes.** `torch.onnx.export(..., dynamo=True)`
   succeeds at translation but the post-translation `InlinePass` optimizer
   trips on `onnx_ir`. Workaround: `optimize=False`.
2. **Dynamo-exported graph is unusable on CUDA EP.** ORT inserts 56 Memcpy
   nodes between layers because dynamo emits scalar tensors with CPU
   placement metadata. Result: 42.9 ms/sample (8× torch fp32) and 15.4 GB
   VRAM (4.4× torch fp32). The legacy exporter only inserts 1 Memcpy.
3. **`op_types_to_quantize=['MatMul']` quantizes nothing on the dynamo
   graph.** Dynamo emits encoder linears as `Gemm` nodes, not `MatMul`.
   Fix: `op_types_to_quantize=['MatMul', 'Gemm']`.
4. **Both ORT shape-inference paths choke on ModernBERT.** Symbolic
   inference asserts in `_infer_Range` (rotary embedding limit input is
   not a scalar); the C++ inference raises a (1024)/(7) dim mismatch on
   the category head Gemm. The `skip_*` flags on `quant_pre_process` are
   ignored, and `ONNXQuantizer.__init__` calls
   `save_and_reload_model_with_shape_infer` unconditionally. Workaround:
   monkey-patch `quant_utils.save_and_reload_model_with_shape_infer`
   *and* the cached binding in `onnx_quantizer` to a no-op, then pass
   `extra_options={'DefaultTensorType': onnx.TensorProto.FLOAT}` so the
   quantizer can still type the head MatMul.
5. **fp16 conversion via `onnxconverter_common` breaks on rotary
   embeddings.** Two distinct failure modes seen across exports:
   `Type Error: Type (tensor(float16)) of output arg (val_58) of node
   (node_Expand_56) does not match expected type (tensor(float))` (dynamo
   graph) and `Type parameter (T) of Optype (Mul) bound to different types
   (tensor(float) and tensor(float16) in node
   (/model/backbone/rotary_emb_1/Mul_2)` (legacy graph). The converter
   leaves the `inv_freq` buffer in fp32 and the surrounding Mul/Expand
   ops then can't unify their type parameter. Could be patched with an
   `op_block_list` for the rotary subgraph, but the cost/value isn't
   there given the dynamic int8 result below.
6. **Dynamic int8 via ORT silently falls back to CPU.** The quantizer
   replaces Gemm/MatMul with `MatMulInteger` + `DynamicQuantizeLinear`,
   neither of which has CUDA kernels in onnxruntime-gpu 1.24. Session
   creation succeeds with CUDAExecutionProvider but routes the
   quantized ops to the CPU EP — observable from the `load_vram_mb`
   collapsing from 2074 MB (fp32) to 266 MB (int8) and latency exploding
   to 95.9 ms/sample. Per-channel int8 weights also drop accuracy from
   0.934 → 0.397 on category and 0.895 → 0.336 on spec, further
   confirming the kernel path is wrong (not just slow).

## What actually works

**onnx-fp32 via the legacy TorchScript exporter** is the one clean win:
12.70 ms/sample vs 16.29 for torch fp32 — a **22% latency improvement
from ORT's LayerNorm/Gelu/MatMul fusion** at bit-identical accuracy. VRAM
is 8228 MB vs 3504 MB for torch fp32 (the ORT session allocates a separate
~5 GB workspace), so the speedup costs you ~2.3× memory. On a single
3090 batch=64 inference run that's a fair trade.

But this is fp32 — bf16 torch + flash-attn-2 is *still* the strict winner
at 5.52 ms / 1741 MB (Phase 10.8 result). ORT can't run bf16 natively, and
fp16 conversion is broken. So even the working ONNX path is dominated by
what we already ship.

## Recommendation

**Don't use ONNX for this model on this hardware.** The torchao int8-wo
result from the quantization sweep (5.52 → 6.08 ms, 1741 → 1416 MB peak
VRAM, F1 within ±0.001) covers the "smaller deployment" use case more
cleanly than anything ONNX can offer here, and bf16 + flash-attn-2
remains the production default.

ONNX *would* be worth revisiting in any of these scenarios:
- **CPU-only deployment** — fp32 ONNX runs fine on CPUExecutionProvider
  and ORT's int8 dynamic path is actually designed for this case. Worth
  benchmarking if a CPU serving target ever shows up.
- **Cross-runtime portability** — TensorRT, OpenVINO, mobile runtimes.
  These would each need their own export validation pass.
- **Static int8 with calibration** — `quantize_static` with a calibration
  dataset can avoid the dynamic-quant CPU fallback path. Would need a
  ModernBERT-friendly calibration loop and probably an `op_block_list`
  to keep the rotary in fp32. Real engineering work, not a one-shot.

## Reproduce

```bash
bun run py:onnx
# writes to:
#   results/eval/onnx/models/{model_fp32,model_fp16,model_int8_dyn}.onnx[.data]
#   results/eval/onnx/summary.json
#   results/eval/onnx/REPORT.md (this file)
```