Using backend tests to evaluate a runtime#

This page explains how to use the backend test suite shipped with onnx-light to validate that a custom ONNX runtime produces correct numerical results.

The backend test infrastructure is located in onnx_light.onnx_lib.backend.test.case and mirrors the structure of the official ONNX backend test suite. The registered node test cases are generated by the C++ lib_onnx_backend_test library and exposed to Python through collect_test_case(). Downstream code can still register additional Python-only test cases by subclassing Base and calling the expect() helper. The make_test_class() function then turns those test cases into a standard unittest.TestCase subclass that calls into a user-supplied runtime function.

Dependencies and layering#

The backend-test stack is split into a small Python front-end and a C++ core:

Python side: onnx_light.onnx_lib.backend.test.case (test-class generation, filtering, and NumPy-based comparisons).
C++ side: lib_onnx_backend_test (test-case registry and model/data generation) and lib_onnx_kernels (reference kernels + runtime onnx::onnx_kernels::Tensor carrier).

In other words, the Python API depends on the C++ implementation for the canonical registry, and downstream runtimes only need to provide one model + inputs -> outputs callable.

Defining a runtime function#

The only requirement for plugging in a runtime is to write a callable with the following signature:

def my_runtime(model, *inputs: np.ndarray) -> list[np.ndarray]:
    ...

where

model is an onnx_light.onnx.ModelProto (the ONNX model for the test case),
*inputs are numpy.ndarray objects corresponding to the model’s graph inputs in order, and
the return value is a list of numpy.ndarray objects corresponding to the model’s graph outputs in order.

The runtime may serialize the ModelProto to bytes, pass it to any ONNX-compatible engine, and return the results.

Generating a test class#

Call make_test_class() with the runtime callable to obtain a ExtTestCase subclass whose methods are one test per registered test case:

import unittest
import numpy as np
from onnx_light.onnx_lib.backend.test.case import make_test_class


def my_runtime(model, *inputs: np.ndarray) -> list[np.ndarray]:
    # replace with the actual engine call
    raise NotImplementedError


MyBackendTests = make_test_class(my_runtime)

if __name__ == "__main__":
    unittest.main(verbosity=2)

Running the file with python or through any unittest-compatible runner (pytest, etc.) will execute every registered node test case and report failures when the runtime output differs from the expected output.

Filtering tests#

Two optional parameters let you restrict which test cases are executed.

include_regex: A list of regular-expression patterns. Only test cases whose name matches at least one pattern are kept.
exclude_regex: A list of regular-expression patterns. Test cases whose name matches at least one pattern are discarded (evaluated before include_regex).

Example — run only tests related to element-wise arithmetic:

ArithmeticTests = make_test_class(
    my_runtime,
    include_regex=[r"^test_add", r"^test_sub", r"^test_mul", r"^test_div"],
)

Example — run everything except the quantization operators:

NoQuantTests = make_test_class(
    my_runtime,
    exclude_regex=[r"quantize", r"dequantize"],
)

Adjusting numerical tolerances#

By default each test case uses atol=1e-7 and rtol=1e-3. These values can be overridden globally per test-case name via the atols and rtols dictionaries:

MyBackendTests = make_test_class(
    my_runtime,
    atols={"test_cast_FLOAT_to_FLOAT16": 1e-3},
    rtols={"test_cast_FLOAT_to_FLOAT16": 1e-2},
)

Filtering test cases by operator and opset#

The helper get_test_cases_for_op() returns the subset of collected backend test cases whose model contains a node with a given op_type (and optionally a given domain / opset_version). This is convenient when a backend wants to focus on a single operator (and version) at a time:

from onnx_light.onnx_lib.backend.test.case import get_test_cases_for_op

# All cases that exercise Abs in the default ai.onnx domain.
abs_cases = get_test_cases_for_op("Abs")

# Cases that import ai.onnx at exactly version 13 and use Abs.
abs_v13 = get_test_cases_for_op("Abs", opset_version=13)

# Cases that use Abs from a custom domain.
custom = get_test_cases_for_op("Abs", domain="my.custom.domain")

When called without test_cases, the helper calls collect_test_case() internally. A precomputed mapping can be passed via the test_cases argument to avoid recollecting test cases on repeated lookups.

Full example: ONNXRuntime backend#

The file unittests/onnxl_vs_ort/test_backend_with_onnxruntime.py in the repository is a ready-to-run example that exercises every registered backend test case through ONNXRuntime:

import unittest
import numpy as np
from onnx_light.ext_test_case import import_or_skip, InferenceSessionAllTypes

# The backend test registries are only available in the full build; skip this
# module on a reduced build (ONNX_LIGHT_BUILD_KERNELS=OFF).
make_test_class = import_or_skip("onnx_light.onnx.backend", "make_test_class")


def onnxruntime_backend(model, *inputs: np.ndarray) -> list[np.ndarray]:
    """
    Runs an ONNX model using ONNXRuntime with support for all dtypes.

    Args:
        model: The ONNX model (onnx_light.ModelProto) to run
        *inputs: Input arrays for the model

    Returns:
        List of output arrays from the model
    """
    sess = InferenceSessionAllTypes(model)

    # Get input names and create feed dict
    input_names = [inp.name for inp in sess._sess.get_inputs()]
    input_dict = dict(zip(input_names, inputs))

    # Run inference
    outputs = sess.run(None, input_dict)
    return outputs


def ort_max_supported_opset() -> int:
    """
    Returns the highest default-domain opset version ONNX Runtime supports.

    Reads the registered operator schemas from ONNX Runtime and takes the
    maximum ``since_version`` over the default ONNX domain (``""``). This lets
    the exclusion list adapt to the installed ONNX Runtime instead of
    hard-coding an opset ceiling.

    Returns:
        The highest default-domain opset version ONNX Runtime supports.
    """
    from onnxruntime.capi._pybind_state import get_all_operator_schema

    return max(
        schema.since_version for schema in get_all_operator_schema() if schema.domain == ""
    )


# Opset version at which the cases below were introduced. They are only excluded
# when the installed ONNX Runtime does not yet support that opset.
OPSET_27 = 27
OPSET_28 = 28

# Exclusions that only apply when ONNX Runtime does not support the given opset.
ORT_OPSET_GATED_EXCLUDE_REGEX = {
    OPSET_27: [
        # Range opset 27 cases.
        r"^test_range_float16_type_positive_delta$",
        r"^test_range_bfloat16_type_positive_delta$",
        # LinearAttention is opset 27.
        r"^test_cc_linear_attention_.*$",
        # CausalConvWithState is opset 27.
        r"^test_cc_causal_conv_with_state_.*$",
    ],
    OPSET_28: [
        # Celu-28 adds float16/bfloat16 support; these test cases target opset 28.
        r"^test_cc_celu_float16$",
        r"^test_cc_celu_bfloat16$",
    ],
}

ORT_EXCLUDE_REGEX = [
    # ORT/reference parity mismatches in focused C++ cases.
    r"^test_cc_stft_complex_batched$",
    r"^test_cc_image_decoder_",
    # Preview ops/functions are not registered in ORT.
    r"^test_cc_flexattention_",
    # Light-only ai.rt ops are not registered in ORT.
    r"^test_cc_delayedinitializer_",
    # ORT exposes different Attention intermediates than the ONNX reference.
    r"^test_cc_attention_4d_with_past_and_present_qk_matmul_bias_3d_mask_causal$",
    r"^test_cc_attention_4d_with_past_and_present_qk_matmul_bias_4d_mask_causal$",
    r"^test_cc_attention_3d_with_past_and_present_qk_matmul_bias$",
    r"^test_cc_attention_4d_with_qk_matmul_bias$",
    r"^test_cc_attention_4d_with_past_and_present_qk_matmul_bias$",
    r"^test_cc_attention_4d_with_past_and_present_qk_matmul_bias_3d_mask$",
    r"^test_cc_attention_4d_with_past_and_present_qk_matmul_bias_4d_mask$",
    r"^test_cc_attention_4d_softcap_neginf_mask$",
    r"^test_cc_attention_4d_softcap_neginf_mask_poison$",
    r"^test_cc_attention_23_boolmask_fullymasked_row_nan_robustness$",
    r"^test_cc_attention_causal_boolmask_nan_robustness$",
    r"^test_cc_attention_23_fullymasked_qk_matmul_output_mode3_zero$",
    r"^test_cc_attention_24_fullymasked_qk_matmul_output_mode3_zero$",
    r"^test_cc_attention_4d_causal_nonpad_attn_mask_composition$",
    r"^test_cc_attention_4d_causal_nonpad_batch_prefill$",
    r"^test_cc_attention_4d_causal_nonpad_continued_prefill$",
    r"^test_cc_attention_4d_causal_nonpad_negative_offset_structural_empty$",
    r"^test_cc_attention_4d_gqa_causal_nonpad_decode$",
    r"^test_cc_attention_4d_gqa_causal_nonpad_decode_fp16$",
    # ORT does not yet implement the opset-24 offset-aware (bottom-right)
    # causal frontier for an external KV cache (``nonpad_kv_seqlen`` without
    # ``past_key``); see ONNX PR #8068.
    r"^test_cc_attention_4d_causal_nonpad_kv_continued_prefill$",
    # Preview training ops are not registered in ORT.
    r"^test_cc_adam_",
    r"^test_adam$",
    r"^test_adam_multiple$",
    r"^test_adagrad$",
    r"^test_adagrad_multiple$",
    r"^test_momentum$",
    r"^test_momentum_multiple$",
    r"^test_nesterov_momentum$",
    # Random ops are missing or nondeterministic in ORT.
    r"^test_cc_bernoulli$",
    r"^test_cc_bernoulli_double$",
    r"^test_cc_bernoulli_seed$",
    r"^test_cc_multinomial$",
    r"^test_cc_multinomial_seeded$",
    r"^test_cc_multinomial_int64$",
    r"^test_cc_randomnormal$",
    r"^test_cc_randomnormal_double$",
    r"^test_cc_randomnormal_seeded$",
    r"^test_cc_randomnormallike$",
    r"^test_cc_randomnormallike_double$",
    r"^test_cc_randomnormallike_seeded$",
    r"^test_cc_randomuniform$",
    r"^test_cc_randomuniform_double$",
    r"^test_cc_randomuniform_seeded$",
    r"^test_cc_randomuniformlike$",
    r"^test_cc_randomuniformlike_double$",
    r"^test_cc_randomuniformlike_seeded$",
    r"^test_training_dropout$",
    r"^test_training_dropout_mask$",
    r"^test_training_dropout_default$",
    r"^test_training_dropout_default_mask$",
    # ORT only wires float kernels for these ai.onnx.ml cases.
    r"^test_cc_binarizer_int64$",
    r"^test_cc_scaler_int64$",
    # ORT's binary LinearClassifier Z output uses [1-z, z] instead of the spec's [-z, z].
    r"^test_cc_linearclassifier_int64_binary$",
    # ORT returns wrong labels for the binary TreeEnsembleClassifier test case.
    r"^test_cc_treeensembleclassifier_int64_binary$",
    # ORT returns ZipMap outputs in a different carrier format.
    r"^test_cc_zipmap_",
    # ORT only supports scalar/1-element zero points for MatMulInteger.
    r"^test_cc_matmulinteger_per_col_b_zp$",
    r"^test_cc_matmulinteger_per_row_a_zp$",
    # ORT rejects FLOAT16 scales for QLinearMatMul.
    r"^test_cc_qlinearmatmul_2D_uint8_float16$",
    r"^test_cc_qlinearmatmul_2D_int8_float16$",
    r"^test_cc_qlinearmatmul_3D_uint8_float16$",
    r"^test_cc_qlinearmatmul_3D_int8_float16$",
    # ORT is missing kernels for these ops or dtypes.
    r"^test_cc_globallppool_",
    r"^test_cc_maxroipool_",
    # The backend harness cannot feed these map-typed inputs to ORT.
    r"^test_cc_dict_vectorizer_",
    r"^test_cc_cast_map_",
    # ORT rejects these mixed-dtype or batchwise sequence patterns.
    r"^test_cc_feature_vectorizer_mixed_dtypes$",
    # More single-op kernel gaps and focused parity checks.
    r"^test_bitshift_right_uint16$",
    r"^test_bitshift_left_uint16$",
    r"^test_bitcast_",
    r"^test_cc_top_k_uint64$",
    r"^test_pow_types_float32_uint32$",
    r"^test_pow_types_float32_uint64$",
    r"^test_max_int16$",
    r"^test_max_uint16$",
    r"^test_min_int16$",
    r"^test_min_uint16$",
    # dim0_offset < dim0_size was false. Invalid dim0_offset of 0. Dimension 0 is 0
    r"^test_cc_scan_zero_trip_count$",
    # ORT CPU does not register int16/int64 kernels for Relu(14).
    r"^test_cc_relu_int16$",
    r"^test_cc_relu_int64$",
    # ORT CPU does not register these bfloat16 kernels.
    r"^test_cc_(abs|add|ceil|div|elu|equal|erf|exp|floor|gelu_default|greater|greater_or_equal|isnan|less|less_or_equal|log|mul|neg|reciprocal|relu|sigmoid|sign|softplus|softsign|sqrt|sub|tanh)_bfloat16$",
    r"^test_mod_mixed_sign_bfloat16$",
    r"^test_cc_mod_bfloat16_fmod$",
    r"^test_cc_pow_types_bfloat16_float32$",
    # ORT diverges from the reference on MaxUnpool and on align_corners
    # Resize downsample cases where scale * input_width is fractional:
    # ONNX reference / onnx-light use (scale * input_width - 1) in the
    # denominator, while ORT uses (output_width_int - 1).
    r"^test_cc_maxunpool_export_with_output_shape$",
    r"^test_resize_downsample_scales_linear_align_corners$",
    r"^test_resize_downsample_scales_cubic_align_corners$",
    # ORT IRFFT mishandles the ``inverse=1, onesided=1`` combination.
    r"^test_cc_dft_irfft(_opset19|_roundtrip|_roundtrip_opset19)?$",
    # ORT does not support Optional loop-carried state in this graph structure.
    r"^test_cc_loop16_seq_none$",
    # ORT does not support these Sequence/Optional graph patterns.
    r"^test_cc_identity_sequence$",
    r"^test_cc_identity_opt$",
    r"^test_cc_if_seq$",
    r"^test_cc_if_opt$",
    # ORT rejects the empty-name encoding of the optional ``axes`` input.
    r"^test_cc_squeeze_empty_axes_name$",
    # ORT does not support batchwise recurrent operations (layout == 1).
    r"^test_cc_gru_batchwise$",
    r"^test_cc_lstm_batchwise$",
    r"^test_cc_simple_rnn_batchwise$",
    # ...
    r"e2m1.*",
    r"e4m3.*",
    r"e5m2.*",
    r"float8.*",
    r"quantizelinear_u?int2.*",
    r"quantizelinear_u?int4.*",
    # ...
    r"E2M1.*",
    r"E4M3.*",
    r"E5M2.*",
    r"FLOAT8.*",
    r"to_BFLOAT16.*",
    r"to_U?INT[24].*",
    r"castlike_U?INT[24].*",
    r"cast_U?INT[24].*",
    r"to_STRING",
    r"prelu_inf.*",
    r"sequence.*",
    # ONNX Runtime's Where kernel does not implement these dtypes.
    r"^test_cc_where_(bool|int8|int16|uint16|uint32|uint64)$",
]

# Add opset-gated exclusions only for opset versions ONNX Runtime cannot load yet.
_ORT_MAX_OPSET = ort_max_supported_opset()
for _opset, _patterns in ORT_OPSET_GATED_EXCLUDE_REGEX.items():
    if _ORT_MAX_OPSET < _opset:
        ORT_EXCLUDE_REGEX.extend(_patterns)

TestOrtBackend = make_test_class(onnxruntime_backend, exclude_regex=ORT_EXCLUDE_REGEX)


if __name__ == "__main__":
    unittest.main(verbosity=2)

The runtime function serialises the ModelProto to bytes with SerializeToString(), creates an onnxruntime.InferenceSession, and returns the inference outputs.

Run it with:

python -m pytest unittests/onnxl_vs_ort/test_backend_with_onnxruntime.py -v

or, to run only the Abs test cases:

python -m pytest unittests/onnxl_vs_ort/test_backend_with_onnxruntime.py -v -k abs

How test cases are collected#

collect_test_case() first collects every node test case registered by the C++ lib_onnx_backend_test library (exposed through the onnx_light.onnx_py._onnxpybackend.backend_test Python bindings). It then runs every export_* class method declared on any user-defined subclass of Base; each call to expect() appends one TestCase to the global ALL_TESTS dictionary. Python-defined cases take precedence over C++ cases with the same name.

make_test_class() calls collect_test_case() internally, so tests are always re-collected from scratch when the function is called.

`TestCase` metadata and ONNX parity#

Every registered case is a kind="node" test case carrying a small single-node ONNX model plus expected datasets. Many cases mirror the official ONNX backend node tests directly; others are focused parity or regression cases that keep the same ONNX-style data model.

Random operators keep deterministic expected outputs even though ONNX marks them as non-deterministic: the reference kernels use a fixed SplitMix64 + Irwin-Hall random stream (or an explicit seed when provided). This keeps the registry reproducible while avoiding large literal TensorProto blobs: expected values are stored as runtime Tensor byte buffers in datasets.

For shape-oriented scenarios, cases can also annotate graph intermediate values with graph.value_info (through AppendValueInfo) so shape inference tests can assert intermediate dimensions, not just final outputs.

Running backend tests in C++#

The exact same node test cases are also available directly from C++ via the lib_onnx_backend_test static library, with no dependency on Python. It publicly links the lib_onnx_kernels static library, which provides the runtime data model and reference kernel implementations. Together they expose:

a runtime onnx::onnx_kernels::Tensor (distinct from onnx::TensorProto) that stores raw element bytes,
a onnx::onnx_kernels::TestCase bundle of onnx::ModelProto and expected input/output data sets,
the onnx::onnx_kernels::Expect() helper used by every RegisterXxxCases function to register a single-node model, and
onnx::onnx_kernels::CollectTestCases(), which returns the full registry of node test cases (the same registry that the Python bindings expose through onnx_light.onnx_py._onnxpybackend.backend_test).

Per-operator cases are organised under onnx_light/onnx_backend_test/cases/<group>/ (math, logical, nn, tensor, …) and the expected outputs are computed with the reference kernels under onnx_light/onnx_kernels/kernels/<group>/ so the registry is fully self-contained and deterministic.

A minimal C++ runtime evaluator therefore looks like:

#include "onnx_backend_test/test_case.h"

using namespace onnx::onnx_kernels;

int main() {
  std::vector<TestCase> cases = CollectTestCases();
  for (const TestCase &tc : cases) {
    // Serialize tc.model and run it through your engine, then
    // compare against tc.data_sets[*].outputs using tc.atol / tc.rtol.
  }
  return 0;
}

The library ships its own GoogleTest-based unit tests under unittests/cc/onnx_kernels and unittests/cc/onnx_backend_test. To build and run them, configure the project with ONNX_LIGHT_BUILD_TESTS=ON and use ctest:

cmake -S . -B build -DONNX_LIGHT_BUILD_TESTS=ON
cmake --build build -j
ctest --test-dir build -R Backend --output-on-failure

The -R regex can be tightened (for example -R KernelClass) to focus on a single test group.