.. _op_ai_onnx_QuantizeLinear: QuantizeLinear ============== - **Domain**: ``ai.onnx`` - **Since version**: 25 The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization granularity. The quantization formula is ``y = saturate((x / y_scale) + y_zero_point)``. Saturation is done according to: - uint16: [0, 65535] - int16: [-32768, 32767] - uint8: [0, 255] - int8: [-128, 127] - uint4: [0, 15] - int4: [-8, 7] - uint2: [0, 3] - int2: [-2, 1] For ``(x / y_scale)``, it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details. ``y_zero_point`` and ``y`` must have the same type. ``y_zero_point`` is usually not used for quantization to float8 and 4bit types, but the quantization formula remains the same for consistency, and the type of the attribute ``y_zero_point`` still determines the quantization type. ``x`` and ``y_scale`` are allowed to have different types. The type of ``y_scale`` determines the precision of the division operation between ``x`` and ``y_scale``, unless the ``precision`` attribute is specified. There are three supported quantization granularities, determined by the shape of ``y_scale``. In all cases, ``y_zero_point`` must have the same shape as ``y_scale``. - Per-tensor (per-layer) quantization: ``y_scale`` is a scalar. - Per-axis quantization: The scale must be a 1-D tensor, with the length of the quantization axis. For an input shape ``(D0, ..., Di, ..., Dn)`` and ``axis=i``, ``y_scale`` is a 1-D tensor of length ``Di``. - Blocked quantization: The scale's shape is identical to the input's shape, except for one dimension, in which blocking is performed. Given ``x`` shape ``(D0, ..., Di, ..., Dn)``, ``axis=i``, and block size ``B``: ``y_scale`` shape is ``(D0, ..., ceil(Di/B), ..., Dn)``. **Inputs** - **x** (*T1*): N-D full precision Input tensor to be quantized. - **y_scale** (*T2*): Scale for doing quantization to get ``y``. For per-tensor/layer quantization the scale is a scalar, for per-axis quantization it is a 1-D Tensor and for blocked quantization it has the same shape as the input, except for one dimension in which blocking is performed. - **y_zero_point** (*T3*): Zero point for doing quantization to get ``y``. Shape must match ``y_scale``. Default is uint8 with zero point of 0 if it's not specified. **Outputs** - **y** (*T3*): N-D quantized output tensor. It has same shape as input ``x``. **Type Constraints** - **T1**: The type of the input 'x'. Allowed types: tensor(bfloat16), tensor(float), tensor(float16), tensor(int32). - **T2**: The type of the input 'y_scale'. Allowed types: tensor(bfloat16), tensor(float), tensor(float16), tensor(float8e8m0), tensor(int32). - **T3**: The type of the input ``y_zero_point`` and the output ``y``. Allowed types: tensor(float4e2m1), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int2), tensor(int4), tensor(int8), tensor(uint16), tensor(uint2), tensor(uint4), tensor(uint8). Differences with previous version (24) -------------------------------------- **SchemaDiff**: ``QuantizeLinear`` (domain ``'ai.onnx'``) * old version: 24 * new version: 25 * breaking: no **Type constraints:** * changed 'T3': added types: ['tensor(int2)', 'tensor(uint2)'] **Documentation:** * line similarity: 0.97 (+2/-0 lines) .. code-block:: diff --- QuantizeLinear v24 +++ QuantizeLinear v25 @@ -10,6 +10,8 @@ - int8: [-128, 127] - uint4: [0, 15] - int4: [-8, 7] +- uint2: [0, 3] +- int2: [-2, 1] For `(x / y_scale)`, it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details. Version History --------------- - :doc:`Version 24 ` - :doc:`Version 23 ` - :doc:`Version 21 ` - :doc:`Version 19 ` - :doc:`Version 13 ` - :doc:`Version 10 `