.. _op_ai_onnx_QuantizeLinear-21: QuantizeLinear - version 21 =========================== This page documents version **21** of operator **QuantizeLinear**. See :doc:`QuantizeLinear` for the latest version (since version 25). - **Domain**: ``ai.onnx`` - **Since version**: 21 The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization granularity. The quantization formula is ``y = saturate((x / y_scale) + y_zero_point)``. Saturation is done according to: - uint16: [0, 65535] - int16: [-32768, 32767] - uint8: [0, 255] - int8: [-128, 127] - uint4: [0, 15] - int4: [-8, 7] For ``(x / y_scale)``, it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details. ``y_zero_point`` and ``y`` must have the same type. ``y_zero_point`` is usually not used for quantization to float8 types, but the quantization formula remains the same for consistency, and the type of the attribute ``y_zero_point`` still determines the quantization type. There are three supported quantization granularities, determined by the shape of ``y_scale``. In all cases, ``y_zero_point`` must have the same shape as ``y_scale``. - Per-tensor (per-layer) quantization: ``y_scale`` is a scalar. - Per-axis quantization: The scale must be a 1-D tensor, with the length of the quantization axis. For an input shape ``(D0, ..., Di, ..., Dn)`` and ``axis=i``, ``y_scale`` is a 1-D tensor of length ``Di``. - Blocked quantization: The scale's shape is identical to the input's shape, except for one dimension, in which blocking is performed. Given ``x`` shape ``(D0, ..., Di, ..., Dn)``, ``axis=i``, and block size ``B``: ``y_scale`` shape is ``(D0, ..., ceil(Di/B), ..., Dn)``. **Inputs** - **x** (*T1*): N-D full precision Input tensor to be quantized. - **y_scale** (*T1*): Scale for doing quantization to get ``y``. For per-tensor/layer quantization the scale is a scalar, for per-axis quantization it is a 1-D Tensor and for blocked quantization it has the same shape as the input, except for one dimension in which blocking is performed. - **y_zero_point** (*T2*): Zero point for doing quantization to get ``y``. Shape must match ``y_scale``.Default is uint8 with zero point of 0 if it's not specified. **Outputs** - **y** (*T2*): N-D quantized output tensor. It has same shape as input ``x``. **Type Constraints** - **T1**: The type of the input 'x'. Allowed types: tensor(bfloat16), tensor(float), tensor(float16), tensor(int32). - **T2**: The type of the input ``y_zero_point`` and the output ``y``. Allowed types: tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int4), tensor(int8), tensor(uint16), tensor(uint4), tensor(uint8). Differences with previous version (19) -------------------------------------- **SchemaDiff**: ``QuantizeLinear`` (domain ``'ai.onnx'``) * old version: 19 * new version: 21 * breaking: no **Type constraints:** * changed 'T2': added types: ['tensor(int16)', 'tensor(int4)', 'tensor(uint16)', 'tensor(uint4)'] **Documentation:** * line similarity: 0.06 (+21/-9 lines) .. code-block:: diff --- QuantizeLinear v19 +++ QuantizeLinear v21 @@ -1,10 +1,22 @@ -The linear quantization operator. It consumes a high precision tensor, a scale, and a zero point to compute the low precision / quantized tensor. -The scale factor and zero point must have same shape, and can be either a scalar for per-tensor / per layer quantization, or a 1-D tensor for per-axis quantization. -The quantization formula is `y = saturate ((x / y_scale) + y_zero_point)`. -For saturation, it saturates to [0, 255] if it's uint8, or [-128, 127] if it's int8. -For (x / y_scale), it's rounding to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details. -'y_zero_point' and 'y' must have same type. -'y_zero_point' is usually not used for quantization to float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz, -but the quantization formula remains the same for consistency and -the type of the attribute 'y_zero_point' still determines the quantization type. +The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the +low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization +granularity. The quantization formula is `y = saturate((x / y_scale) + y_zero_point)`. +Saturation is done according to: +- uint16: [0, 65535] +- int16: [-32768, 32767] +- uint8: [0, 255] +- int8: [-128, 127] +- uint4: [0, 15] +- int4: [-8, 7] +For `(x / y_scale)`, it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details. +`y_zero_point` and `y` must have the same type. `y_zero_point` is usually not used for quantization to float8 types, but the quantization +formula remains the same for consistency, and the type of the attribute `y_zero_point` still determines the quantization type. +There are three supported quantization granularities, determined by the shape of `y_scale`. +In all cases, `y_zero_point` must have the same shape as `y_scale`. +- Per-tensor (per-layer) quantization: `y_scale` is a scalar. +- Per-axis quantization: The scale must be a 1-D tensor, with the length of the quantization axis. For an input shape + `(D0, ..., Di, ..., Dn)` and `axis=i`, `y_scale` is a 1-D tensor of length `Di`. +- Blocked quantization: The scale's shape is identical to the input's shape, except for one dimension, in which + blocking is performed. Given `x` shape `(D0, ..., Di, ..., Dn)`, `axis=i`, and block size `B`: `y_scale` shape is + `(D0, ..., ceil(Di/B), ..., Dn)`.