QuantizeLinear - version 21#

This page documents version 21 of operator QuantizeLinear. See QuantizeLinear for the latest version (since version 25).

  • Domain: ai.onnx

  • Since version: 21

The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization granularity. The quantization formula is y = saturate((x / y_scale) + y_zero_point). Saturation is done according to:

  • uint16: [0, 65535]

  • int16: [-32768, 32767]

  • uint8: [0, 255]

  • int8: [-128, 127]

  • uint4: [0, 15]

  • int4: [-8, 7]

For (x / y_scale), it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details. y_zero_point and y must have the same type. y_zero_point is usually not used for quantization to float8 types, but the quantization formula remains the same for consistency, and the type of the attribute y_zero_point still determines the quantization type. There are three supported quantization granularities, determined by the shape of y_scale. In all cases, y_zero_point must have the same shape as y_scale.

  • Per-tensor (per-layer) quantization: y_scale is a scalar.

  • Per-axis quantization: The scale must be a 1-D tensor, with the length of the quantization axis. For an input shape (D0, ..., Di, ..., Dn) and axis=i, y_scale is a 1-D tensor of length Di.

  • Blocked quantization: The scale’s shape is identical to the input’s shape, except for one dimension, in which blocking is performed. Given x shape (D0, ..., Di, ..., Dn), axis=i, and block size B: y_scale shape is (D0, ..., ceil(Di/B), ..., Dn).

Inputs

  • x (T1): N-D full precision Input tensor to be quantized.

  • y_scale (T1): Scale for doing quantization to get y. For per-tensor/layer quantization the scale is a scalar, for per-axis quantization it is a 1-D Tensor and for blocked quantization it has the same shape as the input, except for one dimension in which blocking is performed.

  • y_zero_point (T2): Zero point for doing quantization to get y. Shape must match y_scale.Default is uint8 with zero point of 0 if it’s not specified.

Outputs

  • y (T2): N-D quantized output tensor. It has same shape as input x.

Type Constraints

  • T1: The type of the input ‘x’. Allowed types: tensor(bfloat16), tensor(float), tensor(float16), tensor(int32).

  • T2: The type of the input y_zero_point and the output y. Allowed types: tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int4), tensor(int8), tensor(uint16), tensor(uint4), tensor(uint8).

Differences with previous version (19)#

SchemaDiff: QuantizeLinear (domain 'ai.onnx')

  • old version: 19

  • new version: 21

  • breaking: no

Type constraints:

  • changed ‘T2’: added types: [‘tensor(int16)’, ‘tensor(int4)’, ‘tensor(uint16)’, ‘tensor(uint4)’]

Documentation:

  • line similarity: 0.06 (+21/-9 lines)

--- QuantizeLinear v19
+++ QuantizeLinear v21
@@ -1,10 +1,22 @@

-The linear quantization operator. It consumes a high precision tensor, a scale, and a zero point to compute the low precision / quantized tensor.
-The scale factor and zero point must have same shape, and can be either a scalar for per-tensor / per layer quantization, or a 1-D tensor for per-axis quantization.
-The quantization formula is `y = saturate ((x / y_scale) + y_zero_point)`.
-For saturation, it saturates to [0, 255] if it's uint8, or [-128, 127] if it's int8.
-For (x / y_scale), it's rounding to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
-'y_zero_point' and 'y' must have same type.
-'y_zero_point' is usually not used for quantization to float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz,
-but the quantization formula remains the same for consistency and
-the type of the attribute 'y_zero_point' still determines the quantization type.
+The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the
+low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization
+granularity. The quantization formula is `y = saturate((x / y_scale) + y_zero_point)`.
+Saturation is done according to:
+- uint16: [0, 65535]
+- int16: [-32768, 32767]
+- uint8: [0, 255]
+- int8: [-128, 127]
+- uint4: [0, 15]
+- int4: [-8, 7]
+For `(x / y_scale)`, it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
+`y_zero_point` and `y` must have the same type. `y_zero_point` is usually not used for quantization to float8 types, but the quantization
+formula remains the same for consistency, and the type of the attribute `y_zero_point` still determines the quantization type.
+There are three supported quantization granularities, determined by the shape of `y_scale`.
+In all cases, `y_zero_point` must have the same shape as `y_scale`.
+- Per-tensor (per-layer) quantization: `y_scale` is a scalar.
+- Per-axis quantization: The scale must be a 1-D tensor, with the length of the quantization axis. For an input shape
+  `(D0, ..., Di, ..., Dn)` and `axis=i`, `y_scale` is a 1-D tensor of length `Di`.
+- Blocked quantization: The scale's shape is identical to the input's shape, except for one dimension, in which
+  blocking is performed. Given `x` shape `(D0, ..., Di, ..., Dn)`, `axis=i`, and block size `B`: `y_scale` shape is
+  `(D0, ..., ceil(Di/B), ..., Dn)`.