QuantizeLinear - version 19#
This page documents version 19 of operator QuantizeLinear. See QuantizeLinear for the latest version (since version 25).
Domain:
ai.onnxSince version: 19
The linear quantization operator. It consumes a high precision tensor, a scale, and a zero point to compute the low precision / quantized tensor.
The scale factor and zero point must have same shape, and can be either a scalar for per-tensor / per layer quantization, or a 1-D tensor for per-axis quantization.
The quantization formula is y = saturate ((x / y_scale) + y_zero_point).
For saturation, it saturates to [0, 255] if it’s uint8, or [-128, 127] if it’s int8.
For (x / y_scale), it’s rounding to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
‘y_zero_point’ and ‘y’ must have same type.
‘y_zero_point’ is usually not used for quantization to float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz,
but the quantization formula remains the same for consistency and
the type of the attribute ‘y_zero_point’ still determines the quantization type.
Inputs
x (T1): N-D full precision Input tensor to be quantized.
y_scale (T1): Scale for doing quantization to get ‘y’. It can be a scalar, which means per-tensor/layer quantization, or a 1-D Tensor for per-axis quantization.
y_zero_point (T2): Zero point for doing quantization to get ‘y’. Shape must match y_scale. Default is uint8 with zero point of 0 if it’s not specified.
Outputs
y (T2): N-D quantized output tensor. It has same shape as input ‘x’.
Type Constraints
T1: Constrain ‘x’ to float, float16, bfloat16 or int32 tensor. Allowed types: tensor(bfloat16), tensor(float), tensor(float16), tensor(int32).
T2: Constrain ‘y_zero_point’ and ‘y’ to 8-bit integer/float tensor. Allowed types: tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int8), tensor(uint8).
Differences with previous version (13)#
SchemaDiff: QuantizeLinear (domain 'ai.onnx')
old version: 13
new version: 19
breaking: yes
Breaking reasons:
input ‘y_scale’ (changed): type_str changed ‘tensor(float)’ -> ‘T1’
Inputs:
[BREAKING] changed ‘y_scale’: type_str changed ‘tensor(float)’ -> ‘T1’
Type constraints:
changed ‘T1’: added types: [‘tensor(bfloat16)’, ‘tensor(float16)’]
changed ‘T2’: added types: [‘tensor(float8e4m3fn)’, ‘tensor(float8e4m3fnuz)’, ‘tensor(float8e5m2)’, ‘tensor(float8e5m2fnuz)’]
Documentation:
line similarity: 0.50 (+6/-2 lines)
--- QuantizeLinear v13
+++ QuantizeLinear v19
@@ -1,6 +1,10 @@
The linear quantization operator. It consumes a high precision tensor, a scale, and a zero point to compute the low precision / quantized tensor.
The scale factor and zero point must have same shape, and can be either a scalar for per-tensor / per layer quantization, or a 1-D tensor for per-axis quantization.
-The quantization formula is y = saturate ((x / y_scale) + y_zero_point).
+The quantization formula is `y = saturate ((x / y_scale) + y_zero_point)`.
For saturation, it saturates to [0, 255] if it's uint8, or [-128, 127] if it's int8.
-For (x / y_scale), it's rounding to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details. 'y_zero_point' and 'y' must have same type.
+For (x / y_scale), it's rounding to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
+'y_zero_point' and 'y' must have same type.
+'y_zero_point' is usually not used for quantization to float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz,
+but the quantization formula remains the same for consistency and
+the type of the attribute 'y_zero_point' still determines the quantization type.