QuantizeLinear - version 23#
This page documents version 23 of operator QuantizeLinear. See QuantizeLinear for the latest version (since version 25).
Domain:
ai.onnxSince version: 23
The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the
low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization
granularity. The quantization formula is y = saturate((x / y_scale) + y_zero_point).
Saturation is done according to:
uint16: [0, 65535]
int16: [-32768, 32767]
uint8: [0, 255]
int8: [-128, 127]
uint4: [0, 15]
int4: [-8, 7]
For (x / y_scale), it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
y_zero_point and y must have the same type. y_zero_point is usually not used for quantization to float8 and 4bit types, but the quantization
formula remains the same for consistency, and the type of the attribute y_zero_point still determines the quantization type.
x and y_scale are allowed to have different types. The type of y_scale determines the precision of the division operation between x and
y_scale, unless the precision attribute is specified.
There are three supported quantization granularities, determined by the shape of y_scale.
In all cases, y_zero_point must have the same shape as y_scale.
Per-tensor (per-layer) quantization:
y_scaleis a scalar.Per-axis quantization: The scale must be a 1-D tensor, with the length of the quantization axis. For an input shape
(D0, ..., Di, ..., Dn)andaxis=i,y_scaleis a 1-D tensor of lengthDi.Blocked quantization: The scale’s shape is identical to the input’s shape, except for one dimension, in which blocking is performed. Given
xshape(D0, ..., Di, ..., Dn),axis=i, and block sizeB:y_scaleshape is(D0, ..., ceil(Di/B), ..., Dn).
Inputs
x (T1): N-D full precision Input tensor to be quantized.
y_scale (T2): Scale for doing quantization to get
y. For per-tensor/layer quantization the scale is a scalar, for per-axis quantization it is a 1-D Tensor and for blocked quantization it has the same shape as the input, except for one dimension in which blocking is performed.y_zero_point (T3): Zero point for doing quantization to get
y. Shape must matchy_scale.Default is uint8 with zero point of 0 if it’s not specified.
Outputs
y (T3): N-D quantized output tensor. It has same shape as input
x.
Type Constraints
T1: The type of the input ‘x’. Allowed types: tensor(bfloat16), tensor(float), tensor(float16), tensor(int32).
T2: The type of the input ‘y_scale’. Allowed types: tensor(bfloat16), tensor(float), tensor(float16), tensor(int32).
T3: The type of the input
y_zero_pointand the outputy. Allowed types: tensor(float4e2m1), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int4), tensor(int8), tensor(uint16), tensor(uint4), tensor(uint8).
Differences with previous version (21)#
SchemaDiff: QuantizeLinear (domain 'ai.onnx')
old version: 21
new version: 23
breaking: yes
Breaking reasons:
input ‘y_scale’ (changed): type_str changed ‘T1’ -> ‘T2’
input ‘y_zero_point’ (changed): type_str changed ‘T2’ -> ‘T3’
output ‘y’ (changed): type_str changed ‘T2’ -> ‘T3’
type constraint ‘T2’ (changed): added types: [‘tensor(bfloat16)’, ‘tensor(float)’, ‘tensor(float16)’, ‘tensor(int32)’]; removed types: [‘tensor(float8e4m3fn)’, ‘tensor(float8e4m3fnuz)’, ‘tensor(float8e5m2)’, ‘tensor(float8e5m2fnuz)’, ‘tensor(int16)’, ‘tensor(int4)’, ‘tensor(int8)’, ‘tensor(uint16)’, ‘tensor(uint4)’, ‘tensor(uint8)’]
Inputs:
[BREAKING] changed ‘y_scale’: type_str changed ‘T1’ -> ‘T2’
[BREAKING] changed ‘y_zero_point’: type_str changed ‘T2’ -> ‘T3’
Outputs:
[BREAKING] changed ‘y’: type_str changed ‘T2’ -> ‘T3’
Type constraints:
added ‘T3’: added types: [‘tensor(float4e2m1)’, ‘tensor(float8e4m3fn)’, ‘tensor(float8e4m3fnuz)’, ‘tensor(float8e5m2)’, ‘tensor(float8e5m2fnuz)’, ‘tensor(int16)’, ‘tensor(int4)’, ‘tensor(int8)’, ‘tensor(uint16)’, ‘tensor(uint4)’, ‘tensor(uint8)’]
[BREAKING] changed ‘T2’: added types: [‘tensor(bfloat16)’, ‘tensor(float)’, ‘tensor(float16)’, ‘tensor(int32)’]; removed types: [‘tensor(float8e4m3fn)’, ‘tensor(float8e4m3fnuz)’, ‘tensor(float8e5m2)’, ‘tensor(float8e5m2fnuz)’, ‘tensor(int16)’, ‘tensor(int4)’, ‘tensor(int8)’, ‘tensor(uint16)’, ‘tensor(uint4)’, ‘tensor(uint8)’]
Documentation:
line similarity: 0.84 (+7/-1 lines)
--- QuantizeLinear v21
+++ QuantizeLinear v23
@@ -2,6 +2,7 @@
The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the
low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization
granularity. The quantization formula is `y = saturate((x / y_scale) + y_zero_point)`.
+
Saturation is done according to:
- uint16: [0, 65535]
- int16: [-32768, 32767]
@@ -9,9 +10,14 @@
- int8: [-128, 127]
- uint4: [0, 15]
- int4: [-8, 7]
+
For `(x / y_scale)`, it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
-`y_zero_point` and `y` must have the same type. `y_zero_point` is usually not used for quantization to float8 types, but the quantization
+
+`y_zero_point` and `y` must have the same type. `y_zero_point` is usually not used for quantization to float8 and 4bit types, but the quantization
formula remains the same for consistency, and the type of the attribute `y_zero_point` still determines the quantization type.
+`x` and `y_scale` are allowed to have different types. The type of `y_scale` determines the precision of the division operation between `x` and
+`y_scale`, unless the `precision` attribute is specified.
+
There are three supported quantization granularities, determined by the shape of `y_scale`.
In all cases, `y_zero_point` must have the same shape as `y_scale`.
- Per-tensor (per-layer) quantization: `y_scale` is a scalar.