QuantizeLinear#

Domain: ai.onnx
Since version: 25

The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization granularity. The quantization formula is y = saturate((x / y_scale) + y_zero_point).

Saturation is done according to:

uint16: [0, 65535]
int16: [-32768, 32767]
uint8: [0, 255]
int8: [-128, 127]
uint4: [0, 15]
int4: [-8, 7]
uint2: [0, 3]
int2: [-2, 1]

For (x / y_scale), it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.

y_zero_point and y must have the same type. y_zero_point is usually not used for quantization to float8 and 4bit types, but the quantization formula remains the same for consistency, and the type of the attribute y_zero_point still determines the quantization type. x and y_scale are allowed to have different types. The type of y_scale determines the precision of the division operation between x and y_scale, unless the precision attribute is specified.

There are three supported quantization granularities, determined by the shape of y_scale. In all cases, y_zero_point must have the same shape as y_scale.

Per-tensor (per-layer) quantization: y_scale is a scalar.
Per-axis quantization: The scale must be a 1-D tensor, with the length of the quantization axis. For an input shape (D0, ..., Di, ..., Dn) and axis=i, y_scale is a 1-D tensor of length Di.
Blocked quantization: The scale’s shape is identical to the input’s shape, except for one dimension, in which blocking is performed. Given x shape (D0, ..., Di, ..., Dn), axis=i, and block size B: y_scale shape is (D0, ..., ceil(Di/B), ..., Dn).

Inputs

x (T1): N-D full precision Input tensor to be quantized.
y_scale (T2): Scale for doing quantization to get y. For per-tensor/layer quantization the scale is a scalar, for per-axis quantization it is a 1-D Tensor and for blocked quantization it has the same shape as the input, except for one dimension in which blocking is performed.
y_zero_point (T3): Zero point for doing quantization to get y. Shape must match y_scale. Default is uint8 with zero point of 0 if it’s not specified.

Outputs

y (T3): N-D quantized output tensor. It has same shape as input x.

Type Constraints

T1: The type of the input ‘x’. Allowed types: tensor(bfloat16), tensor(float), tensor(float16), tensor(int32).
T2: The type of the input ‘y_scale’. Allowed types: tensor(bfloat16), tensor(float), tensor(float16), tensor(float8e8m0), tensor(int32).
T3: The type of the input y_zero_point and the output y. Allowed types: tensor(float4e2m1), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int2), tensor(int4), tensor(int8), tensor(uint16), tensor(uint2), tensor(uint4), tensor(uint8).

Examples#

test_cc_quantizelinear

Node:
  QuantizeLinear(x, y_scale) -> (y)

Inputs:
  x: shape=(6,), dtype=float32
    [    0.,     2.,     3.,  1000.,  -254., -1000.]
  y_scale: shape=(), dtype=float32
    2.

Outputs:
  y: shape=(6,), dtype=uint8
    [  0,   1,   2, 255,   0,   0]

test_cc_quantizelinear_int8

Node:
  QuantizeLinear(x, y_scale, y_zero_point) -> (y)

Inputs:
  x: shape=(6,), dtype=float32
    [    0.,     2.,     3.,  1000.,  -254., -1000.]
  y_scale: shape=(), dtype=float32
    2.
  y_zero_point: shape=(), dtype=int8
    -10

Outputs:
  y: shape=(6,), dtype=int8
    [ -10,   -9,   -8,  127, -128, -128]

test_quantizelinear

Node:
  QuantizeLinear(x, y_scale, y_zero_point) -> (y)

Inputs:
  x: shape=(6,), dtype=float32
    [    0.,     2.,     3.,  1000.,  -254., -1000.]
  y_scale: shape=(), dtype=float32
    2.
  y_zero_point: shape=(), dtype=uint8
    128

Outputs:
  y: shape=(6,), dtype=uint8
    [128, 129, 130, 255,   1,   0]

test_quantizelinear_axis

Node:
  QuantizeLinear(x, y_scale, y_zero_point) -> (y)
  Attributes:
    axis = 1

Inputs:
  x: shape=(1, 3, 3, 2), dtype=float32
    [[[[-162.,   10.],
       [-100.,  232.],
       [ -20.,  -50.]],

      [[ -76.,    0.],
       [   0.,  252.],
       [  32.,  -44.]],

      [[ 245., -485.],
       [-960., -270.],
       [-375., -470.]]]]
  y_scale: shape=(3,), dtype=float32
    [2., 4., 5.]
  y_zero_point: shape=(3,), dtype=uint8
    [ 84,  24, 196]

Outputs:
  y: shape=(1, 3, 3, 2), dtype=uint8
    [[[[  3,  89],
       [ 34, 200],
       [ 74,  59]],

      [[  5,  24],
       [ 24,  87],
       [ 32,  13]],

      [[245,  99],
       [  4, 142],
       [121, 102]]]]

test_quantizelinear_blocked_asymmetric

Node:
  QuantizeLinear(x, y_scale, y_zero_point) -> (y)
  Attributes:
    axis = 1
    block_size = 2

Inputs:
  x: shape=(3, 4), dtype=float32
    [[ 6., 12., 50.,  5.],
     [ 1.,  8.,  4.,  5.],
     [ 0., 20., 10.,  4.]]
  y_scale: shape=(3, 2), dtype=float32
    [[1.5, 2.5],
     [3. , 4.9],
     [5.1, 6.9]]
  y_zero_point: shape=(3, 2), dtype=uint8
    [[0, 1],
     [1, 0],
     [2, 3]]

Outputs:
  y: shape=(3, 4), dtype=uint8
    [[ 4,  8, 21,  3],
     [ 1,  4,  1,  1],
     [ 2,  6,  4,  4]]

test_quantizelinear_blocked_symmetric

Node:
  QuantizeLinear(x, y_scale) -> (y)
  Attributes:
    axis = 1
    block_size = 2
    output_dtype = 5

Inputs:
  x: shape=(3, 4), dtype=float32
    [[  6.,  -8., -10.,   5.],
     [  1.,   8.,   4.,   5.],
     [  0.,  20.,  10.,   4.]]
  y_scale: shape=(3, 2), dtype=float32
    [[1.5, 2.5],
     [3. , 4.9],
     [5.1, 6.9]]

Outputs:
  y: shape=(3, 4), dtype=int16
    [[ 4, -5, -4,  2],
     [ 0,  3,  1,  1],
     [ 0,  4,  1,  1]]

test_quantizelinear_e4m3fn

Node:
  QuantizeLinear(x, y_scale, y_zero_point) -> (y)

Inputs:
  x: shape=(5,), dtype=float32
    [0.e+00, 1.e+00, 2.e+00, 1.e+05, 2.e+02]
  y_scale: shape=(), dtype=float32
    2.
  y_zero_point: shape=(1,), dtype=float8_e4m3fn
    [0]

Outputs:
  y: shape=(5,), dtype=float8_e4m3fn
    [0, 0.5, 1, 448, 96]

test_quantizelinear_e5m2

Node:
  QuantizeLinear(x, y_scale, y_zero_point) -> (y)

Inputs:
  x: shape=(5,), dtype=float32
    [0.e+00, 1.e+00, 2.e+00, 1.e+05, 2.e+02]
  y_scale: shape=(), dtype=float32
    2.
  y_zero_point: shape=(1,), dtype=float8_e5m2
    [0]

Outputs:
  y: shape=(5,), dtype=float8_e5m2
    [0, 0.5, 1, 49152, 96]

test_quantizelinear_float4e2m1

Node:
  QuantizeLinear(x, y_scale, y_zero_point) -> (y)
  Attributes:
    axis = 0

Inputs:
  x: shape=(3, 4), dtype=float32
    [[  0. ,   2.5,   4.8,   8.6],
     [-30. , -20. ,   6. ,   9. ],
     [ -0. ,  -2.5,  -4.8,  -8.6]]
  y_scale: shape=(3,), dtype=float32
    [2., 3., 4.]
  y_zero_point: shape=(3,), dtype=float4_e2m1fn
    [0, 0, 0]

Outputs:
  y: shape=(3, 4), dtype=float4_e2m1fn
    [[0, 1, 2, 4],
     [-6, -6, 2, 3],
     [0, -0.5, -1, -2]]

test_quantizelinear_int16

Node:
  QuantizeLinear(x, y_scale, y_zero_point) -> (y)

Inputs:
  x: shape=(4,), dtype=float32
    [ 0.e+00,  2.e+00,  3.e+00, -1.e+05]
  y_scale: shape=(), dtype=float32
    2.
  y_zero_point: shape=(), dtype=int16
    -1024

Outputs:
  y: shape=(4,), dtype=int16
    [ -1024,  -1023,  -1022, -32768]

test_quantizelinear_int2

Node:
  QuantizeLinear(x, y_scale, y_zero_point) -> (y)
  Attributes:
    axis = 0

Inputs:
  x: shape=(3, 4), dtype=float32
    [[ 0. ,  2.5,  4.8,  8.6],
     [-4. , -3. ,  1. ,  2. ],
     [-0. , -2.5, -4.8, -8.6]]
  y_scale: shape=(3,), dtype=float32
    [2., 3., 4.]
  y_zero_point: shape=(3,), dtype=int2
    [0, 0, 0]

Outputs:
  y: shape=(3, 4), dtype=int2
    [[0, 1, 1, 1],
     [-1, -1, 0, 1],
     [0, -1, -1, -2]]

test_quantizelinear_int4

Node:
  QuantizeLinear(x, y_scale, y_zero_point) -> (y)
  Attributes:
    axis = 0

Inputs:
  x: shape=(3, 4), dtype=float32
    [[  0. ,   2.5,   4.8,   8.6],
     [-30. , -20. ,   6. ,   9. ],
     [ 12. ,  15. ,  16. ,  40. ]]
  y_scale: shape=(3,), dtype=float32
    [2., 3., 4.]
  y_zero_point: shape=(3,), dtype=int4
    [1, 1, 1]

Outputs:
  y: shape=(3, 4), dtype=int4
    [[1, 2, 3, 5],
     [-8, -6, 3, 4],
     [4, 5, 5, 7]]

test_quantizelinear_uint16

Node:
  QuantizeLinear(x, y_scale, y_zero_point) -> (y)

Inputs:
  x: shape=(4,), dtype=float32
    [0.e+00, 2.e+00, 3.e+00, 2.e+05]
  y_scale: shape=(), dtype=float32
    2.
  y_zero_point: shape=(), dtype=uint16
    32767

Outputs:
  y: shape=(4,), dtype=uint16
    [32767, 32768, 32769, 65535]

test_quantizelinear_uint2

Node:
  QuantizeLinear(x, y_scale, y_zero_point) -> (y)
  Attributes:
    axis = 0

Inputs:
  x: shape=(3, 4), dtype=float32
    [[ 0. ,  2.5,  4.8,  8.6],
     [-2. , -1. ,  1. ,  3. ],
     [ 4. ,  5. ,  6. ,  7. ]]
  y_scale: shape=(3,), dtype=float32
    [2., 3., 4.]
  y_zero_point: shape=(3,), dtype=uint2
    [0, 0, 0]

Outputs:
  y: shape=(3, 4), dtype=uint2
    [[0, 1, 2, 3],
     [0, 0, 0, 1],
     [1, 1, 2, 2]]

test_quantizelinear_uint4

Node:
  QuantizeLinear(x, y_scale, y_zero_point) -> (y)
  Attributes:
    axis = 0

Inputs:
  x: shape=(3, 4), dtype=float32
    [[  0. ,   2.5,   4.8,   8.6],
     [-30. , -20. ,   6. ,   9. ],
     [ 12. ,  15. ,  16. ,  40. ]]
  y_scale: shape=(3,), dtype=float32
    [2., 3., 4.]
  y_zero_point: shape=(3,), dtype=uint4
    [1, 1, 1]

Outputs:
  y: shape=(3, 4), dtype=uint4
    [[1, 2, 3, 5],
     [0, 0, 3, 4],
     [4, 5, 5, 11]]

Differences with previous version (24)#

SchemaDiff: QuantizeLinear (domain 'ai.onnx')

old version: 24
new version: 25
breaking: no

Type constraints:

changed ‘T3’: added types: [‘tensor(int2)’, ‘tensor(uint2)’]

Documentation:

line similarity: 0.97 (+2/-0 lines)

--- QuantizeLinear v24
+++ QuantizeLinear v25
@@ -10,6 +10,8 @@
 - int8: [-128, 127]
 - uint4: [0, 15]
 - int4: [-8, 7]
+- uint2: [0, 3]
+- int2: [-2, 1]

 For `(x / y_scale)`, it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.

QuantizeLinear#

Examples#

Differences with previous version (24)#

Version History#