.. _op_ai_onnx_QuantizeLinear-21:

QuantizeLinear - version 21
===========================

This page documents version **21** of operator **QuantizeLinear**. See :doc:`QuantizeLinear` for the latest version (since version 25).

- **Domain**: ``ai.onnx``
- **Since version**: 21

The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the
low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization
granularity. The quantization formula is ``y = saturate((x / y_scale) + y_zero_point)``.
Saturation is done according to:

- uint16: [0, 65535]
- int16: [-32768, 32767]
- uint8: [0, 255]
- int8: [-128, 127]
- uint4: [0, 15]
- int4: [-8, 7]

For ``(x / y_scale)``, it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
``y_zero_point`` and ``y`` must have the same type. ``y_zero_point`` is usually not used for quantization to float8 types, but the quantization
formula remains the same for consistency, and the type of the attribute ``y_zero_point`` still determines the quantization type.
There are three supported quantization granularities, determined by the shape of ``y_scale``.
In all cases, ``y_zero_point`` must have the same shape as ``y_scale``.

- Per-tensor (per-layer) quantization: ``y_scale`` is a scalar.
- Per-axis quantization: The scale must be a 1-D tensor, with the length of the quantization axis. For an input shape
  ``(D0, ..., Di, ..., Dn)`` and ``axis=i``, ``y_scale`` is a 1-D tensor of length ``Di``.
- Blocked quantization: The scale's shape is identical to the input's shape, except for one dimension, in which
  blocking is performed. Given ``x`` shape ``(D0, ..., Di, ..., Dn)``, ``axis=i``, and block size ``B``: ``y_scale`` shape is
  ``(D0, ..., ceil(Di/B), ..., Dn)``.

**Inputs**

- **x** (*T1*): N-D full precision Input tensor to be quantized.
- **y_scale** (*T1*): Scale for doing quantization to get ``y``. For per-tensor/layer quantization the scale is a scalar, for per-axis quantization it is a 1-D Tensor and for blocked quantization it has the same shape as the input, except for one dimension in which blocking is performed.
- **y_zero_point** (*T2*): Zero point for doing quantization to get ``y``. Shape must match ``y_scale``.Default is uint8 with zero point of 0 if it's not specified.

**Outputs**

- **y** (*T2*): N-D quantized output tensor. It has same shape as input ``x``.

**Type Constraints**

- **T1**: The type of the input 'x'.
  Allowed types: tensor(bfloat16), tensor(float), tensor(float16), tensor(int32).
- **T2**: The type of the input ``y_zero_point`` and the output ``y``.
  Allowed types: tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int4), tensor(int8), tensor(uint16), tensor(uint4), tensor(uint8).

Differences with previous version (19)
--------------------------------------

**SchemaDiff**: ``QuantizeLinear`` (domain ``'ai.onnx'``)

* old version: 19
* new version: 21
* breaking: no

**Type constraints:**

* changed 'T2': added types: ['tensor(int16)', 'tensor(int4)', 'tensor(uint16)', 'tensor(uint4)']

**Documentation:**

* line similarity: 0.06 (+21/-9 lines)

.. code-block:: diff

    --- QuantizeLinear v19
    +++ QuantizeLinear v21
    @@ -1,10 +1,22 @@
     
    -The linear quantization operator. It consumes a high precision tensor, a scale, and a zero point to compute the low precision / quantized tensor.
    -The scale factor and zero point must have same shape, and can be either a scalar for per-tensor / per layer quantization, or a 1-D tensor for per-axis quantization.
    -The quantization formula is `y = saturate ((x / y_scale) + y_zero_point)`.
    -For saturation, it saturates to [0, 255] if it's uint8, or [-128, 127] if it's int8.
    -For (x / y_scale), it's rounding to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
    -'y_zero_point' and 'y' must have same type.
    -'y_zero_point' is usually not used for quantization to float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz,
    -but the quantization formula remains the same for consistency and
    -the type of the attribute 'y_zero_point' still determines the quantization type.
    +The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the
    +low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization
    +granularity. The quantization formula is `y = saturate((x / y_scale) + y_zero_point)`.
    +Saturation is done according to:
    +- uint16: [0, 65535]
    +- int16: [-32768, 32767]
    +- uint8: [0, 255]
    +- int8: [-128, 127]
    +- uint4: [0, 15]
    +- int4: [-8, 7]
    +For `(x / y_scale)`, it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
    +`y_zero_point` and `y` must have the same type. `y_zero_point` is usually not used for quantization to float8 types, but the quantization
    +formula remains the same for consistency, and the type of the attribute `y_zero_point` still determines the quantization type.
    +There are three supported quantization granularities, determined by the shape of `y_scale`.
    +In all cases, `y_zero_point` must have the same shape as `y_scale`.
    +- Per-tensor (per-layer) quantization: `y_scale` is a scalar.
    +- Per-axis quantization: The scale must be a 1-D tensor, with the length of the quantization axis. For an input shape
    +  `(D0, ..., Di, ..., Dn)` and `axis=i`, `y_scale` is a 1-D tensor of length `Di`.
    +- Blocked quantization: The scale's shape is identical to the input's shape, except for one dimension, in which
    +  blocking is performed. Given `x` shape `(D0, ..., Di, ..., Dn)`, `axis=i`, and block size `B`: `y_scale` shape is
    +  `(D0, ..., ceil(Di/B), ..., Dn)`.