com.microsoft - QOrderedAttention#

QOrderedAttention - 1#

Version

  • name: QOrderedAttention (GitHub)

  • domain: com.microsoft

  • since_version: 1

  • function:

  • support_level: SupportType.COMMON

  • shape inference: True

This version of the operator has been available since version 1 of domain com.microsoft.

Summary

Attributes

  • num_heads - INT (required) : Number of attention heads

  • order_input - INT (required) : cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.

  • order_output - INT (required) : cublasLt order of global bias

  • order_weight - INT (required) : cublasLt order of weight matrix

  • qkv_hidden_sizes - INTS : Hidden layer sizes of Q, K, V paths in Attention

  • unidirectional - INT : Whether every token can only attend to previous tokens. Default value is 0.

Inputs

Between 17 and 20 inputs.

  • input (heterogeneous) - Q:

  • scale_input (heterogeneous) - S:

  • scale_Q_gemm (heterogeneous) - S:

  • scale_K_gemm (heterogeneous) - S:

  • scale_V_gemm (heterogeneous) - S:

  • Q_weight (heterogeneous) - Q:

  • K_weight (heterogeneous) - Q:

  • V_weight (heterogeneous) - Q:

  • scale_Q_weight (heterogeneous) - S:

  • scale_K_weight (heterogeneous) - S:

  • scale_V_weight (heterogeneous) - S:

  • Q_bias (heterogeneous) - S:

  • K_bias (heterogeneous) - S:

  • V_bias (heterogeneous) - S:

  • scale_QKT_gemm (optional, heterogeneous) - S:

  • scale_QKT_softmax (optional, heterogeneous) - S:

  • scale_values_gemm (heterogeneous) - S:

  • mask_index (optional, heterogeneous) - G:

  • past (optional, heterogeneous) - Q:

  • relative_position_bias (optional, heterogeneous) - S:

Outputs

  • output (heterogeneous) - Q:

Type Constraints

  • Q in ( tensor(int8) ): Constrain input and output types to int8 tensors.

  • S in ( tensor(float) ): Constrain scales to float32 tensors.

  • G in ( tensor(int32) ): Constrain to integer types

Examples