com.microsoft - QOrderedLongformerAttention#

QOrderedLongformerAttention - 1#

Version

This version of the operator has been available since version 1 of domain com.microsoft.

Summary

Attributes

  • num_heads - INT (required) : Number of attention heads

  • order_global_weight - INT (required) : cublasLt order of weight matrix

  • order_input - INT (required) : cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.

  • order_output - INT (required) : cublasLt order of global bias

  • order_weight - INT (required) : cublasLt order of weight matrix

  • window - INT (required) : One sided attention windows length W, or half of total window length

Inputs

  • input (heterogeneous) - Q:

  • scale_input (heterogeneous) - S:

  • weight (heterogeneous) - Q:

  • scale_weight (heterogeneous) - S:

  • bias (heterogeneous) - S:

  • scale_bias (heterogeneous) - S:

  • scale_qkv_gemm (heterogeneous) - S:

  • mask (heterogeneous) - F:

  • global_weight (heterogeneous) - Q:

  • scale_global_weight (heterogeneous) - S:

  • global_bias (heterogeneous) - S:

  • scale_global_gemm (heterogeneous) - S:

  • global (heterogeneous) - G:

  • scale_output (heterogeneous) - S:

Outputs

  • output (heterogeneous) - Q:

Type Constraints

  • Q in ( tensor(int8) ): Constrain input and output types to int8 tensors.

  • S in ( tensor(float) ): Constrain scales to float32 tensors.

  • G in ( tensor(int32) ): Constrain to integer types

  • F in ( tensor(float16) ): Be compatible with float version.

Examples