com.microsoft - QAttention#

QAttention - 1#

Version

  • name: QAttention (GitHub)

  • domain: com.microsoft

  • since_version: 1

  • function:

  • support_level: SupportType.COMMON

  • shape inference: True

This version of the operator has been available since version 1 of domain com.microsoft.

Summary

Attributes

  • do_rotary - INT : Whether to use rotary position embedding. Default value is 0.

  • mask_filter_value - FLOAT : The value to be filled in the attention mask. Default value is -10000.0f

  • num_heads - INT (required) : Number of attention heads

  • past_present_share_buffer - INT : Corresponding past and present are same tensor, its shape is (2, batch_size, num_heads, max_sequence_length, head_size)

  • scale - FLOAT : Custom scale will be used if specified. Default value is 1/sqrt(head_size)

  • unidirectional - INT : Whether every token can only attend to previous tokens. Default value is 0.

Inputs

Between 5 and 9 inputs.

  • input (heterogeneous) - T1:

  • weight (heterogeneous) - T2:

  • bias (heterogeneous) - T3:

  • input_scale (heterogeneous) - T3:

  • weight_scale (heterogeneous) - T3:

  • mask_index (optional, heterogeneous) - T4:

  • input_zero_point (optional, heterogeneous) - T1:

  • weight_zero_point (optional, heterogeneous) - T2:

  • past (optional, heterogeneous) - T3:

Outputs

Between 1 and 2 outputs.

  • output (heterogeneous) - T3:

  • present (optional, heterogeneous) - T3:

Type Constraints

  • T1 in ( tensor(int8), tensor(uint8) ): Constrain input and output types to int8 tensors.

  • T2 in ( tensor(int8), tensor(uint8) ): Constrain input and output types to int8 tensors.

  • T3 in ( tensor(float), tensor(float16) ): Constrain input and output types to float tensors.

  • T4 in ( tensor(int32) ): Constrain mask index to integer types

Examples