FlexAttention#

Domain: ai.onnx.preview
Since version: 1

Computes scaled dot-product attention over rank-4 (batched, multi-head) inputs, with optional user-provided customization subgraphs at two stages:

score_mod: Modify the attention score tensor after Q·K^T
prob_mod: Modify the probability tensor after Softmax

This operator mirrors the capabilities of PyTorch’s flex_attention: https://docs.pytorch.org/docs/stable/nn.attention.flex_attention.html

Input Shapes (MUST be rank-4 tensors):

Q: (batch_size, q_num_heads, q_sequence_length, head_size)
K: (batch_size, kv_num_heads, kv_sequence_length, head_size)
V: (batch_size, kv_num_heads, kv_sequence_length, v_head_size)

Output Shape:

Y: (batch_size, q_num_heads, q_sequence_length, v_head_size)

FlexAttention Computation:

Scores = (Q @ K^T) * scale
Scores = score_mod(Scores)             # if 'score_mod' is provided
Probs = Softmax(Scores, axis=-1)
Probs = prob_mod(Probs)                # if 'prob_mod' is provided
Y = Probs @ V

Grouped Query Attention (GQA): When q_num_heads != kv_num_heads, each K/V head is shared by a contiguous group of query heads in head-index order. Let group_size = q_num_heads / kv_num_heads; then query head h uses K/V head floor(h / group_size). q_num_heads must be a multiple of kv_num_heads.

Modifier Subgraphs (score_mod, prob_mod): Each modifier subgraph takes exactly one rank-4 tensor input and must produce exactly one rank-4 tensor output of the same shape and element type.

score_mod input/output shape: (batch_size, q_num_heads, q_sequence_length, kv_sequence_length)
prob_mod input/output shape: (batch_size, q_num_heads, q_sequence_length, kv_sequence_length)

The element type is determined by softmax_precision (defaults to float32 for non-double inputs, otherwise double).

Masking can be expressed in score_mod by writing masked positions as -inf (or a large negative value appropriate for the target precision).

Inputs

Q (T1): Query tensor with shape (batch_size, q_num_heads, q_seq_len, head_size).
K (T1): Key tensor with shape (batch_size, kv_num_heads, kv_seq_len, head_size).
V (T1): Value tensor with shape (batch_size, kv_num_heads, kv_seq_len, v_head_size).

Outputs

Y (T1): Output tensor with shape (batch_size, q_num_heads, q_seq_len, v_head_size).

Type Constraints

T1: Constrain Q, K, V to float tensors. Allowed types: tensor(bfloat16), tensor(double), tensor(float), tensor(float16).

Examples#

test_cc_flexattention_basic

Node:
  ai.onnx.preview.FlexAttention(Q, K, V) -> (Y)

Inputs:
  Q: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1. ,  0. ],
       [ 0. ,  1. ]],

      [[ 0.5,  0.5],
       [ 1. , -1. ]]]]
  K: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  0.],
       [ 0.,  1.]],

      [[ 1.,  1.],
       [-1.,  1.]]]]
  V: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  2.],
       [ 3.,  4.]],

      [[-1.,  0.],
       [ 0.,  1.]]]]

Outputs:
  Y: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.6604769 ,  2.660477  ],
       [ 2.339523  ,  3.339523  ]],

      [[-0.66976154,  0.33023846],
       [-0.80442965,  0.19557032]]]]

test_cc_flexattention_causal_mask

Node:
  ai.onnx.preview.FlexAttention(Q, K, V) -> (Y)
  Attributes:
    score_mod = <subgraph>

Inputs:
  Q: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1. ,  0. ],
       [ 0. ,  1. ]],

      [[ 0.5,  0.5],
       [ 1. , -1. ]]]]
  K: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  0.],
       [ 0.,  1.]],

      [[ 1.,  1.],
       [-1.,  1.]]]]
  V: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  2.],
       [ 3.,  4.]],

      [[-1.,  0.],
       [ 0.,  1.]]]]

Outputs:
  Y: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.        ,  2.        ],
       [ 2.339523  ,  3.339523  ]],

      [[-1.        ,  0.        ],
       [-0.80442965,  0.19557032]]]]

test_cc_flexattention_diff_head_sizes

Node:
  ai.onnx.preview.FlexAttention(Q, K, V) -> (Y)

Inputs:
  Q: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1. ,  0. ],
       [ 0. ,  1. ]],

      [[ 0.5,  0.5],
       [ 1. , -1. ]]]]
  K: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  0.],
       [ 0.,  1.]],

      [[ 1.,  1.],
       [-1.,  1.]]]]
  V: shape=(1, 2, 2, 3), dtype=float32
    [[[[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]],

      [[-1.,  0.,  1.],
       [ 0.,  1., -1.]]]]

Outputs:
  Y: shape=(1, 2, 2, 3), dtype=float32
    [[[[ 1.9907154 ,  2.9907155 ,  3.9907155 ],
       [ 3.0092845 ,  4.0092845 ,  5.0092845 ]],

      [[-0.66976154,  0.33023846,  0.33952308],
       [-0.80442965,  0.19557032,  0.6088593 ]]]]

test_cc_flexattention_double

Node:
  ai.onnx.preview.FlexAttention(Q, K, V) -> (Y)

Inputs:
  Q: shape=(1, 2, 2, 2), dtype=float64
    [[[[ 1. ,  0. ],
       [ 0. ,  1. ]],

      [[ 0.5,  0.5],
       [ 1. , -1. ]]]]
  K: shape=(1, 2, 2, 2), dtype=float64
    [[[[ 1.,  0.],
       [ 0.,  1.]],

      [[ 1.,  1.],
       [-1.,  1.]]]]
  V: shape=(1, 2, 2, 2), dtype=float64
    [[[[ 1.,  2.],
       [ 3.,  4.]],

      [[-1.,  0.],
       [ 0.,  1.]]]]

Outputs:
  Y: shape=(1, 2, 2, 2), dtype=float64
    [[[[ 1.66047691,  2.66047691],
       [ 2.33952309,  3.33952309]],

      [[-0.66976155,  0.33023845],
       [-0.80442968,  0.19557032]]]]

test_cc_flexattention_fp16

Node:
  ai.onnx.preview.FlexAttention(Q, K, V) -> (Y)

Inputs:
  Q: shape=(1, 2, 2, 2), dtype=float16
    [[[[ 1. ,  0. ],
       [ 0. ,  1. ]],

      [[ 0.5,  0.5],
       [ 1. , -1. ]]]]
  K: shape=(1, 2, 2, 2), dtype=float16
    [[[[ 1.,  0.],
       [ 0.,  1.]],

      [[ 1.,  1.],
       [-1.,  1.]]]]
  V: shape=(1, 2, 2, 2), dtype=float16
    [[[[ 1.,  2.],
       [ 3.,  4.]],

      [[-1.,  0.],
       [ 0.,  1.]]]]

Outputs:
  Y: shape=(1, 2, 2, 2), dtype=float16
    [[[[ 1.66  ,  2.66  ],
       [ 2.34  ,  3.34  ]],

      [[-0.67  ,  0.3303],
       [-0.804 ,  0.1956]]]]

test_cc_flexattention_gqa

Node:
  ai.onnx.preview.FlexAttention(Q, K, V) -> (Y)

Inputs:
  Q: shape=(1, 4, 2, 2), dtype=float32
    [[[[ 0.1 ,  0.2 ],
       [ 0.3 ,  0.4 ]],

      [[-0.1 ,  0.05],
       [ 0.2 , -0.3 ]],

      [[ 0.5 ,  0.5 ],
       [ 0.  ,  1.  ]],

      [[ 1.  ,  0.  ],
       [ 0.5 , -0.5 ]]]]
  K: shape=(1, 2, 3, 2), dtype=float32
    [[[[ 1.  ,  0.  ],
       [ 0.5 ,  0.5 ],
       [ 0.  ,  1.  ]],

      [[-1.  ,  1.  ],
       [ 1.  ,  1.  ],
       [ 0.25, -0.5 ]]]]
  V: shape=(1, 2, 3, 2), dtype=float32
    [[[[ 1.  ,  0.  ],
       [ 0.  ,  1.  ],
       [-1.  ,  1.  ]],

      [[ 2.  , -2.  ],
       [ 0.5 ,  0.25],
       [-0.5 ,  0.  ]]]]

Outputs:
  Y: shape=(1, 4, 2, 2), dtype=float32
    [[[[-0.02356532,  0.6783799 ],
       [-0.02356532,  0.6783799 ]],

      [[-0.03533876,  0.6841799 ],
       [ 0.11724144,  0.6063233 ]],

      [[ 0.6482419 , -0.37858844],
       [ 0.9917567 , -0.74587834]],

      [[ 0.37784207, -0.12898168],
       [ 0.29831943, -0.26321504]]]]

test_cc_flexattention_prob_mod_identity

Node:
  ai.onnx.preview.FlexAttention(Q, K, V) -> (Y)
  Attributes:
    prob_mod = <subgraph>

Inputs:
  Q: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1. ,  0. ],
       [ 0. ,  1. ]],

      [[ 0.5,  0.5],
       [ 1. , -1. ]]]]
  K: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  0.],
       [ 0.,  1.]],

      [[ 1.,  1.],
       [-1.,  1.]]]]
  V: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  2.],
       [ 3.,  4.]],

      [[-1.,  0.],
       [ 0.,  1.]]]]

Outputs:
  Y: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.6604769 ,  2.660477  ],
       [ 2.339523  ,  3.339523  ]],

      [[-0.66976154,  0.33023846],
       [-0.80442965,  0.19557032]]]]

test_cc_flexattention_prob_mod_scale_half

Node:
  ai.onnx.preview.FlexAttention(Q, K, V) -> (Y)
  Attributes:
    prob_mod = <subgraph>

Inputs:
  Q: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1. ,  0. ],
       [ 0. ,  1. ]],

      [[ 0.5,  0.5],
       [ 1. , -1. ]]]]
  K: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  0.],
       [ 0.,  1.]],

      [[ 1.,  1.],
       [-1.,  1.]]]]
  V: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  2.],
       [ 3.,  4.]],

      [[-1.,  0.],
       [ 0.,  1.]]]]

Outputs:
  Y: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 0.83023846,  1.3302385 ],
       [ 1.1697615 ,  1.6697615 ]],

      [[-0.33488077,  0.16511923],
       [-0.40221483,  0.09778516]]]]

test_cc_flexattention_relative_positional

Node:
  ai.onnx.preview.FlexAttention(Q, K, V) -> (Y)
  Attributes:
    score_mod = <subgraph>

Inputs:
  Q: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1. ,  0. ],
       [ 0. ,  1. ]],

      [[ 0.5,  0.5],
       [ 1. , -1. ]]]]
  K: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  0.],
       [ 0.,  1.]],

      [[ 1.,  1.],
       [-1.,  1.]]]]
  V: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  2.],
       [ 3.,  4.]],

      [[-1.,  0.],
       [ 0.,  1.]]]]

Outputs:
  Y: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.3070787 ,  2.3070788 ],
       [ 1.8545915 ,  2.8545914 ]],

      [[-0.84646064,  0.15353936],
       [-0.91790473,  0.08209524]]]]

test_cc_flexattention_scaled

Node:
  ai.onnx.preview.FlexAttention(Q, K, V) -> (Y)
  Attributes:
    scale = 0.10000000149011612

Inputs:
  Q: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1. ,  0. ],
       [ 0. ,  1. ]],

      [[ 0.5,  0.5],
       [ 1. , -1. ]]]]
  K: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  0.],
       [ 0.,  1.]],

      [[ 1.,  1.],
       [-1.,  1.]]]]
  V: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  2.],
       [ 3.,  4.]],

      [[-1.,  0.],
       [ 0.,  1.]]]]

Outputs:
  Y: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.9500417 ,  2.9500418 ],
       [ 2.0499582 ,  3.0499582 ]],

      [[-0.5249792 ,  0.47502083],
       [-0.549834  ,  0.45016602]]]]

test_cc_flexattention_score_mod

Node:
  ai.onnx.preview.FlexAttention(Q, K, V) -> (Y)
  Attributes:
    score_mod = <subgraph>

Inputs:
  Q: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1. ,  0. ],
       [ 0. ,  1. ]],

      [[ 0.5,  0.5],
       [ 1. , -1. ]]]]
  K: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  0.],
       [ 0.,  1.]],

      [[ 1.,  1.],
       [-1.,  1.]]]]
  V: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  2.],
       [ 3.,  4.]],

      [[-1.,  0.],
       [ 0.,  1.]]]]

Outputs:
  Y: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.6604768 ,  2.6604767 ],
       [ 2.339523  ,  3.339523  ]],

      [[-0.66976154,  0.33023843],
       [-0.80442965,  0.19557032]]]]

test_cc_flexattention_soft_cap

Node:
  ai.onnx.preview.FlexAttention(Q, K, V) -> (Y)
  Attributes:
    score_mod = <subgraph>

Inputs:
  Q: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1. ,  0. ],
       [ 0. ,  1. ]],

      [[ 0.5,  0.5],
       [ 1. , -1. ]]]]
  K: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  0.],
       [ 0.,  1.]],

      [[ 1.,  1.],
       [-1.,  1.]]]]
  V: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.,  2.],
       [ 3.,  4.]],

      [[-1.,  0.],
       [ 0.,  1.]]]]

Outputs:
  Y: shape=(1, 2, 2, 2), dtype=float32
    [[[[ 1.6606072 ,  2.6606073 ],
       [ 2.3393927 ,  3.3393927 ]],

      [[-0.6696964 ,  0.3303036 ],
       [-0.8040593 ,  0.19594066]]]]