RoiAlign#

RoiAlign - 16#

Version

  • name: RoiAlign (GitHub)

  • domain: main

  • since_version: 16

  • function: False

  • support_level: SupportType.COMMON

  • shape inference: True

This version of the operator has been available since version 16.

Summary

Region of Interest (RoI) align operation described in the [Mask R-CNN paper](https://arxiv.org/abs/1703.06870). RoiAlign consumes an input tensor X and region of interests (rois) to apply pooling across each RoI; it produces a 4-D tensor of shape (num_rois, C, output_height, output_width).

RoiAlign is proposed to avoid the misalignment by removing quantizations while converting from original image into feature map and from feature map into RoI feature; in each ROI bin, the value of the sampled locations are computed directly through bilinear interpolation.

Attributes

  • coordinate_transformation_mode: Allowed values are ‘half_pixel’ and ‘output_half_pixel’. Use the value ‘half_pixel’ to pixel shift the input coordinates by -0.5 (the recommended behavior). Use the value ‘output_half_pixel’ to omit the pixel shift for the input (use this for a backward-compatible behavior). Default value is 'half_pixel'.

  • mode: The pooling method. Two modes are supported: ‘avg’ and ‘max’. Default is ‘avg’. Default value is 'avg'.

  • output_height: default 1; Pooled output Y’s height. Default value is 1.

  • output_width: default 1; Pooled output Y’s width. Default value is 1.

  • sampling_ratio: Number of sampling points in the interpolation grid used to compute the output value of each pooled output bin. If > 0, then exactly sampling_ratio x sampling_ratio grid points are used. If == 0, then an adaptive number of grid points are used (computed as ceil(roi_width / output_width), and likewise for height). Default is 0. Default value is 0.

  • spatial_scale: Multiplicative spatial scale factor to translate ROI coordinates from their input spatial scale to the scale used when pooling, i.e., spatial scale of the input feature map X relative to the input image. E.g.; default is 1.0f. Default value is 1.0.

Inputs

  • X (heterogeneous) - T1: Input data tensor from the previous operator; 4-D feature map of shape (N, C, H, W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data.

  • rois (heterogeneous) - T1: RoIs (Regions of Interest) to pool over; rois is 2-D input of shape (num_rois, 4) given as [[x1, y1, x2, y2], …]. The RoIs’ coordinates are in the coordinate system of the input image. Each coordinate set has a 1:1 correspondence with the ‘batch_indices’ input.

  • batch_indices (heterogeneous) - T2: 1-D tensor of shape (num_rois,) with each element denoting the index of the corresponding image in the batch.

Outputs

  • Y (heterogeneous) - T1: RoI pooled output, 4-D tensor of shape (num_rois, C, output_height, output_width). The r-th batch element Y[r-1] is a pooled feature map corresponding to the r-th RoI X[r-1].

Type Constraints

  • T1 in ( tensor(double), tensor(float), tensor(float16) ): Constrain types to float tensors.

  • T2 in ( tensor(int64) ): Constrain types to int tensors.

Examples

roialign_aligned_false

node = onnx.helper.make_node(
    "RoiAlign",
    inputs=["X", "rois", "batch_indices"],
    outputs=["Y"],
    spatial_scale=1.0,
    output_height=5,
    output_width=5,
    sampling_ratio=2,
    coordinate_transformation_mode="output_half_pixel",
)

X, batch_indices, rois = get_roi_align_input_values()
# (num_rois, C, output_height, output_width)
Y = np.array(
    [
        [
            [
                [0.4664, 0.4466, 0.3405, 0.5688, 0.6068],
                [0.3714, 0.4296, 0.3835, 0.5562, 0.3510],
                [0.2768, 0.4883, 0.5222, 0.5528, 0.4171],
                [0.4713, 0.4844, 0.6904, 0.4920, 0.8774],
                [0.6239, 0.7125, 0.6289, 0.3355, 0.3495],
            ]
        ],
        [
            [
                [0.3022, 0.4305, 0.4696, 0.3978, 0.5423],
                [0.3656, 0.7050, 0.5165, 0.3172, 0.7015],
                [0.2912, 0.5059, 0.6476, 0.6235, 0.8299],
                [0.5916, 0.7389, 0.7048, 0.8372, 0.8893],
                [0.6227, 0.6153, 0.7097, 0.6154, 0.4585],
            ]
        ],
        [
            [
                [0.2384, 0.3379, 0.3717, 0.6100, 0.7601],
                [0.3767, 0.3785, 0.7147, 0.9243, 0.9727],
                [0.5749, 0.5826, 0.5709, 0.7619, 0.8770],
                [0.5355, 0.2566, 0.2141, 0.2796, 0.3600],
                [0.4365, 0.3504, 0.2887, 0.3661, 0.2349],
            ]
        ],
    ],
    dtype=np.float32,
)

expect(node, inputs=[X, rois, batch_indices], outputs=[Y], name="test_roialign_aligned_false")

roialign_aligned_true

node = onnx.helper.make_node(
    "RoiAlign",
    inputs=["X", "rois", "batch_indices"],
    outputs=["Y"],
    spatial_scale=1.0,
    output_height=5,
    output_width=5,
    sampling_ratio=2,
    coordinate_transformation_mode="half_pixel",
)

X, batch_indices, rois = get_roi_align_input_values()
# (num_rois, C, output_height, output_width)
Y = np.array(
    [
        [
            [
                [0.5178, 0.3434, 0.3229, 0.4474, 0.6344],
                [0.4031, 0.5366, 0.4428, 0.4861, 0.4023],
                [0.2512, 0.4002, 0.5155, 0.6954, 0.3465],
                [0.3350, 0.4601, 0.5881, 0.3439, 0.6849],
                [0.4932, 0.7141, 0.8217, 0.4719, 0.4039],
            ]
        ],
        [
            [
                [0.3070, 0.2187, 0.3337, 0.4880, 0.4870],
                [0.1871, 0.4914, 0.5561, 0.4192, 0.3686],
                [0.1433, 0.4608, 0.5971, 0.5310, 0.4982],
                [0.2788, 0.4386, 0.6022, 0.7000, 0.7524],
                [0.5774, 0.7024, 0.7251, 0.7338, 0.8163],
            ]
        ],
        [
            [
                [0.2393, 0.4075, 0.3379, 0.2525, 0.4743],
                [0.3671, 0.2702, 0.4105, 0.6419, 0.8308],
                [0.5556, 0.4543, 0.5564, 0.7502, 0.9300],
                [0.6626, 0.5617, 0.4813, 0.4954, 0.6663],
                [0.6636, 0.3721, 0.2056, 0.1928, 0.2478],
            ]
        ],
    ],
    dtype=np.float32,
)

expect(node, inputs=[X, rois, batch_indices], outputs=[Y], name="test_roialign_aligned_true")

Differences

00Region of Interest (RoI) align operation described in theRegion of Interest (RoI) align operation described in the
11[Mask R-CNN paper](https://arxiv.org/abs/1703.06870).[Mask R-CNN paper](https://arxiv.org/abs/1703.06870).
22RoiAlign consumes an input tensor X and region of interests (rois)RoiAlign consumes an input tensor X and region of interests (rois)
33to apply pooling across each RoI; it produces a 4-D tensor of shapeto apply pooling across each RoI; it produces a 4-D tensor of shape
44(num_rois, C, output_height, output_width).(num_rois, C, output_height, output_width).
55
66RoiAlign is proposed to avoid the misalignment by removingRoiAlign is proposed to avoid the misalignment by removing
77quantizations while converting from original image into featurequantizations while converting from original image into feature
88map and from feature map into RoI feature; in each ROI bin,map and from feature map into RoI feature; in each ROI bin,
99the value of the sampled locations are computed directlythe value of the sampled locations are computed directly
1010through bilinear interpolation.through bilinear interpolation.
1111
1212**Attributes****Attributes**
1313
14* **coordinate_transformation_mode**:
15 Allowed values are 'half_pixel' and 'output_half_pixel'. Use the
16 value 'half_pixel' to pixel shift the input coordinates by -0.5 (the
17 recommended behavior). Use the value 'output_half_pixel' to omit the
18 pixel shift for the input (use this for a backward-compatible
19 behavior). Default value is 'half_pixel'.
1420* **mode**:* **mode**:
1521 The pooling method. Two modes are supported: 'avg' and 'max'. The pooling method. Two modes are supported: 'avg' and 'max'.
1622 Default is 'avg'. Default value is 'avg'. Default is 'avg'. Default value is 'avg'.
1723* **output_height**:* **output_height**:
1824 default 1; Pooled output Y's height. Default value is 1. default 1; Pooled output Y's height. Default value is 1.
1925* **output_width**:* **output_width**:
2026 default 1; Pooled output Y's width. Default value is 1. default 1; Pooled output Y's width. Default value is 1.
2127* **sampling_ratio**:* **sampling_ratio**:
2228 Number of sampling points in the interpolation grid used to compute Number of sampling points in the interpolation grid used to compute
2329 the output value of each pooled output bin. If > 0, then exactly the output value of each pooled output bin. If > 0, then exactly
2430 sampling_ratio x sampling_ratio grid points are used. If == 0, then sampling_ratio x sampling_ratio grid points are used. If == 0, then
2531 an adaptive number of grid points are used (computed as an adaptive number of grid points are used (computed as
2632 ceil(roi_width / output_width), and likewise for height). Default is ceil(roi_width / output_width), and likewise for height). Default is
2733 0. Default value is 0. 0. Default value is 0.
2834* **spatial_scale**:* **spatial_scale**:
2935 Multiplicative spatial scale factor to translate ROI coordinates Multiplicative spatial scale factor to translate ROI coordinates
3036 from their input spatial scale to the scale used when pooling, i.e., from their input spatial scale to the scale used when pooling, i.e.,
3137 spatial scale of the input feature map X relative to the input spatial scale of the input feature map X relative to the input
3238 image. E.g.; default is 1.0f. Default value is 1.0. image. E.g.; default is 1.0f. Default value is 1.0.
3339
3440**Inputs****Inputs**
3541
3642* **X** (heterogeneous) - **T1**:* **X** (heterogeneous) - **T1**:
3743 Input data tensor from the previous operator; 4-D feature map of Input data tensor from the previous operator; 4-D feature map of
3844 shape (N, C, H, W), where N is the batch size, C is the number of shape (N, C, H, W), where N is the batch size, C is the number of
3945 channels, and H and W are the height and the width of the data. channels, and H and W are the height and the width of the data.
4046* **rois** (heterogeneous) - **T1**:* **rois** (heterogeneous) - **T1**:
4147 RoIs (Regions of Interest) to pool over; rois is 2-D input of shape RoIs (Regions of Interest) to pool over; rois is 2-D input of shape
4248 (num_rois, 4) given as [[x1, y1, x2, y2], ...]. The RoIs' (num_rois, 4) given as [[x1, y1, x2, y2], ...]. The RoIs'
4349 coordinates are in the coordinate system of the input image. Each coordinates are in the coordinate system of the input image. Each
4450 coordinate set has a 1:1 correspondence with the 'batch_indices' coordinate set has a 1:1 correspondence with the 'batch_indices'
4551 input. input.
4652* **batch_indices** (heterogeneous) - **T2**:* **batch_indices** (heterogeneous) - **T2**:
4753 1-D tensor of shape (num_rois,) with each element denoting the index 1-D tensor of shape (num_rois,) with each element denoting the index
4854 of the corresponding image in the batch. of the corresponding image in the batch.
4955
5056**Outputs****Outputs**
5157
5258* **Y** (heterogeneous) - **T1**:* **Y** (heterogeneous) - **T1**:
5359 RoI pooled output, 4-D tensor of shape (num_rois, C, output_height, RoI pooled output, 4-D tensor of shape (num_rois, C, output_height,
5460 output_width). The r-th batch element Y[r-1] is a pooled feature map output_width). The r-th batch element Y[r-1] is a pooled feature map
5561 corresponding to the r-th RoI X[r-1]. corresponding to the r-th RoI X[r-1].
5662
5763**Type Constraints****Type Constraints**
5864
5965* **T1** in (* **T1** in (
6066 tensor(double), tensor(double),
6167 tensor(float), tensor(float),
6268 tensor(float16) tensor(float16)
6369 ): ):
6470 Constrain types to float tensors. Constrain types to float tensors.
6571* **T2** in (* **T2** in (
6672 tensor(int64) tensor(int64)
6773 ): ):
6874 Constrain types to int tensors. Constrain types to int tensors.

RoiAlign - 10#

Version

  • name: RoiAlign (GitHub)

  • domain: main

  • since_version: 10

  • function: False

  • support_level: SupportType.COMMON

  • shape inference: True

This version of the operator has been available since version 10.

Summary

Region of Interest (RoI) align operation described in the [Mask R-CNN paper](https://arxiv.org/abs/1703.06870). RoiAlign consumes an input tensor X and region of interests (rois) to apply pooling across each RoI; it produces a 4-D tensor of shape (num_rois, C, output_height, output_width).

RoiAlign is proposed to avoid the misalignment by removing quantizations while converting from original image into feature map and from feature map into RoI feature; in each ROI bin, the value of the sampled locations are computed directly through bilinear interpolation.

Attributes

  • mode: The pooling method. Two modes are supported: ‘avg’ and ‘max’. Default is ‘avg’. Default value is 'avg'.

  • output_height: default 1; Pooled output Y’s height. Default value is 1.

  • output_width: default 1; Pooled output Y’s width. Default value is 1.

  • sampling_ratio: Number of sampling points in the interpolation grid used to compute the output value of each pooled output bin. If > 0, then exactly sampling_ratio x sampling_ratio grid points are used. If == 0, then an adaptive number of grid points are used (computed as ceil(roi_width / output_width), and likewise for height). Default is 0. Default value is 0.

  • spatial_scale: Multiplicative spatial scale factor to translate ROI coordinates from their input spatial scale to the scale used when pooling, i.e., spatial scale of the input feature map X relative to the input image. E.g.; default is 1.0f. Default value is 1.0.

Inputs

  • X (heterogeneous) - T1: Input data tensor from the previous operator; 4-D feature map of shape (N, C, H, W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data.

  • rois (heterogeneous) - T1: RoIs (Regions of Interest) to pool over; rois is 2-D input of shape (num_rois, 4) given as [[x1, y1, x2, y2], …]. The RoIs’ coordinates are in the coordinate system of the input image. Each coordinate set has a 1:1 correspondence with the ‘batch_indices’ input.

  • batch_indices (heterogeneous) - T2: 1-D tensor of shape (num_rois,) with each element denoting the index of the corresponding image in the batch.

Outputs

  • Y (heterogeneous) - T1: RoI pooled output, 4-D tensor of shape (num_rois, C, output_height, output_width). The r-th batch element Y[r-1] is a pooled feature map corresponding to the r-th RoI X[r-1].

Type Constraints

  • T1 in ( tensor(double), tensor(float), tensor(float16) ): Constrain types to float tensors.

  • T2 in ( tensor(int64) ): Constrain types to int tensors.