RoiAlign#
RoiAlign - 16#
Version
name: RoiAlign (GitHub)
domain: main
since_version: 16
function: False
support_level: SupportType.COMMON
shape inference: True
This version of the operator has been available since version 16.
Summary
Region of Interest (RoI) align operation described in the [Mask R-CNN paper](https://arxiv.org/abs/1703.06870). RoiAlign consumes an input tensor X and region of interests (rois) to apply pooling across each RoI; it produces a 4-D tensor of shape (num_rois, C, output_height, output_width).
RoiAlign is proposed to avoid the misalignment by removing quantizations while converting from original image into feature map and from feature map into RoI feature; in each ROI bin, the value of the sampled locations are computed directly through bilinear interpolation.
Attributes
coordinate_transformation_mode: Allowed values are ‘half_pixel’ and ‘output_half_pixel’. Use the value ‘half_pixel’ to pixel shift the input coordinates by -0.5 (the recommended behavior). Use the value ‘output_half_pixel’ to omit the pixel shift for the input (use this for a backward-compatible behavior). Default value is
'half_pixel'
.mode: The pooling method. Two modes are supported: ‘avg’ and ‘max’. Default is ‘avg’. Default value is
'avg'
.output_height: default 1; Pooled output Y’s height. Default value is
1
.output_width: default 1; Pooled output Y’s width. Default value is
1
.sampling_ratio: Number of sampling points in the interpolation grid used to compute the output value of each pooled output bin. If > 0, then exactly sampling_ratio x sampling_ratio grid points are used. If == 0, then an adaptive number of grid points are used (computed as ceil(roi_width / output_width), and likewise for height). Default is 0. Default value is
0
.spatial_scale: Multiplicative spatial scale factor to translate ROI coordinates from their input spatial scale to the scale used when pooling, i.e., spatial scale of the input feature map X relative to the input image. E.g.; default is 1.0f. Default value is
1.0
.
Inputs
X (heterogeneous) - T1: Input data tensor from the previous operator; 4-D feature map of shape (N, C, H, W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data.
rois (heterogeneous) - T1: RoIs (Regions of Interest) to pool over; rois is 2-D input of shape (num_rois, 4) given as [[x1, y1, x2, y2], …]. The RoIs’ coordinates are in the coordinate system of the input image. Each coordinate set has a 1:1 correspondence with the ‘batch_indices’ input.
batch_indices (heterogeneous) - T2: 1-D tensor of shape (num_rois,) with each element denoting the index of the corresponding image in the batch.
Outputs
Y (heterogeneous) - T1: RoI pooled output, 4-D tensor of shape (num_rois, C, output_height, output_width). The r-th batch element Y[r-1] is a pooled feature map corresponding to the r-th RoI X[r-1].
Type Constraints
T1 in ( tensor(double), tensor(float), tensor(float16) ): Constrain types to float tensors.
T2 in ( tensor(int64) ): Constrain types to int tensors.
Examples
roialign_aligned_false
node = onnx.helper.make_node(
"RoiAlign",
inputs=["X", "rois", "batch_indices"],
outputs=["Y"],
spatial_scale=1.0,
output_height=5,
output_width=5,
sampling_ratio=2,
coordinate_transformation_mode="output_half_pixel",
)
X, batch_indices, rois = get_roi_align_input_values()
# (num_rois, C, output_height, output_width)
Y = np.array(
[
[
[
[0.4664, 0.4466, 0.3405, 0.5688, 0.6068],
[0.3714, 0.4296, 0.3835, 0.5562, 0.3510],
[0.2768, 0.4883, 0.5222, 0.5528, 0.4171],
[0.4713, 0.4844, 0.6904, 0.4920, 0.8774],
[0.6239, 0.7125, 0.6289, 0.3355, 0.3495],
]
],
[
[
[0.3022, 0.4305, 0.4696, 0.3978, 0.5423],
[0.3656, 0.7050, 0.5165, 0.3172, 0.7015],
[0.2912, 0.5059, 0.6476, 0.6235, 0.8299],
[0.5916, 0.7389, 0.7048, 0.8372, 0.8893],
[0.6227, 0.6153, 0.7097, 0.6154, 0.4585],
]
],
[
[
[0.2384, 0.3379, 0.3717, 0.6100, 0.7601],
[0.3767, 0.3785, 0.7147, 0.9243, 0.9727],
[0.5749, 0.5826, 0.5709, 0.7619, 0.8770],
[0.5355, 0.2566, 0.2141, 0.2796, 0.3600],
[0.4365, 0.3504, 0.2887, 0.3661, 0.2349],
]
],
],
dtype=np.float32,
)
expect(node, inputs=[X, rois, batch_indices], outputs=[Y], name="test_roialign_aligned_false")
roialign_aligned_true
node = onnx.helper.make_node(
"RoiAlign",
inputs=["X", "rois", "batch_indices"],
outputs=["Y"],
spatial_scale=1.0,
output_height=5,
output_width=5,
sampling_ratio=2,
coordinate_transformation_mode="half_pixel",
)
X, batch_indices, rois = get_roi_align_input_values()
# (num_rois, C, output_height, output_width)
Y = np.array(
[
[
[
[0.5178, 0.3434, 0.3229, 0.4474, 0.6344],
[0.4031, 0.5366, 0.4428, 0.4861, 0.4023],
[0.2512, 0.4002, 0.5155, 0.6954, 0.3465],
[0.3350, 0.4601, 0.5881, 0.3439, 0.6849],
[0.4932, 0.7141, 0.8217, 0.4719, 0.4039],
]
],
[
[
[0.3070, 0.2187, 0.3337, 0.4880, 0.4870],
[0.1871, 0.4914, 0.5561, 0.4192, 0.3686],
[0.1433, 0.4608, 0.5971, 0.5310, 0.4982],
[0.2788, 0.4386, 0.6022, 0.7000, 0.7524],
[0.5774, 0.7024, 0.7251, 0.7338, 0.8163],
]
],
[
[
[0.2393, 0.4075, 0.3379, 0.2525, 0.4743],
[0.3671, 0.2702, 0.4105, 0.6419, 0.8308],
[0.5556, 0.4543, 0.5564, 0.7502, 0.9300],
[0.6626, 0.5617, 0.4813, 0.4954, 0.6663],
[0.6636, 0.3721, 0.2056, 0.1928, 0.2478],
]
],
],
dtype=np.float32,
)
expect(node, inputs=[X, rois, batch_indices], outputs=[Y], name="test_roialign_aligned_true")
Differences
0 | 0 | Region of Interest (RoI) align operation described in the | Region of Interest (RoI) align operation described in the |
1 | 1 | [Mask R-CNN paper](https://arxiv.org/abs/1703.06870). | [Mask R-CNN paper](https://arxiv.org/abs/1703.06870). |
2 | 2 | RoiAlign consumes an input tensor X and region of interests (rois) | RoiAlign consumes an input tensor X and region of interests (rois) |
3 | 3 | to apply pooling across each RoI; it produces a 4-D tensor of shape | to apply pooling across each RoI; it produces a 4-D tensor of shape |
4 | 4 | (num_rois, C, output_height, output_width). | (num_rois, C, output_height, output_width). |
5 | 5 |
|
|
6 | 6 | RoiAlign is proposed to avoid the misalignment by removing | RoiAlign is proposed to avoid the misalignment by removing |
7 | 7 | quantizations while converting from original image into feature | quantizations while converting from original image into feature |
8 | 8 | map and from feature map into RoI feature; in each ROI bin, | map and from feature map into RoI feature; in each ROI bin, |
9 | 9 | the value of the sampled locations are computed directly | the value of the sampled locations are computed directly |
10 | 10 | through bilinear interpolation. | through bilinear interpolation. |
11 | 11 |
|
|
12 | 12 | **Attributes** | **Attributes** |
13 | 13 |
|
|
14 | * **coordinate_transformation_mode**: | ||
15 | Allowed values are 'half_pixel' and 'output_half_pixel'. Use the | ||
16 | value 'half_pixel' to pixel shift the input coordinates by -0.5 (the | ||
17 | recommended behavior). Use the value 'output_half_pixel' to omit the | ||
18 | pixel shift for the input (use this for a backward-compatible | ||
19 | behavior). Default value is 'half_pixel'. | ||
14 | 20 | * **mode**: | * **mode**: |
15 | 21 | The pooling method. Two modes are supported: 'avg' and 'max'. | The pooling method. Two modes are supported: 'avg' and 'max'. |
16 | 22 | Default is 'avg'. Default value is 'avg'. | Default is 'avg'. Default value is 'avg'. |
17 | 23 | * **output_height**: | * **output_height**: |
18 | 24 | default 1; Pooled output Y's height. Default value is 1. | default 1; Pooled output Y's height. Default value is 1. |
19 | 25 | * **output_width**: | * **output_width**: |
20 | 26 | default 1; Pooled output Y's width. Default value is 1. | default 1; Pooled output Y's width. Default value is 1. |
21 | 27 | * **sampling_ratio**: | * **sampling_ratio**: |
22 | 28 | Number of sampling points in the interpolation grid used to compute | Number of sampling points in the interpolation grid used to compute |
23 | 29 | the output value of each pooled output bin. If > 0, then exactly | the output value of each pooled output bin. If > 0, then exactly |
24 | 30 | sampling_ratio x sampling_ratio grid points are used. If == 0, then | sampling_ratio x sampling_ratio grid points are used. If == 0, then |
25 | 31 | an adaptive number of grid points are used (computed as | an adaptive number of grid points are used (computed as |
26 | 32 | ceil(roi_width / output_width), and likewise for height). Default is | ceil(roi_width / output_width), and likewise for height). Default is |
27 | 33 | 0. Default value is 0. | 0. Default value is 0. |
28 | 34 | * **spatial_scale**: | * **spatial_scale**: |
29 | 35 | Multiplicative spatial scale factor to translate ROI coordinates | Multiplicative spatial scale factor to translate ROI coordinates |
30 | 36 | from their input spatial scale to the scale used when pooling, i.e., | from their input spatial scale to the scale used when pooling, i.e., |
31 | 37 | spatial scale of the input feature map X relative to the input | spatial scale of the input feature map X relative to the input |
32 | 38 | image. E.g.; default is 1.0f. Default value is 1.0. | image. E.g.; default is 1.0f. Default value is 1.0. |
33 | 39 |
|
|
34 | 40 | **Inputs** | **Inputs** |
35 | 41 |
|
|
36 | 42 | * **X** (heterogeneous) - **T1**: | * **X** (heterogeneous) - **T1**: |
37 | 43 | Input data tensor from the previous operator; 4-D feature map of | Input data tensor from the previous operator; 4-D feature map of |
38 | 44 | shape (N, C, H, W), where N is the batch size, C is the number of | shape (N, C, H, W), where N is the batch size, C is the number of |
39 | 45 | channels, and H and W are the height and the width of the data. | channels, and H and W are the height and the width of the data. |
40 | 46 | * **rois** (heterogeneous) - **T1**: | * **rois** (heterogeneous) - **T1**: |
41 | 47 | RoIs (Regions of Interest) to pool over; rois is 2-D input of shape | RoIs (Regions of Interest) to pool over; rois is 2-D input of shape |
42 | 48 | (num_rois, 4) given as [[x1, y1, x2, y2], ...]. The RoIs' | (num_rois, 4) given as [[x1, y1, x2, y2], ...]. The RoIs' |
43 | 49 | coordinates are in the coordinate system of the input image. Each | coordinates are in the coordinate system of the input image. Each |
44 | 50 | coordinate set has a 1:1 correspondence with the 'batch_indices' | coordinate set has a 1:1 correspondence with the 'batch_indices' |
45 | 51 | input. | input. |
46 | 52 | * **batch_indices** (heterogeneous) - **T2**: | * **batch_indices** (heterogeneous) - **T2**: |
47 | 53 | 1-D tensor of shape (num_rois,) with each element denoting the index | 1-D tensor of shape (num_rois,) with each element denoting the index |
48 | 54 | of the corresponding image in the batch. | of the corresponding image in the batch. |
49 | 55 |
|
|
50 | 56 | **Outputs** | **Outputs** |
51 | 57 |
|
|
52 | 58 | * **Y** (heterogeneous) - **T1**: | * **Y** (heterogeneous) - **T1**: |
53 | 59 | RoI pooled output, 4-D tensor of shape (num_rois, C, output_height, | RoI pooled output, 4-D tensor of shape (num_rois, C, output_height, |
54 | 60 | output_width). The r-th batch element Y[r-1] is a pooled feature map | output_width). The r-th batch element Y[r-1] is a pooled feature map |
55 | 61 | corresponding to the r-th RoI X[r-1]. | corresponding to the r-th RoI X[r-1]. |
56 | 62 |
|
|
57 | 63 | **Type Constraints** | **Type Constraints** |
58 | 64 |
|
|
59 | 65 | * **T1** in ( | * **T1** in ( |
60 | 66 | tensor(double), | tensor(double), |
61 | 67 | tensor(float), | tensor(float), |
62 | 68 | tensor(float16) | tensor(float16) |
63 | 69 | ): | ): |
64 | 70 | Constrain types to float tensors. | Constrain types to float tensors. |
65 | 71 | * **T2** in ( | * **T2** in ( |
66 | 72 | tensor(int64) | tensor(int64) |
67 | 73 | ): | ): |
68 | 74 | Constrain types to int tensors. | Constrain types to int tensors. |
RoiAlign - 10#
Version
name: RoiAlign (GitHub)
domain: main
since_version: 10
function: False
support_level: SupportType.COMMON
shape inference: True
This version of the operator has been available since version 10.
Summary
Region of Interest (RoI) align operation described in the [Mask R-CNN paper](https://arxiv.org/abs/1703.06870). RoiAlign consumes an input tensor X and region of interests (rois) to apply pooling across each RoI; it produces a 4-D tensor of shape (num_rois, C, output_height, output_width).
RoiAlign is proposed to avoid the misalignment by removing quantizations while converting from original image into feature map and from feature map into RoI feature; in each ROI bin, the value of the sampled locations are computed directly through bilinear interpolation.
Attributes
mode: The pooling method. Two modes are supported: ‘avg’ and ‘max’. Default is ‘avg’. Default value is
'avg'
.output_height: default 1; Pooled output Y’s height. Default value is
1
.output_width: default 1; Pooled output Y’s width. Default value is
1
.sampling_ratio: Number of sampling points in the interpolation grid used to compute the output value of each pooled output bin. If > 0, then exactly sampling_ratio x sampling_ratio grid points are used. If == 0, then an adaptive number of grid points are used (computed as ceil(roi_width / output_width), and likewise for height). Default is 0. Default value is
0
.spatial_scale: Multiplicative spatial scale factor to translate ROI coordinates from their input spatial scale to the scale used when pooling, i.e., spatial scale of the input feature map X relative to the input image. E.g.; default is 1.0f. Default value is
1.0
.
Inputs
X (heterogeneous) - T1: Input data tensor from the previous operator; 4-D feature map of shape (N, C, H, W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data.
rois (heterogeneous) - T1: RoIs (Regions of Interest) to pool over; rois is 2-D input of shape (num_rois, 4) given as [[x1, y1, x2, y2], …]. The RoIs’ coordinates are in the coordinate system of the input image. Each coordinate set has a 1:1 correspondence with the ‘batch_indices’ input.
batch_indices (heterogeneous) - T2: 1-D tensor of shape (num_rois,) with each element denoting the index of the corresponding image in the batch.
Outputs
Y (heterogeneous) - T1: RoI pooled output, 4-D tensor of shape (num_rois, C, output_height, output_width). The r-th batch element Y[r-1] is a pooled feature map corresponding to the r-th RoI X[r-1].
Type Constraints
T1 in ( tensor(double), tensor(float), tensor(float16) ): Constrain types to float tensors.
T2 in ( tensor(int64) ): Constrain types to int tensors.