BatchNormalization#

BatchNormalization - 15#

Version

  • name: BatchNormalization (GitHub)

  • domain: main

  • since_version: 15

  • function: False

  • support_level: SupportType.COMMON

  • shape inference: True

This version of the operator has been available since version 15.

Summary

Carries out batch normalization as described in the paper https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, There are five required inputs ‘X’, ‘scale’, ‘B’, ‘input_mean’ and ‘input_var’. Note that ‘input_mean’ and ‘input_var’ are expected to be the estimated statistics in inference mode (training_mode=False, default), and the running statistics in training mode (training_mode=True). There are multiple cases for the number of outputs, which we list below:

Output case #1: Y, running_mean, running_var (training_mode=True) Output case #2: Y (training_mode=False)

When training_mode=False, extra outputs are invalid. The outputs are updated as follows when training_mode=True:

running_mean = input_mean * momentum + current_mean * (1 - momentum)
running_var = input_var * momentum + current_var * (1 - momentum)

Y = (X - current_mean) / sqrt(current_var + epsilon) * scale + B

where:

current_mean = ReduceMean(X, axis=all_except_channel_index)
current_var =  ReduceVar(X, axis=all_except_channel_index)

Notice that ReduceVar refers to the population variance, and it equals to
sum(sqrd(x_i - x_avg)) / N
where N is the population size (this formula does not use sample size N - 1).

The computation of ReduceMean and ReduceVar uses float to avoid overflow for float16 inputs.

When training_mode=False:

Y = (X - input_mean) / sqrt(input_var + epsilon) * scale + B

For previous (depreciated) non-spatial cases, implementors are suggested to flatten the input shape to (N x C * D1 * D2 * … * Dn) before a BatchNormalization Op. This operator has optional inputs/outputs. See ONNX for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument’s name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.

Attributes

  • epsilon: The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e-06.

  • momentum: Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum). Default value is 0.8999999761581421.

  • training_mode: If set to true, it indicates BatchNormalization is being used for training, and outputs 1, 2, 3, and 4 would be populated. Default value is 0.

Inputs

  • X (heterogeneous) - T: Input data tensor from the previous operator; dimensions are in the form of (N x C x D1 x D2 … Dn), where N is the batch size, C is the number of channels. Statistics are computed for every channel of C over N and D1 to Dn dimensions. For image data, input dimensions become (N x C x H x W). The op also accepts single dimension input of size N in which case C is assumed to be 1

  • scale (heterogeneous) - T1: Scale tensor of shape (C).

  • B (heterogeneous) - T1: Bias tensor of shape (C).

  • input_mean (heterogeneous) - T2: running (training) or estimated (testing) mean tensor of shape (C).

  • input_var (heterogeneous) - T2: running (training) or estimated (testing) variance tensor of shape (C).

Outputs

Between 1 and 3 outputs.

  • Y (heterogeneous) - T: The output tensor of the same shape as X

  • running_mean (optional, heterogeneous) - T2: The running mean after the BatchNormalization operator.

  • running_var (optional, heterogeneous) - T2: The running variance after the BatchNormalization operator. This op uses the population size (N) for calculating variance, and not the sample size N-1.

Type Constraints

  • T in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ): Constrain input and output types to float tensors.

  • T1 in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ): Constrain scale and bias types to float tensors.

  • T2 in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ): Constrain mean and variance types to float tensors.

Examples

Differences

00Carries out batch normalization as described in the paperCarries out batch normalization as described in the paper
11https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
22There are five required inputs 'X', 'scale', 'B', 'input_mean' andThere are five required inputs 'X', 'scale', 'B', 'input_mean' and
33'input_var'.'input_var'.
44Note that 'input_mean' and 'input_var' are expected to be the estimatedNote that 'input_mean' and 'input_var' are expected to be the estimated
55statistics in inference mode (training_mode=False, default),statistics in inference mode (training_mode=False, default),
66and the running statistics in training mode (training_mode=True).and the running statistics in training mode (training_mode=True).
77There are multiple cases for the number of outputs, which we list below:There are multiple cases for the number of outputs, which we list below:
88
99Output case #1: Y, running_mean, running_var (training_mode=True)Output case #1: Y, running_mean, running_var (training_mode=True)
1010Output case #2: Y (training_mode=False)Output case #2: Y (training_mode=False)
1111
1212When training_mode=False, extra outputs are invalid.When training_mode=False, extra outputs are invalid.
1313The outputs are updated as follows when training_mode=True:The outputs are updated as follows when training_mode=True:
1414::::
1515
1616 running_mean = input_mean * momentum + current_mean * (1 - momentum) running_mean = input_mean * momentum + current_mean * (1 - momentum)
1717 running_var = input_var * momentum + current_var * (1 - momentum) running_var = input_var * momentum + current_var * (1 - momentum)
1818
1919 Y = (X - current_mean) / sqrt(current_var + epsilon) * scale + B Y = (X - current_mean) / sqrt(current_var + epsilon) * scale + B
2020
2121 where: where:
2222
2323 current_mean = ReduceMean(X, axis=all_except_channel_index) current_mean = ReduceMean(X, axis=all_except_channel_index)
2424 current_var = ReduceVar(X, axis=all_except_channel_index) current_var = ReduceVar(X, axis=all_except_channel_index)
2525
2626 Notice that ReduceVar refers to the population variance, and it equals to Notice that ReduceVar refers to the population variance, and it equals to
2727 sum(sqrd(x_i - x_avg)) / N sum(sqrd(x_i - x_avg)) / N
2828 where N is the population size (this formula does not use sample size N - 1). where N is the population size (this formula does not use sample size N - 1).
2929
30The computation of ReduceMean and ReduceVar uses float to avoid overflow for float16 inputs.
31
3032When training_mode=False:When training_mode=False:
3133::::
3234
3335 Y = (X - input_mean) / sqrt(input_var + epsilon) * scale + B Y = (X - input_mean) / sqrt(input_var + epsilon) * scale + B
3436
3537For previous (depreciated) non-spatial cases, implementors are suggestedFor previous (depreciated) non-spatial cases, implementors are suggested
3638to flatten the input shape to (N x C * D1 * D2 * ... * Dn) before a BatchNormalization Op.to flatten the input shape to (N x C * D1 * D2 * ... * Dn) before a BatchNormalization Op.
3739This operator has **optional** inputs/outputs. See ONNX _ for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.This operator has **optional** inputs/outputs. See ONNX _ for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.
3840
3941**Attributes****Attributes**
4042
4143* **epsilon**:* **epsilon**:
4244 The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e-06. The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e-06.
4345* **momentum**:* **momentum**:
4446 Factor used in computing the running mean and variance.e.g., Factor used in computing the running mean and variance.e.g.,
4547 running_mean = running_mean * momentum + mean * (1 - momentum). Default value is 0.8999999761581421. running_mean = running_mean * momentum + mean * (1 - momentum). Default value is 0.8999999761581421.
4648* **training_mode**:* **training_mode**:
4749 If set to true, it indicates BatchNormalization is being used for If set to true, it indicates BatchNormalization is being used for
4850 training, and outputs 1, 2, 3, and 4 would be populated. Default value is 0. training, and outputs 1, 2, 3, and 4 would be populated. Default value is 0.
4951
5052**Inputs****Inputs**
5153
5254* **X** (heterogeneous) - **T**:* **X** (heterogeneous) - **T**:
5355 Input data tensor from the previous operator; dimensions are in the Input data tensor from the previous operator; dimensions are in the
5456 form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is
5557 the number of channels. Statistics are computed for every channel of the number of channels. Statistics are computed for every channel of
5658 C over N and D1 to Dn dimensions. For image data, input dimensions C over N and D1 to Dn dimensions. For image data, input dimensions
5759 become (N x C x H x W). The op also accepts single dimension input become (N x C x H x W). The op also accepts single dimension input
5860 of size N in which case C is assumed to be 1 of size N in which case C is assumed to be 1
5961* **scale** (heterogeneous) - **T**:* **scale** (heterogeneous) - **T1**:
6062 Scale tensor of shape (C). Scale tensor of shape (C).
6163* **B** (heterogeneous) - **T**:* **B** (heterogeneous) - **T1**:
6264 Bias tensor of shape (C). Bias tensor of shape (C).
6365* **input_mean** (heterogeneous) - **U**:* **input_mean** (heterogeneous) - **T2**:
6466 running (training) or estimated (testing) mean tensor of shape (C). running (training) or estimated (testing) mean tensor of shape (C).
6567* **input_var** (heterogeneous) - **U**:* **input_var** (heterogeneous) - **T2**:
6668 running (training) or estimated (testing) variance tensor of shape running (training) or estimated (testing) variance tensor of shape
6769 (C). (C).
6870
6971**Outputs****Outputs**
7072
7173Between 1 and 3 outputs.Between 1 and 3 outputs.
7274
7375* **Y** (heterogeneous) - **T**:* **Y** (heterogeneous) - **T**:
7476 The output tensor of the same shape as X The output tensor of the same shape as X
7577* **running_mean** (optional, heterogeneous) - **U**:* **running_mean** (optional, heterogeneous) - **T2**:
7678 The running mean after the BatchNormalization operator. The running mean after the BatchNormalization operator.
7779* **running_var** (optional, heterogeneous) - **U**:* **running_var** (optional, heterogeneous) - **T2**:
7880 The running variance after the BatchNormalization operator. This op The running variance after the BatchNormalization operator. This op
7981 uses the population size (N) for calculating variance, and not the uses the population size (N) for calculating variance, and not the
8082 sample size N-1. sample size N-1.
8183
8284**Type Constraints****Type Constraints**
8385
8486* **T** in (* **T** in (
8587 tensor(bfloat16), tensor(bfloat16),
8688 tensor(double), tensor(double),
8789 tensor(float), tensor(float),
8890 tensor(float16) tensor(float16)
8991 ): ):
9092 Constrain input and output types to float tensors. Constrain input and output types to float tensors.
93* **T1** in (
94 tensor(bfloat16),
95 tensor(double),
96 tensor(float),
97 tensor(float16)
98 ):
9199* **U** in ( Constrain scale and bias types to float tensors.
100* **T2** in (
92101 tensor(bfloat16), tensor(bfloat16),
93102 tensor(double), tensor(double),
94103 tensor(float), tensor(float),
95104 tensor(float16) tensor(float16)
96105 ): ):
97106 Constrain mean and variance types to float tensors. It allows all Constrain mean and variance types to float tensors.
98 float type for U.

BatchNormalization - 14#

Version

  • name: BatchNormalization (GitHub)

  • domain: main

  • since_version: 14

  • function: False

  • support_level: SupportType.COMMON

  • shape inference: True

This version of the operator has been available since version 14.

Summary

Carries out batch normalization as described in the paper https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, There are five required inputs ‘X’, ‘scale’, ‘B’, ‘input_mean’ and ‘input_var’. Note that ‘input_mean’ and ‘input_var’ are expected to be the estimated statistics in inference mode (training_mode=False, default), and the running statistics in training mode (training_mode=True). There are multiple cases for the number of outputs, which we list below:

Output case #1: Y, running_mean, running_var (training_mode=True) Output case #2: Y (training_mode=False)

When training_mode=False, extra outputs are invalid. The outputs are updated as follows when training_mode=True:

running_mean = input_mean * momentum + current_mean * (1 - momentum)
running_var = input_var * momentum + current_var * (1 - momentum)

Y = (X - current_mean) / sqrt(current_var + epsilon) * scale + B

where:

current_mean = ReduceMean(X, axis=all_except_channel_index)
current_var =  ReduceVar(X, axis=all_except_channel_index)

Notice that ReduceVar refers to the population variance, and it equals to
sum(sqrd(x_i - x_avg)) / N
where N is the population size (this formula does not use sample size N - 1).

When training_mode=False:

Y = (X - input_mean) / sqrt(input_var + epsilon) * scale + B

For previous (depreciated) non-spatial cases, implementors are suggested to flatten the input shape to (N x C * D1 * D2 * … * Dn) before a BatchNormalization Op. This operator has optional inputs/outputs. See ONNX for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument’s name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.

Attributes

  • epsilon: The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e-06.

  • momentum: Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum). Default value is 0.8999999761581421.

  • training_mode: If set to true, it indicates BatchNormalization is being used for training, and outputs 1, 2, 3, and 4 would be populated. Default value is 0.

Inputs

  • X (heterogeneous) - T: Input data tensor from the previous operator; dimensions are in the form of (N x C x D1 x D2 … Dn), where N is the batch size, C is the number of channels. Statistics are computed for every channel of C over N and D1 to Dn dimensions. For image data, input dimensions become (N x C x H x W). The op also accepts single dimension input of size N in which case C is assumed to be 1

  • scale (heterogeneous) - T: Scale tensor of shape (C).

  • B (heterogeneous) - T: Bias tensor of shape (C).

  • input_mean (heterogeneous) - U: running (training) or estimated (testing) mean tensor of shape (C).

  • input_var (heterogeneous) - U: running (training) or estimated (testing) variance tensor of shape (C).

Outputs

Between 1 and 3 outputs.

  • Y (heterogeneous) - T: The output tensor of the same shape as X

  • running_mean (optional, heterogeneous) - U: The running mean after the BatchNormalization operator.

  • running_var (optional, heterogeneous) - U: The running variance after the BatchNormalization operator. This op uses the population size (N) for calculating variance, and not the sample size N-1.

Type Constraints

  • T in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ): Constrain input and output types to float tensors.

  • U in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ): Constrain mean and variance types to float tensors. It allows all float type for U.

Differences

00Carries out batch normalization as described in the paperCarries out batch normalization as described in the paper
11https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
2There are five required inputs 'X', 'scale', 'B', 'input_mean' and
3'input_var'.
4Note that 'input_mean' and 'input_var' are expected to be the estimated
5statistics in inference mode (training_mode=False, default),
6and the running statistics in training mode (training_mode=True).
27there are multiple cases for the number of outputs, which we list below:There are multiple cases for the number of outputs, which we list below:
38
9Output case #1: Y, running_mean, running_var (training_mode=True)
10Output case #2: Y (training_mode=False)
11
12When training_mode=False, extra outputs are invalid.
13The outputs are updated as follows when training_mode=True:
14::
15
16 running_mean = input_mean * momentum + current_mean * (1 - momentum)
17 running_var = input_var * momentum + current_var * (1 - momentum)
18
19 Y = (X - current_mean) / sqrt(current_var + epsilon) * scale + B
20
21 where:
22
23 current_mean = ReduceMean(X, axis=all_except_channel_index)
24 current_var = ReduceVar(X, axis=all_except_channel_index)
25
26 Notice that ReduceVar refers to the population variance, and it equals to
427Output case #1: Y, mean, var, saved_mean, saved_var (training mode) sum(sqrd(x_i - x_avg)) / N
528Output case #2: Y (test mode) where N is the population size (this formula does not use sample size N - 1).
629
30When training_mode=False:
31::
32
33 Y = (X - input_mean) / sqrt(input_var + epsilon) * scale + B
34
735For previous (depreciated) non-spatial cases, implementors are suggestedFor previous (depreciated) non-spatial cases, implementors are suggested
836to flatten the input shape to (N x C*D1*D2 ..*Dn) before a BatchNormalization Op.to flatten the input shape to (N x C * D1 * D2 * ... * Dn) before a BatchNormalization Op.
937This operator has **optional** inputs/outputs. See ONNX _ for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.This operator has **optional** inputs/outputs. See ONNX _ for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.
1038
1139**Attributes****Attributes**
1240
1341* **epsilon**:* **epsilon**:
1442 The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e-06. The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e-06.
1543* **momentum**:* **momentum**:
1644 Factor used in computing the running mean and variance.e.g., Factor used in computing the running mean and variance.e.g.,
1745 running_mean = running_mean * momentum + mean * (1 - momentum). Default value is 0.8999999761581421. running_mean = running_mean * momentum + mean * (1 - momentum). Default value is 0.8999999761581421.
46* **training_mode**:
47 If set to true, it indicates BatchNormalization is being used for
48 training, and outputs 1, 2, 3, and 4 would be populated. Default value is 0.
1849
1950**Inputs****Inputs**
2051
2152* **X** (heterogeneous) - **T**:* **X** (heterogeneous) - **T**:
2253 Input data tensor from the previous operator; dimensions are in the Input data tensor from the previous operator; dimensions are in the
2354 form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is
2455 the number of channels. Statistics are computed for every channel of the number of channels. Statistics are computed for every channel of
2556 C over N and D1 to Dn dimensions. For image data, input dimensions C over N and D1 to Dn dimensions. For image data, input dimensions
2657 become (N x C x H x W). The op also accepts single dimension input become (N x C x H x W). The op also accepts single dimension input
2758 of size N in which case C is assumed to be 1 of size N in which case C is assumed to be 1
2859* **scale** (heterogeneous) - **T**:* **scale** (heterogeneous) - **T**:
2960 Scale tensor of shape (C). Scale tensor of shape (C).
3061* **B** (heterogeneous) - **T**:* **B** (heterogeneous) - **T**:
3162 Bias tensor of shape (C). Bias tensor of shape (C).
3263* **mean** (heterogeneous) - **T**:* **input_mean** (heterogeneous) - **U**:
3364 running (training) or estimated (testing) mean tensor of shape (C). running (training) or estimated (testing) mean tensor of shape (C).
3465* **var** (heterogeneous) - **T**:* **input_var** (heterogeneous) - **U**:
3566 running (training) or estimated (testing) variance tensor of shape running (training) or estimated (testing) variance tensor of shape
3667 (C). (C).
3768
3869**Outputs****Outputs**
3970
4071Between 1 and 5 outputs.Between 1 and 3 outputs.
4172
4273* **Y** (heterogeneous) - **T**:* **Y** (heterogeneous) - **T**:
4374 The output tensor of the same shape as X The output tensor of the same shape as X
4475* **mean** (optional, heterogeneous) - **T**:* **running_mean** (optional, heterogeneous) - **U**:
4576 The running mean after the BatchNormalization operator. The running mean after the BatchNormalization operator.
4677* **var** (optional, heterogeneous) - **T**:* **running_var** (optional, heterogeneous) - **U**:
4778 The running variance after the BatchNormalization operator. The running variance after the BatchNormalization operator. This op
79 uses the population size (N) for calculating variance, and not the
4880* **saved_mean** (optional, heterogeneous) - **T**: sample size N-1.
81
4982 Saved mean used during training to speed up gradient computation.**Type Constraints**
83
84* **T** in (
5085* **saved_var** (optional, heterogeneous) - **T**: tensor(bfloat16),
5186 Saved variance used during training to speed up gradient tensor(double),
52 computation.
53
87 tensor(float),
88 tensor(float16)
89 ):
5490**Type Constraints** Constrain input and output types to float tensors.
55
5691* **T** in (* **U** in (
92 tensor(bfloat16),
5793 tensor(double), tensor(double),
5894 tensor(float), tensor(float),
5995 tensor(float16) tensor(float16)
6096 ): ):
6197 Constrain input and output types to float tensors. Constrain mean and variance types to float tensors. It allows all
98 float type for U.

BatchNormalization - 9#

Version

  • name: BatchNormalization (GitHub)

  • domain: main

  • since_version: 9

  • function: False

  • support_level: SupportType.COMMON

  • shape inference: True

This version of the operator has been available since version 9.

Summary

Carries out batch normalization as described in the paper https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, there are multiple cases for the number of outputs, which we list below:

Output case #1: Y, mean, var, saved_mean, saved_var (training mode) Output case #2: Y (test mode)

For previous (depreciated) non-spatial cases, implementors are suggested to flatten the input shape to (N x C*D1*D2 ..*Dn) before a BatchNormalization Op. This operator has optional inputs/outputs. See ONNX for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument’s name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.

Attributes

  • epsilon: The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e-06.

  • momentum: Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum). Default value is 0.8999999761581421.

Inputs

  • X (heterogeneous) - T: Input data tensor from the previous operator; dimensions are in the form of (N x C x D1 x D2 … Dn), where N is the batch size, C is the number of channels. Statistics are computed for every channel of C over N and D1 to Dn dimensions. For image data, input dimensions become (N x C x H x W). The op also accepts single dimension input of size N in which case C is assumed to be 1

  • scale (heterogeneous) - T: Scale tensor of shape (C).

  • B (heterogeneous) - T: Bias tensor of shape (C).

  • mean (heterogeneous) - T: running (training) or estimated (testing) mean tensor of shape (C).

  • var (heterogeneous) - T: running (training) or estimated (testing) variance tensor of shape (C).

Outputs

Between 1 and 5 outputs.

  • Y (heterogeneous) - T: The output tensor of the same shape as X

  • mean (optional, heterogeneous) - T: The running mean after the BatchNormalization operator.

  • var (optional, heterogeneous) - T: The running variance after the BatchNormalization operator.

  • saved_mean (optional, heterogeneous) - T: Saved mean used during training to speed up gradient computation.

  • saved_var (optional, heterogeneous) - T: Saved variance used during training to speed up gradient computation.

Type Constraints

  • T in ( tensor(double), tensor(float), tensor(float16) ): Constrain input and output types to float tensors.

Differences

00Carries out batch normalization as described in the paperCarries out batch normalization as described in the paper
11https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
22there are multiple cases for the number of outputs, which we list below:there are multiple cases for the number of outputs, which we list below:
33
44Output case #1: Y, mean, var, saved_mean, saved_var (training mode)Output case #1: Y, mean, var, saved_mean, saved_var (training mode)
55Output case #2: Y (test mode)Output case #2: Y (test mode)
6
7For previous (depreciated) non-spatial cases, implementors are suggested
8to flatten the input shape to (N x C*D1*D2 ..*Dn) before a BatchNormalization Op.
69 This operator has **optional** inputs/outputs. See ONNX _ for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.This operator has **optional** inputs/outputs. See ONNX _ for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.
710
811**Attributes****Attributes**
912
1013* **epsilon**:* **epsilon**:
1114 The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e-06. The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e-06.
1215* **momentum**:* **momentum**:
1316 Factor used in computing the running mean and variance.e.g., Factor used in computing the running mean and variance.e.g.,
1417 running_mean = running_mean * momentum + mean * (1 - momentum). Default value is 0.8999999761581421. running_mean = running_mean * momentum + mean * (1 - momentum). Default value is 0.8999999761581421.
15* **spatial**:
16 If true, compute the mean and variance across per activation. If
17 false, compute the mean and variance across per feature over each
18 mini-batch. Default value is 1.
1918
2019**Inputs****Inputs**
2120
2221* **X** (heterogeneous) - **T**:* **X** (heterogeneous) - **T**:
2322 Input data tensor from the previous operator; dimensions for image Input data tensor from the previous operator; dimensions are in the
2423 case are (N x C x H x W), where N is the batch size, C is the number form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is
25 of channels, and H and W are the height and the width of the data.
26 For non image case, the dimensions are in the form of (N x C x D1 x
27 D2 ... Dn), where N is the batch size.
24 the number of channels. Statistics are computed for every channel of
25 C over N and D1 to Dn dimensions. For image data, input dimensions
26 become (N x C x H x W). The op also accepts single dimension input
27 of size N in which case C is assumed to be 1
2828* **scale** (heterogeneous) - **T**:* **scale** (heterogeneous) - **T**:
2929 If spatial is true, the dimension of scale is (C). If spatial is Scale tensor of shape (C).
30 false, the dimensions of scale are (C x D1 x ... x Dn)
3130* **B** (heterogeneous) - **T**:* **B** (heterogeneous) - **T**:
3231 If spatial is true, the dimension of bias is (C). If spatial is Bias tensor of shape (C).
33 false, the dimensions of bias are (C x D1 x ... x Dn)
3432* **mean** (heterogeneous) - **T**:* **mean** (heterogeneous) - **T**:
35 If spatial is true, the dimension of the running mean (training) or
36 the estimated mean (testing) is (C). If spatial is false, the
3733 dimensions of the running mean (training) or the estimated mean running (training) or estimated (testing) mean tensor of shape (C).
38 (testing) are (C x D1 x ... x Dn).
3934* **var** (heterogeneous) - **T**:* **var** (heterogeneous) - **T**:
40 If spatial is true, the dimension of the running variance(training)
4135 or the estimated variance (testing) is (C). If spatial is false, the running (training) or estimated (testing) variance tensor of shape
42 dimensions of the running variance(training) or the estimated
43 variance (testing) are (C x D1 x ... x Dn).
36 (C).
4437
4538**Outputs****Outputs**
4639
4740Between 1 and 5 outputs.Between 1 and 5 outputs.
4841
4942* **Y** (heterogeneous) - **T**:* **Y** (heterogeneous) - **T**:
5043 The output tensor of the same shape as X The output tensor of the same shape as X
5144* **mean** (optional, heterogeneous) - **T**:* **mean** (optional, heterogeneous) - **T**:
5245 The running mean after the BatchNormalization operator. The running mean after the BatchNormalization operator.
5346* **var** (optional, heterogeneous) - **T**:* **var** (optional, heterogeneous) - **T**:
5447 The running variance after the BatchNormalization operator. The running variance after the BatchNormalization operator.
5548* **saved_mean** (optional, heterogeneous) - **T**:* **saved_mean** (optional, heterogeneous) - **T**:
5649 Saved mean used during training to speed up gradient computation. Saved mean used during training to speed up gradient computation.
5750* **saved_var** (optional, heterogeneous) - **T**:* **saved_var** (optional, heterogeneous) - **T**:
5851 Saved variance used during training to speed up gradient Saved variance used during training to speed up gradient
5952 computation. computation.
6053
6154**Type Constraints****Type Constraints**
6255
6356* **T** in (* **T** in (
6457 tensor(double), tensor(double),
6558 tensor(float), tensor(float),
6659 tensor(float16) tensor(float16)
6760 ): ):
6861 Constrain input and output types to float tensors. Constrain input and output types to float tensors.

BatchNormalization - 7#

Version

  • name: BatchNormalization (GitHub)

  • domain: main

  • since_version: 7

  • function: False

  • support_level: SupportType.COMMON

  • shape inference: True

This version of the operator has been available since version 7.

Summary

Carries out batch normalization as described in the paper https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, there are multiple cases for the number of outputs, which we list below:

Output case #1: Y, mean, var, saved_mean, saved_var (training mode) Output case #2: Y (test mode)

This operator has optional inputs/outputs. See ONNX for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument’s name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.

Attributes

  • epsilon: The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e-06.

  • momentum: Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum). Default value is 0.8999999761581421.

  • spatial: If true, compute the mean and variance across per activation. If false, compute the mean and variance across per feature over each mini-batch. Default value is 1.

Inputs

  • X (heterogeneous) - T: Input data tensor from the previous operator; dimensions for image case are (N x C x H x W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data. For non image case, the dimensions are in the form of (N x C x D1 x D2 … Dn), where N is the batch size.

  • scale (heterogeneous) - T: If spatial is true, the dimension of scale is (C). If spatial is false, the dimensions of scale are (C x D1 x … x Dn)

  • B (heterogeneous) - T: If spatial is true, the dimension of bias is (C). If spatial is false, the dimensions of bias are (C x D1 x … x Dn)

  • mean (heterogeneous) - T: If spatial is true, the dimension of the running mean (training) or the estimated mean (testing) is (C). If spatial is false, the dimensions of the running mean (training) or the estimated mean (testing) are (C x D1 x … x Dn).

  • var (heterogeneous) - T: If spatial is true, the dimension of the running variance(training) or the estimated variance (testing) is (C). If spatial is false, the dimensions of the running variance(training) or the estimated variance (testing) are (C x D1 x … x Dn).

Outputs

Between 1 and 5 outputs.

  • Y (heterogeneous) - T: The output tensor of the same shape as X

  • mean (optional, heterogeneous) - T: The running mean after the BatchNormalization operator.

  • var (optional, heterogeneous) - T: The running variance after the BatchNormalization operator.

  • saved_mean (optional, heterogeneous) - T: Saved mean used during training to speed up gradient computation.

  • saved_var (optional, heterogeneous) - T: Saved variance used during training to speed up gradient computation.

Type Constraints

  • T in ( tensor(double), tensor(float), tensor(float16) ): Constrain input and output types to float tensors.

Differences

00Carries out batch normalization as described in the paperCarries out batch normalization as described in the paper
11https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
22there are multiple cases for the number of outputs, which we list below:there are multiple cases for the number of outputs, which we list below:
33
44Output case #1: Y, mean, var, saved_mean, saved_var (training mode)Output case #1: Y, mean, var, saved_mean, saved_var (training mode)
55Output case #2: Y (test mode)Output case #2: Y (test mode)
6
7**Attributes**
8
9* **epsilon**:
106 The epsilon value to use to avoid division by zero, default is This operator has **optional** inputs/outputs. See ONNX <https://github.com/onnx/onnx/blob/master/docs/IR.md>_ for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.
7
118 1e-5f. Default value is 9.999999747378752e-06.**Attributes**
12* **is_test**:
9
1310 If set to nonzero, run spatial batch normalization in test mode,* **epsilon**:
1411 default is 0. Default value is 0. The epsilon value to use to avoid division by zero. Default value is 9.999999747378752e-06.
1512* **momentum**:* **momentum**:
1613 Factor used in computing the running mean and variance.e.g., Factor used in computing the running mean and variance.e.g.,
1714 running_mean = running_mean * momentum + mean * (1 - momentum), running_mean = running_mean * momentum + mean * (1 - momentum). Default value is 0.8999999761581421.
18 default is 0.9f. Default value is 0.8999999761581421.
1915* **spatial**:* **spatial**:
2016 If true, compute the mean and variance across all spatial elements If true, compute the mean and variance across per activation. If
2117 If false, compute the mean and variance across per feature.Default false, compute the mean and variance across per feature over each
2218 is 1. Default value is 1. mini-batch. Default value is 1.
2319
2420**Inputs****Inputs**
2521
2622* **X** (heterogeneous) - **T**:* **X** (heterogeneous) - **T**:
2723 Input data tensor from the previous operator; dimensions for image Input data tensor from the previous operator; dimensions for image
2824 case are (N x C x H x W), where N is the batch size, C is the number case are (N x C x H x W), where N is the batch size, C is the number
2925 of channels, and H and W are the height and the width of the data. of channels, and H and W are the height and the width of the data.
3026 For non image case, the dimensions are in the form of (N x C x D1 x For non image case, the dimensions are in the form of (N x C x D1 x
3127 D2 ... Dn), where N is the batch size. D2 ... Dn), where N is the batch size.
3228* **scale** (heterogeneous) - **T**:* **scale** (heterogeneous) - **T**:
33 The scale as a 1-dimensional tensor of size C to be applied to the
34 output.
29 If spatial is true, the dimension of scale is (C). If spatial is
30 false, the dimensions of scale are (C x D1 x ... x Dn)
3531* **B** (heterogeneous) - **T**:* **B** (heterogeneous) - **T**:
36 The bias as a 1-dimensional tensor of size C to be applied to the
37 output.
32 If spatial is true, the dimension of bias is (C). If spatial is
33 false, the dimensions of bias are (C x D1 x ... x Dn)
3834* **mean** (heterogeneous) - **T**:* **mean** (heterogeneous) - **T**:
35 If spatial is true, the dimension of the running mean (training) or
36 the estimated mean (testing) is (C). If spatial is false, the
3937 The running mean (training) or the estimated mean (testing) as a dimensions of the running mean (training) or the estimated mean
40 1-dimensional tensor of size C.
38 (testing) are (C x D1 x ... x Dn).
4139* **var** (heterogeneous) - **T**:* **var** (heterogeneous) - **T**:
40 If spatial is true, the dimension of the running variance(training)
41 or the estimated variance (testing) is (C). If spatial is false, the
4242 The running variance (training) or the estimated variance (testing) dimensions of the running variance(training) or the estimated
43 as a 1-dimensional tensor of size C.
43 variance (testing) are (C x D1 x ... x Dn).
4444
4545**Outputs****Outputs**
4646
4747Between 1 and 5 outputs.Between 1 and 5 outputs.
4848
4949* **Y** (heterogeneous) - **T**:* **Y** (heterogeneous) - **T**:
5050 The output tensor of the same shape as X. The output tensor of the same shape as X
5151* **mean** (optional, heterogeneous) - **T**:* **mean** (optional, heterogeneous) - **T**:
5252 The running mean after the BatchNormalization operator. Must be in- The running mean after the BatchNormalization operator.
53 place with the input mean. Should not be used for testing.
5453* **var** (optional, heterogeneous) - **T**:* **var** (optional, heterogeneous) - **T**:
5554 The running variance after the BatchNormalization operator. Must be The running variance after the BatchNormalization operator.
56 in-place with the input var. Should not be used for testing.
5755* **saved_mean** (optional, heterogeneous) - **T**:* **saved_mean** (optional, heterogeneous) - **T**:
5856 Saved mean used during training to speed up gradient computation. Saved mean used during training to speed up gradient computation.
59 Should not be used for testing.
6057* **saved_var** (optional, heterogeneous) - **T**:* **saved_var** (optional, heterogeneous) - **T**:
6158 Saved variance used during training to speed up gradient Saved variance used during training to speed up gradient
6259 computation. Should not be used for testing. computation.
6360
6461**Type Constraints****Type Constraints**
6562
6663* **T** in (* **T** in (
6764 tensor(double), tensor(double),
6865 tensor(float), tensor(float),
6966 tensor(float16) tensor(float16)
7067 ): ):
7168 Constrain input and output types to float tensors. Constrain input and output types to float tensors.

BatchNormalization - 6#

Version

  • name: BatchNormalization (GitHub)

  • domain: main

  • since_version: 6

  • function: False

  • support_level: SupportType.COMMON

  • shape inference: True

This version of the operator has been available since version 6.

Summary

Carries out batch normalization as described in the paper https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, there are multiple cases for the number of outputs, which we list below:

Output case #1: Y, mean, var, saved_mean, saved_var (training mode) Output case #2: Y (test mode)

Attributes

  • epsilon: The epsilon value to use to avoid division by zero, default is 1e-5f. Default value is 9.999999747378752e-06.

  • is_test: If set to nonzero, run spatial batch normalization in test mode, default is 0. Default value is 0.

  • momentum: Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum), default is 0.9f. Default value is 0.8999999761581421.

  • spatial: If true, compute the mean and variance across all spatial elements If false, compute the mean and variance across per feature.Default is 1. Default value is 1.

Inputs

  • X (heterogeneous) - T: Input data tensor from the previous operator; dimensions for image case are (N x C x H x W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data. For non image case, the dimensions are in the form of (N x C x D1 x D2 … Dn), where N is the batch size.

  • scale (heterogeneous) - T: The scale as a 1-dimensional tensor of size C to be applied to the output.

  • B (heterogeneous) - T: The bias as a 1-dimensional tensor of size C to be applied to the output.

  • mean (heterogeneous) - T: The running mean (training) or the estimated mean (testing) as a 1-dimensional tensor of size C.

  • var (heterogeneous) - T: The running variance (training) or the estimated variance (testing) as a 1-dimensional tensor of size C.

Outputs

Between 1 and 5 outputs.

  • Y (heterogeneous) - T: The output tensor of the same shape as X.

  • mean (optional, heterogeneous) - T: The running mean after the BatchNormalization operator. Must be in- place with the input mean. Should not be used for testing.

  • var (optional, heterogeneous) - T: The running variance after the BatchNormalization operator. Must be in-place with the input var. Should not be used for testing.

  • saved_mean (optional, heterogeneous) - T: Saved mean used during training to speed up gradient computation. Should not be used for testing.

  • saved_var (optional, heterogeneous) - T: Saved variance used during training to speed up gradient computation. Should not be used for testing.

Type Constraints

  • T in ( tensor(double), tensor(float), tensor(float16) ): Constrain input and output types to float tensors.

Differences

00Carries out batch normalization as described in the paperCarries out batch normalization as described in the paper
11https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
22there are multiple cases for the number of outputs, which we list below:there are multiple cases for the number of outputs, which we list below:
33
44Output case #1: Y, mean, var, saved_mean, saved_var (training mode)Output case #1: Y, mean, var, saved_mean, saved_var (training mode)
55Output case #2: Y (test mode)Output case #2: Y (test mode)
66
77**Attributes****Attributes**
88
9* **consumed_inputs** (required):
10 legacy optimization attribute.
119* **epsilon**:* **epsilon**:
1210 The epsilon value to use to avoid division by zero, default is The epsilon value to use to avoid division by zero, default is
1311 1e-5f. Default value is 9.999999747378752e-06. 1e-5f. Default value is 9.999999747378752e-06.
1412* **is_test**:* **is_test**:
1513 If set to nonzero, run spatial batch normalization in test mode, If set to nonzero, run spatial batch normalization in test mode,
1614 default is 0. Default value is 0. default is 0. Default value is 0.
1715* **momentum**:* **momentum**:
1816 Factor used in computing the running mean and variance.e.g., Factor used in computing the running mean and variance.e.g.,
1917 running_mean = running_mean * momentum + mean * (1 - momentum), running_mean = running_mean * momentum + mean * (1 - momentum),
2018 default is 0.9f. Default value is 0.8999999761581421. default is 0.9f. Default value is 0.8999999761581421.
2119* **spatial**:* **spatial**:
2220 If true, compute the mean and variance across all spatial elements If true, compute the mean and variance across all spatial elements
2321 If false, compute the mean and variance across per feature.Default If false, compute the mean and variance across per feature.Default
2422 is 1. Default value is 1. is 1. Default value is 1.
2523
2624**Inputs****Inputs**
2725
2826* **X** (heterogeneous) - **T**:* **X** (heterogeneous) - **T**:
29 The input 4-dimensional tensor of shape NCHW.
27 Input data tensor from the previous operator; dimensions for image
28 case are (N x C x H x W), where N is the batch size, C is the number
29 of channels, and H and W are the height and the width of the data.
30 For non image case, the dimensions are in the form of (N x C x D1 x
31 D2 ... Dn), where N is the batch size.
3032* **scale** (heterogeneous) - **T**:* **scale** (heterogeneous) - **T**:
3133 The scale as a 1-dimensional tensor of size C to be applied to the The scale as a 1-dimensional tensor of size C to be applied to the
3234 output. output.
3335* **B** (heterogeneous) - **T**:* **B** (heterogeneous) - **T**:
3436 The bias as a 1-dimensional tensor of size C to be applied to the The bias as a 1-dimensional tensor of size C to be applied to the
3537 output. output.
3638* **mean** (heterogeneous) - **T**:* **mean** (heterogeneous) - **T**:
3739 The running mean (training) or the estimated mean (testing) as a The running mean (training) or the estimated mean (testing) as a
3840 1-dimensional tensor of size C. 1-dimensional tensor of size C.
3941* **var** (heterogeneous) - **T**:* **var** (heterogeneous) - **T**:
4042 The running variance (training) or the estimated variance (testing) The running variance (training) or the estimated variance (testing)
4143 as a 1-dimensional tensor of size C. as a 1-dimensional tensor of size C.
4244
4345**Outputs****Outputs**
4446
4547Between 1 and 5 outputs.Between 1 and 5 outputs.
4648
4749* **Y** (heterogeneous) - **T**:* **Y** (heterogeneous) - **T**:
4850 The output 4-dimensional tensor of the same shape as X. The output tensor of the same shape as X.
4951* **mean** (optional, heterogeneous) - **T**:* **mean** (optional, heterogeneous) - **T**:
5052 The running mean after the BatchNormalization operator. Must be in- The running mean after the BatchNormalization operator. Must be in-
5153 place with the input mean. Should not be used for testing. place with the input mean. Should not be used for testing.
5254* **var** (optional, heterogeneous) - **T**:* **var** (optional, heterogeneous) - **T**:
5355 The running variance after the BatchNormalization operator. Must be The running variance after the BatchNormalization operator. Must be
5456 in-place with the input var. Should not be used for testing. in-place with the input var. Should not be used for testing.
5557* **saved_mean** (optional, heterogeneous) - **T**:* **saved_mean** (optional, heterogeneous) - **T**:
5658 Saved mean used during training to speed up gradient computation. Saved mean used during training to speed up gradient computation.
5759 Should not be used for testing. Should not be used for testing.
5860* **saved_var** (optional, heterogeneous) - **T**:* **saved_var** (optional, heterogeneous) - **T**:
5961 Saved variance used during training to speed up gradient Saved variance used during training to speed up gradient
6062 computation. Should not be used for testing. computation. Should not be used for testing.
6163
6264**Type Constraints****Type Constraints**
6365
6466* **T** in (* **T** in (
6567 tensor(double), tensor(double),
6668 tensor(float), tensor(float),
6769 tensor(float16) tensor(float16)
6870 ): ):
6971 Constrain input and output types to float tensors. Constrain input and output types to float tensors.

BatchNormalization - 1#

Version

  • name: BatchNormalization (GitHub)

  • domain: main

  • since_version: 1

  • function: False

  • support_level: SupportType.COMMON

  • shape inference: False

This version of the operator has been available since version 1.

Summary

Carries out batch normalization as described in the paper https://arxiv.org/abs/1502.03167. Depending on the mode it is being run, there are multiple cases for the number of outputs, which we list below:

Output case #1: Y, mean, var, saved_mean, saved_var (training mode) Output case #2: Y (test mode)

Attributes

  • consumed_inputs (required): legacy optimization attribute.

  • epsilon: The epsilon value to use to avoid division by zero, default is 1e-5f. Default value is 9.999999747378752e-06.

  • is_test: If set to nonzero, run spatial batch normalization in test mode, default is 0. Default value is 0.

  • momentum: Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum), default is 0.9f. Default value is 0.8999999761581421.

  • spatial: If true, compute the mean and variance across all spatial elements If false, compute the mean and variance across per feature.Default is 1. Default value is 1.

Inputs

  • X (heterogeneous) - T: The input 4-dimensional tensor of shape NCHW.

  • scale (heterogeneous) - T: The scale as a 1-dimensional tensor of size C to be applied to the output.

  • B (heterogeneous) - T: The bias as a 1-dimensional tensor of size C to be applied to the output.

  • mean (heterogeneous) - T: The running mean (training) or the estimated mean (testing) as a 1-dimensional tensor of size C.

  • var (heterogeneous) - T: The running variance (training) or the estimated variance (testing) as a 1-dimensional tensor of size C.

Outputs

Between 1 and 5 outputs.

  • Y (heterogeneous) - T: The output 4-dimensional tensor of the same shape as X.

  • mean (optional, heterogeneous) - T: The running mean after the BatchNormalization operator. Must be in- place with the input mean. Should not be used for testing.

  • var (optional, heterogeneous) - T: The running variance after the BatchNormalization operator. Must be in-place with the input var. Should not be used for testing.

  • saved_mean (optional, heterogeneous) - T: Saved mean used during training to speed up gradient computation. Should not be used for testing.

  • saved_var (optional, heterogeneous) - T: Saved variance used during training to speed up gradient computation. Should not be used for testing.

Type Constraints

  • T in ( tensor(double), tensor(float), tensor(float16) ): Constrain input and output types to float tensors.