BatchNormalization - 6 vs 15#

Next section compares an older to a newer version of the same operator after both definition are converted into markdown text. Green means an addition to the newer version, red means a deletion. Anything else is unchanged.

Files changed (1) hide show

BatchNormalization6 → BatchNormalization15 +46 -80

BatchNormalization6 → BatchNormalization15 RENAMED Viewed

@@ -1 +1 @@
  Carries out batch normalization as described in the paper
  https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
- There are five required inputs 'X', 'scale', 'B', 'input_mean' and
- 'input_var'.
- Note that 'input_mean' and 'input_var' are expected to be the estimated
- statistics in inference mode (training_mode=False, default),
- and the running statistics in training mode (training_mode=True).
- There are multiple cases for the number of outputs, which we list below:
+ there are multiple cases for the number of outputs, which we list below:
+ Output case #1: Y, mean, var, saved_mean, saved_var (training mode)
+ Output case #2: Y (test mode)
- Output case #1: Y, running_mean, running_var (training_mode=True)
- Output case #2: Y (training_mode=False)
- When training_mode=False, extra outputs are invalid.
- The outputs are updated as follows when training_mode=True:
- ::
-     running_mean = input_mean * momentum + current_mean * (1 - momentum)
-     running_var = input_var * momentum + current_var * (1 - momentum)
-     Y = (X - current_mean) / sqrt(current_var + epsilon) * scale + B
-     where:
-     current_mean = ReduceMean(X, axis=all_except_channel_index)
-     current_var =  ReduceVar(X, axis=all_except_channel_index)
-     Notice that ReduceVar refers to the population variance, and it equals to
-     sum(sqrd(x_i - x_avg)) / N
-     where N is the population size (this formula does not use sample size N - 1).
- The computation of ReduceMean and ReduceVar uses float to avoid overflow for float16 inputs.
- When training_mode=False:
- ::
-     Y = (X - input_mean) / sqrt(input_var + epsilon) * scale + B
- For previous (depreciated) non-spatial cases, implementors are suggested
- to flatten the input shape to (N x C * D1 * D2 * ... * Dn) before a BatchNormalization Op.
- This operator has **optional** inputs/outputs. See ONNX <https://github.com/onnx/onnx/blob/master/docs/IR.md>_ for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.
  **Attributes**
  * **epsilon**:
-   The epsilon value to use to avoid division by zero.
+   The epsilon value to use to avoid division by zero, default is
+   1e-5f.
+ * **is_test**:
+   If set to nonzero, run spatial batch normalization in test mode,
+   default is 0.
  * **momentum**:
    Factor used in computing the running mean and variance.e.g.,
-   running_mean = running_mean * momentum + mean * (1 - momentum).
+   running_mean = running_mean * momentum + mean * (1 - momentum),
+   default is 0.9f.
- * **training_mode**:
+ * **spatial**:
-   If set to true, it indicates BatchNormalization is being used for
+   If true, compute the mean and variance across all spatial elements
-   training, and outputs 1, 2, 3, and 4 would be populated.
+   If false, compute the mean and variance across per feature.Default
+   is 1.
  **Inputs**
  * **X** (heterogeneous) - **T**:
-   Input data tensor from the previous operator; dimensions are in the
+   Input data tensor from the previous operator; dimensions for image
+   case are (N x C x H x W), where N is the batch size, C is the number
+   of channels, and H and W are the height and the width of the data.
+   For non image case, the dimensions are in the form of (N x C x D1 x
+   D2 ... Dn), where N is the batch size.
-   form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is
-   the number of channels. Statistics are computed for every channel of
-   C over N and D1 to Dn dimensions. For image data, input dimensions
-   become (N x C x H x W). The op also accepts single dimension input
-   of size N in which case C is assumed to be 1
- * **scale** (heterogeneous) - **T1**:
+ * **scale** (heterogeneous) - **T**:
-   Scale tensor of shape (C).
+   The scale as a 1-dimensional tensor of size C to be applied to the
+   output.
- * **B** (heterogeneous) - **T1**:
+ * **B** (heterogeneous) - **T**:
-   Bias tensor of shape (C).
+   The bias as a 1-dimensional tensor of size C to be applied to the
+   output.
- * **input_mean** (heterogeneous) - **T2**:
+ * **mean** (heterogeneous) - **T**:
-   running (training) or estimated (testing) mean tensor of shape (C).
+   The running mean (training) or the estimated mean (testing) as a
+   1-dimensional tensor of size C.
- * **input_var** (heterogeneous) - **T2**:
+ * **var** (heterogeneous) - **T**:
-   running (training) or estimated (testing) variance tensor of shape
+   The running variance (training) or the estimated variance (testing)
-   (C).
+   as a 1-dimensional tensor of size C.
  **Outputs**
- Between 1 and 3 outputs.
+ Between 1 and 5 outputs.
  * **Y** (heterogeneous) - **T**:
-   The output tensor of the same shape as X
+   The output tensor of the same shape as X.
- * **running_mean** (optional, heterogeneous) - **T2**:
+ * **mean** (optional, heterogeneous) - **T**:
-   The running mean after the BatchNormalization operator.
+   The running mean after the BatchNormalization operator. Must be in-
+   place with the input mean. Should not be used for testing.
- * **running_var** (optional, heterogeneous) - **T2**:
+ * **var** (optional, heterogeneous) - **T**:
-   The running variance after the BatchNormalization operator. This op
+   The running variance after the BatchNormalization operator. Must be
-   uses the population size (N) for calculating variance, and not the
+   in-place with the input var. Should not be used for testing.
+ * **saved_mean** (optional, heterogeneous) - **T**:
+   Saved mean used during training to speed up gradient computation.
-   sample size N-1.
+   Should not be used for testing.
+ * **saved_var** (optional, heterogeneous) - **T**:
+   Saved variance used during training to speed up gradient
+   computation. Should not be used for testing.
  **Type Constraints**
  * **T** in (
-   tensor(bfloat16),
    tensor(double),
    tensor(float),
    tensor(float16)
    ):
-   Constrain input and output types to float tensors.
+   Constrain input and output types to float tensors.- * **T1** in (
-   tensor(bfloat16),
-   tensor(double),
-   tensor(float),
-   tensor(float16)
-   ):
-   Constrain scale and bias types to float tensors.
- * **T2** in (
-   tensor(bfloat16),
-   tensor(double),
-   tensor(float),
-   tensor(float16)
-   ):
-   Constrain mean and variance types to float tensors.