BatchNormalization - 9 vs 15#

Next section compares an older to a newer version of the same operator after both definition are converted into markdown text. Green means an addition to the newer version, red means a deletion. Anything else is unchanged.

BatchNormalization9 → BatchNormalization15 RENAMED
@@ -1 +1 @@
1
1
  Carries out batch normalization as described in the paper
2
2
  https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
3
- There are five required inputs 'X', 'scale', 'B', 'input_mean' and
4
- 'input_var'.
5
- Note that 'input_mean' and 'input_var' are expected to be the estimated
6
- statistics in inference mode (training_mode=False, default),
7
- and the running statistics in training mode (training_mode=True).
8
- There are multiple cases for the number of outputs, which we list below:
3
+ there are multiple cases for the number of outputs, which we list below:
4
+ Output case #1: Y, mean, var, saved_mean, saved_var (training mode)
5
+ Output case #2: Y (test mode)
9
- Output case #1: Y, running_mean, running_var (training_mode=True)
10
- Output case #2: Y (training_mode=False)
11
-
12
- When training_mode=False, extra outputs are invalid.
13
- The outputs are updated as follows when training_mode=True:
14
- ::
15
-
16
- running_mean = input_mean * momentum + current_mean * (1 - momentum)
17
- running_var = input_var * momentum + current_var * (1 - momentum)
18
-
19
- Y = (X - current_mean) / sqrt(current_var + epsilon) * scale + B
20
-
21
- where:
22
-
23
- current_mean = ReduceMean(X, axis=all_except_channel_index)
24
- current_var = ReduceVar(X, axis=all_except_channel_index)
25
-
26
- Notice that ReduceVar refers to the population variance, and it equals to
27
- sum(sqrd(x_i - x_avg)) / N
28
- where N is the population size (this formula does not use sample size N - 1).
29
-
30
- The computation of ReduceMean and ReduceVar uses float to avoid overflow for float16 inputs.
31
-
32
- When training_mode=False:
33
- ::
34
-
35
- Y = (X - input_mean) / sqrt(input_var + epsilon) * scale + B
36
6
  For previous (depreciated) non-spatial cases, implementors are suggested
37
- to flatten the input shape to (N x C * D1 * D2 * ... * Dn) before a BatchNormalization Op.
7
+ to flatten the input shape to (N x C*D1*D2 ..*Dn) before a BatchNormalization Op.
38
8
  This operator has **optional** inputs/outputs. See ONNX <https://github.com/onnx/onnx/blob/master/docs/IR.md>_ for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.
39
9
  **Attributes**
40
10
  * **epsilon**:
41
11
  The epsilon value to use to avoid division by zero.
42
12
  * **momentum**:
43
13
  Factor used in computing the running mean and variance.e.g.,
44
14
  running_mean = running_mean * momentum + mean * (1 - momentum).
45
- * **training_mode**:
46
- If set to true, it indicates BatchNormalization is being used for
47
- training, and outputs 1, 2, 3, and 4 would be populated.
48
15
  **Inputs**
49
16
  * **X** (heterogeneous) - **T**:
50
17
  Input data tensor from the previous operator; dimensions are in the
51
18
  form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is
52
19
  the number of channels. Statistics are computed for every channel of
53
20
  C over N and D1 to Dn dimensions. For image data, input dimensions
54
21
  become (N x C x H x W). The op also accepts single dimension input
55
22
  of size N in which case C is assumed to be 1
56
- * **scale** (heterogeneous) - **T1**:
23
+ * **scale** (heterogeneous) - **T**:
57
24
  Scale tensor of shape (C).
58
- * **B** (heterogeneous) - **T1**:
25
+ * **B** (heterogeneous) - **T**:
59
26
  Bias tensor of shape (C).
60
- * **input_mean** (heterogeneous) - **T2**:
27
+ * **mean** (heterogeneous) - **T**:
61
28
  running (training) or estimated (testing) mean tensor of shape (C).
62
- * **input_var** (heterogeneous) - **T2**:
29
+ * **var** (heterogeneous) - **T**:
63
30
  running (training) or estimated (testing) variance tensor of shape
64
31
  (C).
65
32
  **Outputs**
66
- Between 1 and 3 outputs.
33
+ Between 1 and 5 outputs.
67
34
  * **Y** (heterogeneous) - **T**:
68
35
  The output tensor of the same shape as X
69
- * **running_mean** (optional, heterogeneous) - **T2**:
36
+ * **mean** (optional, heterogeneous) - **T**:
70
37
  The running mean after the BatchNormalization operator.
71
- * **running_var** (optional, heterogeneous) - **T2**:
38
+ * **var** (optional, heterogeneous) - **T**:
72
- The running variance after the BatchNormalization operator. This op
39
+ The running variance after the BatchNormalization operator.
40
+ * **saved_mean** (optional, heterogeneous) - **T**:
73
- uses the population size (N) for calculating variance, and not the
41
+ Saved mean used during training to speed up gradient computation.
42
+ * **saved_var** (optional, heterogeneous) - **T**:
43
+ Saved variance used during training to speed up gradient
74
- sample size N-1.
44
+ computation.
75
45
  **Type Constraints**
76
46
  * **T** in (
77
- tensor(bfloat16),
78
47
  tensor(double),
79
48
  tensor(float),
80
49
  tensor(float16)
81
50
  ):
82
- Constrain input and output types to float tensors.
51
+ Constrain input and output types to float tensors.- * **T1** in (
83
- tensor(bfloat16),
84
- tensor(double),
85
- tensor(float),
86
- tensor(float16)
87
- ):
88
- Constrain scale and bias types to float tensors.
89
- * **T2** in (
90
- tensor(bfloat16),
91
- tensor(double),
92
- tensor(float),
93
- tensor(float16)
94
- ):
95
- Constrain mean and variance types to float tensors.