BatchNormalization - 9 vs 14

BatchNormalization9 → BatchNormalization14 RENAMED
@@ -1 +1 @@
1
1
  Carries out batch normalization as described in the paper
2
2
  https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
3
+ There are five required inputs 'X', 'scale', 'B', 'input_mean' and
4
+ 'input_var'.
5
+ Note that 'input_mean' and 'input_var' are expected to be the estimated
6
+ statistics in inference mode (training_mode=False, default),
7
+ and the running statistics in training mode (training_mode=True).
3
- there are multiple cases for the number of outputs, which we list below:
8
+ There are multiple cases for the number of outputs, which we list below:
4
- Output case #1: Y, mean, var, saved_mean, saved_var (training mode)
9
+ Output case #1: Y, running_mean, running_var (training_mode=True)
5
- Output case #2: Y (test mode)
10
+ Output case #2: Y (training_mode=False)
11
+
12
+ When training_mode=False, extra outputs are invalid.
13
+ The outputs are updated as follows when training_mode=True:
14
+ ::
15
+
16
+ running_mean = input_mean * momentum + current_mean * (1 - momentum)
17
+ running_var = input_var * momentum + current_var * (1 - momentum)
18
+
19
+ Y = (X - current_mean) / sqrt(current_var + epsilon) * scale + B
20
+
21
+ where:
22
+
23
+ current_mean = ReduceMean(X, axis=all_except_channel_index)
24
+ current_var = ReduceVar(X, axis=all_except_channel_index)
25
+
26
+ Notice that ReduceVar refers to the population variance, and it equals to
27
+ sum(sqrd(x_i - x_avg)) / N
28
+ where N is the population size (this formula does not use sample size N - 1).
29
+
30
+ When training_mode=False:
31
+ ::
32
+
33
+ Y = (X - input_mean) / sqrt(input_var + epsilon) * scale + B
6
34
  For previous (depreciated) non-spatial cases, implementors are suggested
7
- to flatten the input shape to (N x C*D1*D2 ..*Dn) before a BatchNormalization Op.
35
+ to flatten the input shape to (N x C * D1 * D2 * ... * Dn) before a BatchNormalization Op.
8
36
  This operator has **optional** inputs/outputs. See ONNX <https://github.com/onnx/onnx/blob/master/docs/IR.md>_ for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.
9
37
  **Attributes**
10
38
  * **epsilon**:
11
39
  The epsilon value to use to avoid division by zero.
12
40
  * **momentum**:
13
41
  Factor used in computing the running mean and variance.e.g.,
14
42
  running_mean = running_mean * momentum + mean * (1 - momentum).
43
+ * **training_mode**:
44
+ If set to true, it indicates BatchNormalization is being used for
45
+ training, and outputs 1, 2, 3, and 4 would be populated.
15
46
  **Inputs**
16
47
  * **X** (heterogeneous) - **T**:
17
48
  Input data tensor from the previous operator; dimensions are in the
18
49
  form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is
19
50
  the number of channels. Statistics are computed for every channel of
20
51
  C over N and D1 to Dn dimensions. For image data, input dimensions
21
52
  become (N x C x H x W). The op also accepts single dimension input
22
53
  of size N in which case C is assumed to be 1
23
54
  * **scale** (heterogeneous) - **T**:
24
55
  Scale tensor of shape (C).
25
56
  * **B** (heterogeneous) - **T**:
26
57
  Bias tensor of shape (C).
27
- * **mean** (heterogeneous) - **T**:
58
+ * **input_mean** (heterogeneous) - **U**:
28
59
  running (training) or estimated (testing) mean tensor of shape (C).
29
- * **var** (heterogeneous) - **T**:
60
+ * **input_var** (heterogeneous) - **U**:
30
61
  running (training) or estimated (testing) variance tensor of shape
31
62
  (C).
32
63
  **Outputs**
33
- Between 1 and 5 outputs.
64
+ Between 1 and 3 outputs.
34
65
  * **Y** (heterogeneous) - **T**:
35
66
  The output tensor of the same shape as X
36
- * **mean** (optional, heterogeneous) - **T**:
67
+ * **running_mean** (optional, heterogeneous) - **U**:
37
68
  The running mean after the BatchNormalization operator.
38
- * **var** (optional, heterogeneous) - **T**:
69
+ * **running_var** (optional, heterogeneous) - **U**:
39
- The running variance after the BatchNormalization operator.
70
+ The running variance after the BatchNormalization operator. This op
71
+ uses the population size (N) for calculating variance, and not the
72
+ sample size N-1.
40
- * **saved_mean** (optional, heterogeneous) - **T**:
41
- Saved mean used during training to speed up gradient computation.
42
- * **saved_var** (optional, heterogeneous) - **T**:
43
- Saved variance used during training to speed up gradient
44
- computation.
45
73
  **Type Constraints**
46
74
  * **T** in (
75
+ tensor(bfloat16),
47
76
  tensor(double),
48
77
  tensor(float),
49
78
  tensor(float16)
50
79
  ):
80
+ * **U** in (
81
+ tensor(bfloat16),
82
+ tensor(double),
83
+ tensor(float),
84
+ tensor(float16)
85
+ ):
51
- Constrain input and output types to float tensors.+ Constrain input and output types to float tensors.
86
+ Constrain mean and variance types to float tensors. It allows all
87
+ float type for U.