BatchNorm2d: How to use the BatchNorm2d Module in PyTorch

Video Transcript

Batch normalization is a technique that can improve the learning rate of a neural network.

It does so by minimizing internal covariate shift which is essentially the phenomenon of each layer’s input distribution changing as the parameters of the layer above it change during training.

More concretely, in the displayed network

class Convolutional(nn.Module):
    def __init__(self, input_channels=3, num_classes=10)
    	super(Convolutional, self).__init__()
    	self.layer1 = nn.Sequential()
    	self.layer1.add_module("Conv1", nn.Conv2d(in_channels=input_channels, out_channels=16, kernel_size=3, padding=1))
    	self.layer1.add_module("BN1", nn.BatchNorm2d(num_features=16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
    	self.layer1.add_module("Relu1", nn.ReLU(inplace=False))
    	self.layer2 = nn.Sequential()
    	self.layer2.add_module("Conv2", nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1, stride=2))
    	self.layer2.add_module("BN2", nn.BatchNorm2d(num_features=32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
    	self.layer2.add_module("Relu2", nn.ReLU(inplace=False))
    	self.fully_connected = nn.Linear(32 * 16 * 16, num_classes)
    def forward(self, x):
    	x = self.layer1(x)
    	x = self.layer2(x)
    	x = x.view(-1, 32 * 16 * 16)
    	x = self.fully_connected(x)
    	return x

The second layer

x = self.layer2(x)

has an expected distribution of inputs coming from the first layer

x = self.layer1(x)

and its parameters are optimized for this expected distribution.

As the parameters in the first layer are updated

x = self.layer1(x)

this expected distribution becomes less like a true distribution passed by layer1.

This is a problem because it can force some of layer2’s activations to saturate which significantly slows down training.

Batch normalization

self.layer1.add_module("BN1", nn.BatchNorm2d(num_features=16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))

grants us the freedom to use larger learning rates while not worrying as much about internal covariate shift.

This, in turn, means that our network can be trained slightly faster.

In the displayed network, batch normalization is applied to both the first

self.layer1.add_module("BN1", nn.BatchNorm2d(num_features=16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))

and the second layer

self.layer2.add_module("BN2", nn.BatchNorm2d(num_features=32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))

with minimal modification from default arguments.

num_features

num_features=32

is a required argument that tells BatchNorm how many features are in the output of the function above it.

In the case of layer1 and layer2 in the displayed network

out_channels=32

this would be the output channels of their Conv2d functions.

Channels are equivalent to features but channels is more commonly used when referring to image data sets as the original image has a certain number of colored channels.

eps or epsilon

eps=1e-05

is a value added to the denominator of the batch normalization calculation.

This is just to improve numerical stability and it should only be modified with good reason.

The BatchNorm function will keep a running estimate of its computed mean and variance during training for use during evaluation of the network.

This can be disabled by setting track_running_stats

track_running_stats=True

to False in which case, the batch statistics are calculated and used during evaluation as well.

The momentum argument

momentum=0.1

determines the rate at which the running estimates are updated.

If it is set to none, then the running estimates will be simple averaging.

Lastly, the affine argument

affine=True

when set to true indicates the BatchNorm should have learnable affine parameters.

The default value for affine is True.