Xavier Initialization

From the Caffe Berkeley Vision link, you may have seen that a Convolutional layer can be defined as:

layer {
  name: "conv1"
  type: "Convolution"
  param { lr_mult: 1 }
  param { lr_mult: 2 }
  convolution_param {
    num_output: 20
    kernel_size: 5
    stride: 1
    weight_filler { # WEIGHT INITIALIZATION
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
  bottom: "data"
  top: "conv1"
}

where the weight filler uses the algorithm that automatically determines the scale of initialization based on the number of input and output neurons.

What is Xavier and how is it important?

Imagine that your weights are initially very close to 0. What happens is that the signals shrink as it goes through each layer until it becomes too tiny to be useful. Now if your weights are too big, the signals grow at each layer it passes through until it is too massive to be useful.

By using Xavier initialization, we make sure that the weights are not too small but not too big to propagate accurately the signals. From my tests, it turns out that initialization is surprisingly important. A marked difference can appear with only 3-4 layers in the network.

Glorot and Bengio (2009) suggested to initialize the weights from a distribution with zero mean and variance:

where and are respectively the number of inputs and outputs of your layer.

Proof

Suppose we have an input with components and a fully connected layer (also denoted linear or dense) with random weights that outputs a number such that

To make sure that the weights remain in a reasonable range, we expect that

We also know how to compute the variance of the product of two random variables. Therefore

Both our inputs and weights have a mean 0. It simplifies to

Now we make a further assumption that the and are all independent and identically distributed (iid).

It turns that, if we want to have , we must enforce the condition .

Still from Glorot and Bengio (2009), they found that if we go through the same steps for the backpropagated signal, you find that you need

Those two constraints can only be satisfied if . In practice, it restricts too much. As a compromise, they roughly took the average of the two

Written on March 21, 2016