Why is the central limit a normal distribution

The basic function underlying a normal distribution, aka a Gaussian, is $e^{-x^2}$. But you might wonder “Why this function?” Of all the expressions we could dream up that give you some symmetric smooth graph with mass concentrated towards the middle, why is it that the theory of probability seems to have a special place in its heart for this particular expression? For the last many videos, I’ve been hinting at an answer to this question and here we’ll finally arrive at something like a satisfying answer.

As a quick refresher on where we are - a couple videos ago, we talked about the Central Limit Theorem which describes how as you add multiple copies of a random variable, for example rolling a weighted die many different times or letting a ball bounce off of a peg repeatedly, then the distribution describing that sum tends to look approximately like a normal distribution. What the Central Limit Theorem says is as you make that sum bigger and bigger under appropriate conditions, that approximation to a normal becomes better and better. But I never explained why this is actually true, we only talked about what it’s claiming.

In the last video, we started talking about the math involved in adding two random variables. If you have two random variables each following some distribution, then to find the distribution describing the sum of those variables, you compute something known as a Convolution between the two original functions. And we spent a lot of time building up two distinct ways to visualize what this convolution operation really is.

Today, our basic job is to work through a example which is to ask what happens when you add two normally distributed random variables which, as you know by now, is the same as asking what do you get if you compute a convolution between two Gaussian functions. I’d like to share an especially pleasing visual way that you can think about this calculation which hopefully offers some sense of what makes the $e^{-x^2}$ function special in the first place. After we walk through it, we’ll talk about how this calculation is one of the steps involved in proving the Central Limit Theorem, it’s the step that answers the question of why a Gaussian and not something else is the central limit. But first, let’s dive in.

The full formula for a Gaussian is more complicated than just $e^{-x^2}$. The exponent is typically written as $-\frac{1}{2} \frac{x^2}{\sigma^2}$, where $\sigma$ describes the spread of the distribution. Specifically the standard deviation. All of this needs to be multiplied by a fraction on the front which is there to make sure that the area under the curve is one making it a valid probability distribution. And if you want to consider distributions that aren’t necessarily centred at zero, you would also throw another parameter, $\mu$ into the exponent like this. Although for everything we’ll be doing here, we just consider centred distributions.

Now, if you look at our central goal for today, which is to compute a convolution between two Gaussian functions, the direct way to do this would be to take the definition of a convolution - this integral expression we built up the last video and then to plug in for each one of the functions involved the formula for a Gaussian. It’s kind of a lot of symbols when you throw it all together but more than anything working this out is an exercise in completing the square and there’s nothing wrong with that. That will get you the answer that you want but of course you know me, I’m a sucker for visual intuition and in this case, there’s another way to think about it that I haven’t seen written about before that offers a very nice connection to other aspects of this distribution like the presence of $\pi$ and certain ways to derive where it comes from. The way I’d like to do this is by first peeling away all of the constants associated with the actual distribution and just showing the computation for the simplified form; e to the negative x squared. The essence of what we want to compute is what the convolution between two copies of this function looks like. If you remember, in the last video, we had two different ways to visualize convolutions and the one we’ll be using here is the second one involving diagonal slices.

And as a quick reminder of the way that worked, if you have two different distributions that are described by two different functions, F and G, then every possible pair of values that you might get when you sample from these two distributions can be thought of as individual points on the xy plane. And the probability density of landing on one such point, assuming independence, looks like f(x) times g(y).

So what we do is we look at a graph of that expression as a 2-variable function of x and y which is a way of showing the distribution of all possible outcomes when we sample from the two different variables. To interpret the convolution of f and g evaluated on some input s, which is a way of saying how likely are you to get a pair of samples that adds up to this sum, s. What you do is you look at a slice of this graph over the line x plus y equals s and you consider the area under that slice. This area is almost but not quite the value of the convolution at s. For a mildly technical reason, you need to divide by the square root of two.

Still, this area is the key feature to focus on. You can think of it as a way to combine together all the probability densities for all of the outcomes corresponding to a given sum. In the specific case where these two functions look like e to the negative x squared and e to the negative y squared, the resulting 3D graph has a really nice property that you can exploit. It’s rotationally symmetric. You can see this by combining the terms and noticing that it’s entirely a function of x squared plus y squared. And this term describes the square of the distance between any point on the xy-plane and the origin.

So in other words, the expression is purely a function of the distance from the origin. And by the way, this would not be true for any other distribution. It’s a property that uniquely characterizes bell curves. So, for most other pairs of functions, these diagonal slices will be some complicated shape that’s hard to think about and honestly, calculating the area would just amount to computing the original integral that defines a convolution in the first place.

So, in most cases, the visual intuition doesn’t really buy you anything. But in the case of bell curves, you can leverage that rotational symmetry. Here, focus on one of these slices over the line x + y = s for some value of s. And remember, the convolution that we’re trying to compute is a function of s. The thing that you want is an expression of s that tells you the area under this slice.

Well, if you look at that line, it intersects the x-axis at (s, 0) and the y-axis at (0, s) and a little bit of Pythagoras will show you that the straight line distance from the origin to this line is s divided by the square root of two. Now, because of the symmetry, this slice is identical to one that you get rotating 45° where you’d find something parallel to the y-axis the same distance away from the origin.

The key is that computing this other area of a slice parallel to the y-axis is much much easier than slices in other directions because it only involves taking an integral with respect to y. The value of x on the slice is a constant. Specifically, it would be the constant s divided by the square root of two. So, when you’re computing the integral, finding this area, all of this term here behaves like it was just some number and you can factor it out. It involves showing that the Fourier transform of this universal shape is also universal, and then showing that the Fourier transform of the initial distribution is stable in the sense that it tends towards the Fourier transform of this universal shape.

The important point to note is that all of the stuff involving $s$ is now entirely separate from the integrated variable. This remaining integral is a little bit tricky, and it’s actually quite famous, but it is essentially just some number (the square root of $\pi$). What really matters is that it has no dependence on $s$. We were looking for an expression for the area of these slices as a function of $s$ and now we have it; it looks like $e^{-s^2/2}$ scaled by some constant. This is a bell curve, another Gaussian just stretched out a little bit because of the two in the exponent. Technically, the convolution evaluated at $s$ is not quite this area, but it is this area divided by the square root of two. This constant gets baked into the constant, and the conclusion is that the convolution between two Gaussians is itself another Gaussian. This result is quite special, as almost always you end up with a completely different kind of function. The Central Limit Theorem states that if you repeatedly add copies of a random variable to itself, the tendency is always to approach a normal distribution. This computation is the reason that the function at the heart of the Central Limit Theorem is a Gaussian in the first place. Going beyond the scope of this discussion, one can use Moment-generating functions to show that the convolution of two Gaussians gives another Gaussian. This means that a Gaussian is a fixed point in the family of distributions, and thus must be the universal shape. The geometric argument that proves this connects to the Herschel-Maxwell derivation of a Gaussian, which states that the rotational symmetry of the graph is the defining feature of the distribution. Additionally, this argument also explains why pi appears in the formula. Furthermore, entropy can be used to prove this as well. For those who are interested in following the work of this channel, there is a mailing list that posts updates on new videos and other projects.