Let’s kick things off with a quiz: Suppose we take a normal distribution with a familiar bell curve shape, and we have a random variable X that’s drawn from that distribution. What the curve actually means is that if you want the probability that your sample falls within a given range of values (e.g. the probability that it ends up between -1 and 2), that would equal the area under this curve in that range of values.

I’ll also pull up a second random variable also following a normal distribution, but this time it may be slightly more spread out, with a larger standard deviation. The quiz for you is if you repeatedly sample both of these variables and in each iteration you add up the two results, then that sum behaves like its own random variable. What distribution describes that sum that you’re looking at?

Maybe you have a guess, but guessing is not enough. The real quiz is to be able to explain why you get the answer that you do. We’ll be covering a special operation called a convolution, and the primary thing we’ll do today is motivate and build up two distinct ways to visualize what a convolution looks like for continuous functions. We’ll also talk about how these two different visualizations can each be helpful in different ways, with a special focus on the Central Limit Theorem.

To warm up, we’ll cover an example that uses a setting that’s more discreet and finite, like rolling a pair of weighted dice. Here, the animation you’re looking at is simulating two weighted dice, and we’ll be repeatedly sampling from each one and recording the sum of the two values at each iteration. Repeating samples like this many times can give you a heuristic sense of the final distribution, but our real task today is to compute that distribution precisely. If I highlight all the pairs of  dice values that add up to six, they all line up. 

The probability of rolling a two, three, four, or five on a pair of dice is equal to $\frac{1}{36}$. This is because there are 36 distinct possible outcomes, and each outcome has an equal probability of occurring. We can visualize this probability by organizing the outcomes in a six by six grid, where the probability for each outcome is equal to the probability of the blue die multiplied by the probability of the red die. Additionally, we can plot each probability as the height of a bar above the square in a three dimensional plot, and the distribution of possible sums looks like combining all of the heights of this plot along one of the diagonal slices. Lastly, we can flip the bottom distribution around horizontally so that the die values increase as you go from right to left, and this allows us to see which pairs of dice values line up with each other. As it’s positioned right now, we have one and six, two and five, three and four and so on - all of the pairs of values that add up to seven. If you want to think about the probability of rolling a seven, a way to hold that computation in your mind is to take all of the pairs of probabilities that line up with each other, multiply together those pairs and then add up all of the results. Some of you might like to think of this as a kind of dot product, but the operation as a whole is not just one dot product but many.

If we were to slide that bottom distribution a little more to the left - so, in this case it looks like the die values which line up are one and four, two and three, three and two, four and one - in other words, all the ones that add up to a five, well, now if we take the dot product, we multiply the pairs of probabilities that line up and add them together, that would give us the total probability of rolling a five. In general, from this point of view, computing the full distribution for the sum looks like sliding that bottom distribution into various different positions and computing this dot product along the way. It is precisely the same operation as the diagonal slices we were looking at earlier - they’re just two different ways to visualise the same underlying operation.

And however you choose to visualize it, this operation that takes in two different distributions and spits out a new one describing the sum of the relevant random variables is called a convolution and we often denote it with this asterisk. Really the way you want to think about it, especially as we set up for the continuous case, is to think of it as combining two different functions and spitting out a new function.

For example, in this case, maybe I give the function for the first distribution the name P_X. This would be a function that takes in a possible value for the die, like a three, and it spits out the corresponding probability. Similarly, let’s let P_Y be the function for our second distribution and P_X + Y be the function describing the distribution for the sum. In the lingo, what you would say is that P_X + Y is equal to a convolution between P_X and P_Y.

And what I want you to think about now is what the formula for this operation should look like. You’ve seen two different ways to visualize it but how do we actually write it down in symbols? To get your bearings, maybe it’s helpful to write down a specific example like the case of plugging in a four where you add up over all the different pair wise products corresponding to pairs of inputs that add up to a four and more generally, here’s how it might look…This new function takes as an input a possible sum for your random variables, which I’ll call “S”, and what it outputs looks like a sum over a bunch of pairs of values for X and Y except the usual way it’s written is not to write with X and Y but instead we just focus on one of those variables, in this case X, letting it range over all of its possible values which here just means going from one all the way up to six. And instead of writing “Y”, you write S minus X. Essentially, whatever the number has to be to make sure the sum is S.

Now, the astute among you might notice a slightly weird quirk with the formula as it’s written. For example, if you plug in a given value like $S=4$ and you unpack the sum letting X range over all the possible values going from one up to six, then sometimes that corresponding Y value drops below the domain of what we’ve explicitly defined. For example, you plug in zero and negative one and negative two. It’s not actually that big a deal. Essentially you would just say all of these values are zero. So, all these later terms don’t get counted. The probability that the red dyed rolls to become a negative one is zero, as it is an impossible outcome. When dealing with continuous distributions, the random variable can take on values anywhere in an infinite continuum. To understand the probability of a sample falling within a given range, we use a Probability Density Function (PDF), which is represented by the area under the curve in that range. The formula for the continuous case is similar to the discrete case, but instead of a sum, it uses an integral. For example, if we have two different random variables following a continuous distribution and we want to understand their sum and the new distribution that describes that sum, the expression would look like an integral with the density function for the first variable (F of X) and the density function for the second variable (G of X). The integral iterates over all possible pairs of values X and Y constrained to a given sum, S. So, as I slide around this parameter S,  the area under the graph changes and so does the corresponding probability.

For the demo, instead of graphing G directly, I want to graph the result of G of S minus X. This has the effect of flipping the graph horizontally, and then shifting it left or right depending on if S is positive or negative. This parameter, S, is treated as a constant. The real fun comes from graphing the integral of the product between these two graphs. This is analogous to the list of pairwise products, but instead of adding them up, we want to integrate them together. This would be interpreted as the area underneath the product graph. As S is shifted around, the shape of the product graph changes and so does the corresponding area.

For a simple example, let’s say that each of our two random variables follows a uniform distribution between the values negative one half and positive one half. The product between the two graphs will be one wherever the graphs overlap with each other, but zero everywhere else. If S is shifted far enough to the left, the product graph is zero everywhere - this is an impossible sum to achieve. As S is shifted to the right and the graphs overlap, the area increases linearly until the graphs overlap entirely and reach a maximum. After this point, the area starts to decrease linearly again, and the distribution for the sum takes on a wedge shape.

This is analogous to the sum of two unweighted dice, where probabilities increase until they max out at a seven and then decrease back down again. If we add up three different uniformly distributed variables, we can think of the sum of the first two as their own variable, which follows a wedge shape distribution, and then take a convolution between that and the top hat function. This has the effect of filtering out values from the top graph, and the product on the bottom looks like a copy of the top graph but limited to a certain window. As S is shifted around, the area under the graph changes and so does the corresponding probability. And this is the key to understanding why normal distributions are so special.

Again, as I slide this around left and right and the area gets bigger and smaller, the result maxes out in the middle but tapers out to either side, more smoothly. It’s more than just kind of, this literally is a moving average of the top left graph. One thing you might think to do is take this even further. The way we started was combining two top hat functions and we got this wedge then we replaced the first function with that wedge and then when we took the convolution, we got this smoother shape describing a sum of three distinct uniform variables but we could just repeat. Swap that out for the top function and then convolve that with the flat rectangular function and whatever result we see should describe a sum of four uniformly distributed random variables.

Any of you who watched the video about the central limit theorem should know what to expect. As we repeat this process over and over, the shape looks more and more like a bell curve. Or to be more precise, at each iteration, we should rescale the X axis to make sure that the standard deviation is one because the dominant effect of this repeated convolution, the kind of repeated moving average process is to flatten out the function over time. So, in the limit, it just flattens out towards zero. But rescaling is a way of saying “yeah yeah yeah, I know that it gets flatter but the actual shape underlying it all?”

The statement of the central limit theorem, one of the coolest facts from probability is that you could have started with essentially any distribution and this still would have been true. That as you take repeated convolutions like this representing bigger and bigger sums of a given random variable, then the distribution describing that sum which might start off looking very different from a normal distribution over time smooths out more and more until it gets arbitrarily close to a normal distribution. It’s as if a bell curve is in some loose manner of speaking the smoothest possible distribution and attractive fix point in the space of all possible functions as we apply this process of repeated smoothing through the convolution.

Naturally, you might wonder why normal distributions? Why dysfunction and not some other one? There’s a very good answer and I think the most fun way to show the answer is in the light of the last visualization that will show for convolutions. Remember how in the discreet case the first of our two visualizations involved forming this kind of multiplication table showing the probabilities for all possible outcomes and adding up along the diagonals. You’ve probably guessed it by now but our last step is to generalize this to the continuous case. And it is beautiful but you have to be a little bit careful. Pulling up the same two functions we had before, F of X and G of Y, what in this case would be analogous to the grid of possible pairs that we were looking at earlier? Well in this case, each of the variables can take on any real number. So we want to think about all possible pairs of real numbers and the XY plane comes to mind. Every point corresponds to a possible outcome when we sample from both distributions.

Now, the probability of any one of these outcomes, X Y or rather the probability density around that point will look like F of X times G of Y. Again, assuming that the two are independent. So a natural thing to do is to graph this function, F of X times G of Y as a two variable function. Which would give something that looks like a surface above the X Y plane. Notice in this example how if we look at it from one angle where we see the X values changing, it has the shape of our first graph. But if we look at it from another angle, emphasizing the change in the Y direction, it takes on the shape of our second graph. And this is the key to understanding why normal distributions are so special. And then you’d just do the integral.

This three-dimensional graph encodes all of the information we need. It shows all the probability densities for every possible outcome. And if we want to limit our view just to those outcomes where $X + Y$ is constrained to be a given sum, what that looks like is limiting our view to a diagonal slice. Specifically, a slice over the line $X + Y$ equals some constant. All of the possible probability densities for the outcome subject to this constraint looks sort of like a slice under this graph and as we change around what specific sum we’re constraining to, it shifts around which specific diagonal slice we’re looking at.

Now, what you might predict is that the way to combine all of the probability densities along one of these slices, the way to integrate them together can be interpreted as the area under this curve which is a slice of the surface. And that is almost correct. There’s a subtle detail regarding a factor of the square root of two that we need to talk about but up to a constant factor, the areas of these slices give us the values of the convolution. In fact, all of these slices that we’re looking at are precisely the same as the product graph that we were looking at earlier. Here, to emphasize this point, let me pull up both visualizations side by side and I’m going to slowly decrease the value of $S$ from two down to negative two, which on the left means we are looking at different slices and on the right means we’re shifting around the modified graph of $G$. Notice how it all points the shape of the graph on the bottom right, the product between the functions, looks exactly the same as the shape of the diagonal slice. And this should make sense. They are two distinct ways to visualise the same thing. It sounds like a lot when we put it into words but what we’re looking at are all the possible products between outputs of the functions corresponding to pairs of inputs that have a given sum. Again, it’s kind of a mouthful but I think you see what I’m saying and we now have two different ways to see it.

The nice thing about the diagonal slice visualization is that it makes it much more clear that it’s a symmetric operation. It’s much more obvious that $F$ convolved with $G$ is the same thing as $G$ convolved with $F$. Now, technically, the diagonal slices are not exactly the same shape. They’ve actually been stretched out by a factor of the square root of two. The basic reason is that if you imagine taking some small step along one of these lines where $X + Y$ equals a constant, then the change in your $X$ value, that $\Delta X$ here, is not the same thing as the length of that step. That step is actually longer by a factor of the square root of two. I will leave a note up on the screen for the calculus enthusiast among you who want to pause and ponder but the upshot is very simply that the outputs of our convolution are technically not quite the areas of these diagonal slices. You have to divide those areas by a square root of two.

Stepping back from all of this for a moment, I just think this is so beautiful. We started with such a simple question - or at least such a seemingly simple question - how do you add up two random variables? And what we end up with is this very intricate operation for combining two different functions. We have at least two very pretty ways to understand it. But still, some of you might be raising your hands and saying “Pretty pictures are all well and good, but do they actually help you calculate something?” For example, I still have not answered the opening quiz question about adding two normally distributed random variables. Well, the ordinary way that you would approach this kind of question if it showed up on a homework or something like that is that you would plug in the formula for a normal distribution into the definition of a convolution - the integral that we’ve been describing here. And then you’d just do the integral. In this case, the integral is not prohibitively difficult - there are analytical methods, but for this example, I want to show you a more fun method where the visualizations, specifically the diagonal slices, will play a much more prominent role in the proof itself. Many of you may enjoy taking a moment to predict how this will look for yourself. Think about what this 3D graph would look like in the case of two normal distributions and what properties that it has that you might be able to take advantage of. And it is for sure easiest if you start with the case where both distributions have the same standard deviation. Whenever you want the details and to see how the answer fits into the central limit theorem, come join me in the next video.