Let’s kick things off with a quiz: Suppose I take a normal distribution with a familiar bell curve shape, and I have a random variable X drawn from that distribution. What is the probability that it ends up between negative one and two? This is equal to the area under the curve in that range of values. Now, I’ll pull up a second random variable also following a normal distribution but with a slightly bigger standard deviation. If you repeatedly sample both of these variables and in each iteration you add up the two results, what distribution describes that sum?

To answer this question, we will warm up in a setting that’s more discreet and finite - like rolling a pair of weighted dice. The blue die is following a distribution that seems to be biased towards lower values, and the red die has a distinct distribution. Repeating samples like this many times can give us a heuristic sense of the final distribution, but our real task is to compute that distribution precisely. If I was to highlight all the pairs  with a sum of six, they are all sitting on the same  vertical column.

The precise probability of rolling a two, three, four, five, or any other number is $\frac{1}{6}$. It’s not too hard to question, and it’s encouraged to pause and try working it out for yourself. There are two distinct ways to visualize the underlying computation:

  1. Organize the 36 possible outcomes in a six by six grid and multiply the probabilities of the individual die rolls. This assumes that the die rolls are independent from each other.
  2. Flip the bottom distribution horizontally so that the die values increase as you go from right to left. This allows you to highlight all the pairs with a certain sum, as they will all be sitting on the same vertical column. As it’s positioned right now, we have one and six, two and five, three and four and so on; all of the pairs of values that add up to seven. If you want to think about the probability of rolling a seven, a way to hold that computation in your mind is to take all of the pairs of probabilities that line up with each other, multiply together those pairs and then add up all of the results. Some of you might like think of this as a kind of dot product, but the operation as a whole is not just one dot product but many.

If we were to slide that bottom distribution a little more to the left - so, in this case it looks like the die values which line up are one and four, two and three, three and two, four and one - in other words, all the ones that add up to a five, well, now if we take the dot product, we multiply the pairs of probabilities that line up and add them together, that would give us the total probability of rolling a five.

In general, from this point of view, computing the full distribution for the sum looks like sliding that bottom distribution into various different positions and computing this dot product along the way. It is precisely the same operation as the diagonal slices we were looking at earlier; they’re just two different ways to visualise the same underlying operation. And however you choose to visualize it, this operation that takes in two different distributions and spits out a new one describing the sum of the relevant random variables is called a convolution and we often denote it with an asterisk.

Really the way you want to think about it, especially as we set up for the continuous case, is to think of it as combining two different functions and spitting out a new function. For example, in this case, maybe I give the function for the first distribution the name PX. This would be a function that takes in a possible value for the die, like a three, and it spits out the corresponding probability. Similarly, let’s let PY be the function for our second distribution and PX+Y be the function describing the distribution for the sum. In the lingo, what you would say is that PX+Y is equal to a convolution between PX and PY.

And what I want you to think about now is what the formula for this operation should look like. You’ve seen two different ways to visualize it but how do we actually write it down in symbols? To get your bearings, maybe it’s helpful to write down a specific example like the case of plugging in a four where you add up over all the different pair wise products corresponding to pairs of inputs that add up to a four and more generally, here’s how it might look:

This new function takes as an input a possible sum for your random variables, which I’ll call “S” and what it outputs looks like a sum over a bunch of pairs of values for X and Y except the usual way it’s written is not to write with X and Y but instead we just focus on one of those variables, in this case X letting it range over all of its possible values which here just means going from one all the way up to six. And instead of writing “Y”, you write S minus X. Essentially, whatever the number has to be to make sure the sum is S.

Now, the astute among you might notice a slightly weird quirk with the formula as it’s written. For example, if you plug in a given value like s = 4 four and you unpack the sum letting X range over all the possible values going from one up to six, then sometimes that corresponding Y value drops below the domain of what we’ve explicitly defined. For example, you plug in zero and negative one and negative two. It’s not actually that big a deal. Essentially you would just say all of these values are zero. So, all these later terms don’t get counted. The probability that the red dyed rolls to become a negative one is zero, as it is an impossible outcome. When dealing with continuous distributions, the random variable can take on values anywhere in an infinite continuum, such as temperature, financial projections, and wait times. The probability that a sample of the variable falls within a given range looks like the area under the curve in that range, which is known as the “Probability Density Function” (PDF). The formula for the continuous case looks like an integral of the PDF, which is the probability density function in that range. For two different random variables following a continuous distribution, the formula for the sum of the variables is an integral of the PDFs for each variable, where the expression gives an artificial emphasis to one of the variables and the other variable is constrained to a given sum. So, if I slide S around, what I’m doing is  I’m selecting different windows from the top graph.

For the demo, instead of graphing G directly, I want to graph G of S minus X, which has the effect of flipping around the graph horizontally and then shifting it either left or right depending on if S is positive or negative. The real fun comes from graphing the entire contents of the integral, the product between these two graphs. This is analogous to the list of pairwise products that we saw earlier but in this case, instead of adding up all of those pairwise products, we want to integrate them together, which you would interpret as the area underneath this product graph.

As I shift around this value of S, the shape of the product graph changes and so does the corresponding area. Keep in mind, for all three graphs on the left, the input is X and the number S is just a parameter. But for the final graph on the right, for the resulting convolution itself, this number S is the input to that function and the corresponding output is whatever the area of the lower left graph is.

Doing a simple example, say where each of our two random variables follows a uniform distribution between the values negative one half and positive one half. This means that our density functions each have a top hat shape where the graph equals one for all inputs between negative one half and positive one half and it equals zero everywhere else.

The product between the two graphs has a really easy interpretation: it is one wherever the graphs overlap with each other but zero everywhere else. If I slide this parameter S far enough to the left that our top graphs don’t overlap at all, then the product graph is zero everywhere and that’s a way of saying this is an impossible sum to achieve. As I start to slide S to the right and the graphs overlap with each other, the area increases linearly until the graphs overlap entirely and it reaches a maximum. And then after that point, it starts to decrease linearly again, which means that the distribution for the sum takes on a wedge shape.

If we add up three different uniformly distributed variables, we can think about the sum of the first two as their own variable, which follows a wedge shape distribution, and then take a convolution between that and the top hat function. Pulling up the demo, what this looks like is that the product on the bottom looks just like a copy of the top graph but limited to a certain window. So, if I slide S around, what I’m doing is I’m selecting different windows from the top graph. Again, as I slide this around left and right and the area gets bigger and smaller, the result maxes out in the middle but tapers out to either side. Except this time it does so more smoothly. It’s kind of like we’re taking a moving average of that top left graph. Actually, it’s more than just kind of, this literally is a moving average of the top left graph. One thing you might think to do is take this even further. The way we started was combining two top hat functions and we got this wedge then we replaced the first function with that wedge and then when we took the convolution, we got this smoother shape describing a sum of three distinct uniform variables but we could just repeat. Swap that out for the top function and then convolve that with the flat rectangular function and whatever result we see should describe a sum of four uniformly distributed random variables. Any of you who watched the video about the central limit theorem should know what to expect. As we repeat this process over and over, the shape looks more and more like a bell curve. Or to be more precise, at each iteration, we should rescale the X axis to make sure that the standard deviation is one because the dominant effect of this repeated convolution, the kind of repeated moving average process is to flatten out the function over time. So, in the limit, it just flattens out towards zero. But rescaling is a way of saying “yeah yeah yeah, I know that it gets flatter but the actual shape underlying it all?” The statement of the central limit theorem, one of the coolest facts from probability is that you could have started with essentially any distribution and this still would have been true. That as you take repeated convolutions like this representing bigger and bigger sums of a given random variable, then the distribution describing that sum which might start off looking very different from a normal distribution over time smooths out more and more until it gets arbitrarily close to a normal distribution. It’s as if a bell curve is in some loose manner of speaking the smoothest possible distribution and attractive fix point in the space of all possible functions as we apply this process of repeated smoothing through the convolution. Naturally, you might wonder why normal distributions? Why dysfunction and not some other one? There’s a very good answer and I think the most fun way to show the answer is in the light of the last visualization that will show for convolutions. Remember how in the discreet case the first of our two visualizations involved forming this kind of multiplication table showing the probabilities for all possible outcomes and adding up along the diagonals. You’ve probably guessed it by now but our last step is to generalize this to the continuous case. And it is beautiful but you have to be a little bit careful. Pulling up the same two functions we had before, F of X and G of Y, what in this case would be analogous to the grid of possible pairs that we were looking at earlier? Well in this case, each of the variables can take on any real number. So we want to think about all possible pairs of real numbers and the XY plane comes to mind. Every point corresponds to a possible outcome when we sample from both distributions. Now, the probability of any one of these outcomes, X Y or rather the probability density around that point will look like F of X times G of Y. Again, assuming that the two are independent. So a natural thing to do is to graph this function, F of X times G of Y as a two variable function. Which would give something that looks like a surface above the X Y plane. Notice in this example how if we look at it from one angle where we see the X values changing, it has the shape of our first graph. But if we look at it from another angle, emphasizing the change in the Y direction, it takes on the shape of our second graph. And then you’d have to do the integral.

This three-dimensional graph encodes all of the information we need. It shows all the probability densities for every possible outcome. If we want to limit our view just to those outcomes where $X + Y$ is constrained to be a given sum, what that looks like is limiting our view to a diagonal slice. Specifically, a slice over the line $X + Y =$ some constant. All of the possible probability density for the outcome subject to this constraint looks sort of like a slice under this graph and as we change around what specific sum we’re constraining to, it shifts around which specific diagonal slice we’re looking at.

Now, what you might predict is that the way to combine all of the probability densities along one of these slices, the way to integrate them together can be interpreted as the area under this curve which is a slice of the surface. And that is almost correct. There’s a subtle detail regarding a factor of the square root of two that we need to talk about but up to a constant factor, the areas of these slices give us the values of the convolution. In fact, all of these slices that we’re looking at are precisely the same as the product graph that we were looking at earlier. Here, to emphasize this point, let me pull up both visualizations side by side and I’m going to slowly decrease the value of S from two down to negative two, which on the left means we are looking at different slices and on the right means we’re shifting around the modified graph of G. Notice how it all points the shape of the graph on the bottom right, the product between the functions, looks exactly the same as the shape of the diagonal slice. And this should make sense. They are two distinct ways to visualise the same thing.

It sounds like a lot when we put it into words but what we’re looking at are all the possible products between outputs of the functions corresponding to pairs of inputs that have a given sum. Again, it’s kind of a mouthful but I think you see what I’m saying and we now have two different ways to see it. The nice thing about the diagonal slice visualization is that it makes it much more clear that it’s a symmetric operation. It’s much more obvious that $F$ convolved with $G$ is the same thing as $G$ convolved with $F$.

Now, technically, the diagonal slices are not exactly the same shape. They’ve actually been stretched out by a factor of the square root of two. The basic reason is that if you imagine taking some small step along one of these lines where $X + Y =$ a constant, then the change in your $X$ value, that $\Delta X$ here, is not the same thing as the length of that step. That step is actually longer by a factor of the square root of two. I will leave a note up on the screen for the calculus enthusiast among you who want to pause and ponder but the upshot is very simply that the outputs of our convolution are technically not quite the areas of these diagonal slices. You have to divide those areas by a square root of two.

Stepping back from all of this for a moment, I just think this is so beautiful. We started with such a simple question or at least such a seemingly simple question - how do you add up two random variables? And what we end up with is this very intricate operation for combining two different functions. We have at least two very pretty ways to understand it. But still, some of you might be raising your hands and saying “Pretty pictures are all well and good, but do they actually help you calculate something?” For example, I still have not answered the opening quiz question about adding two normally distributed random variables.

Well, the ordinary way that you would approach this kind of question if it showed up on a homework or something like that is that you would plug in the formula for a normal distribution into the definition of a convolution - the integral that we’ve been describing here. And then you’d have to do the integral. In this case, the integral is not prohibitively difficult - there are analytical methods, but for this example, I want to show you a more fun method where the visualizations, specifically the diagonal slices, will play a much more front and center role in the proof itself. I think many of you may actually enjoy taking a moment to predict how this will look for yourself. Think about what this 3D graph would look like in the case of two normal distributions and what properties that it has that you might be able to take advantage of. And it is for sure easiest if you start with the case where both distributions have the same standard deviation. Whenever you want the details and to see how the answer fits into the central limit theorem, come join me in the next video.