Law of Large Numbers and the Central Limit Theorem (With Python)
Law of Large Numbers & CLT: Explanation, mathematical proof & illustration of theorems with Python
Statistics does not usually go with the word "famous," but a few theorems and concepts of statistics find such universal application, that they deserve that tag. The Central Limit Theorem and the Law of Large Numbers are two such concepts. Combined with hypothesis testing, they belong in the toolkit of every quantitative researcher. These are some of the most discussed theorems in quantitative analysis, and yet, scores of people still do not understand them well, or worse, misunderstand them. Moreover, these calculations are usually not done manually—the datasets are too large, computations are time consuming—so it is equally important to understand the computation aspect of these theorems well.
A working knowledge of probability, random variables and their distributions is required to understand these concepts.
Sample Mean
A "sample" is a set of outcomes in an experiment or an event. A sample can sometimes also be replaced by "trials" or number of repetitions of the experiment. For example, tossing a coin with probability p of achieving a heads is n Bernoulli (p) trials. If the outcome is a random variable X, then X~Binomial(n, p) distribution (with n number of Bern(p) trials).
The values in a sample, \(X_{1},X_{2},X_{3}, ..., X_{n}\), will all be random variables, all drawn from the same probabilistic distribution since they are outcomes of the same experiment. The \(X_{1},X_{2},X_{3}, ..., X_{n}\) here are not actual numbers but names of random variables. A realization of the random variable \(X_{1}\) will be \(x_{1}\), an actual number.
A realization or an actual value of an experiment is one reading from the distribution. In a way, a probabilistic distribution is not a physical thing, and hence sometimes hard to relate to in everyday actions, even though highly applicable in everyday activities. It can be thought of as a tree. For example, a mango tree. We know the characteristics of the tree, what types of leaves it has, what types of fruits, their shape, size, color etc. We have a good idea of what a mango tree is, even if we don't physically see it & know each and every value. That is similar to a probabilistic distribution of a random variable X. One specific leaf, x, plucked from a mango tree, is like a realization of the random variable X.
Sample Mean is a random variable itself, as it is an average of other random variables. When referring to outcomes of the same experiment, all outcomes will belong to the same distribution, and hence will be identical in distribution. If each trial or sample is independent of the others, the random variables \(X_{1},X_{2},X_{3}, ..., X_{n}\), will also be independent. This is then a set of I.I.D. (Independent and Identically Distributed) random variables.
For I.I.D. random variables \(X_{1},X_{2},X_{3}, ..., X_{n}\) the sample mean \({\overline{X_n}}\) is simply the arithmetic average of the set of random variables. (Note: the definition of sample mean applies to any set of random variables, but the fact that they are I.I.D. is going to be a special case scenario in common experiments, useful for deriving some important theorems.)
\[{\bar{X_n}} = \sum_{i=1}^{n}{X_i}/n\]
In a small lab experiment, like measuring the length of an instrument with Vernier Calipers, we normally observe and record 3-5 readings of a measurement, and take the average of the readings to report the final value, to cancel any errors. This is high-school level mathematics. In research experiments, this level of simplification is not possible. But the idea behind the averaging is the same. In real life, we draw samples, Xi, and take the expectation of the samples to get the expectation of the sample mean. But the expectation of the sample mean is an average of all possible outcomes for each random variable Xi, not just the realized values. So, in a sense, it is a theoretical average.
\[E[{\overline{X_n}}] = E[\sum_{i=1}^{n}{X_i}/n]\]
When we take the expectation, the expectation of the errors becomes zero: E[e] = 0
Convergence in Probability
A convergence in probability is defined as follows: a sequence of random variables Xn is said to converge in probability to a number a, if for any given small number ϵ, the probability of the difference between Xn and a being greater than ϵ tends to zero as n approaches ∞.
For any \(\epsilon\) > 0, \[\lim_{n \rightarrow \infty } P(|X_n - a| \geq \epsilon ) = 0\]
So, the distribution of Xn bunches around the number a for a large enough number n. But it is not always necessary that the expectation E[Xn] will converge to a too. This can be explained by the presence of outliers which might offset the expectation away from the number a, where the big proportion of the outcomes lie.
Law of Large Numbers (LLN)
As per the LLN, as the sample size n tends to ∞, the expectation of the sample mean tends to the true mean μ of the population with probability 1. This is true for a set of I.I.D. random variables Xi with mean μ and variance σ2. It is calculated as follows:
\[E[{\overline{X_n}}] = E[\sum_{i=1}^{n}{X_i}/n] \longrightarrow \mu \]
This can be simulated and tested in Python by creating say 15 random variables, X1 to X15 that are Xi ~Bin(n,p) using the random generator of Numpy. The Xi must be IID. We calculate the value of the sample mean by averaging the variables. The true mean ( mu
in the code) is very close to the calculated value mean
based on the randomly generated distributions.
Note that in Numpy, np.random.binomial(n, p, size=None)
uses a slightly different notation for the Binomial distribution than what we have been using so far. Here n
refers to the number of trials in one variable, p
is the probability of success, and size
is the sample size (eg, number of coins tossed). It treats the Binomial distribution as a sum of indicator random variables; hence the output is the sum of number of successes for each sample (like each coin). So, if we take size
as 5, and n
(trials) as 100, the output will be a list of 5 numbers, with the sum of number of successes out of 100 for each sample (eg, for each coin).
For the sake of simplicity though, I have created 15 separate random variables, each with size
= 1, for illustrative purposes. XN
refers to the sample mean in the code.
import numpy as np
import scipy.stats as sp
#Running the Simulation with 15 IID Binomial RV for size=1 each, with n=1000 trials, probability of success is p=0.5
X1 = np.random.binomial(1000, 0.5, 1)
X2 = np.random.binomial(1000, 0.5, 1)
X3 = np.random.binomial(1000, 0.5, 1)
X4 = np.random.binomial(1000, 0.5, 1)
X5 = np.random.binomial(1000, 0.5, 1)
X6 = np.random.binomial(1000, 0.5, 1)
X7 = np.random.binomial(1000, 0.5, 1)
X8 = np.random.binomial(1000, 0.5, 1)
X9 = np.random.binomial(1000, 0.5, 1)
X10 = np.random.binomial(1000, 0.5, 1)
X11 = np.random.binomial(1000, 0.5, 1)
X12 = np.random.binomial(1000, 0.5, 1)
X13 = np.random.binomial(1000, 0.5, 1)
X14 = np.random.binomial(1000, 0.5, 1)
X15 = np.random.binomial(1000, 0.5, 1)
XN = (X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8+ X9 + X10 + X11 + X12 + X13 + X14 + X15)/15 #Sample Mean
mean = np.mean(XN) #Calculated mean of the sample
print("Sample Mean: "+ str(mean))
mu = sp.binom.mean(1000, 0.5) #True Mean of the sample
print("True Mean: " + str(mu))
Output:
Sample Mean: 500.8666666666667
True Mean: 500.0
This is the result for just 15 random variables. As the number increases, the sample mean gets closer to true mean. (Note: every time you run the code, it gives a new value for the sample mean because a new set of rv is generated every time. You can check and run the code here. )
The variance of the Sample Mean \(\overline{X_n}\) is calculated as follows: \[Var(\overline{X_n}) =\frac{Var(X_1 + X_2 + ... + X_n)}{n^2} = \frac{n \sigma^2}{n^2} = \frac{\sigma^2}{n} \]
Since Xi are independent, we can use the property of linearity of variances to find the variance of the sample mean. By using the variance calculated above, and the Chebyshev's inequality, we can prove the Weak Law of Large Numbers.
Weak Law of Large Numbers
As per the Chebyshev's inequality,
For any \(\epsilon\) > 0, \[P(|Y_n - a| \geq \epsilon ) = \frac{Var(Y_n)}{ \epsilon ^2} \]
Plugging in the values in this equation, we get:
\[ P(|\overline{X_n} - \mu| \geq \epsilon ) = \frac{\sigma^2}{ n\epsilon ^2} \underset{n \longrightarrow \infty }{\overset{}{\longrightarrow}} 0 \]
As n approaches infinity, the probability of the difference between the sample mean and the true mean μ tends to zero, taking ϵ as a fixed small number.
Central Limit Theorem
So far, we have not mentioned anything about which distribution the Xi belong to and the distribution of the sample mean (which is a random variable too, remember?). Most of the times, knowing the mean is not enough; we would like to know more about the final distribution of the sample mean so we can understand its properties. The Central Limit Theorem describes exactly this.
The Central Limit Theorem (CLT) says:
\[ \frac{\sqrt{n} (\overline{X_n} - \mu )}{ \sigma } \underset{n \longrightarrow \infty }{\longrightarrow}N(0,1) \]
import numpy as np
import scipy.stats as sp
import matplotlib.pyplot as plt
import math
#Running the Simulation with 10 IID Binomial RV for 500 coins, with 1000 trials, probability of success is 0.5
X1 = np.random.binomial(1000, 0.5, 500)
X2 = np.random.binomial(1000, 0.5, 500)
X3 = np.random.binomial(1000, 0.5, 500)
X4 = np.random.binomial(1000, 0.5, 500)
X5 = np.random.binomial(1000, 0.5, 500)
X6 = np.random.binomial(1000, 0.5, 500)
X7 = np.random.binomial(1000, 0.5, 500)
X8 = np.random.binomial(1000, 0.5, 500)
X9 = np.random.binomial(1000, 0.5, 500)
X10 = np.random.binomial(1000, 0.5, 500)
XN = (X1+X2+X3+X4+X5+X6+X7+X8+X9+X10)/10
mean = np.mean(XN) #Calculated mean of the sample
print("Sample Mean: "+ str(mean))
mu = sp.binom.mean(1000, 0.5) #true mean of the sample
print("True Mean: " + str(mu))
sigma = sp.binom.std(1000, 0.5)
print("True standard deviation: " + str(sigma))
#Plotting Sample mean distribution for CLT
gauss = np.random.standard_normal(5000)
nsigma = 1
nmu = 0
size=10
N = math.sqrt(10)
ZN = (N*(XN-mu))/ sigma
plt.figure(figsize=(10,7))
count, bins, ignored = plt.hist(gauss, 100, alpha=0.5, color='orange', density=True)
plt.plot(bins, 1/(nsigma * np.sqrt(2 * np.pi)) *
np.exp( - (bins - nmu)**2 / (2 * nsigma**2) ),
linewidth=3, color='r')
plt.hist(ZN, 100, alpha=0.7, color="Navy", density=True)
plt.legend(["Std Normal PDF","Standard Normal Distribution", "Sample Mean Distribution"], loc=2)
plt.savefig("CLT_new.jpeg")
plt.show()
You can test and run the code from here. Note that "n" is still not very large here. As "n" gets bigger, the results get more and more close to the Standard Normal distribution.
This is another example of the proof of the theorem in Python, taking random variables with a Poisson distribution as the sample (figure below).
As per the Central Limit Theorem, the distribution of the sample mean converges to the distribution of the Standard Normal (after being centralized) as n approaches infinity. This is not a very intuitive result and yet, it turns out to be true. The proof of the CLT is by taking the moment of the sample mean. The moment converges to the moment of the Standard Normal (Refer to this lecture for a detailed proof). Since the proof involves advanced statistics and calculus, it is not covered here.
It is important to remember that the CLT is applicable for 1) independent, 2) identically distributed variables with 3) large sample size. What is "large" is an open ended question, but ~32 is taken as an acceptable number by most people. The larger the sample size, the better, for this purpose. In some cases, even if the variables are not strictly independent, a very weak dependence can be approximated to independence for the purpose of analysis. In nature, it may be difficult to find complete independence, as there may be some externalities which are unaccounted for, but if there is no clear, strong dependence, then it is usually approximated to independence. That said, the researcher must proceed with caution and weigh the situation on its own merits.
The Standard Normal distribution has many nice properties which simplify calculations and give intuitive results for experimental analysis. The bell curve is symmetric, centered around zero, has standard deviation = 1, is unimodal and is overall easy to read. The Z tables make it easy to relate the CDF with the values. No wonder that CLT finds wide applications in research, exploratory data analysis and machine learning.
References:
Law of Large Numbers and Central Limit Theorem | Statistics 110
John Tsitsiklis, and Patrick Jaillet. RES.6-012 Introduction to Probability. Spring 2018. Massachusetts Institute of Technology: MIT OpenCourseWare, https://ocw.mit.edu. License: Creative Commons BY-NC-SA.