Introduction

In this document, we will explore six important statistical distributions: Binomial, Normal, Poisson, Chi-Square, F, and t distributions. Each section includes theoretical explanations, visualizations, and examples using R. Additionally, we will discuss the Central Limit Theorem (CLT) and compare how each distribution behaves under the CLT with figures. We will also summarize the features of these distributions in a comparison table and visualize their relationships.


1. Binomial Distribution

\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \dots, n \]

Where: - \(n\) is the number of trials, - \(p\) is the probability of success in each trial, - \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\) is the binomial coefficient.

The Binomial distribution models the number of successes in a fixed number of independent Bernoulli trials with the same probability of success.

Parameters:
- n: Number of trials
- p: Probability of success

R Example: It simulates 1000 experiments where each experiment consists of 10 trials, and the probability of success in each trial is 0.5.

# Parameters
n <- 10
p <- 0.5

# Generate binomial data
binom_data <- rbinom(1000, size = n, prob = p)

# Plot
hist(binom_data, breaks = 10, main = "Binomial Distribution", xlab = "Number of Successes", col = "skyblue")

# 1.1 Galton Board
# Set the number of simulations (balls) and trials (layers)
k <- 10000  # number of balls (simulations)
n <- 5      # number of trials (layers)
p <- 0.5    # probability of success
# Generate binomial data and plot histogram for 5 trials with probability 0.5
x <- rbinom(k, n, p)
hist(x, main = "Binomial Distribution: n=5, p=0.5", xlab = "Number of Successes", col = "lightblue", border = "black")

# 1.2 Different Probability of Success
# Change parameters for another binomial distribution (5 trials, probability 0.4)
p <- 0.4  # update probability
n <- 200  # number of trials (layers)
x <- rbinom(k, n, p)
hist(x, main = "Binomial Distribution: n=200, p=0.4", xlab = "Number of Successes", col = "lightgreen", border = "black")

# 1.3 Standardization
# Standardizing the data (apply Z-score transformation)
mean <- n * p               # Expected mean of the binomial distribution
var <- n * p * (1 - p)      # Variance of the binomial distribution
z <- (x - mean) / sqrt(var)  # Standardize the data
# Plot the standardized data
hist(z, main = "Standardized Binomial Distribution", xlab = "Z-Score", col = "lightcoral", border = "black")

# 1.4 Plot on Density
# Generate and plot the density of the standardized data
d <- density(z)
par(mfrow = c(2, 1), mar = c(3, 4, 1, 1))  # Adjust the layout for two plots
# Plot density
plot(d, main = "Density of Standardized Data", xlab = "Z-Score")
# Add a shaded area to the density plot for visual effect
plot(d, main = "Density of Standardized Data (Polygon)", xlab = "Z-Score", col = "darkblue")
polygon(d, col = "red", border = "blue")  # Color the area under the density curve (area sum to 1)


2. Normal Distribution

\[ f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2 \sigma^2}} \]

Where: - \(\mu\) is the mean, - \(\sigma^2\) is the variance.

The normal distribution is a continuous distribution characterized by its mean (μ) and standard deviation (σ).

Parameters:
- mean: The mean of the distribution
- sd: The standard deviation

R Example: Let us generate 1000 random numbers from a normal distribution with a mean of 0 and a standard deviation of 1:

# Parameters
mean <- 0
sd <- 1

# Generate normal data
norm_data <- rnorm(1000, mean = mean, sd = sd)

# Plot
hist(norm_data, breaks = 30, main = "Normal Distribution", xlab = "Value", col = "lightgreen")

# 2.1 Generating a random sample from a normal distribution
# with k data points, mean, and variance
x = rnorm(k, mean = mean, sd = sqrt(var))
# Plotting the histogram of the normal distribution sample
hist(x, main = "Histogram of Normal Distribution", xlab = "Value", col = "lightblue", border = "black")

# 2.2 --- Binomial Distribution vs. Normal Distribution ---
par(mfrow = c(2, 1), mar = c(3, 4, 1, 1))  # Adjust the layout for two plots
# Generating a random sample from a binomial distribution with k trials, probability p of success per trial, and number of trials n
x = rbinom(k, n, p)
# Generating the density estimate of the binomial sample
d = density(x)
# Plotting the density estimate of the binomial distribution
plot(d, main = "Density of Binomial Distribution", xlab = "Value", col = "red")
# Calculating the mean and variance for the normal distribution based on the parameters of the binomial distribution
mean = n * p  # mean of binomial distribution
var = n * p * (1 - p)  # variance of binomial distribution
# Generating a random sample from a normal distribution with the same mean and variance as the binomial distribution
x = rnorm(k, mean = mean, sd = sqrt(var))
# Generating the density estimate of the normal sample
d = density(x)
# Plotting the density estimate of the normal distribution
plot(d, main = "Density of Normal Distribution (Approximated from Binomial)", xlab = "Value", col = "blue")


3. Poisson Distribution

\[ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \dots \]

Where: - \(\lambda > 0\) is the rate parameter (mean and variance).

The Poisson distribution models the number of events occurring in a fixed interval of time or space, given an average rate (λ).

Parameter:
- lambda: The average number of events in the interval

R Example: Simulate the number of calls received at a call center in one hour, assuming an average of 5 calls per hour:

# Parameter
lambda <- 5

# Generate Poisson data
pois_data <- rpois(1000, lambda = lambda)

# Plot
hist(pois_data, breaks = 10, main = "Poisson Distribution", xlab = "Number of Calls", col = "coral")

# 3.1 Poisson Distribution
# Set up a 2x2 plotting layout and adjust margins
par(mfrow = c(2, 2), mar = c(3, 4, 1, 1))

# Define a vector of lambda values for the Poisson distribution
lambdas = c(0.5, 1, 5, 10)

# Loop through each lambda value, generate a Poisson distribution sample, and plot its histogram
for (lambda in lambdas) {
  x = rpois(k, lambda)
  hist(x, main = paste("Poisson Distribution λ =", lambda), col = "lightblue", border = "black")
}

# 3.2 Approximation by Binomial
# Set up a 3x3 plotting layout and adjust margins
par(mfrow = c(3, 3), mar = c(3, 4, 1, 1))

# Parameters for the Galton board simulation
k = 10000  # Number of trials (samples)
p = c(0.5, 0.05, 0.005)  # Probabilities of success
n = c(10, 100, 1000)  # Sample sizes

# Generate histograms for combinations of p and n
for (pi in p) {
  for (ni in n) {
    x = rbinom(k, ni, pi)
    hist(x, breaks = "Scott", 
         main = paste("n =", ni), 
         ylab = paste("p =", pi), 
         col = "lightblue", border = "black")
  }
}

# Additional histogram for Poisson distribution
lambda = 5
x = rpois(k, lambda)
hist(x, breaks = "Scott", 
     main = "Poisson Distribution (λ = 5)", 
     col = "lightgreen", border = "black")
# use "Scott" method to automatically select the bin width based on the data's distribution. More info, check ?hist


4. Chi-Square Distribution

\[ f(x) = \frac{1}{2^{k/2} \Gamma(k/2)} x^{k/2 - 1} e^{-x/2}, \quad x > 0 \]

Where: - \(k\) is the degrees of freedom, - \(\Gamma\) is the gamma function.

The chi-square distribution is used in hypothesis testing and confidence interval estimation for variance, particularly for categorical data.

Parameter:
- df: Degrees of freedom

R Example: Generate 1000 random values from a chi-square distribution with 5 degrees of freedom:

# Parameter
df <- 5

# Generate chi-square data
chisq_data <- rchisq(1000, df = df)

# Plot
hist(chisq_data, breaks = 30, main = "Chi-Square Distribution", xlab = "Values", col = "plum")

# 4.1 Random Normal Distribution and Transformations 
par(mfrow = c(2, 1), mar = c(3, 4, 1, 1))  
k = 10000
x = rnorm(k, 0, 1)
hist(x, main = "Histogram of Standard Normal Distribution", col = "lightblue", border = "black")

y = x^2
hist(y, main = "Histogram of Squared Normal Distribution", col = "lightgreen", border = "black")

# 4.2 Sum of Squares of Independent Normal Variables 
par(mfrow = c(2, 1), mar = c(3, 4, 1, 1)) 
x1 = rnorm(k, 0, 1)
x2 = rnorm(k, 0, 1)
y = x1^2 + x2^2
cat("Mean:", mean(y), "\n")
## Mean: 2.002868
cat("Variance:", var(y), "\n")
## Variance: 3.991971
hist(y, main = "Histogram of Sum of Squares (n=2)", col = "lightpink", border = "black")
plot(density(y), main = "Density of Sum of Squares (n=2)", col = "red", lwd = 2)

# 4.3 Chi-square Distribution Simulation 
par(mfrow = c(3, 1), mar = c(3, 4, 1, 1))  # Setup for 3 plots
n = c(2, 5, 100)  # Different degrees of freedom
for (ni in n) {
  x = rnorm(k * ni, 0, 1)^2  # Generate squared random variables
  xm = matrix(x, k, ni)  # Reshape into a matrix of k rows and ni columns
  y = rowSums(xm)  # Sum along rows to get chi-square values
  cat("For n =", ni, ": mean =", mean(y), ", variance =", var(y), "\n")
  hist(y, breaks = "Scott", main = paste("Chi-square Distribution, n =", ni), 
       col = "lightblue", border = "black")
}
## For n = 2 : mean = 1.96542 , variance = 4.001104
## For n = 5 : mean = 5.041031 , variance = 10.18571
## For n = 100 : mean = 100.1561 , variance = 206.3854

# 4.4 Chi-square Distribution Using Built-in Function 
par(mfrow = c(2, 2), mar = c(3, 4, 1, 1))  # Setup for 2x2 plots
dfs = c(2, 5, 100, 1000)  # Degrees of freedom
for (df in dfs) {
  x = rchisq(k, df)  # Generate chi-square samples
  d = density(x)     # Calculate density
  plot(d, main = paste("Chi-square Distribution (df =", df, ")"), 
       col = "blue", lwd = 2)
}


5. F Distribution

\[ f(x) = \frac{\sqrt{\frac{(d_1 x)^{d_1} d_2^{d_2}}{(d_1 x + d_2)^{d_1 + d_2}}}}{x \, B\left(\frac{d_1}{2}, \frac{d_2}{2}\right)}, \quad x > 0 \]

Where: - \(d_1\) and \(d_2\) are the numerator and denominator degrees of freedom, - \(B\) is the beta function.

The F distribution arises when comparing variances. It is commonly used in ANOVA.

Parameters:
- df1: Degrees of freedom for the numerator
- df2: Degrees of freedom for the denominator

R Example: Generate 1000 random values from an F distribution with 5 and 10 degrees of freedom:

# Parameters
df1 <- 5 
df2 <- 10

# Generate F distribution data
f_data <- rf(1000, df1 = df1, df2 = df2)

# Plot
hist(f_data, breaks = 30, main = "F Distribution", xlab = "Value", col = "gold")

# 5.1 F Distribution
# Set up a 2x2 plotting layout and adjust margins
par(mfrow = c(2, 2), mar = c(3, 4, 1, 1))

# Parameters for the F-distribution
k = 10000  # Number of samples
df1_values = c(1, 1, 10, 10000)  # Degrees of freedom for the numerator
df2_values = c(100, 10000, 10000, 10000)  # Degrees of freedom for the denominator

# Loop through the parameter pairs and generate histograms
for (i in seq_along(df1_values)) {
  df1 = df1_values[i]
  df2 = df2_values[i]
  x = rf(k, df1, df2)  # Generate F-distribution samples
  hist(x, breaks = "Scott", 
       main = paste("F-distribution (df1 =", df1, ", df2 =", df2, ")"), 
       col = "lightblue", border = "black")
}


6. t Distribution

\[ f(x) = \frac{\Gamma\left(\frac{k+1}{2}\right)}{\sqrt{k\pi} \, \Gamma\left(\frac{k}{2}\right)} \left(1 + \frac{x^2}{k}\right)^{-\frac{k+1}{2}}, \quad x \in \mathbb{R} \]

Where: - \(k\) is the degrees of freedom, - \(\Gamma\) is the gamma function.

The t distribution is used when estimating the mean of a normally distributed population with a small sample size.

Parameter:
- df: Degrees of freedom

R Example: Generate 1000 random values from a t distribution with 10 degrees of freedom:

# Parameter
df <- 10

# Generate t distribution data
t_data <- rt(1000, df = df)

# Plot
hist(t_data, breaks = 30, main = "t Distribution", xlab = "Value", col = "dodgerblue")

# 6.1 t distribution in R
# Set up a 2x2 plotting layout and adjust margins
par(mfrow = c(2, 2), mar = c(3, 4, 1, 1))

# Parameters for the t-distribution
k = 10000  # Number of samples
df_values = c(2, 5, 10, 100)  # Degrees of freedom

# Loop through each degree of freedom and generate histograms
for (df in df_values) {
  x = rt(k, df)  # Generate t-distribution samples
  hist(x,  breaks = "Scott", 
       main = paste("t-Distribution (df =", df, ")"), 
       col = "lightblue", border = "black")
}

# 6.2 Relationship between t and F 
# Set up a 2x1 plotting layout and adjust margins
par(mfrow = c(2, 1), mar = c(3, 4, 1, 1))

# Parameters
k = 10000  # Number of samples

# Plot 1: F-distribution with df1 = 1 and df2 = 100
x = rf(k, df1 = 1, df2 = 100)
hist(x, breaks = "Scott", 
     main = "F-Distribution (df1 = 1, df2 = 100)", 
     col = "lightblue", border = "black")

# Plot 2: Squared t-distribution with df = 100
x = rt(k, df = 100)  # Generate t-distribution samples
z = x^2              # Square the t-distribution samples
hist(z, breaks = "Scott", 
     main = "Squared t-Distribution (df = 100)", 
     col = "lightgreen", border = "black")

Central Limit Theorem (CLT)

The Central Limit Theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the original population’s distribution.

CLT Illustration

We will generate random samples from different distributions and demonstrate how their sample means behave under the CLT.

# First: Generate and plot different distributions

# Set up the plotting layout for the first 5 plots
par(mfrow = c(5, 1), mar = c(3, 4, 1, 1))

# Number of simulations (boards)
k = 10000

# Binomial Distribution
p = 0.05  # probability of success
n = 100   # number of trials (balls)
x = rbinom(k, n, p)  # Generate binomial random variables
d = density(x)  # Estimate the density of the binomial data
plot(d, main = "Binomial Distribution")

# Poisson Distribution
lambda = 10  # mean of the Poisson distribution
x = rpois(k, lambda)  # Generate Poisson random variables
d = density(x)  # Estimate the density of the Poisson data
plot(d, main = "Poisson Distribution")

# Chi-Square Distribution
x = rchisq(k, 5)  # Generate chi-square random variables with 5 degrees of freedom
d = density(x)  # Estimate the density of the chi-square data
plot(d, main = "Chi-square Distribution")

# F Distribution
x = rf(k, 10, 10000)  # Generate F-distributed random variables
d = density(x)  # Estimate the density of the F-distribution data
plot(d, main = "F Distribution")

# t Distribution
x = rt(k, 5)  # Generate t-distributed random variables with 5 degrees of freedom
d = density(x)  # Estimate the density of the t-distribution data
plot(d, main = "t Distribution")

# Now apply Central Limit Theorem (averaging the random variables) and plot

# Function to apply Central Limit Theorem by averaging values from a distribution
i2mean <- function(x, n = 10) {
  k = length(x)  # total number of samples
  nobs = k / n    # number of groups (observations)
  xm = matrix(x, nobs, n)  # reshape the data into a matrix with 'n' columns
  y = rowMeans(xm)  # compute the row means, which approximates CLT
  return(y)  # return the row means (averages)
}

# Apply i2mean to each distribution and plot
par(mfrow = c(5, 1), mar = c(3, 4, 1, 1))  # Reset plotting layout

# Binomial Distribution (CLT applied)
x = i2mean(rbinom(k, n, p))  # Apply Central Limit Theorem to Binomial distribution
d = density(x)  # Estimate the density of the averaged data
plot(d, main = "Binomial Distribution (CLT Applied)")

# Poisson Distribution (CLT applied)
x = i2mean(rpois(k, lambda))  # Apply Central Limit Theorem to Poisson distribution
d = density(x)  # Estimate the density of the averaged data
plot(d, main = "Poisson Distribution (CLT Applied)")

# Chi-Square Distribution (CLT applied)
x = i2mean(rchisq(k, 5))  # Apply Central Limit Theorem to Chi-Square distribution
d = density(x)  # Estimate the density of the averaged data
plot(d, main = "Chi-square Distribution (CLT Applied)")

# F Distribution (CLT applied)
x = i2mean(rf(k, 10, 10000))  # Apply Central Limit Theorem to F distribution
d = density(x)  # Estimate the density of the averaged data
plot(d, main = "F Distribution (CLT Applied)")

# t Distribution (CLT applied)
x = i2mean(rt(k, 5))  # Apply Central Limit Theorem to t distribution
d = density(x)  # Estimate the density of the averaged data
plot(d, main = "t Distribution (CLT Applied)")

Key Insights:

  1. Regardless of the original distribution, the sampling distribution of the sample mean becomes approximately normal as sample size increases.
  2. This property forms the basis for many inferential statistics methods.

An optimized version

Distribution Comparison Table

Comparison of Key Statistical Distributions
Distribution Symmetrical Parameters Applications
Binomial No n, p Modeling binary outcomes
Normal Yes mean, sd Continuous data
Poisson No lambda Event counts
Chi-Square No df Variance testing
F No df1, df2 Variance ratio testing
t Yes df Small sample analysis

Relationships Between Distributions

Below is a conceptual diagram showing relationships among the distributions.

Binomial and Poisson distributions converge to the Normal distribution under the Central Limit Theorem (CLT). The Chi-Square distribution arises as the sum of squared standard normal variables. The t-distribution and F-distribution are derived from the Normal and Chi-Square distributions. The F-distribution is formed as the ratio of two independent Chi-Square distributions scaled by their degrees of freedom.

# install.packages("DiagrammeR") # Uncomment this line if installing for the first time

library(DiagrammeR)

grViz("
digraph relationships {
  graph [layout = dot, rankdir = LR]

  # Nodes with different fill and border colors
  binomial [label = 'Binomial', shape = oval, style=filled, fillcolor=lightblue, color=blue]
  poisson [label = 'Poisson', shape = oval, style=filled, fillcolor=lightseagreen, color=green]
  normal [label = 'Normal (N)', shape = oval, style=filled, fillcolor=gold, color=orange]
  chisquare [label = 'Chi-Square (X²)', shape = oval, style=filled, fillcolor=plum, color=purple]
  f [label = 'F', shape = oval, style=filled, fillcolor=coral, color=red]
  t [label = 't', shape = oval, style=filled, fillcolor=dodgerblue, color=darkblue]

  # Connections with customized edge colors
  binomial -> normal [label = 'CLT', color=blue]
  poisson -> normal [label = 'CLT', color=green]
  chisquare -> normal [label = 'CLT', color=purple]
  f -> normal [label = 'Approximates', color=red]
  t -> normal [label = 'Approximates', color=dodgerblue]
  chisquare -> f [label = 'Ratio of X²', color=orange]
  normal -> t [label = 't from Z', color=darkblue]
  normal -> chisquare [label = 'Sum of Z²', color=gold]
}
")

Conclusion

In this document, we explored six important statistical distributions, provided examples with R, and demonstrated the Central Limit Theorem. We also included a comparison table and a conceptual diagram to visualize the relationships among these distributions. Understanding these distributions and their properties is essential for data analysis and statistical modeling.