Key Takeaways
- Kolmogorov's first axiom states that the probability of any event is a non-negative real number, ensuring P(E) ≥ 0 for all events E in the sample space.
- Kolmogorov's second axiom requires that the probability of the entire sample space is exactly 1, i.e., P(Ω) = 1, normalizing all probabilities.
- Kolmogorov's third axiom specifies that for any countable collection of mutually exclusive events, the probability of their union equals the sum of their individual probabilities.
- The binomial distribution Bin(n,p) gives the probability of exactly k successes in n independent Bernoulli trials: P(K=k) = C(n,k) p^k (1-p)^{n-k}.
- For Bin(10,0.5), the mode is 5 with P(K=5) ≈ 0.2461, highest probability mass at the mean.
- The expected value of Bin(n,p) is np, linear in trials, e.g., for n=100, p=0.3, E[X]=30.
- The normal distribution N(μ,σ²) has density φ(x) = (1/(σ√(2π))) exp(-(x-μ)^2/(2σ²)).
- Standard normal Z~N(0,1) has P(Z ≤ 1.96) ≈ 0.975, used for 95% confidence intervals.
- 68-95-99.7 rule: ≈68% within 1σ, 95% within 2σ, 99.7% within 3σ of mean for normal.
- Central Limit Theorem (CLT) states that sum of i.i.d. with finite variance, normalized, converges to N(0,1).
- Lindeberg-Lévy CLT requires i.i.d. mean μ, var σ²>0, S_n* = (S_n - nμ)/(σ√n) → N(0,1).
- Berry-Esseen theorem bounds CLT approximation error by |F_n(x) - Φ(x)| ≤ C ρ / (σ^3 √n), C≈0.5.
- Birthday problem: P(at least one shared birthday in 23 people) ≈ 0.5073 for 365 days.
- Monty Hall problem: switching doors gives 2/3 probability of winning car.
- In 52-card deck, P(royal flush in 5 cards) = 4 / 2,598,960 ≈ 0.000154%.
This blog post explores probability foundations, key distributions, theorems, and surprising real-world applications.
Advanced Theorems
- Central Limit Theorem (CLT) states that sum of i.i.d. with finite variance, normalized, converges to N(0,1).
- Lindeberg-Lévy CLT requires i.i.d. mean μ, var σ²>0, S_n* = (S_n - nμ)/(σ√n) → N(0,1).
- Berry-Esseen theorem bounds CLT approximation error by |F_n(x) - Φ(x)| ≤ C ρ / (σ^3 √n), C≈0.5.
- Law of Large Numbers (LLN) weak: sample mean → μ almost surely for i.i.d. finite mean.
- Strong LLN by Kolmogorov: for i.i.d. finite mean, P( lim \bar{X}_n = μ ) =1.
- Glivenko-Cantelli theorem: uniform convergence of empirical CDF to true CDF almost surely.
- Donsker's theorem for functional CLT: empirical process → Brownian bridge in Skorokhod space.
- Hoeffding's inequality: for bounded i.i.d., P(|\bar{X}-μ| ≥ t) ≤ 2 exp(-2 n t^2 / (b-a)^2).
- Chernoff bound general: P(S_n ≥ a) ≤ exp(-n D(p||q)) for binomial-like.
- Markov's inequality: P(X ≥ a) ≤ E[X]/a for non-negative X, a>0.
- Chebyshev's inequality: P(|X-μ| ≥ kσ) ≤ 1/k^2, distribution-free bound.
- Large deviation principle rates exceedances via Cramér's theorem for i.i.d. sums.
- Stein's method bounds distributional distances, e.g., for normal approximation error <1/√n.
- Polya's urn theorem shows reinforcement leads to beta-binomial limits.
Advanced Theorems Interpretation
Applications and Examples
- Birthday problem: P(at least one shared birthday in 23 people) ≈ 0.5073 for 365 days.
- Monty Hall problem: switching doors gives 2/3 probability of winning car.
- In 52-card deck, P(royal flush in 5 cards) = 4 / 2,598,960 ≈ 0.000154%.
- Google birthday paradox: with 20 employees, P(shared birthday)>50%, but only ~1% collision risk adjusted.
- Gambler's ruin: with equal probs, finite capital, absorption prob = (1-(q/p)^i)/(1-(q/p)^N) if p≠q.
- Buffon's needle: P(intersect line) = 2l/(π d) for needle l ≤ d, estimates π≈3.14.
- In craps, P(win on come-out roll) = 244/495 ≈49.29%, house edge from other rules.
- Boy or Girl paradox: given at least one boy, Pboth boys|Monday boy =13/27 ≈0.481.
- Sleeping beauty problem: halfer P heads=1/2, thirder P=1/3 on awakening.
- P(coin fair | 100 heads in 100 flips) tiny under beta prior, updates strongly.
- In election polling, margin of error for n=1000, p=0.5 is ≈3.1% at 95% confidence via normal approx.
- Netflix prize: probability models for ratings improved RMSE to 0.8565.
- In quality control, AQL 1.0% means P(accept lot with 1% defectives) high, say 95%.
- DNA match probability: for 13 STR loci, random match 1 in 10^18 for Caucasians.
- In machine learning, overfitting probability decreases with VC dimension bounds.
- P(airplane crash per flight) ≈1 in 11 million for commercial jets 2008-2017.
- In insurance, Poisson claims with λ=2, P(no claims)=e^{-2}≈0.1353.
- Stock crash 1987: Black Monday drop 22.6%, tail event beyond normal vol.
Applications and Examples Interpretation
Continuous Distributions
- The normal distribution N(μ,σ²) has density φ(x) = (1/(σ√(2π))) exp(-(x-μ)^2/(2σ²)).
- Standard normal Z~N(0,1) has P(Z ≤ 1.96) ≈ 0.975, used for 95% confidence intervals.
- 68-95-99.7 rule: ≈68% within 1σ, 95% within 2σ, 99.7% within 3σ of mean for normal.
- Exponential distribution Exp(λ) has pdf λ e^{-λx}, mean 1/λ, memoryless property P(X>s+t|X>s)=P(X>t).
- Uniform continuous U(a,b) has pdf 1/(b-a), mean (a+b)/2, variance (b-a)^2/12.
- Gamma distribution Γ(α,β) generalizes exponential (α=1), mean α/β, mode (α-1)/β for α>1.
- Chi-squared χ²(k) is Gamma(k/2,1/2), mean k, variance 2k, for sum of k standard normal squares.
- Student's t-distribution t(ν) has heavier tails than normal, converges as ν→∞, used in t-tests.
- F-distribution F(d1,d2) ratio of chi-squared variances, central in ANOVA, mean d2/(d2-2) for d2>2.
- Beta distribution Beta(α,β) on [0,1], mean α/(α+β), conjugate prior for binomial p.
- Lognormal ln(X)~N(μ,σ²), median e^μ, used for skewed positives like stock prices.
- Weibull(λ,k) models lifetimes, shape k=1 exponential, k>1 increasing hazard.
- Cauchy distribution has no mean or variance, heavy tails, pdf 1/[π(1+x²)].
- Logistic distribution symmetric, variance π²/3, cdf 1/(1+e^{-x}), sigmoid shape.
- Pareto distribution Type I: pdf α x_m^α / x^{α+1}, tail index α, for incomes/earthquakes.
- Inverse Gaussian μ,λ has mean μ, used in Brownian motion first passage times.
- Laplace distribution double exponential, median μ, heavier tails than normal.
- Rayleigh distribution for vector magnitude of normals, pdf (x/σ²) exp(-x²/(2σ²)).
Continuous Distributions Interpretation
Discrete Distributions
- The binomial distribution Bin(n,p) gives the probability of exactly k successes in n independent Bernoulli trials: P(K=k) = C(n,k) p^k (1-p)^{n-k}.
- For Bin(10,0.5), the mode is 5 with P(K=5) ≈ 0.2461, highest probability mass at the mean.
- The expected value of Bin(n,p) is np, linear in trials, e.g., for n=100, p=0.3, E[X]=30.
- Variance of Bin(n,p) is np(1-p), maximum at p=0.5, e.g., Var=6.25 for n=10, p=0.5.
- Poisson approximation to Bin(n,p) is valid when n large, p small, λ=np, with error <0.01 often.
- Geometric distribution Geo(p) models trials until first success: P(X=k) = (1-p)^{k-1} p, for k=1,2,...
- Negative binomial NB(r,p) counts trials for r successes: mean r/p, variance r(1-p)/p^2.
- Hypergeometric distribution for sampling without replacement: P(K=k) = [C(K,k) C(N-K,n-k)] / C(N,n).
- For Hypergeometric N=52, K=13 hearts, n=5, P(exactly 2 hearts) ≈ 0.2743.
- Uniform discrete on {1..n} has P(X=k)=1/n, mean (n+1)/2, variance (n^2-1)/12.
- Bernoulli(p) is Bin(1,p), with P(X=1)=p, P(X=0)=1-p, simplest discrete distribution.
- Multinomial distribution generalizes binomial to k categories: P(n1,..nk) = [n! / (n1!..nk!)] p1^{n1}...pk^{nk}.
- Zipf's law follows discrete power-law: P(rank r) ∝ 1/r^s, s≈1 for word frequencies.
- Skellam distribution models difference of two Poissons: P(K=k|μ1,μ2) involves modified Bessel function.
- Binomial cumulative P(K≤k) for n=20,p=0.5,k=10 is ≈0.588, via tables or computation.
- Pascal distribution is negative binomial with r integer, mean r(1-p)/p.
- Delaporte distribution convolves gamma and negative binomial, used in insurance claims.
- Hermite distribution for sum of Poissons with Bernoulli thinning, mean μ, variance μ + θμ(1-θ).
Discrete Distributions Interpretation
Foundational Concepts
- Kolmogorov's first axiom states that the probability of any event is a non-negative real number, ensuring P(E) ≥ 0 for all events E in the sample space.
- Kolmogorov's second axiom requires that the probability of the entire sample space is exactly 1, i.e., P(Ω) = 1, normalizing all probabilities.
- Kolmogorov's third axiom specifies that for any countable collection of mutually exclusive events, the probability of their union equals the sum of their individual probabilities.
- The classical probability definition assigns equal probability to each outcome in a finite equally likely sample space, as P(E) = |E| / |Ω|.
- Conditional probability is defined as P(A|B) = P(A ∩ B) / P(B) when P(B) > 0, quantifying updated probabilities given evidence.
- The law of total probability states that for a partition {B_i} of the sample space, P(A) = Σ P(A|B_i) P(B_i), decomposing probabilities over partitions.
- Independence of events A and B means P(A ∩ B) = P(A) P(B), implying that knowledge of one doesn't affect the other.
- The probability of the union of two events is P(A ∪ B) = P(A) + P(B) - P(A ∩ B), accounting for overlap via inclusion-exclusion.
- Bayes' theorem relates prior and posterior probabilities: P(A|B) = [P(B|A) P(A)] / P(B), fundamental for inference.
- The sample space Ω is the set of all possible outcomes of a random experiment, foundational to probability modeling.
- Events are subsets of the sample space, and the power set of Ω contains all possible events, with 2^|Ω| events for finite Ω.
- The addition rule for mutually exclusive events simplifies to P(∪ A_i) = Σ P(A_i), avoiding overlap corrections.
- Probability zero events are not necessarily impossible, as in continuous spaces where single points have P=0 but can occur.
- The frequentist interpretation defines probability as the long-run frequency limit of relative occurrences in repeated trials.
- Subjective probability reflects an individual's degree of belief, calibrated via betting odds or coherence axioms.
- The principle of indifference assigns equal probabilities to indistinguishable outcomes under insufficient information.
- Boole's inequality bounds the probability of union: P(∪ A_i) ≤ Σ P(A_i), useful for upper bounds.
- The probability of an empty event is always P(∅) = 0, a direct consequence of the axioms.
- Continuity of probability measures ensures limits of increasing events have P(lim A_n) = lim P(A_n).
- Sigma-additivity extends finite additivity to countable unions of disjoint events in modern probability theory.
Foundational Concepts Interpretation
Sources & References
- Reference 1ENen.wikipedia.orgVisit source
- Reference 2MATHWORLDmathworld.wolfram.comVisit source
- Reference 3BRILLIANTbrilliant.orgVisit source
- Reference 4KHANACADEMYkhanacademy.orgVisit source
- Reference 5MATHSISFUNmathsisfun.comVisit source
- Reference 6MATHmath.libretexts.orgVisit source
- Reference 7PROBABILITYCOURSEprobabilitycourse.comVisit source
- Reference 8MATHmath.stackexchange.comVisit source
- Reference 9PLATOplato.stanford.eduVisit source
- Reference 10STATTREKstattrek.comVisit source
- Reference 11ITLitl.nist.govVisit source
- Reference 12COUNTBAYESIEcountbayesie.comVisit source
- Reference 13SOCIETYOFACTUARIESsocietyofactuaries.orgVisit source






