**Why do I need to learn about probability and statistics?**

Probability and statistics are fundamental tools for understanding many modern theories and techniques such as artificial intelligence, machine learning, deep learning, data mining, security, digital imagine processing and natural language processing.

**What can I do after finishing learning about probability and statistics?**

You will be prepared to learn modern theories and techniques to create modern security, machine learning, data mining, image processing or natural language processing software.

**That sounds useful! What should I do now?**

Please read

– this Dimitri P. Bertsekas and John N. Tsitsiklis (2008). Introduction to Probability. Athena Scientific book, or

– this Hossein Pishro-Nik (2014). Introduction to Probability, Statistics, and Random Processes. Kappa Research, LLC book.

Alternatively, please read these notes, then watch

– this MIT 6.041SC – Probabilistic Systems Analysis and Applied Probability, Fall 2011 course (Lecture Notes), and

– this MIT RES.6-012 – Introduction to Probability, Spring 2018 course (Lecture Notes).

Probability and statistics are quite difficult topics so you may need to *learn it 2 or 3 times* using different sources to actually master the concepts.

**Terminology Review:**

- Sample Space (Ω): Set of possible outcomes.
- Event: Subset of the sample space.
- Probability Law: Law specified by giving the probabilities of all possible outcomes.
- Probability Model = Sample Space + Probability Law.
- Probability Axioms: Nonnegativity: P(A) ≥ 0; Normalization: P(Ω)=1; Additivity: If A ∩ B = Ø, then P(A ∪ B)= P(A)+ P(B).
- Conditional Probability: P(A|B) = P (A ∩ B) / P(B).
- Multiplication Rule.
- Total Probability Theorem.
- Bayes’ Rule: Given P(Aᵢ) (initial “beliefs” ) and P (B|Aᵢ). P(Aᵢ|B) = ? (revise “beliefs”, given that B occurred).
- Independence of Two Events: P(B|A) = P(B) or P(A ∩ B) = P(A) · P(B).
- Discrete Uniform Law: P(A) = Number of elements of A / Total number of sample points = |A| / |Ω|
- Basic Counting Principle: r stages, nᵢ choices at stage i, number of choices = n₁ n₂ · · · nᵣ
*Permutations*: Number of ways of ordering elements. No repetition for n slots: [n] [n-1] [n-2] [] [] [] [] [1].*Combinations*: number of k-element subsets of a given n-element set.- Binomial Probabilities. P (any sequence) = p# ʰᵉᵃᵈˢ(1 − p)# ᵗᵃᶦˡˢ.
*Random Variable*: a function from the sample space to the real numbers. It is not random. It is not a variable. It is a function: f: Ω ↦ ℝ.- Discrete Random Variable.
- Bernoulli Random Variable (Indicator Random Variable): f: Ω ↦ {1, 0}.
- Probability Mass Function: P(X = 𝑥) or Pₓ(𝑥): A function from the sample space to [0..1] that produces the likelihood that the value of X equals to 𝑥. PMF gives probabilities. 0 ≤ PMF ≤ 1. All the values of PMF must sum to 1.
- Geometric Random Variable: X = Number of coin tosses until first head.
- Geometric Probability Mass Function: (1 − p)ᵏ−¹p.
- Binomial Random Variable: X = Number of heads (e.g. 2) in n (e.g. 4) independent coin tosses.
- Binomial Probability Mass Function: Combination of (k, n)pᵏ(1 − p)ⁿ−ᵏ.
- Expectation: E[X] = Sum of xpₓ(x).
- Let Y=g(X): E[Y] = E[g(X)] = Sum of g(x)pₓ(x). Caution: E[g(X)] ≠ g(E[X]) in general.
- Variance: var(X) = E[(X−E[X])²].
- var(aX)=a²var(X).
- X and Y are independent: var(X+Y) = var(X) + var(Y). Caution: var(X+Y) ≠ var(X) + var(Y) in general.
- Standard Deviation: Square root of var(X).
- Conditional Probability Mass Function: P(X=x|A).
- Conditional Expectation: E[X|A].
- Joint Probability Mass Function: Pₓᵧ(x,y) = P(X=x, Y=y) = P((X=x) and (Y=y)).
- Marginal Probability Mass Function: P(x) = Σ
_{y}Pₓᵧ(x,y). - Total Expectation Theorem: E[X|Y = y].
- Independent Random Variables: P(X=x, Y=y)=P(X=x)·P(Y=y).
- Expectation of Multiple Random Variables: E[X + Y + Z] = E[X] + E[Y] + E[Z].
- Binomial Random Variable: X = Sum of Bernoulli Random Variables.
- The Hat Problem.
- Continuous Random Variables.
- Probability Density Function: P(a ≤ X ≤ b) or Pₓ(𝑥).
*(a ≤ X ≤ b)*means X function produces a real number value*within the [a, b] range*. Programming language: X(outcome) = 𝑥, where a ≤ 𝑥 ≤ b. PDF does NOT give probabilities. PDF does NOT have to be less than 1. PDF gives probabilities per unit length. The total area under PDF must be 1. - Continuous Uniform Random Variable.
- Cumulative Distribution Function: P(X ≤ b).
*(X ≤ b)*means X function produces a real number value*within the [-∞, b] range*. Programming language: X(outcome) = 𝑥, where 𝑥 ≤ b. - Normal Random Variable, Gaussian Distribution, Normal Distribution.
- Joint Probability Density Function.
- Conditional Probability Density Function.
- Marginal Probability Density Function.
- Derived Distributions.
- Convolution: A mathematical operation on two functions (f and g) that produces a third function.
- Covariance.
- Correlation Coefficient.
- Conditional Expectation: E[X | Y = y] = Sum of xpₓ|ᵧ(x|y). If Y is unknown then E[X | Y] is a random variable, i.e. a function of Y. So E[X | Y] also has its expectation and variance.
- Law of Iterated Expectations: E[E[X | Y]] = E[X].
- Conditional Variance: var(X | Y) is a function of Y.
- Law of Total Variance: var(X) = E[var(X | Y)] +var([E[X | Y]).
- Bernoulli Process: A sequence of independent Bernoulli trials. At each trial, i: P(Xᵢ=1)=p, P(Xᵢ=0)=1−p.
- Poisson Process.
- Markov Chain.
- Markov’s Inequality: P(X ≥ a) ≤ E(X)/a (X > 0, a > 0).
- Chebyshev’s Inequality: P(|X – E(X)| ≥ a) ≤ var(X)/a².
- The Law of Large Numbers.
- Central Limit Theorem.
- Model Building: X = a·S + W, where W: noise, know S, assume W, observe X, find a.
- Inferring: Know a, assume W, observe X, find S.
- Hypothesis Testing: Know a, observe X, find S. S can take
*one*of few possible values. - Estimation: Know a, observe X, find S. S can take unlimited possible values.
- Bayesian Inference can be used for both Hypothesis Testing and Estimation, leverages Bayes rule. Output is posterior distribution. Single answer can be Maximum a posteriori probability (MAP) or Conditional Expectation.
- Least Mean Squares Estimation of Θ based on X.
- Classical Inference can be used for both Hypothesis Testing and Estimation, leverages .
- Maximum Likelihood Estimation: Given data the maximum likelihood estimate (MLE) for the
*parameter*p is the*value*of p that maximizes the likelihood P (data | p). P (data | p) is the*likelihood function*. For continuous distributions, we use the probability density function to define the likelihood. - Log likelihood: the natural log of the likelihood function.

After finishing learning about probability and statistics please click Topic 20 – Discrete Mathematics to continue.