Topic 21 – Introduction to Computational Thinking

Why do I need to learn about computational thinking?

Computational thinking is a fundamental tool for understanding, implementing, and evaluating modern theories in artificial intelligence, machine learning, deep learning, data mining, security, digital image processing, and natural language processing.

What can I do after finishing learning about computation thinking?

You will be able to:

  • use a programming language to express computations,
  • apply systematic problem-solving strategies such as decomposition, pattern recognition, abstraction, and algorithmic thinking to turn an ambiguous problem statement into a computational solution method,
  • apply algorithmic and problem-reduction techniques,
  • use randomness and simulations to address problems that cannot be solved with closed-form solutions,
  • use computational tools, including basic statistical, visualization, and machine learning tools, to model and understand data.

These skills foster abstract thinking that enables you not only to use technology effectively but also to understand what is possible, recognize inherent trade-offs, and account for computational constraints that shape the software you design.

You will also be prepared to learn how to design and build compilers, operating systems, database management systems, and distributed systems.

That sounds useful! What should I do now?

First, please read this book to learn how to apply computational methods such as simulation, randomized algorithms, and statistical analysis to solve problems such as modeling disease spread, simulating physical systems, analyzing biological data, optimizing transportation, and designing communication networks: John V. Guttag (2021). Introduction to Computation and Programming using Python. 3rd Edition. The MIT Press.

Alternatively, if you want to gain the same concepts through interactive explanations, please audit the following courses:

After that, please read chapters 5 and 6 of the following book to learn about the theory of computing and how a machine performs computations: Robert Sedgewick and Kevin Wayne (2016). Computer Science – An Interdisciplinary Approach. Addison-Wesley Professional.

Alternatively, if you want to gain the same concepts through interactive explanations, please audit the following courses: Computer Science: Algorithms, Theory, and Machines.

After that, please read the following book to learn what is going on “under the hood” of a computer system: Randal E. Bryant and David R. O’Hallaron (2015). Computer Systems. A Programmer’s Perspective. Pearson.

After that, please audit this course to learn how to build scalable and high-performance software systems: MIT 6.172 Performance Engineering of Software Systems, Fall 2018 (Lecture Notes).

Terminology Review:

  • Algorithms.
  • Fixed Program Computer, Stored Program Computer.
  • Computer Architecture.
  • Hardware or Computer Architecture Primitives, Programming Language Primitives, Theoretical or Computability Primitives
  • Mathematical Abstraction of a Computing Machine (Turing Machine, Abstract Device), Turing’s Primitives.
  • Programming Languages.
  • Expressions, Syntax, Static Sematics, Semantics, Variables, Bindings.
  • Programming vs. Math.
  • Programs.
  • Big O notation.
  • Optimization Models: Knapsack Problem.
  • Graph-Theoretic Models: Shortest Path Problems.
  • Simulation Models: Monte Carlo Simulation, Random Walk.
  • Statistical Models.
  • K-means Clustering.
  • k-Nearest Neighbors Algorithm.

After finishing computational thinking, please click on Topic 22 – Introduction to Machine Learning to continue.

 

Topic 19 – Probability & Statistics

Why do I need to learn about probability and statistics?

Probability and statistics are fundamental tools for understanding many modern theories and techniques such as artificial intelligence, machine learning, deep learning, data mining, security, digital imagine processing and natural language processing.

What can I do after finishing learning about probability and statistics?

You will be prepared to learn modern theories and techniques to create modern security, machine learning, data mining, image processing or natural language processing software.

That sounds useful! What should I do now?

Please read one of the following books to grasp the core concepts of probability and statistics:

Alternatively, please read these notes first, and then audit the courses below if you would like to learn through interactive explanations:

Perhaps probability and statistics are among the most difficult topics in mathematics, so you may need to study them two or three times using different sources to truly master the concepts. For example, you may audit the course and read the books below to gain additional examples and intuition about the concepts:

Learning probability and statistics requires patience. However, the rewards will be worthwhile: you will be able to master AI algorithms more quickly and with greater confidence.

Terminology Review:

  • Sample Space (Ω): Set of possible outcomes.
  • Event: Subset of the sample space.
  • Probability Law: Law specified by giving the probabilities of all possible outcomes.
  • Probability Model = Sample Space + Probability Law.
  • Probability Axioms: Nonnegativity: P(A) ≥ 0; Normalization: P(Ω)=1; Additivity: If A ∩ B = Ø, then P(A ∪ B)= P(A)+ P(B).
  • Conditional Probability: P(A|B) = P (A ∩ B) / P(B).
  • Multiplication Rule.
  • Total Probability Theorem.
  • Bayes’ Rule: Given P(Aᵢ) (initial “beliefs” ) and P (B|Aᵢ). P(Aᵢ|B) = ? (revise “beliefs”, given that B occurred).
  • The Monty Problem: 3 doors, behind which are two goats and a car.
  • The Spam Detection Problem: “Lottery” word in spam emails.
  • Independence of Two Events: P(B|A) = P(B)  or P(A ∩ B) = P(A) · P(B).
  • The Birthday Problem: P(Same Birthday of 23 People) > 50%.
  • The Naive Bayes Model: “Naive” means features independence assumption.
  • Discrete Uniform Law: P(A) = Number of elements of A / Total number of sample points = |A| / |Ω|
  • Basic Counting Principle: r stages, nᵢ choices at stage i, number of choices = n₁ n₂ · · · nᵣ
  • Permutations: Number of ways of ordering elements. No repetition for n slots: [n] [n-1] [n-2] [] [] [] [] [1].
  • Combinations: number of k-element subsets of a given n-element set.
  • Binomial Probabilities. P (any sequence) = p# ʰᵉᵃᵈˢ(1 − p)# ᵗᵃᶦˡˢ.
  • Random Variable: A function from the sample space to the real numbers. It is not random. It is not a variable. It is a function: f: Ω ℝ. Random variable is used to model the whole experiment at once.
  • Discrete Random Variables.
  • Probability Mass Function: P(X = 𝑥) or Pₓ(𝑥): A function from the sample space to [0..1] that produces the likelihood that the value of X equals to 𝑥. PMF gives probabilities. 0 ≤ PMF ≤ 1. All the values of PMF must sum to 1. PMF is used to model a random variable.
  • Bernoulli Random Variable (Indicator Random Variable): f: Ω {1, 0}. Only 2 outcomes: 1 and 0. p(1) = p and p(0) = 1 – p.
  • Binomial Random Variable: X = Number of successes in n trials. X = Number of heads in n independent coin tosses.
  • Binomial Probability Mass Function: Combination of (k, n)pᵏ(1 − p)ⁿ−ᵏ.
  • Geometric Random Variable: X = Number of coin tosses until first head.
  • Geometric Probability Mass Function: (1 − p)ᵏ−¹p.
  • Expectation: E[X] = Sum of xpₓ(x).
  • Let Y=g(X): E[Y] = E[g(X)] = Sum of g(x)pₓ(x). Caution: E[g(X)] ≠ g(E[X]) in general.
  • Variance: var(X) = E[(X−E[X])²].
  • var(aX)=a²var(X).
  • X and Y are independent: var(X+Y) = var(X) + var(Y). Caution: var(X+Y) ≠ var(X) + var(Y) in general.
  • Standard Deviation: Square root of var(X).
  • Conditional Probability Mass Function: P(X=x|A).
  • Conditional Expectation: E[X|A].
  • Joint Probability Mass Function: Pₓᵧ(x,y) = P(X=x, Y=y) = P((X=x) and (Y=y)).
  • Marginal Distribution: Distribution of one variable
    while ignoring the other.
  • Marginal Probability Mass Function: P(x) = Σy Pₓᵧ(x,y).
  • Total Expectation Theorem: E[X|Y = y].
  • Independent Random Variables: P(X=x, Y=y)=P(X=xP(Y=y).
  • Expectation of Multiple Random Variables: E[X + Y + Z] = E[X] + E[Y] + E[Z].
  • Binomial Random Variable: X = Sum of Bernoulli Random Variables.
  • The Hat Problem.
  • Continuous Random Variables.
  • Probability Density Function: P(a ≤ X ≤ b) or Pₓ(𝑥). (a ≤ X ≤ b) means X function produces a real number value within the [a, b] range. Programming language: X(outcome) = 𝑥, where a ≤ 𝑥 ≤ b. PDF does NOT give probabilities. PDF does NOT have to be less than 1. PDF gives probabilities per unit length. The total area under PDF must be 1. PDF is used to define the random variable’s probability coming within a distinct range of values.
  • Cumulative Distribution Function: P(X ≤ b). (X ≤ b) means X function produces a real number value within the [-∞, b] range. Programming language: X(outcome) = 𝑥, where 𝑥 ≤ b.
  • Continuous Uniform Random Variables: fₓ(x) = 1/(b – a) if a ≤ X ≤ b, otherwise f = 0.
  • Normal Random Variable, Gaussian Distribution, Normal Distribution: Fitting bell shaped data.
  • Chi-Squared Distribution: Modelling communication noise.
  • Sampling from a Distribution: The process of drawing a random value (or set of values) from a probability distribution.
  • Joint Probability Density Function.
  • Marginal Probability Density Function.
  • Conditional Probability Density Function.
  • Derived Distributions.
  • Convolution: A mathematical operation on two functions (f and g) that produces a third function.
  • The Distribution of W = X + Y.
  • The Distribution of X + Y where X, Y: Independent Normal Ranndom Variables.
  • Covariance.
  • Covariance Matrix.
  • Correlation Coefficient.
  • Conditional Expectation: E[X | Y = y] = Sum of xpₓ|ᵧ(x|y). If Y is unknown then E[X | Y] is a random variable, i.e. a function of Y. So E[X | Y] also has its expectation and variance.
  • Law of Iterated Expectations: E[E[X | Y]] = E[X].
  • Conditional Variance: var(X | Y) is a function of Y.
  • Law of Total Variance: var(X) =  E[var(X | Y)] +var([E[X | Y]).
  • Bernoulli Process:  A sequence of independent Bernoulli trials. At each trial, i: P(Xᵢ=1)=p, P(Xᵢ=0)=1−p.
  • Poisson Process.
  • Markov Chain.

  • Bar Chart, Line Charts, Scatter Plots, Histograms.
  • Mean, Median, Mode.
  • Moments of a Distribution.
  • Skewness: E[((X – μ)/σ)³].
  • Kurtosis: E[((X – μ)/σ)⁴].
  • k% Quantile: Value k such that P (X ≤ qₖ/₁₀₀) = k/100.
  • Interquartile Range: IQR = Q₃ − Q₁.
  • Box-Plots: Q₁, Q₂, Q₃, IQR, min, max.
  • Kernel Density Estimation.
  • Violin Plot = Box-Plot + Kernel Density Estimation.
  • Quantile-Quantile Plots (QQ Plots).
  • Population: N.
  • Sample: n.
  • Random Sampling.
  • Population Mean: μ.
  • Sample Mean: x̄.
  • Population Proportion: p.
  • Sample Proportion: p̂.
  • Population Variance: σ².
  • Sample Variance: s².
  • Sampling Distributions.
  • Sampling from a Distribution: Drawing random values directly from a probability distribution. Purpose: Simulating or modeling real-world processes when the underlying distribution is known.
  • Markov’s Inequality: P(X ≥ a) ≤ E(X)/a (X > 0, a > 0).
  • Chebyshev’s Inequality: P(|X – E(X)| ≥ a) ≤ var(X)/a².
  • Week Law of Large Numbers: The average of the samples will get closer to the population mean as the sample size (not number of items) increases.
  • Central Limit Theorem: The distribution of sample means approximates a normal distribution as the sample size (not number of items) gets larger, regardless of the population’s distribution.
  • Sampling Distributions: Distribution of Sample Mean, Distribution of Sample Proportion, Distribution of Sample Variance.
  • Point Estimate: A single number, calculated from a sample, that estimates a parameter of the population.
  • Maximum Likelihood Estimation: Given data the maximum likelihood estimate (MLE) for the parameter p is the value of p that maximizes the likelihood P (data | p). P (data | p) is the likelihood function. For continuous distributions, we use the probability density function to define the likelihood.
  • Log likelihood: the natural log of the likelihood function.
  • Frequentists: Assume no prior belief, the goal is to find the model that most likely generated observed data.
  • Bayesians: Assume prior belief, the goal is to update prior belief based on observed data.
  • Maximum A Posteriori (MAP): Good for instances when you have limited data or strong prior beliefs. Wrong priors, wrong conclusions. MAP with uninformative priors is just MLE.
  • Margin of Error: A bound that we can confidently place on the difference between an estimate of something and the true value.
  • Significance Level: α, the probability that the event could have occurred by chance.
  • Confidence Level: 1 − α,  a measure of how confident we are in a given margin of error.
  • Confidence Interval: A 95% confidence interval (CI) of the mean is a range with an upper and lower number calculated from a sample. Because the true population mean is unknown, this range describes possible values that the mean could be. If multiple samples were drawn from the same population and a 95% CI calculated for each sample, we would expect the population mean to be found within 95% of these CIs.
  • z-score: the number of standard deviations from the mean value of the reference population.
  • Confidence Interval: Unknown σ.
  • Confidence Interval for Proportions.
  • Hypothesis: A statement about a population developed for the purpose of testing.
  • Hypothesis Testing.
  • Null Hypothesis (H₀): A statement about the value of a population parameter, contains equal sign.
  • Alternate Hypothesis (H₁): A statement that is accepted if the sample data provide sufficient evidence that the null hypothesis is false, never contains equal sign.
  • Type I Error: Reject the null hypothesis when it is true.
  • Type II Error: Do not reject the null hypothesis when it is false.
  • Significance Level, α: The maximum probability of rejecting the null hypothesis when it is true.
  • Test Statistic:  A number, calculated from samples, used to find if your data could have occurred under the null hypothesis.
  • Right-Tailed Test: The alternative hypothesis states that the true value of the parameter specified in the null hypothesis is greater than the null hypothesis claims.
  • Left-Tailed Test: The alternative hypothesis states that the true value of the parameter specified in the null hypothesis is less than the null hypothesis claims.
  • Two-Tailed Test: The alternative hypothesis which does not specify a direction, i.e. when the alternative hypothesis states that the null hypothesis is wrong.
  • p-value: The probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. μ₀ is assumed to be known and H₀ is assumed to be true.
  • Decision Rules: If H₀ is true then acceptable x̄ must fall in (1 − α) region.
  • Critical Value or k-value: A value on a test distribution that is used to decide whether the null hypothesis should be rejected or not.
  • Power of a Test: The probability of rejecting the null hypothesis when it is false; in other words, it is the probability of avoiding a type II error.
  • t-Distribution.
  • T-Statistic.
  • t-Tests: Unknown σ, use T-Statistic.
  • Independent Two-Sample t-Tests.
  • Paired t-Tests.
  • A/B testing: A methodology for comparing two variations (A/B) that uses t-Tests for statistical analysis and making a decision.
  • Model Building: X = a·S + W, where X: output, S: “signal”, a: parameters, W: noise. Know S, assume W, observe X, find a.
  • Inferring: X = a·S + W. Know a, assume W, observe X, find S.
  • Hypothesis Testing: X = a·S + W. Know a, observe X, find S. S can take one of few possible values.
  • Estimation: X = a·S + W. Know a, observe X, find S. S can take unlimited possible values.
  • Bayesian Inference can be used for both Hypothesis Testing and Estimation by leveraging Bayes rule. Output is posterior distribution. Single answer can be Maximum a posteriori probability (MAP) or Conditional Expectation.
  • Least Mean Squares Estimation of Θ based on X.
  • Classical Inference can be used for both Hypothesis Testing and Estimation.

After finishing probability and statistics, please click on Topic 20 – Discrete Mathematics to continue.

 

Topic 18 – Linear Algebra

Why do I need to learn about linear algebra?

Linear algebra is a fundamental tool for understanding many modern theories and techniques such as artificial intelligence, machine learning, deep learning, data mining, security, digital imagine processing, and natural language processing.

Linear algebra provides a powerful language that unifies algebra, geometry, and computation. It enables compact representation, allowing many equations to be expressed as a single 2D array. It also facilitates convenient manipulation, as algebraic operations on vectors and matrices naturally correspond to geometric transformations. By linking algebra, geometry, and computation within a single framework, linear algebra serves as a foundation for both geometric interpretation and computational implementation.

What can I do after finishing learning about linear algebra?

You will be prepared to learn modern theories and techniques to create modern security, machine learning, data mining, image processing or natural language processing software.

That sounds useful! What should I do now?

Linear algebra can be difficult if you try to memorize all of its formulas. The best way to study it is to focus on the systems of equations in the problems that interest you, and then look for notations and concepts that make it easier to analyze or solve those systems.

Please read this book to grasp the core concepts of linear algebra: David C. Lay et al. (2022). Linear Algebra and Its Applications. Pearson Education.

Alternatively, please audit the course and do read its lecture notes: MIT 18.06 – Linear Algebra, Spring 2005 (Lecture Notes).

While auditing this course, refer to this book for a better understanding of some complex topics: Gilbert Strang (2016). Introduction to Linear Algebra. Wellesley-Cambridge Press.

Terminology Review:

  • Linear Equations.
  • Row Picture.
  • Column Picture.
  • Triangular matrix is a square matrix where all the values above or below the diagonal are zero.
  • Lower Triangular Matries.
  • Upper Triangular Matries.
  • Diagonal matrix is a matrix in which the entries outside the main diagonal are all zero.
  • Tridiagonal Matries.
  • Identity Matries.
  • Transpose of a Matrix.
  • Symmetric Matries.
  • Pivot Columns.
  • Pivot Variables.
  • Augmented Matrix.
  • Echelon Form.
  • Reduced Row Echelon Form.
  • Elimination Matrices.
  • Inverse Matrix.
  • Factorization into A = LU.
  • Free Columns.
  • Free Variables.
  • Gauss-Jordan Elimination.
  • Vector Spaces.
  • Rank of a Matrix.
  • Permutation Matrices.
  • Subspaces.
  • Column space, C(A) consists of all combinations of the columns of A and is a vector space in ℝᵐ.
  • Nullspace, N(A) consists of all solutions x of the equation Ax = 0 and lies in ℝⁿ.
  • Row space, C(Aᵀ) consists of all combinations of the row vectors of A and form a subspace of ℝⁿ. We equate this with C(Aᵀ), the column space of the transpose of A.
  • The left nullspace of A, N(Aᵀ) is the nullspace of Aᵀ. This is a subspace of ℝᵐ.
  • Linearly Dependent Vectors.
  • Linearly Independent Vectors.
  • Linear Span of Vectors.
  • A basis for a vector space is a sequence of vectors with two properties:
    • They are independent.
    • They span the vector space.
  • Given a space, every basis for that space has the same number of vectors; that number is the dimension of the space.
  • Dimension of a Vector Space.
  • Dot Product.
  • Orthogonal Vectors.
  • Orthogonal Subspaces.
  • Row space of A is orthogonal to  nullspace of A.
  • Matrix Spaces.
  • Rank-One Matrices.
  • Orthogonal Complements.
  • Projection Matrices: P = A(AᵀA)⁻¹Aᵀ. Properties of projection matrix: Pᵀ = P and P² = P. Projection component: Pb = A(AᵀA)⁻¹Aᵀb = (AᵀA)⁻¹(Aᵀb)A.
  • Linear regression, least squares, and normal equations: Instead of solving Ax = b we solve Ax̂ = p or AᵀAx̂ = Aᵀb.
  • Linear Regression.
  • Orthogonal Matrices.
  • Orthogonal Basis.
  • Orthonormal Vectors.
  • Orthonormal Basis.
  • Orthogonal Subspaces.
  • Gram–Schmidt process.
  • Determinant: A number associated with any square matrix letting us know whether the matrix is invertible, the formula for the inverse matrix, the volume of the parallelepiped whose edges are the column vectors of A. The determinant of a triangular matrix is the product of the diagonal entries (pivots).
  • The big formula for computing the determinant.
  • The cofactor formula rewrites the big formula for the determinant of an n by n matrix in terms of the determinants of smaller matrices.
  • Formula for Inverse Matrices.
  • Cramer’s Rule.
  • Eigenvectors are vectors for which Ax is parallel to x: Ax = λx. λ is an eigenvalue of A, det(A − λI)= 0.
  • Diagonalizing a matrix: AS = SΛ 🡲 S⁻¹AS = Λ 🡲 A = SΛS⁻¹. S: matrix of n linearly independent eigenvectors. Λ: matrix of eigenvalues on diagonal.
  • Matrix exponential eᴬᵗ.
  • Markov Matrices: All entries are non-negative and each column adds to 1.
  • Symmetric Matrices: Aᵀ = A.
  • Positive Definite Matrices: all eigenvalues are positive or all pivots are positive or all determinants are positive.
  • Similar Matrices: A and B = M⁻¹AM.
  • Singular Value Decomposition (SVD) of a matrix: A = UΣVᵀ, where U is orthogonal, Σ is diagonal, and V is orthogonal.
  • Linear Transformations: T(v + w) = T(v)+ T(w) and T(cv)= cT(v) . For any linear transformation T we can find a matrix A so that T(v) = Av.
  • Change-of-basis Matrix.
  • Left Inverse Matries: LA=I, Right Inverse Matrices: AR=I.
  • Pseudo Inverse Matrices: A⁺=VΣ⁺Uᵀ.

After finishing linear algebra, please click on Topic 19 – Probability & Statistics to continue.