Probability

Fundamentals for ML

Author

Modified

May 19, 2025

This note could easily be its own book. For now, we’ll simply use this page to collect relevant definitions and formulas. We may add additional narrative and examples over time.

Terms

Experiment: A process that generates an outcome.
Outcome: A possible result of an experiment.
Sample Space: The set of all possible outcomes of an experiment, \(S\).
Event: A subset of the sample space.
Probability: A function, \(P()\), that assigns a number between 0 and 1 to each event.
Random Variable: A function that assigns a number to each outcome of an experiment.
Joint Probability: The probability of two events occurring together.
Conditional Probability: The probability of one event occurring given that another event has occurred.
Independence: Two events are independent if the occurrence of one does not affect the probability of the other.

Probability Rules

Complement Rule

The probability of the complement of an event \(A\), denoted as \(A^\prime\), is given by

\[ P(A^\prime) = 1 - P(A). \]

Note that \(A\) and \(A^\prime\) are disjoint as they cannot both occur. To denote this, we write

\[ A \cap A^\prime = \emptyset. \]

Also note that one of \(A\) or \(A^\prime\) must occur together, thus we say that together they are exhaustive. To denote this, we write

\[ A \cup A^\prime = S. \]

A collection of events that are disjoint and exhaustive is called a partition of the sample space.

Addition Rule

The probability of the union of two events \(A\) and \(B\) is given by

\[ P(A \cup B) = P(A) + P(B) - P(A \cap B). \]

\(P(A \cup B)\) is the probability that either \(A\) or \(B\) occurs.
- In set theory, \(\cup\) is the union operator.
\(P(A \cap B)\) is the probability that both \(A\) and \(B\) occur.
- In set theory, \(\cap\) is the intersection operator.

If \(A\) and \(B\) are disjoint, then \(P(A \cap B) = 0\) and the addition rule simplifies to

\[ P(A \cup B) = P(A) + P(B). \]

Conditional Probability

The conditional probability of event \(A\) given event \(B\) is given by

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \]

\(P(A \mid B)\) is the probability that \(A\) occurs given that \(B\) has already occurred.

Multiplication Rule

The probability of two events \(A\) and \(B\) occurring together is given by

\[ P(A \cap B) = P(A|B) \cdot P(B). \]

Notice that this is simply a rearrangement of the definition of conditional probability.

Rearranging the order of the right-hand-side helps to better understand the multiplication rule.

\[ P(A \cap B) = P(B) \cdot P(A \mid B). \]

We can read this as: “The probability that \(A\) and \(B\) both occur is first the probability that \(B\) occurs, then the probability that \(A\) occurs given that \(B\) has occurred.”

Note that the multiplication rule is simply a rearrangement of the definition of conditional probability. As such we could also write \(P(A \cap B)\) as

\[ P(A \cap B) = P(B \mid A) \cdot P(A). \]

Independence

Two events \(A\) and \(B\) are independent if the occurrence of one does not affect the probability of the other.

In terms of conditional probability, this can be expressed as:

\[ P(A \mid B) = P(A) \]

\[ P(B \mid A) = P(B). \]

That is, \(A\) and \(B\) are independent if both of the above are true.

Applying the conditional probability rule and rearranging, we can see that an alternative check for the independence of \(A\) and \(B\) is

\[ P(A \cap B) = P(A) \cdot P(B). \]

This is the specific case of the multiplication rule when \(A\) and \(B\) are independent.

Law of Total Probability

The law of total probability states that

\[ P(B) = \sum_{i=1}^n P(B \cap A_i) = \sum_{i=1}^n P(B \mid A_i) \cdot P(A_i) \]

where \(A_1, A_2, \ldots, A_n\) is a partition of the sample space.

A partition of the sample space is a set of \(n\) events \(A_1, A_2, \ldots, A_n\) that are mutually exclusive and exhaustive.

The events are mutually exclusive if they cannot occur together. That is, \(A_i \cap A_j = \emptyset\) for all \(i \neq j\).

The events are exhaustive if their union cover the entire sample space. That is, \(\cup_{i=1}^n A_i = S\).

For any event, say \(A\), the event together with its complement form a partition of the sample space.

\[ S = A \cup A^\prime \]

Because \(A\) and \(A^\prime\) cannot both occur, they are mutually exclusive. Similarly, because either \(A\) or \(A^\prime\) must occur, they are exhaustive.

In the case of a simple partition of the space with \(A\) and \(A^\prime\), the law of total probability simplifies to

\[ P(B) = P(B \cap A) + P(B \cap A^\prime) = P(B \mid A) \cdot P(A) + P(B \mid A^\prime) \cdot P(A^\prime). \]

Bayes’ Theorem

Given events \(A\) and \(B\), with \(P(B) \neq 0\), Bayes’ Theorem states that

\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}. \]

Often, Bayes’ Theorem rewrites the denominator \(P(B)\) according to the law of total probability.

\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} = \frac{P(B \mid A) \cdot P(A)}{P(B \mid A) \cdot P(A) + P(B \mid A^\prime) \cdot P(A^\prime)} \]

If instead of \(A\) and \(A^\prime\) we have a partition of the sample space with an arbitrary number of events \(n\), \(A_1, A_2, \ldots, A_n\), then Bayes’ Theorem generalizes to

\[ P(A_i \mid B) = \frac{P(B \mid A_i) \cdot P(A_i)}{\sum_{j=1}^n P(B \mid A_j) \cdot P(A_j)}. \]

When working with Bayes’ Theorem it is often useful to draw a tree diagram.

Random Variables

A random variable is simply a function which maps outcomes in the sample space to real numbers.

Distributions

We often talk about the distribution of a random variable, which can be thought of as:

\[ \text{distribution} = \text{list of possible} \textbf{ values} + \text{associated} \textbf{ probabilities} \]

This is not a strict mathematical definition, but is useful for conveying the idea.

If the possible values of a random variables are discrete, it is called a discrete random variable. If the possible values of a random variables are continuous, it is called a continuous random variable.

Discrete Random Variables

The distribution of a discrete random variable \(X\) is most often specified by a list of possible values and a probability mass function, \(p(x)\). The mass function directly gives probabilities, that is,

\[ p(x) = p_X(x) = P[X = x]. \]

Note we almost always drop the subscript from the more correct \(p_X(x)\) and simply refer to \(p(x)\). The relevant random variable is discerned from context

The most common example of a discrete random variable is a binomial random variable. The mass function of a binomial random variable \(X\), is given by

\[ p(x | n, p) = {n \choose x} p^x(1 - p)^{n - x}, \ \ \ x = 0, 1, \ldots, n, \ n \in \mathbb{N}, \ 0 < p < 1. \]

This line conveys a large amount of information.

The function \(p(x | n, p)\) is the mass function. It is a function of \(x\), the possible values of the random variable \(X\). It depends on the parameters \(n\) and \(p\). Different values of these parameters specify different binomial distributions.
\(x = 0, 1, \ldots, n\) indicates the sample space, that is, the possible values of the random variable.
\(n \in \mathbb{N}\) and \(0 < p < 1\) specify the parameter spaces. These are the possible values of the parameters that give a valid binomial distribution.

Often all of this information is simply encoded by writing

\[ X \sim \text{bin}(n, p). \]

Continuous Random Variables

The distribution of a continuous random variable \(X\) is most often specified by a set of possible values and a probability density function, \(f(x)\).

The probability of the event \(a < X < b\) is calculated as

\[ P[a < X < b] = \int_{a}^{b} f(x)dx. \]

Note that densities are not probabilities.

The most common example of a continuous random variable is a normal random variable. The density of a normal random variable \(X\), is given by

\[ f(x | \mu, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \cdot \exp\left[\frac{-1}{2} \left(\frac{x - \mu}{\sigma}\right)^2 \right], \ \ \ -\infty < x < \infty, \ -\infty < \mu < \infty, \ \sigma > 0. \]

The function \(f(x | \mu, \sigma^2)\) is the density function. It is a function of \(x\), the possible values of the random variable \(X\). It depends on the parameters \(\mu\) and \(\sigma^2\). Different values of these parameters specify different normal distributions.
\(-\infty < x < \infty\) indicates the sample space. In this case, the random variable may take any value on the real line.
\(-\infty < \mu < \infty\) and \(\sigma > 0\) specify the parameter space. These are the possible values of the parameters that give a valid normal distribution.

Often all of this information is simply encoded by writing

\[ X \sim N(\mu, \sigma^2) \]

Working with Random Variables

Consider random variables \(X\) and \(Y\). We will often encounter probabilities written in terms of random variables. For example, consider

\[ P[Y = 1 \mid X = 2]. \]

While written in terms of random variables, this is simply a conditional probability. We can use the definition of conditional probability to write

\[ P[Y = 1 \mid X = 2] = \frac{P[Y = 1 \cap X = 2]}{P[X = 2]}. \]

Here, \(Y = 1\) and \(X = 2\) are events, and thus this expression is simply an application of the definition of conditional probability.

Conditional Distributions

Given random variables \(X\) and \(Y\), in machine learning we will often discuss the conditional distribution of \(Y\) given \(X\).

Conditional distributions are contrasted with marginal distributions.

The marginal distribution of \(Y\) is the distribution of \(Y\) without regard to \(X\). That is, it is the distribution of \(Y\) as if \(X\) did not exist. Simply, it is the distribution of \(Y\).

The conditional distribution of \(Y\) given \(X\) is the distribution of \(Y\) modified based on a known the value of \(X\).

Example: Joint, Marginal, and Conditional Distributions

Let’s consider a simple example where the random variable \(Y\) can take two values, 0 and 1, and the random variable \(X\) can take three values, 3, 4, and 5.

Joint Distribution

The joint distribution of \(X\) and \(Y\) can be written as a table.

\(Y \backslash X\)	\(X = 0\)	\(X = 1\)	\(X = 2\)
\(Y = 0\)	0.1	0.2	0.3
\(Y = 1\)	0.2	0.2	0.0

This table specifies the probability of each pair of possible values for \((X, Y)\).

\(P(X = 0 \cap Y = 0) = 0.1\)
\(P(X = 0 \cap Y = 1) = 0.2\)
\(P(X = 1 \cap Y = 0) = 0.2\)
\(P(X = 1 \cap Y = 1) = 0.2\)
\(P(X = 2 \cap Y = 0) = 0.3\)
\(P(X = 2 \cap Y = 1) = 0.0\)

Marginal Distributions

The marginal distribution of \(X\) is obtained by summing the joint probabilities over all possible values of \(Y\):

\[ P(X = 0) = 0.1 + 0.2 = 0.3 \] \[ P(X = 1) = 0.2 + 0.2 = 0.4 \] \[ P(X = 2) = 0.3 + 0.0 = 0.3 \]

More compactly, we could write:

\[ P(X = x) = \begin{cases} 0.3 & \text{if } x = 0 \\ 0.4 & \text{if } x = 1 \\ 0.3 & \text{if } x = 2 \\ 0 & \text{otherwise} \end{cases} \]

Note that if not written explicitly, we assume that the probability of any other value of \(X\) is 0.

The marginal distribution of \(Y\) is obtained by summing the joint probabilities over all possible values of \(X\):

\[ P(Y = 0) = 0.1 + 0.2 + 0.3 = 0.6 \] \[ P(Y = 1) = 0.2 + 0.2 + 0.0 = 0.4 \]

Again, more compactly, we could write:

\[ P(Y = y) = \begin{cases} 0.6 & \text{if } y = 0 \\ 0.4 & \text{if } y = 1 \\ 0 & \text{otherwise} \end{cases} \]

Conditional Distributions

The conditional distribution of \(Y\) given \(X\) is obtained by dividing the joint probabilities by the marginal probability of \(X\).

When \(X = 0\), we have:

\[ \begin{aligned} P(Y = 0 \mid X = 0) &= \frac{P(X = 0 \cap Y = 0)}{P(X = 0)} = \frac{0.1}{0.3} = \frac{1}{3} = 0.33\overline{3} \\ P(Y = 1 \mid X = 0) &= \frac{P(X = 0 \cap Y = 1)}{P(X = 0)} = \frac{0.2}{0.3} = \frac{2}{3} = 0.66\overline{6} \end{aligned} \]

When \(X = 1\), we have:

\[ \begin{aligned} P(Y = 0 \mid X = 1) &= \frac{P(X = 1 \cap Y = 0)}{P(X = 1)} = \frac{0.2}{0.4} = \frac{1}{2} = 0.5 \\ P(Y = 1 \mid X = 1) &= \frac{P(X = 1 \cap Y = 1)}{P(X = 1)} = \frac{0.2}{0.4} = \frac{1}{2} = 0.5 \end{aligned} \]

When \(X = 2\), we have:

\[ \begin{aligned} P(Y = 0 \mid X = 2) &= \frac{P(X = 2 \cap Y = 0)}{P(X = 2)} = \frac{0.3}{0.3} = 1 \\ P(Y = 1 \mid X = 2) &= \frac{P(X = 2 \cap Y = 1)}{P(X = 2)} = \frac{0.0}{0.3} = 0 \end{aligned} \]

The conditional distribution of \(X\) given \(Y\) is obtained by dividing the joint probabilities by the marginal probability of \(Y\).

When \(Y = 0\), we have:

\[ \begin{aligned} P(X = 0 \mid Y = 0) &= \frac{P(X = 0 \cap Y = 0)}{P(Y = 0)} = \frac{0.1}{0.6} = \frac{1}{6} = 0.166\overline{6} \\ P(X = 1 \mid Y = 0) &= \frac{P(X = 1 \cap Y = 0)}{P(Y = 0)} = \frac{0.2}{0.6} = \frac{1}{3} = 0.33\overline{3} \\ P(X = 2 \mid Y = 0) &= \frac{P(X = 2 \cap Y = 0)}{P(Y = 0)} = \frac{0.3}{0.6} = \frac{1}{2} = 0.5 \end{aligned} \]

When \(Y = 1\), we have:

\[ \begin{aligned} P(X = 0 \mid Y = 1) = \frac{P(X = 0 \cap Y = 1)}{P(Y = 1)} = \frac{0.2}{0.4} = \frac{1}{2} = 0.5 \\ P(X = 1 \mid Y = 1) = \frac{P(X = 1 \cap Y = 1)}{P(Y = 1)} = \frac{0.2}{0.4} = \frac{1}{2} = 0.5 \\ P(X = 2 \mid Y = 1) = \frac{P(X = 2 \cap Y = 1)}{P(Y = 1)} = \frac{0.0}{0.4} = 0 \end{aligned} \]

Terms

Probability Rules

Complement Rule

Addition Rule

Conditional Probability

Multiplication Rule

Independence

Law of Total Probability

Bayes’ Theorem

Random Variables

Distributions

Discrete Random Variables

Continuous Random Variables

Working with Random Variables

Conditional Distributions

Example: Joint, Marginal, and Conditional Distributions

Joint Distribution

Marginal Distributions

Conditional Distributions

TODO