Notation and Terminology
Maximum Likelihood Estimation
Given “data” $X_1, X_2, X_3,…, X_n$ , a random sample (iid) from a distribution with unknown parameter $\theta$, we want to find the value of $\theta$ in the parameter space that maximizes our “probability” of observing that data.
- If $X_1, X_2, X_3,…, X_n$ are discrete, we can look at \(P(X_1=x_1, X_2=x_2, ..., X_n=x_n)\) as a function of $\theta$, and find $\theta$ that maximizes it.
This is the joint pmf for $X_1, X_2, X_3,…, X_n$.
- The analogue for continous $X_1, X_2, X_3,…, X_n$ is to maximize the joint pdf with respect to $\theta$.
The pmf/pdf for any one of $X_1, X_2, X_3,…, X_n$ is denoted by $f(x)$.
We will emphasize the dependence of $f$ on a parameter $\theta$ by writing it as $f(x;\theta)$.
The join pmf/pdf for all of them is:
\[f(x_1,x_2,...,x_n;\theta) = f({\stackrel{\rightharpoonup}{x}; \theta}) = \prod_{i=1}^{n}f(x_i;\theta)\]- The data (the $x$’s) are fixed.
- Think of the $x$’s as fixed and the joint pdf as a function of $\theta$.
Call it a likelihood function and denote it by $L(\theta)$.
The likelihood $L(\theta)$ is defined to be anything proportional to the joint pmf/pdf.
Discrete Example
\(X_1, X_2, X_3,..., X_n \stackrel{iid}{\sim} Bernoulli(p)\)
The pmf for one of them is \(f(x;p) = p^x(1-p)^{1-x}I_{(0,1)}(x)\)
The joint pmf for all of them is:
\[\begin{align} f(\stackrel{\rightharpoonup}{x};p) &= \prod_{i=1}^{n}f(x_i;p) \\ &= \prod_{i=1}^{n}p^{x_i}(1-p)^{1-x_i}I_{(0,1)}(x) \\ &= p^{\sum_{i=1}^{n}x_i}(1-p)^{n-\sum_{i=1}^{n}x_i}\prod_{i=1}^{n}I_{(0,1)}(x_i) \end{align}\]A likelihood is
\[L(p) = p^{\sum_{i=1}^{n}x_i}(1-p)^{n-\sum_{i=1}^{n}x_i}\]It is almost always easier to minimize the log-likelihood
\[l(p) = \ln L(p) = \sum_{i=1}^{n}x_i\ln p + (n - \sum_{i=1}^{n}x_i)\ln (1 - p)\]To maximize $l(p)$ w.r.t $p$, We take derivative w.r.t $p$ and set it to $0$
\[\begin{align} \frac{\partial}{\partial p}l(p) &= \frac{\sum_{i=1}^{n}x_i}{p} - \frac{n - \sum_{i=1}^{n}x_i}{1-p} = 0 \\ &= p(1-p)\bigg[\frac{\sum_{i=1}^{n}x_i}{p} - \frac{n - \sum_{i=1}^{n}x_i}{1-p}\bigg] = p(1-p)0 \\ &= (1-p)\sum_{i=1}^{n}x_i-p\bigg(n - \sum_{i=1}^{n}x_i \bigg) = 0 \\ &\Rightarrow p = \frac{\sum_{i=1}^{n}x_i}{n} \end{align}\]The maximum likelihood estimator for $p$ is:
\[\hat{p} = \frac{\sum_{i=1}^{n}x_i}{n} = \bar{X}\]Continous Example
\[X_1, X_2, X_3,..., X_n \stackrel{iid}{\sim} exp(rate = \lambda)\]The pmf for one of them is
\[f(x;\lambda) = \lambda e^{-\lambda x}I_{(0, \infty)}(x)\]The joint pdf for all of them is
\[\begin{align} f(\stackrel{\rightharpoonup}{x};\lambda) &= \prod_{i=1}^{n}f(x_i;\lambda) \\ &= \prod_{i=1}^{n}\lambda e^{-\lambda x_i}I_{(0, \infty)}(x_i) \\ &= \lambda^n e^{-\lambda \sum_{i=1}^{n}x_i}\prod_{i=1}^{n}I_{(0, \infty)}(x_i) \end{align}\]A likelihood is
\[L(\lambda) = \lambda^n e^{-\lambda \sum_{i=1}^{n}x_i}\]The log-likelihood is
\[l(\lambda) = n\ln \lambda - \lambda\sum_{i=1}^{n}x_i\]To maximize $l(\lambda)$ w.r.t $\lambda$, We take derivative w.r.t $\lambda$ and set it to $0$
\[\begin{align} \frac{\partial}{\partial p}l(\lambda) &= \frac{n}{\lambda} - \sum_{i=1}^{n}x_i = 0 \\ &= \lambda = \frac{n}{\sum_{i=1}^{n}x_i} \end{align}\]The maximum likelihood estimator for $\lambda$ is
\[\hat{\lambda} = \frac{n}{\sum_{i=1}^{n}x_i} = \frac{1}{\bar{X}}\]