---
title: "Theory and Intuition Behind Numerical MLE"
author: "Alexander Towell"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Theory and Intuition Behind Numerical MLE}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
library(compositional.mle)
```

## What is Maximum Likelihood Estimation?

Maximum Likelihood Estimation (MLE) is a fundamental method for estimating the
parameters of a statistical model. The idea is simple yet powerful: **find the
parameter values that make the observed data most probable**.

### The Likelihood Function

Suppose we observe data $x_1, x_2, \ldots, x_n$ and we believe it comes from a
probability distribution with parameter(s) $\theta$. The **likelihood function**
$L(\theta)$ measures how probable the observed data is, given parameter $\theta$:

$$L(\theta) = P(X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n \mid \theta)$$

For independent observations:

$$L(\theta) = \prod_{i=1}^{n} f(x_i \mid \theta)$$

where $f(\cdot \mid \theta)$ is the probability density (or mass) function.

### Why Log-Likelihood?

Working with products is numerically unstable and mathematically inconvenient.
Taking the logarithm converts products to sums:

$$\ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log f(x_i \mid \theta)$$

Since $\log$ is monotonic, maximizing $\ell(\theta)$ is equivalent to maximizing
$L(\theta)$. The log-likelihood has several advantages:

1. **Numerical stability**: Products of small probabilities can underflow to zero
2. **Computational efficiency**: Sums are faster than products
3. **Mathematical convenience**: Derivatives of sums are easier than derivatives of products
4. **Statistical properties**: The curvature of $\ell(\theta)$ relates to estimation uncertainty

### A Concrete Example

Let's see this with normal data:

```{r normal-likelihood}
set.seed(42)
x <- rnorm(50, mean = 3, sd = 1.5)

# Log-likelihood function for normal distribution
loglike <- function(theta) {
  mu <- theta[1]
  sigma <- theta[2]
  if (sigma <= 0) return(-Inf)
  sum(dnorm(x, mean = mu, sd = sigma, log = TRUE))
}

# Visualize the log-likelihood surface
mu_grid <- seq(1, 5, length.out = 50)
sigma_grid <- seq(0.5, 3, length.out = 50)
ll_surface <- outer(mu_grid, sigma_grid, function(m, s) {
  mapply(function(mi, si) loglike(c(mi, si)), m, s)
})

# Contour plot
contour(mu_grid, sigma_grid, ll_surface, nlevels = 20,
        xlab = expression(mu), ylab = expression(sigma),
        main = "Log-Likelihood Surface")
points(mean(x), sd(x), pch = 19, col = "red", cex = 1.5)
legend("topright", "MLE", pch = 19, col = "red")
```

The MLE is the point on this surface with the highest value (deepest red in the
contour plot).

## The Score Function

The **score function** is the gradient (vector of partial derivatives) of the
log-likelihood:

$$s(\theta) = \nabla_\theta \ell(\theta) = \left( \frac{\partial \ell}{\partial \theta_1}, \ldots, \frac{\partial \ell}{\partial \theta_p} \right)$$

At the MLE $\hat{\theta}$, the score is zero: $s(\hat{\theta}) = 0$.

### Intuition

The score tells us the **direction of steepest ascent** on the log-likelihood
surface. If $s(\theta) \neq 0$, we can increase the likelihood by moving in the
direction of $s(\theta)$.

```{r score-example}
# Score function for normal distribution
score <- function(theta) {
  mu <- theta[1]
  sigma <- theta[2]
  n <- length(x)
  c(
    sum(x - mu) / sigma^2,                        # d/d_mu
    -n / sigma + sum((x - mu)^2) / sigma^3        # d/d_sigma
  )
}

# At a point away from the MLE, the score points toward the MLE
theta_start <- c(1, 0.8)
s <- score(theta_start)
cat("Score at (1, 0.8):", round(s, 2), "\n")
cat("Direction: move", ifelse(s[1] > 0, "right", "left"), "in mu,",
    ifelse(s[2] > 0, "up", "down"), "in sigma\n")
```

## Gradient Ascent

**Gradient ascent** is the simplest optimization algorithm. It iteratively moves
in the direction of the gradient:

$$\theta^{(t+1)} = \theta^{(t)} + \eta \cdot s(\theta^{(t)})$$

where $\eta > 0$ is the **learning rate** (step size).

### Why It Works

The score points in the direction of steepest increase. Taking small steps in
this direction guarantees improvement (for small enough $\eta$).

### The Challenge: Choosing the Step Size

- **Too large**: We might overshoot and oscillate
- **Too small**: Convergence is painfully slow

```{r gradient-demo}
# Demonstrate gradient ascent with different step sizes
run_gradient_ascent <- function(eta, max_iter = 50) {
  theta <- c(1, 0.8)
  path <- matrix(NA, max_iter + 1, 2)
  path[1, ] <- theta

  for (i in 1:max_iter) {
    theta <- theta + eta * score(theta)
    if (theta[2] <= 0) theta[2] <- 0.01  # Enforce constraint
    path[i + 1, ] <- theta
  }
  path
}

# Compare step sizes
path_small <- run_gradient_ascent(0.001)
path_good <- run_gradient_ascent(0.01)
path_large <- run_gradient_ascent(0.05)

# Plot paths
contour(mu_grid, sigma_grid, ll_surface, nlevels = 15,
        xlab = expression(mu), ylab = expression(sigma),
        main = "Gradient Ascent: Effect of Step Size")
lines(path_small[, 1], path_small[, 2], col = "blue", lwd = 2)
lines(path_good[, 1], path_good[, 2], col = "green", lwd = 2)
lines(path_large[1:20, 1], path_large[1:20, 2], col = "red", lwd = 2)
points(mean(x), sd(x), pch = 19, cex = 1.5)
legend("topright", c("Small (0.001)", "Good (0.01)", "Large (0.05)"),
       col = c("blue", "green", "red"), lwd = 2)
```

### Line Search: Automatic Step Size Selection

**Backtracking line search** adaptively finds a good step size:

1. Start with a large step size
2. If it doesn't improve the objective enough, shrink it
3. Repeat until we find an acceptable step

```{r line-search-demo}
# Define the problem
problem <- mle_problem(
  loglike = loglike,
  score = score,
  constraint = mle_constraint(
    support = function(theta) theta[2] > 0,
    project = function(theta) c(theta[1], max(theta[2], 1e-6))
  )
)

result <- gradient_ascent(max_iter = 50)(problem, theta0 = c(1, 0.8))

cat("Final estimate:", round(result$theta.hat, 4), "\n")
cat("Iterations:", result$iterations, "\n")
cat("Converged:", result$converged, "\n")
```

## The Fisher Information Matrix

The **Fisher information matrix** $I(\theta)$ measures how much information
the data carries about $\theta$. It can be defined equivalently as:

$$I(\theta) = -E\left[ \frac{\partial^2 \ell}{\partial \theta \partial \theta^T} \right] = E\left[ s(\theta) s(\theta)^T \right]$$

The second form shows that $I(\theta)$ equals the covariance of the score
(since $E[s(\theta)] = 0$ at the true parameter).

### Why It Matters

1. **Curvature**: $I(\theta)$ describes the curvature of the log-likelihood surface
2. **Uncertainty**: The asymptotic variance of the MLE is $\text{Var}(\hat{\theta}) \approx I(\theta)^{-1}$ (Cramér-Rao bound)
3. **Natural scaling**: Different parameters may have different scales; $I(\theta)$ accounts for this

### Observed vs Expected Fisher Information

- **Expected information**: $I(\theta) = -E[\nabla^2 \ell(\theta)]$
- **Observed information**: $J(\theta) = -\nabla^2 \ell(\theta)$ (evaluated at the data)

In practice, we often use the observed information, which doesn't require
computing expectations.

## Newton-Raphson

**Newton-Raphson** uses the Fisher information to take smarter steps:

$$\theta^{(t+1)} = \theta^{(t)} + I(\theta^{(t)})^{-1} s(\theta^{(t)})$$

### Intuition

Gradient ascent treats all directions equally, but some directions might be
"easier" to move in than others. Newton-Raphson pre-multiplies by $I^{-1}$,
which:

- Takes larger steps in flat directions (low curvature)
- Takes smaller steps in curved directions (high curvature)
- Accounts for correlations between parameters

### Comparison

```{r newton-comparison}
# Fisher information for normal distribution
fisher <- function(theta) {
  sigma <- theta[2]
  n <- length(x)
  matrix(c(
    n / sigma^2, 0,
    0, 2 * n / sigma^2
  ), nrow = 2)
}

# Define problem with Fisher information
problem_with_fisher <- mle_problem(
  loglike = loglike,
  score = score,
  fisher = fisher,
  constraint = mle_constraint(
    support = function(theta) theta[2] > 0,
    project = function(theta) c(theta[1], max(theta[2], 1e-6))
  )
)

# Run gradient ascent
result_ga <- gradient_ascent(max_iter = 100)(problem, theta0 = c(1, 0.8))

# Run Newton-Raphson
result_nr <- newton_raphson(max_iter = 100)(problem_with_fisher, theta0 = c(1, 0.8))

cat("Gradient Ascent: ", result_ga$iterations, "iterations\n")
cat("Newton-Raphson:  ", result_nr$iterations, "iterations\n")
```

Newton-Raphson typically converges much faster, especially near the optimum
where its quadratic convergence kicks in.

## Composing Solvers

The `compositional.mle` package lets you **compose** optimization strategies:

```{r compose-demo}
# Coarse-to-fine: grid search finds a good region, Newton polishes
strategy <- grid_search(lower = c(0, 0.1), upper = c(6, 4), n = 5) %>>%
  newton_raphson(max_iter = 20)

result <- strategy(problem_with_fisher, theta0 = c(1, 0.8))
cat("Result:", round(result$theta.hat, 4), "\n")

# Race different methods, pick the best
strategy <- gradient_ascent() %|% bfgs() %|% nelder_mead()
result <- strategy(problem, theta0 = c(1, 0.8))
cat("Winner:", result$solver, "\n")
```

## When to Use Which Method

| Method | Information | Best For |
|--------|-------------|----------|
| `gradient_ascent()` | Score only | Large problems, rough estimates |
| `newton_raphson()` | Score + Fisher | High precision, small problems |
| `bfgs()` | Score only | Quasi-Newton, good balance |
| `nelder_mead()` | Likelihood only | Derivative-free, robust |
| `grid_search()` | Likelihood only | Finding starting points |
| `with_restarts()` | Varies | Multi-modal problems |

## Constrained Optimization

Real problems often have **constraints** on parameters:

- Variance must be positive: $\sigma > 0$
- Probabilities must be in $[0, 1]$
- Correlation must satisfy $|\rho| < 1$

### Projection Method

The `compositional.mle` package uses **projection**: if a step takes us outside the
feasible region, we project back to the nearest feasible point.

```{r constraint-demo}
# The constraint keeps sigma positive throughout optimization
result <- gradient_ascent(max_iter = 100)(problem, theta0 = c(0, 0.1))

cat("Final sigma:", result$theta.hat[2], "> 0 (constraint satisfied)\n")
```

## Regularization and Penalized Likelihood

Sometimes we want to **penalize** certain parameter values to:

- Prevent overfitting
- Encourage sparsity
- Incorporate prior beliefs

### Penalized Log-Likelihood

$$\ell_\lambda(\theta) = \ell(\theta) - \lambda \cdot P(\theta)$$

where $P(\theta)$ is a penalty function and $\lambda > 0$ controls regularization strength.

### Common Penalties

**L1 (LASSO)**: $P(\theta) = \sum_j |\theta_j|$

- Encourages sparsity (some $\theta_j = 0$)
- Useful for variable selection

**L2 (Ridge)**: $P(\theta) = \sum_j \theta_j^2$

- Shrinks parameters toward zero
- Prevents extreme values
- Equivalent to Gaussian prior

**Elastic Net**: $P(\theta) = \alpha \sum_j |\theta_j| + (1-\alpha) \sum_j \theta_j^2$

- Combines L1 and L2 benefits
- $\alpha$ controls the mix

```{r penalty-demo}
# Original log-likelihood (maximum at theta = (3,2))
loglike_simple <- function(theta) -sum((theta - c(3, 2))^2)

# Add L2 penalty
loglike_l2 <- with_penalty(loglike_simple, penalty_l2(), lambda = 1)

# Compare
theta <- c(3, 2)
cat("At theta = (3, 2):\n")
cat("  Original:", loglike_simple(theta), "\n")
cat("  With L2 penalty:", loglike_l2(theta), "\n")
cat("  The penalty shrinks the solution toward zero\n")
```

## Summary

| Method | Order | Information Needed | Best For |
|--------|-------|-------------------|----------|
| Gradient Ascent | 1st | Score only | Large problems, rough estimates |
| Newton-Raphson | 2nd | Score + Fisher | High precision, small problems |
| BFGS | Quasi-2nd | Score only | Good balance |
| Nelder-Mead | 0th | Likelihood only | Robust, derivative-free |
| Grid Search | 0th | Likelihood only | Finding starting points |

The `compositional.mle` package provides composable solvers that can be chained
(`%>>%`), raced (`%|%`), or restarted (`with_restarts()`) to handle real-world
complexities like constraints, regularization, and multimodal surfaces.

## Further Reading

1. Casella, G. and Berger, R.L. (2002). *Statistical Inference*. Duxbury.
2. Nocedal, J. and Wright, S.J. (2006). *Numerical Optimization*. Springer.
3. Murphy, K.P. (2012). *Machine Learning: A Probabilistic Perspective*. MIT Press.