For many models in machine learning and statistics, computing the ML or MAP parameter estimate is easy provided we observe all the values of all the relevant random variables, i.e., if we have complete data. However, if we have missing data and/or latent variables, then computing the ML/MAP estimate becomes hard. One approach is to use a generic gradient-based optimizer to find a local minimum of the negative log likelihood or NLL, given by
However, we often have to enforce constraints, such as the fact that covariance matrices must be positive definite, mixing weights must sum to one, etc., which can be tricky. In such cases, it is often much simpler (but not always faster) to use an algorithm called expectation maximization, or EM for short. This is a simple iterative algorithm, often with closed-form updates at each step. Furthermore, the algorithm automatically enforce the required constraints. EM exploits the fact that if the data were fully observed, then the ML/ MAP estimate would be easy to compute. In particular, EM is an iterative algorithm which alternates between inferring the missing values given the parameters (E step), and then optimizing the parameters given the “filled in” data (M step). We give the details below, followed by several examples.
Let be the visible or observed variables in case , and let be the hidden or missing variables. The goal is to maximize the log likelihood of the observed data:
Unfortunately this is hard to optimize, since the log cannot be pushed inside the sum.
EM gets around this problem as follows. Define the complete data log likelihood to be
Let the marginal likelihood as
The complete data log likelihod cannot be computed. So lets define the expected complete data log likelihood as follows:
is called the auxiliary function. The expectation is taken wrt the old parameters, , and the observed data . The goal of the E step is to compute or rather, the terms inside of it which the MLE depends on; these are known as the expected sufficient statistics or ESS. In the M step, we optimize the Q function wrt :
To perform MAP to avoid overfitting:
We could easily maximize
E-step: . In class label problems, the E-step is to estimate the responsibility of one data belongs to that class label. M-step: Using the new class label, we may choose a new parameter to maximize the likelihood.
Taking maxture of Gausians as an example, the expected complete data log likelihood is given by
where is the responsibility that cluster takes for data point . This is computed at the E step:
M step: in the M step, we optimize Q wrt and the . For , we obviously have
Which is the averaged responsibility for cluster . However, the M step will have no regard to . and
This is just a weighted version of the standard problem of computing the MLEs of an MVN and the new parameter estimate is given by
import numpy as np def gaussian_point_generator(mu:list,sigma:list,n:int): return list(zip(np.random.normal(mu[0], sigma[0], n), np.random.normal(mu[1], sigma[1], n))) class GaussianParameter: def __init__(self, mu:list,sigma:list, cov=0): self.mu = mu self.sigma = sigma self.cov=cov def __repr__(self): return "".format(self.mu, self.sigma) def plot(self, ax): m = np.array([[self.mu[0]],[self.mu[1]]]) cov = np.array(self.sigma) cov_inv = np.linalg.inv(cov) # inverse of covariance matrix cov_det = np.linalg.det(cov) # determinant of covariance matrix x = np.linspace(-20,20) y = np.linspace(-20,20) X,Y = np.meshgrid(x,y) coe = 1.0 / ((2 * np.pi)**2 * cov_det)**0.5 Z = coe * np.e ** (-0.5 * (cov_inv[0,0]*(X-m[0])**2 + (cov_inv[0,1] + cov_inv[1,0])*(X-m[0])*(Y-m[1]) + cov_inv[1,1]*(Y-m[1])**2)) ax.contour(X,Y,Z) def p(self, point): return np.array((1 / 2 * np.pi * np.power(np.linalg.det(np.matrix(self.sigma)), 0.5) ) * np.exp( -1/2 * (np.matrix(np.array(self.mu)) - np.matrix(np.array(point))) * np.matrix(self.sigma) * (np.matrix(np.array(self.mu)) - np.matrix(np.array(point))).T))[0][0] init_gaussian_parameter = [GaussianParameter([0,10], [[1,0],[0,1]]), GaussianParameter([10,0], [[1,0],[0,1]])] points = gaussian_point_generator([1,1], [1,1],50) + gaussian_point_generator([9,9], [2,2],50) def iterate(axis): r = [[], []] for i in points: for j in range(2): r[j].append(init_gaussian_parameter[j].p(i)) r[0], r[1] = np.array(r[0]) / (np.array(r[0]) + np.array(r[1])), np.array(r[1]) / (np.array(r[0]) + np.array(r[1])) r[0] = np.array(list(map(lambda x:0 if np.isnan(x) else x, r[0]))) r[1] = np.array(list(map(lambda x:0 if np.isnan(x) else x, r[1]))) new_mu = [np.array([0,0]), np.array([0,0])] for j in range(2): for ind,i in enumerate(points): new_mu[j] = new_mu[j] + r[j][ind] * np.array(i) new_mu[j] = new_mu[j] / np.sum(r[j]) new_sigma = [np.array([[0,0], [0,0]]), np.array([[0,0], [0,0]])] for j in range(2): for ind,i in enumerate(points): new_sigma[j] = new_sigma[j] + r[j][ind] * np.matrix(i).T * np.matrix(i) new_sigma[j] = new_sigma[j] / np.sum(r[j]) new_sigma[j] = new_sigma[j] - np.matrix(new_mu[j]).T * np.matrix(new_mu[j]) init_gaussian_parameter[j].mu = np.array(new_mu[j]) init_gaussian_parameter[j].sigma = np.array(new_sigma[j]) init_gaussian_parameter[0].plot(axis) init_gaussian_parameter[1].plot(axis)
The result gives