Exploring the EM Algorithm

Exploring the EM Algorithm

tl;dr: I explain, visualize, and code the EM Algorithm for Gaussian Mixture Models and explain how it is basically just an example of Variational Inference.

In my AMLED class we recently talked about how Maximum Likelihood can help you find the best statistical model:

  1. Write down the formula for the Likelihood of the data $P(\mathbf{X})$. This is how likely the data are with respect to your probabilistic models parameters.
  2. Take that likelihood, and take the natural logarithm of it. This won't change the optimal parameters that maximize the likelihood, since the logarithm is a monotonically increasing function.
  3. Take that log-likelihood, and optimize it, typically by setting the partial derivatives of the log-likelihood w.r.t. each parameter equal to 0 and solving for which parameters get you that.

A typical example of this is finding the $\mu$ and $\sigma$ of a Gaussian Distribution.

This works fine for certain convex likelihoods (provided you are careful to take care of things like singularities, ill-conditioned matrices, etc.). However, for many types of models this optimization is not straight-forward: the optimal parameters depend on one another in complex, non-linear ways, and you can't just "plug-in" the right data to get the optimal answer. One case where this occurs is in Latent Variable models, such as Gaussian Mixture Models, that is, where you have data that are generated from multiple gaussian distributions, but you don't know which distribution generated which data point. While this might seem similar to the MLE of the Gaussian Distribution above, it is actually much more complex.

One way to solve this problem is by first finding a "simpler" distribution, for which analytical closed form updates are possible, and then matching that simpler distribution as closely as possible to the "harder" distribution. This is the basic idea behind Variational Inference. Often, one can iteratively update simpler models to get closer and closer to the "harder" one. That iteration between simplier models is the basic idea of the Expectation-Maximization Algorithm. And in some cases, with certain models and distributions, that iterative process of inching closer and close with simpler distributions can actually get you the exact solution to the harder distribution that you originally wanted. Below, we will see that Gaussian Mixture Models are one case where that happens. Most of the time, however, you will not be so lucky and the optimized simple models will be a little different than what an optimized hard model would be; in these cases you just hope that the diffence doesn't affect actual performance much.

In [1]:
# First, let's just setup the libraries we'll need
from scipy.stats import norm
from scipy.stats import entropy
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource
from ipywidgets import interact
import numpy as np
plot_height = 400
plot_width = 800
# output to the IPython Notebook
output_notebook()
BokehJS successfully loaded.

Problem Setup - Gaussian Mixture Models

As mentioned above, the basic idea is that you have $N$ data points that are actually drawn from $K$ different Gaussian distributions. We don't know which data point was drawn from which distribution, or what these distributions are: where each ($k$) is centered ($\mu_k$), or what its standard deviation ($\sigma_k$) is. To make things even a little trickier, we also don't know how many of the $N$ points are from which distribution! Maybe 90% of the points come from one distribution, and 10% come from the others; we don't know. So we also want to figure out, for each ($k$) distribution, what fraction of the data comes from that distribution ($\pi_k$).

So what does this look like?

Before we get into the details of the model, let's just see what the problem and data look like:

In [2]:
# Draw points from actual Gaussians
N = 10000 # This is the number of data points
K = 2 # We'll only do 2 Gaussians

### First Gaussian ###
mu_1 = 3
s_1 = 0.5
### Second Gaussian ###
mu_2 = 5
s_2 = 1

# Relative frequency of the gaussians
pi1 = 1
pi2 = 2 # The second will produce twice as many data points
pi_total = pi1+pi2
pi1 /= pi_total # Normalize
pi2 /= pi_total # Normalize
# Make an analytical gaussian with scipi.stats
g1 = norm(loc = mu_1, scale = s_1)
g2 = norm(loc = mu_2, scale = s_2)
# Now "mix" the two: this shows us the joint PDF of the mixture
gm = lambda x: pi1*g1.pdf(x)+pi2*g2.pdf(x)

# Plot the actual Probability Density Function (PDF) curve
xp = np.linspace(0, 10, 1000)
p = figure(title="Gaussian Mixtures", x_axis_label='x', y_axis_label='p(x)',
           plot_height = plot_height, plot_width = plot_width)
# Plot the first gaussian - lighter line
p.line(xp,g1.pdf(xp),legend="p1(x)",  line_width=2, alpha =0.3)
# Second gaussian - lighter line
p.line(xp,g2.pdf(xp),legend="p2(x)",  line_width=2, alpha =0.3)
# Mixed gaussian - darker line
p.line(xp,gm(xp),legend="Mix",  line_width=2, alpha = 1)
show(p)
In [3]:
# Now generate actual data points
D1 = g1.rvs(size=N*pi1) # Get points from the first
D2 = g2.rvs(size=N*pi2) # Get points from the second
D = np.hstack([D1,D2])
N = len(D) 

# Plot them
hist, edges = np.histogram(D, density=True, bins=50)
p1 = figure(title="Actual Mixture with Data", x_axis_label='x', y_axis_label='p(x)',
           plot_height = plot_height, plot_width = plot_width)
p1.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
     fill_color="#036564", line_color="#033649", alpha=0.4)
p1.line(xp,gm(xp),legend="p(x)",  line_width=2)
show(p1)

The Probabilistic Model

Now that we understand the setup and have seen what the data look like, let's think about how we would model this probabilistically: We know we have data ($\mathbf{X}$), and that each data point ($X_n$) is independently selected from one ($Z_n$) out of a set ($\mathbf{Z}$) of Gaussians. The chance of that data point coming from a specific Gaussian (e.g., the first one, $k=1$) is controlled by the overall percentage of data that the specific distribution contributes ($\pi_k$). We don't know the means ($\mathbf{\mu}$), variance ($\mathbf{\sigma^2}$), or data percentages ($\mathbf{\pi}$), of any of the $K$ gaussians. We also don't know which Gaussian ($Z_n$) was responsible for which data point ($X_n$). In this example, we will assume $\mu$,$\sigma$, and $\pi$ are just fixed parameters that we will optimize over (a fully Bayesian model would treat these as random variables with a prior, etc., but let's not get into that just yet...).

We'll also assume that $\mathbf{Z}$ is an unknown set of binary random variables that dictate which k out of the $K$ (2, in our case) distributions a specific data point ($X_n$) came from. So, Z is actually an $N\times K$ matrix where $Z_{n,k}=1$ if $X_n$ came from the $k$th distribution, and $Z_{n,k}=0$ otherwise. For example, for the 7th data point ($X_7$), $Z_7$ might be [ 1 , 0 ], so $Z_{7,0} = 1$, $Z_{7,1} = 0$, meaning that $X_7$ came from the first Gaussian, and not the second. Since a data point can only come from one of those gaussians, it also makes sense that $\sum_{k=1}^K Z_{n,k} = 1$. Of course, we don't yet know which $Z_{n,k}$ are 1 or 0, but our plan is to use our probabilistic model to figure that out.

So given all of this, how would we express the joint probability over the random variables (i.e., X,Z):

$$ P(\mathbf{X},\mathbf{Z}) = \prod_{n=1}^{N} P(X_n,Z_n) = \prod_{n=1}^{N} P(X_n|Z_n)P(Z_n) $$

since the N data points are independent of one another, and where: $$ P(X_n|Z_n) = \prod_{k=1}^{K} \mathcal{N_n}(X|\mu_k,\sigma_k)^{Z_{nk}} $$ $$ \mathcal{N_n}(X|\mu_k,\sigma_k) = \left(2\pi \sigma^2\right)^{-\frac{1}{2}} \exp\left[-\frac{1}{2}\frac{(x_n - \mu_k)^2}{\sigma_k^2}\right] $$ $$ P(Z_n) = \prod_{k=1}^{K} \pi_k^{Z_{nk}}~~ \mathrm{where}~~ \sum_{k=1}^{K} \pi_k = 1 $$

There is one weird math trick going on above: we're using $Z_{n,k}$ as a indicator variable, to select the appropriate gaussian distibution and $\pi_k$ when a given $Z_{n,k}=1$.

Optimizing the Log-Likelihood

Alright, so now that we have the probablistic model set up, and have chosen what our distributions will be, we can now attack our main goal of finding $\mu$, $\sigma$, and $\pi$, such that the data likelihood is maximized. To do this, we need the likelihood of the data ($P(\mathbf{X})$): $$ \begin{eqnarray} P(\mathbf{X}) &=& \prod_{n=1}^{N} P(X_n)\\ &=& \prod_{n=1}^{N} \sum_Z P(X_n,Z_n)\\ &=& \prod_{n=1}^{N} \sum_Z P(X_n|Z_n)P(Z_n)\\ &=& \prod_{n=1}^{N} \sum_Z \left[ \prod_{k=1}^{K} \mathcal{N_n}(X|\mu_k,\sigma_k)^{Z_{nk}} \right] \left[\prod_{k=1}^{K} \pi_k^{Z_{nk}}\right]\\ &=& \prod_{n=1}^{N} \sum_Z \prod_{k=1}^{K} \pi_k^{Z_{nk}}\mathcal{N_n}(X|\mu_k,\sigma_k)^{Z_{nk}} \\ &=& \prod_{n=1}^{N} \sum_Z \prod_{k=1}^{K} \left[ \pi_k \mathcal{N_n}(X|\mu_k,\sigma_k) \right]^{Z_{nk}} \\ &=& \prod_{n=1}^{N} \sum_{k=1}^{K} \pi_k \mathcal{N_n}(X|\mu_k,\sigma_k)\\ \end{eqnarray} $$

Let's stop here a second and look back at those last two lines. What happened to $Z_{n,k}$ there? For each data point ($\prod_{n=1}^{N}$), we then sum over all possible values of $Z_n$ (to marginalize out that probability): what does this do? For each one of the k terms in the summation, it would set one particular value of $Z_{n,k}=1$ and others to 0. So really, $\sum_Z$, given that we're looking at a particular data point $n$, really means $\sum_{k=1}^K$. Then, when we see the product ($\prod_{k=1}^{K}$) every term inside that product is raised to $Z_{n,k}$. But only one particular $Z_{n,k}$ will be active inside the summation. Which $Z_{n,k}$? The kth one! So $Z_{n,j}=0~~\forall j\ne k \in K$. Thus, the product term is essentially redundant.

Now that we have done some simplifying algebra on the Likelihood, let's get the Log-Likelihood: $$ \begin{eqnarray} \ln P(\mathbf{X}) &=& \ln \prod_{n=1}^{N} \sum_{k=1}^{K} \pi_k\mathcal{N_n}(X|\mu_k,\sigma_k)\\ &=& \sum_{n=1}^{N} \ln \sum_{k=1}^{K} \pi_k\mathcal{N_n}(X|\mu_k,\sigma_k)\\ \end{eqnarray} $$

In [4]:
gaussian = lambda x,mu,sigma2: 1/(np.sqrt(2*np.pi*sigma2)) * np.exp(-(x-mu)**2 / (2*sigma2))
def log_likelihood(X,mu,sigma2,pi):
    LL = 0
    for n in range(N):
        L = 0
        for k in range(K):
            L+=pi[k]*gaussian(X[n],mu[k],sigma2[k])
        LL += np.log(L)
    return LL
# Actually calculating the likelihood is not something you would do in practice
# Since the numbers get so small that they become useless and prone to round-off error
# This function is just for illustrative purposes.
def likelihood(X,mu,sigma2,pi):
    L = 1
    for n in range(N):
        sL = 0
        for k in range(K):
            sL+=pi[k]*gaussian(X[n],mu[k],sigma2[k])
        L *= sL
    return L

Now that we have the Log-Likelihood, we want to maximize it. Do this by taking derivatives of the Log-Likelihood and setting them equal to zero to obtain the MLE: $$ \begin{eqnarray} N_k &=& \sum_{n=1}^{N} \gamma(z_{nk})\\ \gamma(z_{nk}) &=& \frac{\pi_k \mathcal{N_n}(X|\mu_k,\sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N_n}(X|\mu_j,\sigma_j)}\\ \mu_k &=& \frac{1}{N_k}\sum_{n=1}^{N} \gamma(z_{nk})\cdot x_n\\ \sigma_k^2 &=& \frac{1}{N_k} \sum_{n=1}^{N} \gamma(z_{nk}) (x_n-\mu_k)^2\\ \pi_k &=& \frac{N_k}{N} \end{eqnarray} $$ Note that: $\gamma(z_{nk})$ depends on $\mu,\sigma,\pi$, and vice versa, but if we knew one group we could easily figure out the other. In fact, we will alternate between the two by guessing one, and then using that guess to update the other one. We will then switch back and forth, alternating as we go until they don't change anymore. This back and forth is the main idea behind the EM Algorithm. Actually, what the EM Algorithm is doing is actually a type of Variational Inference, which we'll come back to below, but first let's just see visually (and in code) how the EM algorithm handles the two gaussian mixtures:

In [5]:
def update_gamma(X,mu,sigma2,pi):
    gamma = np.zeros((N,K))
    for n in range(N):
        for k in range(K):
            gamma[n][k] = pi[k]*gaussian(X[n],mu[k],sigma2[k])
        # Now normalize
        gamma[n] /= np.sum(gamma[n])
    return gamma
In [6]:
# E-Step
def EStep():
    G=update_gamma(D,mu,s2,pi) # Gamma
    return (G, np.sum(G,axis=0)) # Total Counts
In [7]:
# M-Step:
def MStep():
    return (np.sum(np.multiply(gamma,D[:,np.newaxis]),axis=0)/Nk, # Mu
            np.sum(np.multiply(gamma,(np.subtract(mu,D[:,np.newaxis]))**2),axis=0)/Nk, # Sigma^2
            Nk/N) # Pi

Now let's visualize how exactly EM works:

In [8]:
# First, let's randomly initialize everything
np.random.seed(3) # Just so everyone will see something consistent.
gamma = np.zeros((N,K))
mu = np.random.uniform([0,10],size=K)
s2 = np.random.uniform([0.5,3],size=K)
pi = np.random.uniform([0,1],size=K)
pi /= np.sum(pi) # pi must sum to 1
print('   mu:',mu)
print('sigma:',np.sqrt(s2))
print('   pi:',pi)
   mu: [ 0.5507979  3.6266696]
sigma: [ 0.8034005   1.40653645]
   pi: [ 0.47172318  0.52827682]
In [9]:
def plot_dist(title):
    em0 = norm(loc = mu[0], scale = np.sqrt(s2[0]))
    em1 = norm(loc = mu[1], scale = np.sqrt(s2[1]))
    emg = lambda x: pi[0]*em0.pdf(x)+pi[1]*em1.pdf(x)

    hist, edges = np.histogram(D, density=True, bins=50)
    p1 = figure(title=title, x_axis_label='x', y_axis_label='p(x)',
               plot_height = plot_height, plot_width = plot_width)
    p1.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
         fill_color="#036564", line_color="#033649", alpha=0.4)
    p1.line(xp,emg(xp),legend="p(x)",  line_width=2)
    show(p1)
    print('LogL(X):',log_likelihood(D,mu,s2,pi))
    print('   mu:',mu)
    print('sigma:',np.sqrt(s2))
    print('   pi:',pi)
plot_dist("Initial Mixture Model")
LogL(X): -24165.6102451
   mu: [ 0.5507979  3.6266696]
sigma: [ 0.8034005   1.40653645]
   pi: [ 0.47172318  0.52827682]
In [10]:
plot_height=200
plot_width=400
iters = 60
m1 = np.zeros(iters)
m2 = np.zeros(iters)
pi1 = np.zeros(iters)
pi2 = np.zeros(iters)
s21 = np.zeros(iters)
s22 = np.zeros(iters)
for i in range(iters):
    gamma, Nk = EStep()
    mu, s2, pi = MStep()
    if i%10==0:
        print("After ",i," Iterations...")
        plot_dist("Iteration "+str(i))
    m1[i] = mu[0]
    m2[i] = mu[1]
    pi1[i] = pi[0]
    pi2[i] = pi[1]
    s21[i] = s2[0]
    s22[i] = s2[1]
plot_dist("Iteration "+str(i))
After  0  Iterations...
LogL(X): -16913.7198837
   mu: [ 2.30103714  4.38008569]
sigma: [ 1.79815121  1.46766574]
   pi: [ 0.0209453  0.9790547]
After  10  Iterations...
LogL(X): -16634.9190938
   mu: [ 3.14792909  4.36061059]
sigma: [ 0.61676405  1.28140891]
   pi: [ 0.01984957  0.98015043]
After  20  Iterations...
LogL(X): -16277.4977762
   mu: [ 2.8976809   4.56734176]
sigma: [ 0.33622725  1.22734046]
   pi: [ 0.13823308  0.86176692]
After  30  Iterations...
LogL(X): -16180.6742575
   mu: [ 2.91742178  4.74300876]
sigma: [ 0.42807031  1.14992907]
   pi: [ 0.22265133  0.77734867]
After  40  Iterations...
LogL(X): -16148.0395245
   mu: [ 2.95129087  4.85797164]
sigma: [ 0.4734274   1.08914979]
   pi: [ 0.27347643  0.72652357]
After  50  Iterations...
LogL(X): -16135.3217373
   mu: [ 2.97961941  4.9316184 ]
sigma: [ 0.49767569  1.04727505]
   pi: [ 0.30485621  0.69514379]
LogL(X): -16130.6433223
   mu: [ 2.99861521  4.97377539]
sigma: [ 0.51085983  1.02280191]
   pi: [ 0.32262497  0.67737503]

In addition to visualizing the progress of the mixtures, we can also plot the convergence of the parameters:

In [11]:
plot_height = 500
plot_width = 600
p = figure(title="Convergence", x_axis_label='x', y_axis_label='f(x)',
           plot_height = plot_height, plot_width = plot_width)
p.line(range(iters),m1,legend="m1", line_width=2, alpha=1.0, color='red')
p.line(range(iters),m2,legend="m2", line_width=2, alpha=1.0, color='blue')
p.line(range(iters),pi1,legend="pi1", line_width=3, alpha=0.75, color='red', line_dash='dashed')
p.line(range(iters),pi2,legend="pi2", line_width=3, alpha=0.75, color='blue', line_dash='dashed')
p.line(range(iters),s21,legend="s21", line_width=4, alpha=0.5, color='red', line_dash='dotted')
p.line(range(iters),s22,legend="s22", line_width=4, alpha=0.5, color='blue', line_dash='dotted')
show(p)
In [12]:
plot_dist("Final Mixture Model")
LogL(X): -16130.6433223
   mu: [ 2.99861521  4.97377539]
sigma: [ 0.51085983  1.02280191]
   pi: [ 0.32262497  0.67737503]

Connection to Variational Inference

So we saw how by decomposing the Log Likelihood into two simpler, tractable parts and iterating between the two, the EM algorithm could take what would normally be a difficult estimation problem and make it solvable. This approach (modeling a complex distribution through a simpler one and using that simpler one to "inch" closer to the right answer) is central to one of the main methods of statistical inference: Variational Inference.

Typically, Variational Inference is discussed in the context of fully Bayesian approaches, but I find it easier to introduce it in the concrete context of the EM algorithm above (following the style and progression of Bishop 2006). To see what I mean by this, let's review what we know from above:

  1. We wanted to optimize $P(\mathbf{X})$ with respect to the models parameters $\mu$, $\sigma$, $\pi$, however this was difficult because the likelihood was not straightforward: our updates for the parameters were awkwardly tied together because of the latent variables $Z_{n,k}$.
  2. However, if we didn't have to marginalize $Z$ out, then the log likelihood would become much simplier: $$ \begin{eqnarray} \ln P(\mathbf{X},\mathbf{Z}) &=& \ln \prod_{n=1}^{N} P(X_n,Z_n)\\ &=& \sum_{n=1}^{N} \ln P(X_n|Z_n)P(Z_n)\\ &=& \sum_{n=1}^{N} \ln \sum_{k=1}^{K} \left[ \pi_k \mathcal{N_n}(X|\mu_k,\sigma_k) \right]^{Z_{nk}} \\ &=& \sum_{n=1}^{N} \sum_{k=1}^{K} \ln(\pi_k) + ln (\mathcal{N_n}(X|\mu_k,\sigma_k))\\ &=& \sum_{n=1}^{N} \sum_{k=1}^{K} \ln(\pi_k) -\frac{1}{2} ln \left(2\pi \sigma^2\right) -\frac{1}{2\sigma_k^2}(x_n - \mu_k)^2\\ \end{eqnarray} $$

How can we use this to our advantage? We will take our original formulation for $P(X)$ and introduce an explicit, tractable probability distribution over $Z$, $q(Z)$. We get to choose what $q(Z)$ is, and we will try to pick something that helps us approximate $P(X)$ in a useful way while still being tractable to compute. To see how this works, first let's decompose the joint probability and shift some terms around: $$ \begin{eqnarray} \ln P(X,Z) &=& \ln P(Z|X) + \ln P(X)\\ \ln P(X) &=& \ln P(X,Z) - \ln P(Z|X) \end{eqnarray} $$ Now we'll introduce $q(Z)$ noting that $\sum_Z q(Z) = 1$ since $q$ is a probability distribution: $$ \begin{eqnarray} \sum_Z q(Z) \ln P(X) &=& \sum_Z q(Z)\ln P(X,Z) - q(Z)\ln P(Z|X)\\ \ln P(X) &=& \sum_Z q(Z)\ln P(X,Z) - q(Z)\ln P(Z|X) + \left[q(Z)\ln q(Z) - q(Z)\ln q(Z)\right]\\ &=& \sum_Z \left[q(Z)\ln P(X,Z)-q(Z)\ln q(Z)\right] - \left[q(Z)\ln P(Z|X) - q(Z)\ln q(z)\right]\\ &=& \sum_Z q(Z)\ln \frac{P(X,Z)}{q(Z)} - q(Z)\ln \frac{P(Z|X)}{q(Z)}\\ &=& \sum_Z q(Z)\ln \frac{P(X,Z)}{q(Z)} + q(Z)\ln \frac{q(Z)}{P(Z|X)}\\ &=& \mathcal{L}(q) + KL(q~||~P(Z|X)) \end{eqnarray} $$

So we see that the complex log likelihood that we had before ($\ln P(X)$) can actually be broken down into two terms: 1) $\mathcal{L}(q)$, which is a function that depends only $q$ (a distribution that we get to choose) and the (substantially simpler) complete model likelihood where we don't have to marginalize out $Z$ (i.e., $P(X,Z)$); and 2) the KL-Divergence between our chosen simple distribution $q$ and $P(Z|X)$.

You are probably asking yourself three questions at this point: 1) What is the KL-Divergence? 2) What exactly is this $q$ distribution anyways? and 3) How does this get me any closer to solving my problem?

While I encourage you to read further about it, in broad strokes the KL-Divergence measures the logarithmic difference between two probability distributions (under expectation of the first distribution). As such, it has the following nice properties:

  1. $KL(p~||~p) = 0$
  2. $KL(q~||~p) \geq 0$
  3. $KL(q~||~p) = 0~\mathrm{if~and~only~if}~q=p$

Why does this matter to us? Well, we see that for our new expression of $\ln P(X)$ above: $$ \ln P(X) = \mathcal{L}(q) + KL(q~||~P(Z|X)) $$ so we would like to drive $KL(q~||~P(Z|X))\rightarrow 0$, which means setting $q(Z) = P(Z|X)$. This is exactly what the E-step of the EM Algorithm did above: it optimized a distribution over $Z$, such that it matched $P(Z|X)$ while holding the parameters $\theta$ (i.e., $\mu$, $\sigma$, $\pi$) fixed: $$ \gamma(z_{nk}) = \frac{\pi_k \mathcal{N_n}(X|\mu_k,\sigma_k)}{\sum_{j=1}^{K}\pi_j \mathcal{N_n}(X|\mu_j,\sigma_j)} $$ You can think of it as setting $q(Z)=P(Z|X,\theta^{old}))$.

Since $\ln P(X)$ doesn't depend on $q(Z)$, optimizing $q(Z)$ to minimize $KL(q~||~P(Z|X))$ increases $\mathcal{L}(q)$ by that amount, at which point $KL(q~||~P(Z|X))=0$ and $\mathcal{L}(q) = \ln P(X)$. So by minimizing the KL, we're actually improving $\mathcal{L}(q)$ to make it closer to $\ln P(X)$.

We've still made progress, however: the E-step minimized $KL(q~||~P(Z|X))$ by holding the parameters fixed ($\theta^{old}$) and finding the "best" simple distribution $q(Z|\theta^{old})$ for our given likelihood. Now we can turn the tables by keeping $q(Z|\theta^{old})$ fixed and trying to find the best parameters $\theta$ that increase $\mathcal{L}(q)$!

$$ \begin{eqnarray} \mathcal{L}(q) &=& \sum_Z q(Z|\theta^{old})\ln \frac{P(X,Z|\theta)}{q(Z|\theta^{old})}\\ &=& \sum_Z q(Z|\theta^{old})\ln P(X,Z|\theta) - q(Z|\theta^{old})\ln q(Z|\theta^{old})\\ &=& \sum_Z q(Z|\theta^{old})\ln P(X,Z|\theta) + \mathrm{constant}\\ &=& \sum_Z P(Z|X,\theta^{old})\ln P(X,Z|\theta) + \mathrm{constant}\\ &=& \mathbb{E}_{P(Z|X,\theta^{old})} \ln P(X,Z|\theta) + \mathrm{constant} \end{eqnarray} $$

So what we are really optimizing here is the complete data log-likelihood $P(X,Z|\theta)$, but under the expectation of $P(Z|X,\theta^{old})$. In other words, we are doing the much simplier optimization ($P(Z,X)$), by setting the troublesome values of $Z$ to what they would have been given the old parameters.

This is identical to the "M-Step" that we saw above and is guaranteed to increase the log-likelihood (unless we're already at the optimal parameters). Unfortunately, in changing the parameters $\theta$, we also increase the KL-divergence $KL(q~||~P(Z|X))$, which means we have to do the E-step again (to bring the $KL$ back to 0).

This back and forth might seem counter-productive, but in each step we're making guaranteed progress: minimizing the KL in the E-Step improves $q(Z)$ without harming the overall log-likelihood of $P(X)$, and then maximizing $\mathcal{L}(q)$ finds better model parameters by bumping up the likelihood. Over time, this one-two punch eventually climbs its way to the top of the likelihood, and we have made an intractable problem easily solvable!

Of course, we were pretty lucky in this particular example: 1) we were able to minimize the KL to zero through a fairly straightforward, closed-form update; this may not always be the case and we may need to pick a computable distribution that doesn't accurately capture $P(Z|X)$. Likewise, 2) we could maximize $\mathcal{L}(q)$ in a useful, closed form; again this may not be possible and we may have to make approximations there. Still, this general framework forms the backbone of Variational Inference and you can apply it to many models beyond just Gaussian Mixtures.

Laplacians and Middleschool

tl;dr: I visualize the connection between Graphs and their Laplacian Eigenvalues.

The below example is something I put together to supplement an extra credit assignment as part of my Applied Machine Learning for Engineering and Design course, though I'll post it here in case others find it useful.

In [771]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import numpy as np
import scipy
# Add a dark background
#sns.set_style('dark')
sns.set_context("poster")

A Relaxed Caveman graph basically just creates K groups of N members each, where each N member is connected to each other within its group, but to no-one outside of its group (this should look familiar to anyone who ever went to middleschool...). The "Relaxed part" controls whether these internal connects are randomly re-writen to other members at random (i.e., the nodes from one group will start to randomly connect with other groups as the probability increases.

In [772]:
number_of_clusters = 4
number_of_nodes_in_each_cluster = 5
probability_of_randomly_rewriting_edges = 0.01
G = nx.relaxed_caveman_graph(number_of_clusters,
                             number_of_nodes_in_each_cluster,
                             probability_of_randomly_rewriting_edges,
                             seed=53)
fig = plt.figure(figsize=(8,16))
ax = plt.subplot(2,1,1)
nx.draw(G,pos=nx.spring_layout(G, iterations=500, weight=500),
        linewidths=1, width=2, with_labels=True, ax=ax,
        node_color = sns.color_palette()[0], node_size=500)
plt.draw()

With this graph, G, (which, above, has number_of_clusters components), we can plot its adjacency matrix, showing which nodes are corrected to which other nodes:

In [773]:
A = nx.adjacency_matrix(G).todense()
plt.matshow(A,cmap = plt.cm.Blues)
Out[773]:
<matplotlib.image.AxesImage at 0x129b1eb0>

We can now construct something called the Graph Laplacian, which is just: $$L = D - A$$ where A is the adjacency matrix above, and D is the Degree Matrix of the graph. The Laplacian is just a matrix, so we can take that matrix and find its Eigenvalues, and then plot the eigenvalues from smallest to largest:

In [774]:
D = np.diag(list(nx.degree(G).values()))
L = D-A
# For undirected graphs, the Laplacian is real-symmetric
w = scipy.linalg.eigvalsh(L)
plt.plot(w,'o-')
plt.xlabel('Index of Eigenvalue')
plt.ylabel('Value of Eigenvalue')
Out[774]:
<matplotlib.text.Text at 0x14d191b0>

To help us understand the role of the laplacian here, let's take what we did above, and just put it in a function, so that we can look at different kinds of graphs and how things change.

In [775]:
def draw_relaxed_graph_and_spectrum(probability_of_randomly_rewriting_edges):
    number_of_clusters = 4
    number_of_nodes_in_each_cluster = 5
    G = nx.relaxed_caveman_graph(number_of_clusters,
                                 number_of_nodes_in_each_cluster,
                                 probability_of_randomly_rewriting_edges,
                                 seed=53)
    A = nx.adjacency_matrix(G).todense()
    D = np.diag(list(nx.degree(G).values()))
    L = D-A
    # Now Plot things
    fig = plt.figure(figsize=(16,8))
    fig.patch.set_alpha(0)
    ax1 = plt.subplot(1,2,1)
    ax2 = plt.subplot(1,2,2)
    # Plot the graph
    nx.draw(G,pos=nx.spring_layout(G, iterations=500, weight=1, k=.17),
            linewidths=1, width=2, ax=ax1,
            node_color = sns.color_palette()[0], node_size=500)
    # Plot the eigenvalues
    ax2.plot(scipy.linalg.eigvalsh(L),'o-')
    plt.xlabel('Index of Eigenvalue')
    plt.ylabel('Value of Eigenvalue')
    plt.draw()

Specifically, let's look at different Relaxed Caveman graphs as we increase the relaxation:

In [776]:
# Low Probability of Re-writing (aka Middle School)
draw_relaxed_graph_and_spectrum(0.01)
In [777]:
# Get some edges in there...
draw_relaxed_graph_and_spectrum(0.04)
In [778]:
# Some Re-writing (though still pretty clustered)
draw_relaxed_graph_and_spectrum(0.06)
In [779]:
# Now we're mixing!
draw_relaxed_graph_and_spectrum(0.2)
In [780]:
# Edges now equally likely to mix vs stay in the community
draw_relaxed_graph_and_spectrum(0.5)
In [781]:
# Completely random re-writing
draw_relaxed_graph_and_spectrum(1.0)

Questions for students:

  1. What is the connection between the graph and the eigenvalues of the Laplacian (in terms of the smaller eigenvalues)?
  2. What does it mean if an eigenvalue is zero?
  3. Describe one possible way you might use the above as a way to "discover" communities and clusters in data, even when the communities aren't completely separable.

Note, most libraries will have a function for what we just did above: nx.laplacian_spectrum(G)

The Design Data List

tl;dr: I've made a GitHub Repo for linking to known Design Data resources. Check it out and add to it if you or someone you know has design data somewhere.

One theme to come out during this year's IDETC Design Theory and Methodology session on Creativity and Ideation was the need for increased data sharing and reproducibility in our community. We do have a beginning culture of this, particularly in the CAD and Design Computing sub-fields. But cultural change takes time, and we have struggled to find a central place to share and store data like the Machine Learning community has managed to do with the UCI Machine Learning Repository and MLoss. Jami Shah's group has valiantly tried to create a central repository through the ASU Design Protocol Repository, and Rob Stone's group has provided a product repository for several years, but the practice of reproducibility and data sharing is not yet widely adopted.

I think this is comes down to the perceived difference between benefits and costs:

  • Benefits: The benefits of sharing code and data aren't as well understood within our community as they are in other communities (Computer Science, Psychology, Economics, etc.). This is something that will take time to educate the community about, and is not the purpose of this post (though this would a good topic for a future post).
  • Costs: Sharing data or code is seen as arduous (which might or might not be true depending on how it is done). Even if one wanted to follow Uri Simonsohn's advice to "Just Post It" , it's often unclear how or where to post your data/code to maximize impact and minimize work. Do you just post up a zip file on your website, transfer it to a central community repository, or go full-in by publishing your entire workflow on something like the Open Science Framework? This is what this post attempts to address.

After chatting with folks at the past two IDETCs about this (notably Alex Burnap and Karthik Ramani), I decided to address the Cost side of the equation by borrowing an idea I saw from trying to keep track of the ever evolving landscape of Machine Learning Libraries: rather than asking folks to pick some standard method of sharing their data, I created a GitHub Repo that just aggregates links to all of the design data sources that I know about. This works for me for several reasons:

  • Contributing is easy: it's just a set of text files. The original authors can add a link, or any 3rd party. Its completely open.
  • Since it just provides links, researchers don't need to do anything fancy to share their data: researchers can chose to just post it wherever is convenient, or they can use more formal mechanisms like central repositories. It's all treated equally; I (or anyone else) just needs to add a link to it.
  • It's a low-effort way to highlight some of the great open-source work we do in our community, and to visualize our collective efforts in one place. In NSF parlance: it helps increase our works broader impact beyond just the papers we write.
  • GitHub's version tracking and commenting system give us a low-weight, but potentially useful means to have a discussion around the purpose and organization of the list. Their pull request system means we can collaborate on it together and change it over time.

Will this be the permanent solution to bring reproducibility to the design community? I hope not. But it's a start; a way for folks to test the waters and establish a culture of reproducibility before trying their hands at something better-suited, like the OSF.

Recap of IDETC 2015

Table of Contents

IDETC 2015 concluded yesterday and there were a bunch of great talks and events I wanted to recap.

Conference Overall

This year, ASME held IDETC in the beautiful Back Bay neighborhood in South Boston, with great nearby restaurants and perfect weather. The size has grown this year with the increased size and co-location of the AM3D industry conference. This comes with the additional logistical complexity of having more people, but it was also nice to run into industry friends of mine who I might not have seen otherwise. I'm looking forward to the more intimate Design Computing and Cognition conference in 2016, which is a nice counter-point to the behemoth IDETC has become. Overall though, the conference was nicely done, and I'm looking forward to IDETC next year in Charlotte.

NSF Broadening Participation Committee Workshop: Managing your Career Like an Entrepreneur

This Pre-conference Sunday workshop was the first NSF BP workshop I've attended, but it certainly won't be the last. Great community and diversity of ideas, with a fun atmosphere. We discussed the connection between Entrepreneurship and Academia, and how to identify strong/weak entrepreneurial skills within ourselves that we could work on. I had some great conversations, particularly with Li Shu of U-Toronto on how to build high functioning research teams.

The best sign of the quality of the workshop: it raised more questions than it answered. Great job to the Broadening Participation committee, particularly co-chairs Susan Finger and Kate Fu.

Design Theory and Methodology (DTM)

I spent most of my in-session time at the Design Theory and Methodology sub-conference this year. The room is always packed, the topics are diverse, and the quality tends to be very high overall. Specifically, here were some talks/papers that I thought raised some great questions:

DTM Creativity and Ideation

The Creativity and Ideation sessions where split into two sections this year (a sign of its growing popularity), and they covered a wide gambit:

  • Bradley Camburn looked at the prototyping behaviors of participants on Instructables.com, and collected some interesting principles on how DIY designs and makers create ideas. Would be interesting to see how members of that community interact with each other to promote ideas, a la Liz Gerber's work on crowd-funding platforms and my work on Open-Source design communities.
  • Mahmoud Dinar looked at how to capture and understand problem formulation in a data-driven fashion. Would be fascinating to combine some of these ideas with un-supervised learning techniques as they collect more information from their web-tool. I'm definitely going to be keep my eye on this!
  • Diana Moreno presented some great work on the application of creativity within the context of Service Design. I've been following her work for some time and it is great to see the expansion of DTM's efforts beyond traditional mechanical design. Dave Brown and Christine Toh triangulated a great discussion during the Q/A around the role of Fun and Novelty.
  • Ryan Arlitt gave an informative and entertaining talk that showed some of the power of combining automated text analysis with traditional Design by Analogy. The part I found the most intriguing was the delineation between compound and single analogies, and how to capture those connections in description text.
  • Bryan Levy gave an interesting talk on the open problem of identifying the fundamental characteristics of design problems; the central idea being that we need a minimal set of maximally diverse benchmark problems to consistently test our design methods. This is far from a solved problem, but he brought up some great points and the Q/A discussion shined.
  • Caitlin Dippo presented some follow-up work to her IDETC 2013 paper using the Alternative Uses Test (which I remember enjoying back then, so I'm glad they came back to DTM this year). This time, she focused on the role of concept elaboration, or how well-described a concept is. This reminded me of the concept of minimum information content/minimum description length in Computer Science, and how that might connect to design representations.
  • There were also some great talks by Christine Toh on Ownership Bias, Tahira Reid on reducing sketch inhibition, and by Tyler Johnson on linking ideation/creativity tests across Psychology and Engineering.

Really impressed with the diversity of topics this year that kept the session interesting. I gave the last talk in the session on the role of Statistical Tests in Design Cognition. It was primarily a review paper on the debate about Null Hypothesis Statistical Testing and how DTM has evolved over the past decade. We had an interesting discussion both after the talk and beyond the session, and I'm looking forward to exploring some of my ideas, particularly around review checklists, with the community in future years.

DTM User Preferences

The user preferences session this year was a brilliant display of diverse techniques, and also (personally interesting to me) the role that large-scale data can play in uncovering new questions:

  • Erin McDonald used eye-tracking techniques to debunk the assumption that shared features cancel each other out when users are considering alternatives. I love these kind of "things aren't as simple as you think they are" papers, since they always make me revisit my assumptions; I'm always better off as a result.
  • Alex Burnap presented this year's DTM Best Paper, which brought together a bunch of nice techniques under one umbrella. Specifically, he looked at the trade-offs car designers face between designs that are novel with those that preserve brand; a hard problem! They used a combination of crowd-sourced evaluations with a home-grown distance function to define a set of linear trade-offs for designers to explore. It reminded me of Pareto Frontiers, and could probably be expanded to non-linear convex frontiers in the future. A fascinating challenge they had to face is in defining reasonable distance metrics; this brought to mind techniques from the metric-learning community, where you can use human evaluations to actually learn the distance metric. So many neat directions they could take it! No wonder it was selected for the best paper award.
  • Cory Schaffhausen gave a talk near-and-dear to my own interests in identifying collections of user needs. His approach was interesting in that they studied the converge and concentration of user needs collected during an Amazon Mechanical Turk project. The fascinating thing about the work was that unique user needs did not seem to saturate as they increased the sample size; this is odd, since you'd imagine that a given design prompt would have a finite number of applicable user needs. The talk highlighted the power of combining user needs with crowd-sourcing and automated text analysis, and I'm looking forward to following where this work goes from here.
  • Jessica Armstrong gave a nice counter-point to the more data-centric talks with her work on designing suits for empathic design for disabilities. This highlights the diversity of how DTM papers approach user preference problems.

Design Computing

I've been fascinated by Design Computing, particularly Computational Design Synthesis, since following the great work of Matt Campbell and Kristi Shea over the years. I missed the beginning and end of this session, but the talks I did catch were great:

  • I only caught the tail-end of Andy Dong's talk on graph transformations of function structures. In his usual clarity, Andy connected the worlds of graph theory to real-world properties of physical and functional relationships. Given the post-talk Q/A that I did see, I anticipate that I'll enjoy reading through the whole paper.
  • Joran Booth talked about the relationship between function trees and the final design. He brought up some neat ideas around the quality of function decompositions which I'm still mulling over.
  • Clemens Münzer presented some new work in a line of functional synthesis that brought together prior work on the connection between configuration design and the Boolean Satisfiability Problem, and how to conduct joint optimization used simulation models. I love this kind of work because it marries the pragmatic choices one needs to make when adapting optimization to real engineering problems, with the elegance of finding unique connections to well-studied problems in computer science (I remember reading their first paper on connecting design synthesis to the SAT problem---brilliant!)
  • Fritz Stöckli presented work on optimizing Brachiating (i.e., swinging) passive robots using graph grammars. As expected, the objective space is highly non-linear, and so they turned to techniques like simulated annealing and showed some well-presented results. This got me thinking about relaxations within complex design configuration problems, and how we might approximate the solution space to simplify search.

This lead to a great conversation I had with Warren Seering and Chad Foster about how Design Computing could benefit from bridging to some of the theoretical work in computational complexity and lower-/upper-bounds on performance (e.g., Branch-and-Bound and Upper Confidence Bound/Minimal Regret algorithms) that are common in Computer Science. Some of the work by Kristi's group would be the most state-of-the-art that I can think of; I think the field is ripe for more efforts in this area. I anticipate recent work in Multi-Armed and Infinite-Armed Bandits will be relevant as we broach Design Computing topics that combine discrete and continuous parameter spaces.

DTM Trends and Technologies Impacting the Design Process

This was a neat session that brought together a set of topics that the chairs thought might represent future directions for DTM. Two papers were highly related to the intersection of design and data:

  • Chaoyang Song and Jianxi Luo summarized products that resulted from Crowd-funding, specifically on platforms such as Kickstarter and Indiegogo. Again this reminded me of Liz Gerber's work on crowd-funding and my work on online design communities. There does seem to be a growing interest in how to leverage "the crowd" in various capacities to aid product development, whether through funding or design itself. This was exemplified through the next DTM Trends talk:
  • Dev Ramanujan presented a Crowd Co-Creation study that used professional designers to seed a crowd-interface where users could parametrically alter the 3D models to produce new designs. Again, we see the use of crowd services like MTurk, as well as adoption of ML techniques such as k-means to help interpret the designs that get produced.

DTM Overall

There were several growing themes at DTM this year, some of which included:

  • The use of crowd platforms (MTurk, Instructables, etc.) for the purpose of conducting empirical work.
  • The combination of advanced statistical and computational techniques with human-generated data, whether through databases, crowd-sourcing, or otherwise.
  • The expansion of the scope that DTM typically covers; for example, product-service systems.

I'm excited to see where our community takes these at next year's conference

Design Automation Conference: Data-Driven Design

Two years ago the Design Automation Conference had "Data-Driven Design" as an emerging research topic, and now the the Data-Driven Design session at DAC is under full swing. I was really fascinated by this set of talks because of their obvious connection to my area of applying machine learning techniques to design data. Although I usually participate in the DTM sessions, this DAC session stole me away due to the sheer concentration of peer researchers who are doing great work.

  • Yi (Max) Ren kicked things off with his paper, which won this year's DAC Best Paper award. He explored how humans select designs and control policies through an online crowd-sourced game called "Eco-Racer". He then compared human performance to that of a global optimizer (specifically, EGO) and looked at the differences in convergence performance to the optimal policy. He highlighted the challenge of finding the right crowd to provide reasonable initial data for policy exploration. This reminded me a lot of parallels in robotics with apprenticeship learning and off-policy learning, though those have slightly different goals than what Max was trying to accomplish. As we start to explore how humans and computers can design in semi-supervised fashion, these connections with optimal control exploration and trust-regions from the ML community will become increasingly important.
  • Harrison Kim then presented work on predictive modeling of product returns. The key to the paper was in combining two disparate types of prior product return models: one that models the sales data and another that models product return distributions. By combining the two you in essence get to capture the relationship between these two linked distributions and build a joint model that can use both datasets.
  • Hyunmin Cheong gave two back-to-back presentations on some of the work he is spearheading at Autodesk Research. The first detailed a crowd-sourcing game called "find the pretender" based on a popular Chinese gameshow. The brilliance behind this approach is how he takes the fairly simple game mechanic (providing text information for objects), and leverages it to get functional data at the right level of abstraction to be useful in functional design. The solution looks both fun and elegant.
  • His second paper extracted functional relationships in what he called "Artifact-Function-Flow" Triplets. For example "Battery-Stores-Electrical Energy". He does this by selecting a sub-corpus wikipedia, and then using a combination of Sentence Parsers and Word similarity measures to collect these triplets from Wikipedia. This was a great example of how domain knowledge regarding functions could combine with modern text-processing and crowd-sourcing techniques. Both this paper and his last paper show great creativity in combining the best parts of DAC and DTM together. I also had no idea Autodesk was doing this kind of work, which was great news!

DFMLC: Panel on Additive Manufacturing's Impact on Design for Manufacturing, Assembly, and Life Cycle

Kazuhiro Saitou organized a great panel of academic and industry experts to discuss what role Additive Manufacturing is playing in DFM, DFA, and DFMLC. After the panelists gave an overview of their opinions of the field, he posed the question: "Is Complexity Free?"

The discussion took a series of interesting turns, with several people starting off agreeing the manufacturing complexity has essentially become free, pointing to examples of the GE LEAP engine fuel injectors as one example. However Erhan Arishoy of Siemens Corporate Research noted that while manufacturing might be free, there is no such thing as a free lunch: the costs of AM now come in the design and software side. For example, even if one could create a complex, light-weight lattice model, how would one store or transmit that data to machines in today's formats? This then opened up additional concerns about other aspects of complexity, such as surface finish and precision which is far from "Free" in AM.

The Graduate School Statement of Purpose: A Faculty Perspective

tl;dr: I talk about writing an effective graduate school Statement of Purpose (SoP) from the perspective of how I (a faculty member in Mechanical Engineering) read it, along with common mistakes I see applicants make. I offer a few tips for students looking to make sure their SoP gets into the right hands and communicates the necessary details so that you'll have the best shot of getting accepted.

Applying to graduate school, particularly for Ph.D. positions, can be a nerve-racking experience for many students. Part of the stress comes from the all-important Statement of Purpose, where you have the opportunity to represent yourself and your interests beyond what your purely numerical scores (e.g., GPA, GRE, TOFEL, etc.) or recommendation letters might say about you. There are many guides all over the Internet about how to write your Statement of Purpose (SoP) (See Berkeley, Purdue, UCLA, UNI, etc.). I won't replicate their advice here.

However, to write well you need to know your audience. So rather than talk about how to write a SoP, I want to describe what it is like to read one. By going in the reverse direction and giving you the faculty perspective, I'm hoping you'll better understand how and why faculty members read a SoP so that you'll have an easier time writing something that communicates effectively to them. (With the important caveat that this is all my own opinion, and that other faculty may read a SoP slightly differently.)

Why the type of Graduate School program you're applying to matters

Before we dive into specific details, we need to differentiate between at least three types of "Graduate School" programs (in Science and Engineering):

  1. Ph.D. Program: long-term commitment usually at least four years in length where the primary responsibility of the student is to conduct research with a faculty advisor. This includes M.S./Ph.D. programs where a student receives an M.S. degree during the course of pursuing their Ph.D. degree.
  2. M.S. w/ Thesis: shorter-term commitment, typically two years in length, where the student splits their time between taking graduate courses and conducting a two year research project with a faculty advisor.
  3. M.S. w/ Coursework: like the M.S. w/ Thesis, except without the research aspects; you just take courses of your own choosing and then graduate with the degree. Depending on the student, there might be no primary faculty advisor that you communicate with on a regular basis. This also includes M.Eng. degrees or "Professional Masters" programs.

Why differentiate? Because faculty will expect your SoP to be fundamentally different depending on what your eventual goals are.

Advice for Particular Types of Graduate Programs

For each of the three main programs, I'll mention: 1. what I'm looking for in the SoP, 2. common mistakes I see applicants make, and 3. suggestions for improving your SoP so it has a better chance of success.

Ph.D. Program

From the faculty perspective, Ph.D. students are a big, but important, commitment. You will develop a long-term professional relationship with your faculty advisor and they will act as a mentor (officially or unofficially) to you for the rest of their life, even after you graduate. Beyond mentoring, faculty provide most of the financial support for their Ph.D. students, for things like tuition, a stipend, any experimental resources they need to complete their research, not to mention hours of one-on-one training. In exchange for these years of training, the Ph.D. student and the advisor will eventually carve out new areas of knowledge that will push forward the cutting edge of science and technology. In short: big commitment and big pay-off, for both the student and advisor, over the course of about 4-6 years.

What runs through my head when I open the SoP

This student is looking primarily for a faculty mentor that will guide their research throughout the course of the Ph.D. I should be looking to see if I'm the right person to guide them. This means I'm paying attention to:

  1. Are they interested in research that is relevant to my area?
  2. Who else in the department could act as good additional mentors to them?
  3. Do their interests align with projects I have going on right now (or wish to start)?
  4. What are their career goals once they get their Ph.D.?
  5. Do they appear to have enough preparation and credentials that it is worth my time to contact them and set up a remote interview?
  6. If they are the right fit, can I find the appropriate financial support for them over the duration of the Ph.D.?

Ideally, the SoP would help me answer the above questions as easily as possible.

Common Mistakes and Suggestions for Improvement

In line with my above points, here are common mistakes applicants make:

  1. The applicant doesn't say what their research interests are.

    If a student is fantastic (good grades, research experience, great letters of recommendations, etc.), but doesn't tell me what kind of research they want to do, there is no way for me to determine if I'd be the right advisor for them.

    Suggestion: Be upfront about the kind of research you want to do, preferably in the first paragraph. Say something to the effect of "My research interests include insert broad MechE topic area here, specifically in insert specific sub-fields here." This way, in the first paragraph of your statement I know whether you are appropriate for my lab or possibly another faculty member's lab.

    It's important to strike a balance here. If you say "I'm interested in Mechanical Engineering", I would say "this student doesn't yet know what kind of research they want to do, so how do I know if I'll be a good advisor for them?" On the other hand, if you are super-specific and say something like "I want to work on agent-based architectures for swarm-based, unmanned underwater vehicles" then I might say "hmm, I don't really have any funded projects specifically on that topic right now, so maybe the student wouldn't ultimately be happy with my available projects; maybe another faculty member might have something closer to that." Look over faculty web pages and try to find a happy medium that is specific enough to pique some faculty interests, but broad enough appeal to the projects they have going on.

  2. The applicant doesn't make it clear which faculty might be appropriate mentors for them.

    If you want to work with particular people but don't mention them, you are missing a golden opportunity.

    Suggestion: name dropping particular faculty in your SoP is one of the best ways to get those particular people to look over your application. Look over faculty webpages and specifically highlight one or more faculty that you might possibly want to work with. For example, if you really like the work of Dr. X, but could also see yourself working with Dr. Y or Z, then say something like "I am particularly interested in Dr. X's work on super cool research topic by Dr. X, but would also be interested in related work by Dr. Y and Dr. Z in the areas of research topics of Drs. Y and Z that you like."

    That strategy is powerful for multiple reasons. First, it shows you did your homework on what people are working on. Second, it demonstrates that you have specific research interests, but also are flexible regarding projects in related areas. Third, it is eye-catching: if I see my name explicitly listed in a SoP, I spend much more time reading it through, since I already know that the student is possibly interested in my specific line of research.

  3. The applicant doesn't mention what they want to do after they complete their Ph.D.

    If you don't mention what you want to do once you have your Ph.D., then I can't determine if I'll be able to provide the appropriate contacts or support when you graduate.

    Suggestion: mention why you want to get a Ph.D. and what your goals are once you graduate. Do you want to do research at a research University? Teach at a teaching university? Work in an Industry lab? Start-up company? Open your own bakery/circus/boutique coffee shop? Let us know.

    This is important since this helps us determine two things: 1) why do you want to go through the long and arduous Ph.D. process, and 2) are we the best people to provide you with that kind of path once you graduate? If you're interested in working as a research scientist for Fancy Company or National Lab, and I have many connections or joint-projects with those or similar labs, then I'll likely be able to give you what you need to succeed.

  4. Not listing skills or experience that match the research field you are trying to go into.

    Your experience and skills should match the job you want. If you've spent years doing experimental work, but list heavy computational or theoretical research interests, we may think "This person is really interested in my area, but do they really know what they are getting themselves into? How much extra training will they need to get up-to-speed on the work in my area?"

    Suggestion: make it crystal clear how your past experience translates directly into applicable skills that will be useful when you start. For example, what if you want to join a lab that does computational work? Did you do a project where you had to learn and master C++ programming? Go ahead and mention it! What about your time doing biological research in a wet-lab? Think about how that experience translates to the new lab you want to join and tailor it to them: maybe your exceptional pipetting ability is not worth mentioning, but your data-analysis abilities would be perfect!

  5. It is unclear what options exist to financially support the student.

    Typically students are funded by the advisor out of an active research grant they have at that time. If you express interest in a project related to that grant, and we have money available, it's your lucky day! However, sometimes things aren't that lucky: maybe we're waiting to hear back about a pending grant, or there is a student graduating in one year who is already on that grant, so money won't be available for a new student on that project until he or she graduates. This could mean that I can't admit a fantastic student that I normally would because the right funding didn't line up.

    Suggestion: if you're open to receiving other forms of funding, say so. For example, Teaching Assistantships might be possible for several semesters while waiting for dedicated research grant funding. Or if your country has some kind of fellowship program (NSF GRFP or NDSEG are examples in the U.S.) that you have already applied for (or anticipate applying for), then you should mention this. If you're open to different funding options, then that increases the possibility that we can provide continuous financial support throughout your entire time as a Ph.D. student.

M.S. w/ Thesis

For a research-focused M.S. degree, where you are expected to work with a faculty advisor, the same advice from Ph.D. applications above applies. In addition to that advice, you should be specific about your goals for the M.S. degree.

Students apply to a research-focused M.S. program for a variety of reasons: 1) they like research, but are unsure about whether they want to go all the way with a Ph.D., so they test the water with the M.S. + research first and then maybe apply for the Ph.D. later; 2) They just want the M.S. degree, and intend to go into industry upon completing it, but like research and are hoping to cover some of the M.S. costs through a research assistantship; or 3) they want to get into a Ph.D. program, but believe that having an M.S. first before applying for Ph.D. programs will benefit them more than the direct Ph.D. program (this is less useful if you intend to stay at one institution for both degrees).

Whatever your goals, be specific about them, since that will help faculty determine the appropriate level of support, expectations for your application, and how you might fit into the research group.

M.S. w/ Coursework

This type of degree program doesn't directly typically involve a faculty advisor, and so faculty have less say in these applications and the advice above is less relevant to you. Since these are often reviewed by a department's graduate office, I don't have much input here other than to be specific in your degree goals and state concrete ways in which the programs at that particular university will benefit you.

General Tips for Improving Readability

Given the above considerations, there are some general ways that you can make your SoP easier to read:

  1. Organization and Formatting are your Friends.

    SoPs that are well organized, either by using topic paragraphs/sentences or section headings make it really easy to scan through the SoP and make a judgement. For example, bolding the names of research interests or particular professors make it less likely that person will miss that detail in a quick read. You can even use special headings to organize the SoP, such as "Faculty who are closely related to my research interest:" or "Prior Research Experience:" or "Degree Goals:"

  2. Quality Over Quantity

    The longer the SoP is, the more likely the reader is to skip around looking for the information they want, rather than reading the whole thing. Just like a resume, assume that a first-pass read of your SoP will only be ~10 seconds, so you want to get your point across quickly. This means 1) highlight important points in the first paragraph, 2) keep it shorter, if possible, and 3) use organization to make things easy to scan. Feel free to use all the space provided to tell your story, but make sure that if they only read the first paragraph you'd be able to pique the interest of the appropriate faculty member. I've seen a SoP with only a couple of to-the-point paragraphs that led me to interview someone, as well as a multi-page, well organized SoP, labeled with clear section headings that allowed me to identify whether the candidate was appropriate within seconds. Length doesn't matter as much as quality and clarity.

  3. Print It Out and Give it to Someone to Quickly Read:

    Get a friend of yours to look at your SoP quickly and give you their gut reaction. You have been working so hard on it that you'll know it inside and out, but a fresh set of eyes can be really useful. Is the page too crammed with text that it looks cluttered, busy, and unapproachable? Is it easy for them to find the above mentioned information? Are there spelling or other errors you didn't catch? Spending one minute with a third party will drastically help you improve your chances on the real deal.

Best of luck!