In natural language we often wish to model the conditional distribution over the next word \(w\) in a sentence given the previous words \(c\), which stands for context. We can model the conditional density using a real-valued function \(S_{\theta} (w, c)\):

\(\displaystyle p_{\theta} (w|c) = \frac{\exp (S_{\theta} (w, c))}{Z_{\theta} (c)} \qquad Z_{\theta} (c) = \int \exp (S_{\theta} (w', c)) d w'\) | (1) |

Intuitively, a particular value of \(w\) is exponentially more likely (higher density) if \(S_{\theta} (w, c)\) is larger. The denominator \(Z_{\theta} (c)\) is there to ensure that the probability density integrates to \(1\):

Notice that we were able to pull \(Z_{\theta} (c)\) out of the integral because it is not a function of \(w\).

## 1Maximum Likelihood

Suppose that for each context \(c\) there is a true data distribution over next words \(p_d (w|c)\). For each \(c\) we would like to maximize the log likelihood of the observed \(w\) under our model:

\(\displaystyle \max_{\theta} E_{p_d (\cdot |c)} [\log p_{\theta} (w|c)]\) | (2) |

To optimize this objective we would like to differentiate the expected log-likelihood:

For the second equality we are using:

We see from Eq. 4 that the second term of Eq. 3 involves an integral or sum over all possible values of \(w\), which can be difficult to compute.

## 2Classifying reals vs fakes

Suppose we have a batch of samples of \(w\) from two different distributions: one of the samples is from the real distribution \(p_d (w|c)\), and \(k\) additional samples are from a “noise” distribution \(p_n (w|c)\). For each sample we attach a label \(D = 1\) if it is from the real distribution, or \(D = 0\) if it is from the noise distribution. We want to use our model \(p_{\theta}\) to predict this label (a binary classification problem). Essentially, the idea behind noise contrastive estimation (NCE) is to use the model to differentiate samples from the true data distribution versus those from the noise distribution.

Since we want our model distribution \(p_{\theta} (w|c)\) to closely approximate \(p_d (w|c)\), we use \(p_{\theta}\) to write down the likelihood of each label given \((w, c)\):

In practice, NCE implementations often assume the partition function \(Z \equiv 1\), so for the rest of the section we assume

Then we can rewrite the likelihoods using the sigmoid \(\sigma (x) = \frac{1}{1 + e^{- x}}\):

where we have defined

So we can write our maximum likelihood objective for this binary classification task:

As before we differentiate the objective making use of the fact that \(\sigma' (x) = \sigma (x) (1 - \sigma (x))\) and that \(\frac{d \Delta (w, c)}{d \theta} = \frac{d}{d \theta} S_{\theta} (w, c)\):

At the last equality we are using the fact that:

Finally, since by inspecting the definition we can see \(\sigma (\Delta) \xrightarrow{k \rightarrow \infty} 0\):

So if we increase the number of noise samples \(k\) in the batch, our gradient approaches the original maximum likelihood gradient of Eq 3 asymptotically. Hence we see the motivation for why NCE's binary classification task is a good approximation for the original objective in Eq 2.

## 3Further reading

This blog post and this paper contain good explanations of noise contrastive estimation, and have some more detailed derivations for some results.