In natural language we often wish to model the conditional distribution
over the next word \(w\) in a sentence given the previous words \(c\),
which stands for context. We can model the conditional density using a
real-valued function \(S_{\theta} (w, c)\):
\(\displaystyle p_{\theta} (w|c) = \frac{\exp
(S_{\theta} (w, c))}{Z_{\theta} (c)} \qquad
Z_{\theta} (c) = \int \exp
(S_{\theta} (w', c)) d w'\) |
(1) |
Intuitively, a particular value of \(w\) is exponentially more likely
(higher density) if \(S_{\theta} (w, c)\) is larger. The denominator
\(Z_{\theta} (c)\) is there to ensure that the probability density
integrates to \(1\):
\(\displaystyle \int p_{\theta} (w|c) d w = \frac{1}{Z_{\theta} (c)}
\int \exp (S_{\theta} (w,
c)) d w = 1\)
Notice that we were able to pull \(Z_{\theta} (c)\) out of the integral
because it is not a function of \(w\).
1Maximum Likelihood
Suppose that for each context \(c\) there is a true data distribution
over next words \(p_d (w|c)\). For each \(c\) we would like to maximize
the log likelihood of the observed \(w\) under our model:
\(\displaystyle \max_{\theta} E_{p_d (\cdot |c)}
[\log p_{\theta} (w|c)]\) |
(2) |
To optimize this objective we would like to differentiate the expected
log-likelihood:
\begin{eqnarray*}
\frac{d}{d \theta} \mathbb{E}_{p_d (\cdot |c)} [\log
p_{\theta} (w|c)] & = &
\mathbb{E}_{p_d (\cdot |c)} \left[ \frac{d}{d
\theta} \log (\exp (S_{\theta}
(w, c))) \right] - \frac{d}{d \theta}
\log Z_{\theta} (c)\\
& = & \mathbb{E}_{p_d (\cdot |c)} \left[
\frac{d}{d \theta} S_{\theta} (w,
c) \right] - E_{p_{\theta} (\cdot
|c)} \left[ \frac{d}{d \theta} S_{\theta}
(w, c) \right] \hspace{3cm}
\text{(3)}
\end{eqnarray*}
For the second equality we are using:
\begin{eqnarray*}
\frac{d}{d \theta} \log Z_{\theta} (c) & = &
\frac{1}{Z_{\theta} (c)}
\frac{d}{d \theta} \int \exp (S_{\theta} (w,
c)) d w\\
& = & \frac{1}{Z_{\theta} (c)} \int \exp (S_{\theta} (w, c))
\frac{d}{d
\theta} S_{\theta} (w, c) d w\\
& = & \int \frac{\exp
(S_{\theta} (w, c))}{Z_{\theta} (c)} \frac{d}{d
\theta} S_{\theta} (w,
c) d w \hspace{3cm} \text{(4)}\\
& = & E_{p_{\theta} (\cdot |c)}
\left[ \frac{d}{d \theta} S_{\theta} (w, c)
\right]
\end{eqnarray*}
We see from Eq. 4 that the second term of Eq. 3
involves an integral or sum over all possible values of \(w\), which can
be difficult to compute.
2Classifying reals vs fakes
Suppose we have a batch of samples of \(w\) from two different
distributions: one of the samples is from the real distribution \(p_d
(w|c)\), and \(k\) additional samples are from a “noise”
distribution \(p_n (w|c)\). For each sample we attach a label \(D = 1\)
if it is from the real distribution, or \(D = 0\) if it is from the
noise distribution. We want to use our model \(p_{\theta}\) to predict
this label (a binary classification problem). Essentially, the idea
behind noise contrastive estimation (NCE) is to use the model to
differentiate samples from the true data distribution versus those from
the noise distribution.
Since we want our model distribution \(p_{\theta} (w|c)\) to closely
approximate \(p_d (w|c)\), we use \(p_{\theta}\) to write down the
likelihood of each label given \((w, c)\):
\(\displaystyle p (D = 1| w, c) = \frac{p_{\theta} (w|c)}{p_{\theta}
(w|c) + kp_n (w|c)}\)
\(\displaystyle p (D = 0| w, c) = \frac{kp_n (w|c)}{p_{\theta} (w|c) +
kp_n (w|c)}\)
In practice, NCE implementations often assume the partition function \(Z
\equiv 1\), so for the rest of the section we assume
\(\displaystyle p_{\theta} (w|c) = \exp (S_{\theta} (w, c))\)
Then we can rewrite the likelihoods using the sigmoid \(\sigma (x) =
\frac{1}{1 + e^{- x}}\):
\begin{eqnarray*}
p (D = 1| w, c) & = & \frac{1}{1 + \frac{kp_n
(w|c)}{\exp (S_{\theta} (w,
c))}}\\
& = & \frac{1}{1 + \exp (-
(S_{\theta} (w, c) - k \log p_n (w|c)))}\\
& = & \sigma (S_{\theta}
(w, c) - k \log p_n (w|c))\\
& = & \sigma (\Delta (w, c))\\
p (D =
0| w, c) & = & 1 - \sigma (\Delta (w, c))
\end{eqnarray*}
where we have defined
\(\displaystyle \Delta (w, c) = S_{\theta} (w, c) - k \log p_n (w|c)\)
So we can write our maximum likelihood objective for this binary
classification task:
\(\displaystyle \max_{\theta} G (\theta) =\mathbb{E}_{p_d (\cdot |c)}
[\log \sigma (\Delta)] +
k\mathbb{E}_{p_n (\cdot |c)} [\log (1 - \sigma
(\Delta))]\)
As before we differentiate the objective making use of the fact that
\(\sigma' (x) = \sigma (x) (1 - \sigma (x))\) and that \(\frac{d \Delta
(w, c)}{d \theta} = \frac{d}{d \theta} S_{\theta} (w, c)\):
\begin{eqnarray*}
\frac{d G}{d \theta} & = & \mathbb{E}_{p_d (\cdot
|c)} \left[ \frac{d}{d
\theta} \log \sigma (\Delta) \right] +
k\mathbb{E}_{p_n (\cdot |c)} \left[
\frac{d}{d \theta} \log (1 -
\sigma (\Delta)) \right]\\
& = & \mathbb{E}_{p_d (\cdot |c)} \left[
\frac{\sigma (\Delta) (1 - \sigma
(\Delta))}{\sigma (\Delta)} \frac{d
S_{\theta}}{d \theta} \right] -
k\mathbb{E}_{p_n (\cdot |c)} \left[
\frac{\sigma (\Delta) (1 - \sigma
(\Delta))}{(1 - \sigma (\Delta))}
\frac{d S_{\theta}}{d \theta} \right]\\
& = & \mathbb{E}_{p_d (\cdot
|c)} \left[ (1 - \sigma (\Delta)) \frac{d
S_{\theta}}{d \theta}
\right] - k\mathbb{E}_{p_n (\cdot |c)} \left[ \sigma
(\Delta) \frac{d
S_{\theta}}{d \theta} \right]\\
& = & \mathbb{E}_{p_d (\cdot |c)}
\left[ (1 - \sigma (\Delta)) \frac{d
S_{\theta}}{d \theta} \right]
-\mathbb{E}_{p_{\theta} (\cdot |c)} \left[ (1
- \sigma (\Delta))
\frac{d S_{\theta}}{d \theta} \right]
\end{eqnarray*}
At the last equality we are using the fact that:
\begin{eqnarray*}
k\mathbb{E}_{p_n (\cdot |c)} \left[ \sigma (\Delta)
\frac{d S_{\theta}}{d
\theta} \right] & = & k\mathbb{E}_{p_n (\cdot
|c)} \left[ \frac{p_{\theta}
(w|c)}{p_{\theta} (w|c) + kp_n (w|c)}
\frac{d S_{\theta}}{d \theta}
\right]\\
& = & \mathbb{E}_{p_n (\cdot
|c)} \left[ \frac{kp_n (w|c)}{p_{\theta} (w|c)
+ kp_n (w|c)} \frac{d
S_{\theta}}{d \theta} \right]\\
& = & \mathbb{E}_{p_{\theta} (\cdot
|c)} \left[ (1 - \sigma (\Delta))
\frac{d S_{\theta}}{d \theta}
\right]
\end{eqnarray*}
Finally, since by inspecting the definition we can see \(\sigma (\Delta)
\xrightarrow{k \rightarrow \infty} 0\):
\(\displaystyle \frac{d G}{d \theta} \xrightarrow{k \rightarrow \infty}
\mathbb{E}_{p_d (\cdot
|c)} \left[ \frac{d}{d \theta} S_{\theta} (w, c)
\right] - E_{p_{\theta}
(\cdot |c)} \left[ \frac{d}{d \theta} S_{\theta}
(w, c) \right]\)
So if we increase the number of noise samples \(k\) in the batch, our
gradient approaches the original maximum likelihood gradient of Eq 3 asymptotically. Hence we see the motivation for why NCE's
binary classification task is a good approximation for the original
objective in Eq 2.
3Further reading
This blog post and this paper contain good
explanations of noise contrastive estimation, and have some more
detailed derivations for some results.