Policy gradient


Note: This largely summarizes introductory material from Berkeley's CS285.

A trajectory \(\tau\) of finite length \(T\) is a sequence of states and actions:

\(\displaystyle \tau = \{ s_1, a_1, \ldots, s_T, a_T, s_{T + 1} \}\)

where the initial state \(s_1\) has density \(p_0 (s)\), actions are sampled from the parameterized policy \(\pi_{\theta} (a|s)\), and the transition dynamics are \(p (s' |s, a)\). The joint distribution over \(\tau\) is:

\(\displaystyle p_{\theta} (\tau) = p_{\theta} (s_1, a_1, \ldots, s_T, a_T, s_{T + 1}) = p_0 (s_1) \prod_{t = 1}^T \pi_{\theta} (a_t |s_t) p (s_{t + 1} |s_t, a_t)\)

We use the environment reward function \(r (s, a)\) to define each timestep's reward \(r_t = r (s_t, a_t)\). Our objective is to find the policy that maximizes the expected return:

\(\displaystyle \max_{\theta} J (\theta) =\mathbb{E} \left[ \sum_{t = 1}^T r_t \right]\) (1)

Then we would like the gradient. Denoting \(r (\tau) = \sum_t r_t\):

\begin{eqnarray*} \nabla_{\theta} J (\theta) & = & \nabla_{\theta} \int r (\tau) p_{\theta} (\tau) \mathrm{d} \tau\\ & = & \int r (\tau) \nabla_{\theta} p_{\theta} (\tau) \mathrm{d} \tau\\ & = & \int r (\tau) \nabla_{\theta} \log p_{\theta} (\tau) p_{\theta} (\tau) \mathrm{d} \tau\\ & = & \mathbb{E} [r (\tau) \nabla_{\theta} \log p_{\theta} (\tau)] \end{eqnarray*}

Expanding the gradient of the log density:

\begin{eqnarray*} \nabla_{\theta} \log p_{\theta} (\tau) & = & \nabla_{\theta} \log \left( p_0 (s_1) \prod_{t = 1}^T \pi_{\theta} (a|s) p (s_{t + 1} |s_t, a_t) \right)\\ & = & \nabla_{\theta} \left( \log p_0 (s_1) + \sum_{t = 1}^T \log \pi_{\theta} (a|s) + \log p (s_{t + 1} |s_t, a_t) \right)\\ & = & \sum_{t = 1}^T \nabla_{\theta} \log \pi_{\theta} (a|s) \end{eqnarray*}

So we have that:

\(\displaystyle \nabla_{\theta} J (\theta) =\mathbb{E} \left[ \left( \sum_{t = 1}^T r_t \right) \left( \sum_{t = 1}^T \nabla_{\theta} \log \pi_{\theta} (a|s) \right) \right]\) (2)

However, we may also rewrite Eq. 1 by exchanging expectation and sum:

\(\displaystyle J (\theta) = \sum_{t = 1}^T \mathbb{E} [r_t]\)

Since \(r_t\) is a function of the trajectory up to time \(t\), which we denote \(\tau_{1 : t} = \{ s_1, a_1, \ldots, s_t, a_t \}\), we can do analogous calculations to the above:

\begin{eqnarray*} \nabla_{\theta} \mathbb{E} [r_t] & = & \nabla_{\theta} \int r_t p_{\theta} (\tau_{1 : t}) \mathrm{d} \tau_{1 : t}\\ & = & \int r_t \nabla_{\theta} \log p_{\theta} (\tau_{1 : t}) p_{\theta} (\tau_{1 : t}) \mathrm{d} \tau_{1 : t}\\ & = & \mathbb{E} \left[ r_t \left( \sum_{u = 1}^t \nabla_{\theta} \log \pi_{\theta} (a_u |s_u) \right) \right] \end{eqnarray*}

Then summing over values of \(t\), we obtain a different policy gradient formula (compare to Eq. 2):

\begin{eqnarray*} \nabla_{\theta} J (\theta) & = & \sum_{t = 1}^T \nabla_{\theta} \mathbb{E} [r_t]\\ & = & \sum_{t = 1}^T \mathbb{E} \left[ r_t \left( \sum_{u = 1}^t \nabla_{\theta} \log \pi_{\theta} (a_u |s_u) \right) \right]\\ & = & \mathbb{E} \left[ \sum_{t = 1}^T \nabla_{\theta} \log \pi_{\theta} (a_t |s_t) \left( \sum_{u = t}^T r_u \right) \right] \end{eqnarray*}

Finally, define the action-value function:

\(\displaystyle Q_t (s, a) =\mathbb{E} \left[ \sum_{u = t}^T r_u \middle| s_t = s, a_t = a \right]\)

Using the tower property we may rewrite the previous formula as:

\(\displaystyle \nabla_{\theta} J (\theta) =\mathbb{E} \left[ \sum_{t = 1}^T \nabla_{\theta} \log \pi_{\theta} (a_t |s_t) Q_t (s_t, a_t) \right]\) (3)