Policy gradient

2020/07/11

Note: This largely summarizes introductory material from Berkeley's CS285.

A trajectory \(\tau\) of length \(T\) is a sequence of states and actions:

\(\displaystyle \tau = \{ s_1, a_1, \ldots, s_T, a_T, s_{T + 1} \}\)

Suppose the initial state \(s_1\) is sampled from a distribution with density \(p (s)\). For given dynamics \(p (s' |s, a)\) and parameterized policy \(\pi_{\theta} (a|s)\), the joint distribution for some triple \((s_1, a_1, s_2)\) has density:

\(\displaystyle p_{\theta} (s_1, a_1, s_2) = p (s_1) \pi_{\theta} (a_1 |s_1) p (s_2 |s_1, a_1)\)

Extending this logic, the density of the joint distribution over \(\tau\) is:

\(\displaystyle p_{\theta} (\tau) = p_{\theta} (s_1, a_1, \ldots, s_T, a_T, s_{T + 1}) = p (s_1) \prod_{t = 1}^T \pi_{\theta} (a_t |s_t) p (s_{t + 1} |s_t, a_t)\)

Assume we are given some state-action reward function \(r : (s, a) \mapsto r (s, a)\). Our objective is to find the policy that maximizes the expected return:

\(\displaystyle \max_{\theta} J (\theta) =\mathbb{E}_{p_{\theta} (\tau)} \left[ \sum_{t = 1}^T r (s_t, a_t) \right] =\mathbb{E}_{p_{\theta} (\tau)} [r (\tau)]\)

where we have abused notation to define \(r (\tau) = \sum_t r (s_t, a_t)\). Then we would like the gradient:

\begin{eqnarray*} \nabla_{\theta} J (\theta) & = & \nabla_{\theta} \int r (\tau) p_{\theta} (\tau) \mathrm{d} \tau\\ & = & \int r (\tau) \nabla_{\theta} p_{\theta} (\tau) \mathrm{d} \tau\\ & = & \int r (\tau) \nabla_{\theta} \log p_{\theta} (\tau) p_{\theta} (\tau) \mathrm{d} \tau\\ & = & \mathbb{E}_{p_{\theta} (\tau)} [r (\tau) \nabla_{\theta} \log p_{\theta} (\tau)] \end{eqnarray*}

Expanding the gradient of the log density:

\begin{eqnarray*} \nabla_{\theta} \log p_{\theta} (\tau) & = & \nabla_{\theta} \log \left( p (s_1) \prod_{t = 1}^T \pi_{\theta} (a|s) p (s_{t + 1} |s_t, a_t) \right)\\ & = & \nabla_{\theta} \left( \log p (s_1) + \sum_{t = 1}^T \log \pi_{\theta} (a|s) + \log p (s_{t + 1} |s_t, a_t) \right)\\ & = & \sum_{t = 1}^{T} \nabla_{\theta} \log \pi_{\theta} (a|s) \end{eqnarray*}

So we have that:

\(\displaystyle \nabla_{\theta} J (\theta) =\mathbb{E}_{p_{\theta} (\tau)} \left[ \left( \sum_{t = 1}^T r (s_{t,} a_t) \right) \left( \sum_{t = 1}^T \nabla_{\theta} \log \pi_{\theta} (a|s) \right) \right]\)

which we can estimate using simple monte carlo sampling.