Note: This largely summarizes introductory material from Berkeley's
CS285.
A trajectory \(\tau\) of finite length \(T\) is a sequence of states and
actions:
\(\displaystyle \tau = \{ s_1, a_1, \ldots, s_T, a_T, s_{T + 1} \}\)
where the initial state \(s_1\) has density \(p_0 (s)\), actions are
sampled from the parameterized policy \(\pi_{\theta} (a|s)\), and the
transition dynamics are \(p (s' |s, a)\). The joint distribution over
\(\tau\) is:
\(\displaystyle p_{\theta} (\tau) = p_{\theta} (s_1, a_1, \ldots, s_T,
a_T, s_{T + 1}) = p_0
(s_1) \prod_{t = 1}^T \pi_{\theta} (a_t |s_t) p
(s_{t + 1} |s_t, a_t)\)
We use the environment reward function \(r (s, a)\) to define each
timestep's reward \(r_t = r (s_t, a_t)\). Our objective is to find the
policy that maximizes the expected return:
\(\displaystyle \max_{\theta} J (\theta) =\mathbb{E}
\left[ \sum_{t = 1}^T r_t \right]\) |
(1) |
Then we would like the gradient. Denoting \(r (\tau) = \sum_t r_t\):
\begin{eqnarray*}
\nabla_{\theta} J (\theta) & = & \nabla_{\theta}
\int r (\tau) p_{\theta}
(\tau) \mathrm{d} \tau\\
& = & \int r
(\tau) \nabla_{\theta} p_{\theta} (\tau) \mathrm{d} \tau\\
& = & \int
r (\tau) \nabla_{\theta} \log p_{\theta} (\tau) p_{\theta}
(\tau)
\mathrm{d} \tau\\
& = & \mathbb{E} [r (\tau) \nabla_{\theta} \log
p_{\theta} (\tau)]
\end{eqnarray*}
Expanding the gradient of the log density:
\begin{eqnarray*}
\nabla_{\theta} \log p_{\theta} (\tau) & = &
\nabla_{\theta} \log \left( p_0
(s_1) \prod_{t = 1}^T \pi_{\theta}
(a|s) p (s_{t + 1} |s_t, a_t) \right)\\
& = & \nabla_{\theta} \left(
\log p_0 (s_1) + \sum_{t = 1}^T \log
\pi_{\theta} (a|s) + \log p (s_{t
+ 1} |s_t, a_t) \right)\\
& = & \sum_{t = 1}^T \nabla_{\theta} \log
\pi_{\theta} (a|s)
\end{eqnarray*}
So we have that:
\(\displaystyle \nabla_{\theta} J (\theta)
=\mathbb{E} \left[ \left( \sum_{t = 1}^T r_t
\right) \left( \sum_{t =
1}^T \nabla_{\theta} \log \pi_{\theta} (a|s) \right)
\right]\) |
(2) |
However, we may also rewrite Eq. 1 by exchanging
expectation and sum:
\(\displaystyle J (\theta) = \sum_{t = 1}^T \mathbb{E} [r_t]\)
Since \(r_t\) is a function of the trajectory up to time \(t\), which we
denote \(\tau_{1 : t} = \{ s_1, a_1, \ldots, s_t, a_t \}\), we can do
analogous calculations to the above:
\begin{eqnarray*}
\nabla_{\theta} \mathbb{E} [r_t] & = &
\nabla_{\theta} \int r_t p_{\theta}
(\tau_{1 : t}) \mathrm{d} \tau_{1
: t}\\
& = & \int r_t \nabla_{\theta} \log p_{\theta} (\tau_{1 : t})
p_{\theta}
(\tau_{1 : t}) \mathrm{d} \tau_{1 : t}\\
& = & \mathbb{E}
\left[ r_t \left( \sum_{u = 1}^t \nabla_{\theta} \log
\pi_{\theta}
(a_u |s_u) \right) \right]
\end{eqnarray*}
Then summing over values of \(t\), we obtain a different policy gradient
formula (compare to Eq. 2):
\begin{eqnarray*}
\nabla_{\theta} J (\theta) & = & \sum_{t = 1}^T
\nabla_{\theta} \mathbb{E}
[r_t]\\
& = & \sum_{t = 1}^T \mathbb{E}
\left[ r_t \left( \sum_{u = 1}^t
\nabla_{\theta} \log \pi_{\theta}
(a_u |s_u) \right) \right]\\
& = & \mathbb{E} \left[ \sum_{t = 1}^T
\nabla_{\theta} \log \pi_{\theta}
(a_t |s_t) \left( \sum_{u = t}^T
r_u \right) \right]
\end{eqnarray*}
Finally, define the action-value function:
\(\displaystyle Q_t (s, a) =\mathbb{E} \left[ \sum_{u = t}^T r_u
\middle| s_t = s, a_t = a
\right]\)
Using the tower property we may rewrite the previous formula as:
\(\displaystyle \nabla_{\theta} J (\theta)
=\mathbb{E} \left[ \sum_{t = 1}^T \nabla_{\theta}
\log \pi_{\theta}
(a_t |s_t) Q_t (s_t, a_t) \right]\) |
(3) |