## 2020/07/11

Note: This largely summarizes introductory material from Berkeley's CS285.

A trajectory $$\tau$$ of finite length $$T$$ is a sequence of states and actions:

$$\displaystyle \tau = \{ s_1, a_1, \ldots, s_T, a_T, s_{T + 1} \}$$

where the initial state $$s_1$$ has density $$p_0 (s)$$, actions are sampled from the parameterized policy $$\pi_{\theta} (a|s)$$, and the transition dynamics are $$p (s' |s, a)$$. The joint distribution over $$\tau$$ is:

$$\displaystyle p_{\theta} (\tau) = p_{\theta} (s_1, a_1, \ldots, s_T, a_T, s_{T + 1}) = p_0 (s_1) \prod_{t = 1}^T \pi_{\theta} (a_t |s_t) p (s_{t + 1} |s_t, a_t)$$

We use the environment reward function $$r (s, a)$$ to define each timestep's reward $$r_t = r (s_t, a_t)$$. Our objective is to find the policy that maximizes the expected return:

 $$\displaystyle \max_{\theta} J (\theta) =\mathbb{E} \left[ \sum_{t = 1}^T r_t \right]$$ (1)

Then we would like the gradient. Denoting $$r (\tau) = \sum_t r_t$$:

\begin{eqnarray*} \nabla_{\theta} J (\theta) & = & \nabla_{\theta} \int r (\tau) p_{\theta} (\tau) \mathrm{d} \tau\\ & = & \int r (\tau) \nabla_{\theta} p_{\theta} (\tau) \mathrm{d} \tau\\ & = & \int r (\tau) \nabla_{\theta} \log p_{\theta} (\tau) p_{\theta} (\tau) \mathrm{d} \tau\\ & = & \mathbb{E} [r (\tau) \nabla_{\theta} \log p_{\theta} (\tau)] \end{eqnarray*}

Expanding the gradient of the log density:

\begin{eqnarray*} \nabla_{\theta} \log p_{\theta} (\tau) & = & \nabla_{\theta} \log \left( p_0 (s_1) \prod_{t = 1}^T \pi_{\theta} (a|s) p (s_{t + 1} |s_t, a_t) \right)\\ & = & \nabla_{\theta} \left( \log p_0 (s_1) + \sum_{t = 1}^T \log \pi_{\theta} (a|s) + \log p (s_{t + 1} |s_t, a_t) \right)\\ & = & \sum_{t = 1}^T \nabla_{\theta} \log \pi_{\theta} (a|s) \end{eqnarray*}

So we have that:

 $$\displaystyle \nabla_{\theta} J (\theta) =\mathbb{E} \left[ \left( \sum_{t = 1}^T r_t \right) \left( \sum_{t = 1}^T \nabla_{\theta} \log \pi_{\theta} (a|s) \right) \right]$$ (2)

However, we may also rewrite Eq. 1 by exchanging expectation and sum:

$$\displaystyle J (\theta) = \sum_{t = 1}^T \mathbb{E} [r_t]$$

Since $$r_t$$ is a function of the trajectory up to time $$t$$, which we denote $$\tau_{1 : t} = \{ s_1, a_1, \ldots, s_t, a_t \}$$, we can do analogous calculations to the above:

\begin{eqnarray*} \nabla_{\theta} \mathbb{E} [r_t] & = & \nabla_{\theta} \int r_t p_{\theta} (\tau_{1 : t}) \mathrm{d} \tau_{1 : t}\\ & = & \int r_t \nabla_{\theta} \log p_{\theta} (\tau_{1 : t}) p_{\theta} (\tau_{1 : t}) \mathrm{d} \tau_{1 : t}\\ & = & \mathbb{E} \left[ r_t \left( \sum_{u = 1}^t \nabla_{\theta} \log \pi_{\theta} (a_u |s_u) \right) \right] \end{eqnarray*}

Then summing over values of $$t$$, we obtain a different policy gradient formula (compare to Eq. 2):

\begin{eqnarray*} \nabla_{\theta} J (\theta) & = & \sum_{t = 1}^T \nabla_{\theta} \mathbb{E} [r_t]\\ & = & \sum_{t = 1}^T \mathbb{E} \left[ r_t \left( \sum_{u = 1}^t \nabla_{\theta} \log \pi_{\theta} (a_u |s_u) \right) \right]\\ & = & \mathbb{E} \left[ \sum_{t = 1}^T \nabla_{\theta} \log \pi_{\theta} (a_t |s_t) \left( \sum_{u = t}^T r_u \right) \right] \end{eqnarray*}

Finally, define the action-value function:

$$\displaystyle Q_t (s, a) =\mathbb{E} \left[ \sum_{u = t}^T r_u \middle| s_t = s, a_t = a \right]$$

Using the tower property we may rewrite the previous formula as:

 $$\displaystyle \nabla_{\theta} J (\theta) =\mathbb{E} \left[ \sum_{t = 1}^T \nabla_{\theta} \log \pi_{\theta} (a_t |s_t) Q_t (s_t, a_t) \right]$$ (3)