江苏省建设考试培训网网站,昆明开发,做视频特技的网站,网站运营配置文章目录 前言TRPO特点策略梯度的优化目标使用重要性采样忽略状态分布的差异约束策略的变化近似求解线性搜索算法伪代码广义优势估计代码实践离散动作空间连续动作空间 参考 前言
之前介绍的基于策略的方法包括策略梯度算法和 Actor-Critic 算法。这些方法虽然简单、直观… 文章目录 前言TRPO特点策略梯度的优化目标使用重要性采样忽略状态分布的差异约束策略的变化近似求解线性搜索算法伪代码广义优势估计代码实践离散动作空间连续动作空间 参考 前言
之前介绍的基于策略的方法包括策略梯度算法和 Actor-Critic 算法。这些方法虽然简单、直观但在实际应用过程中会遇到训练不稳定的情况。回顾一下基于策略的方法参数化智能体的策略并设计衡量策略好坏的目标函数通过梯度上升的方法来最大化这个目标函数使得策略最优。这种算法有一个明显的缺点当策略网络是深度模型时沿着策略梯度更新参数很有可能由于步长太长策略突然显著变差进而影响训练效果同时由于采集到的数据的分布会随策略的更新而变化步长太长更新的策略可能变差导致采集到的数据变差进而陷入恶性循环之中。本文将要介绍的信任区域策略优化trust region policy optimizationTRPO算法在理论上能够保证策略学习的性能单调性并在实际应用中取得了比策略梯度算法更好的效果。
TRPO特点
是一种on-policy的算法可以用于连续或离散的动作空间Open AI Spinning Up 中的TRPO可以达到并行运行的效果TRPO以一种on-policy的方式训练了一个随机策略。这意味着它通过根据其随机策略的最新版本进行动作采样来进行探索。动作选择中的随机量取决于初始条件和训练过程。在训练过程中策略通常会逐渐变得不那么随机因为更新规则鼓励它利用已经发现的奖励。这可能导致策略陷入局部最优。
策略梯度的优化目标
假设当前策略为 π θ \pi_\theta πθ其参数为 θ \theta θ。对于策略梯度的优化目标我们可以将其写成累计的折扣回报 J ( θ ) E τ ∼ p θ ( τ ) [ ∑ t γ t r ( s t , a t ) ] J(\theta)\mathbb{E}_{\tau\sim p_\theta(\tau)}[\sum_t\gamma^tr(s_t,a_t)] J(θ)Eτ∼pθ(τ)[∑tγtr(st,at)]基于轨迹 τ \tau τ我们可以采样到累计的折扣回报。同时因为 V π θ ( s ) V^{\pi_\theta}(s) Vπθ(s)相应的期望可以写成累计折扣回报的形式, V π θ ( s ) E a ∼ π θ ( s ) [ Q π θ ( s , a ) ] E a ∼ π θ ( s ) [ E τ ∼ p θ ( τ ) [ ∑ s k s , a k a ∑ t k ∞ γ t − k r ( s t , a t ) ] ] V^{\pi_\theta}(s)\mathbb{E}_{a\sim\pi_\theta(s)}[Q^{\pi_\theta}(s,a)]\mathbb{E}_{a\sim\pi_\theta(s)}[\mathbb{E}_{\tau\sim p_\theta(\tau)}[\sum_{s_ks,a_ka}\sum_{tk}^{\infty}\gamma^{t-k}r(s_t,a_t)]] Vπθ(s)Ea∼πθ(s)[Qπθ(s,a)]Ea∼πθ(s)[Eτ∼pθ(τ)[∑sks,aka∑tk∞γt−kr(st,at)]]所以我们可以得到第二种优化目标基于起始点的形式 J ( θ ) E s 0 ∼ p θ ( s 0 ) [ V π θ ( s 0 ) ] J(\theta)\mathbb{E}_{s_0\sim p_{\theta}(s_0)}[V^{\pi_\theta}(s_0)] J(θ)Es0∼pθ(s0)[Vπθ(s0)]。两者是等效的。
在策略优化中我们考虑如何借助当前的 θ \theta θ找到一个更优的参数 θ ′ \theta θ′使得 J ( θ ′ ) ≥ J ( θ ) J(\theta)\ge J(\theta) J(θ′)≥J(θ)。由于初始状态 s 0 s_0 s0的分布和策略无关与起始的环境相关因此上述策略下的优化目标 J ( θ ) J(\theta) J(θ)可以写成在新策略 π θ ′ \pi_{\theta} πθ′的期望形式 J ( θ ) E s 0 [ V π θ ( s 0 ) ] E π θ ′ [ ∑ t 0 ∞ γ t V π θ ( s t ) − ∑ t 1 ∞ γ t V π θ ( s t ) ] − E π θ ′ [ ∑ t 0 ∞ γ t ( γ V π θ ( s t 1 ) − V π θ ( s t ) ) ] \begin{aligned} J(\theta) \mathbb{E}_{s_0}[V^{\pi_\theta}(s_0)] \\ \mathbb{E}_{\pi_{\theta^{\prime}}}\left[\sum_{t0}^\infty\gamma^tV^{\pi_\theta}(s_t)-\sum_{t1}^\infty\gamma^tV^{\pi_\theta}(s_t)\right] \\ -\mathbb{E}_{\pi_{\theta}}\left[\sum_{t0}^{\infty}\gamma^{t}\left(\gamma V^{\pi_{\theta}}(s_{t1})-V^{\pi_{\theta}}(s_{t})\right)\right] \end{aligned} J(θ)Es0[Vπθ(s0)]Eπθ′[t0∑∞γtVπθ(st)−t1∑∞γtVπθ(st)]−Eπθ′[t0∑∞γt(γVπθ(st1)−Vπθ(st))]
基于以上等式我们可以推导新旧策略的目标函数之间的差距 J ( θ ′ ) − J ( θ ) E s 0 [ V π θ ′ ( s 0 ) ] − E s 0 [ V π θ ( s 0 ) ] E π θ ′ [ ∑ t 0 ∞ γ t r ( s t , a t ) ] E π θ ′ [ ∑ t 0 ∞ γ t ( γ V π θ ( s t 1 ) − V π θ ( s t ) ) ] E π θ ′ [ ∑ t 0 ∞ γ t [ r ( s t , a t ) γ V π θ ( s t 1 ) − V π θ ( s t ) ] ] \begin{aligned} J(\theta)-J(\theta) \mathbb{E}_{s_0}\left[V^{\pi_{\theta}}(s_0)\right]-\mathbb{E}_{s_0}\left[V^{\pi_\theta}(s_0)\right] \\ \mathbb{E}_{\pi_{\theta^{\prime}}}\left[\sum_{t0}^\infty\gamma^tr(s_t,a_t)\right]\mathbb{E}_{\pi_{\theta^{\prime}}}\left[\sum_{t0}^\infty\gamma^t\left(\gamma V^{\pi_\theta}(s_{t1})-V^{\pi_\theta}(s_t)\right)\right] \\ \mathbb{E}_{\pi_{\theta}}\left[\sum_{t0}^{\infty}\textcolor{red}{\gamma^{t}\left[r(s_{t},a_{t})\gamma V^{\pi_{\theta}}(s_{t1})-V^{\pi_{\theta}}(s_{t})\right]}\right] \end{aligned} J(θ′)−J(θ)Es0[Vπθ′(s0)]−Es0[Vπθ(s0)]Eπθ′[t0∑∞γtr(st,at)]Eπθ′[t0∑∞γt(γVπθ(st1)−Vπθ(st))]Eπθ′[t0∑∞γt[r(st,at)γVπθ(st1)−Vπθ(st)]]
很自然的红色标注的部分和我们之前介绍过的优势函数形式一致因此等式可以变为 E π θ ′ [ ∑ t 0 ∞ γ t A π θ ( s t , a t ) ] \mathbb{E}_{\textcolor{blue}{\pi_{\theta^{\prime}}}}\left[\sum_{t0}^\infty\gamma^tA^{\textcolor{red}{\pi_\theta}}(s_t,a_t)\right] Eπθ′[t0∑∞γtAπθ(st,at)]
使用重要性采样
直接对上式进行采样是困难的因为 π θ ′ \pi_{\theta^{\prime}} πθ′是我们需要求解的策略但我们又要用它来收集样本。把所有可能的新策略都拿来收集数据然后判断哪个策略满足上述条件的做法显然是不现实的。因此我们需要用到重要性采样Importance Sampling。 J ( θ ′ ) − J ( θ ) E τ ∼ p θ ′ ( τ ) [ ∑ t 0 ∞ γ t A π θ ( s t , a t ) ] \begin{aligned}J(\theta^{\prime})-J(\theta)\\\mathbb{E}_{\tau\sim p_{\theta^{\prime}}(\tau)}\left[\sum_{t0}^\infty\gamma^tA^{\pi_\theta}(s_t,a_t)\right]\end{aligned} J(θ′)−J(θ)Eτ∼pθ′(τ)[t0∑∞γtAπθ(st,at)] E τ ∼ p θ ′ ( τ ) \mathbb{E}_{\tau\sim p_{\theta^{\prime}}(\tau)} Eτ∼pθ′(τ)可以拆分成先采样出状态 s t s_t st再基于 π θ ′ \pi_{\theta} πθ′采样出动作 a t a_t at于是有 ∑ t E s t ∼ p θ ′ ( s t ) [ E a t ∼ π θ ′ ( a t ∣ s t ) [ γ t A π θ ( s t , a t ) ] ] \sum_t\mathbb{E}_{s_t\sim p_{\theta^{\prime}}(s_t)}[\mathbb{E}_{a_t\sim\pi_{\theta^{\prime}}(a_t|s_t)}[\gamma^tA^{\pi_\theta}(s_t,a_t)]] t∑Est∼pθ′(st)[Eat∼πθ′(at∣st)[γtAπθ(st,at)]]
再利用重要性采样的方法可得 ∑ t E s t ∼ p θ ′ ( s t ) [ E a t ∼ π θ ( a t ∣ s t ) [ π θ ′ ( a t ∣ s t ) π θ ( a t ∣ s t ) γ t A π θ ( s t , a t ) ] ] \sum_t\mathbb{E}_{s_t\sim p_{\theta^{\prime}}(s_t)}[\mathbb{E}_{a_t\sim\pi_\theta(a_t|s_t)}[\frac{\pi_{\theta^{\prime}}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)]] t∑Est∼pθ′(st)[Eat∼πθ(at∣st)[πθ(at∣st)πθ′(at∣st)γtAπθ(st,at)]]
忽略状态分布的差异
不过此时前面依然是 E s t ∼ p θ ′ ( s t ) \mathbb{E}_{s_t\sim p_{\theta^{\prime}}(s_t)} Est∼pθ′(st) s t s_t st需要基于 π θ ′ \pi_{\theta} πθ′进行采样。对此TRPO忽略状态分布的差异.当策略更新前后的变化较小时可以近似地令 π θ ( s t ) ≈ π θ ′ ( s t ) \pi_{\theta}(s_t)\approx\pi_{\theta}(s_t) πθ(st)≈πθ′(st)。现在假设使用确定性策略且当 π θ ′ ( s t ) ≠ π θ ( s t ) \pi_{\theta^{\prime}}(s_t)\neq\pi_\theta\mathrm{~(s_t)} πθ′(st)πθ (st)的概率小于时或者假设使用随机性策略且当 a ′ ∼ π θ ′ ( ⋅ ∣ s t ) ≠ a ∼ π θ ( ⋅ ∣ s t ) a^{\prime}{\sim}\pi_{\theta^{\prime}}(\cdot|s_t)\neq a{\sim}\pi_\theta(\cdot|s_t) a′∼πθ′(⋅∣st)a∼πθ(⋅∣st)的概率小于时则新策略可以由与旧策略相等的部分和不相等的部分之和组成 p θ ′ ( s t ) ( 1 − ϵ ) t p θ ( s t ) ( 1 − ( 1 − ϵ ) t ) p m i s t a k e ( s t ) p_{\theta}(s_t)(1-\epsilon)^tp_\theta(s_t)(1-(1-\epsilon)^t)p_{mistake}(s_t) pθ′(st)(1−ϵ)tpθ(st)(1−(1−ϵ)t)pmistake(st)
移项可得 ∣ p θ ′ ( s t ) − p θ ( s t ) ∣ ( 1 − ( 1 − ϵ ) t ) ∣ p m i s t a k e ( s t ) − p θ ( s t ) ∣ ≤ 2 ( 1 − ( 1 − ϵ ) t ) ≤ 2 ϵ t \begin{aligned}|p_{\theta^{\prime}}(s_t)-p_\theta(s_t)|(1-(1-\epsilon)^t)|p_{mistake}(s_t)-p_\theta(s_t)|\\ \leq2(1-(1-\epsilon)^t)\\\leq2\epsilon t\end{aligned} ∣pθ′(st)−pθ(st)∣(1−(1−ϵ)t)∣pmistake(st)−pθ(st)∣≤2(1−(1−ϵ)t)≤2ϵt ⚠️You Should Know 1. ∣ p m i s t a k e ( s t ) − p θ ( s t ) ∣ ≤ 2 |p_{mistake}(s_t)-p_\theta(s_t)| \leq2 ∣pmistake(st)−pθ(st)∣≤2概率上下差值最大为2 2. ( 1 − ϵ ) t ≥ 1 − ϵ t f o r ϵ ∈ [ 0 , 1 ] (1-\epsilon)^t\geq1-\epsilon t\mathrm{~for~}\epsilon\in[0,1] (1−ϵ)t≥1−ϵt for ϵ∈[0,1]泰勒展开 最后我们可以让 π θ ( s t ) ≈ π θ ′ ( s t ) \pi_{\theta}(s_t)\approx\pi_{\theta}(s_t) πθ(st)≈πθ′(st) J ( θ ′ ) − J ( θ ) ≈ ∑ t E s t ∼ p θ ( s t ) [ E a t ∼ π θ ( a t ∣ s t ) [ π θ ′ ( a t ∣ s t ) π θ ( a t ∣ s t ) γ t A π θ ( s t , a t ) ] ] J(\theta^{\prime})-J(\theta)\approx\sum_t\mathbb{E}_{s_t\sim p_\theta(s_t)}[\mathbb{E}_{a_t\sim\pi_\theta(a_t|s_t)}[\frac{\pi_{\theta^{\prime}}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)]] J(θ′)−J(θ)≈t∑Est∼pθ(st)[Eat∼πθ(at∣st)[πθ(at∣st)πθ′(at∣st)γtAπθ(st,at)]]
约束策略的变化
为了保证新旧策略足够接近TRPO 使用了库尔贝克-莱布勒Kullback-LeiblerKL散度来约束策略更新的幅度。 θ ′ ← arg max θ ′ ∑ t E s t ∼ p θ ( s t ) [ E a t ∼ π θ ( a t ∣ s t ) [ π θ ′ ( a t ∣ s t ) π θ ( a t ∣ s t ) γ t A π θ ( s t , a t ) ] ] s u c h t h a t E s t ∼ p ( s t ) [ D K L ( π θ ′ ( a t ∣ s t ) ∥ π θ ( a t ∣ s t ) ) ] ≤ ϵ \begin{aligned} \theta\leftarrow\arg\max_{\theta}\sum_t\mathbb{E}_{s_t\sim p_\theta(s_t)}[\mathbb{E}_{a_t\sim\pi_\theta(a_t|s_t)}[\frac{\pi_{\theta}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)]] \\ \mathrm{such~that~}\mathbb{E}_{s_t\sim p(s_t)}[D_{KL}(\pi_{\theta^{\prime}}(a_t|s_t)\parallel\pi_\theta(a_t|s_t))]\leq\epsilon \end{aligned} θ′←argθ′maxt∑Est∼pθ(st)[Eat∼πθ(at∣st)[πθ(at∣st)πθ′(at∣st)γtAπθ(st,at)]]such that Est∼p(st)[DKL(πθ′(at∣st)∥πθ(at∣st))]≤ϵ ⚠️You Should Know 1.当 θ θ ′ \theta\theta θθ′时上述目标函数以及其约束都为0此外当 θ θ ′ \theta\theta θθ′时约束对 θ \theta θ的梯度也为0. 2.实际多使用constraint violate as penalty进行求解 θ ′ ← arg max θ ′ ∑ t E s t ∼ p θ ( s t ) [ E a t ∼ π θ ( a t ∣ s t ) [ π θ ′ ( a t ∣ s t ) π θ ( a t ∣ s t ) γ t A π θ ( s t , a t ) ] ] − λ ( D K L ( π θ ′ ( a t ∣ s t ) ∥ π θ ( a t ∣ s t ) ) − ϵ ) \begin{aligned}\theta\leftarrow\arg\max_{\theta}\sum_t\mathbb{E}_{s_t\sim p_\theta(s_t)}[\mathbb{E}_{a_t\sim\pi_\theta(a_t|s_t)}[\frac{\pi_{\theta}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)]]\\-\lambda(D_{KL}(\pi_{\theta}(a_t|s_t)\parallel\pi_\theta(a_t|s_t))-\epsilon)\end{aligned} θ′←argθ′maxt∑Est∼pθ(st)[Eat∼πθ(at∣st)[πθ(at∣st)πθ′(at∣st)γtAπθ(st,at)]]−λ(DKL(πθ′(at∣st)∥πθ(at∣st))−ϵ) 2.1优化上式更新 θ ′ \theta θ′ 2.2更新惩罚项 λ ← λ α ( D K L ( π θ ′ ( a t ∣ s t ) ∥ π θ ( a t ∣ s t ) ) − ϵ ) \lambda\leftarrow\lambda\alpha(D_{KL}(\pi_{\theta^{\prime}}(a_{t}|s_{t})\parallel\pi_{\theta}(a_{t}|s_{t}))-\epsilon) λ←λα(DKL(πθ′(at∣st)∥πθ(at∣st))−ϵ)若KL散度超出 ϵ \epsilon ϵ则进行惩罚超出越多惩罚越多 近似求解
上述式子的求解还是比较困难的因此TRPO进行了近似求解。为了方便描述我们将目标函数和约束转化为以下形式 θ k 1 arg max θ L ( θ k , θ ) s . t . D ˉ K L ( θ ∣ ∣ θ k ) ≤ δ \begin{aligned}\theta_{k1}\arg\max_{\theta}\mathcal{L}(\theta_k,\theta)\\\mathrm{s.t.~}\bar{D}_{KL}(\theta||\theta_k)\le\delta\end{aligned} θk1argθmaxL(θk,θ)s.t. DˉKL(θ∣∣θk)≤δ
其中 L ( θ k , θ ) E s , a ∼ π θ k [ π θ ( a ∣ s ) π θ k ( a ∣ s ) A π θ k ( s , a ) ] , \mathcal{L}(\theta_k,\theta)\underset{s,a\sim\pi_{\theta_k}}{\operatorname*{E}}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)}A^{\pi_{\theta_k}}(s,a)\right], L(θk,θ)s,a∼πθkE[πθk(a∣s)πθ(a∣s)Aπθk(s,a)], D ˉ K L ( θ ∣ ∣ θ k ) E s ∼ π θ k [ D K L ( π θ ( ⋅ ∣ s ) ∣ ∣ π θ k ( ⋅ ∣ s ) ) ] . \bar{D}_{KL}(\theta||\theta_{k})\mathop{\mathrm{E}}_{s\sim\pi_{\theta_{k}}}\left[D_{KL}\left(\pi_{\theta}(\cdot|s)||\pi_{\theta_{k}}(\cdot|s))\right].\right. DˉKL(θ∣∣θk)Es∼πθk[DKL(πθ(⋅∣s)∣∣πθk(⋅∣s))].
对其基于 θ k \theta_k θk进行泰勒展开 L ( θ k , θ ) ≈ g T ( θ − θ k ) D ˉ K L ( θ ∣ ∣ θ k ) ≈ 1 2 ( θ − θ k ) T H ( θ − θ k ) \begin{aligned} \mathcal{L}(\theta_{k},\theta) \approx g^{T}(\theta-\theta_{k}) \\ \bar{D}_{KL}(\theta||\theta_{k}) \approx\frac12(\theta-\theta_k)^TH(\theta-\theta_k) \end{aligned} L(θk,θ)DˉKL(θ∣∣θk)≈gT(θ−θk)≈21(θ−θk)TH(θ−θk)
其中 g g g表示目标函数的梯度 ∇ θ J ( π θ ) \nabla_{\theta}J(\pi_{\theta}) ∇θJ(πθ) H H H表示策略之间平均 KL 距离的黑塞矩阵Hessian matrix。因此我们可以将问题转化为一个近似的优化问题 θ k 1 arg max θ g T ( θ − θ k ) s.t. 1 2 ( θ − θ k ) T H ( θ − θ k ) ≤ δ . \begin{aligned}\theta_{k1}\arg\max_{\theta}g^T(\theta-\theta_k)\\\text{s.t. }\frac{1}{2}(\theta-\theta_k)^TH(\theta-\theta_k)\leq\delta.\end{aligned} θk1argθmaxgT(θ−θk)s.t. 21(θ−θk)TH(θ−θk)≤δ.
此时我们可以用卡罗需-库恩-塔克Karush-Kuhn-TuckerKKT条件直接导出上述问题的解 θ k 1 θ k 2 δ g T H − 1 g H − 1 g \theta_{k1}\theta_k\sqrt{\frac{2\delta}{g^TH^{-1}g}}H^{-1}g θk1θkgTH−1g2δ H−1g ⚠️You Should Know 1.进一步的求解需要用到共轭梯度法conjugate gradient method或者拉格朗日对偶法Lagrangian duality。共轭梯度法TRPO所用的具体求解参考《动手学强化学习》第11.4节拉格朗日对偶法Lagrangian duality则可参考Boyd的凸优化 —— Convex Optimization by Boyd and Vandenberghe, especially chapters 2 through 5. 2.一般来说用神经网络表示的策略函数的参数数量都是成千上万的计算和存储黑塞矩阵 H H H的逆矩阵会耗费大量的内存资源和时间。TRPO 通过共轭梯度法conjugate gradient method回避了这个问题它的核心思想是直接计算 H x Hx Hx向量, ( x H − 1 g ) (xH^{-1}g) (xH−1g) x x x即参数更新方向而非存储黑塞矩阵 H H H。公式描述如下 H x ∇ θ ( ( ∇ θ D ˉ K L ( θ ∣ ∣ θ k ) ) T x ) , Hx\nabla_{\theta}\left(\left(\nabla_{\theta}\bar{D}_{KL}(\theta||\theta_{k})\right)^{T}x\right), Hx∇θ((∇θDˉKL(θ∣∣θk))Tx), 线性搜索
如果到此为止计算出最终的结果那么现在的算法和Natural Policy Gradient差不多。由于 TRPO 算法用到了泰勒展开的 1 阶和 2 阶近似这并非精准求解最终的策略结果并非能得到提升并且未必能满足 KL 散度限制。因此TRPO 在每次迭代的最后进行一次线性搜索Line Search以确保找到满足条件。具体来说就是找到一个最小的非负整数 i i i使得按照 θ k 1 θ k α i 2 δ x T H x x \theta_{k1}\theta_k\alpha^i\sqrt{\frac{2\delta}{x^THx}}x θk1θkαixTHx2δ x
其中 α ∈ ( 0 , 1 ) \alpha \in (0,1) α∈(0,1)是一个决定线性搜索长度的超参数。求出的 θ k 1 \theta_{k1} θk1依然满足最初的 KL 散度限制并且确实能够提升目标函数 L ( θ k , θ ) \mathcal{L}(\theta_k,\theta) L(θk,θ)。
算法伪代码 广义优势估计
现在我们还缺少一种合适的方法用来估计优势函数 A A A。目前比较常用的一种方法为广义优势估计Generalized Advantage EstimationGAE接下来我们简单介绍一下 GAE 的做法。首先用 δ t r t γ V ( s t 1 ) − V ( s t ) \delta_tr_t\gamma V(s_{t1})-V(s_t) δtrtγV(st1)−V(st)表示时序差分误差其中 V V V是一个已经学习的状态价值函数。于是根据多步时序差分的思想有 A t ( 1 ) δ t − V ( s t ) r t γ V ( s t 1 ) A t ( 2 ) δ t γ δ t 1 − V ( s t ) r t γ r t 1 γ 2 V ( s t 2 ) A t ( 3 ) δ t γ δ t 1 γ 2 δ t 2 − V ( s t ) r t γ r t 1 γ 2 r t 2 γ 3 V ( s t 3 ) ⋮ ⋮ A t ( k ) ∑ l 0 k − 1 γ l δ t l − V ( s t ) r t γ r t 1 … γ k − 1 r t k − 1 γ k V ( s t k ) \begin{aligned} A_{t}^{(1)} \delta_{t} -V(s_t)r_t\gamma V(s_{t1}) \\ A_{t}^{(2)} \delta_t\gamma\delta_{t1} -V(s_t)r_t\gamma r_{t1}\gamma^2V(s_{t2}) \\ A_t^{(3)} \delta_t\gamma\delta_{t1}\gamma^2\delta_{t2} -V(s_t)r_t\gamma r_{t1}\gamma^2r_{t2}\gamma^3V(s_{t3}) \\ \vdots\vdots\\ A_t^{(k)} \sum_{l0}^{k-1}\gamma^l\delta_{tl} -V(s_t)r_t\gamma r_{t1}\ldots\gamma^{k-1}r_{tk-1}\gamma^kV(s_{tk}) \end{aligned} At(1)At(2)At(3)At(k)⋮δtδtγδt1δtγδt1γ2δt2l0∑k−1γlδtl⋮−V(st)rtγV(st1)−V(st)rtγrt1γ2V(st2)−V(st)rtγrt1γ2rt2γ3V(st3)−V(st)rtγrt1…γk−1rtk−1γkV(stk) 然后GAE 将这些不同步数的优势估计进行指数加权平均 A t G A E ( 1 − λ ) ( A t ( 1 ) λ A t ( 2 ) λ 2 A t ( 3 ) ⋯ ) ( 1 − λ ) ( δ t λ ( δ t γ δ t 1 ) λ 2 ( δ t γ δ t 1 γ 2 δ t 2 ) ⋯ ) ( 1 − λ ) ( δ ( 1 λ λ 2 ⋯ ) γ δ t 1 ( λ λ 2 λ 3 ⋯ ) γ ( 1 − λ ) ( δ t 1 1 − λ γ δ t 1 λ 1 − λ γ 2 δ t 2 λ 2 1 − λ ⋯ ) ∑ l 0 ∞ ( γ λ ) l δ t l \begin{aligned} A_{t}^{GAE} (1-\lambda)(A_t^{(1)}\lambda A_t^{(2)}\lambda^2A_t^{(3)}\cdots) \\ (1-\lambda)(\delta_t\lambda(\delta_t\gamma\delta_{t1})\lambda^2(\delta_t\gamma\delta_{t1}\gamma^2\delta_{t2})\cdots) \\ (1-\lambda)(\delta(1\lambda\lambda^2\cdots)\gamma\delta_{t1}(\lambda\lambda^2\lambda^3\cdots)\gamma \\ (1-\lambda)\left(\delta_t\frac1{1-\lambda}\gamma\delta_{t1}\frac\lambda{1-\lambda}\gamma^2\delta_{t2}\frac{\lambda^2}{1-\lambda}\cdots\right) \\ \sum_{l0}^\infty(\gamma\lambda)^l\delta_{tl} \end{aligned} AtGAE(1−λ)(At(1)λAt(2)λ2At(3)⋯)(1−λ)(δtλ(δtγδt1)λ2(δtγδt1γ2δt2)⋯)(1−λ)(δ(1λλ2⋯)γδt1(λλ2λ3⋯)γ(1−λ)(δt1−λ1γδt11−λλγ2δt21−λλ2⋯)l0∑∞(γλ)lδtl
其中 λ ∈ [ 0 , 1 ] \lambda\in[0,1] λ∈[0,1]是在 GAE 中额外引入的一个超参数。当时 λ 0 \lambda0 λ0 A t G A E δ t r t γ V ( s t 1 ) − V ( s t ) A_t^{GAE}\delta_tr_t\gamma V(s_{t1})-V(s_t) AtGAEδtrtγV(st1)−V(st)也即是仅仅只看一步差分得到的优势当时 λ 1 \lambda1 λ1 A t G A E ∑ l 0 ∞ γ l δ t l ∑ l 0 ∞ γ l r t l − V ( s t ) . A_t^{GAE}\sum_{l0}^{\infty}\gamma^l\delta_{tl}\sum_{l0}^{\infty}\gamma^lr_{tl}-V(s_t). AtGAE∑l0∞γlδtl∑l0∞γlrtl−V(st).则是看每一步差分得到优势的完全平均值。
由于广义优势估计(GAE)的实现有很多变化下面是一种可能的实现方法
def calculate_gae(rewards, values, gamma0.99, lambda_0.95):计算广义优势估计(GAE)的值:param rewards: 一组连续的状态动作奖励(reward)list:param values: 一组连续的状态的价值(value) list:param gamma: 衰减系数float:param lambda_: GAE超参数float:return: 估计的优势值(advantages)list# 计算 TD 计算误差 deltadeltas [r gamma * v_next - v for r, v, v_next in zip(rewards, values, values[1:] [0])]# 计算 GAEadvantages []advantage 0for delta in reversed(deltas):advantage gamma * lambda_ * advantage deltaadvantages.append(advantage)advantages.reverse()return advantages这个函数的输入是同步的状态动作奖励和状态价值列表以及GAE的超参数。它通过计算TD误差(delta)和广义优势估计(GAE)输出估计的优势值(advantages)。
在训练过程中我们会调用这个函数来计算advantages。例子伪代码如下
for _ in range(num_iterations):#执行策略获取state,action,rewardstates, actions, rewards run_policy(env, model, num_steps)#计算所有状态的值values calculate_values(env, model, states)#计算所有状态的advantagesadvantages calculate_gae(rewards, values)#更新策略update_policy(model, states, actions, advantages)这些伪代码只是一个粗略的概念实际实现中需要很多细节。
代码实践
离散动作空间
class PolicyNet(torch.nn.Module):def __init__(self, state_dim, hidden_dim, action_dim):super(PolicyNet, self).__init__()self.fc1 torch.nn.Linear(state_dim, hidden_dim)self.fc2 torch.nn.Linear(hidden_dim, action_dim)def forward(self, x):x F.relu(self.fc1(x))return F.softmax(self.fc2(x), dim1)# 输入是某个状态输出则是状态的价值。
class ValueNet(torch.nn.Module):def __init__(self, state_dim, hidden_dim):super(ValueNet, self).__init__()self.fc1 torch.nn.Linear(state_dim, hidden_dim)self.fc2 torch.nn.Linear(hidden_dim, 1)def forward(self, x):x F.relu(self.fc1(x))return self.fc2(x)class TRPO:def __init__(self, state_dim, hidden_dim, action_dim, lambda_, gamma, alpha,kl_constraint, critic_lr, device, numOfEpisodes, env):self.actor PolicyNet(state_dim, hidden_dim, action_dim).to(device)self.critic ValueNet(state_dim, hidden_dim).to(device)# 策略网络参数不需要优化器更新self.critic_optimizer torch.optim.Adam(self.critic.parameters(), lrcritic_lr)self.gamma gamma# GAE参数self.lambda_ lambda_# KL距离最大限制self.kl_constraint kl_constraint# 线性搜索参数self.alpha alphaself.device deviceself.env envself.numOfEpisodes numOfEpisodesdef take_action(self, state):state torch.tensor(np.array([state]), dtypetorch.float).to(self.device)probs self.actor(state)action_dist torch.distributions.Categorical(probs)return action_dist.sample().item()def cal_advantage(self, gamma, lambda_, td_delta):td_delta td_delta.detach().numpy()advantages []advantage 0.0for delta in reversed(td_delta):advantage gamma * lambda_ * advantage deltaadvantages.append(advantage)advantages.reverse()return torch.FloatTensor(np.array(advantages))# compute surrogate object functiondef compute_surrogate_obj(self, states, actions, advantage, old_log_probs, actor):log_probs torch.log(actor(states).gather(1, actions))ratio torch.exp(log_probs - old_log_probs)return torch.mean(ratio * advantage)def hessian_matrix_vector_product(self, states, old_action_dists, vector):# 计算黑塞矩阵和一个向量的乘积new_action_dists torch.distributions.Categorical(self.actor(states))kl torch.mean(torch.distributions.kl.kl_divergence(old_action_dists, new_action_dists)) # 计算平均KL距离kl_grad torch.autograd.grad(kl, self.actor.parameters(), create_graphTrue)kl_grad_vector torch.cat([grad.view(-1) for grad in kl_grad])# KL距离的梯度先和向量进行点积运算kl_grad_vector_product torch.dot(kl_grad_vector, vector)grad2 torch.autograd.grad(kl_grad_vector_product, self.actor.parameters())grad2_vector torch.cat([grad.view(-1) for grad in grad2])return grad2_vector# 共轭梯度法求解方程def conjugate_gradient(self, grad, states, old_action_dists):x torch.zeros_like(grad)r grad.clone()p grad.clone()rdotr torch.dot(r, r)for i in range(10): # 共轭梯度主循环Hp self.hessian_matrix_vector_product(states, old_action_dists,p)alpha rdotr / torch.dot(p, Hp)x alpha * pr - alpha * Hpnew_rdotr torch.dot(r, r)if new_rdotr 1e-10:breakbeta new_rdotr / rdotrp r beta * prdotr new_rdotrreturn xdef line_search(self, states, actions, advantage, old_log_probs,old_action_dists, max_vec): # 线性搜索old_para torch.nn.utils.convert_parameters.parameters_to_vector(self.actor.parameters())old_obj self.compute_surrogate_obj(states, actions, advantage,old_log_probs, self.actor)for i in range(15): # 线性搜索主循环coef self.alpha**inew_para old_para coef * max_vecnew_actor copy.deepcopy(self.actor)torch.nn.utils.convert_parameters.vector_to_parameters(new_para, new_actor.parameters())new_action_dists torch.distributions.Categorical(new_actor(states))kl_div torch.mean(torch.distributions.kl.kl_divergence(old_action_dists,new_action_dists))new_obj self.compute_surrogate_obj(states, actions, advantage,old_log_probs, new_actor)if new_obj old_obj and kl_div self.kl_constraint:return new_parareturn old_paradef update(self, transition_dict):states torch.tensor(np.array(transition_dict[states]), dtypetorch.float).to(self.device)actions torch.tensor(transition_dict[actions]).view(-1, 1).to(self.device)rewards torch.tensor(transition_dict[rewards], dtypetorch.float).view(-1, 1).to(self.device)next_states torch.tensor(np.array(transition_dict[next_states]), dtypetorch.float).to(self.device)terminateds torch.tensor(transition_dict[terminateds], dtypetorch.float).view(-1, 1).to(self.device)truncateds torch.tensor(transition_dict[truncateds], dtypetorch.float).view(-1, 1).to(self.device)td_target rewards self.gamma * self.critic(next_states) * (1 - terminateds truncateds)td_delta td_target - self.critic(states)# estimate advantage using GAEadvantage self.cal_advantage(self.gamma, self.lambda_, td_delta.cpu()).to(self.device)old_log_probs torch.log(self.actor(states).gather(1, actions)).detach()old_action_dists torch.distributions.Categorical(self.actor(states).detach())# update Value functioncritic_loss torch.mean(F.mse_loss(self.critic(states), td_target.detach()))self.critic_optimizer.zero_grad()critic_loss.backward()self.critic_optimizer.step()# estimate policy gradient# compute surrogate object functionsurrogate_obj self.compute_surrogate_obj(states, actions, advantage, old_log_probs, self.actor)grads torch.autograd.grad(surrogate_obj, self.actor.parameters())obj_grad torch.cat([grad.view(-1) for grad in grads]).detach()# use the conjugate gradient algorithm to compute x H^(-1)gdescent_direction self.conjugate_gradient(obj_grad, states, old_action_dists)Hd self.hessian_matrix_vector_product(states, old_action_dists,descent_direction)max_coef torch.sqrt(2 * self.kl_constraint / (torch.dot(descent_direction, Hd) 1e-8))# update the policy by backtracking line searchnew_para self.line_search(states, actions, advantage, old_log_probs,old_action_dists,descent_direction * max_coef)# 用线性搜索后的参数更新策略torch.nn.utils.convert_parameters.vector_to_parameters(new_para, self.actor.parameters())def TRPOrun(self):returnList []for i in range(10):with tqdm(totalint(self.numOfEpisodes / 10), descIteration %d % i) as pbar:for episode in range(int(self.numOfEpisodes / 10)):# initialize statestate, info self.env.reset()terminated Falsetruncated FalseepisodeReward 0transition_dict {states: [], actions: [], next_states: [], rewards: [], terminateds: [], truncateds: []}# Loop for each step of episode:while 1:action self.take_action(state)next_state, reward, terminated, truncated, info self.env.step(action)transition_dict[states].append(state)transition_dict[actions].append(action)transition_dict[next_states].append(next_state)transition_dict[rewards].append(reward)transition_dict[terminateds].append(terminated)transition_dict[truncateds].append(truncated)state next_stateepisodeReward rewardif terminated or truncated:breakself.update(transition_dict)returnList.append(episodeReward)if (episode 1) % 10 0: # 每10条序列打印一下这10条序列的平均回报pbar.set_postfix({episode:%d % (self.numOfEpisodes / 10 * i episode 1),return:%.3f % np.mean(returnList[-10:])})pbar.update(1)return returnList从结果中可以看到TRPO收敛速度相对较快,性能优异。
超参数参考 agent TRPO(state_dimenv.observation_space.shape[0],hidden_dim256,action_dim2,lambda_0.95,gamma0.99,alpha0.5,kl_constraint0.0005,critic_lr1e-2,devicedevice,numOfEpisodes1000,envenv)连续动作空间
下面代码仅列出了需要修改部分的代码。
class PolicyNetContinuous(torch.nn.Module):def __init__(self, state_dim, hidden_dim, action_dim):super(PolicyNetContinuous, self).__init__()self.fc1 torch.nn.Linear(state_dim, hidden_dim)self.fc_mu torch.nn.Linear(hidden_dim, action_dim)self.fc_std torch.nn.Linear(hidden_dim, action_dim)# 高斯分布的均值和标准差def forward(self, x):x F.relu(self.fc1(x))mu 2.0 * torch.tanh(self.fc_mu(x))std F.softplus(self.fc_std(x))return mu, stdclass TRPO:def __init__(self, state_dim, hidden_dim, action_dim, lambda_, gamma, alpha,kl_constraint, critic_lr, device, numOfEpisodes, env):self.actor PolicyNetContinuous(state_dim, hidden_dim, action_dim).to(device)self.critic ValueNet(state_dim, hidden_dim).to(device)# 策略网络参数不需要优化器更新self.critic_optimizer torch.optim.Adam(self.critic.parameters(), lrcritic_lr)self.gamma gamma# GAE参数self.lambda_ lambda_# KL距离最大限制self.kl_constraint kl_constraint# 线性搜索参数self.alpha alphaself.device deviceself.env envself.numOfEpisodes numOfEpisodesdef take_action(self, state):state torch.tensor(np.array([state]), dtypetorch.float).to(self.device)mu, std self.actor(state)action_dist torch.distributions.Normal(mu, std)action action_dist.sample()return [action.item()]# compute surrogate object functiondef compute_surrogate_obj(self, states, actions, advantage, old_log_probs, actor):mu, std actor(states)action_dists torch.distributions.Normal(mu, std)log_probs action_dists.log_prob(actions)ratio torch.exp(log_probs - old_log_probs)return torch.mean(ratio * advantage)def hessian_matrix_vector_product(self, states, old_action_dists, vector, damping0.1):# 计算黑塞矩阵和一个向量的乘积mu, std self.actor(states)new_action_dists torch.distributions.Normal(mu, std)kl torch.mean(torch.distributions.kl.kl_divergence(old_action_dists, new_action_dists)) # 计算平均KL距离kl_grad torch.autograd.grad(kl, self.actor.parameters(), create_graphTrue)kl_grad_vector torch.cat([grad.view(-1) for grad in kl_grad])# KL距离的梯度先和向量进行点积运算kl_grad_vector_product torch.dot(kl_grad_vector, vector)grad2 torch.autograd.grad(kl_grad_vector_product, self.actor.parameters())grad2_vector torch.cat([grad.view(-1) for grad in grad2])return grad2_vector damping * vectordef line_search(self, states, actions, advantage, old_log_probs,old_action_dists, max_vec):old_para torch.nn.utils.convert_parameters.parameters_to_vector(self.actor.parameters())old_obj self.compute_surrogate_obj(states, actions, advantage,old_log_probs, self.actor)for i in range(15):coef self.alpha**inew_para old_para coef * max_vecnew_actor copy.deepcopy(self.actor)torch.nn.utils.convert_parameters.vector_to_parameters(new_para, new_actor.parameters())mu, std new_actor(states)new_action_dists torch.distributions.Normal(mu, std)kl_div torch.mean(torch.distributions.kl.kl_divergence(old_action_dists,new_action_dists))new_obj self.compute_surrogate_obj(states, actions, advantage,old_log_probs, new_actor)if new_obj old_obj and kl_div self.kl_constraint:return new_parareturn old_paradef update(self, transition_dict):states torch.tensor(np.array(transition_dict[states]), dtypetorch.float).to(self.device)actions torch.tensor(transition_dict[actions]).view(-1, 1).to(self.device)rewards torch.tensor(transition_dict[rewards], dtypetorch.float).view(-1, 1).to(self.device)next_states torch.tensor(np.array(transition_dict[next_states]), dtypetorch.float).to(self.device)terminateds torch.tensor(transition_dict[terminateds], dtypetorch.float).view(-1, 1).to(self.device)truncateds torch.tensor(transition_dict[truncateds], dtypetorch.float).view(-1, 1).to(self.device)rewards (rewards 8.0) / 8.0 # 对奖励进行修改,方便训练td_target rewards self.gamma * self.critic(next_states) * (1 - terminateds truncateds)td_delta td_target - self.critic(states)# estimate advantage using GAEadvantage self.cal_advantage(self.gamma, self.lambda_, td_delta.cpu()).to(self.device)mu, std self.actor(states)old_action_dists torch.distributions.Normal(mu.detach(), std.detach())old_log_probs old_action_dists.log_prob(actions)# update Value functioncritic_loss torch.mean(F.mse_loss(self.critic(states), td_target.detach()))self.critic_optimizer.zero_grad()critic_loss.backward()self.critic_optimizer.step()# estimate policy gradient# compute surrogate object functionsurrogate_obj self.compute_surrogate_obj(states, actions, advantage, old_log_probs, self.actor)grads torch.autograd.grad(surrogate_obj, self.actor.parameters())obj_grad torch.cat([grad.view(-1) for grad in grads]).detach()# use the conjugate gradient algorithm to compute x H^(-1)gdescent_direction self.conjugate_gradient(obj_grad, states, old_action_dists)Hd self.hessian_matrix_vector_product(states, old_action_dists,descent_direction)max_coef torch.sqrt(2 * self.kl_constraint / (torch.dot(descent_direction, Hd) 1e-8))# update the policy by backtracking line searchnew_para self.line_search(states, actions, advantage, old_log_probs,old_action_dists,descent_direction * max_coef)# 用线性搜索后的参数更新策略torch.nn.utils.convert_parameters.vector_to_parameters(new_para, self.actor.parameters())用 TRPO 在与连续动作交互的倒立摆环境中能够取得非常不错的效果这说明 TRPO 中的信任区域优化方法在离散和连续动作空间都能有效工作。
超参数参考 agent TRPO(state_dimenv.observation_space.shape[0],hidden_dim256,action_dimenv.action_space.shape[0],lambda_0.85,gamma0.90,alpha0.5,kl_constraint0.0005,critic_lr1e-2,devicedevice,numOfEpisodes2000,envenv)参考
[1] 伯禹AI [2] https://www.davidsilver.uk/teaching/ [3] 动手学强化学习 [4] Reinforcement Learning [5] SCHULMAN J, LEVINE S, ABBEEL P, et al. Trust region policy optimization [C]// International conference on machine learning, PMLR, 2015:1889-1897. [6] Schulman J. Optimizing expectations: From deep reinforcement learning to stochastic computation graphs[D]. UC Berkeley, 2016. [7] https://spinningup.openai.com/en/latest/algorithms/trpo.html