Part 1:基本概念
概念
强化学习
强化学习关注与智能体(agent)如何与环境交互中不断学习以完成特定的目标;
与有监督学习相比,不需要告诉智能体数据以及对应的标签,学习相应的模型,而是需要智能体在环境中一次次学习(哪些数据对应哪些标签),从而学习规律知道策略;
强化学习是希望智能体在环境中根据当前状态,采取行动,转移到下一个状态,获得回报。不断进行这样的过程,从而学习到一个策略(状态到动作的映射,即当前状态下,采取什么样的行动,能使得我最终获得的回报最大【不仅只是当前状态的而回报,一个策略的长期影响才是至关重要的】)
交互对象
智能体(agent):可以感知外界环境的状态(state)和反馈的奖励(reward),并进行学习和决策.智能体的决策功能是指根据外界环境的状态来做出不同的动作(action),而学习功能是指根据外界环境的奖励来调整策略(policy);
环境(environment):是智能体外部的所有事物,并受智能体动作的影响而改变其状态,并反馈给智能体相应的奖励。
基本要素
状态(state):对环境的描述,s s s
动作(action):对智能体行为的描述,a a a
奖励(reward):智能体做出动作a a a 后,环境更新状态s ′ s' s ′ ,并给出奖励r r r ,评估此时刻智能体动作的好坏,奖励的作用是使得智能体能在相同的状态下做出动作的修正,以使得它能够更好地去适应环境,奖励的设计会决定游戏的公平和智能体是否能够通过游戏
策略(policy):是一组概率分布,表示每个动作的概率,π \pi π
回报(return):智能体在某状态下,或者关系到未来多个奖励状态的总和,即t t t 时刻回报是由当前时刻的回报加上后续时刻回报的总和,且越是后续时刻的回报对当前回报的作用也就越小,可以使用衰减因子γ \gamma γ 对t t t 时刻以后的回报进行加权
G t = R t + γ R t + 1 + γ 2 R t + 2 + ⋯ = ∑ k = 0 N γ k R t + k G_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \cdots = \sum_{k=0}^N \gamma^k R_{t+k}
G t = R t + γ R t + 1 + γ 2 R t + 2 + ⋯ = k = 0 ∑ N γ k R t + k
状态价值函数 (action-value function):
从状态s s s 出发,遵循策略π \pi π 所能获得的回报的期望值 ,即
V π ( s ) = E π [ G t ∣ S t = s ] V^\pi(s) = E_\pi[G_t|S_t=s]
V π ( s ) = E π [ G t ∣ S t = s ]
有贝尔曼方程 (Bellman Equation)
V π ( s ) = E π [ G t ∣ S t = s ] = E π [ R t + γ R t + 1 + γ 2 R t + 2 + ⋯ ∣ S t = s ] = E π [ R t + γ ( R t + 1 + γ R t + 2 + ⋯ ) ∣ S t = s ] = E π [ R t + γ G t + 1 ∣ S t = s ] = E π [ R t + γ V π ( S t + 1 ) ∣ S t = s ] \begin{aligned}
V^{\pi}(s) &= E_\pi[G_t|S_t=s] \\
&= E_\pi[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \cdots | S_t=s] \\
&= E_\pi[R_t + \gamma (R_{t+1} + \gamma R_{t+2} + \cdots) | S_t=s] \\
&= E_\pi[R_t + \gamma G_{t+1} | S_t=s] \\
&= E_\pi[R_t + \gamma V^{\pi}(S_{t+1}) | S_t=s] \\
\end{aligned}
V π ( s ) = E π [ G t ∣ S t = s ] = E π [ R t + γ R t + 1 + γ 2 R t + 2 + ⋯ ∣ S t = s ] = E π [ R t + γ ( R t + 1 + γ R t + 2 + ⋯ ) ∣ S t = s ] = E π [ R t + γ G t + 1 ∣ S t = s ] = E π [ R t + γ V π ( S t + 1 ) ∣ S t = s ]
动作价值函数 (state-value function):在当前状态s s s ,执行动作a a a 后,遵循策略π \pi π 所能获得的回报的期望值 ,即
Q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] Q^\pi(s, a) = E_\pi[G_t|S_t=s, A_t=a]
Q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ]
Q:quantity,Q函数是指状态动作函数。
根据条件概率,有
V π ( s ) = E a ∼ P ( A t = a ∣ S t = s ) Q π ( s , a ) V^\pi(s) = E_{a \sim P(A_t=a|S_t=s)} Q^\pi(s, a)
V π ( s ) = E a ∼ P ( A t = a ∣ S t = s ) Q π ( s , a )
动作价值a a a 包含了即时奖励R t R_t R t 、下一状态的状态价值的期望 ,记动作a a a 作用下由状态s s s 转移到状态s ′ s' s ′ 的转移概率 为P ( s ′ ∣ s , a ) P(s'|s, a) P ( s ′ ∣ s , a ) ,有
Q π ( s , a ) = r ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) V π ( s ′ ) Q^\pi(s, a) = r(s, a) + \gamma \sum_{s' \in S} P(s'|s, a) V^\pi(s')
Q π ( s , a ) = r ( s , a ) + γ s ′ ∈ S ∑ P ( s ′ ∣ s , a ) V π ( s ′ )
可以用动作价值函数判断t t t 时刻价值最高的动作,即
a ∗ = arg max a Q ( s , a ) a^* = \argmax_a Q(s, a)
a ∗ = a arg max Q ( s , a )
优势函数 (advantage function):表示状态s s s 处,动作a a a 相对于平均水平的高低
A π ( s , a ) = Q π ( s , a ) − V π ( s ) A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)
A π ( s , a ) = Q π ( s , a ) − V π ( s )
TD误差 (TD error):在一回合观测过程中,得到部分状态序列,根据贝尔曼方程V π ( s ) = E π [ R t + γ V π ( S t + 1 ) ∣ S t = s ] V^{\pi}(s)=E_\pi[R_t + \gamma V^{\pi}(S_{t+1}) | S_t=s] V π ( s ) = E π [ R t + γ V π ( S t + 1 ) ∣ S t = s ] ,可以用TD目标值R t + γ V π ( S t + 1 ) R_t + \gamma V^{\pi}(S_{t+1}) R t + γ V π ( S t + 1 ) 代替G t G_t G t ,并定义TD误差为
δ ( t ) = R t + γ V π ( S t + 1 ) − V π ( S t ) \delta(t) = R_t + \gamma V^{\pi}(S_{t+1}) - V^{\pi}(S_{t})
δ ( t ) = R t + γ V π ( S t + 1 ) − V π ( S t )
假如有以下两个序列:
S 0 ( 1 ) → A 0 ( 1 ) S 1 ( 1 ) → A 1 ( 1 ) S 2 ( 1 ) → A 2 ( 1 ) S 3 ( 1 ) S_0^{(1)} \rightarrow^{A_0^{(1)}} S_1^{(1)} \rightarrow^{A_1^{(1)}} S_2^{(1)} \rightarrow^{A_2^{(1)}} S_3^{(1)} S 0 ( 1 ) → A 0 ( 1 ) S 1 ( 1 ) → A 1 ( 1 ) S 2 ( 1 ) → A 2 ( 1 ) S 3 ( 1 ) ,赢
S 0 ( 2 ) → A 0 ( 2 ) S 1 ( 2 ) → A 2 ( 2 ) S 2 ( 2 ) S_0^{(2)} \rightarrow^{A_0^{(2)}} S_1^{(2)} \rightarrow^{A_2^{(2)}} S_2^{(2)} S 0 ( 2 ) → A 0 ( 2 ) S 1 ( 2 ) → A 2 ( 2 ) S 2 ( 2 ) ,输
一共2 2 2 条序列,状态S 1 S_1 S 1 转移到两个不同的下一状态,因此转移概率都是0.5 0.5 0.5 。根据马尔可夫假设,设衰减因子γ = 0.9 \gamma=0.9 γ = 0.9 ,那么状态S 1 S_1 S 1 状态价值函数为V π ( S 1 ) = 0.5 × ( R 1 ( 1 ) + 0.9 × R 2 ( 1 ) + 0. 9 2 × R 3 ( 1 ) ) + 0.5 × ( R 1 ( 2 ) + 0.9 × R 2 ( 2 ) ) V^\pi(S_1)=0.5 \times (R_1^{(1)} + 0.9 \times R_2^{(1)} + 0.9^2 \times R_3^{(1)}) + 0.5 \times (R_1^{(2)} + 0.9 \times R_2^{(2)}) V π ( S 1 ) = 0.5 × ( R 1 ( 1 ) + 0.9 × R 2 ( 1 ) + 0. 9 2 × R 3 ( 1 ) ) + 0.5 × ( R 1 ( 2 ) + 0.9 × R 2 ( 2 ) ) ,最终赢的状态下R 1 ( 1 ) = R 2 ( 1 ) = R 3 ( 1 ) = 1 R_1^{(1)} = R_2^{(1)} = R_3^{(1)} = 1 R 1 ( 1 ) = R 2 ( 1 ) = R 3 ( 1 ) = 1 、输的状态下R 1 ( 2 ) = R 2 ( 2 ) = 0 R_1^{(2)} = R_2^{(2)} = 0 R 1 ( 2 ) = R 2 ( 2 ) = 0 ,那么有V π ( S 1 ) = 1.355 V^\pi(S_1)=1.355 V π ( S 1 ) = 1.355
分类
value-based & policy-based
value-based:训练Q ( s , a ) Q(s, a) Q ( s , a ) ,测试时基于s s s 选择使Q值最大的a a a ,如Q-Learning、SARSA、DQN
policy-based:训练p ( s , a ) p(s, a) p ( s , a ) ,测试时基于s s s 得到不同a a a 的概率,选择概率最大的a a a ,如policy-gradient
也有将两种方法结合,如actor-critic
on-policy & off-policy
on-policy:行动策略和评估策略相同,需要学习的Agent和训练过程中和环境进行交互的Agent是同一个,如SARSA
off-policy:行动策略和评估策略不相同,需要学习的Agent和训练过程中真正和环境进行交互的Agent不是同一个,如Q-Learning
model-based & model-free
model-based相对于model-free的最主要区别是引入了对环境的建模。这里提到的建模是指我们通过监督训练来训练一个环境模型,其数据是算法和环境的实际交互数据( s t , a t , r t , s t + 1 , a t + 1 , r t + 1 , ⋯ ) (s_t, a_t, r_t, s_{t+1}, a_{t+1}, r_{t+1}, \cdots) ( s t , a t , r t , s t + 1 , a t + 1 , r t + 1 , ⋯ ) ,是在给定s t s_t s t 、a t a_t a t 下预测下一个状态s t + 1 s_{t+1} s t + 1 。
model-based:使用环境模型(环境的动态特性,即期望收益和状态转移概率)和规划(在真正经历之前,先考虑未来可能发生的各种情境从而预先决定采取何种动作)来解决强化学习问题的方法。
model-free::通过学习(直接地试错)经验(在与环境交互中采样得到的状态、动作、收益序列)来解决强化学习问题的方法。
在agent执行它的动作之前,它是否能对下一步的状态和回报做出预测,如果可以,那么就是model-based方法(model based方法就好比人类对环境的转移有一个初步的预估,所以plan了一个更好的action),如果不能,即为model-free方法。
offline reinforcement learning
离线强化学习,即用大量过往数据进行学习,没有交互环境参与。
Part 2: 从Q-Learning到DQN
Q-Learning
Q-Learning是根据所经历的状态和所选择的行为建立一张Q表格(Q-Table),根据每一轮学习到的奖励更新Q表格。Q-Table即以状态为行、动作为列建立的表格,存放Q值。问题在于,如何求取Q-Table中的Q值。
状态\动作
a 0 a_0 a 0
a 1 a_1 a 1
a 2 a_2 a 2
⋯ \cdots ⋯
s 0 s_0 s 0
s 1 s_1 s 1
s 1 s_1 s 1
⋯ \cdots ⋯
伪代码为
1 2 3 4 5 6 7 8 9 Initialize Q(s, a) arbitrarily Repeat (for each episode): Initialize s Repeat (for each step of episode): Choose a from s using policy derived from Q (e.g. \epsilon-greedy) Take action a, observe r, s' Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] s \leftarrow s' until s is terminal
其中,ϵ − g r e e d y \epsilon-greedy ϵ − g ree d y 是指,在初始阶段, 随机地探索环境往往比固定的行为模式要好, 所以这也是累积经验的阶段, 我们希望探索者不会那么贪婪(greedy),所以ϵ \epsilon ϵ 就是用来控制贪婪程度的值(以ϵ \epsilon ϵ 几率选择最优,以$1 - ϵ \epsilon ϵ 几率随机探索),ϵ \epsilon ϵ 可以随着探索时间不断提升(越来越贪婪),即
a = { arg max a ′ ∈ A Q ( s , a ′ ) p < ϵ random a ′ ∈ A a ′ otherwise a = \begin{cases}
\argmax_{a' \in A} Q(s, a') & p < \epsilon \\
\text{random}_{a' \in A} a' & \text{otherwise}
\end{cases}
a = { arg max a ′ ∈ A Q ( s , a ′ ) random a ′ ∈ A a ′ p < ϵ otherwise
按时间步展开,图例如下,注意在时刻t t t 时四元组( s , a , s ′ , r ) (s, a, s', r) ( s , a , s ′ , r ) 均为已知量
参数更新公式如下,α \alpha α 是学习率
Q ( s , a ) ← Q ( s , a ) + α [ r + γ max a ′ Q ( s ′ , a ′ ) ‾ − Q ( s , a ) ] Q(s, a) \leftarrow Q(s, a) + \alpha \left[
\underline{r + \gamma \max_{a'} Q(s', a')} - Q(s, a)
\right]
Q ( s , a ) ← Q ( s , a ) + α [ r + γ a ′ max Q ( s ′ , a ′ ) − Q ( s , a ) ]
其中,r + γ max a ′ Q ( s ′ , a ′ ) r + \gamma \max_{a'} Q(s', a') r + γ max a ′ Q ( s ′ , a ′ ) 可以视作Q ( s , a ) Q(s, a) Q ( s , a ) 的真实值,通过与预测的Q ( s , a ) Q(s, a) Q ( s , a ) 偏差来逐步修正,max a ′ Q ( s ′ , a ′ ) \max_{a'} Q(s', a') max a ′ Q ( s ′ , a ′ ) 是下一状态s ′ s' s ′ 下,在能选择的所有动作a ′ ∈ A a' \in A a ′ ∈ A 中,能拿到的最大Q值。
下面的Q-Learning例程,是智能体在长度为N_STATES
的一维空间中探索的例子,当N_STATES=6
该空间表示为-----T
。智能体从最左侧出发,即o----T
,探索一条路线到达终点T
。Q-Table设置为
位置(s)\方向(a)
left
right
0
1
2
3
4
5(T
)
Q-Learning例程:是智能体在长度为N_STATES
的一维空间中探索
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 import numpy as npimport pandas as pdimport timenp.random.seed(42 ) N_STATES = 6 ACTIONS = ['left' , 'right' ] EPSILON = 0.9 ALPHA = 0.1 GAMMA = 0.9 MAX_EPISODES = 13 FRESH_TIME = 0.3 def build_q_table (n_states, actions ): """ 新建Q表格,Q(s, a)表示在位置s处采取a行为的行为值 """ table = pd.DataFrame( np.zeros((n_states, len (actions))), columns=actions, ) return table """ left right 0 0.0 0.0 1 0.0 0.0 2 0.0 0.0 3 0.0 0.0 4 0.0 0.0 5 0.0 0.0 """ def choose_action (state, q_table ): """ 以\epsilon-greedy策略,选择当前s处选择的动作a 以90%概率贪婪选择,10%概率随机选择 """ state_actions = q_table.iloc[state, :] if (np.random.uniform() > EPSILON) or (state_actions.any () == 0 ): action_name = np.random.choice(ACTIONS) else : action_name = state_actions.idxmax() return action_name def get_env_feedback (S, A ): """ 在位置s处采取动作a,求取状态s'、奖励r """ if A == 'right' : if S == N_STATES - 2 : S_ = 'terminal' R = 1 else : S_ = S + 1 R = 0 else : R = 0 if S == 0 : S_ = S else : S_ = S - 1 return S_, R def update_env (S, episode, step_counter ): env_list = ['-' ] * (N_STATES - 1 ) + ['T' ] if S == 'terminal' : interaction = 'Episode %s: total_steps = %s' % (episode + 1 , step_counter) print ('\r{}' .format (interaction), end='' ) time.sleep(1 ) print ('\r ' , end='' ) else : env_list[S] = 'o' interaction = '' .join(env_list) print ('\r[{} - {}] {}' .format (episode, step_counter, interaction), end='' ) time.sleep(FRESH_TIME) def rl (): q_table = build_q_table(N_STATES, ACTIONS) for episode in range (MAX_EPISODES): step_counter = 0 S = 0 is_terminated = False update_env(S, episode, step_counter) while not is_terminated: A = choose_action(S, q_table) S_, R = get_env_feedback(S, A) q_predict = q_table.loc[S, A] if S_ != 'terminal' : q_target = R + GAMMA * q_table.iloc[S_, :].max () else : q_target = R is_terminated = True q_table.loc[S, A] += ALPHA * (q_target - q_predict) step_counter += 1 ; S = S_ update_env(S, episode, step_counter) return q_table if __name__ == "__main__" : q_table = rl() print ('\r\nQ-table:\n' ) print (q_table)
SARSA
全称是State-Action-Reward-State’-Action’
伪代码为
1 2 3 4 5 6 7 8 9 10 Initialize Q(s, a) arbitrarily Repeat (for each episode): Initialize s Repeat (for each step of episode): Choose a from s using policy derived from Q (e.g. \epsilon-greedy) Take action a, observe r, s' Choose a' from s' using policy derived from Q (e.g. \epsilon-greedy) Q(s, a) \leftarrow Q(s, a) + \alpha \left[ \underline{r + \gamma Q(s', a')} - Q(s, a) \right] s \leftarrow s'; a \leftarrow a' until s is terminal
与Q-Learning的区别在于更新方式不同,在下一状态s ′ s' s ′ 用相同策略确定动作a ′ a' a ′
Q ( s , a ) ← Q ( s , a ) + α [ r + γ Q ( s ′ , a ′ ) ‾ − Q ( s , a ) ] Q(s, a) \leftarrow Q(s, a) + \alpha \left[
\underline{r + \gamma Q(s', a')} - Q(s, a)
\right]
Q ( s , a ) ← Q ( s , a ) + α [ r + γ Q ( s ′ , a ′ ) − Q ( s , a ) ]
与Q-Learning的区别:,Q-learning是选取s ′ s' s ′ 上会带来最大收益的行为,但是做决策的时候可能不一定会选择该行为(异策略,行动策略和评估策略不是同一个策略),而SARSA则是在s ′ s' s ′ 上面选择实际a ′ a' a ′ 的Q值,最后像Q-learning一样求出现实和估计的差距,并且更新Q表里面的值。
DQN
在状态空间S S S 或者动作空间A A A 非常大的情况下,无法枚举( s , a ) (s, a) ( s , a ) 构建Q-Table,因此Q-Learning不适用于复杂场景。为了解决这个问题,DQN用神经网络模型拟合函数Q ( s , a ) Q(s, a) Q ( s , a )
伪代码如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Initialize relay memory D to capacity N # experience replay Initialize action-value function Q with random weights \theta # Q-Function Initialize target action-value function \hat{Q} with weights \theta^- = \theta For episode = 1, M do Initialize sequence s_1 = \{x_1\} and preprocessed sequence \phi_1 = \phi(s_1) For t = 1, T do With probability \epsilon select a random action a_t \ otherwise select a_t = \argmax_{a} Q(\phi(s_t), a; \theta) # \epsilon-greedy Execute action a_t in emulator and observe reward r_t and image x_{t + 1} # environment reaction Set s_{t + 1} = s_t, a_t, x_{t + 1} and preprocess \phi_{t + 1} = \phi(s_{t + 1}) Store transition (\phi_t, a_t, r_t, \phi_{t + 1}) in D # experience replay Sample random minibatch of transitions (\phi_j, a_j, r_j, \phi_{j + 1})_{j = 1, \cdots, B} from D set y_j = \begin{cases} r_j & \text{if episode terminates at step j + 1} \\ r_j + \gamma \max_{a'} \hat{Q}(\phi_{j + 1}, a'; \theta^-) & \text{otherwise} \end{cases} Perform a gradient descent step on L_j = \left( y_j - Q(\phi_j, a_j; \theta) \right)^2 with respect to the network parameters \theta Every C steps reset \hat{Q} = Q # fixed-q-target End For End For
其中a t a_t a t 的选择同样基于ϵ − g r e e d y \epsilon-greedy ϵ − g ree d y ,即
a t = { arg max a Q ( ϕ ( s t ) , a ; θ ) p < ϵ random a ∈ A a otherwise a_t = \begin{cases}
\argmax_{a} Q(\phi(s_t), a; \theta) & p < \epsilon \\
\text{random}_{a \in A} a & \text{otherwise}
\end{cases}
a t = { arg max a Q ( ϕ ( s t ) , a ; θ ) random a ∈ A a p < ϵ otherwise
注意损失定义为
L j = ( y j − Q ( ϕ j , a j ; θ ) ) 2 L_j = \left( y_j - Q(\phi_j, a_j; \theta) \right)^2
L j = ( y j − Q ( ϕ j , a j ; θ ) ) 2
其中
y j = { r j if episode terminates at step j + 1 r j + γ max a ′ Q ^ ( ϕ j + 1 , a ′ ; θ − ) otherwise y_j = \begin{cases}
r_j & \text{if episode terminates at step j + 1} \\
r_j + \gamma \max_{a'} \hat{Q}(\phi_{j + 1}, a'; \theta^-) & \text{otherwise}
\end{cases}
y j = { r j r j + γ max a ′ Q ^ ( ϕ j + 1 , a ′ ; θ − ) if episode terminates at step j + 1 otherwise
从伪代码可以看出,DQN主要作出了以下三个贡献
将Q-Table参数化得到Q-Function,并用神经网络拟合;
经验回放(Experience Replay):
强化学习采集数据的过程非常慢,如果能将互动过程中的数据缓存起来,每步就可以通过采样一批数据进行参数更新
强化学习采集的数据之间存在关联性,而深度神经网络训练中要求数据满足独立同分布,因此直接用相邻时间步的数据会使模型训练不稳定,而经验回放通过采样的方式可以打破数据间的关联;
当超出容量N N N ,则按队列顺序删除以前的经验,从而动态地提升训练数据质量。
目标网络(Fixed-Q-Target):训练过程中使用了评估网络Q Q Q 和目标网络Q ^ \hat{Q} Q ^ 两个网络,也是一种打乱相关性的机制。具体地,这两个网络在初始化时有相同的结构和参数,训练过程中,评估网络Q Q Q 的参数θ \theta θ 不断地通过梯度下降更新,而目标网络Q ^ \hat{Q} Q ^ 的参数θ − \theta^- θ − 每隔C C C 步与Q Q Q 进行同步。
实际上,DQN参数更新可以表示为
θ ← θ + α [ r j + γ max a ′ Q ^ ( ϕ j + 1 , a ′ ; θ − ) − Q ( ϕ j , a j ; θ ) ] ∇ Q ( ϕ j , a j ; θ ) \theta \leftarrow \theta + \alpha \left[
r_j + \gamma \max_{a'} \hat{Q}(\phi_{j + 1}, a'; \theta^-) - Q(\phi_j, a_j; \theta)
\right] \nabla Q(\phi_j, a_j; \theta)
θ ← θ + α [ r j + γ a ′ max Q ^ ( ϕ j + 1 , a ′ ; θ − ) − Q ( ϕ j , a j ; θ ) ] ∇ Q ( ϕ j , a j ; θ )
DQN的三大变体
Double DQN:目标值估计的改进,缓解过估计问题
因为DQN是off-policy方法,每次学习时,不是使用下一次交互的真实动作,而是使用当前认为价值最大的动作来更新目标值函数,因此Q值往往偏大,导致过估计(over estimate)。因此,一种直观的解决方案是再加入一个模型相互监察,而DQN中本来就有两个网络Q Q Q 和Q ^ \hat{Q} Q ^ ,且Q ^ \hat{Q} Q ^ 滞后于Q Q Q ,可以极大缓解该问题。具体地,是在计算y j y_j y j 时,用Q ^ ( ϕ j + 1 , arg max a ′ ( Q ( ϕ j + 1 , a ′ ; θ ) ) ‾ ; θ − ) \hat{Q}(\phi_{j + 1}, \underline{\argmax_{a'}(Q(\phi_{j + 1}, a'; \theta))}; \theta^-) Q ^ ( ϕ j + 1 , arg max a ′ ( Q ( ϕ j + 1 , a ′ ; θ )) ; θ − ) 代替max a ′ Q ^ ( ϕ j + 1 , a ′ ; θ − ) \max_{a'} \hat{Q}(\phi_{j + 1}, a'; \theta^-) max a ′ Q ^ ( ϕ j + 1 , a ′ ; θ − )
y j = { r j if episode terminates at step j + 1 r j + γ Q ^ ( ϕ j + 1 , arg max a ′ ( Q ( ϕ j + 1 , a ′ ; θ ) ) ‾ ; θ − ) otherwise y_j = \begin{cases}
r_j & \text{if episode terminates at step j + 1} \\
r_j + \gamma \hat{Q}(\phi_{j + 1}, \underline{\argmax_{a'}(Q(\phi_{j + 1}, a'; \theta))}; \theta^-) & \text{otherwise}
\end{cases}
y j = { r j r j + γ Q ^ ( ϕ j + 1 , arg max a ′ ( Q ( ϕ j + 1 , a ′ ; θ )) ; θ − ) if episode terminates at step j + 1 otherwise
其中a j + 1 = arg max a ′ ( Q ( ϕ j + 1 , a ′ ; θ ) ) a_{j + 1} =\argmax_{a'}(Q(\phi_{j + 1}, a'; \theta)) a j + 1 = arg max a ′ ( Q ( ϕ j + 1 , a ′ ; θ )) ,是用评估网络Q Q Q 得到的状态ϕ j + 1 \phi_{j+1} ϕ j + 1 下采取的动作a j + 1 a_{j + 1} a j + 1 。
Dueling DQN:网络结构的改进
从网络结构上改进DQN,将动作值函数分为状态值函数 V V V 和优势函数 A A A ,即
Q ( ϕ , a ; θ , α , β ) = V ( ϕ ; θ , β ) + A ( ϕ , a ; θ , α ) Q(\phi, a; \theta, \alpha, \beta) = V(\phi; \theta, \beta) + A(\phi, a; \theta, \alpha)
Q ( ϕ , a ; θ , α , β ) = V ( ϕ ; θ , β ) + A ( ϕ , a ; θ , α )
其中α \alpha α 和β \beta β 是两个全连接网络的参数,可以看到V V V 仅与状态ϕ \phi ϕ 有关,A A A 与状态ϕ \phi ϕ 和动作a a a 有关。但是,此时Q Q Q 无法用唯一的V V V 、A A A 确定,因此强制优势函数A A A 估计量在动作a ∗ a^* a ∗ 处具有零优势,即
Q ( ϕ , a ; θ , α , β ) = V ( ϕ ; θ , β ) + ( A ( ϕ , a ; θ , α ) − max a ′ A ( ϕ , a ′ ; θ , α ) ) Q(\phi, a; \theta, \alpha, \beta) = V(\phi; \theta, \beta) + \left(
A(\phi, a; \theta, \alpha) - \max_{a'} A(\phi, a'; \theta, \alpha)
\right)
Q ( ϕ , a ; θ , α , β ) = V ( ϕ ; θ , β ) + ( A ( ϕ , a ; θ , α ) − a ′ max A ( ϕ , a ′ ; θ , α ) )
这样,对于∀ a ∗ ∈ A \forall a^* \in \mathcal{A} ∀ a ∗ ∈ A 都有
a ∗ = arg max a ′ ∈ A Q ( ϕ , a ′ ; θ , α , β ) = arg max a ′ ∈ A A ( ϕ , a ′ ; θ , α ) a^* = \argmax_{a' \in \mathcal{A}} Q(\phi, a'; \theta, \alpha, \beta) = \argmax_{a' \in \mathcal{A}} A(\phi, a'; \theta, \alpha)
a ∗ = a ′ ∈ A arg max Q ( ϕ , a ′ ; θ , α , β ) = a ′ ∈ A arg max A ( ϕ , a ′ ; θ , α )
此时就有
Q ( ϕ , a ∗ ; θ , α , β ) = V ( ϕ ; θ , β ) Q(\phi, a^*; \theta, \alpha, \beta) = V(\phi; \theta, \beta)
Q ( ϕ , a ∗ ; θ , α , β ) = V ( ϕ ; θ , β )
最后,作者又用平均代替了最大,即
Q ( ϕ , a ; θ , α , β ) = V ( ϕ ; θ , β ) + ( A ( ϕ , a ; θ , α ) − 1 ∣ A ∣ ∑ a ′ A ( ϕ , a ′ ; θ , α ) ) Q(\phi, a; \theta, \alpha, \beta) = V(\phi; \theta, \beta) + \left(
A(\phi, a; \theta, \alpha) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(\phi, a'; \theta, \alpha)
\right)
Q ( ϕ , a ; θ , α , β ) = V ( ϕ ; θ , β ) + ( A ( ϕ , a ; θ , α ) − ∣ A ∣ 1 a ′ ∑ A ( ϕ , a ′ ; θ , α ) )
虽然使得值函数V V V 和优势函数A A A 不再完美的表示值函数和优势函数(在语义上的表示),但是这种操作提高了稳定性。而且,并没有改变值函数V V V 和优势函数A A A 的本质表示。
状态值函数V ( ϕ ; θ , β ) V(\phi; \theta, \beta) V ( ϕ ; θ , β ) 是在状态ϕ \phi ϕ 下,所有可能动作a a a 所对应的动作值函数,乘以采取该动作的概率的和,也就是状态的期望。优势函数Q ( ϕ , a ; θ , α , β ) − V ( ϕ ; θ , β ) Q(\phi, a; \theta, \alpha, \beta) - V(\phi; \theta, \beta) Q ( ϕ , a ; θ , α , β ) − V ( ϕ ; θ , β ) 可以评价当前动作值函数相对于平均值的大小,“优势”是指动作值函数Q Q Q 相比于当前状态的值函数V V V 的优势:如果Q − V > 0 Q - V > 0 Q − V > 0 ,表示动作a a a 比平均动作好。
Prioritized Replay Buffer:训练过程的改进
在传统DQN的经验池中,选择batch的数据进行训练是随机的,没有考虑样本的优先级关系。但其实不同的样本的价值是不同的,我们需要给每个样本一个优先级,并根据样本的优先级进行采样。
样本的优先级如何确定?我们可以用到 TD-error, 也就是 q-target - q-eval 来规定优先学习的程度. 如果 TD-error 越大, 就代表我们的预测精度还有很多上升空间, 那么这个样本就越需要被学习, 也就是优先级 p 越高。
有了 TD-error 就有了优先级 p, 那我们如何有效地根据 p 来抽样呢? 如果每次抽样都需要针对 p 对所有样本排序, 这将会是一件非常消耗计算能力的事. 文中提出了一种被称作SumTree的方法。
Part 3: 从Policy-Gradient到TROP/PPO/PPO2
基于策略和基于价值的强化学习方法有什么区别?
作者:郝伟
链接:https://www.zhihu.com/question/542423465/answer/2566685921
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
对于一个状态转移概率已知的马尔可夫决策过程,我们可以使用动态规划算法来求解。从决策方式来看,强化学习又可以划分为基于策略的方法和基于价值的方法。决策方式是智能体在给定状态下从动作集合中选择一个动作的依据,它是静态的,不随状态变化而变化。在基于策略的强化学习方法中,智能体会制定一套动作策略(确定在给定状态下需要采取何种动作) ,并根据这个策略进行操作。强化学习算法直接对策略进行优化,使制定的策略能够获得最大的奖励。而在基于价值的强化学习方法中,智能体不需要制定显式的策略,它维护一个价值表格或价值函数,并通过这个价值表格或价值函数来选取价值最大的动作 。基于价值迭代的方法只能应用在不连续的、离散的环境下 (如围棋或某些游戏领域),对于动作集合规模庞大、动作连续的场景(如机器人控制领域),其很难学习到较好的结果(此时基于策略迭代的方法能够根据设定的策略来选择连续的动作)。基于价值的强化学习算法有Q学习(Q-learning)、Sarsa等,而基于策略的强化学习算法有策略梯度(Policy Gradient,PG)算法等。此外,演员-评论员算法同时使用策略和价值评估来做出决策。其中,智能体会根据策略做出动作,而价值函数会对做出的动作给出价值,这样可以在原有的策略梯度算法的基础上加速学习过程,取得更好的效果。
Policy Gradient
核心思想是直接优化策略网络(Policy Network)a = π ( a ∣ s ; θ ) a = \pi(a | s; \theta) a = π ( a ∣ s ; θ ) ,即根据输入状态s s s 输出各动作的概率,并依概率采样得到动作a a a 。那么网络应该如何训练来实现最终的收敛呢?强化学习中只能通过奖励判断动作的好坏,也就是说一个动作奖励越大,那么增加其出现的概率,否则降低 ,这就是策略梯度的基本思想。
给定策略网络π ( a ∣ s ; θ ) \pi(a | s; \theta) π ( a ∣ s ; θ ) ,在一个回合内(游戏开始到结束称为一个回合,episode)与环境产生交互得到序列τ = { s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , ⋯ , s T , a T , r T } \tau = \{s_1, a_1, r_1, s_2, a_2, r_2, \cdots, s_T, a_T, r_T\} τ = { s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , ⋯ , s T , a T , r T } ,其中a t a_t a t 依概率π ( a t ∣ s t ; θ ) \pi(a_t | s_t; \theta) π ( a t ∣ s t ; θ ) 采样得到,因而具有随机性。那么该回合总的奖励为R θ ( τ ) = ∑ t r t R_{\theta}(\tau) = \sum_t r_t R θ ( τ ) = ∑ t r t ,记P θ ( τ ) P_{\theta}(\tau) P θ ( τ ) 为该回合产生的概率,多个回合产生序列集合T \Tau T 。定义期望的总奖励为R ‾ θ \overline{R}_{\theta} R θ ,就有
R ‾ θ = ∑ τ R θ ( τ ) P θ ( τ ) \overline{R}_{\theta} = \sum_\tau R_{\theta}(\tau) P_{\theta}(\tau)
R θ = τ ∑ R θ ( τ ) P θ ( τ )
那么,总体的训练目标就是令期望的总奖励最大,即
θ ∗ = arg max θ R ‾ θ \theta^* = \argmax_{\theta} \overline{R}_{\theta}
θ ∗ = θ arg max R θ
可通过梯度下降法求取
∇ R ‾ θ = ∑ τ R θ ( τ ) ⋅ ∇ P θ ( τ ) = ∑ τ R θ ( τ ) ⋅ P θ ( τ ) ⋅ ∇ log P θ ( τ ) = E τ ∼ P θ ( τ ) R θ ( τ ) ⋅ ∇ log P θ ( τ ) ≈ 1 ∣ T ∣ ∑ τ ∈ T R θ ( τ ) ⋅ ∇ log P θ ( τ ) \begin{aligned}
\nabla \overline{R}_{\theta} &= \sum_\tau R_{\theta}(\tau) \cdot \nabla P_{\theta}(\tau) \\
&= \sum_\tau R_{\theta}(\tau) \cdot P_{\theta}(\tau) \cdot \nabla \log P_{\theta}(\tau) \\
&= E_{\tau \sim P_{\theta}(\tau)} R_{\theta}(\tau) \cdot \nabla \log P_{\theta}(\tau) \\
&\approx \frac{1}{|\Tau|} \sum_{\tau \in \Tau} R_{\theta}(\tau) \cdot \nabla \log P_{\theta}(\tau) \\
\end{aligned}
∇ R θ = τ ∑ R θ ( τ ) ⋅ ∇ P θ ( τ ) = τ ∑ R θ ( τ ) ⋅ P θ ( τ ) ⋅ ∇ log P θ ( τ ) = E τ ∼ P θ ( τ ) R θ ( τ ) ⋅ ∇ log P θ ( τ ) ≈ ∣ T ∣ 1 τ ∈ T ∑ R θ ( τ ) ⋅ ∇ log P θ ( τ )
注:∇ f ( x ) = f ( x ) ⋅ ∇ f ( x ) f ( x ) = f ( x ) ⋅ ∇ l o g f ( x ) \nabla f(x) = f(x) \cdot \frac{\nabla f(x)}{f(x)} = f(x) \cdot \nabla log f(x) ∇ f ( x ) = f ( x ) ⋅ f ( x ) ∇ f ( x ) = f ( x ) ⋅ ∇ l o g f ( x )
而
P θ ( τ ) = P ( s 1 ) ⋅ P ( a 1 ∣ s 1 ) P ( s 2 ∣ s 1 , a 1 ) ⋅ P ( a 2 ∣ s 2 ) P ( s 3 ∣ s 2 , a 2 ) ⋯ = P ( s 1 ) ∏ t P ( a t ∣ s t ) P ( s t + 1 ∣ s t , a t ) \begin{aligned}
P_{\theta}(\tau) &= P(s_1) \cdot P(a_1|s_1) P(s_2|s_1, a_1) \cdot P(a_2|s_2) P(s_3|s_2, a_2) \cdots \\
&= P(s_1) \prod_{t} P(a_t|s_t) P(s_{t+1}|s_t, a_t)
\end{aligned}
P θ ( τ ) = P ( s 1 ) ⋅ P ( a 1 ∣ s 1 ) P ( s 2 ∣ s 1 , a 1 ) ⋅ P ( a 2 ∣ s 2 ) P ( s 3 ∣ s 2 , a 2 ) ⋯ = P ( s 1 ) t ∏ P ( a t ∣ s t ) P ( s t + 1 ∣ s t , a t )
则
log P θ ( τ ) = log P ( s 1 ) ‾ + ∑ t log P ( a t ∣ s t ) + log P ( s t + 1 ∣ s t , a t ) ‾ \log P_{\theta}(\tau) = \underline{\log P(s_1)} + \sum_t \log P(a_t|s_t) + \underline{\log P(s_{t+1}|s_t, a_t)}
log P θ ( τ ) = log P ( s 1 ) + t ∑ log P ( a t ∣ s t ) + log P ( s t + 1 ∣ s t , a t )
那么
∇ log P θ ( τ ) = ∑ t ∇ log P ( a t ∣ s t ) \nabla \log P_{\theta}(\tau) = \sum_t \nabla \log P(a_t|s_t)
∇ log P θ ( τ ) = t ∑ ∇ log P ( a t ∣ s t )
代入∇ R ‾ θ \nabla \overline{R}_{\theta} ∇ R θ 则有
∇ R ‾ θ ≈ 1 ∣ T ∣ ∑ τ ∈ T R θ ( τ ) ⋅ ∑ t ∇ log π ( a t ∣ s t ; θ ) ‾ ≈ 1 ∣ T ∣ ∑ τ ∈ T ∑ t r t ⋅ ∇ log π ( a t ∣ s t ; θ ) \begin{aligned}
\nabla \overline{R}_{\theta}
\approx \frac{1}{|\Tau|} \sum_{\tau \in \Tau} R_{\theta}(\tau) \cdot \underline{\sum_t \nabla \log \pi(a_t|s_t; \theta)}
\approx \frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} r_t \cdot \nabla \log \pi(a_t|s_t; \theta)
\end{aligned}
∇ R θ ≈ ∣ T ∣ 1 τ ∈ T ∑ R θ ( τ ) ⋅ t ∑ ∇ log π ( a t ∣ s t ; θ ) ≈ ∣ T ∣ 1 τ ∈ T ∑ t ∑ r t ⋅ ∇ log π ( a t ∣ s t ; θ )
因此
{ ∇ R ‾ θ ≈ 1 ∣ T ∣ ∑ τ ∈ T ∑ t r t ⋅ ∇ log π ( a t ∣ s t ; θ ) θ ← θ + η ∇ R ‾ θ \begin{cases}
\nabla \overline{R}_{\theta} &\approx \frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} r_t \cdot \nabla \log \pi(a_t|s_t; \theta) \\
\theta &\leftarrow \theta + \eta \nabla \overline{R}_{\theta} \\
\end{cases}
{ ∇ R θ θ ≈ ∣ T ∣ 1 ∑ τ ∈ T ∑ t r t ⋅ ∇ log π ( a t ∣ s t ; θ ) ← θ + η ∇ R θ
注:是否与交叉熵的形式类似??L = 1 ∣ D ∣ ∑ ( x , y ) ∈ D ∑ c y c log p c ( x ) L = \frac{1}{|D|} \sum_{(x, y) \in D} \sum_c y_c \log p_c(x) L = ∣ D ∣ 1 ∑ ( x , y ) ∈ D ∑ c y c log p c ( x )
改进1 :增加一个奖励基准b b b ,即奖励达到b b b 才能说这一步动作好,防止智能体在训练初期,就倾向于选择某几个奖励高的动作,从而忽略了探索低奖励动作
∇ R ‾ θ ≈ 1 ∣ T ∣ ∑ τ ∈ T ∑ t ( r t − b ) ‾ ⋅ ∇ log π ( a t ∣ s t ; θ ) \nabla \overline{R}_{\theta} \approx \frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} \underline{(r_t - b)} \cdot \nabla \log \pi(a_t|s_t; \theta)
∇ R θ ≈ ∣ T ∣ 1 τ ∈ T ∑ t ∑ ( r t − b ) ⋅ ∇ log π ( a t ∣ s t ; θ )
改进2 :上式中每个时间步t t t 的( s t , a t ) (s_t, a_t) ( s t , a t ) 的奖励,都是回合结束后的最终奖励( r t − b ) (r_t - b) ( r t − b ) ,也就是说权重都相同,这样是不合理的。因此,考虑用t t t 到回合结束的奖励的累加作为时刻t t t 的权重,并添加衰减因子0 < γ < 1 0< \gamma < 1 0 < γ < 1 ,意味着随着时间推移,组合越来越多,那么前面的 组合对很后面的组合的影响就越来越小,即
r t → ∑ t ′ ≥ t r t ′ → ∑ t ′ ≥ t γ t ′ − t r t ′ r_t \rightarrow \sum_{t' \ge t} r_{t'} \rightarrow \sum_{t' \ge t} \gamma^{t'-t} r_{t'}
r t → t ′ ≥ t ∑ r t ′ → t ′ ≥ t ∑ γ t ′ − t r t ′
∇ R ‾ θ ≈ 1 ∣ T ∣ ∑ τ ∈ T ∑ t ( ∑ t ′ ≥ t γ t ′ − t r t ′ − b ‾ ) ⋅ ∇ log π ( a t ∣ s t ; θ ) \nabla \overline{R}_{\theta} \approx \frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} (\underline{\sum_{t' \ge t} \gamma^{t'-t} r_{t'} - b}) \cdot \nabla \log \pi(a_t|s_t; \theta)
∇ R θ ≈ ∣ T ∣ 1 τ ∈ T ∑ t ∑ ( t ′ ≥ t ∑ γ t ′ − t r t ′ − b ) ⋅ ∇ log π ( a t ∣ s t ; θ )
定义划线部分为优势函数(Advantage Function),即
A ( s t , a t ; θ ) = ∑ t ′ ≥ t γ t ′ − t r t ′ − b A(s_t, a_t; \theta) = \sum_{t' \ge t} \gamma^{t'-t} r_{t'} - b
A ( s t , a t ; θ ) = t ′ ≥ t ∑ γ t ′ − t r t ′ − b
最终优化目标 定义为
θ ∗ = arg max θ 1 ∣ T ∣ ∑ τ ∈ T ∑ t A ( s t , a t ; θ ) ⋅ log π ( a t ∣ s t ; θ ) \theta^* = \argmax_{\theta} \frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} A(s_t, a_t; \theta) \cdot \log \pi(a_t|s_t; \theta)
θ ∗ = θ arg max ∣ T ∣ 1 τ ∈ T ∑ t ∑ A ( s t , a t ; θ ) ⋅ log π ( a t ∣ s t ; θ )
优势函数还可以参数化,如定义价值函数V ( s ; ϕ ) V(s; \phi) V ( s ; ϕ ) 来评估奖励(即AC框架中的Critic),并用下式优化
ϕ ∗ = arg min ϕ 1 ∣ T ∣ ∑ τ ∈ T ∑ t ( V ( s t ; ϕ ) − r t ) 2 \phi^* = \argmin_{\phi} \frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} (V(s_t; \phi) - r_t)^2
ϕ ∗ = ϕ arg min ∣ T ∣ 1 τ ∈ T ∑ t ∑ ( V ( s t ; ϕ ) − r t ) 2
PG的几种变体对比:
∇ R ‾ θ ≈ { 1 ∣ T ∣ ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ r t REINFOCEMENT 1 ∣ T ∣ ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ Q ( s t , a t ; θ ) Q Actor-Critic 1 ∣ T ∣ ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ A ( s t , a t ; θ ) Advantage Actor-Critic 1 ∣ T ∣ ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ δ TD Actor-Critic 1 ∣ T ∣ ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ δ e TD( λ )Actor-Critic \nabla \overline{R}_{\theta} \approx \begin{cases}
\frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} \nabla \log \pi(a_t|s_t; \theta) \cdot r_t & \text{REINFOCEMENT} \\
\frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} \nabla \log \pi(a_t|s_t; \theta) \cdot Q(s_t, a_t; \theta) & \text{Q Actor-Critic} \\
\frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} \nabla \log \pi(a_t|s_t; \theta) \cdot A(s_t, a_t; \theta) & \text{Advantage Actor-Critic} \\
\frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} \nabla \log \pi(a_t|s_t; \theta) \cdot \delta & \text{TD Actor-Critic} \\
\frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} \nabla \log \pi(a_t|s_t; \theta) \cdot \delta e & \text{TD(}\lambda\text{)Actor-Critic} \\
\end{cases}
∇ R θ ≈ ⎩ ⎨ ⎧ ∣ T ∣ 1 ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ r t ∣ T ∣ 1 ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ Q ( s t , a t ; θ ) ∣ T ∣ 1 ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ A ( s t , a t ; θ ) ∣ T ∣ 1 ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ δ ∣ T ∣ 1 ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ δe REINFOCEMENT Q Actor-Critic Advantage Actor-Critic TD Actor-Critic TD( λ )Actor-Critic
优点:
更好的收敛性质
在高维或连续动作空间有效
可以学习随机策略
不会出现策略退化现象
缺点:
可以收敛到不动点,但往往是局部最优
对策略的评估往往是低效并且高方差的
数据效率和鲁棒性不行。
Policy Gradient的例程,智能体通过控制滑块左右移动来保持杆子处于竖直状态。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 import osimport gymimport numpy as npfrom copy import deepcopyfrom collections import dequeimport torchimport torch.nn as nnimport torch.nn.functional as Ffrom torch.distributions import Categoricalenv = gym.make('CartPole-v1' ) env = env.unwrapped state_number = env.observation_space.shape[0 ] action_number = env.action_space.n device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu" ) class Net (nn.Module): def __init__ (self ): super ().__init__() self .layers = nn.Sequential( nn.Linear(state_number, 32 ), nn.ReLU(inplace=True ), nn.Linear(32 , 32 ), nn.ReLU(inplace=True ), nn.Linear(32 , action_number), nn.Softmax(dim=-1 ), ) def forward (self, state ): pi = self .layers(state) return pi class PG (): def __init__ ( self, gamma=0.9 , lr=5e-4 , weight_decay=0.0 , ): self .gamma = gamma self .buffer = [] self .model = Net() self .model.to(device) self .optimizer = torch.optim.Adam(self .model.parameters(), lr=lr, weight_decay=weight_decay) @torch.no_grad() def choose_action (self, state ): state = torch.from_numpy(state).float ().unsqueeze(0 ).to(device) pi = self .model(state) dist = torch.distributions.Categorical(pi) action = dist.sample().item() return action def store_experience (self, experience ): self .buffer.append(experience) def update (self ): get_tensor = lambda x: torch.tensor([b[x] for b in self .buffer]).to(device) states = get_tensor(0 ).float () actions = get_tensor(1 ).long() rewards = get_tensor(2 ).float () next_states = get_tensor(3 ).float () done = get_tensor(4 ).long() for t in reversed (range (0 , rewards.size(0 ) - 1 )): rewards[t] = rewards[t] + self .gamma * rewards[t + 1 ] rewards = (rewards - rewards.mean()) / rewards.std() pi = self .model(states) log_prob = torch.sum (pi.log() * F.one_hot(actions), dim=1 ) loss = - (log_prob * rewards).mean() self .optimizer.zero_grad() loss.backward() self .optimizer.step() del self .buffer[:] return loss.item() def train (agent, num_episodes=5000 , render=False ): step = 0 for i in range (num_episodes): total_rewards = 0 done = False state, _ = env.reset() while not done: step += 1 if render: env.render() action = agent.choose_action(state) next_state, reward, done, truncated, info = env.step(action) x, x_dot, theta, theta_dot = next_state r1 = (env.x_threshold - abs (x)) / env.x_threshold - 0.8 r2 = (env.theta_threshold_radians - abs (theta)) / env.theta_threshold_radians - 0.5 r3 = 3 * r1 + r2 agent.store_experience((state, action, r3, next_state, done)) state = next_state total_rewards += reward loss = agent.update() if i % 50 == 0 : print ('episode:{} reward:{}' .format (i, total_rewards)) def test (agent, num_episodes=10 , render=False ): env = gym.make('CartPole-v1' , render_mode="human" if render else None ) step = 0 eval_rewards = [] for i in range (num_episodes): total_rewards = 0 done = False state, _ = env.reset() while not done: step += 1 if render: env.render() action = agent.choose_action(state) next_state, reward, done, truncated, info = env.step(action) state = next_state total_rewards += reward eval_rewards.append(total_rewards) return sum (eval_rewards) / len (eval_rewards) if __name__ == "__main__" : agent = PG() train(agent, render=False ) test(agent, render=True )
TRPO
强化学习的目标是最大化长期期望折扣奖励,即
θ ∗ = arg max θ ∑ t γ t R t θ = arg max θ G θ ( τ ) \theta^* = \argmax_\theta \sum_t \gamma^t R^{\theta}_t = \argmax_\theta G^{\theta}(\tau)
θ ∗ = θ arg max t ∑ γ t R t θ = θ arg max G θ ( τ )
如果学习率α \alpha α 选择不合适,迭代过程中不能保证θ n e w \theta_{new} θ n e w 比θ o l d \theta_{old} θ o l d 好,导致θ n e w \theta_{new} θ n e w 参数采样得到较差的样本,导致参数进一步恶化。TRPO(Trust Region Policy Optimization )就是为了解决如何选择一个合适的更新策略,或是如何选择一个合适的步长,使得更新过后的策略π ( a ∣ s ; θ n e w ) \pi(a|s; \theta_{new}) π ( a ∣ s ; θ n e w ) 一定比更新前的策略π ( a ∣ s ; θ o l d ) \pi(a|s; \theta_{old}) π ( a ∣ s ; θ o l d ) 好 。
在策略π ( a t ∣ s t ; θ ) \pi(a_t|s_t;\theta) π ( a t ∣ s t ; θ ) 和π ( a t ∣ s t ; θ ~ ) \pi(a_t|s_t;\tilde{\theta}) π ( a t ∣ s t ; θ ~ ) 下,长期折扣奖励分别如下,目标也就是使g ( θ n e w ) ≥ g ( θ o l d ) g(\theta_{new}) \ge g(\theta_{old}) g ( θ n e w ) ≥ g ( θ o l d )
g ( θ ) = E τ ∼ P θ ( τ ) G θ ( τ ) g ( θ ~ ) = E τ ∼ P θ ~ ( τ ) G θ ~ ( τ ) \begin{aligned}
g(\theta) &= E_{\tau \sim P_{\theta}(\tau)} G^{\theta}(\tau) \\
g(\tilde{\theta}) &= E_{\tau \sim P_{\tilde{\theta}}(\tau)} G^{\tilde{\theta}}(\tau) \\
\end{aligned}
g ( θ ) g ( θ ~ ) = E τ ∼ P θ ( τ ) G θ ( τ ) = E τ ∼ P θ ~ ( τ ) G θ ~ ( τ )
那么就有
g ( θ ~ ) = g ( θ ) + E τ ∼ P θ ~ ( τ ) ∑ t γ t A θ ( s t , a t ) \begin{aligned}
g(\tilde{\theta})
& = g(\theta) + E_{\tau \sim P^{\tilde{\theta}}(\tau)} \sum_t \gamma^t A^{\theta} (s_t, a_t) \\
\end{aligned}
g ( θ ~ ) = g ( θ ) + E τ ∼ P θ ~ ( τ ) t ∑ γ t A θ ( s t , a t )
怎么来的?
定义
ρ θ ( s ) = ∑ t = 0 ∞ γ t P ( s t = s ) \rho^{\theta}(s) = \sum_{t=0}^\infty \gamma^t P(s_t = s)
ρ θ ( s ) = t = 0 ∑ ∞ γ t P ( s t = s )
那么
g ( θ ~ ) = g ( θ ) + E τ ∼ P θ ~ ( τ ) ∑ t γ t A θ ( s t , a t ) = g ( θ ) + ∑ t ∑ s P ( s t = s ) ∑ a π ( a ∣ s ; θ ~ ) ‾ ⋅ γ t A θ ( s , a ) = g ( θ ) + ∑ s ∑ t γ t P ( s t = s ) ∑ a π ( a ∣ s ; θ ~ ) A θ ( s , a ) = g ( θ ) + ∑ s ρ θ ~ ( s ) ∑ a π ( a ∣ s ; θ ~ ) A θ ( s , a ) \begin{aligned}
g(\tilde{\theta})
& = g(\theta) + E_{\tau \sim P^{\tilde{\theta}}(\tau)} \sum_t \gamma^t A^{\theta} (s_t, a_t) \\
& = g(\theta) + \sum_t \underline{\sum_s P(s_t=s) \sum_a \pi(a|s;\tilde{\theta})} \cdot \gamma^t A^{\theta} (s, a) \\
& = g(\theta) + \sum_s \sum_t \gamma^t P(s_t=s) \sum_a \pi(a|s;\tilde{\theta}) A^{\theta} (s, a) \\
& = g(\theta) + \sum_s \rho^{\tilde{\theta}}(s) \sum_a \pi(a|s;\tilde{\theta}) A^{\theta} (s, a) \\
\end{aligned}
g ( θ ~ ) = g ( θ ) + E τ ∼ P θ ~ ( τ ) t ∑ γ t A θ ( s t , a t ) = g ( θ ) + t ∑ s ∑ P ( s t = s ) a ∑ π ( a ∣ s ; θ ~ ) ⋅ γ t A θ ( s , a ) = g ( θ ) + s ∑ t ∑ γ t P ( s t = s ) a ∑ π ( a ∣ s ; θ ~ ) A θ ( s , a ) = g ( θ ) + s ∑ ρ θ ~ ( s ) a ∑ π ( a ∣ s ; θ ~ ) A θ ( s , a )
上式中ρ θ ~ ( s ) \rho^{\tilde{\theta}}(s) ρ θ ~ ( s ) 对θ ~ \tilde{\theta} θ ~ 有很强依赖,但实际训练过程中下一步模型θ ~ \tilde{\theta} θ ~ 是无法拿到的,考虑替代函数 L θ ( θ ~ ) L^{\theta}(\tilde{\theta}) L θ ( θ ~ )
L θ ( θ ~ ) = g ( θ ) + ∑ s ρ θ ( s ) ‾ ∑ a π ( a ∣ s ; θ ~ ) A θ ( s , a ) L^{\theta}(\tilde{\theta}) = g(\theta) + \sum_s \underline{\rho^{\theta}(s)} \sum_a \pi(a|s;\tilde{\theta}) A^{\theta} (s, a)
L θ ( θ ~ ) = g ( θ ) + s ∑ ρ θ ( s ) a ∑ π ( a ∣ s ; θ ~ ) A θ ( s , a )
该函数与g ( θ ~ ) g(\tilde{\theta}) g ( θ ~ ) 在参数θ = θ o l d \theta=\theta_{old} θ = θ o l d 附近是一阶近似的,即
{ L θ ( θ o l d ) = g ( θ o l d ) ∇ L θ ( θ ) ∣ θ = θ o l d = ∇ g ( θ ) ∣ θ = θ o l d \begin{cases}
L^{\theta}(\theta_{old}) &= g(\theta_{old}) \\
\nabla L^{\theta}(\theta) |_{\theta=\theta_{old}} &= \nabla g(\theta) |_{\theta=\theta_{old}} \\
\end{cases}
{ L θ ( θ o l d ) ∇ L θ ( θ ) ∣ θ = θ o l d = g ( θ o l d ) = ∇ g ( θ ) ∣ θ = θ o l d
函数f ( x ) = x − 1 f(x)=x-1 f ( x ) = x − 1 与函数g ( x ) = ln x g(x)=\ln x g ( x ) = ln x 在x = 1 x=1 x = 1 处是一阶近似的,因为f ( 1 ) = g ( 1 ) = 0 , f ′ ( 1 ) = g ′ ( 1 ) = 1 f(1)=g(1)=0, f'(1)=g'(1)=1 f ( 1 ) = g ( 1 ) = 0 , f ′ ( 1 ) = g ′ ( 1 ) = 1 。
可以通过优化L θ ( θ ~ ) L^{\theta}(\tilde{\theta}) L θ ( θ ~ ) 来达到优化g ( θ ~ ) g(\tilde{\theta}) g ( θ ~ ) 的目的:
θ ~ ∗ = arg max θ ~ L θ ( θ ~ ) \tilde{\theta}^* = \argmax_{\tilde{\theta}} L^{\theta}(\tilde{\theta})
θ ~ ∗ = θ ~ arg max L θ ( θ ~ )
但是该参数不能作为更新后的参数θ n e w \theta_{new} θ n e w ,因为:
θ ~ ∗ \tilde{\theta}^* θ ~ ∗ 只是给出了优化θ o l d \theta_{old} θ o l d 的方向,需要将θ o l d \theta_{old} θ o l d 向θ ~ ∗ \tilde{\theta}^* θ ~ ∗ 迭代
θ ~ ∗ \tilde{\theta}^* θ ~ ∗ 不一定在θ o l d \theta_{old} θ o l d 附近,因此L θ o l d ( θ ~ ∗ ) ≥ L θ o l d ( θ o l d ) L^{\theta_{old}}(\tilde{\theta}^*) \ge L^{\theta_{old}}(\theta_{old}) L θ o l d ( θ ~ ∗ ) ≥ L θ o l d ( θ o l d ) 不能证明g ( θ ~ ∗ ) ≥ g ( θ o l d ) g(\tilde{\theta}^*) \ge g(\theta_{old}) g ( θ ~ ∗ ) ≥ g ( θ o l d )
因此,需要将θ ~ ∗ \tilde{\theta}^* θ ~ ∗ 限制在θ o l d \theta_{old} θ o l d 附近,可以通过KL散度限制两个策略的差异(除了上述原因,重要性采样精度同样有要求),这样就得到了TRPO算法优化目标
θ ~ ∗ = arg max θ ~ L θ ( θ ~ ) s.t. KL ( π ( a ∣ s ; θ ) , π ( a ∣ s ; θ ~ ∗ ) ) ≤ δ \begin{aligned}
\tilde{\theta}^* &= \argmax_{\tilde{\theta}} L^{\theta}(\tilde{\theta}) \\
\text{s.t.} &\quad \text{KL} \left( \pi(a|s; \theta),\pi(a|s; \tilde{\theta}^*) \right) \leq \delta
\end{aligned}
θ ~ ∗ s.t. = θ ~ arg max L θ ( θ ~ ) KL ( π ( a ∣ s ; θ ) , π ( a ∣ s ; θ ~ ∗ ) ) ≤ δ
也就是在以θ \theta θ 为圆心、δ \delta δ 为半径的区域中搜索θ ~ ∗ \tilde{\theta}^* θ ~ ∗ 。还有一个问题是,L θ ( θ ~ ) L^{\theta}(\tilde{\theta}) L θ ( θ ~ ) 涉及到依概率π ( a ∣ s ; θ ~ ) \pi(a|s; \tilde{\theta}) π ( a ∣ s ; θ ~ ) 采样,但更新前无法基于未知的π \pi π 采样,因此考虑重要性采样 ,首先基于π ( a ∣ s ; θ ) \pi(a|s; \theta) π ( a ∣ s ; θ ) 采样,再进行修正
L θ ( θ ~ ) = g ( θ ) + ∑ s ρ θ ( s ) ∑ a π ( a ∣ s ; θ ~ ) A θ ( s , a ) = g ( θ ) + ∑ s ρ θ ( s ) ∑ a π ( a ∣ s ; θ ) ( π ( a ∣ s ; θ ~ ) π ( a ∣ s ; θ ) A θ ( s , a ) ) \begin{aligned}
L^{\theta}(\tilde{\theta})
&= g(\theta) + \sum_s \rho^{\theta}(s) \sum_a \pi(a|s;\tilde{\theta}) A^{\theta} (s, a) \\
&= g(\theta) + \sum_s \rho^{\theta}(s) \sum_a \pi(a|s; \theta) \left(
\frac{\pi(a|s;\tilde{\theta})}{\pi(a|s; \theta)} A^{\theta} (s, a)
\right) \\
\end{aligned}
L θ ( θ ~ ) = g ( θ ) + s ∑ ρ θ ( s ) a ∑ π ( a ∣ s ; θ ~ ) A θ ( s , a ) = g ( θ ) + s ∑ ρ θ ( s ) a ∑ π ( a ∣ s ; θ ) ( π ( a ∣ s ; θ ) π ( a ∣ s ; θ ~ ) A θ ( s , a ) )
每一步的策略梯度更新对应
θ ~ ∗ = arg max θ ~ E s ∼ ρ θ ( s ) , a ∼ π ( a ∣ s ; θ ) π ( a ∣ s ; θ ~ ) π ( a ∣ s ; θ ) A θ ( s , a ) s.t. KL ( π ( a ∣ s ; θ ) , π ( a ∣ s ; θ ~ ∗ ) ) ≤ δ \begin{aligned}
\tilde{\theta}^* &= \argmax_{\tilde{\theta}} E_{s \sim \rho^{\theta}(s), a \sim \pi(a|s; \theta)}
\frac{\pi(a|s;\tilde{\theta})}{\pi(a|s; \theta)} A^{\theta} (s, a) \\
\text{s.t.} &\quad \text{KL} \left( \pi(a|s; \theta),\pi(a|s; \tilde{\theta}^*) \right) \leq \delta
\end{aligned}
θ ~ ∗ s.t. = θ ~ arg max E s ∼ ρ θ ( s ) , a ∼ π ( a ∣ s ; θ ) π ( a ∣ s ; θ ) π ( a ∣ s ; θ ~ ) A θ ( s , a ) KL ( π ( a ∣ s ; θ ) , π ( a ∣ s ; θ ~ ∗ ) ) ≤ δ
用泰勒展开简化
θ ~ ∗ = arg max θ ~ g ⊤ ( θ ~ − θ ) s.t. 1 2 ( θ ~ − θ ) ⊤ H ( θ ~ − θ ) ≤ δ \begin{aligned}
\tilde{\theta}^* &= \argmax_{\tilde{\theta}} g^\top (\tilde{\theta} - \theta) \\
\text{s.t.} &\quad \frac{1}{2} (\tilde{\theta} - \theta)^\top H (\tilde{\theta} - \theta) \leq \delta
\end{aligned}
θ ~ ∗ s.t. = θ ~ arg max g ⊤ ( θ ~ − θ ) 2 1 ( θ ~ − θ ) ⊤ H ( θ ~ − θ ) ≤ δ
其中g g g 等于策略梯度,根据拉格朗日对偶定理,得到如下。
θ ~ ∗ = θ + α j 2 δ g ⊤ H − 1 g H − 1 g \tilde{\theta}^* = \theta + \alpha^j \sqrt{\frac{2 \delta}{g^\top H^{-1} g}} H^{-1} g
θ ~ ∗ = θ + α j g ⊤ H − 1 g 2 δ H − 1 g
式中α \alpha α 是回溯系数,能避免泰勒展开误差,防止约束函数无法满足、或代理函数无法提升。
重要性采样(Importance Sampling),假定概率分布p ( x ) p(x) p ( x ) 、函数f ( x ) f(x) f ( x ) ,要估算E x ∼ p ( x ) f ( x ) E_{x \sim p(x)} f(x) E x ∼ p ( x ) f ( x ) ,可以通过蒙特卡洛方法逼近,即采样足够次数N N N 后求均值得到
E x ∼ p ( x ) f ( x ) = ∫ p ( x ) f ( x ) d x ≈ 1 N ∑ x = 1 N f ( x i ) E_{x \sim p(x)} f(x) = \int p(x) f(x) dx \approx \frac{1}{N} \sum_{x=1}^N f(x_i)
E x ∼ p ( x ) f ( x ) = ∫ p ( x ) f ( x ) d x ≈ N 1 x = 1 ∑ N f ( x i )
问题就在于实际问题中:1) 很难确定p ( x ) p(x) p ( x ) 的函数分布;2) 就算已知p ( x ) p(x) p ( x ) 分布,也可能很难按该分布采样得到x i x_i x i ;3) 依p ( x ) p(x) p ( x ) 采样可能无法准确估算结果,例如用均匀分布在区间[ a , b ] [a, b] [ a , b ] 上采样f ( x ) f(x) f ( x ) ,从而求曲线积分面积∫ a b f ( x ) d x = b − a N ∑ i = 1 N f ( x i ) \int_a^b f(x) dx = \frac{b - a}{N} \sum_{i=1}^N f(x_i) ∫ a b f ( x ) d x = N b − a ∑ i = 1 N f ( x i ) ,由于没有考虑f ( x ) f(x) f ( x ) 曲率等其他因素导致结果不准确。
这种情况下就需要用重要性采样解决,具体地,引入另一个容易采样的分布q ( x ) q(x) q ( x ) ,那么
E x ∼ p ( x ) f ( x ) = ∫ p ( x ) f ( x ) d x = ∫ q ( x ) p ( x ) q ( x ) f ( x ) d x = E x ∼ q ( x ) p ( x ) q ( x ) f ( x ) ≈ 1 N ∑ x = 1 N p ( x i ) q ( x i ) f ( x i ) ‾ E_{x \sim p(x)} f(x)
= \int p(x) f(x) dx
= \int q(x) \frac{p(x)}{q(x)} f(x) dx
= \underline{
E_{x \sim q(x)} \frac{p(x)}{q(x)} f(x)
\approx \frac{1}{N} \sum_{x=1}^N \frac{p(x_i)}{q(x_i)} f(x_i)
}
E x ∼ p ( x ) f ( x ) = ∫ p ( x ) f ( x ) d x = ∫ q ( x ) q ( x ) p ( x ) f ( x ) d x = E x ∼ q ( x ) q ( x ) p ( x ) f ( x ) ≈ N 1 x = 1 ∑ N q ( x i ) p ( x i ) f ( x i )
式中p ( x i ) q ( x i ) \frac{p(x_i)}{q(x_i)} q ( x i ) p ( x i ) 即重要性权重。注意,p ( x ) p(x) p ( x ) 与q ( x ) q(x) q ( x ) 差距越大,则需要更多采样次数以保证精度。
PPO(DeepMind)
TRPO算法引入了KL散度来保证分布相近,需要解决带约束的优化问题。PPO(Proximal Policy Optimization Algorithms)算法对此进行改进,得到
θ ~ ∗ = arg max θ ~ E s ∼ ρ θ ( s ) , a ∼ π ( a ∣ s ; θ ) ( π ( a ∣ s ; θ ~ ) π ( a ∣ s ; θ ) A θ ( s , a ) − β KL ( π ( a ∣ s ; θ ) , π ( a ∣ s ; θ ~ ∗ ) ) ) \begin{aligned}
\tilde{\theta}^* &= \argmax_{\tilde{\theta}}
E_{s \sim \rho^{\theta}(s), a \sim \pi(a|s; \theta)} \left(
\frac{\pi(a|s;\tilde{\theta})}{\pi(a|s; \theta)} A^{\theta} (s, a)
- \beta \text{KL} \left(
\pi(a|s; \theta),\pi(a|s; \tilde{\theta}^*)
\right)
\right)
\end{aligned}
θ ~ ∗ = θ ~ arg max E s ∼ ρ θ ( s ) , a ∼ π ( a ∣ s ; θ ) ( π ( a ∣ s ; θ ) π ( a ∣ s ; θ ~ ) A θ ( s , a ) − β KL ( π ( a ∣ s ; θ ) , π ( a ∣ s ; θ ~ ∗ ) ) )
其中β \beta β 是动态惩罚系数,用于控制KL散度,即KL > KL max \text{KL} > \text{KL}_{\max} KL > KL m a x 则增加β \beta β ,KL < KL min \text{KL} < \text{KL}_{\min} KL < KL m i n 则减小β \beta β 。
PPO2(OpenAI)
另一种改进方式,采取截断来使两分布的比值在( 1 − ϵ , 1 + ϵ ) (1 - \epsilon, 1 + \epsilon) ( 1 − ϵ , 1 + ϵ ) 之间,来保证分布相近
θ ~ ∗ = arg max θ ~ E s ∼ ρ θ ( s ) , a ∼ π ( a ∣ s ; θ ) min ( π ( a ∣ s ; θ ~ ) π ( a ∣ s ; θ ) A θ ( s , a ) , clip ( π ( a ∣ s ; θ ~ ) π ( a ∣ s ; θ ) , 1 − ϵ , 1 + ϵ ) A θ ( s , a ) ) \begin{aligned}
\tilde{\theta}^* &= \argmax_{\tilde{\theta}}
E_{s \sim \rho^{\theta}(s), a \sim \pi(a|s; \theta)} \min \left(
\frac{\pi(a|s;\tilde{\theta})}{\pi(a|s; \theta)} A^{\theta} (s, a),
\text{clip}\left(
\frac{\pi(a|s;\tilde{\theta})}{\pi(a|s; \theta)}, 1 - \epsilon, 1 + \epsilon
\right) A^{\theta} (s, a)
\right)
\end{aligned}
θ ~ ∗ = θ ~ arg max E s ∼ ρ θ ( s ) , a ∼ π ( a ∣ s ; θ ) min ( π ( a ∣ s ; θ ) π ( a ∣ s ; θ ~ ) A θ ( s , a ) , clip ( π ( a ∣ s ; θ ) π ( a ∣ s ; θ ~ ) , 1 − ϵ , 1 + ϵ ) A θ ( s , a ) )
PPO2的例程,智能体通过控制左右旋转力度来保持杆子处于竖直状态(涉及Actor-Critic,在下一节中介绍)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 import osimport randomimport argparsefrom collections import namedtupleimport gymimport torchimport torch.nn as nnimport torch.nn.functional as Fimport torch.optim as optimfrom torch.distributions import Normalfrom torch.utils.data.sampler import BatchSampler, SubsetRandomSamplerparser = argparse.ArgumentParser(description='Solve the Pendulum with PPO' ) parser.add_argument('--gamma' , type =float , default=0.9 , metavar='G' , help ='discount factor (default: 0.9)' ) parser.add_argument('--seed' , type =int , default=0 , metavar='N' , help ='random seed (default: 0)' ) parser.add_argument('--render' , action='store_true' , default=False , help ='render the environment' ) parser.add_argument('--log-interval' , type =int , default=10 , metavar='N' , help ='interval between training status logs (default: 10)' ) args = parser.parse_args() env = gym.make('Pendulum-v1' , render_mode='human' if args.render else None ).unwrapped num_state = env.observation_space.shape[0 ] num_action = env.action_space.shape[0 ] torch.manual_seed(args.seed) random.seed(args.seed) Transition = namedtuple('Transition' , ['state' , 'action' , 'a_log_prob' , 'reward' , 'next_state' ]) TrainRecord = namedtuple('TrainRecord' , ['episode' , 'reward' ]) class Actor (nn.Module): def __init__ (self ): super (Actor, self ).__init__() self .fc = nn.Linear(3 , 100 ) self .mu_head = nn.Linear(100 , 1 ) self .sigma_head = nn.Linear(100 , 1 ) def forward (self, x ): x = F.tanh(self .fc(x)) mu = 2.0 * F.tanh(self .mu_head(x)) sigma = F.softplus(self .sigma_head(x)) return (mu, sigma) class Critic (nn.Module): def __init__ (self ): super (Critic, self ).__init__() self .fc1 = nn.Linear(num_state, 64 ) self .fc2 = nn.Linear(64 , 8 ) self .state_value = nn.Linear(8 , 1 ) def forward (self, x ): x = F.leaky_relu(self .fc1(x)) x = F.relu(self .fc2(x)) value = self .state_value(x) return value class PPO2 (): clip_epsilon = 0.2 max_grad_norm = 0.5 ppo_epoch = 10 buffer_capacity, batch_size = 1000 , 32 def __init__ (self ): super (PPO2, self ).__init__() self .actor_net = Actor().float () self .critic_net = Critic().float () self .buffer = [] self .counter = 0 self .training_step = 0 self .actor_optimizer = optim.Adam(self .actor_net.parameters(), lr=1e-4 ) self .critic_net_optimizer = optim.Adam(self .critic_net.parameters(), lr=3e-4 ) @torch.no_grad() def select_action (self, state ): state = torch.from_numpy(state).float ().unsqueeze(0 ) mu, sigma = self .actor_net(state) dist = Normal(mu, sigma) action = dist.sample() action_log_prob = dist.log_prob(action) action = action.clamp(-2 , 2 ) return action.item(), action_log_prob.item() @torch.no_grad() def get_value (self, state ): state = torch.from_numpy(state) value = self .critic_net(state) return value.item() def save_param (self ): torch.save(self .actor_net.state_dict(), 'ppo2_actor_params.pkl' ) torch.save(self .critic_net.state_dict(), 'ppo2_critic_params.pkl' ) def load_param (self ): self .actor_net.load_state_dict(torch.load('ppo2_actor_params.pkl' )) self .critic_net.load_state_dict(torch.load('ppo2_critic_params.pkl' )) def store_transition (self, transition ): self .buffer.append(transition) self .counter += 1 return self .counter % self .buffer_capacity == 0 def update (self ): self .training_step += 1 state = torch.tensor([t.state for t in self .buffer], dtype=torch.float ) action = torch.tensor([t.action for t in self .buffer], dtype=torch.float ).view(-1 , 1 ) action_log_prob_old = torch.tensor([t.a_log_prob for t in self .buffer], dtype=torch.float ).view(-1 , 1 ) reward = torch.tensor([t.reward for t in self .buffer], dtype=torch.float ).view(-1 , 1 ) next_state = torch.tensor([t.next_state for t in self .buffer], dtype=torch.float ) del self .buffer[:] with torch.no_grad(): reward = (reward + 8 ) / 8 reward = (reward - reward.mean()) / (reward.std() + 1e-5 ) target_v = reward + args.gamma * self .critic_net(next_state) advantage = target_v - self .critic_net(state) for _ in range (self .ppo_epoch): for index in BatchSampler( SubsetRandomSampler(range (self .buffer_capacity)), self .batch_size, False ): mu, sigma = self .actor_net(state[index]) dist = Normal(mu, sigma) action_log_prob = dist.log_prob(action[index]) ratio = torch.exp(action_log_prob - action_log_prob_old[index] ) action_loss = - torch.min ( ratio * advantage[index], torch.clamp(ratio, 1 - self .clip_epsilon, 1 + self .clip_epsilon) * advantage[index], ).mean() self .actor_optimizer.zero_grad() action_loss.backward() nn.utils.clip_grad_norm_(self .actor_net.parameters(), self .max_grad_norm) self .actor_optimizer.step() value_loss = F.smooth_l1_loss(self .critic_net(state[index]), target_v[index]) self .critic_net_optimizer.zero_grad() value_loss.backward() nn.utils.clip_grad_norm_(self .critic_net.parameters(), self .max_grad_norm) self .critic_net_optimizer.step() def main (is_training ): agent = PPO2() if not is_training: agent.load_param() args.render = True training_records = [] running_reward = -1000 for i_epoch in range (1000 ): score = 0 state, _ = env.reset() if args.render: env.render() for t in range (200 ): action, action_log_prob = agent.select_action(state) next_state, reward, done, truncated, info = env.step([action]) if args.render: env.render() if is_training: trans = Transition(state, action, action_log_prob, reward, next_state) if agent.store_transition(trans): agent.update() score += reward state = next_state running_reward = running_reward * 0.9 + score * 0.1 training_records.append(TrainRecord(i_epoch, running_reward)) if i_epoch % 10 == 0 : print ("Epoch {}, Moving average score is: {:.2f} " .format (i_epoch, running_reward)) if running_reward > -200 : print ("Solved! Moving average score is now {}!" .format (running_reward)) env.close() agent.save_param() break if __name__ == '__main__' : main(is_training=True ) main(is_training=False )
Part 4: 从Actor-Critic到A2C/A3C
AC: Actor-Critic
policy-based可以在连续空间内选择合适动作,而这对value-based方法来说搜索空间过大;但是policy-based基于回合更新,学习效率低,通过value-based作为critic可以实现单步更新。因此,Actor-Critic算法结合了两类方法,包含Actor、Critic两部分:
Actor:policy-based,在连续动作空间内选择合适的动作,即策略函数π ( a ∣ s ) \pi(a|s) π ( a ∣ s ) ;
Critic:value-based,评估actor产生的动作,如状态价值函数V ( s ) V(s) V ( s ) 。
Actor的更新参数的目标是让Critic的输出值越大越好。当确定状态s s s 的情况下,如何选取动作a a a 来使得Critic的值最大就是Actor网络需要优化的目标。而更新Critic的参数是为了让其的打分更精准,训练的依据就是环境给的奖励r r r 。
在基于蒙特卡洛的策略梯度REINFORCEMENT中,参数更新公式为
θ ← θ + η 1 ∣ T ∣ ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ r t \theta \leftarrow \theta + \eta
\frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} \nabla \log \pi(a_t|s_t; \theta) \cdot r_t
θ ← θ + η ∣ T ∣ 1 τ ∈ T ∑ t ∑ ∇ log π ( a t ∣ s t ; θ ) ⋅ r t
其中r t r_t r t 是用蒙特卡罗方法采样获得的。现在引入Critic,用神经网络计算Q函数值,
θ ← θ + η 1 ∣ T ∣ ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ Q ( s t , a t ; θ ) \theta \leftarrow \theta + \eta
\frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} \nabla \log \pi(a_t|s_t; \theta) \cdot Q(s_t, a_t; \theta)
θ ← θ + η ∣ T ∣ 1 τ ∈ T ∑ t ∑ ∇ log π ( a t ∣ s t ; θ ) ⋅ Q ( s t , a t ; θ )
其中,Critic模型Q ( s t , a t ; θ ) Q(s_t, a_t; \theta) Q ( s t , a t ; θ ) 参数更新如下
θ ← θ + η ∇ ∣ ∣ r t + max a ′ Q ( s t + 1 , a ′ ; θ ) − Q ( s t , a t ; θ ) ∣ ∣ 2 2 \theta \leftarrow \theta + \eta \nabla
||r_t + \max_{a'} Q(s_{t+1}, a'; \theta) - Q(s_t, a_t; \theta)||_2^2
θ ← θ + η ∇∣∣ r t + a ′ max Q ( s t + 1 , a ′ ; θ ) − Q ( s t , a t ; θ ) ∣ ∣ 2 2
另外,广义的Actor-Critic可以有以下几种
{ θ ← θ + η 1 ∣ T ∣ ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ V π ( s t ) 基于状态价值 θ ← θ + η 1 ∣ T ∣ ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ Q ( s t , a t ; θ ) 基于动作价值 θ ← θ + η 1 ∣ T ∣ ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ δ ( t ) 基于 T D 误差 θ ← θ + η 1 ∣ T ∣ ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ A ( s t , a t ; θ ) 基于优势函数 θ ← θ + η 1 ∣ T ∣ ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ δ ( t ) E ( t ) 基于 T D ( λ ) 误差 \begin{cases}
\theta & \leftarrow \theta + \eta
\frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} \nabla \log \pi(a_t|s_t; \theta) \cdot V^{\pi}(s_{t})
& 基于状态价值 \\
\theta & \leftarrow \theta + \eta
\frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} \nabla \log \pi(a_t|s_t; \theta) \cdot Q(s_t, a_t; \theta)
& 基于动作价值 \\
\theta & \leftarrow \theta + \eta
\frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} \nabla \log \pi(a_t|s_t; \theta) \cdot \delta(t)
& 基于TD误差 \\
\theta & \leftarrow \theta + \eta
\frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} \nabla \log \pi(a_t|s_t; \theta) \cdot A(s_t, a_t; \theta)
& 基于优势函数 \\
\theta & \leftarrow \theta + \eta
\frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} \nabla \log \pi(a_t|s_t; \theta) \cdot \delta(t) E(t)
& 基于TD(\lambda)误差 \\
\end{cases}
⎩ ⎨ ⎧ θ θ θ θ θ ← θ + η ∣ T ∣ 1 ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ V π ( s t ) ← θ + η ∣ T ∣ 1 ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ Q ( s t , a t ; θ ) ← θ + η ∣ T ∣ 1 ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ δ ( t ) ← θ + η ∣ T ∣ 1 ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ A ( s t , a t ; θ ) ← θ + η ∣ T ∣ 1 ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ δ ( t ) E ( t ) 基于状态价值 基于动作价值 基于 T D 误差 基于优势函数 基于 T D ( λ ) 误差
A2C: Advantage Actor-Critic
**A2C的出现是为了解决AC的高方差问题。**A2C与AC的不同之处在于,给Q值增加了一个baseline,我们用Q值减去这个baseline来判断当前逻辑的好坏,这个baseline通常由V π ( s t ) V^{\pi}(s_t) V π ( s t ) 担任,有
θ ← θ + η 1 ∣ T ∣ ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ ( Q ( s t , a t ; θ ) − V π ( s t ) ) \theta \leftarrow \theta + \eta
\frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t}
\nabla \log \pi(a_t|s_t; \theta) \cdot
\left(
Q(s_t, a_t; \theta) - V^{\pi}(s_t)
\right)
θ ← θ + η ∣ T ∣ 1 τ ∈ T ∑ t ∑ ∇ log π ( a t ∣ s t ; θ ) ⋅ ( Q ( s t , a t ; θ ) − V π ( s t ) )
因此,既需要学习一个Actor来决策选什么动作,又需要Critic来评估V值和Q值,但是同时估计V值和Q值是很复杂的。执行一个动作的下一回合必定更新到s t + 1 s_{t+1} s t + 1 ,在加上本回合获得的r t r_t r t 就是Q的期望值。或者,由
{ Q π ( s , a ) = r ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) V π ( s ′ ) V π ( s ) = E π [ R t + γ V π ( S t + 1 ) ∣ S t = s ] ( 贝尔曼方程 ) \begin{cases}
Q^\pi(s, a) &= r(s, a) + \gamma \sum_{s' \in S} P(s'|s, a) V^\pi(s') \\
V^{\pi}(s) &= E_\pi[R_t + \gamma V^{\pi}(S_{t+1}) | S_t=s] & (贝尔曼方程) \\
\end{cases}
{ Q π ( s , a ) V π ( s ) = r ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) V π ( s ′ ) = E π [ R t + γ V π ( S t + 1 ) ∣ S t = s ] ( 贝尔曼方程 )
我们可以用r t + γ V π ( s t + 1 ) r_t + \gamma V^{\pi}(s_{t+1}) r t + γ V π ( s t + 1 ) 来代替Q π ( s , a ) Q^\pi(s, a) Q π ( s , a ) ,如此就只需计算V值即可:
δ ( t ) = r t + γ V π ( s t + 1 ) ‾ t a r g e t V − V π ( s t ) \delta(t) = \underline{r_t + \gamma V^{\pi}(s_{t+1})}_{target V} - V^{\pi}(s_{t})
δ ( t ) = r t + γ V π ( s t + 1 ) t a r g e t V − V π ( s t )
也就是
1 ∣ T ∣ ∑ τ ∈ T ∑ t ∇ log π ( a t ∣ s t ; θ ) ⋅ ( r t + γ V π ( s t + 1 ) − V π ( s t ) ) \frac{1}{|\Tau|} \sum_{\tau \in \Tau} \sum_{t} \nabla \log \pi(a_t|s_t; \theta)
\cdot \left(
r_t + \gamma V^{\pi}(s_{t+1}) - V^{\pi}(s_{t})
\right)
∣ T ∣ 1 τ ∈ T ∑ t ∑ ∇ log π ( a t ∣ s t ; θ ) ⋅ ( r t + γ V π ( s t + 1 ) − V π ( s t ) )
其中,Critic模型V π ( s ) V^{\pi}(s) V π ( s ) 参数更新如下
θ ← θ + η ∇ ∣ ∣ r t + γ V π ( s t + 1 ) ‾ − V π ( s t ) ∣ ∣ 2 2 \theta \leftarrow \theta + \eta \nabla ||\underline{r_t + \gamma V^{\pi}(s_{t+1})} - V^{\pi}(s_{t})||_2^2
θ ← θ + η ∇∣∣ r t + γ V π ( s t + 1 ) − V π ( s t ) ∣ ∣ 2 2
A3C: Asynchronous Advantage Actor Critic
A3C算法完全使用了Actor-Critic框架,并且引入了异步训练的思想(异步是指数据并非同时产生),在提升性能的同时也大大加快了训练速度。A
经验回放机制存在两个问题:
Agent与环境的每次实时交互都需要耗费很多的内存和计算力;
经验回放机制要求Agent采用离策略(off-policy)方法来进行学习,而off-policy方法只能基于旧策略生成的数据进行更新;
3C算法为了提升训练速度采用异步训练的思想,利用多个线程。每个线程相当于一个智能体在随机探索,多个智能体共同探索,并行计算策略梯度,对参数进行更新。或者说同时启动多个训练环境,同时进行采样,并直接使用采集的样本进行训练,这里的异步得到数据,相比DQN算法,A3C算法不需要使用经验池来存储历史样本并随机抽取训练来打乱数据相关性,节约了存储空间,并且采用异步训练,大大加倍了数据的采样速度,也因此提升了训练速度。与此同时,采用多个不同训练环境采集样本,样本的分布更加均匀,更有利于神经网络的训练。
Part 5: AlphaZero:多智能体强化学习
总体介绍
蒙特卡洛树搜索
自对弈
参考资料