MARL

人工智能 / 2022-10-23

分类

调度方式

  • centrilized:合作游戏,直接扩展单智能体 RL,所有智能体共享策略
  • decentralized:每个智能体最优化自己独立的环境回报
    • IPPO

算法思路

  • centralized training and decentralized execution (CTDE):使用 Actor-Critic 框架,通过集中式的 Critic 纵览大局
    • MADDPG
    • COMA:multi-agent PG methods
    • QMix
  • value decomposition (VD)
    • value-decomposed Q-learning

问题

  1. instability
  2. high variance
    • 使用大的 batch size 降低 PG 的方差

环境

MDP

  • Decentralized partially observable Markov decision processes (DEC-POMDP) shared rewards. A DEC-POMDP is defined by S,A,O,R,P,n,γ.S\langle\mathcal{S}, \mathcal{A}, O, R, P, n, \gamma\rangle . \mathcal{S} is the state space. A\mathcal{A} is the shared action space for each agent. oi=O(s;i)o_{i}=O(s ; i) is the local observation for agent ii at global state ss. P(ss,A)P\left(s^{\prime} \mid s, A\right) denotes the transition probability from SS to SS^{\prime} given the joint action A=(a1,,an)A=\left(a_{1}, \ldots, a_{n}\right) for all nn agents. R(s,A)R(s, A) denotes the shared reward function. γ\gamma is the discount factor. Since most of the benchmark environments contain homogeneous agents, we utilize parameter sharing: each agent uses a shared policy πθ(aioi)\pi_{\theta}\left(a_{i} \mid o_{i}\right) parameterized by θ\theta to produce its action aia_{i} from its local observation oio_{i}, and optimizes its discounted accumulated reward J(θ)=Eat,st[tγtR(st,at)]J(\theta)=\mathbb{E}_{a^{t}, s^{t}}\left[\sum_{t} \gamma^{t} R\left(s^{t}, a^{t}\right)\right].

GYM

  • multi-agent particle-world environment (MPE)
  • Starcraft multi-agent challenge (SMAC)
  • Hanabi challenge

MAPPO

描述

​ 既是有集中式价值函数的 CTDE 算法,也是有分布式价值函数的分布学习算法 —— 既有一套 CTDE 式的网络,也允许各个智能体自己有一套独立的网络。

​ 能有效解决 PPO 这类 on-policy 方法样本效率(sample efficient)低的问题 —— 使用重要性采样来学习以前的经验。

思路

​ 像 PPO 一样训练策略 πθ\pi_\theta 与值函数 Vϕ(s)V_\phi(s)。用于在训练中降低方差的 Vϕ(s)V_\phi(s) 具有全局视野,让 MAPPO 成为了 CTDE 结构。这些网络可以被分发给每一个智能体,智能体也可以再保留两个独立的网络。

​ 使用五个对 MAPPO 重要的技巧来调整网络:value normalization, value function inputs, training data usage, policy and value clipping, and death masking。

技巧

  1. Utilize value normalization to stabilize value learning.
  2. Include agent-specific features in the global state and check that these features do not make the state dimension substantially higher.
  3. Avoid using too many training epochs and do not split data into mini-batches.
  4. For the best PPO performance, tune the clipping ratio ϵ\epsilon as a trade-off between training stability and fast convergence.
  5. Use zero states with agent ID as the value input for dead agents.

优化目标

  1. Actor 网络

L(θ)=[1Bni=1Bk=1nmin(rθ,i(k)Ai(k),clip(rθ,i(k),1ϵ,1+ϵ)Ai(k))]+σ1Bni=1Bk=1nS[πθ(oi(k)))]\left.L(\theta)=\left[\frac{1}{B n} \sum_{i=1}^{B} \sum_{k=1}^{n} \min \left(r_{\theta, i}^{(k)} A_{i}^{(k)}, \operatorname{clip}\left(r_{\theta, i}^{(k)}, 1-\epsilon, 1+\epsilon\right) A_{i}^{(k)}\right)\right]+\sigma \frac{1}{B n} \sum_{i=1}^{B} \sum_{k=1}^{n} S\left[\pi_{\theta}\left(o_{i}^{(k)}\right)\right)\right]

where $ r_{\theta, i}{(k)}=\frac{\pi_{\theta}\left(a_{i} \mid o_{i}^{(k)}\right)}{\pi_{\theta_{o l d}}{\left(a_{i}^{(k)} \mid o_{i}^{(k)}\right)}} \cdot A_{i}^{(k)} $ is computed using the GAE method, $ S $ is the policy entropy, and $ \sigma $ is the entropy coefficient hyperparameter.

  1. Critic 网络

L(ϕ)=1Bni=1Bk=1n(max[(Vϕ(si(k))R^i)2,(clip(Vϕ(si(k)),Vϕold(si(k))ε,Vϕold(si(k))+ε)R^i)2]L(\phi)=\frac{1}{B n} \sum_{i=1}^{B} \sum_{k=1}^{n}\left(\max \left[\left(V_{\phi}\left(s_{i}^{(k)}\right)-\hat{R}_{i}\right)^{2},\left(\operatorname{clip}\left(V_{\phi}\left(s_{i}^{(k)}\right), V_{\phi_{o l d}}\left(s_{i}^{(k)}\right)-\varepsilon, V_{\phi_{o l d}}\left(s_{i}^{(k)}\right)+\varepsilon\right)-\hat{R}_{i}\right)^{2}\right]\right.

where R^i\hat{R}_{i} is the discounted reward-to-go. BB refers to the batch size and nn refers to the number of agents.

一只学术咸鱼 _(:ᗤ」ㄥ)_