分类
调度方式
- centrilized:合作游戏,直接扩展单智能体 RL,所有智能体共享策略
- decentralized:每个智能体最优化自己独立的环境回报
- IPPO
算法思路
- centralized training and decentralized execution (CTDE):使用 Actor-Critic 框架,通过集中式的 Critic 纵览大局
- MADDPG
- COMA:multi-agent PG methods
- QMix
- value decomposition (VD)
- value-decomposed Q-learning
问题
- instability
- high variance
- 使用大的 batch size 降低 PG 的方差
环境
MDP
- Decentralized partially observable Markov decision processes (DEC-POMDP) shared rewards. A DEC-POMDP is defined by is the state space. is the shared action space for each agent. is the local observation for agent at global state . denotes the transition probability from to given the joint action for all agents. denotes the shared reward function. is the discount factor. Since most of the benchmark environments contain homogeneous agents, we utilize parameter sharing: each agent uses a shared policy parameterized by to produce its action from its local observation , and optimizes its discounted accumulated reward .
GYM
- multi-agent particle-world environment (MPE)
- Starcraft multi-agent challenge (SMAC)
- Hanabi challenge
MAPPO
描述
既是有集中式价值函数的 CTDE 算法,也是有分布式价值函数的分布学习算法 —— 既有一套 CTDE 式的网络,也允许各个智能体自己有一套独立的网络。
能有效解决 PPO 这类 on-policy 方法样本效率(sample efficient)低的问题 —— 使用重要性采样来学习以前的经验。
思路
像 PPO 一样训练策略 与值函数 。用于在训练中降低方差的 具有全局视野,让 MAPPO 成为了 CTDE 结构。这些网络可以被分发给每一个智能体,智能体也可以再保留两个独立的网络。
使用五个对 MAPPO 重要的技巧来调整网络:value normalization, value function inputs, training data usage, policy and value clipping, and death masking。
技巧
- Utilize value normalization to stabilize value learning.
- Include agent-specific features in the global state and check that these features do not make the state dimension substantially higher.
- Avoid using too many training epochs and do not split data into mini-batches.
- For the best PPO performance, tune the clipping ratio as a trade-off between training stability and fast convergence.
- Use zero states with agent ID as the value input for dead agents.
优化目标
- Actor 网络
where $ r_{\theta, i}{(k)}=\frac{\pi_{\theta}\left(a_{i} \mid o_{i}^{(k)}\right)}{\pi_{\theta_{o l d}}{\left(a_{i}^{(k)} \mid o_{i}^{(k)}\right)}} \cdot A_{i}^{(k)} $ is computed using the GAE method, $ S $ is the policy entropy, and $ \sigma $ is the entropy coefficient hyperparameter.
- Critic 网络
where is the discounted reward-to-go. refers to the batch size and refers to the number of agents.