Survey of Multiagent Reinforcement Learning

22 minute read

The survey paper can be accessed here: Busoniu et al. 2008. In this post I will try to condense the basic taxonomy and highlight the main lines along which work in multiagent reinforcement learning had been conducted until 2008. A later post will talk about the work since then, and particularly in the last 2 years following the seminal Deep Q-Networks paper.

What about predefined behaviors? Why learn? It is certainly possible to program multiple agents to act (and even coordinate) in a shared environment, and has even been deployed to great effect by companies like Disney for attractions, and Amazon in their warehouses, it requires very controlled environments and a great deal of resources (in terms of man hours spent) to get those systems to work, painstakingly. Getting multiple agents to collaborate in any significant manner will require behavior that would be VERY difficult for humans to hand-design apriori. Moreover, in dynamic environments which keep changing, those hand-designed behaviors would soon be redundant in the absence of any learning.

Background: Stochastic Games

  1. State transitions are the result of the joint action of all agents.
  2. The rewards and returns also depend upon the joint action.
  3. The Q-function of each agent depends on the joint action and is conditioned upon the joint policy.
  4. If all agents’ reward functions are identical, they have the same goal and hence the SG is fully cooperative. If there are 2 agents and one’s reward is negative of the other’s reward, the SG is fully competitive. Mixed games are stochastic games which are neither fully cooperative nor fully competitive.

Benefits of MARL

  1. Exploiting the decentralized nature of the problem lends itself to easy parallelization and faster computation.
  2. Experience sharing through communication, teaching and imitation can help learn faster.
  3. Inherent robustness of the system because of multiple agents which can take over from one another

Issues with MARL

  1. What is a good learning goal for the agents, which will elicit the desired behavior? How do we incorporate both stability and adaptability?
  2. Keeping track of the other agents (who are also learning and improving)
  3. Learning agents render the environment for any one agent nonstationary - which invalidates the convergence properties of most single-agent RL algorithms like Generalized Policy Iteration, Q-Learning, SARSA etc.
  4. Scalability issues become even more pronounced with multiple agents. The curse of dimensionality rears its ugly head with exponential growth in the dimensionality of joint action space.
  5. The agents’ choice of actions must be mutually consistent, and if planning is done in the joint action space then coordination boils down to each agent consistently breaking ties between equally good strategies.

Choosing a goal

  1. Stability: is important because stable agents reduce nonstationarity in learning process of other agents, making things easier to solve. The following criteria have been used:
    1. Convergence of agents’ strategies to Nash equilibria. Nash Q-Learning
    2. Convergence of agents’ strategies to an equilibrium strategy regardless of what other agents are doing Littman 2001
    3. Agent’s capability to predict nearly accurate models of the other agents
  2. Adaptation: is needed because environment dynamics are evolving, and agents need to adapt to it. The following criteria have been used:
    1. Rationality: Each agent should converge to a best response if other agents remain stationary (do not change their strategies)
    2. No-Regret: The agent should always achieve a return at least as good as any stationary strategy, for any set of strategies of the other agents. This also prevents agent from being exploited by the other agents.
    3. Average Reward Bounds: Targeted optimality requires an average reward better than that of a best response. Compatibility prescribes an average reward level in self-play, while Safety prescribes a minimum safety level of average reward.
    4. Opponent Awareness: Learn models of other agents and formulate a best response to them

Some taxonomy for ease of reference

  1. Type of task targeted: Static (stateless) or Dynamic
  2. Degree of Awareness exhibited: Strongly linked to the learning goal. Algorithms focused on stability are relatively less aware of other agents. Algorithms focused on adaptation are obviously aware of others, and often use some form of modeling to keep track of the other agents’ policies.
  3. Inspiration: Fusion of concepts from game theory (fictitious play, AWESOME, Nash-Q, minimax-Q), Temporal Difference-RL (Distributed-Q, FMQ), Direct Policy Search (IGA,WoLF-IGA)
  4. Homogeneity of the learning algorithms: Does the algo work only if all the agents use it (team-Q, Nash-Q), or others can use other learning algorithms (AWESOME, WoLF-PHC)
  5. Prior Knowledge: A task model is available to the agent (AWESOME), or not (team-Q, Nash-Q, WoLF-PHC)
  6. Agent Observations: Agent might need to observe actions of other agents (team-Q, AWESOME), their actions and rewards (Nash-Q) or neither (WoLF-PHC)

Some MARL Algorithms

  1. Fully Cooperative Tasks: Decentralized decision making, learning the optimal joint-action values with Q-learning is easy but coordination problems arise when there are multiple joint actions that are unique, and each agent breaks ties randomly, resulting in a suboptimal joint action. These algos assume perfect perception of state and of other agents’ actions. Most also suffer from curse of dimensionality (except Distributed-Q and FMQ).
    1. Coordination-Free:
      • Team-Q: Assume unique optimal joint-action (hardly ever the case). Each agent learns the optimal joint-action value by Q-learning, and uses that to select joint actions.
      • Distributed-Q: It works only with deterministic policies. Each agent’s local Q-function (depending only on its own action) is updated only when it leads to an increase in the Q-value, and the local policy is updated only if it improves the Q-values. Provably converges to joint-optimal policies if reward is positive and Q’s are initialized with zero values.
    2. Coordination-Based: Coordination graphs additively decompose the Q-function into multiple local Q-functions depending on only a subset of the agents, making things easier to solve individually and then aggregating their solutions.
    3. Indirect Coordination:
      • Joint Action Learners: Static Tasks: Each agent learns models for all the other agents based on empirically counting how many times it observed the other taking a particular action.
      • Frequency Max-Q (FMQ): Static Tasks: It works only for deterministic tasks, where variance in the reward resulting from an agent’s actions can only result from other agents’ actions. It empirically tracks the frequency with which any particular action has yielded good rewards in the past, and increases the Q-value for those joint actions, leading to coordination developing.
      • Optimal Adaptive Learning: Dynamic Tasks: Uses virtual games where optimal joint actions are rewarded with 1, and others with 0. This biases the agent toward recently selected optimal actions, and provably converges to optimal joint policies in fully cooperative tasks. However computational complexity is high.
  2. Fully Competitive Tasks: Minimax principle is applicable: maximize the minimum expected reward. Minimax-Q is opponent independent and uses minimax principle to compute strategies and values, and a TD-rule to propagate those values. Opponent Modeling can do better than the minimax return if the opponent is suboptimal, i.e., does not always take the most damaging action). The model is built by empirically observing the frequency of taking each action in a particular state.

  3. Mixed Tasks: Most of the following algorithms are for static games, which have limited applicability in real-world applications WoLF-PHC, NSCP are exceptions. Static game algos also assume availability of exact task model, which is not the case usually. Understanding the conditions in which single-agent RL works in multiagent settings would be an important step since there are theoretical guarantees available for single-agent RL algorithms.
    1. Single agent Q-Learning: Tuyls et al. 2006 applied evolutionary analysis to behavior of Q-learning with Boltzmann policies in repeated games, and found that there was some limited convergence in some games.
    2. Agent-Independent methods: Policies and state values are computed with game-theoretic solvers for each stage game, for example Nash Q-Learning which computes the Nash equilibrium at each stage. However, if this equilibrium is not unique, then the equilibirum selection problem arises where each agent might break ties differently and end up with different updates.
    3. Agent-Tracking methods: Estimate models of other agents’ strategies and formulate a best response to those. Usually not stable. Observability of other agents’ actions is assumed. e.g. Fictitious Play and Nonstationary Converging Policies (NSCP)
    4. Agent-Aware methods: Target stability as well as adaptability.
      • AWESOME: It tries to adapt to the others’ strategies when they appear stationary (using fictitious play), but retreats to a centrally precomputed Nash equilibrium strategy when nonstationary. It is rational, and converges to a Nash eq in self-play, and valid for any number of actions and players, while relying on observing only the other agents’ actions, not their strategies.
      • Infinitesimal Gradient Ascent (IGA), Win-or-Learn-Fast-IGA (WoLF-IGA) and Generalized IGA (GIGA) are direct policy search methods which use gradient update rules that guarantee convergence in specific static games. IGA works for 2-agent, 2-action games, uses constant gradient steps and the average rewards converge to Nash rewards for infinitesimally small step size. WoLF-IGA uses a small learning rate when winning, and a large winning rate when it is losing.
      • WoLF-Policy Hill Climbing (WoLF-PHC) updates policies with a WoLF step size, and Q-functions with the Q-learning rule.
  4. Explicit Coordination Mechanisms: Could work with any of the above 3 settings. The coordination problem usually involves all agents breaking ties in the same way, which can be achieved using social roles, conventions and communication. Social roles restricts the set of actions available to an agent and prevent ties. Social conventions encode preferences towards certain joint actions and can totally eliminate ties if properly designed. Communication can obviously be used to communicate action choices, either alone or along with conventions and roles. If comms are available, only an ordering of agents has to be known to break ties deterministically.


  1. Scalability is the big challenge for MARL. Most of the above presented methods require tabular representation of Q-values, which is no good for any real application with continuous state and action spaces. The work on function approximation in single-agent RL needs to be extended to multiple agents.
  2. Exploiting the decentralized, modular structure of the multiagent task can distinguish MARL methods completely from single-agent RL methods and provide a quantum leap in performance.
  3. Making learning easier by providing some domain knowledge, by choosing a suitable model for the approximator, shaping rewards to reward promising behaviour, decomposing into subtasks and using hierarchical RL.

Leave a Comment