Motivation

Most multi-agent RL assumes agents are trained together, share conventions, or have access to communication channels. Zero-shot coordination (ZSC) removes these assumptions: can an agent trained independently cooperate with a stranger at test time? The problem is harder than it looks — self-play policies develop arbitrary conventions that prevent ad-hoc teamwork.

Approach

I implement Independent PPO (IPPO) across a suite of cooperative grid-world environments derived from Overcooked-AI and LBF (Level-Based Foraging). The core question: what structural properties of the environment or algorithm enable coordination to emerge without explicit agreement?

The key insight I'm testing: symmetry-breaking in the reward landscape forces agents to commit to asymmetric roles, creating stable conventions even without communication. I compare:

  • Standard self-play IPPO
  • Population-based training (PBT) for convention diversity
  • Fictitious co-play as a ZSC-specific baseline
  • Cross-play evaluation between independently trained populations

Technical Setup

Environments implemented in PettingZoo. Policies are MLPs with shared parameters across agents in self-play, separated at test time. Training runs in JAX (Brax-style parallelism) with PyTorch policy networks. Evaluation metric: cross-play reward normalized by self-play upper bound.

$$\text{ZSC Score} = \frac{\mathbb{E}[\text{cross-play reward}]}{\mathbb{E}[\text{self-play reward}]}$$

Status

Ongoing. Initial results show that symmetry-breaking reward shaping improves ZSC Score by ~18% in the Overcooked layout. Population diversity via PBT further closes the gap. Writing in progress.

← Back to Projects