Zero-Shot Coordination in Multi-Agent RL
The Problem
Two robots trained separately to cook in the same kitchen. When they finally meet, they can't coordinate — one keeps trying to pass ingredients from the left while the other reaches from the right. They never agreed on a protocol. Neither knows what the other expects.
This is the zero-shot coordination problem. Agents that train together develop effective conventions — but those conventions are private. They're habits formed through co-adaptation, not principles that generalize to a new partner. Put two agents from different training runs together, and coordination collapses.
The Question
The standard fix — training against a diverse population of partners — helps but doesn't fully solve it. The deeper question is structural: what properties of the training environment push agents toward conventions that strangers can recognize?
When a task has a natural asymmetry — one side should carry, the other should assemble — agents have a reason to commit to distinct roles. The asymmetry breaks the tie. Instead of both agents independently arriving at an arbitrary private convention, they both arrive at the same one, because the environment made that convention the obvious choice.
What Was Built
Independent training across cooperative environments derived from Overcooked: each agent learns separately, then coordinates with a stranger at test time. Four conditions compared:
- Standard independent training — the baseline
- Population-based training — train against many partner policies to force robustness
- Fictitious co-play — train against a mixture of past selves to diversify conventions
- Symmetry-breaking reward shaping — the proposed structural mechanism
Training parallelized in JAX; coordination measured as cross-play reward normalized by same-team performance. A score of 1.0 would mean perfect generalization to any stranger.
Early Results
Symmetry-breaking reward shaping improves zero-shot coordination by ~18% in the core Overcooked layout. Population diversity closes the remaining gap further. The two mechanisms appear complementary — one shapes the conventions themselves, the other diversifies the agent's experience of possible partners.
The structural story is the interesting part: effective zero-shot coordination is not just about training more broadly — it is about training in environments whose structure makes the right convention the obvious one.
Ongoing. Dissipative coupling and multi-layout generalization in progress.