skip to main content
10.5555/3635637.3662861acmconferencesArticle/Chapter ViewAbstractPublication PagesaamasConference Proceedingsconference-collections
research-article

A Model-Based Solution to the Offline Multi-Agent Reinforcement Learning Coordination Problem

Published: 06 May 2024 Publication History

Abstract

Training multiple agents to coordinate is an essential problem with applications in robotics, game theory, economics, and social sciences. However, most existing Multi-Agent Reinforcement Learning (MARL) methods are online and thus impractical for real-world applications in which collecting new interactions is costly or dangerous. While these algorithms should leverage offline data when available, doing so gives rise to what we call the offline coordination problem. Specifically, we identify and formalize the strategy agreement (SA) and the strategy fine-tuning (SFT) coordination challenges, two issues at which current offline MARL algorithms fail. Concretely, we reveal that the prevalent model-free methods are severely deficient and cannot handle coordination-intensive offline multi-agent tasks in either toy or MuJoCo domains. To address this setback, we emphasize the importance of inter-agent interactions and propose the very first model-based offline MARL method. Our resulting algorithm, Model-based Offline Multi-Agent Proximal Policy Optimization (MOMA-PPO) generates synthetic interaction data and enables agents to converge on a strategy while fine-tuning their policies accordingly. This simple model-based solution solves the coordination-intensive offline tasks, significantly outperforming the prevalent model-free methods even under severe partial observability and with learned world models.

References

[1]
Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015.
[2]
Paul Barde, Tristan Karch, Derek Nowrouzezahrai, Cl�ment Moulin-Frier, Christopher Pal, and Pierre-Yves Oudeyer. 2022. Learning to Guide and to be Guided in the Architect-Builder Problem. In International Conference on Learning Representations. https://openreview.net/forum?id=swiyAeGzFhQ
[3]
Paul Barde, Julien Roy, Wonseok Jeon, Joelle Pineau, Chris Pal, and Derek Nowrouzezahrai. 2020. Adversarial soft advantage fitting: Imitation learning without policy optimization. Advances in Neural Information Processing Systems, Vol. 33 (2020), 12334--12344.
[4]
Samuel Barrett, Peter Stone, and Sarit Kraus. 2011. Empirical evaluation of ad hoc teamwork in the pursuit domain. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2. 567--574.
[5]
Craig Boutilier. 1996. Planning, learning and coordination in multiagent decision processes. In TARK, Vol. 96. Citeseer, 195--210.
[6]
Justin Boyan and Michael Littman. 1993. Packet routing in dynamically changing networks: A reinforcement learning approach. Advances in neural information processing systems, Vol. 6 (1993).
[7]
Ronen Brafman and Moshe Tennenholtz. 2002. Efficient learning equilibrium. Advances in Neural Information Processing Systems, Vol. 15 (2002).
[8]
Wilfried Brauer and Gerhard Wei�. 1998. Multi-machine scheduling-a multi-agent learning approach. In Proceedings International Conference on Multi Agent Systems (Cat. No. 98EX160). IEEE, 42--48.
[9]
Micah Carroll, Rohin Shah, Mark K Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, and Anca Dragan. 2019. On the utility of learning about humans for human-ai coordination. Advances in neural information processing systems, Vol. 32 (2019).
[10]
Lars-Erik Cederman. 1997. Emergent actors in world politics: how states and nations develop and dissolve. Vol. 2. Princeton University Press.
[11]
Georgios Chalkiadakis and Craig Boutilier. 2003. Coordination in multiagent reinforcement learning: A Bayesian approach. In Proceedings of the second international joint conference on Autonomous agents and multiagent systems. 709--716.
[12]
Caroline Claus and Craig Boutilier. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, Vol. 1998, 746--752 (1998), 2.
[13]
Vincent P Crawford and Joel Sobel. 1982. Strategic information transmission. Econometrica: Journal of the Econometric Society (1982), 1431--1451.
[14]
Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. 2020. Is independent learning all you need in the starcraft multi-agent challenge? arXiv preprint arXiv:2011.09533 (2020).
[15]
Kurt Dresner and Peter Stone. 2004. Multiagent traffic management: A reservation-based intersection control mechanism. In Autonomous Agents and Multiagent Systems, International Joint Conference on, Vol. 3. Citeseer, 530--537.
[16]
K Fischer, N Kuhn, HJ Muller, JP Muller, and M Pischel. 1993. Sophisticated and distributed: The transportation domain. In Proceedings of 9th IEEE Conference on Artificial Intelligence for Applications. IEEE, 454.
[17]
Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. 2018. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
[18]
Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. 2020. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219 (2020).
[19]
Scott Fujimoto and Shixiang Shane Gu. 2021. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, Vol. 34 (2021), 20132--20145.
[20]
Scott Fujimoto, David Meger, and Doina Precup. 2019. Off-policy deep reinforcement learning without exploration. In International conference on machine learning. PMLR, 2052--2062.
[21]
Stephen Grand, Dave Cliff, and Anil Malhotra. 1997. Creatures: Artificial life autonomous software agents for home entertainment. In Proceedings of the first international conference on Autonomous agents. 22--29.
[22]
David Ha and J�rgen Schmidhuber. 2018. World models. arXiv preprint arXiv:1803.10122 (2018).
[23]
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning. PMLR, 1861--1870.
[24]
Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. 2020. "Other-Play" for Zero-Shot Coordination. In International Conference on Machine Learning. PMLR, 4399--4410.
[25]
Jun Huang, Nicholas R Jennings, and John Fox. 1995. An agent architecture for distributed medical care. In Intelligent Agents: ECAI-94 Workshop on Agent Theories, Architectures, and Languages Amsterdam, The Netherlands August 8-9, 1994 Proceedings 1. Springer, 219--232.
[26]
Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, Dj Strouse, Joel Z Leibo, and Nando De Freitas. 2019. Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning. In International Conference on Machine Learning. 3040--3049.
[27]
Jiechuan Jiang and Zongqing Lu. 2021. Offline decentralized multi-agent reinforcement learning. arXiv preprint arXiv:2108.01832 (2021).
[28]
Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. 2020. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, Vol. 33 (2020), 21810--21823.
[29]
Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations (2015).
[30]
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. 2021. Offline Reinforcement Learning with Implicit Q-Learning. arxiv (2021).
[31]
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. 2020. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, Vol. 33 (2020), 1179--1191.
[32]
Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. 2017. Multi-Agent Cooperation and the Emergence of (Natural) Language. In International Conference on Learning Representations. https://openreview.net/forum?id=Hk8N3Sclg
[33]
Adam Lerer and Alexander Peysakhovich. 2019. Learning existing social conventions via observationally augmented self-play. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 107--114.
[34]
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020).
[35]
Jiaoyang Li, Zhe Chen, Yi Zheng, Shao-Hung Chan, Daniel Harabor, Peter J Stuckey, Hang Ma, and Sven Koenig. 2021. Scalable rail planning and replanning: Winning the 2020 flatland challenge. In Proceedings of the International Conference on Automated Planning and Scheduling, Vol. 31. 477--485.
[36]
Michael L Littman et al. 2001. Friend-or-foe Q-learning in general-sum games. In ICML, Vol. 1. 322--328.
[37]
Haotian Liu and Wenchuan Wu. 2021. Online multi-agent reinforcement learning for decentralized inverter-based volt-var control. IEEE Transactions on Smart Grid, Vol. 12, 4 (2021), 2980--2990.
[38]
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems. 6379--6390.
[39]
Xueguang Lyu, Yuchen Xiao, Brett Daley, and Christopher Amato. 2021. Contrasting centralized and decentralized critics in multi-agent reinforcement learning. arXiv preprint arXiv:2102.04402 (2021).
[40]
Anuj Mahajan, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. 2019. Maven: Multi-agent variational exploration. Advances in Neural Information Processing Systems, Vol. 32 (2019).
[41]
Ranjit Nair, Milind Tambe, Makoto Yokoo, David Pynadath, and Stacy Marsella. 2003. Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In IJCAI, Vol. 3. Citeseer, 705--711.
[42]
Praveen Palanisamy. 2020. Multi-agent connected autonomous driving using deep reinforcement learning. In 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--7.
[43]
Ling Pan, Longbo Huang, Tengyu Ma, and Huazhe Xu. 2022. Plan better amid conservatism: Offline multi-agent reinforcement learning with actor rectification. In International Conference on Machine Learning. PMLR, 17221--17237.
[44]
Liviu Panait and Sean Luke. 2005. Cooperative multi-agent learning: The state of the art. Autonomous agents and multi-agent systems, Vol. 11, 3 (2005), 387--434.
[45]
Bei Peng, Tabish Rashid, Christian Schroeder de Witt, Pierre-Alexandre Kamienny, Philip Torr, Wendelin B�hmer, and Shimon Whiteson. 2021. Facmac: Factored multi-agent centralised policy gradients. Advances in Neural Information Processing Systems, Vol. 34 (2021), 12208--12221.
[46]
Dean A Pomerleau. 1991. Efficient training of artificial neural networks for autonomous navigation. Neural computation, Vol. 3, 1 (1991), 88--97.
[47]
Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. 2018. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning. PMLR, 4295--4304.
[48]
Julien Roy, Paul Barde, F�lix Harvey, Derek Nowrouzezahrai, and Chris Pal. 2020. Promoting coordination through policy regularization in multi-agent deep reinforcement learning. Advances in Neural Information Processing Systems, Vol. 33 (2020), 15774--15785.
[49]
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015).
[50]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
[51]
Tianyu Shi, Dong Chen, Kaian Chen, and Zhaojian Li. 2021. Offline Reinforcement Learning for Autonomous Driving with Safety and Exploration Enhancement. arXiv preprint arXiv:2110.07067 (2021).
[52]
Jonathan Spencer, Sanjiban Choudhury, Arun Venkatraman, Brian Ziebart, and J Andrew Bagnell. 2021. Feedback in imitation learning: The three regimes of covariate shift. arXiv preprint arXiv:2102.02872 (2021).
[53]
Randall Steeb, Stephanie Cammarata, Frederick A Hayes-Roth, Perry W Thorndyke, and Robert E Wesson. 1981. Distributed intelligence for air fleet control. Technical Report. RAND CORP SANTA MONICA CA.
[54]
Peter Stone, Gal Kaminka, Sarit Kraus, and Jeffrey Rosenschein. 2010. Ad hoc autonomous agent teams: Collaboration without pre-coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 24. 1504--1509.
[55]
Richard S Sutton. 1990. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings 1990. Elsevier, 216--224.
[56]
Wei-Cheng Tseng, Tsun-Hsuan Johnson Wang, Yen-Chen Lin, and Phillip Isola. 2022. Offline Multi-Agent Reinforcement Learning with Knowledge Distillation. Advances in Neural Information Processing Systems, Vol. 35 (2022), 226--237.
[57]
Mycal Tucker, Yilun Zhou, and Julie Shah. 2020. Adversarially guided self-play for adopting social conventions. arXiv preprint arXiv:2001.05994 (2020).
[58]
LászlóZ Varga, Nick R Jennings, and David Cockburn. 1994. Integrating intelligent systems into a cooperating community for electricity distribution management. Expert Systems with Applications, Vol. 7, 4 (1994), 563--579.
[59]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[60]
Eugene Vinitsky, Nathan Lichtlé, Xiaomeng Yang, Brandon Amos, and Jakob Foerster. 2022. Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world. arXiv preprint arXiv:2206.09889 (2022).
[61]
Jianhao Wang, Wenzhe Li, Haozhe Jiang, Guangxiang Zhu, Siyuan Li, and Chongjie Zhang. 2021. Offline reinforcement learning with reverse model-based imagination. Advances in Neural Information Processing Systems, Vol. 34 (2021), 29420--29432.
[62]
Daniël Willemsen, Mario Coppola, and Guido CHE de Croon. 2021. MAMBPO: Sample-efficient multi-robot reinforcement learning using learned world models. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5635--5640.
[63]
Mark Woodward, Chelsea Finn, and Karol Hausman. 2020. Learning to interactively learn and assist. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 2535--2543.
[64]
Yiqin Yang, Xiaoteng Ma, Chenghao Li, Zewu Zheng, Qiyuan Zhang, Gao Huang, Jun Yang, and Qianchuan Zhao. 2021. Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning. NeurIPS (2021).
[65]
Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. 2022. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, Vol. 35 (2022), 24611--24624.
[66]
Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. 2021. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, Vol. 34 (2021), 28954--28967.
[67]
Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. 2020. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, Vol. 33 (2020), 14129--14142.
[68]
Chi Zhang, Sanmukh R Kuppannagari, Chuanxiu Xiong, Rajgopal Kannan, and Viktor K Prasanna. 2019. A cooperative multi-agent deep reinforcement learning framework for real-time residential load scheduling. In Proceedings of the International Conference on Internet of Things Design and Implementation. 59--69.
[69]
Chongjie Zhang and Victor Lesser. 2013. Coordinating multi-agent reinforcement learning with limited communication. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems. 1101--1108.
[70]
Qizhen Zhang, Chris Lu, Animesh Garg, and Jakob Foerster. 2022. Centralized Model and Exploration Policy for Multi-Agent RL. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems (Virtual Event, New Zealand) (AAMAS '22). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1500--1508.
[71]
Weinan Zhang, Xihuai Wang, Jian Shen, and Ming Zhou. 2021. Model-based Multi-agent Policy Optimization with Adaptive Opponent-wise Rollouts. International Joint Conference on Artificial Intelligence (2021).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
AAMAS '24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems
May 2024
2898 pages
ISBN:9798400704864

Sponsors

Publisher

International Foundation for Autonomous Agents and Multiagent Systems

Richland, SC

Publication History

Published: 06 May 2024

Check for updates

Author Tags

  1. coordination
  2. deep learning
  3. model-based reinforcement learning
  4. multi-agent learning
  5. offline reinforcement learning
  6. world models

Qualifiers

  • Research-article

Conference

AAMAS '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,155 of 5,036 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 52
    Total Downloads
  • Downloads (Last 12 months)52
  • Downloads (Last 6 weeks)16
Reflects downloads up to 19 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media