research-article

Free access

First order constrained optimization in policy space

AUTHORs:

Keith W. RossAuthors Info & Claims

NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing Systems

Article No.: 1286, Pages 15338 - 15349

Published: 06 December 2020 Publication History

PDF eReader Publisher Site

Abstract

In reinforcement learning, an agent attempts to learn high-performing behaviors through interacting with the environment, such behaviors are often quantified in the form of a reward function. However some aspects of behavior—such as ones which are deemed unsafe and to be avoided—are best captured through constraints. We propose a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS) which maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints. Using data generated from the current policy, FOCOPS first finds the optimal update policy by solving a constrained optimization problem in the nonparameterized policy space. FOCOPS then projects the update policy back into the parametric policy space. Our approach has an approximate upper bound for worst-case constraint violation throughout training and is first-order in nature therefore simple to implement. We provide empirical evidence that our simple approach achieves better performance on a set of constrained robotics locomotive tasks.

References

[1]

A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a posteriori policy optimisation. In International Conference on Learning Representations, 2018.

[2]

J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22-31. JMLR. org, 2017.

Digital Library

[3]

E. Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.

[4]

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man�. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.

[5]

D. P. Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3): 334-334, 1997.

[6]

D. P. Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014.

[7]

F. J. Beutler and K. W. Ross. Optimal policies for controlled markov chains with a constraint. Journal of mathematical analysis and applications, 112(1):236-252, 1985.

[8]

F. J. Beutler and K. W. Ross. Time-average optimal constrained semi-markov decision processes. Advances in Applied Probability, 18(2):341-359, 1986.

[9]

S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.

[10]

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016.

[11]

Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070-6120, 2017.

Digital Library

[12]

Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In Advances in neural information processing systems, pages 8092-8101, 2018.

[13]

Y. Chow, O. Nachum, A. Faust, M. Ghavamzadeh, and E. Duenez-Guzman. Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031, 2019.

[14]

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299-4307, 2017.

Digital Library

[15]

Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329-1338, 2016.

Digital Library

[16]

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. International Conference on Machine Learning, 2018.

[17]

S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, volume 2, pages 267-274, 2002.

Digital Library

[18]

L. Kallenberg. Linear Programming and Finite Markovian Control Problems. Centrum Voor Wiskunde en Informatica, 1983.

[19]

J. Luketina, N. Nardelli, G. Farquhar, J. Foerster, J. Andreas, E. Grefenstette, S. Whiteson, and T. Rockt�schel. A survey of reinforcement learning informed by natural language. arXiv preprint arXiv:1906.03926, 2019.

[20]

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. NIPS Deep Learning Workshop, 2013.

[21]

V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928-1937, 2016.

Digital Library

[22]

W. H. Montgomery and S. Levine. Guided policy search via approximate mirror descent. In Advances in Neural Information Processing Systems, pages 4008-4016, 2016.

Digital Library

[23]

J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682-697, 2008.

Digital Library

[24]

J. Peters, K. Mulling, and Y. Altun. Relative entropy policy search. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.

Digital Library

[25]

J. Pineau. NeurIPS 2018 Invited Talk: Reproducible, Reusable, and Robust Reinforcement Learning, 2018. URL: https://media.neurips.cc/Conferences/NIPS2018/Slides/jpineau-NeurIPS-dec18-fb.pdf. Last visited on 2020/05/28.

[26]

M. Pirotta, M. Restelli, A. Pecorino, and D. Calandriello. Safe policy iteration. In International Conference on Machine Learning, pages 307-315, 2013.

Digital Library

[27]

A. Ray, J. Achiam, and D. Amodei. Benchmarking Safe Exploration in Deep Reinforcement Learning. arXiv preprint arXiv:1910.01708, 2019.

[28]

K. W. Ross. Constrained markov decision processes with queueing applications. Dissertation Abstracts International Part B: Science and Engineering[DISS. ABST. INT. PT. B- SCI. & ENG.], 46(4), 1985.

[29]

K. W. Ross. Randomized and past-dependent policies for markov decision processes with multiple constraints. Operations Research, 37(3):474-477, 1989.

Digital Library

[30]

K. W. Ross and R. Varadarajan. Markov decision processes with sample path constraints: the communicating case. Operations Research, 37(5):780-790, 1989.

Digital Library

[31]

K. W. Ross and R. Varadarajan. Multichain markov decision processes with a sample path constraint: A decomposition approach. Mathematics of Operations Research, 16(1):195-207, 1991.

Digital Library

[32]

T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.

[33]

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889-1897, 2015.

Digital Library

[34]

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016.

[35]

J. Schulman, X. Chen, and P. Abbeel. Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440, 2017a.

[36]

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.

[37]

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.

[38]

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Ku-maran, T. Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. Science, 362(6419):1140-1144, 2018.

[39]

A. Stooke, J. Achiam, and P. Abbeel. Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning, 2020.

[40]

G. Strang. Computational science and engineering. Wellesley-Cambridge Press, 2007.

[41]

R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

Digital Library

[42]

C. Tessler, D. J. Mankowitz, and S. Mannor. Reward constrained policy optimization. International Conference on Learning Representation, 2019.

[43]

E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026-5033. IEEE, 2012.

[44]

H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In AAAI, volume 2, page 5. Phoenix, AZ, 2016.

Digital Library

[45]

Q. Vuong, Y. Zhang, and K. W. Ross. Supervised policy update for deep reinforcement learning. In International Conference on Learning Representation, 2019.

[46]

Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample efficient actor-critic with experience replay. International Conference on Learning Representations, 2017.

[47]

Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pages 5285-5294, 2017.

Digital Library

[48]

T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge. Projection-based constrained policy optimization. In International Conference on Learning Representation, 2020.

[49]

B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433-1438. Chicago, IL, USA, 2008.

Digital Library

Index Terms

First order constrained optimization in policy space
1. Mathematics of computing
  1. Mathematical analysis

Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems

December 2020

22651 pages

ISBN:9781713829546

Editors:
H. Larochelle
Google Research
,
M. Ranzato
Facebook AI Research
,
R. Hadsell
DeepMind
,
M.F. Balcan
Carnegie Mellon University
,
H. Lin
National Taiwan University

Copyright � 2020 Neural Information Processing Systems Foundation, Inc.

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 06 December 2020

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
45
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)6

Reflects downloads up to 22 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents