First order constrained optimization in policy space
Article No.: 1286, Pages 15338 - 15349
Abstract
In reinforcement learning, an agent attempts to learn high-performing behaviors through interacting with the environment, such behaviors are often quantified in the form of a reward function. However some aspects of behavior—such as ones which are deemed unsafe and to be avoided—are best captured through constraints. We propose a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS) which maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints. Using data generated from the current policy, FOCOPS first finds the optimal update policy by solving a constrained optimization problem in the nonparameterized policy space. FOCOPS then projects the update policy back into the parametric policy space. Our approach has an approximate upper bound for worst-case constraint violation throughout training and is first-order in nature therefore simple to implement. We provide empirical evidence that our simple approach achieves better performance on a set of constrained robotics locomotive tasks.
References
[1]
A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a posteriori policy optimisation. In International Conference on Learning Representations, 2018.
[2]
J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22-31. JMLR. org, 2017.
[3]
E. Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
[4]
D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man�. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
[5]
D. P. Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3): 334-334, 1997.
[6]
D. P. Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014.
[7]
F. J. Beutler and K. W. Ross. Optimal policies for controlled markov chains with a constraint. Journal of mathematical analysis and applications, 112(1):236-252, 1985.
[8]
F. J. Beutler and K. W. Ross. Time-average optimal constrained semi-markov decision processes. Advances in Applied Probability, 18(2):341-359, 1986.
[9]
S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
[10]
G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016.
[11]
Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070-6120, 2017.
[12]
Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In Advances in neural information processing systems, pages 8092-8101, 2018.
[13]
Y. Chow, O. Nachum, A. Faust, M. Ghavamzadeh, and E. Duenez-Guzman. Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031, 2019.
[14]
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299-4307, 2017.
[15]
Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329-1338, 2016.
[16]
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. International Conference on Machine Learning, 2018.
[17]
S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, volume 2, pages 267-274, 2002.
[18]
L. Kallenberg. Linear Programming and Finite Markovian Control Problems. Centrum Voor Wiskunde en Informatica, 1983.
[19]
J. Luketina, N. Nardelli, G. Farquhar, J. Foerster, J. Andreas, E. Grefenstette, S. Whiteson, and T. Rockt�schel. A survey of reinforcement learning informed by natural language. arXiv preprint arXiv:1906.03926, 2019.
[20]
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. NIPS Deep Learning Workshop, 2013.
[21]
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928-1937, 2016.
[22]
W. H. Montgomery and S. Levine. Guided policy search via approximate mirror descent. In Advances in Neural Information Processing Systems, pages 4008-4016, 2016.
[23]
J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682-697, 2008.
[24]
J. Peters, K. Mulling, and Y. Altun. Relative entropy policy search. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
[25]
J. Pineau. NeurIPS 2018 Invited Talk: Reproducible, Reusable, and Robust Reinforcement Learning, 2018. URL: https://media.neurips.cc/Conferences/NIPS2018/Slides/jpineau-NeurIPS-dec18-fb.pdf. Last visited on 2020/05/28.
[26]
M. Pirotta, M. Restelli, A. Pecorino, and D. Calandriello. Safe policy iteration. In International Conference on Machine Learning, pages 307-315, 2013.
[27]
A. Ray, J. Achiam, and D. Amodei. Benchmarking Safe Exploration in Deep Reinforcement Learning. arXiv preprint arXiv:1910.01708, 2019.
[28]
K. W. Ross. Constrained markov decision processes with queueing applications. Dissertation Abstracts International Part B: Science and Engineering[DISS. ABST. INT. PT. B- SCI. & ENG.], 46(4), 1985.
[29]
K. W. Ross. Randomized and past-dependent policies for markov decision processes with multiple constraints. Operations Research, 37(3):474-477, 1989.
[30]
K. W. Ross and R. Varadarajan. Markov decision processes with sample path constraints: the communicating case. Operations Research, 37(5):780-790, 1989.
[31]
K. W. Ross and R. Varadarajan. Multichain markov decision processes with a sample path constraint: A decomposition approach. Mathematics of Operations Research, 16(1):195-207, 1991.
[32]
T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
[33]
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889-1897, 2015.
[34]
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016.
[35]
J. Schulman, X. Chen, and P. Abbeel. Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440, 2017a.
[36]
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.
[37]
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
[38]
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Ku-maran, T. Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. Science, 362(6419):1140-1144, 2018.
[39]
A. Stooke, J. Achiam, and P. Abbeel. Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning, 2020.
[40]
G. Strang. Computational science and engineering. Wellesley-Cambridge Press, 2007.
[41]
R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
[42]
C. Tessler, D. J. Mankowitz, and S. Mannor. Reward constrained policy optimization. International Conference on Learning Representation, 2019.
[43]
E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026-5033. IEEE, 2012.
[44]
H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In AAAI, volume 2, page 5. Phoenix, AZ, 2016.
[45]
Q. Vuong, Y. Zhang, and K. W. Ross. Supervised policy update for deep reinforcement learning. In International Conference on Learning Representation, 2019.
[46]
Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample efficient actor-critic with experience replay. International Conference on Learning Representations, 2017.
[47]
Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pages 5285-5294, 2017.
[48]
T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge. Projection-based constrained policy optimization. In International Conference on Learning Representation, 2020.
[49]
B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433-1438. Chicago, IL, USA, 2008.
Index Terms
- First order constrained optimization in policy space
Index terms have been assigned to the content through auto-classification.
Recommendations
Comments
Information & Contributors
Information
Published In
December 2020
22651 pages
ISBN:9781713829546
- Editors:
- H. Larochelle,
- M. Ranzato,
- R. Hadsell,
- M.F. Balcan,
- H. Lin
Copyright � 2020 Neural Information Processing Systems Foundation, Inc.
Publisher
Curran Associates Inc.
Red Hook, NY, United States
Publication History
Published: 06 December 2020
Qualifiers
- Research-article
- Research
- Refereed limited
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 45Total Downloads
- Downloads (Last 12 months)39
- Downloads (Last 6 weeks)6
Reflects downloads up to 22 Oct 2024
Other Metrics
Citations
View Options
Get Access
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in