skip to main content
10.5555/3495724.3497010guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
research-article
Free access

First order constrained optimization in policy space

Published: 06 December 2020 Publication History

Abstract

In reinforcement learning, an agent attempts to learn high-performing behaviors through interacting with the environment, such behaviors are often quantified in the form of a reward function. However some aspects of behavior—such as ones which are deemed unsafe and to be avoided—are best captured through constraints. We propose a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS) which maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints. Using data generated from the current policy, FOCOPS first finds the optimal update policy by solving a constrained optimization problem in the nonparameterized policy space. FOCOPS then projects the update policy back into the parametric policy space. Our approach has an approximate upper bound for worst-case constraint violation throughout training and is first-order in nature therefore simple to implement. We provide empirical evidence that our simple approach achieves better performance on a set of constrained robotics locomotive tasks.

References

[1]
A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a posteriori policy optimisation. In International Conference on Learning Representations, 2018.
[2]
J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22-31. JMLR. org, 2017.
[3]
E. Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
[4]
D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man�. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
[5]
D. P. Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3): 334-334, 1997.
[6]
D. P. Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014.
[7]
F. J. Beutler and K. W. Ross. Optimal policies for controlled markov chains with a constraint. Journal of mathematical analysis and applications, 112(1):236-252, 1985.
[8]
F. J. Beutler and K. W. Ross. Time-average optimal constrained semi-markov decision processes. Advances in Applied Probability, 18(2):341-359, 1986.
[9]
S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
[10]
G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016.
[11]
Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070-6120, 2017.
[12]
Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In Advances in neural information processing systems, pages 8092-8101, 2018.
[13]
Y. Chow, O. Nachum, A. Faust, M. Ghavamzadeh, and E. Duenez-Guzman. Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031, 2019.
[14]
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299-4307, 2017.
[15]
Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329-1338, 2016.
[16]
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. International Conference on Machine Learning, 2018.
[17]
S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, volume 2, pages 267-274, 2002.
[18]
L. Kallenberg. Linear Programming and Finite Markovian Control Problems. Centrum Voor Wiskunde en Informatica, 1983.
[19]
J. Luketina, N. Nardelli, G. Farquhar, J. Foerster, J. Andreas, E. Grefenstette, S. Whiteson, and T. Rockt�schel. A survey of reinforcement learning informed by natural language. arXiv preprint arXiv:1906.03926, 2019.
[20]
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. NIPS Deep Learning Workshop, 2013.
[21]
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928-1937, 2016.
[22]
W. H. Montgomery and S. Levine. Guided policy search via approximate mirror descent. In Advances in Neural Information Processing Systems, pages 4008-4016, 2016.
[23]
J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682-697, 2008.
[24]
J. Peters, K. Mulling, and Y. Altun. Relative entropy policy search. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
[25]
J. Pineau. NeurIPS 2018 Invited Talk: Reproducible, Reusable, and Robust Reinforcement Learning, 2018. URL: https://media.neurips.cc/Conferences/NIPS2018/Slides/jpineau-NeurIPS-dec18-fb.pdf. Last visited on 2020/05/28.
[26]
M. Pirotta, M. Restelli, A. Pecorino, and D. Calandriello. Safe policy iteration. In International Conference on Machine Learning, pages 307-315, 2013.
[27]
A. Ray, J. Achiam, and D. Amodei. Benchmarking Safe Exploration in Deep Reinforcement Learning. arXiv preprint arXiv:1910.01708, 2019.
[28]
K. W. Ross. Constrained markov decision processes with queueing applications. Dissertation Abstracts International Part B: Science and Engineering[DISS. ABST. INT. PT. B- SCI. & ENG.], 46(4), 1985.
[29]
K. W. Ross. Randomized and past-dependent policies for markov decision processes with multiple constraints. Operations Research, 37(3):474-477, 1989.
[30]
K. W. Ross and R. Varadarajan. Markov decision processes with sample path constraints: the communicating case. Operations Research, 37(5):780-790, 1989.
[31]
K. W. Ross and R. Varadarajan. Multichain markov decision processes with a sample path constraint: A decomposition approach. Mathematics of Operations Research, 16(1):195-207, 1991.
[32]
T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
[33]
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889-1897, 2015.
[34]
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016.
[35]
J. Schulman, X. Chen, and P. Abbeel. Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440, 2017a.
[36]
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.
[37]
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
[38]
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Ku-maran, T. Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. Science, 362(6419):1140-1144, 2018.
[39]
A. Stooke, J. Achiam, and P. Abbeel. Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning, 2020.
[40]
G. Strang. Computational science and engineering. Wellesley-Cambridge Press, 2007.
[41]
R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
[42]
C. Tessler, D. J. Mankowitz, and S. Mannor. Reward constrained policy optimization. International Conference on Learning Representation, 2019.
[43]
E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026-5033. IEEE, 2012.
[44]
H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In AAAI, volume 2, page 5. Phoenix, AZ, 2016.
[45]
Q. Vuong, Y. Zhang, and K. W. Ross. Supervised policy update for deep reinforcement learning. In International Conference on Learning Representation, 2019.
[46]
Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample efficient actor-critic with experience replay. International Conference on Learning Representations, 2017.
[47]
Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pages 5285-5294, 2017.
[48]
T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge. Projection-based constrained policy optimization. In International Conference on Learning Representation, 2020.
[49]
B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433-1438. Chicago, IL, USA, 2008.

Index Terms

  1. First order constrained optimization in policy space
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems
    December 2020
    22651 pages
    ISBN:9781713829546

    Publisher

    Curran Associates Inc.

    Red Hook, NY, United States

    Publication History

    Published: 06 December 2020

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 45
      Total Downloads
    • Downloads (Last 12 months)39
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 22 Oct 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media