research-article

Reinforcement learning for model building and variance-penalized control

Author:

Abhijit GosaviAuthors Info & Claims

WSC '09: Winter Simulation Conference

Pages 373 - 379

Published: 13 December 2009 Publication History

Abstract

Reinforcement learning (RL) is a simulation-based technique to solve Markov decision problems or processes (MDPs). It is especially useful if the transition probabilities in the MDP are hard to find or if the number of states in the problem is too large. In this paper, we present a new model-based RL algorithm that builds the transition probability model without the generation of the transition probabilities; the literature on model-based RL attempts to compute the transition probabilities. We also present a variance-penalized Bellman equation and an RL algorithm that uses it to solve a variance-penalized MDP. We conclude with some numerical experiments with these algorithms.

References

[1]

Baird, L. C. 1995. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning.

[2]

Barto, A., R. Sutton, and C. Anderson. 1983. Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on SMC 13:835--846.

[3]

Bertsekas, D., and J. Tsitsiklis. 1996. Neuro-dynamic programming. Belmont, MA, USA: Athena Scientific.

Digital Library

[4]

Borkar, V. 2002. Q-learning for risk-sensitive control. Mathematics of Operations Research 27(2):294--311.

Digital Library

[5]

Borkar, V. S. 1997. Stochastic approximation with two-time scales. Systems and Control Letters 29:291--294.

Digital Library

[6]

Filar, J., L. Kallenberg, and H. Lee. 1989. Variance-penalized Markov decision processes. Mathematics of Operations Research 14(1):147--161.

Digital Library

[7]

Geibel, P., and F. Wysotzki. 2005. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research 24:81--108.

[8]

Gosavi, A. 2003. Simulation-based optimization: Parametric optimization techniques and reinforcement learning. Boston, MA: Kluwer Academic.

Digital Library

[9]

Gosavi, A. 2006. A risk-sensitive approach to total productive maintenance. Automatica 42:1321--1330.

Digital Library

[10]

Gosavi, A. 2007. Adaptive critics for airline revenue management. In Proceedings of 18th Annual Conference of the Production and Operations Management Society, Dallas, TX.

[11]

Gosavi, A., and S. Meyn. 2009. A dynamic programming algorithm for variance-penalized Markov decision process. Working Paper, Missouri University of Science and Technology and University of Illinois.

[12]

Markowitz, H. 1952. Portfolio selection. Journal of Finance 7(1):77--91.

[13]

Rummery, G., and M. Niranjan. 1994. On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166. Engineering Department, Cambridge University.

[14]

Sato, M., and S. Kobayashi. 2001. Average-reward reinforcement learning for variance-penalized Markov decision problems. In Proceedings of the 18th International Conference on Machine Learning, 473--480. Morgan Kauffman.

Digital Library

[15]

Sutton, R. 1996. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8. Cambridge, MA: MIT Press.

[16]

Sutton, R., and A. G. Barto. 1998. Reinforcement learning: An introduction. Cambridge, MA, USA: The MIT Press.

Digital Library

[17]

Tadepalli, P., and D. Ok. 1998. Model-based Average Reward Reinforcement Learning Algorithms. Artificial Intelligence 100:177--224.

Digital Library

[18]

Tsitsiklis, J. N., and B. V. Roy. 1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42(5):674--690.

[19]

Watkins, C. 1989. Learning from delayed rewards. Ph. D. thesis, Kings College, England.

[20]

Werb�s, P. 1990. A menu of designs for reinforcement learning over time. In Neural Networks for Control, 67--95. MIT Press, MA.

Digital Library

[21]

Werb�s, P. J. 1974, May. Beyond regression: New tools for prediction and analysis of behavioral sciences. Ph. D. thesis, Harvard University, Cambridege, MA, USA.

[22]

Werb�s, P. J. 1987. Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on SMC 17:7--20.

Digital Library

[23]

Williams, R. 1988. On the use of backpropagation in associative reinforcement learning. In Proceedings of the International Conference on Neural Networks, San Diego, CA.

[24]

Witten, I. 1977. An adaptive optimal controller for discrete time Markov environments. Information and Control 34:286--295.

Cited By

Gosavi AParulekar A(2018)Solving Markov decision processes with downside risk adjustmentInternational Journal of Automation and Computing10.1007/s11633-016-1005-313:3(235-245)Online publication date: 17-Dec-2018
https://dl.acm.org/doi/10.1007/s11633-016-1005-3
Garc�a JFern�ndez F(2015)A comprehensive survey on safe reinforcement learningThe Journal of Machine Learning Research10.5555/2789272.288679516:1(1437-1480)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.5555/2789272.2886795
Gosavi APurohit M(2011)Stochastic policy search for variance-penalized semi-Markov controlProceedings of the Winter Simulation Conference10.5555/2431518.2431858(2865-2876)Online publication date: 11-Dec-2011
https://dl.acm.org/doi/10.5555/2431518.2431858

Reinforcement learning for model building and variance-penalized control
1. Computing methodologies
2. Mathematics of computing
  1. Probability and statistics
    1. Probabilistic representations
    2. Stochastic processes

Recommendations

Stochastic policy search for variance-penalized semi-Markov control
WSC '11: Proceedings of the Winter Simulation Conference

The variance-penalized metric in Markov decision processes (MDPs) seeks to maximize the average reward minus a scalar times the variance of rewards. In this paper, our goal is to study the same metric in the context of the semi-Markov decision process (...
Variance-Penalized Markov Decision Processes

We consider a Markov decision process with both the expected limiting average, and the discounted total return criteria, appropriately modified to include a penalty for the variability in the stream of rewards. In both cases we formulate appropriate ...
Finite-horizon variance penalised Markov decision processes

We consider a finite horizon Markov decision process with only terminal rewards. We describe a finite algorithm for computing a Markov deterministic policy which maximises the variance penalised reward and we outline a vertex elimination algorithm which ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSC '09: Winter Simulation Conference

December 2009

3211 pages

ISBN:9781424457717

General Chair:
Ann Dunkin
Palo Alto Unified School District
,
Program Chair:
Ricki G. Ingalls
Oklahoma State University

Sponsors

IIE: Institute of Industrial Engineers
INFORMS-SIM: Institute for Operations Research and the Management Sciences: Simulation Society
SIGSIM: ACM Special Interest Group on Simulation and Modeling
SCS: Society for Modeling and Simulation International

Publisher

Winter Simulation Conference

Publication History

Published: 13 December 2009

Check for updates

Qualifiers

Research-article

Conference

WSC09

Sponsor:

IIE
INFORMS-SIM
SIGSIM
SCS

WSC09: Winter Simulation Conference

December 13 - 16, 2009

Texas, Austin

Acceptance Rates

WSC '09 Paper Acceptance Rate 137 of 256 submissions, 54%;

Overall Acceptance Rate 3,413 of 5,075 submissions, 67%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
42
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gosavi AParulekar A(2018)Solving Markov decision processes with downside risk adjustmentInternational Journal of Automation and Computing10.1007/s11633-016-1005-313:3(235-245)Online publication date: 17-Dec-2018
https://dl.acm.org/doi/10.1007/s11633-016-1005-3
Garc�a JFern�ndez F(2015)A comprehensive survey on safe reinforcement learningThe Journal of Machine Learning Research10.5555/2789272.288679516:1(1437-1480)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.5555/2789272.2886795
Gosavi APurohit M(2011)Stochastic policy search for variance-penalized semi-Markov controlProceedings of the Winter Simulation Conference10.5555/2431518.2431858(2865-2876)Online publication date: 11-Dec-2011
https://dl.acm.org/doi/10.5555/2431518.2431858

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents