article

Solving Markov decision processes with downside risk adjustment

Authors:

Abhijit Gosavi,

Anish ParulekarAuthors Info & Claims

International Journal of Automation and Computing, Volume 13, Issue 3

Pages 235 - 245

https://doi.org/10.1007/s11633-016-1005-3

Published: 01 June 2016 Publication History

Abstract

Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discreteevent systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance.

References

[1]

R. S. Sutton, A. G. Barto. Reinforcement Learning: An Introduction, Cambridge, USA: The MIT Press, 1998.

[2]

D. P. Bertsekas, J. N. Tsitsiklis. Neuro-dynamic Programming, Athena Scientific: Belmont, USA, 1996.

[3]

P. Balakrishna, R. Ganesan, L. Sherry. Accuracy of reinforcement learning algorithms for predicting aircraft taxiout times: A case-study of Tampa bay departures. Transportation Research Part C: Emerging Technologies, vol. 18, no. 6, pp. 950-962, 2010.

[4]

Z. Sui, A. Gosavi, L. Lin. A reinforcement learning approach for inventory replenishment in vendor-managed inventory systems with consignment inventory. Engineering Management Journal, vol. 22, no. 4, pp. 44-53, 2010.

[5]

P. Abbeel, A. Coates, T. Hunter, A. Y. Ng. Autonomous autorotation of an RC helicopter. Experimental Robotics, O. Khatib, V. Kumar, G. J. Pappas, Eds., Berlin Heidelberg, Germany: Springer, pp. 385-394, 2009.

[6]

R. A. Howard, J. E. Matheson. Risk-sensitive Markov decision processes. Management Science, vol. 18, no. 7, pp. 356- 369, 1972.

Digital Library

[7]

M. Rabin. Risk aversion and expected-utility theory: A calibration theorem. Econometrica, vol. 68, no. 5, pp. 1281- 1292, 2000.

[8]

P. Whittle. Risk-sensitive Optimal Control, NY, USA: John Wiley, 1990.

[9]

J. A. Filar, L. C. M. Kallenberg, H. M. Lee. Variance-penalized Markov decision processes. Mathematics of Operations Research, vol. 14, no. 1, pp. 147-161, 1989.

Digital Library

[10]

M. J. Sobel. The variance of discounted Markov decision processes. Journal of Applied Probability, vol. 19, no. 4, pp. 794-802, 1982.

[11]

M. Sato, S. Kobayashi. Average-reward reinforcement learning for variance penalized Markov decision problems. In Proceedings of the 18th International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, USA, pp. 473-480, 2001.

[12]

A. Gosavi. A risk-sensitive approach to total productive maintenance. Automatica, vol. 42, no. 8, pp. 1321-1330, 2006.

Digital Library

[13]

A. Gosavi. Variance-penalized Markov decision processes: Dynamic programming and reinforcement learning techniques. International Journal of General Systems, vol. 43, no. 6, pp. 649-669, 2014.

[14]

A. Gosavi. Reinforcement learning for model building and variance-penalized control. In Proceedings of Winter Simulation Conference, IEEE, Austin, USA, pp. 373-379, 2009.

Digital Library

[15]

O. Mihatsch, R. Neuneier. Risk-sensitive reinforcement learning. Machine Learning, vol. 49, no. 2-3, pp. 267-290, 2002.

Digital Library

[16]

P. Geibel. Reinforcement learning via bounded risk. In Proceedings of Internation Conference on Machine Learning, Morgan Kaufman, pp. 373-379, 2009.

[17]

M. Heger. Consideration of risk in reinforcement learning. In Proceedings of the 11th International Machine Learning Conference, Bellevue, USA, pp. 162-169, 2001.

[18]

Y. Chen, J. H. Jin. Cost-variability-sensitive preventive maintenance considering management risk. IIE Transactions, vol. 35, no. 12, pp. 1091-1102, 2003.

[19]

C. Barz, K. H. Waldmann. Risk-sensitive capacity control in revenue management. Mathematical Methods of Operations Research, vol. 65, no. 3, pp. 565-579, 2007.

[20]

K. J. Chung, M. J. Sobel. Discounted MDPs: Distribution functions and exponential utility maximization. SIAM Journal of Control and Optimization, vol. 25, no. 1, pp. 49- 62, 1987.

Digital Library

[21]

M. Bouakiz, Y. Kebir. Target-level criterion in Markov decision processes. Journal of Optimization Theory and Applications, vol. 86, no. 1, pp. 1-15, 1995.

Digital Library

[22]

C. B. Wu, Y. L. Lin. Minimizing risk models in Markov decision processes with policies depending on target values. Journal of Mathematical Analysis and Applications, vol. 231, no. 1, pp. 47-67, 1999.

[23]

A. Gosavi. Target-sensitive control of Markov and semi-Markov processes. International Journal of Control, Automation and Systems, vol. 9, no. 5, pp. 941-951, 2011.

[24]

A. A. Gosavi, S. K. Das, S. L. Murray. Beyond exponential utility functions: A variance-adjusted approach for risk-averse reinforcement learning. In Proceedings of the 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), IEEE, Orlando, USA, pp. 1-8, 2014.

[25]

D. P. Bertsekas. Dynamic Programming and Optimal Control, USA: Athena, 1995.

[26]

M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming, New York, USA: John Wiley & Sons, Inc., 1994.

[27]

T. K. Das, A. Gosavi, S. Mahadevan, N. Marchalleck. Solving semi-Markov decision problems using average reward reinforcement learning. Management Science, vol. 45, no. 4, pp. 560-574, 1999.

[28]

J. Baxter, P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence, vol. 15, pp. 319- 350, 2001.

Digital Library

[29]

A. Parulekar. A Downside Risk Criterion for Preventive Maintenance, Master dissertation, University at Buffalo, The State University of New York, 2006.

[30]

T. K. Das, S. Sarkar. Optimal preventive maintenance in a production inventory system. IIE Transactions, vol. 31, no. 6, pp. 537-551, 1999.

Solving Markov decision processes with downside risk adjustment

Recommendations

Variability Sensitive Markov Decision Processes

Considered are time-average Markov Decision Processes MDPs with finite state and action spaces. Two definitions of variability are introduced, namely, the expected time-average variability and time-average expected variability. The two criteria are in ...
Markov Decision Processes with Sample Path Constraints: The Communicating Case

We consider time-average Markov Decision Processes MDPs, which accumulate a reward and cost at each decision epoch. A policy meets the sample-path constraint if the time-average cost is below a specified value with probability one. The optimization ...
Variance-Penalized Markov Decision Processes

We consider a Markov decision process with both the expected limiting average, and the discounted total return criteria, appropriately modified to include a penalty for the variability in the stream of rewards. In both cases we formulate appropriate ...

Comments

Information & Contributors

Information

Published In

cover image International Journal of Automation and Computing

International Journal of Automation and Computing Volume 13, Issue 3

June 2016

106 pages

ISSN:1476-8186

Issue’s Table of Contents

Copyright © Copyright © 2016 Institute of Automation, Chinese Academy of Sciences and Springer-Verlag Berlin Heidelberg.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 June 2016

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents