skip to main content
article

Solving Markov decision processes with downside risk adjustment

Published: 01 June 2016 Publication History

Abstract

Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discreteevent systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance.

References

[1]
R. S. Sutton, A. G. Barto. Reinforcement Learning: An Introduction, Cambridge, USA: The MIT Press, 1998.
[2]
D. P. Bertsekas, J. N. Tsitsiklis. Neuro-dynamic Programming, Athena Scientific: Belmont, USA, 1996.
[3]
P. Balakrishna, R. Ganesan, L. Sherry. Accuracy of reinforcement learning algorithms for predicting aircraft taxiout times: A case-study of Tampa bay departures. Transportation Research Part C: Emerging Technologies, vol. 18, no. 6, pp. 950-962, 2010.
[4]
Z. Sui, A. Gosavi, L. Lin. A reinforcement learning approach for inventory replenishment in vendor-managed inventory systems with consignment inventory. Engineering Management Journal, vol. 22, no. 4, pp. 44-53, 2010.
[5]
P. Abbeel, A. Coates, T. Hunter, A. Y. Ng. Autonomous autorotation of an RC helicopter. Experimental Robotics, O. Khatib, V. Kumar, G. J. Pappas, Eds., Berlin Heidelberg, Germany: Springer, pp. 385-394, 2009.
[6]
R. A. Howard, J. E. Matheson. Risk-sensitive Markov decision processes. Management Science, vol. 18, no. 7, pp. 356- 369, 1972.
[7]
M. Rabin. Risk aversion and expected-utility theory: A calibration theorem. Econometrica, vol. 68, no. 5, pp. 1281- 1292, 2000.
[8]
P. Whittle. Risk-sensitive Optimal Control, NY, USA: John Wiley, 1990.
[9]
J. A. Filar, L. C. M. Kallenberg, H. M. Lee. Variance-penalized Markov decision processes. Mathematics of Operations Research, vol. 14, no. 1, pp. 147-161, 1989.
[10]
M. J. Sobel. The variance of discounted Markov decision processes. Journal of Applied Probability, vol. 19, no. 4, pp. 794-802, 1982.
[11]
M. Sato, S. Kobayashi. Average-reward reinforcement learning for variance penalized Markov decision problems. In Proceedings of the 18th International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, USA, pp. 473-480, 2001.
[12]
A. Gosavi. A risk-sensitive approach to total productive maintenance. Automatica, vol. 42, no. 8, pp. 1321-1330, 2006.
[13]
A. Gosavi. Variance-penalized Markov decision processes: Dynamic programming and reinforcement learning techniques. International Journal of General Systems, vol. 43, no. 6, pp. 649-669, 2014.
[14]
A. Gosavi. Reinforcement learning for model building and variance-penalized control. In Proceedings of Winter Simulation Conference, IEEE, Austin, USA, pp. 373-379, 2009.
[15]
O. Mihatsch, R. Neuneier. Risk-sensitive reinforcement learning. Machine Learning, vol. 49, no. 2-3, pp. 267-290, 2002.
[16]
P. Geibel. Reinforcement learning via bounded risk. In Proceedings of Internation Conference on Machine Learning, Morgan Kaufman, pp. 373-379, 2009.
[17]
M. Heger. Consideration of risk in reinforcement learning. In Proceedings of the 11th International Machine Learning Conference, Bellevue, USA, pp. 162-169, 2001.
[18]
Y. Chen, J. H. Jin. Cost-variability-sensitive preventive maintenance considering management risk. IIE Transactions, vol. 35, no. 12, pp. 1091-1102, 2003.
[19]
C. Barz, K. H. Waldmann. Risk-sensitive capacity control in revenue management. Mathematical Methods of Operations Research, vol. 65, no. 3, pp. 565-579, 2007.
[20]
K. J. Chung, M. J. Sobel. Discounted MDPs: Distribution functions and exponential utility maximization. SIAM Journal of Control and Optimization, vol. 25, no. 1, pp. 49- 62, 1987.
[21]
M. Bouakiz, Y. Kebir. Target-level criterion in Markov decision processes. Journal of Optimization Theory and Applications, vol. 86, no. 1, pp. 1-15, 1995.
[22]
C. B. Wu, Y. L. Lin. Minimizing risk models in Markov decision processes with policies depending on target values. Journal of Mathematical Analysis and Applications, vol. 231, no. 1, pp. 47-67, 1999.
[23]
A. Gosavi. Target-sensitive control of Markov and semi-Markov processes. International Journal of Control, Automation and Systems, vol. 9, no. 5, pp. 941-951, 2011.
[24]
A. A. Gosavi, S. K. Das, S. L. Murray. Beyond exponential utility functions: A variance-adjusted approach for risk-averse reinforcement learning. In Proceedings of the 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), IEEE, Orlando, USA, pp. 1-8, 2014.
[25]
D. P. Bertsekas. Dynamic Programming and Optimal Control, USA: Athena, 1995.
[26]
M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming, New York, USA: John Wiley & Sons, Inc., 1994.
[27]
T. K. Das, A. Gosavi, S. Mahadevan, N. Marchalleck. Solving semi-Markov decision problems using average reward reinforcement learning. Management Science, vol. 45, no. 4, pp. 560-574, 1999.
[28]
J. Baxter, P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence, vol. 15, pp. 319- 350, 2001.
[29]
A. Parulekar. A Downside Risk Criterion for Preventive Maintenance, Master dissertation, University at Buffalo, The State University of New York, 2006.
[30]
T. K. Das, S. Sarkar. Optimal preventive maintenance in a production inventory system. IIE Transactions, vol. 31, no. 6, pp. 537-551, 1999.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image International Journal of Automation and Computing
International Journal of Automation and Computing  Volume 13, Issue 3
June 2016
106 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 June 2016

Author Tags

  1. Downside risk
  2. Markov decision processes
  3. dynamic programming
  4. reinforcement learning
  5. targets
  6. thresholds

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Oct 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media