skip to main content
article

Learning to Forget: Continual Prediction with LSTM

Published: 01 October 2000 Publication History

Abstract

Long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997) can solve numerous tasks not solvable by previous learning algorithms for recurrent neural networks (RNNs). We identify a weakness of LSTM networks processing continual input streams that are not a priori segmented into subsequences with explicitly marked ends at which the network's internal state could be reset. Without resets, the state may grow indefinitely and eventually cause the network to break down. Our remedy is a novel, adaptive "forget gate" that enables an LSTM cell to learn to reset itself at appropriate times, thus releasing internal resources. We review illustrative benchmark problems on which standard LSTM outperforms other RNN algorithms. All algorithms (including LSTM) fail to solve continual versions of these problems. LSTM with forget gates, however, easily solves them, and in an elegant way.

References

[1]
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157-166.
[2]
Cleeremans, A., Servan-Schreiber, D., & McClelland, J. L. (1989). Finite-state automata and simple recurrent networks. Neural Computation, 1, 372-381.
[3]
Cummins, F., Gers, F., & Schmidhuber, J. (1999). Language identification from prosody without explicit features. In Proceedings of EUROSPEECH'99 (Vol. 1, pp. 371-374).
[4]
Darken, C. (1995). Stochastic approximation and neural network learning. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 941- 944). Cambridge, MA: MIT Press.
[5]
Doya, K., & Yoshizawa, S. (1989). Adaptive neural oscillator using continuous-time backpropagation learning. Neural Networks, 2(5), 375-385.
[6]
Fahlman, S. E. (1991). The recurrent cascade-correlation learning algorithm. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 190-196). San Mateo, CA: Morgan Kaufmann.
[7]
Gers, F. A., Schmidhuber, J., & Cummins, F. (1999). Learning to forget: Continual prediction with LSTM (Tech. Rep. No. IDSIA-01-99). Lugano, Switzerland: IDSIA.
[8]
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Technische Universit�t M�nchen. Available online at www7. informatik.tu-muenchen.de/~hochreit.
[9]
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780.
[10]
Jordan, M. I. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Eighth Annual Cognitive Science Society Conference. Hillsdale, NJ: Erlbaum.
[11]
Lin, T., Horne, B. G., Ti�o, P., & Giles, C. L. (1996). Learning long-term dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks, 7(6), 1329-1338.
[12]
Mozer, M. C. (1989). A focused backpropagation algorithm for temporal pattern processing. Complex Systems, 3, 349-381.
[13]
Pearlmutter, B. A. (1995). Gradient calculation for dynamic recurrent neural networks: A survey. IEEE Transactions on Neural Networks, 6(5), 1212-1228.
[14]
Robinson, A. J., & Fallside, F. (1987). The utility driven dynamic error propagation network. (Tech. Rep. No. CUED/F-INFENG/TR.1). Cambridge: Cambridge University Engineering Department.
[15]
Schmidhuber, J. (1989). The neural bucket brigade: A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science, 1(4), 403-412.
[16]
Schmidhuber, J. (1992). A fixed size storage O(n 3) time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4(2), 243-248.
[17]
Schraudolph, N. (1999). A fast, compact approximation of the exponential function. Neural Computationx, 11(4), 853-862.
[18]
Smith, A. W., & Zipser, D. (1989). Learning sequential structures with the real-time recurrent learning algorithm. International Journal of Neural Systems, 1(2), 125-131.
[19]
Waibel, A. (1989). Modular construction of time-delay neural networks for speech recognition. Neural Computation, 1(1), 39-46.
[20]
Werbos, P. J. (1988). Generalisation of backpropagation with application to a recurrent gas market model. Neural Networks, 1, 339-356.
[21]
Williams, R. J., & Peng, J. (1990). An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 2(4), 490-501.
[22]
Williams, R. J., & Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin & D. E. Rumelhart (Eds.), Back-propagation: Theory, architectures and applications. Hillsdale, NJ: Erlbaum.

Cited By

View all
  1. Learning to Forget: Continual Prediction with LSTM

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Neural Computation
    Neural Computation  Volume 12, Issue 10
    October 2000
    242 pages

    Publisher

    MIT Press

    Cambridge, MA, United States

    Publication History

    Published: 01 October 2000

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 22 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Marketing Decision Model and Consumer Behavior Prediction With Deep LearningJournal of Organizational and End User Computing10.4018/JOEUC.33654736:1(1-25)Online publication date: 30-Jan-2024
    • (2024)MetaStore: Analyzing Deep Learning Meta-Data at ScaleProceedings of the VLDB Endowment10.14778/3648160.364818217:6(1446-1459)Online publication date: 1-Feb-2024
    • (2024)A modified LSTM network to predict the citation counts of papersJournal of Information Science10.1177/0165551522111100050:4(894-909)Online publication date: 1-Aug-2024
    • (2024)A State-of-Health Estimation Method for Lithium Batteries Based on Incremental Energy Analysis and Bayesian TransformerJournal of Electrical and Computer Engineering10.1155/2024/58221062024Online publication date: 1-Jan-2024
    • (2024)The Application of the LSTM Neural Networks on the Hydrology ForecastProceedings of the 2024 6th International Conference on Pattern Recognition and Intelligent Systems10.1145/3689218.3689233(93-97)Online publication date: 25-Jul-2024
    • (2024)NASDAQ 100 Index Prediction Using LSTM And Sentiment AnalysisProceedings of the International Conference on Modeling, Natural Language Processing and Machine Learning10.1145/3677779.3677837(351-357)Online publication date: 17-May-2024
    • (2024)Drum groove visualization using information distribution maps at LSTM variational autoencoderProceedings of the 2024 6th International Conference on Image, Video and Signal Processing10.1145/3655755.3655780(185-191)Online publication date: 14-Mar-2024
    • (2024)A Hierarchical Underwater Acoustic Target Recognition Method Based on Transformer and Transfer LearningProceedings of the 2024 6th International Conference on Image, Video and Signal Processing10.1145/3655755.3655758(16-24)Online publication date: 14-Mar-2024
    • (2024)NPEL: Neural Paired Entity Linking in Web TablesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3652511Online publication date: 19-Mar-2024
    • (2024)A Taxonomy for Learning with Perturbation and AlgorithmsACM Transactions on Knowledge Discovery from Data10.1145/364439118:5(1-38)Online publication date: 26-Mar-2024
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media