skip to main content
article

Learning simpler language models with the differential state framework

Published: 01 December 2017 Publication History

Abstract

Learning useful information across long time lags is a critical and difficult problem for temporal neural models in tasks such as language modeling. Existing architectures that address the issue are often complex and costly to train. The differential state framework DSF is a simple and high-performing design that unifies previously introduced gated neural models. DSF models maintain longer-term memory by learning to interpolate between a fast-changing data-driven representation and a slowly changing, implicitly stable state. Within the DSF framework, a new architecture is presented, the delta-RNN. This model requires hardly any more parameters than a classical, simple recurrent network. In language modeling at the word and character levels, the delta-RNN outperforms popular complex architectures, such as the long short-term memory LSTM and the gated recurrent unit GRU, and, when regularized, performs comparably to several state-of-the-art baselines. At the subword level, the delta-RNN's performance is comparable to that of complex gated architectures.

References

[1]
Aslin, R. N., Saffran, J. R., & Newport, E. L. (1998). Computation of conditional probability statistics by 8-month-old infants. Psychological Science, 9(4), 321-324.
[2]
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv:1607.06450.
[3]
Baayen, R. H., & Schreuder, R. (2006). Morphological processing. Hoboken, NJ: Wiley.
[4]
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.
[5]
Boston, M. F., Hale, J., Kliegl, R., Patil, U., & Vasishth, S. (2008). Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus. Journal of Eye Movement Research, 2(1).
[6]
Choudhury, V. (2015). Thought vectors: Bringing common sense to artificial intelligence. www.iamwire.com.
[7]
Chung, J., Ahn, S., & Bengio, Y. (2016). Hierarchical multiscale recurrent neural networks. arXiv:1609.01704.
[8]
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555.
[9]
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2015). Gated feedback recurrent neural networks. In Proceedings of the International Conference on Machine Learning (pp. 2067-2075).
[10]
Cooijmans, T., Ballas, N., Laurent, C., G�l�ehre, �., & Courville, A. (2016). Recurrent batch normalization. arXiv:1603.09025.
[11]
Das, S., Giles, C. L., & Sun, G.-Z. (1992). Learning context-free grammars: Capabilities and limitations of a recurrent neural networkwith an external stack memory. In Proceedings of the 14th Annual Conference of the Cognitive Science Society (p. 14). San Mateo, CA: Morgan Kaufmann.
[12]
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211.
[13]
Gal, Y., & Ghahramani, Z. (2016). A theoretically grounded application of dropout in recurrent neural networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems, 29 (pp. 1019-1027). Red Hook, NY: Curran.
[14]
Gers, F. A., & Schmidhuber, J. (2000). Recurrent nets that time and count. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (vol. 3, pp. 189-194). Piscataway, NJ: IEEE.
[15]
Giles, C. L., Chen, D., Miller, C., Chen, H., Sun, G., & Lee, Y. (1991). Second-order recurrent neural networks for grammatical inference. In Proceedings of the International Joint Conference on Neural Networks (vol. 2, pp. 273-281). Piscataway, NJ: IEEE.
[16]
Giles, C. L., Lawrence, S., & Tsoi, A. C. (2001). Noisy time series prediction using recurrent neural networks and grammatical inference. Machine Learning, 44(1-2), 161-183.
[17]
Giles, C. L., Miller, C. B., Chen, D., Chen, H.-H., Sun, G.-Z., & Lee, Y.-C. (1992). Learning and extracting finite state automata with second-order recurrent neural networks. Neural Computation, 4(3), 393-405.
[18]
Goudreau, M. W., Giles, C. L., Chakradhar, S. T., & Chen, D. (1994). First-order versus second-order single-layer recurrent neural networks. IEEE Transactions on Neural Networks, 5(3), 511-513.
[19]
Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv:1308.0850.
[20]
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwinska, A., Hassabis, D. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476.
[21]
Gulcehre, C., Chander, S., & Bengio, Y. (2017). Memory augmented neural networks with wormhole connections. arXiv:1701.08718.
[22]
Gulcehre, C., Moczulski, M., Denil, M., & Bengio, Y. (2016). Noisy activation functions. arXiv:1603.00391.
[23]
Ha, D., Dai, A., & Le, Q. V. (2016). Hypernetworks. arXiv:1609.09106.
[24]
Hale, J. (2001). A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics (pp. 1-8). Stroudsburg, PA: ACL.
[25]
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision (pp. 630-645). New York: Springer.
[26]
Hochreiter, S., & Schmidhuber, J. (1997a). Long short-term memory. Neural Computation, 9(8), 1735-1780.
[27]
Hochreiter, S., & Schmidhuber, J. (1997b). LTSM can solve hard time lag problems. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 473-479). Cambridge, MA: MIT Press.
[28]
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.
[29]
Jernite, Y., Grave, E., Joulin, A., & Mikolov, T. (2016). Variable computation in recurrent neural networks. arXiv:1611.06188.
[30]
Jordan, M. I. (1990). Artificial neural networks. In J. Diederich Joachim (Ed.), Attractor dynamics and parallelism in a connectionist sequential machine (pp. 112-127). Piscataway, NJ: IEEE Press.
[31]
Joulin, A., & Mikolov, T. (2015). Inferring algorithmic patterns with stack-augmented recurrent nets. In Proceedings of the 28th International Conference on Neural Information Processing Systems (pp. 190-198). Cambridge, MA: MIT Press.
[32]
Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015). An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning (pp. 2342-2350).
[33]
Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
[34]
Koutnik, J., Greff, K., Gomez, F., & Schmidhuber, J. (2014). A clockwork RNN. arXiv:1402.3511.
[35]
Krueger, D., Maharaj, T., Kram�r, J., Pezeshki, M., Ballas, N., Ke, N. R., Pal, C. (2016). Zoneout: Regularizing rnns by randomly preserving hidden activations. arXiv:1606.01305.
[36]
Le, Q. V., Jaitly, N., & Hinton, G. E. (2015). A simple way to initialize recurrent networks of rectified linear units. arXiv:1504.00941.
[37]
Levy, R. (2008). Expectation-based syntactic comprehension. Cognition, 106(3), 1126-1177.
[38]
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Stroudsburg, PA: Association for Computational Linguistics.
[39]
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313-330.
[40]
Mikolov, T. (2012). Statistical language models based on neural networks. Ph.D. diss., University of Brno, Brno, CZ.
[41]
Mikolov, T., Joulin, A., Chopra, S., Mathieu, M., & Ranzato, M. (2014). Learning longer memory in recurrent neural networks. arXiv:1412.7753.
[42]
Mikolov, T., Karafi�t, M., Burget, L., �ernocku, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (vol. 2, pp. 1045-1048). International Speech Communication Association.
[43]
Mikolov, T., Kombrink, S., Burget, L., �ernocky, J., & Khudanpur, S. (2011). Extensions of recurrent neural network languagemodel. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 5528-5531). Piscataway, NJ: IEEE.
[44]
Mikolov, T., Sutskever, I., Deoras, A., Le, H.-S., Kombrink, S., & �ernocky, J. (2012). Subword language modeling with neural networks. http://www.fit.vutbr.cz/~imikolov/rnnlm/char.pdf.
[45]
Mozer, M. C. (1993). Neural net architectures for temporal sequence processing. Reading, MA: Addison-Wesley.
[46]
Neal, R. M. (2012). Bayesian learning for neural networks. New York: Springer Science & Business Media.
[47]
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference of Machine Learning (pp. 1310-1318).
[48]
Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838-855.
[49]
Serban, I. V., Ororbia, I., Alexander, G., Pineau, J., & Courville, A. (2016). Piecewise latent variables for neural variational text processing. arXiv:1612.00377.
[50]
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958.
[51]
Sukhbaatar, S., Szlam, A., Weston, J., & Fergus, R. (2015). End-to-end memory networks. arXiv:1503.08895.
[52]
Sun, G.-Z., Giles, C. L., & Chen, H.-H. (1998). The neural network pushdown automaton: Architecture, dynamics and training. In C. L. Giles & M. Gori (Eds.), Adaptive processing of sequences and data structures (pp. 296-345). New York: Springer.
[53]
Sundermeyer, M. (2016). Improvements in language and translation modeling. Ph.D. diss., RWTH Aachen University.
[54]
Turian, J., Bergstra, J., & Bengio, Y. (2009). Quadratic features and deep architectures for chunking. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers (pp. 245-248). Stroudsburg, PA: Association for Computational Linguistics.
[55]
Wang, T., & Cho, K. (2015). Larger-context language modelling. arXiv:1511.03729.
[56]
Weston, J., Chopra, S., & Bordes, A. (2014). Memory networks. arXiv:1410.3916.
[57]
Wu, Y., Zhang, S., Zhang, Y., Bengio, Y., & Salakhutdinov, R. R. (2016). On multiplicative integration with recurrent neural networks. arXiv:1606.06630.
[58]
Zaremba, W., Sutskever, I., & Vinyals, O. (2014). Recurrent neural network regularization. arXiv:1409.2329.
[59]
Zhou, G.-B., Wu, J., Zhang, C.-L., & Zhou, Z.-H. (2016). Minimal gated unit for recurrent neural networks. International Journal of Automation and Computing, 13(3), 226-234.

Cited By

View all
  • (2024)Enabling An Informed Contextual Multi-Armed Bandit Framework For Stock Trading With NeuroevolutionProceedings of the Genetic and Evolutionary Computation Conference Companion10.1145/3638530.3664145(1924-1933)Online publication date: 14-Jul-2024
  • (2023)Spiking neural predictive coding for continually learning from data streamsNeurocomputing10.1016/j.neucom.2023.126292544:COnline publication date: 1-Aug-2023
  • (2023)Online evolutionary neural architecture search for multivariate non-stationary time series forecastingApplied Soft Computing10.1016/j.asoc.2023.110522145:COnline publication date: 1-Sep-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Neural Computation
Neural Computation  Volume 29, Issue 12
December 2017
278 pages

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 01 December 2017

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Enabling An Informed Contextual Multi-Armed Bandit Framework For Stock Trading With NeuroevolutionProceedings of the Genetic and Evolutionary Computation Conference Companion10.1145/3638530.3664145(1924-1933)Online publication date: 14-Jul-2024
  • (2023)Spiking neural predictive coding for continually learning from data streamsNeurocomputing10.1016/j.neucom.2023.126292544:COnline publication date: 1-Aug-2023
  • (2023)Online evolutionary neural architecture search for multivariate non-stationary time series forecastingApplied Soft Computing10.1016/j.asoc.2023.110522145:COnline publication date: 1-Sep-2023
  • (2021)Neuroevolution of recurrent neural networks for time series forecasting of coal-fired power plant operating parametersProceedings of the Genetic and Evolutionary Computation Conference Companion10.1145/3449726.3463196(1735-1743)Online publication date: 7-Jul-2021
  • (2020)Improving neuroevolutionary transfer learning of deep recurrent neural networks through network-aware adaptationProceedings of the 2020 Genetic and Evolutionary Computation Conference10.1145/3377930.3390193(315-323)Online publication date: 25-Jun-2020
  • (2020)Ant-based Neural Topology Search (ANTS) for Optimizing Recurrent NetworksApplications of Evolutionary Computation10.1007/978-3-030-43722-0_40(626-641)Online publication date: 15-Apr-2020
  • (2020)Neuro-Evolutionary Transfer Learning Through Structural AdaptationApplications of Evolutionary Computation10.1007/978-3-030-43722-0_39(610-625)Online publication date: 15-Apr-2020
  • (2020)An Empirical Exploration of Deep Recurrent Connections Using Neuro-EvolutionApplications of Evolutionary Computation10.1007/978-3-030-43722-0_35(546-561)Online publication date: 15-Apr-2020
  • (2019)Investigating recurrent neural network memory structures using neuro-evolutionProceedings of the Genetic and Evolutionary Computation Conference10.1145/3321707.3321795(446-455)Online publication date: 13-Jul-2019
  • (2019)Sentiment analysis through recurrent variants latterly on convolutional neural network of TwitterFuture Generation Computer Systems10.1016/j.future.2018.12.01895:C(292-308)Online publication date: 1-Jun-2019

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media