Were RNNs All We Needed?

View PDF HTML (experimental)

Abstract:The introduction of Transformers in 2017 reshaped the landscape of deep learning. Originally proposed for sequence modelling, Transformers have since achieved widespread success across various domains. However, the scalability limitations of Transformers - particularly with respect to sequence length - have sparked renewed interest in novel recurrent models that are parallelizable during training, offer comparable performance, and scale more effectively. In this work, we revisit sequence modelling from a historical perspective, focusing on Recurrent Neural Networks (RNNs), which dominated the field for two decades before the rise of Transformers. Specifically, we examine LSTMs (1997) and GRUs (2014). We demonstrate that by simplifying these models, we can derive minimal versions (minLSTMs and minGRUs) that (1) use fewer parameters than their traditional counterparts, (2) are fully parallelizable during training, and (3) achieve surprisingly competitive performance on a range of tasks, rivalling recent models including Transformers.

Submission history

From: Leo Feng [view email]
[v1] Wed, 2 Oct 2024 03:06:49 UTC (292 KB)
[v2] Fri, 4 Oct 2024 05:01:26 UTC (289 KB)
[v3] Thu, 28 Nov 2024 07:10:33 UTC (4,010 KB)