LSTMs are popular because it solved the problem of vanishing gradient with RNNs. But so do Gated Recurrent Units (GRUs). On top of it, GRUs also have less parameters to learn compared to LSTM. And yet, GRUs are not used as much in the industry as LSTMs. Almost nobody ever mentions of GRUs when building a course on NLP. Look at the Google Trends of these 2 terms.

An astonishing difference, right? In this post let's learn what GRUs are, how they learn, and what is the reason behind GRUs not being the goto algorithm for NLP.

What are GRUs and how does it learn?

Gru's Plan Meme | Let's use a neural network for sequence modeling! We'll use GRUs because... Have fewer parameters than LSTMs; But wait, how do they handle long-term dependencies? | image tagged in memes,gru's plan | made w/ Imgflip meme maker

Gated Recurrent Units (GRUs) were designed with one purpose. To give comparable performance of LSTM while reducing the number of trainable parameters. And they do it very effectively. And this is how they achieve it.

Contrary to LSTMs (it's also counterintuitive), output of a GRU is only 1 value. If you have checked my previous post on LSTMs, LSTMs tend to have 2 outputs, one aligned with the short term memory and another the long term memory. So the question remains, how does it maintain the context for long?

Architecture of GRU

GRUs simplify the LSTM design with two gates: a reset gate and an update gate. These gates decide what information should be passed to the output. The update gate helps the model determine the amount of past information (from previous time steps) that needs to be passed along to the future. This is akin to how much of the long-term memory should be used in the current state. The reset gate, on the other hand, decides how much past information to forget. This functionality is somewhat similar to the forget gate in LSTMs, but GRUs combine this with the input gate into a single update gate to make the model more efficient.

Despite having only one output, GRUs are able to maintain the context for long sequences through a clever balancing act performed by these gates. They modulate the flow of information inside the unit without separate memory cells, which are present in LSTMs. This allows GRUs to still capture dependencies from large spans of time, deciding at each step what to keep from the past and what new information to add. By adapting these gates' settings at each step in the sequence, GRUs can keep track of long-term dependencies, thus enabling them to maintain context and perform various sequence modeling tasks effectively.

Why aren't they used as much as LSTMs?

GRUs for the most part, give a close fight to LSTM in terms of performance. The final decision of which one to use lies on the dataset and the problem at hand. It's not written on stone but from experience, when dealing with datasets with smaller sequence lengths, GRUs outperform LSTMs and the inverse happens when the sequence length starts to increase. This means, LSTMs are better when it comes to remembering context for a longer time compared to GRUs.

Secondly, GRUs have been around for far less time than LSTMs. GRUs were introduced only in 2014. Coincidentally, Attention Mechanism (we'll discuss this in next week's post) was also introduced around the same time. Attention brought most of engineer's and researchers attention towards it, overshadowing GRUs.

Just to make it clear, it is not like GRUs are not at all used, it's just the use-cases that the industry is focussed on right now, fit LSTMs and Attention Mechanism more. For instance, chatbots. The need to remember the context of conversation for a long time is extremely important. And LSTMs do it better than GRUs.