Recurrent Neural Networks have "Attention Deficiency". Came along LSTMs with their ability to store information for long. But, as they say, all good things have an even better alternatives, came Attention.

Attention Mechanism is a way to provide the model with the information of which aspects of the training data are more useful for making prediction. If you could, try to compare the Attention Mechanism with that of your brain. When reading something, do you put equal emphasis on each word or more emphasis on the keywords? Or when you've just watched a movie and tell the story to someone else, do you narrate a whole 3 hour script or a summary of the most-important bits? It's the same case with the Attention Mechanism. It focuses on the most important bits of information.

Where does Attention finds it's use?

The primary use-case for Attention Mechanism is in the Natural Language Processing tasks. And it also happens to be the reason this was invented. Translation of sentences from one language to another often poses a challenge of remembering the context for a longer periods of time and Attention enables it.

However, lately, it has found its use in the Computer Vision tasks as well. Enabling the model to focus on the pixels that add the more value to the final prediction. We'll discuss more on it when we cover the Computer Vision topics.

How does Attention even work?

The core idea behind an attention mechanism is relatively straightforward: it allows a model to automatically focus on the most relevant parts of the input for making a decision or prediction.

Each part of the input has its individual weight. These weights are called attention weights and basically tell the model how much focus each part of the input deserves. Now, it depends on the kind of Attention Mechanism used (more on it in just a sec) but the gist is the same.

These weights make up a vector representation of the input called the context vector. This vector is used by the model to identify the part of the inputs to focus more on.

There are 2 primary forms of Attention - Global and Local. It is self-explanatory what it means. Global assigns attention weights to the entirety of the input while the Local assigns attention weights to the specific parts of the input. There are many more, and each deserves its own article. Perhaps once a few more concepts are covered first.

(I am skipping over the architecture for now. Will bring it up once the Transformers are covered in this series.)