Transformers: The hottest word of the century!

Prometheus stole fire from the gods and gifted it to humanity- an excerpt from the movie Oppenheimer. Since then, the course of humanity has been forever altered.

Now, let's shift gears to the flashy year of 2023. It's as if the year has burst into a buzzword blossom—ChatGPT, Llama, Dolly, Gorilla LLM, and many more that sound like a zoo party. The big question on everyone's mind: Are these innovations a boon or a bane? Only time will reveal. These technological marvels have performed the cha-cha with our perceptions, and who knows, we might just be a stone's throw away from conjuring creatures smarter than the average human.

All these technologies are Large Language Models(or LLMs) that stem from the Transformer architecture. Now one may wonder what is a Transformer.🤔

Transformers are like superstar brain boosters for computers in the world of language! They're special models designed to understand and generate human-like text. Just imagine having a genius friend who's not only awesome at understanding languages but also brilliant at predicting what comes next in a conversation. Transformers use their smarts to break down sentences into tiny pieces, grasp their meanings, and then piece everything together flawlessly. They've transformed how computers handle language tasks, making them the rockstars of Natural Language Processing (NLP)! 🚀🧠📚

Here, we will focus on the Transformer architecture used commonly for Language Translation.

Architecture Overview:

Data inputs for both the Encoder and Decoder, which contains:

Embedding layer
Position Encoding layer

The Encoder stack contains several encoders(six in the original Transformer paper). Each Encoder contains:

Multi-Head Attention layer
Feed-forward layer

The Decoder stack contains several Decoders. Each Decoder contains:

Two Multi-Head Attention layers(one Masked, the other Cross)
Feed-forward layer

Output — generates the final output, and contains:

Linear layer
Softmax layer.

Some points need to be kept in mind before we delve into the intricacies of Transformer architecture. These are:

Each of the words of the sentences is tokenized and a start token <SOS> and an end token <EOS> are added to the starting and the ending of the sentences. Say, for example, there is a sentence: "I love to play football." The sentences are broken down into "I", "love", "to", "play", "football", and ".". After appending <SOS> and <EOS>, the whole sentence becomes <SOS>, "I", "love", "to", "play", "football", ".", <EOS>. Now these are converted into numbers according to the vocabulary like: [0,2,3,4,5,6,7,1].
A sentence can be of any size. To tackle the problem of variable lengths of sentences, padding is performed. The padding ensures every sentence is the same length, making computations smooth and efficient. Say, for example, there is a dataset whose max sentence length is 10(including the <SOS> and <EOS> tokens). There is a sentence: "I like traveling". So the sentence after padding becomes, <SOS>, "I", "like", "traveling", <EOS>, <PAD>, <PAD>, <PAD>, <PAD>, <PAD>.

Embedding Layer

Remember when we read about position vectors in Physics where we could imagine the positions or velocities of objects in a 3D space? The same transformation is done by the Embedding Layers of a Transformer. Since transformers are algorithms that don't understand text, these Layers convert the texts to meaningful vectors that the Transformer can understand. For example, the word 'cat' is encoded into a vector of [0.3, 0.1, 0.99] in the case of a vector of shape (1x3).

Positional Encoding

Unlike Recurrent Neural Networks or Long Short Memory Networks(LSTMs) that take words sequentially, transformers take all the words parallelly as inputs which increases their speed and their ability to look at the whole sentence in one go. But the position of words in a sentence is important to convey the meaning of a sentence, right? This is made possible by the Position Encoding Layer that is computed on each input sequence independently.

The positional encoding is added to the token embeddings before feeding them into the transformer's self-attention layers. This allows the model to consider the order of tokens within the sequence while maintaining the benefits of parallelization and attention-based processing.

The positional encodings are calculated by:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:

pos is the position of the token in the sequence.
i is the dimension index in the positional encoding vector.
d_model is the dimensionality of the model's embeddings.

Squad, Attention!⚠️

Imagine you're at a concert, and there's a spotlight that moves around, highlighting different musicians as they play solos. Well, attention in Transformers works kind of like that spotlight, but for words in sentences!

When transformers process language, they don't treat all words equally. Just like that spotlight, they focus more on certain words that are important for understanding the context. These important words get more attention.

For instance, if you're reading "The cat sat on the mat," attention helps the transformer understand that "cat" and "mat" are connected because they're part of the same idea. It's like the spotlight of attention shines on those words, making sure the transformer gets how they fit together.

This way, transformers can catch the relationships between words and understand sentences better. So, attention is like the star feature that lets transformers know where to look and what to focus on when they're figuring out the meaning of words and how they connect. 🌟🎤📖

Self-Attention Mechanism:

At a high level, self-attention allows each position in a sequence of input data to focus on other positions when computing its output representation. This is done by calculating weighted sums of the values associated with each position, where the weights are determined by similarity scores between positions. These similarity scores are obtained through the dot product of query, key, and value vectors associated with each position.

Here's a breakdown of the main components of self-attention:

Query, Key, and Value Vectors: For each word in the sentence of the input sequence, the transformer creates three vectors: the query vector(Q), the key vector(K), and the value vector(V). These vectors are linear transformations of the input features, which are used to compute the similarity scores.
Similarity Scores: The similarity between a query vector and a key vector is computed using the dot product, followed by a scaling factor d_k to control the magnitude, where d_k is the dimension of the embedding vector. This helps in stabilizing the gradients during backpropagation, leading to efficient architecture optimization. The dot product represents the "attention" the query pays to the key. The similarity scores are then passed through a softmax function to obtain attention weights that sum up to 1.
Weighted Sum: The attention weights are used to compute a weighted sum of the value vectors. This step combines the contextual information from other positions in the input sequence, emphasizing more relevant positions.

$$\text{self attention} = softmax\bigg(\frac{Q.K^T}{\sqrt{d_k}}+M\bigg)$$

$$\text{Contextually aware V} = \text{self attention}.V$$

Where:

Q is the query vector
K is the key vector
d_k is the dimension of the embedding vectors
M is the masking matrix, not required in Encoders, but important in Decoders

Multi-Head Attention Mechanism:

Imagine you're at the center of a buzzing brainstorm with multiple screens showing different things. Each screen focuses on a unique aspect, like colors, shapes, and patterns. That's multi-head attention in a nutshell!

In the world of machine learning, multi-head attention is like having a group of experts in a meeting, each looking at a specific part of the information. Just as your brain can concentrate on various things simultaneously, multi-head attention lets a model pay attention to different parts of input at once.

Instead of relying on a single self-attention mechanism, multiple attention mechanisms (heads) operate in parallel. Each head focuses on different aspects of the data, allowing the model to capture different types of relationships.

Masked Multi-Head Attention Mechanism:

This is a type of Multi-Head Attention mechanism used only in Decoders to ensure words don't get context from words generated after it. In masked multi-head attention, the model wears special masks when it's not looking at certain words. It's like a magical veil that hides parts of the input sequence. This is super handy when the model's generating text because it shouldn't peek at future words – no spoilers allowed!

So, just like you'd avoid peeking at the last page of a mystery novel, masked multi-head attention helps the model predict what's next without cheating. It's like solving a puzzle with only a few pieces revealed at a time.

This masked masquerade of attention keeps the model honest and sharp, allowing it to understand the context of each word without any futuristic spoilers. Think of it as a high-tech game where the model's the ultimate mystery solver, one masked glance at a time!

In the masked multi-head attention mechanism in decoders, all the self-attention heads contain a Mask M given above.

Cross Multi-Head Attention Mechanism:

Cross Multi-Head Attention Mechanism is the second type of Multi-Head Attention Mechanism used in the Decoder block of the Transformer Architecture where the model understands the relationship between the encoder inputs and the decoder inputs. This enables the decoder to capture relevant information from the input sequence while also maintaining context from its own generated tokens.

Picture this: you're translating English to French. Each attention head in the decoder not only looks at the input English words but also taps into the output French words. It's like having language interpreters who understand both sides of the conversation.

This supercharged teamwork lets the decoder capture richer relationships between the input and output, ensuring accurate translations or any other sequence-to-sequence tasks. It's like having a bilingual genius on your team, making the decoder an impressive language magician!

Residual Connection:

The unidirectional arrows connecting one 'Add and Norm' operation to another, bypassing the attention module, are referred to as skip connections or residual connections. These connections play a crucial role in preventing the degradation of the neural network and ensuring the consistent flow of gradients across the network during training, which contributes to enhanced robustness in the learning process.

Position-wise Feedforward Network:

After the attention mechanism, each layer includes a position-wise feedforward neural network. This network is applied independently to each position in the sequence, adding a non-linear transformation to the representations.

Output Generation

The Linear layer of the last Decoder projects the Decoder vector into Word Scores, with a score value for each unique word in the target vocabulary, at each position in the sentence. For instance, if our final output sentence has 7 words and the target English vocabulary has 10000 unique words, we generate 10000 score values for each of those 7 words. The score values indicate the likelihood of occurrence for each word in the vocabulary in that position of the sentence.

The Softmax layer then turns those scores into probabilities (which add up to 1.0). In each position, we find the index for the word with the highest probability and then map that index to the corresponding word in the vocabulary. Those words then form the output sequence of the Transformer.

Train Hard, Train Fast!💪 😡

Consider a problem of Language Translation from French to English. During training, a sentence is given as input to the Encoder(a French sentence) and its corresponding English translation is fed into the decoder. The softmax layer of the Decoder gives the probabilities of the words occurring next to the English words fed to the Decoder. The loss is calculated using a Cross Entropy Loss Function which is minimized by backpropagation.

It's Show Time!🤯😮‍💨

During Inference, the sentence to be translated is fed into the Encoder(a French sentence) and the start token <SOS> is fed to the Decoder. The Decoder generates the next word following <SOS> and both words are fed again to the Decoder to predict the third word. This occurs till the Decoder encounters the end token <EOS>.

A huge shoutout to the resources and the images provided by Ria Kulshrestha and Ajay Halthor for simplifying the complexities of the Transformer Architecture.

Happy Reading!

Transformers: The hottest word of the century!😎

Table of contents