Know Your LLM - Part1
- Shashank Shekhar
- Mar 11, 2024
- 6 min read

Photo by Kate Trysh on Unsplash
Language models have undergone a transformative evolution in recent years, leading to the development of Large Language Models (LLMs) that have redefined the capabilities of natural language processing (NLP) systems. The journey to LLMs has been marked by groundbreaking research and technical innovations, with several key papers paving the way for their creation. This article explores the foundational papers that have shaped the landscape of LLMs, from early models like recurrent neural networks (RNNs) to the revolutionary transformer-based architectures that underpin modern LLMs like GPT-3.
Neural Machine Translation by Jointly Learning to Align and Translate (2014) by Bahdanau, Cho, and Bengio, https://arxiv.org/abs/1409.0473
The key architecture change proposed in this paper was the addition of an attention mechanism, which allows the model to focus on different parts of the input sequence when generating the output sequence. In traditional RNN-based machine translation models, the entire input sequence is encoded into a fixed-length vector (often referred to as the "context vector") using the final hidden state of the RNN. This fixed-length representation must then capture all the relevant information from the input sequence, which can be challenging for long input sequences or when there are long-range dependencies between words.
The attention mechanism (soft attention) introduced in the paper addresses this issue by allowing the model to dynamically focus on different parts of the input sequence at each step of the output generation process. This is achieved by calculating a set of attention weights for each input position, which determine how much attention the model should pay to each input position when generating the output at a given step.
By allowing the model to focus on different parts of the input sequence as needed, the attention mechanism enables the model to better capture long-range dependencies and improve the quality of the translation. This has led to significant improvements in machine translation performance compared to traditional RNN-based models. The decoder architecture:
Attention Is All You Need (2017) by Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin, https://arxiv.org/abs/1706.03762
This paper introduced the Transformer model, which has since become a foundational architecture in natural language processing. The Transformer model is based entirely on attention mechanisms and does not use recurrent or convolutional layers. It consists of an encoder and a decoder, each composed of multiple layers. The encoder is composed of a stack of identical layers. Each layer has two sub-layers: a multi-head self-attention mechanism and a feedforward neural network. The output of each sub-layer is passed through a residual connection, and layer normalization is applied. The decoder is also composed of a stack of identical layers. In addition to the two sub-layers used in the encoder, the decoder has a third sub-layer that performs multi-head attention over the encoder's output. This allows the decoder to focus on different parts of the input sequence when generating the output sequence.
The key innovation of the Transformer is its attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when generating each output token. The attention mechanism is based on scaled dot-product attention, which calculates attention weights by taking the dot product of a query vector with all key vectors and then applying a softmax function. Multi-Head Attention: To capture different aspects of the input sequence, the Transformer uses multiple attention heads in parallel. Each attention head learns a different attention distribution over the input sequence, and the outputs of the different heads are concatenated and linearly transformed before being passed to the next layer.
Since the Transformer does not have any built-in notion of word order, positional encodings are added to the input embedding to convey the position of each token in the sequence. The Transformer is trained using a variant of the sequence-to-sequence model with attention. The model is trained to minimize the negative log-likelihood of the target sequence given the source sequence.
The Transformer model has shown state-of-the-art performance on various natural language processing tasks, including machine translation, text summarization, and text generation. It has inspired many subsequent models, including BERT, GPT, and T5.
Note: The attention mechanism introduced in the RNN paper and the one in Transformer paper are different. In the RNN paper, the attention mechanism is termed “soft” or "global" attention. It computes a weighted sum of all the encoder hidden states, where the weights are determined by a separate alignment model that learns to align the source and target sequences. This allows the model to focus on different parts of the source sequence when generating each target token.
In contrast, the attention mechanism in the Transformer paper is termed "self-attention" or "scaled dot-product attention." It computes attention weights directly between all pairs of positions in the input sequence, allowing the model to capture dependencies between tokens regardless of their positions in the sequence. Therefore, in Transformer model, self-attention mechanism can be computed in parallel for all positions in the sequence, whereas the global attention mechanism requires sequential computation of the alignment weights. RNN model has implicit positional information whereas Transformer model uses positional encodings to explicitly encode the position of each token in the input sequence, allowing the model to learn positional relationships between tokens.
The self-attention mechanism used in the Transformer model is more complex than the global attention mechanism used in the RNN model. It involves multiple attention heads and linear transformations, whereas the global attention mechanism involves a simpler alignment model
On Layer Normalization in the Transformer Architecture (2020) by Xiong, Yang, He, K Zheng, S Zheng, Xing, Zhang, Lan, Wang, and Liu, https://arxiv.org/abs/2002.04745
This brought in a slight modification to Transformer architecture.
Universal Language Model Fine-tuning for Text Classification (2018) by Howard and Ruder, https://arxiv.org/abs/1801.06146
This paper introduces a method for fine-tuning pre-trained language models for text classification tasks. The authors propose a three-stage training process called ULMFiT (Universal Language Model Fine-tuning). Initially, a language model (LM) is pre-trained on a large corpus of text using an unsupervised objective, such as predicting the next word in a sentence. The pretrained LM is fine-tuned on the target task's dataset using a supervised objective, such as predicting the sentiment of a text or classifying it into different categories. Finally, the fine-tuned LM is used as the base for a classifier, which is then fine-tuned on the target task's dataset. The LM's weights are frozen during this step, and only the classifier's weights are updated.
The authors introduce the concept of discriminative fine-tuning, where different layers of the LM are fine-tuned at different learning rates. They found that lower layers (closer to the input) benefit from higher learning rates, while higher layers benefit from lower learning rates. They also applied a triangular learning rate schedule, where the learning rate increases linearly for a certain number of iterations and then decreases exponentially. This helped the model to quickly converge to a good solution and then fine-tune more carefully. Instead of fine-tuning all layers of the LM at once, the authors suggest gradually unfreezing the layers, starting from the top layers and moving towards the bottom. This allowed the model to retain the knowledge learned during pretraining while adapting to the new task.
Note: ULMFiT and generative models like Bart, T5 or GPT, use the same concept of task specific fine-tuning but their approaches are different. The difference is aside from the base model itself – ULMFit <-> RNN whereas generative models are all transformer based. The fine-tuning differences are: 1st) ULMFiT pre-trains the language model using a standard unsupervised objective, such as predicting the next word in a sentence. GPT, on the other hand, uses a masked language modeling (MLM) objective, where a certain percentage of input tokens are masked and the model is trained to predict those masked tokens. 2nd) ULMFiT employs a two-step fine-tuning process, where the language model is first fine-tuned on the target task and then used as the base for a classifier that is further fine-tuned. In contrast, GPT typically fine-tunes the entire model end-to-end on the target task without separate steps for language model and classifier fine-tuning and 3rd) ULMFiT introduces task-specific modifications such as discriminative fine-tuning (different learning rates for different layers) and gradual unfreezing (unfreezing layers gradually during fine-tuning), whereas GPT has no task specific modifications.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018) by Devlin, Chang, Lee, and Toutanova, https://arxiv.org/abs/1810.04805
Following the original transformer architecture, large language model research started to bifurcate in two directions: encoder-style transformers for predictive modelling tasks such as text classification and decoder-style transformers for generative modelling tasks such as translation, summarization, and other forms of text creation. BERT is based on the Transformer architecture, which uses self-attention mechanisms to capture dependencies between input tokens. BERT uses a stack of Transformer encoder layers for both pre-training and fine-tuning. This paper introduced the idea of masked-language modelling and next sentence prediction. BERT uses WordPiece tokenization, which breaks words into subwords and allows the model to handle out-of-vocabulary words. After pre-training, BERT can be fine-tuned on specific downstream tasks such as text classification, question answering, and named entity recognition. Fine-tuning involves adding a task-specific output layer to the pre-trained BERT model and updating the model's weights on a task-specific dataset.
Conclusion: We learnt about the importance of soft attention to achieve outstanding goal of complex translation task, about transformer model and techniques of developing infinitely scalable discriminative attention, about the importance of task oriented fine-tuning and masked-language modelling to contextualize of the given token. These are the papers that laid the foundation for generative papers to come about, which we will cover in the next article.








Comments