In recent years, large language models (LLMs) like GPT (Generative Pre-trained Transformer) have revolutionized natural language processing by learning complex patterns from massive datasets. Before these models can generate coherent and contextually relevant text, they must first be trained on well-prepared input data. In this post, we’ll take a deep dive into one crucial aspect of training: data preprocessing, batching, and contextual chunking. We’ll explore how to split text data into train and validation sets, prepare input-target sequences for the Transformer, and efficiently batch these sequences for parallel processing—all vital steps in the LLM development process.
Note: This post is inspired by the "Let's build GPT: from scratch, in code, spelled out." video series, which provides an accessible yet rigorous walkthrough of building GPT models. For additional background on Transformers, see Vaswani et al.’s seminal work Attention Is All You Need (2017).
Introduction
Modern LLMs are built on the Transformer architecture, which processes text data in fixed-size segments or "blocks." Rather than feeding an entire text (e.g., Shakespeare’s complete works) into the model—which would be computationally prohibitive—we work with smaller chunks of data. This not only makes training more manageable but also encourages the model to learn to generate text given varying context lengths. In a typical development process for an LLM, the data preparation pipeline is one of the first stages, ensuring that the model sees a representative sample of language patterns while also guarding against overfitting by setting aside validation data.
The steps we explore here include:
Splitting the dataset into training and validation portions.
Sampling fixed-length sequences from the training data.
Forming input-target pairs where the model learns to predict the next token.
Batching these pairs for efficient parallel processing during training.
Understanding these steps is critical, as they lay the groundwork for the subsequent stages of model architecture design and training.
Splitting Data: Training vs. Validation
Before training, it is standard practice to partition the dataset into two segments:
Training Set (90% of data): Used for learning the patterns in the text.
Validation Set (10% of data): Held out to evaluate the model’s performance on unseen data, thereby helping to monitor and prevent overfitting.
This separation is essential. If a model only memorizes the training data, it won’t generalize well to new inputs. The validation set serves as a check on this behavior.
In our example code:
n = int(0.9 * len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]
Here, the first 90% of the data is assigned to train_data
, while the remaining 10% becomes val_data
.
Chunking the Data: Context and Target Sequences
Training a Transformer involves teaching it to predict the next token in a sequence given a context. However, instead of processing an entire long text, we break it down into manageable chunks defined by a block size. In this context:
Block Size: Maximum context length (e.g., 8 characters).
Input (
x
): The firstblock_size
tokens.Target (
y
): The next sequence of tokens, shifted by one position relative to the input.
The reason for creating sequences with an extra token (i.e., a block of size + 1) is to generate multiple training examples from a single chunk. Each position in the chunk (except the last one) serves as the context for predicting the subsequent token.
For example, given a sequence of 9 characters (with block_size
equal to 8):
At time step 0: Input is the first character, and the target is the second.
At time step 1: Input is the first two characters, and the target is the third.
And so on until the 8th token.
This approach ensures the Transformer learns to predict the next token with varying lengths of context, which is crucial for effective inference—especially when the model is provided with limited context during text generation.
The code snippet below illustrates this idea:
block_size = 8
x = train_data[:block_size] # first 8 tokens
y = train_data[1:block_size+1] # next 8 tokens (offset by 1)
for t in range(block_size):
context = x[:t+1]
target = y[t]
print(f"when input is {context} the target: {target}")
Batching: Parallel Processing for Efficiency
Training on modern GPUs requires feeding data in batches to maximize parallel computation. Batching involves stacking several independent chunks together. Each chunk is processed independently by the model, allowing for efficient utilization of GPU resources.
Key parameters in batching:
Batch Size: Number of independent sequences processed in parallel (e.g., 4).
Block Size: The maximum context length for each sequence (e.g., 8).
The batching function randomly samples starting indices from the training (or validation) data and constructs batches of input-target pairs:
def get_batch(split):
data = train_data if split == 'train' else val_data
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i:i+block_size] for i in ix])
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
return x, y
This code:
Randomly selects starting positions (
ix
) for each sequence in the batch.Extracts a chunk of length
block_size
for inputs (x
) and a shifted chunk for targets (y
).
By printing the batch dimensions and individual contexts, one can verify that the batching mechanism creates the correct input-target pairs:
xb, yb = get_batch('train')
print('inputs:')
print(xb.shape) # Expected shape: (batch_size, block_size)
print(xb)
print('targets:')
print(yb.shape) # Expected shape: (batch_size, block_size)
print(yb)
for b in range(batch_size):
for t in range(block_size):
context = xb[b, :t+1]
target = yb[b, t]
print(f"when input is {context.tolist()} the target: {target}")
Step-by-Step Code Walkthrough
Let’s summarize the code workflow to consolidate understanding:
Dataset Splitting:
Calculate the index
n
that splits the dataset into 90% training and 10% validation.Create
train_data
andval_data
accordingly.
Defining Block and Batch Sizes:
Set
block_size
(e.g., 8) andbatch_size
(e.g., 4).These parameters control the sequence length for context and the number of sequences processed in parallel.
Creating Input-Target Pairs:
For a given sequence of tokens, the input
x
is taken as the firstblock_size
tokens.The target
y
is the same sequence shifted by one token, making it a prediction problem for each position inx
.
Batching Function (
get_batch
):Randomly samples indices from the dataset ensuring there’s enough room for a full block.
Stacks multiple sequences into a batch for simultaneous processing.
Verification:
Print out the shapes and individual sequence contexts to ensure correctness.
Demonstrates that each batch element is processed independently yet follows the same input-target construction pattern.
Conclusion
Data preparation is a foundational step in developing LLMs like GPT. The techniques discussed—splitting data into training and validation sets, creating context-target pairs through sequence chunking, and batching these sequences—are critical for efficient model training. Not only do these processes ensure that the model learns from a diverse set of examples, but they also facilitate effective GPU utilization during training.
By carefully constructing the data pipeline, we set the stage for the subsequent steps in model development, such as designing the Transformer architecture, optimizing the training loop, and eventually generating coherent text. For further reading on these techniques and the underlying theory, consult:
Vaswani et al. (2017): Attention Is All You Need
OpenAI’s blog on GPT: Improving Language Understanding by Generative Pre-Training
This deep dive into data batching and sequence chunking is just one piece of the puzzle in building a GPT-like model from scratch. Stay tuned for more posts on model architecture, training strategies, and optimization techniques in the journey toward building state-of-the-art language models.
Happy coding!