Tiny Transformer for Next-Token Prediction

Overview

Implemented transformer model for next-token prediction on 581K Shakespeare tokens, systematically analyzing 7 variants to understand how hyperparameters affect performance.

Key Results

3.69 perplexity with vocabulary size=260 (51% improvement over baseline)
Baseline: 7.53 perplexity with vocab=500
Learning rate optimization (0.01 vs 0.001) provided 27% improvement at no extra cost
Batch size trade-off: Smaller batches (128) performed well but took 2.3× longer

Architecture

2 transformer blocks with single-head self-attention
RMSNorm, causal masking, residual connections
493,940 total parameters
Trained on Tesla T4 GPU (~13 min/model)

Systematic Hyperparameter Analysis

Variant	Key Change	Perplexity	Insight
v1 (baseline)	-	7.527	Strong baseline, minimal overfitting
v2	LR: 0.01	5.460	Higher LR finds better solution
v3	Batch: 128	5.366	Better but 2.3× slower
v5	Vocab: 260	3.690	Best: optimal vocab for dataset
v4	Seq: 25	8.339	Shorter context hurts
v6	FFN: 256	9.278	Reduced capacity degrades
v7	Embed: 64	10.851	Insufficient embedding space

Technical Stack

Python • PyTorch • Hugging Face Tokenizers • Google Colab

What I Learned

The biggest surprise was vocabulary size—reducing from 500 to 260 gave 51% perplexity improvement because smaller vocab meant each token appeared more frequently, allowing the model to learn robust embeddings rather than spreading capacity across rare tokens. This taught me that matching model capacity to dataset characteristics matters more than simply maximizing parameters. The attention visualization revealed hierarchical patterns where certain tokens act as information hubs, providing insight into what transformers actually learn.

Attention plot — *Sample Transformer Attention matrix*

View Code on GitHub

Twitter Facebook LinkedIn