Tiny Transformer for Next-Token Prediction

Overview

Implemented transformer model for next-token prediction on 581K Shakespeare tokens, systematically analyzing 7 variants to understand how hyperparameters affect performance.

Key Results

  • 3.69 perplexity with vocabulary size=260 (51% improvement over baseline)
  • Baseline: 7.53 perplexity with vocab=500
  • Learning rate optimization (0.01 vs 0.001) provided 27% improvement at no extra cost
  • Batch size trade-off: Smaller batches (128) performed well but took 2.3× longer

Architecture

  • 2 transformer blocks with single-head self-attention
  • RMSNorm, causal masking, residual connections
  • 493,940 total parameters
  • Trained on Tesla T4 GPU (~13 min/model)

Systematic Hyperparameter Analysis

Variant Key Change Perplexity Insight
v1 (baseline) - 7.527 Strong baseline, minimal overfitting
v2 LR: 0.01 5.460 Higher LR finds better solution
v3 Batch: 128 5.366 Better but 2.3× slower
v5 Vocab: 260 3.690 Best: optimal vocab for dataset
v4 Seq: 25 8.339 Shorter context hurts
v6 FFN: 256 9.278 Reduced capacity degrades
v7 Embed: 64 10.851 Insufficient embedding space

Technical Stack

Python • PyTorch • Hugging Face Tokenizers • Google Colab

What I Learned

The biggest surprise was vocabulary size—reducing from 500 to 260 gave 51% perplexity improvement because smaller vocab meant each token appeared more frequently, allowing the model to learn robust embeddings rather than spreading capacity across rare tokens. This taught me that matching model capacity to dataset characteristics matters more than simply maximizing parameters. The attention visualization revealed hierarchical patterns where certain tokens act as information hubs, providing insight into what transformers actually learn.

Attention plot
Sample Transformer Attention matrix

View Code on GitHub

Updated: