Chapter 13 described fine-tuning as adjusting every weight in a model. For a 7-billion-parameter model, that means storing and updating 7 billion floating-point numbers. In fp16 (16-bit floating point, the standard precision for training), that's 14 GB just for the model weights. But training also requires storing optimizer states -- Adam, the standard optimizer, keeps two additional copies of every parameter (the first and second moment estimates). That triples the memory to around 42 GB. Add gradient storage and activations for backpropagation, and you need 60-80 GB of GPU memory to fine-tune a 7B model.
A 70-billion-parameter model requires roughly 10x that. A consumer GPU has 8-24 GB. Even a high-end professional GPU (NVIDIA A100) has 80 GB. Full fine-tuning of frontier models is possible only with clusters of expensive hardware.
This created a practical problem: most people who want to customize a model can't afford to fine-tune one. The solution came from a simple mathematical observation about what happens to weight matrices during fine-tuning.
Aghajanyan et al. (2020) made a surprising empirical discovery: when you fine-tune a pre-trained language model, the weight updates live in a much lower-dimensional subspace than the full parameter space.1 In linear algebra terms, the matrix of weight changes has low intrinsic rank.
To understand why this matters, recall what rank means. A matrix of size d x d has rank at most d -- that's the maximum number of linearly independent rows (or columns). If the rank is r, where r is much smaller than d, the matrix can be decomposed into a product of two smaller matrices: one of size d x r and one of size r x d. This is just the outer product form of rank factorization you'd see in linear algebra.
The intuition for why fine-tuning updates are low-rank: the pre-trained model already "knows" almost everything it needs. Fine-tuning is making small, structured adjustments -- not random perturbations across all dimensions of the weight space. Those adjustments tend to be correlated (many weights shift in related ways), which is exactly what low rank means: the change has structure, and that structure can be captured with far fewer numbers than a full matrix.
Hu et al. (2021) turned this observation into a practical method called LoRA (Low-Rank Adaptation of Large Language Models).2 The idea is elegant:
The original weight matrix W might be 4096 x 4096 -- that's about 16.8 million parameters. With LoRA at rank r = 16, the trainable parameters are B (4096 x 16) and A (16 x 4096), totaling about 131,000 parameters. That's less than 1% of the original matrix. Applied across the model, LoRA typically trains 0.1% to 1% of total parameters while achieving performance comparable to full fine-tuning.
LoRA initializes A with a random Gaussian and B with zeros. This means the product BA starts as the zero matrix, so the model begins fine-tuning from exactly the pre-trained weights. As training progresses, B and A adjust to capture the needed modification. This is a clean design: at initialization, the LoRA model behaves identically to the pre-trained model, and changes are introduced gradually through gradient updates.
In a transformer, the main weight matrices are in the attention layers (the query, key, value, and output projection matrices) and the feed-forward layers. The original LoRA paper found that applying low-rank updates to the attention projection matrices (particularly Q and V) was sufficient. Later work showed that applying LoRA to all linear layers gives better results, at modestly higher cost.
After training, you can merge the LoRA weights back into the base model: just compute W' = W + BA and use W' as your new weight matrix. The merged model has the same architecture and size as the original -- there's no inference overhead. Alternatively, you can keep the LoRA weights separate and swap them in and out, which lets you serve one base model with multiple task-specific adaptations.
LoRA reduces the number of trainable parameters, but you still need to load the full base model into memory for the forward pass. For a 7B model in fp16, that's 14 GB. Dettmers et al. (2023) introduced QLoRA, which solves this by quantizing the base model to 4-bit precision before applying LoRA.3
Quantization is the process of representing numbers with fewer bits. The standard precision levels:
| Format | Bits per parameter | Memory for 7B model | Relative quality |
|---|---|---|---|
| fp32 | 32 | 28 GB | Full precision (training standard) |
| fp16 / bf16 | 16 | 14 GB | Near-lossless for most models |
| int8 | 8 | 7 GB | Minimal degradation |
| int4 / nf4 | 4 | 3.5 GB | Small degradation, depends on method |
The numbers above are for weights only. A 7B model quantized to 4-bit fits in about 3.5 GB -- well within the 24 GB unified memory of a Mac Mini M4. This is how tools like Ollama run large models locally: the model weights are quantized to 4-bit, and the remaining memory is used for activations and the KV cache during inference (more on this in Chapter 15).
QLoRA introduces a specific quantization format called NF4 (NormalFloat4), which is optimized for the fact that neural network weights tend to be normally distributed. Instead of spacing quantization levels uniformly, NF4 spaces them according to the quantiles of a normal distribution, putting more levels where more values actually are. This gives better quality than naive 4-bit quantization.
QLoRA's key innovation is that the base model is stored in 4-bit, but computations during the LoRA forward and backward passes happen in higher precision (bf16). The frozen base weights are dequantized on the fly -- pulled from 4-bit to 16-bit -- for each computation, then discarded. Only the LoRA parameters (B and A) are stored and updated in full precision.
The result: you can fine-tune a 65B parameter model on a single 48 GB GPU. Or a 7B model on a consumer GPU with 8 GB. The QLoRA paper showed that a 4-bit quantized model fine-tuned with LoRA matches the quality of full 16-bit fine-tuning on standard benchmarks.3
With a Mac Mini M4 and 24 GB of unified memory, QLoRA puts serious fine-tuning within reach. A 7B model in 4-bit takes about 4 GB for weights, leaving ample room for the LoRA parameters, optimizer states, and activations. Even a 13B model is feasible. You won't be fine-tuning GPT-4-scale models at home, but the ability to customize a 7B or 13B model for a specific task -- on hardware you already own -- is a qualitative shift from five years ago when this required a data center.
Before LoRA, the parameter-efficient fine-tuning approach was adapters, introduced by Houlsby et al. (2019).4 Adapters insert small trainable modules (typically two linear layers with a nonlinearity) between the existing layers of the transformer. The original layers are frozen; only the adapter modules are trained.
The architecture of an adapter module is simple: a down-projection from dimension d to a bottleneck dimension m (where m is much smaller than d), a nonlinearity (typically ReLU), and an up-projection back to dimension d. A residual connection adds the adapter output to the original layer output.
Adapters work well, but they have a practical disadvantage compared to LoRA: they add new layers to the model, which means they change the model architecture. At inference time, the adapter modules need to be present, adding latency. LoRA's weights can be merged into the base model with no inference overhead, which is why LoRA has largely displaced adapters in practice.
An even lighter approach: instead of modifying any model weights at all, learn a small set of continuous embeddings that are prepended to the input.
Prefix tuning (Li and Liang, 2021) learns a sequence of virtual tokens that are prepended to the key and value matrices in each attention layer.5 The model weights are completely frozen; only these prefix vectors are optimized. The number of trainable parameters is tiny -- a few thousand to a few hundred thousand, depending on the prefix length.
Prompt tuning (Lester et al., 2021) is even simpler: learn a set of soft prompt embeddings that are prepended to the input at the embedding layer only.6 As model size increases, prompt tuning approaches the performance of full fine-tuning, which is a striking result: for sufficiently large models, you can match full fine-tuning performance by learning just a handful of input vectors.
These methods are useful when you need to serve many tasks from a single model instance. Each task gets its own set of learned prefix/prompt vectors, and swapping between tasks is just swapping a tiny vector -- no reloading of model weights required.
With all these methods available, how do you choose? The decision depends on your resources, your data, and your target quality.
| Method | Trainable params | Memory needed | When to use |
|---|---|---|---|
| Full fine-tuning | 100% | ~6x model size | Frontier labs, maximum quality, abundant compute |
| LoRA | 0.1-1% | ~1.5x model size | Best quality-to-cost ratio, most common choice |
| QLoRA | 0.1-1% | ~0.5x model size | Limited GPU memory, consumer hardware |
| Adapters | 1-3% | ~1.5x model size | Largely superseded by LoRA |
| Prompt tuning | <0.01% | ~1x model size | Many tasks, large models, minimal compute |
| Just prompting | 0% | Inference only | Sufficient for many tasks with good enough models |
The trend is clear: you should try the lightest method first and move to heavier methods only if quality demands it. For many practical applications, good prompting with a strong base model is enough. When it's not, LoRA or QLoRA is the next step. Full fine-tuning is reserved for cases where you're building a product that needs to be as good as possible and you have the budget for it.
1 Aghajanyan et al. (2020), "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." ACL 2021. Showed that pre-trained models have a low intrinsic dimensionality for fine-tuning -- you can project the optimization into a much smaller subspace and still achieve good performance.
2 Hu et al. (2021), "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. The original LoRA paper. Demonstrated that adding trainable low-rank decomposition matrices to frozen pre-trained weights achieves performance comparable to full fine-tuning with a fraction of the trainable parameters.
3 Dettmers et al. (2023), "QLoRA: Efficient Finetuning of Quantized Language Models." NeurIPS 2023. Introduced 4-bit NormalFloat quantization and showed that combining it with LoRA enables fine-tuning of 65B parameter models on a single 48 GB GPU without quality degradation.
4 Houlsby et al. (2019), "Parameter-Efficient Transfer Learning for NLP." ICML 2019. Introduced adapter modules -- small bottleneck layers inserted between transformer layers -- achieving near-full fine-tuning performance while training only 3.6% of parameters.
5 Li and Liang (2021), "Prefix-Tuning: Optimizing Continuous Prompts for Generation." ACL 2021. Proposed learning task-specific continuous vectors prepended to the key and value matrices at every attention layer, with all model parameters frozen.
6 Lester et al. (2021), "The Power of Scale for Parameter-Efficient Prompt Tuning." EMNLP 2021. Showed that learning soft prompt embeddings at the input layer only becomes competitive with full fine-tuning as model scale increases, closing the gap at around 10 billion parameters.