Part IV — Training and Deployment

LoRA and Efficient Methods

Low-rank adaptation, adapters, quantization. Why you don't need to retrain everything, and what's actually happening mathematically.

The Problem with Full Fine-Tuning

Chapter 13 described fine-tuning as adjusting every weight in a model. For a 7-billion-parameter model, that means storing and updating 7 billion floating-point numbers. In fp16 (16-bit floating point, the standard precision for training), that's 14 GB just for the model weights. But training also requires storing optimizer states -- Adam, the standard optimizer, keeps two additional copies of every parameter (the first and second moment estimates). That triples the memory to around 42 GB. Add gradient storage and activations for backpropagation, and you need 60-80 GB of GPU memory to fine-tune a 7B model.

A 70-billion-parameter model requires roughly 10x that. A consumer GPU has 8-24 GB. Even a high-end professional GPU (NVIDIA A100) has 80 GB. Full fine-tuning of frontier models is possible only with clusters of expensive hardware.

This created a practical problem: most people who want to customize a model can't afford to fine-tune one. The solution came from a simple mathematical observation about what happens to weight matrices during fine-tuning.

The Key Insight: Low-Rank Updates

Aghajanyan et al. (2020) made a surprising empirical discovery: when you fine-tune a pre-trained language model, the weight updates live in a much lower-dimensional subspace than the full parameter space.1 In linear algebra terms, the matrix of weight changes has low intrinsic rank.

To understand why this matters, recall what rank means. A matrix of size d x d has rank at most d -- that's the maximum number of linearly independent rows (or columns). If the rank is r, where r is much smaller than d, the matrix can be decomposed into a product of two smaller matrices: one of size d x r and one of size r x d. This is just the outer product form of rank factorization you'd see in linear algebra.

The intuition for why fine-tuning updates are low-rank: the pre-trained model already "knows" almost everything it needs. Fine-tuning is making small, structured adjustments -- not random perturbations across all dimensions of the weight space. Those adjustments tend to be correlated (many weights shift in related ways), which is exactly what low rank means: the change has structure, and that structure can be captured with far fewer numbers than a full matrix.

LoRA: Low-Rank Adaptation

Hu et al. (2021) turned this observation into a practical method called LoRA (Low-Rank Adaptation of Large Language Models).2 The idea is elegant:

  1. Freeze the original pre-trained weight matrix W (no gradient updates).
  2. Add a trainable low-rank decomposition that represents the weight update.
  3. During the forward pass, compute the output as: x through (W + BA), where B and A are the small trainable matrices.
W' = W + BA
where W is d x d (frozen), B is d x r, A is r x d, and r << d

The original weight matrix W might be 4096 x 4096 -- that's about 16.8 million parameters. With LoRA at rank r = 16, the trainable parameters are B (4096 x 16) and A (16 x 4096), totaling about 131,000 parameters. That's less than 1% of the original matrix. Applied across the model, LoRA typically trains 0.1% to 1% of total parameters while achieving performance comparable to full fine-tuning.

LoRA: Low-Rank Adaptation x Input W d x d (frozen) Pre-trained weights A r x d B d x r + h Output Trainable (small) Trainable (small) h = Wx + BAx Typical r = 8, 16, or 64. Full rank d = 4096+. Trainable params < 1%.

Why the initialization matters

LoRA initializes A with a random Gaussian and B with zeros. This means the product BA starts as the zero matrix, so the model begins fine-tuning from exactly the pre-trained weights. As training progresses, B and A adjust to capture the needed modification. This is a clean design: at initialization, the LoRA model behaves identically to the pre-trained model, and changes are introduced gradually through gradient updates.

Where LoRA is applied

In a transformer, the main weight matrices are in the attention layers (the query, key, value, and output projection matrices) and the feed-forward layers. The original LoRA paper found that applying low-rank updates to the attention projection matrices (particularly Q and V) was sufficient. Later work showed that applying LoRA to all linear layers gives better results, at modestly higher cost.

Key idea: LoRA doesn't approximate the pre-trained model. It exactly preserves it and adds a small, trainable correction. The pre-trained knowledge is completely intact. This is why LoRA largely avoids catastrophic forgetting -- the original weights are never modified.

Merging and serving

After training, you can merge the LoRA weights back into the base model: just compute W' = W + BA and use W' as your new weight matrix. The merged model has the same architecture and size as the original -- there's no inference overhead. Alternatively, you can keep the LoRA weights separate and swap them in and out, which lets you serve one base model with multiple task-specific adaptations.

QLoRA: Quantization Meets LoRA

LoRA reduces the number of trainable parameters, but you still need to load the full base model into memory for the forward pass. For a 7B model in fp16, that's 14 GB. Dettmers et al. (2023) introduced QLoRA, which solves this by quantizing the base model to 4-bit precision before applying LoRA.3

What quantization means

Quantization is the process of representing numbers with fewer bits. The standard precision levels:

Format Bits per parameter Memory for 7B model Relative quality
fp32 32 28 GB Full precision (training standard)
fp16 / bf16 16 14 GB Near-lossless for most models
int8 8 7 GB Minimal degradation
int4 / nf4 4 3.5 GB Small degradation, depends on method

The numbers above are for weights only. A 7B model quantized to 4-bit fits in about 3.5 GB -- well within the 24 GB unified memory of a Mac Mini M4. This is how tools like Ollama run large models locally: the model weights are quantized to 4-bit, and the remaining memory is used for activations and the KV cache during inference (more on this in Chapter 15).

QLoRA introduces a specific quantization format called NF4 (NormalFloat4), which is optimized for the fact that neural network weights tend to be normally distributed. Instead of spacing quantization levels uniformly, NF4 spaces them according to the quantiles of a normal distribution, putting more levels where more values actually are. This gives better quality than naive 4-bit quantization.

The QLoRA trick

QLoRA's key innovation is that the base model is stored in 4-bit, but computations during the LoRA forward and backward passes happen in higher precision (bf16). The frozen base weights are dequantized on the fly -- pulled from 4-bit to 16-bit -- for each computation, then discarded. Only the LoRA parameters (B and A) are stored and updated in full precision.

The result: you can fine-tune a 65B parameter model on a single 48 GB GPU. Or a 7B model on a consumer GPU with 8 GB. The QLoRA paper showed that a 4-bit quantized model fine-tuned with LoRA matches the quality of full 16-bit fine-tuning on standard benchmarks.3

What this means for your hardware

With a Mac Mini M4 and 24 GB of unified memory, QLoRA puts serious fine-tuning within reach. A 7B model in 4-bit takes about 4 GB for weights, leaving ample room for the LoRA parameters, optimizer states, and activations. Even a 13B model is feasible. You won't be fine-tuning GPT-4-scale models at home, but the ability to customize a 7B or 13B model for a specific task -- on hardware you already own -- is a qualitative shift from five years ago when this required a data center.

Adapters

Before LoRA, the parameter-efficient fine-tuning approach was adapters, introduced by Houlsby et al. (2019).4 Adapters insert small trainable modules (typically two linear layers with a nonlinearity) between the existing layers of the transformer. The original layers are frozen; only the adapter modules are trained.

The architecture of an adapter module is simple: a down-projection from dimension d to a bottleneck dimension m (where m is much smaller than d), a nonlinearity (typically ReLU), and an up-projection back to dimension d. A residual connection adds the adapter output to the original layer output.

Adapters work well, but they have a practical disadvantage compared to LoRA: they add new layers to the model, which means they change the model architecture. At inference time, the adapter modules need to be present, adding latency. LoRA's weights can be merged into the base model with no inference overhead, which is why LoRA has largely displaced adapters in practice.

Prompt Tuning and Prefix Tuning

An even lighter approach: instead of modifying any model weights at all, learn a small set of continuous embeddings that are prepended to the input.

Prefix tuning (Li and Liang, 2021) learns a sequence of virtual tokens that are prepended to the key and value matrices in each attention layer.5 The model weights are completely frozen; only these prefix vectors are optimized. The number of trainable parameters is tiny -- a few thousand to a few hundred thousand, depending on the prefix length.

Prompt tuning (Lester et al., 2021) is even simpler: learn a set of soft prompt embeddings that are prepended to the input at the embedding layer only.6 As model size increases, prompt tuning approaches the performance of full fine-tuning, which is a striking result: for sufficiently large models, you can match full fine-tuning performance by learning just a handful of input vectors.

These methods are useful when you need to serve many tasks from a single model instance. Each task gets its own set of learned prefix/prompt vectors, and swapping between tasks is just swapping a tiny vector -- no reloading of model weights required.

The Practical Decision Tree

With all these methods available, how do you choose? The decision depends on your resources, your data, and your target quality.

Method Trainable params Memory needed When to use
Full fine-tuning 100% ~6x model size Frontier labs, maximum quality, abundant compute
LoRA 0.1-1% ~1.5x model size Best quality-to-cost ratio, most common choice
QLoRA 0.1-1% ~0.5x model size Limited GPU memory, consumer hardware
Adapters 1-3% ~1.5x model size Largely superseded by LoRA
Prompt tuning <0.01% ~1x model size Many tasks, large models, minimal compute
Just prompting 0% Inference only Sufficient for many tasks with good enough models

The trend is clear: you should try the lightest method first and move to heavier methods only if quality demands it. For many practical applications, good prompting with a strong base model is enough. When it's not, LoRA or QLoRA is the next step. Full fine-tuning is reserved for cases where you're building a product that needs to be as good as possible and you have the budget for it.

Key idea: The most important shift in the last three years of AI practice isn't a new architecture -- it's the democratization of model customization. LoRA and QLoRA mean that fine-tuning a model for a specific task is no longer a data-center-scale operation. It's something you can do on a laptop with a good GPU, or a Mac Mini with 24 GB of unified memory, in a few hours.
Next: Chapter 15 — Inference. How generation actually works. Autoregressive decoding, temperature, top-k, top-p sampling. Context windows and KV cache.

1 Aghajanyan et al. (2020), "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." ACL 2021. Showed that pre-trained models have a low intrinsic dimensionality for fine-tuning -- you can project the optimization into a much smaller subspace and still achieve good performance.

2 Hu et al. (2021), "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. The original LoRA paper. Demonstrated that adding trainable low-rank decomposition matrices to frozen pre-trained weights achieves performance comparable to full fine-tuning with a fraction of the trainable parameters.

3 Dettmers et al. (2023), "QLoRA: Efficient Finetuning of Quantized Language Models." NeurIPS 2023. Introduced 4-bit NormalFloat quantization and showed that combining it with LoRA enables fine-tuning of 65B parameter models on a single 48 GB GPU without quality degradation.

4 Houlsby et al. (2019), "Parameter-Efficient Transfer Learning for NLP." ICML 2019. Introduced adapter modules -- small bottleneck layers inserted between transformer layers -- achieving near-full fine-tuning performance while training only 3.6% of parameters.

5 Li and Liang (2021), "Prefix-Tuning: Optimizing Continuous Prompts for Generation." ACL 2021. Proposed learning task-specific continuous vectors prepended to the key and value matrices at every attention layer, with all model parameters frozen.

6 Lester et al. (2021), "The Power of Scale for Parameter-Efficient Prompt Tuning." EMNLP 2021. Showed that learning soft prompt embeddings at the input layer only becomes competitive with full fine-tuning as model scale increases, closing the gap at around 10 billion parameters.