What is Fine-Tuning? AI Model Customisation Explained

The one-sentence definition

Fine-tuning is the process of taking a pre-trained AI model and continuing to train it on a smaller, domain-specific dataset — so the model learns the patterns, vocabulary, format, and knowledge specific to your use case, while retaining everything it learned during its original large-scale training.

The analogy that works: A general-practice doctor knows a little about everything in medicine. A cardiac surgeon knows one thing extremely well. Fine-tuning turns a general AI into the cardiac surgeon AI for your specific problem.

When is fine-tuning the right approach?

Fine-tuning is not always the answer. Before fine-tuning, consider three alternatives that are faster and cheaper:

Prompting — a better system prompt or few-shot examples often solves the problem without any training. Try this first. Most "I need fine-tuning" problems are actually prompting problems.
RAG (Retrieval-Augmented Generation) — if the goal is giving the model access to specific knowledge (your documents, your product data, your policies), RAG retrieves relevant content at query time. No training required, and the knowledge stays current automatically.
Fine-tuning — the right choice when: you need a specific output format or style consistently, the task requires knowledge that cannot be retrieved (e.g. an implicit writing style), you need faster/cheaper inference and can train a smaller model, or you are running thousands of similar requests where a specialised model is more efficient.

What fine-tuning actually changes

A pre-trained model like GPT-4 has billions of numerical parameters — weights — that encode everything it learned from its training data. Fine-tuning updates a subset of these weights using your new training examples. The model learns which patterns in your data are important and adjusts its behaviour accordingly.

What fine-tuning can do:

Teach the model a specific writing style (your brand voice, a character's voice, a document format)
Teach it domain-specific vocabulary and concepts (medical terminology, legal language, your company's internal jargon)
Teach it to follow a specific output format consistently (JSON with particular fields, structured reports, specific templates)
Teach it to refuse or redirect specific types of requests
Improve accuracy on a narrow task with a smaller, faster model

What fine-tuning cannot do:

Add information after the training cutoff reliably — for current information, use RAG
Make a fundamentally incapable model capable — fine-tuning improves what is already there, does not add new reasoning abilities
Guarantee factual accuracy — fine-tuned models still hallucinate

How to fine-tune — practical options

OpenAI fine-tuning API

The simplest starting point. Upload a JSONL training file (each line is a prompt-completion pair), trigger the training job, and receive a fine-tuned model ID you can use in API calls. Supports GPT-4o mini and GPT-3.5-turbo. Pricing: training cost (per 1k tokens) + inference cost (per 1k tokens, slightly higher than base models). Minimum recommended training examples: 50-100 good examples, though 500-1,000 produce significantly better results. Documentation at platform.openai.com/docs/guides/fine-tuning.

LoRA / QLoRA on open models

LoRA (Low-Rank Adaptation) is a technique for fine-tuning large models by training only a small set of additional parameters rather than updating all model weights. This reduces the memory and compute required by 10-100x. QLoRA extends LoRA with quantisation for even lower memory requirements. Using LoRA/QLoRA, a consumer GPU (24GB VRAM) can fine-tune models up to 13B parameters. Popular tools: Hugging Face PEFT library, Unsloth (optimised for speed), Axolotl (flexible training framework). Open-source models that can be fine-tuned this way: Llama 3, Mistral, Phi-3, Gemma.

Vertex AI / Azure fine-tuning

Google Cloud Vertex AI and Azure OpenAI both offer managed fine-tuning services for their respective models. Higher cost than doing it yourself, but handles the infrastructure, monitoring, and model serving. Appropriate for enterprise teams that need SLAs and support rather than infrastructure management.

Training data quality — the only thing that matters

Fine-tuning quality is determined almost entirely by training data quality. A small dataset of excellent examples outperforms a large dataset of mediocre examples every time. Principles for training data:

Consistency — every example should demonstrate exactly the behaviour you want. Contradictory examples confuse the model.
Coverage — examples should cover the range of inputs the model will receive in production, not just the easy cases.
Quality over quantity — 100 carefully crafted examples outperform 1,000 automatically generated ones in most cases.
Correct format — training examples must be in the exact format the model will be queried in production.

Decide whether to fine-tune or use RAG

My use case is [describe — what the AI should do, what data it needs access to, how often the data changes, the expected volume of requests]. Should I use fine-tuning, RAG, or just better prompting? For each option: explain if it fits my use case, estimate the implementation complexity, and give me the main reason it would or would not work. Make a recommendation.

Create a training dataset

I want to fine-tune [model — e.g. GPT-4o mini / Llama 3 8B] for [task — e.g. classifying customer support emails / writing in our brand voice / extracting data from invoices]. Generate 10 high-quality training examples in the JSONL format required by the OpenAI fine-tuning API. Each example should: [describe the characteristics of a good example for my task]. After the examples, explain what makes each one effective.

Write a system prompt for a fine-tuned model

I am fine-tuning a model for [task]. Write a system prompt that will be included in every training example. The system prompt should: establish the model's role and context, specify the output format precisely, handle edge cases, and be consistent across all training examples. Also write 3 example user messages with ideal responses that I can use as training data.

Evaluate fine-tuning results

I have fine-tuned a model for [task] using [number] training examples. I want to evaluate whether the fine-tuning worked. Design an evaluation framework: (1) the specific metrics I should measure, (2) a test set of [number] examples that would reveal whether the model has learned correctly, (3) how to compare the fine-tuned model against the base model, (4) the threshold that would tell me the fine-tuning was successful.

Prepare data for LoRA fine-tuning

I want to fine-tune [Llama 3 8B / Mistral 7B / another open model] using LoRA on my local machine / a cloud GPU. My use case: [describe]. Walk me through: (1) the data format required, (2) the recommended LoRA hyperparameters for my use case (rank, alpha, target modules), (3) which tool to use (Unsloth / Hugging Face PEFT / Axolotl) and why, (4) how much GPU memory I need and how to reduce it with QLoRA if needed.

Calculate fine-tuning costs

I want to fine-tune [GPT-4o mini / GPT-3.5-turbo] using the OpenAI API. I have approximately [number] training examples, each with an average of [X] tokens in the prompt and [Y] tokens in the completion. Calculate: (1) the cost to train one epoch, (2) the recommended number of epochs, (3) the total training cost, (4) the inference cost if I run [Z] requests per day at an average of [W] tokens each, (5) whether fine-tuning is cost-effective vs using the base model with a longer system prompt.

Compare fine-tuning to alternatives for my case

I have [describe the problem — task, data, constraints]. I am considering three approaches: (1) a detailed system prompt with few-shot examples, (2) RAG with a vector database of my documents, (3) fine-tuning the model. For my specific case, rank these approaches by: likely quality of output, cost to implement, time to implement, ongoing maintenance burden, and ability to update when my data changes.

Fine-tune for brand voice

I want to fine-tune a model to write content in [brand name]'s voice. Here are 5 examples of excellent [brand name] content: [paste examples]. Analyse: (1) the distinctive characteristics of this writing style — vocabulary, sentence structure, tone, what it avoids, (2) how to create training examples that capture these characteristics, (3) how many examples I need, (4) how to test whether the fine-tuned model has learned the voice correctly.