Skip to content
Back to Blog

A Practical Guide to Fine-Tuning AI Models

· 14 min read

Fine-tuning has become the answer to everything. Your model does not follow instructions? Fine-tune it. Your prompts are too long? Fine-tune it. Your RAG pipeline returns garbage? Fine-tune it. The problem is that fine-tuning is also the easiest way to burn a weekend and three hundred dollars in compute for a model that performs worse than the base checkpoint you started with.

This guide is the one I wish I had eighteen months ago. No theory for its own sake. Every section ends with something you can run.

When Fine-Tuning Actually Makes Sense

Fine-tuning is not the first tool you reach for. It is the one you reach for when everything else has stopped working. The decision tree, simplified:

ApproachWhen to useCostTime to first result
Prompt engineeringAlways firstFreeMinutes
Few-shot examples in contextPrompt alone is not enoughFree (but eats context)Minutes
RAGNeed external knowledge, facts, documentsLow to mediumHours to days
Fine-tuningNeed a new behavior, style, format, or domain languageMedium to highDays to weeks
Pre-training from scratchYou are a lab with a nine-figure budgetAstronomicalMonths

The heuristic that has served me well: if you can describe what you want in a paragraph, prompt for it. If you need fifty examples in the context window to get the behavior you want, you have a fine-tuning problem. If the model consistently gets the answer right but in the wrong format, you have a fine-tuning problem. If the model does not know facts it should know, you have a RAG problem, not a fine-tuning problem — fine-tuning is for behavior, not knowledge.

The corollary: the most successful fine-tuning runs I have seen are narrow. One task. One output format. One style. The moment you try to teach a model three unrelated things at once, you are paying for three fine-tuning runs that each produce worse results than one focused run would have.

Full Fine-Tuning vs. LoRA vs. QLoRA

You do not need to understand the linear algebra to pick the right method, but you do need to understand the trade-offs.

MethodGPU requirementTraining speedModel size on diskQuality ceiling
Full fine-tuning8× A100 (80 GB) for a 70B modelSlowFull model (~140 GB for 70B)Highest
LoRA (Low-Rank Adaptation)1–2× A100 or RTX 4090 for 7–13BFastAdapter only (~10–200 MB)Very close to full
QLoRA (Quantized LoRA)1× RTX 3090/4090 (24 GB) for 7–13B, or a Mac with 64 GB unified memoryModerateAdapter only (~10–200 MB)Slightly below LoRA, adequate for most tasks

For ninety-five percent of real-world fine-tuning tasks, QLoRA on a single consumer GPU is the right answer. The quality gap between QLoRA and full fine-tuning on a focused, single-task dataset is smaller than the gap between a well-curated dataset and a sloppy one. Spend your energy on the data, not on chasing the last half-percent of method quality.

The one exception: if you are fine-tuning a model that will serve millions of requests a day and every token of quality matters, full fine-tuning starts to pay for itself. But if you are in that position, you already have an H100 cluster and a team that is not reading this post.

Preparing Your Dataset: The Part That Actually Matters

The single most common reason fine-tuning fails is not the learning rate or the number of epochs. It is the dataset. And the failure mode is almost always the same: too many examples, too little curation.

Quality Over Quantity, Always

You need far fewer examples than you think. For instruction fine-tuning on a narrow task:

Task complexityMinimum examplesComfortable range
Output format change (e.g., always respond in YAML)50–100200–500
Style or tone adaptation100–300500–1,000
Domain-specific reasoning (legal, medical, financial)500–1,0002,000–5,000
Complex multi-step behavior1,000–2,0005,000–10,000

The numbers above are not theoretical. They come from actual runs. I have seen a fifty-example dataset produce a better-structured output formatter than a five-thousand-example dataset where nobody bothered to remove the duplicates and the contradictory examples.

The Format: JSONL, Conversational

Every major fine-tuning API and library expects JSONL — one JSON object per line. The two formats you will actually use:

ShareGPT / conversation format (OpenAI, Together, Fireworks, most open-weight fine-tuning):

{"messages": [{"role": "system", "content": "You are a helpful code reviewer. Respond in the format: {summary, issues: [{file, severity, description, fix}]}."}, {"role": "user", "content": "Review this function: def add(a,b): return a+b"}, {"role": "assistant", "content": "{\"summary\": \"Simple addition function, no issues found.\", \"issues\": []}"}]}

Instruction / completion format (older APIs, some open-weight tools):

{"instruction": "Review this code and return a JSON object with keys 'summary' and 'issues'.", "input": "def add(a,b): return a+b", "output": "{\"summary\": \"Simple addition function.\", \"issues\": []}"}

Use the conversation format unless you have a specific reason not to. It is the direction the ecosystem is moving, and it handles multi-turn examples cleanly.

The Curation Checklist

Before you train, run through this list for every example in your dataset:

  1. Is the output exactly what you want the model to produce? Not approximately. Exactly. If your production system expects valid JSON with snake_case keys, every training example must output valid JSON with snake_case keys. One example with camelCase keys will confuse the model more than zero examples would have.

  2. Are there contradictory examples? Two examples with the same input and different outputs are poison. Find them. Remove one. If you cannot decide which output is correct, your task is underspecified and fine-tuning will not help.

  3. Are the examples diverse enough? Fifty variations of “write a function that adds two numbers” will teach the model to add numbers. It will not teach the model to write functions. Cover the edge cases, the error paths, the weird inputs.

  4. Are system prompts consistent? If you use a system prompt in training, use the same system prompt (or very close variants) in inference. A model fine-tuned with “You are an expert Python reviewer” will behave unpredictably when you prompt it with “You are a helpful assistant” at inference time.

  5. Is there leakage from the base model? Generate candidate outputs with the base model first. If the base model already produces the right answer for twenty percent of your examples, those examples are not teaching the model anything new. Replace them with harder cases.

  6. Have you reserved a test set? Set aside ten to twenty percent of your examples before training. Do not look at them during curation. Use them only for evaluation. This is not optional.

Data Generation: Use a Stronger Model

The highest-leverage trick in modern fine-tuning is using a stronger model to generate your training data. Have Opus or GPT-5.4 generate five hundred examples in your target format, then manually review the most diverse fifty. Fix the mistakes. Use the cleaned set to fine-tune a smaller, cheaper model.

This pattern — strong model as teacher, small model as student — is called distillation, and it is the reason most successful fine-tuning projects in 2026 are not about teaching a model something new. They are about making something a big model already knows how to do cheap enough to run at scale.

Hands-On: Fine-Tuning with Hugging Face TRL + QLoRA

This section assumes you have a Linux machine with an RTX 3090 or 4090 (24 GB VRAM), or a cloud instance with equivalent specs. If you are on a Mac with Apple Silicon and 64 GB or more of unified memory, replace the bits-per-bytes config with MPS-appropriate settings and expect roughly two to three times longer training times.

Step 1: Install the Stack

pip install transformers datasets accelerate peft trl bitsandbytes wandb
huggingface-cli login
wandb login  # optional but strongly recommended for tracking runs

Step 2: Load and Quantize the Base Model

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset

# 4-bit quantization config — the sweet spot for 24 GB cards
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model_id = "Qwen/Qwen3-8B-Instruct"  # or Mistral, Llama, Gemma, etc.

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

model = prepare_model_for_kbit_training(model)

Step 3: Configure LoRA

lora_config = LoraConfig(
    r=16,               # rank — 8 to 64; 16 is a safe default
    lora_alpha=32,      # scaling factor — usually 2× rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,  # light regularization
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Should print something like: trainable params: 42M / 8.2B (0.5%)

A word on r (rank): higher rank means more capacity to learn but also more risk of overfitting and larger adapter files. If you have fewer than two hundred examples, use r=8. For two hundred to a thousand, r=16. Above a thousand, r=32 to 64. Do not just max it out — a higher rank on a small dataset will memorize, not generalize.

Step 4: Format and Load Your Dataset

def format_conversation(example):
    """Convert your JSONL lines to the text format TRL expects."""
    messages = example["messages"]
    return {"text": tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,
    )}

dataset = load_dataset("json", data_files="my_dataset.jsonl", split="train")
dataset = dataset.map(format_conversation)
dataset = dataset.train_test_split(test_size=0.15, seed=42)

Step 5: Train

from trl import SFTTrainer, SFTConfig

training_args = SFTConfig(
    output_dir="./qwen3-code-reviewer",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    logging_steps=10,
    eval_steps=50,
    save_steps=100,
    bf16=True,
    max_seq_length=2048,
    packing=False,  # set True for very short examples to speed up
    report_to="wandb",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    peft_config=lora_config,
)

trainer.train()
trainer.save_model("./qwen3-code-reviewer-final")

Total training time on a single RTX 4090: roughly fifteen to forty-five minutes for five hundred examples on an 8B model with the settings above. That is the whole point of QLoRA — you iterate over lunch, not over a weekend.

If you see the eval loss rising while train loss keeps falling after epoch two, you are overfitting. Stop. Reduce epochs or increase your dataset. Pushing through will not fix itself.

Hands-On: Fine-Tuning with the OpenAI API

If you do not want to manage GPUs, the hosted option is simpler. It is also more expensive per run but cheaper in engineering time.

Step 1: Prepare and Validate Your JSONL

import json

def validate_and_write(examples, output_path):
    """Write examples to JSONL with validation."""
    with open(output_path, "w") as f:
        for i, ex in enumerate(examples):
            assert "messages" in ex, f"Example {i} missing 'messages'"
            assert len(ex["messages"]) >= 2, f"Example {i} needs at least user + assistant"
            roles = {m["role"] for m in ex["messages"]}
            assert "assistant" in roles, f"Example {i} has no assistant message"
            f.write(json.dumps(ex) + "\n")
    print(f"Wrote {len(examples)} examples to {output_path}")

# Each example is a full conversation
examples = [
    {
        "messages": [
            {"role": "system", "content": "You are a precise code reviewer."},
            {"role": "user", "content": "Review: def foo(): pass"},
            {"role": "assistant", "content": '{"summary": "Empty function.", "issues": [{"severity": "low", "description": "Function body is empty."}]}'}
        ]
    },
    # ... more examples
]

validate_and_write(examples, "training_data.jsonl")

Step 2: Upload and Create the Fine-Tuning Job

from openai import OpenAI

client = OpenAI()

# Upload
file = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune",
)
print(f"File uploaded: {file.id}")

# Create job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2025-07-18",  # cheapest model that supports fine-tuning
    suffix="code-reviewer-v1",
    hyperparameters={
        "n_epochs": 3,
        "batch_size": 4,
        "learning_rate_multiplier": 1.8,
    },
)

print(f"Job created: {job.id}")
print(f"Status endpoint: openai.fine_tuning.jobs.retrieve('{job.id}')")

Step 3: Monitor and Evaluate

import time

while True:
    job = client.fine_tuning.jobs.retrieve(job.id)
    print(f"Status: {job.status}")

    if job.status == "succeeded":
        print(f"Fine-tuned model: {job.fine_tuned_model}")
        break
    elif job.status in ("failed", "cancelled"):
        print(f"Job failed: {job.error}")
        break

    # Print metrics if available
    if hasattr(job, 'result_files') and job.result_files:
        metrics = client.files.content(job.result_files[0])
        print(f"Metrics available")

    time.sleep(60)

OpenAI fine-tuning pricing as of mid-2026: roughly $8 per million tokens for training on GPT-4o-mini, plus inference on the resulting model at roughly double the base inference price. A five-hundred-example dataset with an average of two thousand tokens per example will cost about eight dollars to train. That is cheap enough that you can afford to be wrong twice before you get it right — and you should budget for being wrong at least once.

Evaluating Your Fine-Tuned Model

Training loss is not evaluation. It tells you whether the model is learning your dataset. It does not tell you whether the model is learning anything useful. You need a separate evaluation that approximates how the model will be used in production.

Automated Evaluation: LLM-as-Judge

For tasks where the output has a clear structure, use a stronger model as a judge:

def evaluate_with_judge(prompt, reference_output, model_output):
    """Ask a strong model to compare outputs."""
    judge_prompt = f"""You are evaluating a fine-tuned model's output against a reference.

Prompt: {prompt}

Reference (desired output):
{reference_output}

Model output:
{model_output}

Rate the model output on a scale of 1-5 for:
1. Correctness: Does it get the right answer?
2. Format adherence: Does it follow the expected structure?
3. Completeness: Is anything missing?

Return a JSON object: {{"correctness": int, "format": int, "completeness": int, "notes": str}}"""

    response = client.chat.completions.create(
        model="claude-opus-4-8",  # or gpt-5.4
        messages=[{"role": "user", "content": judge_prompt}],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

Run this on your held-out test set and compute the mean scores. If correctness drops below 4 out of 5 for a task you described as narrow and well-defined, your dataset has a problem.

The Manual Spot-Check

Automated metrics lie. Always. Sample twenty outputs from your fine-tuned model and twenty from the base model on the same prompts. Read them side by side, blinded if you can. Ask yourself: would I ship this? If the answer is no for more than two or three of the twenty, do not ship it. Go back to the dataset.

The Regression Test

Keep a fixed set of twenty to fifty prompts that represent your most important use cases. After every fine-tuning run, evaluate against this set. Store the results. A new run that beats the old run on train loss but regresses on three of your regression prompts is a failed run, no matter what the loss curve says.

Common Pitfalls and How to Avoid Them

Catastrophic forgetting. The model gets better at your task and worse at everything else. Mitigation: mix in five to ten percent of general-purpose examples (from the base model’s original training distribution) into your dataset. This is called data mixing, and it is the simplest insurance policy you can buy against a model that forgets how to say hello.

Overfitting to the prompt template. If you train with ”### Instruction:\n…\n### Response:\n…” and then prompt with “You are a helpful assistant. Please…” the model will produce garbage. The fine-tuned model has learned to associate a specific format with the behavior. Use the same prompt structure at inference that you used at training. Better yet, use the model’s native chat template (via tokenizer.apply_chat_template()) and never think about prompt formatting again.

Training on AI-generated data without review. Using Opus to generate five hundred examples and then fine-tuning on all of them without reading a single one is how you get a model that confidently produces beautiful, well-formatted wrong answers. The teacher model has its own failure modes, and fine-tuning amplifies them. Review at least the most diverse ten percent of your generated data manually.

Using the wrong base model. Fine-tuning a general-purpose chat model (like GPT-4o-mini or Qwen Instruct) usually works better than fine-tuning a base completion model, because the chat model already understands instruction-following. The exception is when you need the model to produce raw completions without the chat scaffolding — but you probably do not.

Ignoring the system prompt during training. If your inference pipeline uses a system prompt, your training data must use the same system prompt. If your inference pipeline does not use a system prompt, do not include one in training. This sounds obvious, and yet it is the single most common bug in production fine-tuning pipelines I have seen.

Fine-tuning when RAG would have worked. I said this at the start, and I am saying it again because it is the most expensive mistake on this list. A fine-tuning run costs money. A RAG pipeline costs time. The wrong choice costs both.

Deployment: Serving Your Fine-Tuned Model

LoRA Adapter Deployment

For Hugging Face PEFT adapters, you have two options. The simple one: merge and serve.

from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B-Instruct",
    torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base, "./qwen3-code-reviewer-final")
merged = model.merge_and_unload()
merged.save_pretrained("./qwen3-code-reviewer-merged")
tokenizer.save_pretrained("./qwen3-code-reviewer-merged")

Then serve the merged model with vLLM, TGI, or any standard inference server. The merged model is a full checkpoint — the adapter weights have been folded into the base weights — so it is compatible with any serving infrastructure that supports the base architecture.

The more flexible option: serve the base model and load adapters dynamically. vLLM supports LoRA adapter hot-swapping, which lets you serve dozens of fine-tuned variants from a single base model instance. This is the right architecture if you are fine-tuning per-customer or per-task and do not want to run a separate GPU for each one.

OpenAI Fine-Tuned Model Deployment

Your fine-tuned model gets a model ID like ft:gpt-4o-mini-2025-07-18:personal:code-reviewer-v1:abc123. Use it exactly like any other model:

response = client.chat.completions.create(
    model="ft:gpt-4o-mini-2025-07-18:personal:code-reviewer-v1:abc123",
    messages=[
        {"role": "system", "content": "You are a precise code reviewer."},
        {"role": "user", "content": "Review: def foo(): pass"},
    ],
)

No infrastructure to manage. The trade-off is that you cannot inspect the weights, you cannot merge adapters, you cannot serve on your own hardware, and you pay OpenAI’s inference markup. For many teams, this is a completely reasonable trade.

What This Costs, Honestly

Here is a real budget for a typical fine-tuning project in mid-2026, assuming you are an individual developer or a small team:

ItemQLoRA (self-hosted)OpenAI hosted
GPU for training$1.50–$3/hour (cloud rental, RTX 4090)$0 (included in training price)
Training run (500 examples, 8B model)~$1–$5 total (30–90 minutes)~$8 total
Iteration (3 runs, because the first two will be wrong)$3–$15$24
Inference (1M tokens/month)$0.30–$0.80 (cloud GPU) or free (local)~$0.60 (GPT-4o-mini fine-tune)
Engineering time (dataset, eval, integration)2–5 days1–3 days

The hosted route wins on engineering time. The self-hosted route wins on marginal cost and gives you a model you can run anywhere, including offline and in air-gapped environments. Pick based on which constraint bites harder: your time or your compute budget.

If you are fine-tuning a 70B model with QLoRA, multiply the training time by roughly four to six, and the cloud GPU cost by the same factor. It is still manageable on a single A100 (80 GB), which rents for about two to four dollars an hour.

When Not to Fine-Tune

You have read two thousand words about how to fine-tune. Here are two hundred about when to stop.

If your base model already gets the answer right more than eighty percent of the time, you are probably fine-tuning the wrong thing. Fix your prompts or your pipeline first. Fine-tuning is for the gap between “mostly works” and “works reliably in production.” It is not for the gap between “does not work at all” and “mostly works.” That gap is almost always a prompt engineering or RAG problem.

If you cannot write down, in one sentence, exactly what behavior you want the fine-tuned model to exhibit that the base model does not, you are not ready to fine-tune. “Better code reviews” is not a sentence. “Always respond with a JSON object containing ‘summary’, ‘issues’, and ‘score’ keys, where ‘issues’ is an array of objects with ‘file’, ‘line’, ‘severity’, and ‘description’” is a sentence. Write the sentence first. Then collect the data. Then train.

If you have fewer than thirty high-quality examples, you do not have a dataset. You have a few examples. Use them as few-shot prompts instead. Fine-tuning with thirty examples will overfit, and your eval will look great right up until the moment a real user sends a prompt that is not in your thirty-example set.

The Takeaway

Fine-tuning is not magic. It is a tool that takes a well-defined task, a carefully curated dataset, and a few hours of GPU time, and produces a model that does one thing reliably. The reliability is the point. A fine-tuned 8B model that does exactly what you want, every time, beats a 70B general-purpose model that sometimes gets creative.

Start small. One task. Fifty examples. QLoRA on whatever GPU you have. Evaluate honestly. Iterate on the data, not the hyperparameters. Ship when the regression tests pass.

The models are getting better every quarter, but the gap between a good general-purpose model and a great task-specific model has not closed. That gap is called fine-tuning, and it is still the highest-leverage thing you can learn this year.

Related Posts