Fine-tuning has become the answer to everything. Your model does not follow instructions? Fine-tune it. Your prompts are too long? Fine-tune it. Your RAG pipeline returns garbage? Fine-tune it. The problem is that fine-tuning is also the easiest way to burn a weekend and three hundred dollars in compute for a model that performs worse than the base checkpoint you started with.
This guide is the one I wish I had eighteen months ago. No theory for its own sake. Every section ends with something you can run.
When Fine-Tuning Actually Makes Sense
Fine-tuning is not the first tool you reach for. It is the one you reach for when everything else has stopped working. The decision tree, simplified:
| Approach | When to use | Cost | Time to first result |
|---|---|---|---|
| Prompt engineering | Always first | Free | Minutes |
| Few-shot examples in context | Prompt alone is not enough | Free (but eats context) | Minutes |
| RAG | Need external knowledge, facts, documents | Low to medium | Hours to days |
| Fine-tuning | Need a new behavior, style, format, or domain language | Medium to high | Days to weeks |
| Pre-training from scratch | You are a lab with a nine-figure budget | Astronomical | Months |
The heuristic that has served me well: if you can describe what you want in a paragraph, prompt for it. If you need fifty examples in the context window to get the behavior you want, you have a fine-tuning problem. If the model consistently gets the answer right but in the wrong format, you have a fine-tuning problem. If the model does not know facts it should know, you have a RAG problem, not a fine-tuning problem — fine-tuning is for behavior, not knowledge.
The corollary: the most successful fine-tuning runs I have seen are narrow. One task. One output format. One style. The moment you try to teach a model three unrelated things at once, you are paying for three fine-tuning runs that each produce worse results than one focused run would have.
Full Fine-Tuning vs. LoRA vs. QLoRA
You do not need to understand the linear algebra to pick the right method, but you do need to understand the trade-offs.
| Method | GPU requirement | Training speed | Model size on disk | Quality ceiling |
|---|---|---|---|---|
| Full fine-tuning | 8× A100 (80 GB) for a 70B model | Slow | Full model (~140 GB for 70B) | Highest |
| LoRA (Low-Rank Adaptation) | 1–2× A100 or RTX 4090 for 7–13B | Fast | Adapter only (~10–200 MB) | Very close to full |
| QLoRA (Quantized LoRA) | 1× RTX 3090/4090 (24 GB) for 7–13B, or a Mac with 64 GB unified memory | Moderate | Adapter only (~10–200 MB) | Slightly below LoRA, adequate for most tasks |
For ninety-five percent of real-world fine-tuning tasks, QLoRA on a single consumer GPU is the right answer. The quality gap between QLoRA and full fine-tuning on a focused, single-task dataset is smaller than the gap between a well-curated dataset and a sloppy one. Spend your energy on the data, not on chasing the last half-percent of method quality.
The one exception: if you are fine-tuning a model that will serve millions of requests a day and every token of quality matters, full fine-tuning starts to pay for itself. But if you are in that position, you already have an H100 cluster and a team that is not reading this post.
Preparing Your Dataset: The Part That Actually Matters
The single most common reason fine-tuning fails is not the learning rate or the number of epochs. It is the dataset. And the failure mode is almost always the same: too many examples, too little curation.
Quality Over Quantity, Always
You need far fewer examples than you think. For instruction fine-tuning on a narrow task:
| Task complexity | Minimum examples | Comfortable range |
|---|---|---|
| Output format change (e.g., always respond in YAML) | 50–100 | 200–500 |
| Style or tone adaptation | 100–300 | 500–1,000 |
| Domain-specific reasoning (legal, medical, financial) | 500–1,000 | 2,000–5,000 |
| Complex multi-step behavior | 1,000–2,000 | 5,000–10,000 |
The numbers above are not theoretical. They come from actual runs. I have seen a fifty-example dataset produce a better-structured output formatter than a five-thousand-example dataset where nobody bothered to remove the duplicates and the contradictory examples.
The Format: JSONL, Conversational
Every major fine-tuning API and library expects JSONL — one JSON object per line. The two formats you will actually use:
ShareGPT / conversation format (OpenAI, Together, Fireworks, most open-weight fine-tuning):
{"messages": [{"role": "system", "content": "You are a helpful code reviewer. Respond in the format: {summary, issues: [{file, severity, description, fix}]}."}, {"role": "user", "content": "Review this function: def add(a,b): return a+b"}, {"role": "assistant", "content": "{\"summary\": \"Simple addition function, no issues found.\", \"issues\": []}"}]} Instruction / completion format (older APIs, some open-weight tools):
{"instruction": "Review this code and return a JSON object with keys 'summary' and 'issues'.", "input": "def add(a,b): return a+b", "output": "{\"summary\": \"Simple addition function.\", \"issues\": []}"} Use the conversation format unless you have a specific reason not to. It is the direction the ecosystem is moving, and it handles multi-turn examples cleanly.
The Curation Checklist
Before you train, run through this list for every example in your dataset:
Is the output exactly what you want the model to produce? Not approximately. Exactly. If your production system expects valid JSON with snake_case keys, every training example must output valid JSON with snake_case keys. One example with camelCase keys will confuse the model more than zero examples would have.
Are there contradictory examples? Two examples with the same input and different outputs are poison. Find them. Remove one. If you cannot decide which output is correct, your task is underspecified and fine-tuning will not help.
Are the examples diverse enough? Fifty variations of “write a function that adds two numbers” will teach the model to add numbers. It will not teach the model to write functions. Cover the edge cases, the error paths, the weird inputs.
Are system prompts consistent? If you use a system prompt in training, use the same system prompt (or very close variants) in inference. A model fine-tuned with “You are an expert Python reviewer” will behave unpredictably when you prompt it with “You are a helpful assistant” at inference time.
Is there leakage from the base model? Generate candidate outputs with the base model first. If the base model already produces the right answer for twenty percent of your examples, those examples are not teaching the model anything new. Replace them with harder cases.
Have you reserved a test set? Set aside ten to twenty percent of your examples before training. Do not look at them during curation. Use them only for evaluation. This is not optional.
Data Generation: Use a Stronger Model
The highest-leverage trick in modern fine-tuning is using a stronger model to generate your training data. Have Opus or GPT-5.4 generate five hundred examples in your target format, then manually review the most diverse fifty. Fix the mistakes. Use the cleaned set to fine-tune a smaller, cheaper model.
This pattern — strong model as teacher, small model as student — is called distillation, and it is the reason most successful fine-tuning projects in 2026 are not about teaching a model something new. They are about making something a big model already knows how to do cheap enough to run at scale.
Hands-On: Fine-Tuning with Hugging Face TRL + QLoRA
This section assumes you have a Linux machine with an RTX 3090 or 4090 (24 GB VRAM), or a cloud instance with equivalent specs. If you are on a Mac with Apple Silicon and 64 GB or more of unified memory, replace the bits-per-bytes config with MPS-appropriate settings and expect roughly two to three times longer training times.
Step 1: Install the Stack
pip install transformers datasets accelerate peft trl bitsandbytes wandb
huggingface-cli login
wandb login # optional but strongly recommended for tracking runs Step 2: Load and Quantize the Base Model
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
# 4-bit quantization config — the sweet spot for 24 GB cards
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model_id = "Qwen/Qwen3-8B-Instruct" # or Mistral, Llama, Gemma, etc.
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = prepare_model_for_kbit_training(model) Step 3: Configure LoRA
lora_config = LoraConfig(
r=16, # rank — 8 to 64; 16 is a safe default
lora_alpha=32, # scaling factor — usually 2× rank
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05, # light regularization
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Should print something like: trainable params: 42M / 8.2B (0.5%) A word on r (rank): higher rank means more capacity to learn but also more risk of overfitting and larger adapter files. If you have fewer than two hundred examples, use r=8. For two hundred to a thousand, r=16. Above a thousand, r=32 to 64. Do not just max it out — a higher rank on a small dataset will memorize, not generalize.
Step 4: Format and Load Your Dataset
def format_conversation(example):
"""Convert your JSONL lines to the text format TRL expects."""
messages = example["messages"]
return {"text": tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)}
dataset = load_dataset("json", data_files="my_dataset.jsonl", split="train")
dataset = dataset.map(format_conversation)
dataset = dataset.train_test_split(test_size=0.15, seed=42) Step 5: Train
from trl import SFTTrainer, SFTConfig
training_args = SFTConfig(
output_dir="./qwen3-code-reviewer",
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_ratio=0.1,
lr_scheduler_type="cosine",
logging_steps=10,
eval_steps=50,
save_steps=100,
bf16=True,
max_seq_length=2048,
packing=False, # set True for very short examples to speed up
report_to="wandb",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
tokenizer=tokenizer,
peft_config=lora_config,
)
trainer.train()
trainer.save_model("./qwen3-code-reviewer-final") Total training time on a single RTX 4090: roughly fifteen to forty-five minutes for five hundred examples on an 8B model with the settings above. That is the whole point of QLoRA — you iterate over lunch, not over a weekend.
If you see the eval loss rising while train loss keeps falling after epoch two, you are overfitting. Stop. Reduce epochs or increase your dataset. Pushing through will not fix itself.
Hands-On: Fine-Tuning with the OpenAI API
If you do not want to manage GPUs, the hosted option is simpler. It is also more expensive per run but cheaper in engineering time.
Step 1: Prepare and Validate Your JSONL
import json
def validate_and_write(examples, output_path):
"""Write examples to JSONL with validation."""
with open(output_path, "w") as f:
for i, ex in enumerate(examples):
assert "messages" in ex, f"Example {i} missing 'messages'"
assert len(ex["messages"]) >= 2, f"Example {i} needs at least user + assistant"
roles = {m["role"] for m in ex["messages"]}
assert "assistant" in roles, f"Example {i} has no assistant message"
f.write(json.dumps(ex) + "\n")
print(f"Wrote {len(examples)} examples to {output_path}")
# Each example is a full conversation
examples = [
{
"messages": [
{"role": "system", "content": "You are a precise code reviewer."},
{"role": "user", "content": "Review: def foo(): pass"},
{"role": "assistant", "content": '{"summary": "Empty function.", "issues": [{"severity": "low", "description": "Function body is empty."}]}'}
]
},
# ... more examples
]
validate_and_write(examples, "training_data.jsonl") Step 2: Upload and Create the Fine-Tuning Job
from openai import OpenAI
client = OpenAI()
# Upload
file = client.files.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune",
)
print(f"File uploaded: {file.id}")
# Create job
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2025-07-18", # cheapest model that supports fine-tuning
suffix="code-reviewer-v1",
hyperparameters={
"n_epochs": 3,
"batch_size": 4,
"learning_rate_multiplier": 1.8,
},
)
print(f"Job created: {job.id}")
print(f"Status endpoint: openai.fine_tuning.jobs.retrieve('{job.id}')") Step 3: Monitor and Evaluate
import time
while True:
job = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {job.status}")
if job.status == "succeeded":
print(f"Fine-tuned model: {job.fine_tuned_model}")
break
elif job.status in ("failed", "cancelled"):
print(f"Job failed: {job.error}")
break
# Print metrics if available
if hasattr(job, 'result_files') and job.result_files:
metrics = client.files.content(job.result_files[0])
print(f"Metrics available")
time.sleep(60) OpenAI fine-tuning pricing as of mid-2026: roughly $8 per million tokens for training on GPT-4o-mini, plus inference on the resulting model at roughly double the base inference price. A five-hundred-example dataset with an average of two thousand tokens per example will cost about eight dollars to train. That is cheap enough that you can afford to be wrong twice before you get it right — and you should budget for being wrong at least once.
Evaluating Your Fine-Tuned Model
Training loss is not evaluation. It tells you whether the model is learning your dataset. It does not tell you whether the model is learning anything useful. You need a separate evaluation that approximates how the model will be used in production.
Automated Evaluation: LLM-as-Judge
For tasks where the output has a clear structure, use a stronger model as a judge:
def evaluate_with_judge(prompt, reference_output, model_output):
"""Ask a strong model to compare outputs."""
judge_prompt = f"""You are evaluating a fine-tuned model's output against a reference.
Prompt: {prompt}
Reference (desired output):
{reference_output}
Model output:
{model_output}
Rate the model output on a scale of 1-5 for:
1. Correctness: Does it get the right answer?
2. Format adherence: Does it follow the expected structure?
3. Completeness: Is anything missing?
Return a JSON object: {{"correctness": int, "format": int, "completeness": int, "notes": str}}"""
response = client.chat.completions.create(
model="claude-opus-4-8", # or gpt-5.4
messages=[{"role": "user", "content": judge_prompt}],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content) Run this on your held-out test set and compute the mean scores. If correctness drops below 4 out of 5 for a task you described as narrow and well-defined, your dataset has a problem.
The Manual Spot-Check
Automated metrics lie. Always. Sample twenty outputs from your fine-tuned model and twenty from the base model on the same prompts. Read them side by side, blinded if you can. Ask yourself: would I ship this? If the answer is no for more than two or three of the twenty, do not ship it. Go back to the dataset.
The Regression Test
Keep a fixed set of twenty to fifty prompts that represent your most important use cases. After every fine-tuning run, evaluate against this set. Store the results. A new run that beats the old run on train loss but regresses on three of your regression prompts is a failed run, no matter what the loss curve says.
Common Pitfalls and How to Avoid Them
Catastrophic forgetting. The model gets better at your task and worse at everything else. Mitigation: mix in five to ten percent of general-purpose examples (from the base model’s original training distribution) into your dataset. This is called data mixing, and it is the simplest insurance policy you can buy against a model that forgets how to say hello.
Overfitting to the prompt template. If you train with ”### Instruction:\n…\n### Response:\n…” and then prompt with “You are a helpful assistant. Please…” the model will produce garbage. The fine-tuned model has learned to associate a specific format with the behavior. Use the same prompt structure at inference that you used at training. Better yet, use the model’s native chat template (via tokenizer.apply_chat_template()) and never think about prompt formatting again.
Training on AI-generated data without review. Using Opus to generate five hundred examples and then fine-tuning on all of them without reading a single one is how you get a model that confidently produces beautiful, well-formatted wrong answers. The teacher model has its own failure modes, and fine-tuning amplifies them. Review at least the most diverse ten percent of your generated data manually.
Using the wrong base model. Fine-tuning a general-purpose chat model (like GPT-4o-mini or Qwen Instruct) usually works better than fine-tuning a base completion model, because the chat model already understands instruction-following. The exception is when you need the model to produce raw completions without the chat scaffolding — but you probably do not.
Ignoring the system prompt during training. If your inference pipeline uses a system prompt, your training data must use the same system prompt. If your inference pipeline does not use a system prompt, do not include one in training. This sounds obvious, and yet it is the single most common bug in production fine-tuning pipelines I have seen.
Fine-tuning when RAG would have worked. I said this at the start, and I am saying it again because it is the most expensive mistake on this list. A fine-tuning run costs money. A RAG pipeline costs time. The wrong choice costs both.
Deployment: Serving Your Fine-Tuned Model
LoRA Adapter Deployment
For Hugging Face PEFT adapters, you have two options. The simple one: merge and serve.
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B-Instruct",
torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base, "./qwen3-code-reviewer-final")
merged = model.merge_and_unload()
merged.save_pretrained("./qwen3-code-reviewer-merged")
tokenizer.save_pretrained("./qwen3-code-reviewer-merged") Then serve the merged model with vLLM, TGI, or any standard inference server. The merged model is a full checkpoint — the adapter weights have been folded into the base weights — so it is compatible with any serving infrastructure that supports the base architecture.
The more flexible option: serve the base model and load adapters dynamically. vLLM supports LoRA adapter hot-swapping, which lets you serve dozens of fine-tuned variants from a single base model instance. This is the right architecture if you are fine-tuning per-customer or per-task and do not want to run a separate GPU for each one.
OpenAI Fine-Tuned Model Deployment
Your fine-tuned model gets a model ID like ft:gpt-4o-mini-2025-07-18:personal:code-reviewer-v1:abc123. Use it exactly like any other model:
response = client.chat.completions.create(
model="ft:gpt-4o-mini-2025-07-18:personal:code-reviewer-v1:abc123",
messages=[
{"role": "system", "content": "You are a precise code reviewer."},
{"role": "user", "content": "Review: def foo(): pass"},
],
) No infrastructure to manage. The trade-off is that you cannot inspect the weights, you cannot merge adapters, you cannot serve on your own hardware, and you pay OpenAI’s inference markup. For many teams, this is a completely reasonable trade.
What This Costs, Honestly
Here is a real budget for a typical fine-tuning project in mid-2026, assuming you are an individual developer or a small team:
| Item | QLoRA (self-hosted) | OpenAI hosted |
|---|---|---|
| GPU for training | $1.50–$3/hour (cloud rental, RTX 4090) | $0 (included in training price) |
| Training run (500 examples, 8B model) | ~$1–$5 total (30–90 minutes) | ~$8 total |
| Iteration (3 runs, because the first two will be wrong) | $3–$15 | $24 |
| Inference (1M tokens/month) | $0.30–$0.80 (cloud GPU) or free (local) | ~$0.60 (GPT-4o-mini fine-tune) |
| Engineering time (dataset, eval, integration) | 2–5 days | 1–3 days |
The hosted route wins on engineering time. The self-hosted route wins on marginal cost and gives you a model you can run anywhere, including offline and in air-gapped environments. Pick based on which constraint bites harder: your time or your compute budget.
If you are fine-tuning a 70B model with QLoRA, multiply the training time by roughly four to six, and the cloud GPU cost by the same factor. It is still manageable on a single A100 (80 GB), which rents for about two to four dollars an hour.
When Not to Fine-Tune
You have read two thousand words about how to fine-tune. Here are two hundred about when to stop.
If your base model already gets the answer right more than eighty percent of the time, you are probably fine-tuning the wrong thing. Fix your prompts or your pipeline first. Fine-tuning is for the gap between “mostly works” and “works reliably in production.” It is not for the gap between “does not work at all” and “mostly works.” That gap is almost always a prompt engineering or RAG problem.
If you cannot write down, in one sentence, exactly what behavior you want the fine-tuned model to exhibit that the base model does not, you are not ready to fine-tune. “Better code reviews” is not a sentence. “Always respond with a JSON object containing ‘summary’, ‘issues’, and ‘score’ keys, where ‘issues’ is an array of objects with ‘file’, ‘line’, ‘severity’, and ‘description’” is a sentence. Write the sentence first. Then collect the data. Then train.
If you have fewer than thirty high-quality examples, you do not have a dataset. You have a few examples. Use them as few-shot prompts instead. Fine-tuning with thirty examples will overfit, and your eval will look great right up until the moment a real user sends a prompt that is not in your thirty-example set.
The Takeaway
Fine-tuning is not magic. It is a tool that takes a well-defined task, a carefully curated dataset, and a few hours of GPU time, and produces a model that does one thing reliably. The reliability is the point. A fine-tuned 8B model that does exactly what you want, every time, beats a 70B general-purpose model that sometimes gets creative.
Start small. One task. Fifty examples. QLoRA on whatever GPU you have. Evaluate honestly. Iterate on the data, not the hyperparameters. Ship when the regression tests pass.
The models are getting better every quarter, but the gap between a good general-purpose model and a great task-specific model has not closed. That gap is called fine-tuning, and it is still the highest-leverage thing you can learn this year.