Why Fine-Tune? The Power of Customization
Pre-trained LLMs like Llama 3, Mistral, and Qwen are incredibly capable, but they're trained on general data. Fine-tuning lets you specialize a model for your specific use case:
- Domain expertise: Medical, legal, financial, or technical knowledge
- Style adaptation: Match your brand voice or writing style
- Task specialization: Code generation, summarization, classification
- Custom formatting: Structured outputs, JSON responses
- Improved accuracy: Better performance on your specific data
A fine-tuned 7B model often outperforms a general-purpose 70B model on specific tasks, while being 10x cheaper to run.
Fine-Tuning Methods Compared
| Method | VRAM Needed | Training Speed | Quality | Best For |
|---|---|---|---|---|
| Full Fine-Tuning | Very High | Slow | Best | Maximum performance, enough GPU |
| LoRA | Medium | Fast | Excellent | Best balance of quality/cost |
| QLoRA | Low | Medium | Very Good | Limited VRAM, larger models |
| Prefix Tuning | Very Low | Very Fast | Good | Simple adaptations |
For most users, LoRA (Low-Rank Adaptation) is the sweet spot. It trains only ~1% of parameters while achieving 95%+ of full fine-tuning quality.
Prerequisites
Before starting, you'll need:
- A GPU instance (we recommend H100 or A100 from GPUBrazil)
- Your training dataset in the right format
- Access to base model weights (Hugging Face account)
GPU Requirements by Model Size
| Model Size | Full FT | LoRA | QLoRA |
|---|---|---|---|
| 7B | 1x A100 80GB | 1x L40S 48GB | 1x RTX 4090 |
| 13B | 2x A100 80GB | 1x A100 80GB | 1x L40S 48GB |
| 70B | 8x H100 80GB | 2x H100 80GB | 1x H100 80GB |
Step 1: Prepare Your Dataset
Your dataset format depends on your task. Here's the standard instruction format:
{
"instruction": "Summarize the following customer review",
"input": "I bought this laptop last week and I'm very impressed...",
"output": "Positive review highlighting good performance and value."
}
For chat models, use the conversation format:
{
"conversations": [
{"role": "system", "content": "You are a helpful customer service agent."},
{"role": "user", "content": "I need help with my order #12345"},
{"role": "assistant", "content": "I'd be happy to help! Let me look up order #12345..."}
]
}
๐ก Dataset Quality > Quantity
500 high-quality, diverse examples often beat 50,000 low-quality ones. Focus on examples that represent the exact behavior you want.
Step 2: Set Up Your Environment
SSH into your GPU instance and install the required packages:
# Create virtual environment
python -m venv finetune-env
source finetune-env/bin/activate
# Install packages
pip install torch transformers datasets accelerate peft bitsandbytes trl wandb
# Verify GPU
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPUs: {torch.cuda.device_count()}')"
Step 3: Fine-Tune with LoRA
Here's a complete script to fine-tune Llama 3 8B with LoRA:
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
# Configuration
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
dataset_path = "your_dataset.json"
output_dir = "./llama3-finetuned"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank - higher = more capacity
lora_alpha=32, # Scaling factor
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=[ # Which layers to adapt
"q_proj", "k_proj", "v_proj",
"o_proj", "gate_proj", "up_proj", "down_proj"
]
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # Shows ~0.5-1% trainable
# Load dataset
dataset = load_dataset("json", data_files=dataset_path)["train"]
# Format function for your data
def format_prompt(example):
return f"""### Instruction:
{example['instruction']}
### Input:
{example['input']}
### Response:
{example['output']}"""
# Training arguments
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
logging_steps=10,
save_strategy="epoch",
bf16=True,
optim="adamw_torch_fused",
report_to="wandb",
)
# Initialize trainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
formatting_func=format_prompt,
max_seq_length=2048,
packing=True, # Efficient packing of short examples
)
# Train!
trainer.train()
# Save the LoRA weights
model.save_pretrained(f"{output_dir}/lora-weights")
Step 4: QLoRA for Larger Models
Need to fine-tune a 70B model on a single H100? Use QLoRA:
from transformers import BitsAndBytesConfig
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-70B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)
# Rest of training is the same as LoRA
โ ๏ธ QLoRA Tradeoffs
QLoRA uses ~70% less VRAM but trains ~30% slower and may have slightly lower final quality. Use it when you can't fit the model otherwise.
Step 5: Merge and Deploy
After training, merge LoRA weights back into the base model:
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
torch_dtype=torch.bfloat16,
)
# Load and merge LoRA weights
model = PeftModel.from_pretrained(base_model, "./llama3-finetuned/lora-weights")
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("./llama3-merged")
tokenizer.save_pretrained("./llama3-merged")
Now you can deploy with vLLM or any other inference framework!
Cost Breakdown: Real-World Example
Fine-tuning Llama 3 8B on 10,000 examples (3 epochs) on GPUBrazil:
| GPU | Training Time | Cost/Hour | Total Cost |
|---|---|---|---|
| 1x H100 80GB | ~45 minutes | $2.80 | $2.10 |
| 1x A100 80GB | ~1.5 hours | $1.60 | $2.40 |
| 1x L40S 48GB | ~2.5 hours | $0.90 | $2.25 |
You can fine-tune a powerful custom LLM for under $3. Compare that to OpenAI's fine-tuning at $8/1M training tokens!
Ready to Fine-Tune Your Own LLM?
Get instant access to H100 and A100 GPUs. Training starts at $0.90/hour.
Start Free with $5 Credit โBest Practices for Fine-Tuning
- Start small: Test with 100 examples before scaling up
- Monitor loss curves: Use Weights & Biases to track training
- Evaluate frequently: Run validation prompts every N steps
- Avoid overfitting: Stop when validation loss plateaus
- Use diverse examples: Cover edge cases in your training data
- Keep base capability: Include some general examples to prevent forgetting
Conclusion
Fine-tuning LLMs on cloud GPUs is fast, affordable, and powerful. With LoRA and modern techniques, you can create specialized AI models that outperform general-purpose alternatives at a fraction of the cost.
Key takeaways:
- LoRA is the best balance of quality, speed, and cost
- Quality data matters more than quantity
- You can fine-tune a custom LLM for under $3 on GPUBrazil
Get started with GPUBrazil and build your first custom LLM today!