Why Fine-Tune? The Power of Customization

Pre-trained LLMs like Llama 3, Mistral, and Qwen are incredibly capable, but they're trained on general data. Fine-tuning lets you specialize a model for your specific use case:

A fine-tuned 7B model often outperforms a general-purpose 70B model on specific tasks, while being 10x cheaper to run.

Fine-Tuning Methods Compared

MethodVRAM NeededTraining SpeedQualityBest For
Full Fine-TuningVery HighSlowBestMaximum performance, enough GPU
LoRAMediumFastExcellentBest balance of quality/cost
QLoRALowMediumVery GoodLimited VRAM, larger models
Prefix TuningVery LowVery FastGoodSimple adaptations

For most users, LoRA (Low-Rank Adaptation) is the sweet spot. It trains only ~1% of parameters while achieving 95%+ of full fine-tuning quality.

Prerequisites

Before starting, you'll need:

GPU Requirements by Model Size

Model SizeFull FTLoRAQLoRA
7B1x A100 80GB1x L40S 48GB1x RTX 4090
13B2x A100 80GB1x A100 80GB1x L40S 48GB
70B8x H100 80GB2x H100 80GB1x H100 80GB

Step 1: Prepare Your Dataset

Your dataset format depends on your task. Here's the standard instruction format:

{
  "instruction": "Summarize the following customer review",
  "input": "I bought this laptop last week and I'm very impressed...",
  "output": "Positive review highlighting good performance and value."
}

For chat models, use the conversation format:

{
  "conversations": [
    {"role": "system", "content": "You are a helpful customer service agent."},
    {"role": "user", "content": "I need help with my order #12345"},
    {"role": "assistant", "content": "I'd be happy to help! Let me look up order #12345..."}
  ]
}

๐Ÿ’ก Dataset Quality > Quantity

500 high-quality, diverse examples often beat 50,000 low-quality ones. Focus on examples that represent the exact behavior you want.

Step 2: Set Up Your Environment

SSH into your GPU instance and install the required packages:

# Create virtual environment
python -m venv finetune-env
source finetune-env/bin/activate

# Install packages
pip install torch transformers datasets accelerate peft bitsandbytes trl wandb

# Verify GPU
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPUs: {torch.cuda.device_count()}')"

Step 3: Fine-Tune with LoRA

Here's a complete script to fine-tune Llama 3 8B with LoRA:

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

# Configuration
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
dataset_path = "your_dataset.json"
output_dir = "./llama3-finetuned"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# LoRA configuration
lora_config = LoraConfig(
    r=16,                      # Rank - higher = more capacity
    lora_alpha=32,             # Scaling factor
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[           # Which layers to adapt
        "q_proj", "k_proj", "v_proj", 
        "o_proj", "gate_proj", "up_proj", "down_proj"
    ]
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # Shows ~0.5-1% trainable

# Load dataset
dataset = load_dataset("json", data_files=dataset_path)["train"]

# Format function for your data
def format_prompt(example):
    return f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""

# Training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    optim="adamw_torch_fused",
    report_to="wandb",
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    formatting_func=format_prompt,
    max_seq_length=2048,
    packing=True,  # Efficient packing of short examples
)

# Train!
trainer.train()

# Save the LoRA weights
model.save_pretrained(f"{output_dir}/lora-weights")

Step 4: QLoRA for Larger Models

Need to fine-tune a 70B model on a single H100? Use QLoRA:

from transformers import BitsAndBytesConfig

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-70B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)

# Rest of training is the same as LoRA

โš ๏ธ QLoRA Tradeoffs

QLoRA uses ~70% less VRAM but trains ~30% slower and may have slightly lower final quality. Use it when you can't fit the model otherwise.

Step 5: Merge and Deploy

After training, merge LoRA weights back into the base model:

from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    torch_dtype=torch.bfloat16,
)

# Load and merge LoRA weights
model = PeftModel.from_pretrained(base_model, "./llama3-finetuned/lora-weights")
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./llama3-merged")
tokenizer.save_pretrained("./llama3-merged")

Now you can deploy with vLLM or any other inference framework!

Cost Breakdown: Real-World Example

Fine-tuning Llama 3 8B on 10,000 examples (3 epochs) on GPUBrazil:

GPUTraining TimeCost/HourTotal Cost
1x H100 80GB~45 minutes$2.80$2.10
1x A100 80GB~1.5 hours$1.60$2.40
1x L40S 48GB~2.5 hours$0.90$2.25

You can fine-tune a powerful custom LLM for under $3. Compare that to OpenAI's fine-tuning at $8/1M training tokens!

Ready to Fine-Tune Your Own LLM?

Get instant access to H100 and A100 GPUs. Training starts at $0.90/hour.

Start Free with $5 Credit โ†’

Best Practices for Fine-Tuning

  1. Start small: Test with 100 examples before scaling up
  2. Monitor loss curves: Use Weights & Biases to track training
  3. Evaluate frequently: Run validation prompts every N steps
  4. Avoid overfitting: Stop when validation loss plateaus
  5. Use diverse examples: Cover edge cases in your training data
  6. Keep base capability: Include some general examples to prevent forgetting

Conclusion

Fine-tuning LLMs on cloud GPUs is fast, affordable, and powerful. With LoRA and modern techniques, you can create specialized AI models that outperform general-purpose alternatives at a fraction of the cost.

Key takeaways:

Get started with GPUBrazil and build your first custom LLM today!