Complete Guide to Fine-Tuning LLMs on Cloud GPUs (2025)

Why Fine-Tune? The Power of Customization

Pre-trained LLMs like Llama 3, Mistral, and Qwen are incredibly capable, but they're trained on general data. Fine-tuning lets you specialize a model for your specific use case:

Domain expertise: Medical, legal, financial, or technical knowledge
Style adaptation: Match your brand voice or writing style
Task specialization: Code generation, summarization, classification
Custom formatting: Structured outputs, JSON responses
Improved accuracy: Better performance on your specific data

A fine-tuned 7B model often outperforms a general-purpose 70B model on specific tasks, while being 10x cheaper to run.

Fine-Tuning Methods Compared

Method	VRAM Needed	Training Speed	Quality	Best For
Full Fine-Tuning	Very High	Slow	Best	Maximum performance, enough GPU
LoRA	Medium	Fast	Excellent	Best balance of quality/cost
QLoRA	Low	Medium	Very Good	Limited VRAM, larger models
Prefix Tuning	Very Low	Very Fast	Good	Simple adaptations

For most users, LoRA (Low-Rank Adaptation) is the sweet spot. It trains only ~1% of parameters while achieving 95%+ of full fine-tuning quality.

Prerequisites

Before starting, you'll need:

A GPU instance (we recommend H100 or A100 from GPUBrazil)
Your training dataset in the right format
Access to base model weights (Hugging Face account)

GPU Requirements by Model Size

Model Size	Full FT	LoRA	QLoRA
7B	1x A100 80GB	1x L40S 48GB	1x RTX 4090
13B	2x A100 80GB	1x A100 80GB	1x L40S 48GB
70B	8x H100 80GB	2x H100 80GB	1x H100 80GB

Step 1: Prepare Your Dataset

Your dataset format depends on your task. Here's the standard instruction format:

{
  "instruction": "Summarize the following customer review",
  "input": "I bought this laptop last week and I'm very impressed...",
  "output": "Positive review highlighting good performance and value."
}

For chat models, use the conversation format:

{
  "conversations": [
    {"role": "system", "content": "You are a helpful customer service agent."},
    {"role": "user", "content": "I need help with my order #12345"},
    {"role": "assistant", "content": "I'd be happy to help! Let me look up order #12345..."}
  ]
}

💡 Dataset Quality > Quantity

500 high-quality, diverse examples often beat 50,000 low-quality ones. Focus on examples that represent the exact behavior you want.

Step 2: Set Up Your Environment

SSH into your GPU instance and install the required packages:

# Create virtual environment
python -m venv finetune-env
source finetune-env/bin/activate

# Install packages
pip install torch transformers datasets accelerate peft bitsandbytes trl wandb

# Verify GPU
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPUs: {torch.cuda.device_count()}')"

Step 3: Fine-Tune with LoRA

Here's a complete script to fine-tune Llama 3 8B with LoRA:

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

# Configuration
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
dataset_path = "your_dataset.json"
output_dir = "./llama3-finetuned"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# LoRA configuration
lora_config = LoraConfig(
    r=16,                      # Rank - higher = more capacity
    lora_alpha=32,             # Scaling factor
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[           # Which layers to adapt
        "q_proj", "k_proj", "v_proj", 
        "o_proj", "gate_proj", "up_proj", "down_proj"
    ]
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # Shows ~0.5-1% trainable

# Load dataset
dataset = load_dataset("json", data_files=dataset_path)["train"]

# Format function for your data
def format_prompt(example):
    return f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""

# Training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    optim="adamw_torch_fused",
    report_to="wandb",
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    formatting_func=format_prompt,
    max_seq_length=2048,
    packing=True,  # Efficient packing of short examples
)

# Train!
trainer.train()

# Save the LoRA weights
model.save_pretrained(f"{output_dir}/lora-weights")

Step 4: QLoRA for Larger Models

Need to fine-tune a 70B model on a single H100? Use QLoRA:

from transformers import BitsAndBytesConfig

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-70B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)

# Rest of training is the same as LoRA

⚠️ QLoRA Tradeoffs

QLoRA uses ~70% less VRAM but trains ~30% slower and may have slightly lower final quality. Use it when you can't fit the model otherwise.

Step 5: Merge and Deploy

After training, merge LoRA weights back into the base model:

from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    torch_dtype=torch.bfloat16,
)

# Load and merge LoRA weights
model = PeftModel.from_pretrained(base_model, "./llama3-finetuned/lora-weights")
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./llama3-merged")
tokenizer.save_pretrained("./llama3-merged")

Now you can deploy with vLLM or any other inference framework!

Cost Breakdown: Real-World Example

Fine-tuning Llama 3 8B on 10,000 examples (3 epochs) on GPUBrazil:

GPU	Training Time	Cost/Hour	Total Cost
1x H100 80GB	~45 minutes	$2.80	$2.10
1x A100 80GB	~1.5 hours	$1.60	$2.40
1x L40S 48GB	~2.5 hours	$0.90	$2.25

You can fine-tune a powerful custom LLM for under $3. Compare that to OpenAI's fine-tuning at $8/1M training tokens!

Ready to Fine-Tune Your Own LLM?

Get instant access to H100 and A100 GPUs. Training starts at $0.90/hour.

Start Free with $5 Credit →

Best Practices for Fine-Tuning

Start small: Test with 100 examples before scaling up
Monitor loss curves: Use Weights & Biases to track training
Evaluate frequently: Run validation prompts every N steps
Avoid overfitting: Stop when validation loss plateaus
Use diverse examples: Cover edge cases in your training data
Keep base capability: Include some general examples to prevent forgetting

Conclusion

Fine-tuning LLMs on cloud GPUs is fast, affordable, and powerful. With LoRA and modern techniques, you can create specialized AI models that outperform general-purpose alternatives at a fraction of the cost.

Key takeaways:

LoRA is the best balance of quality, speed, and cost
Quality data matters more than quantity
You can fine-tune a custom LLM for under $3 on GPUBrazil

Get started with GPUBrazil and build your first custom LLM today!

The Complete Guide to Fine-Tuning LLMs on Cloud GPUs