Multi-modal AI is revolutionizing how machines understand the world. Vision-language models (VLMs) combine the power of large language models with computer vision, enabling AI to see, understand, and reason about images. This guide covers deploying models like LLaVA, Qwen-VL, and building production-ready multi-modal applications.
Deploy LLaVA-1.6 on GPUBrazil with a single A100 GPU in under 5 minutes. Process images with natural language queries instantly.
Understanding Vision-Language Models
Vision-language models bridge the gap between seeing and understanding. Unlike traditional computer vision models that output labels or bounding boxes, VLMs can engage in natural conversations about images.
VLM Architecture Overview
Modern VLMs consist of three main components:
- Vision Encoder: Processes images into feature representations (CLIP, SigLIP, EVA)
- Projection Layer: Aligns visual features with text embedding space
- Language Model: Generates text responses based on visual and text inputs
# VLM Architecture Flow
Image โ Vision Encoder โ Visual Tokens โ Projection
โ
Text Prompt โ Tokenizer โ Text Tokens โ [Combined] โ LLM โ Response
Top Vision-Language Models in 2025
| Model | Parameters | GPU Memory | Strengths |
|---|---|---|---|
| LLaVA-1.6-34B | 34B | 70GB | Best open-source quality |
| Qwen-VL-Max | 72B | 150GB | Document understanding |
| InternVL2-26B | 26B | 55GB | OCR & charts |
| CogVLM2 | 19B | 40GB | Visual grounding |
| LLaVA-1.6-7B | 7B | 15GB | Fast, efficient |
Deploying LLaVA on GPU Cloud
LLaVA (Large Language and Vision Assistant) is the most popular open-source VLM. Let's deploy it on GPUBrazil.
Quick Setup with vLLM
# Install vLLM with vision support
pip install vllm>=0.4.0
# Start LLaVA server
python -m vllm.entrypoints.openai.api_server \
--model llava-hf/llava-v1.6-mistral-7b-hf \
--trust-remote-code \
--max-model-len 4096 \
--gpu-memory-utilization 0.9
Python Client for Image Analysis
import base64
import httpx
def encode_image(image_path: str) -> str:
"""Encode image to base64."""
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
async def analyze_image(image_path: str, prompt: str) -> str:
"""Analyze image with LLaVA."""
base64_image = encode_image(image_path)
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "llava-hf/llava-v1.6-mistral-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
},
{
"type": "text",
"text": prompt
}
]
}
],
"max_tokens": 1024,
"temperature": 0.2
},
timeout=60.0
)
return response.json()["choices"][0]["message"]["content"]
# Usage
result = await analyze_image(
"product.jpg",
"Describe this product in detail for an e-commerce listing."
)
print(result)
๐ Deploy Multi-Modal AI Instantly
Get an A100 or H100 GPU and deploy LLaVA in minutes on GPUBrazil.
Start Free Trial โBuilding Production VLM Applications
Use Case 1: Automated Image Captioning
from typing import List
import asyncio
class ImageCaptioner:
def __init__(self, model_url: str):
self.model_url = model_url
async def caption_batch(
self,
images: List[str],
style: str = "descriptive"
) -> List[str]:
"""Generate captions for multiple images."""
prompts = {
"descriptive": "Describe this image in detail.",
"alt_text": "Write concise alt text for this image for accessibility.",
"social": "Write an engaging social media caption for this image.",
"seo": "Write an SEO-optimized description for this image."
}
prompt = prompts.get(style, prompts["descriptive"])
tasks = [
self.analyze_single(img, prompt)
for img in images
]
return await asyncio.gather(*tasks)
async def analyze_single(self, image: str, prompt: str) -> str:
# Implementation from above
pass
# Process 100 product images
captioner = ImageCaptioner("http://localhost:8000")
captions = await captioner.caption_batch(
product_images,
style="seo"
)
Use Case 2: Document Understanding
class DocumentAnalyzer:
"""Extract structured data from documents."""
EXTRACTION_PROMPT = """Analyze this document image and extract:
1. Document type (invoice, receipt, form, contract, etc.)
2. Key entities (names, dates, amounts, addresses)
3. Main content summary
Return as JSON:
{
"document_type": "...",
"entities": {
"names": [],
"dates": [],
"amounts": [],
"addresses": []
},
"summary": "..."
}"""
async def extract(self, document_image: str) -> dict:
"""Extract structured info from document."""
response = await analyze_image(
document_image,
self.EXTRACTION_PROMPT
)
# Parse JSON from response
import json
try:
# Find JSON in response
start = response.find('{')
end = response.rfind('}') + 1
return json.loads(response[start:end])
except:
return {"raw_response": response}
# Extract invoice data
analyzer = DocumentAnalyzer()
invoice_data = await analyzer.extract("invoice.png")
print(f"Total: {invoice_data['entities']['amounts']}")
Use Case 3: Visual Question Answering API
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import JSONResponse
import tempfile
import os
app = FastAPI(title="Visual QA API")
@app.post("/v1/visual-qa")
async def visual_qa(
image: UploadFile = File(...),
question: str = Form(...)
):
"""Answer questions about uploaded images."""
# Save uploaded file temporarily
with tempfile.NamedTemporaryFile(
delete=False,
suffix=os.path.splitext(image.filename)[1]
) as tmp:
content = await image.read()
tmp.write(content)
tmp_path = tmp.name
try:
# Analyze with VLM
answer = await analyze_image(tmp_path, question)
return JSONResponse({
"question": question,
"answer": answer,
"model": "llava-1.6-7b",
"confidence": "high"
})
finally:
os.unlink(tmp_path)
@app.post("/v1/batch-analysis")
async def batch_analysis(
images: List[UploadFile] = File(...),
task: str = Form(default="describe")
):
"""Process multiple images in batch."""
tasks_map = {
"describe": "Describe this image in detail.",
"ocr": "Extract all text visible in this image.",
"objects": "List all objects visible in this image.",
"sentiment": "What is the mood/sentiment of this image?"
}
prompt = tasks_map.get(task, tasks_map["describe"])
results = []
for image in images:
# Process each image
# ... similar to above
pass
return {"results": results}
Advanced Multi-Modal Techniques
Visual Grounding (Object Detection)
class VisualGrounder:
"""Locate objects in images using natural language."""
GROUNDING_PROMPT = """Find and locate "{object}" in this image.
Return bounding box coordinates as [x1, y1, x2, y2] normalized 0-1.
If multiple instances, return all locations."""
async def ground(
self,
image: str,
target_object: str
) -> List[dict]:
"""Find object locations in image."""
prompt = self.GROUNDING_PROMPT.format(object=target_object)
response = await analyze_image(image, prompt)
# Parse bounding boxes from response
# Models like CogVLM2 return structured coordinates
return self.parse_boxes(response)
def parse_boxes(self, response: str) -> List[dict]:
"""Extract bounding boxes from model response."""
import re
# Find coordinate patterns [x1, y1, x2, y2]
pattern = r'\[(\d+\.?\d*),\s*(\d+\.?\d*),\s*(\d+\.?\d*),\s*(\d+\.?\d*)\]'
matches = re.findall(pattern, response)
boxes = []
for match in matches:
boxes.append({
"x1": float(match[0]),
"y1": float(match[1]),
"x2": float(match[2]),
"y2": float(match[3])
})
return boxes
# Find all people in image
grounder = VisualGrounder()
people_locations = await grounder.ground("crowd.jpg", "person")
Multi-Image Reasoning
async def compare_images(
image1: str,
image2: str,
comparison_type: str = "differences"
) -> str:
"""Compare two images and describe differences/similarities."""
prompts = {
"differences": "Compare these two images and list all differences you can find.",
"similarities": "What do these two images have in common?",
"changes": "Describe what changed between the first and second image.",
"better": "Which image is better quality and why?"
}
# Combine images (model-specific implementation)
# Some models support multiple images in one request
base64_1 = encode_image(image1)
base64_2 = encode_image(image2)
response = await client.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "llava-hf/llava-v1.6-34b-hf",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_1}"}},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_2}"}},
{"type": "text", "text": prompts[comparison_type]}
]
}
]
}
)
return response.json()["choices"][0]["message"]["content"]
# Compare before/after images
differences = await compare_images(
"before.jpg",
"after.jpg",
"changes"
)
Optimizing VLM Performance
Image Preprocessing
from PIL import Image
import io
def preprocess_image(
image_path: str,
max_size: int = 1024,
quality: int = 85
) -> bytes:
"""Optimize image for VLM inference."""
img = Image.open(image_path)
# Convert to RGB if needed
if img.mode != "RGB":
img = img.convert("RGB")
# Resize if too large (preserves aspect ratio)
if max(img.size) > max_size:
img.thumbnail((max_size, max_size), Image.Resampling.LANCZOS)
# Compress
buffer = io.BytesIO()
img.save(buffer, format="JPEG", quality=quality, optimize=True)
return buffer.getvalue()
def tile_large_image(
image_path: str,
tile_size: int = 512,
overlap: int = 64
) -> List[Image.Image]:
"""Split large image into overlapping tiles for detailed analysis."""
img = Image.open(image_path)
width, height = img.size
tiles = []
for y in range(0, height, tile_size - overlap):
for x in range(0, width, tile_size - overlap):
box = (x, y, min(x + tile_size, width), min(y + tile_size, height))
tile = img.crop(box)
tiles.append({
"image": tile,
"position": (x, y),
"box": box
})
return tiles
Batching and Caching
import hashlib
from functools import lru_cache
import redis
class CachedVLM:
"""VLM with result caching for repeated queries."""
def __init__(self, redis_url: str = "redis://localhost:6379"):
self.redis = redis.from_url(redis_url)
self.cache_ttl = 3600 * 24 # 24 hours
def _cache_key(self, image_hash: str, prompt: str) -> str:
"""Generate cache key from image and prompt."""
combined = f"{image_hash}:{prompt}"
return f"vlm:{hashlib.md5(combined.encode()).hexdigest()}"
def _hash_image(self, image_bytes: bytes) -> str:
"""Hash image content for cache key."""
return hashlib.sha256(image_bytes).hexdigest()[:16]
async def analyze(
self,
image_path: str,
prompt: str,
use_cache: bool = True
) -> str:
"""Analyze with caching."""
with open(image_path, "rb") as f:
image_bytes = f.read()
image_hash = self._hash_image(image_bytes)
cache_key = self._cache_key(image_hash, prompt)
# Check cache
if use_cache:
cached = self.redis.get(cache_key)
if cached:
return cached.decode()
# Run inference
result = await analyze_image(image_path, prompt)
# Store in cache
self.redis.setex(cache_key, self.cache_ttl, result)
return result
VLMs process images at high resolution. LLaVA-34B needs ~70GB VRAM. Use quantization (AWQ/GPTQ) to run on smaller GPUs, or choose efficient models like LLaVA-7B (15GB) for cost-effective deployments.
Building a Complete Multi-Modal Pipeline
from dataclasses import dataclass
from enum import Enum
from typing import Optional
class TaskType(Enum):
CAPTION = "caption"
OCR = "ocr"
QA = "qa"
ANALYSIS = "analysis"
GROUNDING = "grounding"
@dataclass
class VLMResult:
task: TaskType
input_image: str
output: str
confidence: float
processing_time: float
model: str
class MultiModalPipeline:
"""Complete pipeline for multi-modal AI tasks."""
def __init__(self, config: dict):
self.config = config
self.models = {}
self._load_models()
def _load_models(self):
"""Load models for different tasks."""
# Use specialized models for each task
self.models = {
TaskType.CAPTION: "llava-1.6-7b",
TaskType.OCR: "internvl2-26b",
TaskType.QA: "llava-1.6-34b",
TaskType.ANALYSIS: "qwen-vl-max",
TaskType.GROUNDING: "cogvlm2"
}
async def process(
self,
image: str,
task: TaskType,
prompt: Optional[str] = None,
options: Optional[dict] = None
) -> VLMResult:
"""Process image with appropriate model."""
import time
start = time.time()
# Select model
model = self.models[task]
# Build prompt based on task
if task == TaskType.CAPTION:
prompt = prompt or "Describe this image in detail."
elif task == TaskType.OCR:
prompt = prompt or "Extract all text from this image."
elif task == TaskType.ANALYSIS:
prompt = prompt or "Analyze this image comprehensively."
# Run inference
output = await self._run_inference(image, prompt, model)
return VLMResult(
task=task,
input_image=image,
output=output,
confidence=0.95,
processing_time=time.time() - start,
model=model
)
async def process_batch(
self,
images: List[str],
task: TaskType,
prompt: Optional[str] = None
) -> List[VLMResult]:
"""Process multiple images in parallel."""
tasks = [
self.process(img, task, prompt)
for img in images
]
return await asyncio.gather(*tasks)
# Usage
pipeline = MultiModalPipeline({})
# Single image
result = await pipeline.process(
"product.jpg",
TaskType.CAPTION
)
print(result.output)
# Batch processing
results = await pipeline.process_batch(
["img1.jpg", "img2.jpg", "img3.jpg"],
TaskType.OCR
)
GPU Requirements and Costs
| Model | Min GPU | Recommended GPU | Cost/Hour |
|---|---|---|---|
| LLaVA-7B | RTX 4090 (24GB) | A100 40GB | $0.50 - $1.50 |
| LLaVA-13B | A100 40GB | A100 80GB | $1.50 - $3.00 |
| LLaVA-34B | A100 80GB | 2x A100 80GB | $3.00 - $6.00 |
| Qwen-VL-72B | 2x A100 80GB | 4x A100 80GB | $6.00 - $12.00 |
๐ฏ Start with Multi-Modal AI Today
GPUBrazil offers instant access to A100 and H100 GPUs. Deploy vision-language models in minutes with our pre-configured environments.
Get Started Free โBest Practices Summary
- Model Selection: Use smaller models (7B) for simple tasks, larger (34B+) for complex reasoning
- Image Optimization: Resize to 1024px max, compress to reduce inference time
- Batching: Process multiple images in parallel for throughput
- Caching: Cache results for repeated queries on same images
- Prompt Engineering: Be specific in prompts for better accuracy
- Error Handling: VLMs can hallucinate - validate critical outputs
- Cost Optimization: Use quantized models when quality is acceptable
Conclusion
Multi-modal AI opens incredible possibilities - from automated content moderation to intelligent document processing. With open-source models like LLaVA matching GPT-4V quality, you can build powerful visual AI applications without API dependencies.
GPUBrazil provides the GPU infrastructure you need to deploy these models cost-effectively. Whether you're processing millions of images or building interactive visual AI, our A100 and H100 GPUs deliver the performance required for production multi-modal workloads.
Ready to build multi-modal AI? Sign up for GPUBrazil, deploy LLaVA, and start processing images with AI in minutes.