Local Vision LLMs Worth Running in 2026

The field exploded. Let’s catch up.

Two years ago I wrote about Pixtral, LLaVA, and Qwen-VL like they were the three models. Then Alibaba shipped Qwen3-VL, Google open-sourced Gemma 4, Meta dropped Llama 4, and Shanghai AI Lab quietly published InternVL3 with an MIT license. Now there’s a whole lineup — and the original three have aged very differently.

I’ve been running most of these on an RTX 4080 (16GB) and I’ll tell you which ones actually matter for a home lab. Some of the new entries are remarkable. Some are nominally “open” but need a small data center to run. I’ll sort out the noise.

Here’s the 2026 picture.

The architecture story (condensed)

Every vision-language model follows the same basic plumbing:

Vision encoder — crunches the image into embeddings (CLIP, DINOv2, or custom ViT)
Projector — maps those embeddings to token space the LLM understands
Language model — the actual reasoning and text generation

The encoder is where all the interesting engineering happens. How it handles resolution, how many vision tokens it generates per image, and how it tiles high-res content — that’s what separates “I can read that invoice” from “I hallucinated three line items.” The base LLM determines reasoning quality and speed.

This split is also why a 7B vision model can match a 13B competitor on certain tasks. Different architecture choices compound differently.

The contenders

LLaVA: still standing, but the old guard now

LLaVA (UW/CMU, 2023) is the granddaddy. CLIP ViT-L/14 encoder, 336×336 native, ~576 vision tokens per image. It was the first model that made Ollama’s vision support possible, and honestly it kicked off the whole “local multimodal” era.

But here’s the thing: LLaVA-OneVision was its last significant push. The ecosystem has largely moved to Qwen and InternVL as base architectures. LLaVA-Next is still in Ollama, still fine for casual use, but nobody’s actively pouring resources into it anymore.

What it’s still good at:

Zero-friction baseline — one ollama pull llava-next and you’re running
Screenshots, diagrams, casual image Q&A
Grafana dashboard reads (I tested this — threw a live dashboard at LLaVA-Next, asked “top 3 metrics by value,” got all three correct in ~3 seconds)
Doesn’t hallucinate on simple stuff

The catch: 336×336 fixed resolution kills you on dense text. My scanned invoice test — LLaVA got the invoice number, hallucinated the total. That’s the ceiling. If your use case lives inside that ceiling, great. If not, keep reading.

ollama pull llava-next
ollama run llava-next

Pixtral: still the invoice whisperer

Pixtral (Mistral) is the model you reach for when accuracy on document OCR is non-negotiable. Two versions: Pixtral-12B (fits on 16GB-ish, tight) and Pixtral Large (~124B parameters, MoE architecture, needs multi-GPU or 24GB+ territory — not a home-lab card, let’s be honest).

The secret sauce is dynamic patch tiling. Instead of cramming everything into 336×336, Pixtral keeps high-res regions high-res and compresses uniform regions. Dense text stays crisp. That same scanned invoice test — Pixtral-12B nailed every cell in an 8-column, 25-row table. No hallucinations. LLaVA got most of it.

What it’s good at:

Dense-text OCR — handwritten forms, scanned PDFs, embedded text in images
Document Q&A (“what’s the customer name on this invoice?”) with complex layouts
Charts with overlapping legend items, similar colors
Spatial reasoning (“top-left corner, what color is…”)

The catch: Still not in Ollama. You’re running vLLM, full stop. It’s slower too — 5–8 seconds per image on a 4080 because it’s doing real work per patch. And 12B on a 16GB card leaves you with almost no headroom for anything else.

pip install vllm torch

huggingface-cli download mistralai/Pixtral-12B

vllm serve mistralai/Pixtral-12B \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

import base64
import requests

def ask_pixtral(image_path, question):
    with open(image_path, 'rb') as f:
        image_data = base64.b64encode(f.read()).decode('utf-8')

    payload = {
        'model': 'mistralai/Pixtral-12B',
        'messages': [{
            'role': 'user',
            'content': [
                {'type': 'image_url', 'image_url': {'url': f'data:image/jpeg;base64,{image_data}'}},
                {'type': 'text', 'text': question}
            ]
        }],
        'max_tokens': 512,
        'temperature': 0.3
    }

    r = requests.post('http://localhost:8000/v1/chat/completions', json=payload)
    return r.json()['choices'][0]['message']['content']

print(ask_pixtral('invoice.png', 'What is the total amount due?'))

For invoice processing, that accuracy gap pays for itself fast. I manage a small side business — Pixtral saves me an hour a week in manual corrections. LLaVA was wrong on about 10% of invoices. Pixtral: almost none.

Qwen3-VL: the one that actually grew up

When I wrote the original version of this article, Qwen-VL and Qwen2-VL were the speed demons with solid multilingual chops. Then Alibaba shipped Qwen3-VL (September 2025) and it’s a different beast.

Sizes: dense 2B / 4B / 8B / 32B, plus MoE variants (30B-A3B and a monster 235B-A22B). Apache 2.0 licensed. 256K context window, extendable toward 1M. If you haven’t heard that last part — yes, you can throw long documents at it.

The headline upgrade over Qwen2.5-VL is the Thinking/Instruct split. Instruct is the fast interactive variant you know. Thinking is deep multimodal reasoning — up to ~40K output tokens — for when you need the model to actually work through something complex rather than pattern-match to an answer. Think multi-step document analysis, GUI task planning, or when you need it to explain its visual reasoning rather than just spit out an answer.

Numbers (OCRBench/ScreenSpot figures from third-party benchmarks; see the verified head-to-head table further down for confirmed scores): OCRBench around 896 for the 8B model, climbing past 920 for larger variants. DocVQA in the 96–97% range. GUI grounding (ScreenSpot) around 92–94%. Where Qwen3-VL really shines is raw visual perception — reading what’s actually in the image.

What it’s good at:

OCR across 32 languages (up from 19 in 2.5-VL), including CJK, Arabic, multilingual invoices
Video understanding — can scan long videos natively, not just frame dumps
GUI/agentic tasks — “click this button in this screenshot” type workflows
15–60% faster inference than Qwen2.5-VL at the same model size
The 8B variant is the sweet spot for a 16GB home-lab GPU

The catch: The bigger MoE variants are still not 16GB territory. Stick to 8B for solo-GPU setups. Also — the Thinking mode is powerful but you’ll want to be intentional about when you trigger it; leaving it on for everything is overkill.

# Qwen3-VL 8B via vLLM
vllm serve Qwen/Qwen3-VL-8B-Instruct \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9

For the multilingual invoice test that the old article covered — Qwen3-VL with Chinese headers and English line items — accuracy is effectively perfect now. And it’s faster than Pixtral. If you’re not dealing with degraded scans, Qwen3-VL is becoming the all-rounder.

Gemma 4: the actual home-lab dark horse

This is the one I’m most excited about, and the one most people haven’t tried yet.

Gemma 4 (Google DeepMind, April 2026) is open, Apache 2.0, and it runs in Ollama and llama.cpp. Let that sink in — no vLLM setup, no custom serving stack, just ollama pull. That alone earns it a place in the conversation.

Sizes: E2B, E4B (phone and laptop-class), 12B (this one adds text + image + audio in the “Unified” variant), 26B-A4B (MoE), 31B (dense). 140+ languages. 256K context. Native VLM with configurable visual token budgets: 70 / 140 / 280 / 560 / 1120 — you dial this yourself. 70 tokens = fast captioning or video frames; 560–1120 = OCR, dense docs, UI reasoning. That’s a genuinely useful knob to have.

The 12B Unified model is the one to highlight: it handles images, text, AND audio (ASR + translation, ~30s clips). If you want a single model that can transcribe a voice memo and then reason about a screenshot in the same workflow — that’s Gemma 4 12B.

The numbers that matter for home lab: QAT (quantization-aware-trained) variants cut memory roughly 3x. At 4-bit QAT:

E2B: ~3GB
E4B: ~5GB
12B: ~7GB
26B-A4B: ~15GB
31B: ~18GB

That E2B running in 3GB of RAM is not a joke model — it’s a genuinely capable vision model designed for constrained hardware. E4B in 5GB. The 12B in 7GB with audio support. If you’re running on anything from a gaming laptop to an actual phone, there’s a Gemma 4 variant for you.

# Gemma 4 12B via Ollama — yes, this works
ollama pull gemma4:12b

# Or the efficient E4B if you're on a potato
ollama pull gemma4:e4b

# Run it
ollama run gemma4:12b

import ollama

# Vision query — image analysis
response = ollama.generate(
    model='gemma4:12b',
    prompt='What metrics are elevated in this dashboard?',
    images=['grafana_screenshot.png'],
    stream=False
)

print(response['response'])

What it’s good at:

Runs anywhere Ollama runs — the lowest barrier to entry in this roundup
Audio on the 12B+ variant (ASR + translation, ~30s clips)
Video via frame sequences
Visual token budget control — you can tune speed vs accuracy per request
The QAT memory efficiency is real; you can run vision at quality that would’ve needed a 24GB card a year ago

The catch: Gemma 4’s OCR on heavily degraded scanned documents doesn’t match Pixtral’s tiling approach. For invoice-processing accuracy at scale, Pixtral still has an edge. And audio is 12B+ only — E2B/E4B are image-only.

InternVL3: the MIT-licensed all-rounder

InternVL3 (Shanghai AI Lab) deserves a mention because it holds the strongest benchmark numbers for a truly permissive MIT license. Up to 78B. MMMU around 72.2, OCRBench around 906, DocVQA ~92.7 on the 8B variant.

Honest take: if you’re building something you want to ship commercially and need a VLM backend with clean licensing, InternVL3 is your answer. It’s solid across image understanding, document Q&A, and OCR. Not as fast as Qwen3-VL, not as OCR-precise as Pixtral, but permissive MIT and genuinely capable.

Runs via vLLM. Treat it like Pixtral from a setup perspective.

Llama 4: technically multimodal, realistically not home-lab

Meta shipped Llama 4 Scout (~17B active parameters / ~109B total) and Maverick (~17B active / ~400B total with 1M context). Native multimodal, strong numbers — DocVQA around 94.4, ChartQA 88–90.

But let’s be honest about what these models are: MoE giants. Scout needs enough memory to load ~109B total weights. Maverick is 400B. You’re looking at multi-GPU setups with enough VRAM to take a small family on vacation. The Llama 4 Community License is also not true open-source.

Worth knowing they exist. Not worth planning your home-lab workflow around them unless you’re running a proper inference server.

The verified numbers: Qwen3-VL vs Gemma 4, head to head

The subjective ratings above are from my own poking. But it’s worth seeing the actual benchmark scores side by side, because they tell a story you might not expect. These are the bigger variants (the 8B Qwen / 12B Gemma you’d run at home land below these, but the shape holds):

Benchmark	Qwen3-VL 32B (Thinking)	Qwen3-VL 4B (Instruct)	Gemma 4 26B-A4B	Gemma 4 31B
Total params	33B	4B	26B (MoE)	30.7B (dense)
Active params	33B	4B	3.8B	30.7B
Context window	—	262K	256K	256K
MMLU-Pro (knowledge)	82.1%	63.4%	82.6%	85.2%
GPQA Diamond (science)	73.1%	37.1%	82.3%	84.3%
AIME (math)	83.7%	46.6%	88.3%	89.2%
LiveCodeBench v6	65.6%	37.9%	77.1%	80.0%
MMMU-Pro (vision reasoning)	68.1%	—	73.8%	76.9%
MMBench-V1.1 (vision)	90.8%	85.1%	—	—
DocVQA (vision)	96.1%	95.3%	—	—
Arena ELO	—	—	~1441	~1452

Here’s the surprising part. On text reasoning, math, and code — and even MMMU-Pro, which is reasoning-heavy visual problems — Gemma 4 leads, sometimes by a wide margin (GPQA 84.3 vs 73.1 is not close). Google built a genuinely strong reasoner that happens to see.

But flip to raw visual perception — reading documents, answering about what’s literally in an image — and Qwen3-VL pulls ahead (DocVQA 96.1, MMBench 90.8, on benchmarks where Google didn’t even report a number). Qwen3-VL is the sharper eye.

So the practical split: if your “vision” task is really reasoning that involves an image — interpreting a chart, reasoning over a diagram — Gemma 4 is shockingly strong. If it’s perception — OCR, document extraction, “what’s in this screenshot” — Qwen3-VL is still the one to beat. (And note the 4B-Instruct vs 32B-Thinking columns are very different tiers — that’s the range, not a fair fight.)

Decision matrix: the ones you can actually run

	LLaVA-Next	Pixtral-12B	Qwen3-VL 8B	Gemma 4 12B
OCR (clean text)	8/10	9.5/10	9/10	8.5/10
OCR (dense/degraded)	6/10	9.5/10	8.5/10	7/10
Charts & diagrams	8/10	9/10	8.5/10	8/10
Screenshots / UI	8.5/10	8/10	9/10	8.5/10
Speed (4080)	3–4s	5–8s	1.5–2s	2–3s
VRAM	8–12GB	12–16GB	8–12GB	7GB (QAT)
Ollama support	✅	❌	❌	✅
Multilingual	Decent	Good	Excellent (32 langs)	Excellent (140+ langs)
Audio support	❌	❌	❌	✅ (12B+)
Video support	Basic	❌	✅ Native	✅ Frame-based
License	Apache 2.0	Apache 2.0	Apache 2.0	Apache 2.0

Which should you run?

Easiest setup + runs on a potato + want audio too → Gemma 4

ollama pull gemma4:12b and you’re done. The QAT variants mean you can run a genuinely capable vision model in 7GB of RAM. If you also want ASR — transcribing voice memos, meeting recordings, audio clips — Gemma 4 12B is the only model in this lineup that handles all of it. This is the pick for “I want this to just work.”

OCR + multilingual + video + GUI/agentic work → Qwen3-VL

The 8B Instruct variant is the fastest accurate VLM you can run on a 16GB card right now. If your documents mix languages, if you’re building workflows that interact with screenshots or GUI elements, or if you want native video understanding — Qwen3-VL is the answer. It’s grown from “fast multilingual option” to “best general-purpose local VLM.” The Thinking mode is a genuine differentiator when you need deep visual reasoning, not just pattern matching.

Heavy document / invoice OCR accuracy → Pixtral

If you’re processing scanned PDFs, handwritten forms, or invoices where a single wrong number costs you real money — Pixtral’s tiling approach is still the best. It’s not the most convenient, and you’ll need vLLM, but the accuracy gap on degraded documents is real. I’ve been running this for invoice processing and it’s earned its place.

MIT license specifically → InternVL3

Commercial project, clean licensing required? InternVL3 is solid across image understanding and OCR, genuinely permissive MIT, and runs via vLLM on the same hardware as the others.

Already have LLaVA running and don’t need more → keep it

It’s the old guard now. It’s not getting better. But if it’s solving your problem, there’s zero urgency to migrate. Upgrade when your use case hits its ceiling.

Real-world workflow: what this actually looks like

The Grafana dashboard scraper I mentioned earlier is still running. Screenshot every 5 minutes, fed to a vision model, asks “what’s the CPU utilization right now?” With Qwen3-VL 8B it now comes back under 2 seconds and correctly identifies which panel it’s reading from. With the old LLaVA setup it occasionally picked the wrong panel. That’s a monitoring alerting pipeline, and wrong panel reads sent false alerts at 2 AM. You can guess how much I enjoyed those.

For invoices, Pixtral is still in the loop. Dense scanned PDFs with varying quality, complex table layouts, occasionally a fax (yes, people still fax things) — Pixtral is the only one in this lineup that doesn’t require manual spot-checking.

And I’ve started experimenting with Gemma 4 12B for a new workflow: voice memos from my phone transcribed locally, then the transcript runs through the same model to extract action items. Single model, no cloud, no API fees. The audio transcription isn’t production-perfect on background noise but for clear recordings it’s surprisingly usable.

The honest state of play

In 2023 this was a research novelty. In 2024 it was “technically works but budget extra RAM.” In 2026 it’s just… infrastructure. You pick the model that fits the task, slap it behind a vLLM server or Ollama endpoint, and build on top of it.

The big shift from the old three-model world: Gemma 4 made the accessibility bar much lower, and Qwen3-VL closed most of the accuracy gap that used to justify running Pixtral for everything. Pixtral is still the specialist for degraded document OCR. But for 80% of vision tasks, you’re now choosing between an Ollama-friendly model and a vLLM model, not between “works” and “doesn’t work.”

Pick one that fits your GPU and your workflow. Spend the first 30 minutes poking it with your actual documents, not generic benchmark images. That’s the only benchmark that matters for your use case.

Your 2 AM self (who will absolutely be debugging an invoice parser at 2 AM) will thank you for making the call now, not then.

Local Vision LLMs Worth Running in 2026

The field exploded. Let’s catch up.

The architecture story (condensed)

The contenders

LLaVA: still standing, but the old guard now

Pixtral: still the invoice whisperer

Qwen3-VL: the one that actually grew up

Gemma 4: the actual home-lab dark horse

InternVL3: the MIT-licensed all-rounder

Llama 4: technically multimodal, realistically not home-lab

The verified numbers: Qwen3-VL vs Gemma 4, head to head

Decision matrix: the ones you can actually run

Which should you run?

Real-world workflow: what this actually looks like

The honest state of play

Responses from around the web

Discussion

Related Posts

Gemma 4 vs Qwen3.6

Running Gemma 4 Locally with Ollama

LLM Backends: vLLM vs llama.cpp vs Ollama

Ollama: Powerful Language Models on Your Own Machine