Skip to content
Go back

Local Vision LLMs Worth Running in 2026

By SumGuy 15 min read
Local Vision LLMs Worth Running in 2026

The field exploded. Let’s catch up.

Two years ago I wrote about Pixtral, LLaVA, and Qwen-VL like they were the three models. Then Alibaba shipped Qwen3-VL, Google open-sourced Gemma 4, Meta dropped Llama 4, and Shanghai AI Lab quietly published InternVL3 with an MIT license. Now there’s a whole lineup — and the original three have aged very differently.

I’ve been running most of these on an RTX 4080 (16GB) and I’ll tell you which ones actually matter for a home lab. Some of the new entries are remarkable. Some are nominally “open” but need a small data center to run. I’ll sort out the noise.

Here’s the 2026 picture.

The architecture story (condensed)

Every vision-language model follows the same basic plumbing:

  1. Vision encoder — crunches the image into embeddings (CLIP, DINOv2, or custom ViT)
  2. Projector — maps those embeddings to token space the LLM understands
  3. Language model — the actual reasoning and text generation

The encoder is where all the interesting engineering happens. How it handles resolution, how many vision tokens it generates per image, and how it tiles high-res content — that’s what separates “I can read that invoice” from “I hallucinated three line items.” The base LLM determines reasoning quality and speed.

This split is also why a 7B vision model can match a 13B competitor on certain tasks. Different architecture choices compound differently.


The contenders

LLaVA: still standing, but the old guard now

LLaVA (UW/CMU, 2023) is the granddaddy. CLIP ViT-L/14 encoder, 336×336 native, ~576 vision tokens per image. It was the first model that made Ollama’s vision support possible, and honestly it kicked off the whole “local multimodal” era.

But here’s the thing: LLaVA-OneVision was its last significant push. The ecosystem has largely moved to Qwen and InternVL as base architectures. LLaVA-Next is still in Ollama, still fine for casual use, but nobody’s actively pouring resources into it anymore.

What it’s still good at:

The catch: 336×336 fixed resolution kills you on dense text. My scanned invoice test — LLaVA got the invoice number, hallucinated the total. That’s the ceiling. If your use case lives inside that ceiling, great. If not, keep reading.

Terminal window
ollama pull llava-next
ollama run llava-next

Pixtral: still the invoice whisperer

Pixtral (Mistral) is the model you reach for when accuracy on document OCR is non-negotiable. Two versions: Pixtral-12B (fits on 16GB-ish, tight) and Pixtral Large (~124B parameters, MoE architecture, needs multi-GPU or 24GB+ territory — not a home-lab card, let’s be honest).

The secret sauce is dynamic patch tiling. Instead of cramming everything into 336×336, Pixtral keeps high-res regions high-res and compresses uniform regions. Dense text stays crisp. That same scanned invoice test — Pixtral-12B nailed every cell in an 8-column, 25-row table. No hallucinations. LLaVA got most of it.

What it’s good at:

The catch: Still not in Ollama. You’re running vLLM, full stop. It’s slower too — 5–8 seconds per image on a 4080 because it’s doing real work per patch. And 12B on a 16GB card leaves you with almost no headroom for anything else.

Terminal window
pip install vllm torch
huggingface-cli download mistralai/Pixtral-12B
vllm serve mistralai/Pixtral-12B \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--gpu-memory-utilization 0.85
pixtral_client.py
import base64
import requests
def ask_pixtral(image_path, question):
with open(image_path, 'rb') as f:
image_data = base64.b64encode(f.read()).decode('utf-8')
payload = {
'model': 'mistralai/Pixtral-12B',
'messages': [{
'role': 'user',
'content': [
{'type': 'image_url', 'image_url': {'url': f'data:image/jpeg;base64,{image_data}'}},
{'type': 'text', 'text': question}
]
}],
'max_tokens': 512,
'temperature': 0.3
}
r = requests.post('http://localhost:8000/v1/chat/completions', json=payload)
return r.json()['choices'][0]['message']['content']
print(ask_pixtral('invoice.png', 'What is the total amount due?'))

For invoice processing, that accuracy gap pays for itself fast. I manage a small side business — Pixtral saves me an hour a week in manual corrections. LLaVA was wrong on about 10% of invoices. Pixtral: almost none.


Qwen3-VL: the one that actually grew up

When I wrote the original version of this article, Qwen-VL and Qwen2-VL were the speed demons with solid multilingual chops. Then Alibaba shipped Qwen3-VL (September 2025) and it’s a different beast.

Sizes: dense 2B / 4B / 8B / 32B, plus MoE variants (30B-A3B and a monster 235B-A22B). Apache 2.0 licensed. 256K context window, extendable toward 1M. If you haven’t heard that last part — yes, you can throw long documents at it.

The headline upgrade over Qwen2.5-VL is the Thinking/Instruct split. Instruct is the fast interactive variant you know. Thinking is deep multimodal reasoning — up to ~40K output tokens — for when you need the model to actually work through something complex rather than pattern-match to an answer. Think multi-step document analysis, GUI task planning, or when you need it to explain its visual reasoning rather than just spit out an answer.

Numbers (OCRBench/ScreenSpot figures from third-party benchmarks; see the verified head-to-head table further down for confirmed scores): OCRBench around 896 for the 8B model, climbing past 920 for larger variants. DocVQA in the 96–97% range. GUI grounding (ScreenSpot) around 92–94%. Where Qwen3-VL really shines is raw visual perception — reading what’s actually in the image.

What it’s good at:

The catch: The bigger MoE variants are still not 16GB territory. Stick to 8B for solo-GPU setups. Also — the Thinking mode is powerful but you’ll want to be intentional about when you trigger it; leaving it on for everything is overkill.

Terminal window
# Qwen3-VL 8B via vLLM
vllm serve Qwen/Qwen3-VL-8B-Instruct \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9

For the multilingual invoice test that the old article covered — Qwen3-VL with Chinese headers and English line items — accuracy is effectively perfect now. And it’s faster than Pixtral. If you’re not dealing with degraded scans, Qwen3-VL is becoming the all-rounder.


Gemma 4: the actual home-lab dark horse

This is the one I’m most excited about, and the one most people haven’t tried yet.

Gemma 4 (Google DeepMind, April 2026) is open, Apache 2.0, and it runs in Ollama and llama.cpp. Let that sink in — no vLLM setup, no custom serving stack, just ollama pull. That alone earns it a place in the conversation.

Sizes: E2B, E4B (phone and laptop-class), 12B (this one adds text + image + audio in the “Unified” variant), 26B-A4B (MoE), 31B (dense). 140+ languages. 256K context. Native VLM with configurable visual token budgets: 70 / 140 / 280 / 560 / 1120 — you dial this yourself. 70 tokens = fast captioning or video frames; 560–1120 = OCR, dense docs, UI reasoning. That’s a genuinely useful knob to have.

The 12B Unified model is the one to highlight: it handles images, text, AND audio (ASR + translation, ~30s clips). If you want a single model that can transcribe a voice memo and then reason about a screenshot in the same workflow — that’s Gemma 4 12B.

The numbers that matter for home lab: QAT (quantization-aware-trained) variants cut memory roughly 3x. At 4-bit QAT:

That E2B running in 3GB of RAM is not a joke model — it’s a genuinely capable vision model designed for constrained hardware. E4B in 5GB. The 12B in 7GB with audio support. If you’re running on anything from a gaming laptop to an actual phone, there’s a Gemma 4 variant for you.

Terminal window
# Gemma 4 12B via Ollama — yes, this works
ollama pull gemma4:12b
# Or the efficient E4B if you're on a potato
ollama pull gemma4:e4b
# Run it
ollama run gemma4:12b
gemma4_vision.py
import ollama
# Vision query — image analysis
response = ollama.generate(
model='gemma4:12b',
prompt='What metrics are elevated in this dashboard?',
images=['grafana_screenshot.png'],
stream=False
)
print(response['response'])

What it’s good at:

The catch: Gemma 4’s OCR on heavily degraded scanned documents doesn’t match Pixtral’s tiling approach. For invoice-processing accuracy at scale, Pixtral still has an edge. And audio is 12B+ only — E2B/E4B are image-only.


InternVL3: the MIT-licensed all-rounder

InternVL3 (Shanghai AI Lab) deserves a mention because it holds the strongest benchmark numbers for a truly permissive MIT license. Up to 78B. MMMU around 72.2, OCRBench around 906, DocVQA ~92.7 on the 8B variant.

Honest take: if you’re building something you want to ship commercially and need a VLM backend with clean licensing, InternVL3 is your answer. It’s solid across image understanding, document Q&A, and OCR. Not as fast as Qwen3-VL, not as OCR-precise as Pixtral, but permissive MIT and genuinely capable.

Runs via vLLM. Treat it like Pixtral from a setup perspective.


Llama 4: technically multimodal, realistically not home-lab

Meta shipped Llama 4 Scout (~17B active parameters / ~109B total) and Maverick (~17B active / ~400B total with 1M context). Native multimodal, strong numbers — DocVQA around 94.4, ChartQA 88–90.

But let’s be honest about what these models are: MoE giants. Scout needs enough memory to load ~109B total weights. Maverick is 400B. You’re looking at multi-GPU setups with enough VRAM to take a small family on vacation. The Llama 4 Community License is also not true open-source.

Worth knowing they exist. Not worth planning your home-lab workflow around them unless you’re running a proper inference server.


The verified numbers: Qwen3-VL vs Gemma 4, head to head

The subjective ratings above are from my own poking. But it’s worth seeing the actual benchmark scores side by side, because they tell a story you might not expect. These are the bigger variants (the 8B Qwen / 12B Gemma you’d run at home land below these, but the shape holds):

BenchmarkQwen3-VL 32B (Thinking)Qwen3-VL 4B (Instruct)Gemma 4 26B-A4BGemma 4 31B
Total params33B4B26B (MoE)30.7B (dense)
Active params33B4B3.8B30.7B
Context window262K256K256K
MMLU-Pro (knowledge)82.1%63.4%82.6%85.2%
GPQA Diamond (science)73.1%37.1%82.3%84.3%
AIME (math)83.7%46.6%88.3%89.2%
LiveCodeBench v665.6%37.9%77.1%80.0%
MMMU-Pro (vision reasoning)68.1%73.8%76.9%
MMBench-V1.1 (vision)90.8%85.1%
DocVQA (vision)96.1%95.3%
Arena ELO~1441~1452

Here’s the surprising part. On text reasoning, math, and code — and even MMMU-Pro, which is reasoning-heavy visual problems — Gemma 4 leads, sometimes by a wide margin (GPQA 84.3 vs 73.1 is not close). Google built a genuinely strong reasoner that happens to see.

But flip to raw visual perception — reading documents, answering about what’s literally in an image — and Qwen3-VL pulls ahead (DocVQA 96.1, MMBench 90.8, on benchmarks where Google didn’t even report a number). Qwen3-VL is the sharper eye.

So the practical split: if your “vision” task is really reasoning that involves an image — interpreting a chart, reasoning over a diagram — Gemma 4 is shockingly strong. If it’s perception — OCR, document extraction, “what’s in this screenshot” — Qwen3-VL is still the one to beat. (And note the 4B-Instruct vs 32B-Thinking columns are very different tiers — that’s the range, not a fair fight.)


Decision matrix: the ones you can actually run

LLaVA-NextPixtral-12BQwen3-VL 8BGemma 4 12B
OCR (clean text)8/109.5/109/108.5/10
OCR (dense/degraded)6/109.5/108.5/107/10
Charts & diagrams8/109/108.5/108/10
Screenshots / UI8.5/108/109/108.5/10
Speed (4080)3–4s5–8s1.5–2s2–3s
VRAM8–12GB12–16GB8–12GB7GB (QAT)
Ollama support
MultilingualDecentGoodExcellent (32 langs)Excellent (140+ langs)
Audio support✅ (12B+)
Video supportBasic✅ Native✅ Frame-based
LicenseApache 2.0Apache 2.0Apache 2.0Apache 2.0

Which should you run?

Easiest setup + runs on a potato + want audio too → Gemma 4

ollama pull gemma4:12b and you’re done. The QAT variants mean you can run a genuinely capable vision model in 7GB of RAM. If you also want ASR — transcribing voice memos, meeting recordings, audio clips — Gemma 4 12B is the only model in this lineup that handles all of it. This is the pick for “I want this to just work.”

OCR + multilingual + video + GUI/agentic work → Qwen3-VL

The 8B Instruct variant is the fastest accurate VLM you can run on a 16GB card right now. If your documents mix languages, if you’re building workflows that interact with screenshots or GUI elements, or if you want native video understanding — Qwen3-VL is the answer. It’s grown from “fast multilingual option” to “best general-purpose local VLM.” The Thinking mode is a genuine differentiator when you need deep visual reasoning, not just pattern matching.

Heavy document / invoice OCR accuracy → Pixtral

If you’re processing scanned PDFs, handwritten forms, or invoices where a single wrong number costs you real money — Pixtral’s tiling approach is still the best. It’s not the most convenient, and you’ll need vLLM, but the accuracy gap on degraded documents is real. I’ve been running this for invoice processing and it’s earned its place.

MIT license specifically → InternVL3

Commercial project, clean licensing required? InternVL3 is solid across image understanding and OCR, genuinely permissive MIT, and runs via vLLM on the same hardware as the others.

Already have LLaVA running and don’t need more → keep it

It’s the old guard now. It’s not getting better. But if it’s solving your problem, there’s zero urgency to migrate. Upgrade when your use case hits its ceiling.


Real-world workflow: what this actually looks like

The Grafana dashboard scraper I mentioned earlier is still running. Screenshot every 5 minutes, fed to a vision model, asks “what’s the CPU utilization right now?” With Qwen3-VL 8B it now comes back under 2 seconds and correctly identifies which panel it’s reading from. With the old LLaVA setup it occasionally picked the wrong panel. That’s a monitoring alerting pipeline, and wrong panel reads sent false alerts at 2 AM. You can guess how much I enjoyed those.

For invoices, Pixtral is still in the loop. Dense scanned PDFs with varying quality, complex table layouts, occasionally a fax (yes, people still fax things) — Pixtral is the only one in this lineup that doesn’t require manual spot-checking.

And I’ve started experimenting with Gemma 4 12B for a new workflow: voice memos from my phone transcribed locally, then the transcript runs through the same model to extract action items. Single model, no cloud, no API fees. The audio transcription isn’t production-perfect on background noise but for clear recordings it’s surprisingly usable.


The honest state of play

In 2023 this was a research novelty. In 2024 it was “technically works but budget extra RAM.” In 2026 it’s just… infrastructure. You pick the model that fits the task, slap it behind a vLLM server or Ollama endpoint, and build on top of it.

The big shift from the old three-model world: Gemma 4 made the accessibility bar much lower, and Qwen3-VL closed most of the accuracy gap that used to justify running Pixtral for everything. Pixtral is still the specialist for degraded document OCR. But for 80% of vision tasks, you’re now choosing between an Ollama-friendly model and a vLLM model, not between “works” and “doesn’t work.”

Pick one that fits your GPU and your workflow. Spend the first 30 minutes poking it with your actual documents, not generic benchmark images. That’s the only benchmark that matters for your use case.

Your 2 AM self (who will absolutely be debugging an invoice parser at 2 AM) will thank you for making the call now, not then.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Previous Post
Home Assistant Add-Ons vs Docker Containers
Next Post
HAProxy vs Envoy

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts