Skip to content
Go back

Agentic Browsers: Browser-Use & Skyvern Reviewed

By SumGuy 9 min read
Agentic Browsers: Browser-Use & Skyvern Reviewed

Your Selenium Scripts Are Showing Their Age

Remember writing Playwright tests in 2021? You’d spend 40 minutes hunting the right CSS selector, another 20 minutes making it not break when the marketing team changed a button label, and then it would silently die six months later when the site got redesigned. The automation worked — kind of — but you were basically writing a brittle map to a city that kept renaming its streets.

Agentic browsers flip this on its head. Instead of page.click('#submit-btn-v2-final-FINAL'), you say “fill out the contact form with these details and submit it.” An LLM reads the DOM, figures out what’s what, and drives the browser like a human who actually understands the page.

That’s the pitch, anyway. Let’s see how it holds up.

What We’re Actually Talking About

Two projects are worth your time here: Browser-Use and Skyvern. Both let you describe a goal in plain language and have an LLM execute it against a real browser. They differ in approach, maturity, and who they’re aimed at.

Browser-Use is a Python library backed by Playwright. You give it a task string, hook it up to an LLM (OpenAI, Anthropic, or a local model via Ollama), and it builds an agent that reads the DOM as structured text, takes actions, and loops until the goal is met or it gives up.

Skyvern takes a more vision-forward approach — it combines DOM parsing and screenshots to understand pages, which helps with heavily JavaScript-rendered UIs where the raw HTML is a mess of React garbage. It comes in open-source and cloud-hosted flavors, and the enterprise polish shows: API endpoints, workflow definitions, retry logic, and a UI dashboard.

Both are real tools doing real work in production. Neither is magic.

Browser-Use: Give It a Goal and Step Back

Browser-Use is stupidly easy to get started with. Install it, wire it to an LLM, write two lines of Python.

agent.py
import asyncio
from browser_use import Agent
from langchain_openai import ChatOpenAI
async def main():
agent = Agent(
task="Go to https://news.ycombinator.com, find the top story, and return its title and URL",
llm=ChatOpenAI(model="gpt-4o"),
)
result = await agent.run()
print(result.final_result())
asyncio.run(main())

That’s it. No selectors. No waiting for elements. No waitForSelector('.hn-title-link:nth-child(3)') prayers. The agent spins up a Chromium instance, navigates, reads the DOM as a simplified text tree, decides what action to take next, and loops.

Running It With Ollama Instead

Here’s the thing — every agent step is an LLM call. With GPT-4o you’re burning real money per page interaction. If you’re running anything at scale or just want to experiment without a credit card bill, point it at a local model.

agent_local.py
import asyncio
from browser_use import Agent
from langchain_ollama import ChatOllama
async def main():
agent = Agent(
task="Search DuckDuckGo for 'self-hosted monitoring tools', click the first result, and summarize what the page is about",
llm=ChatOllama(model="qwen2.5:32b"),
max_actions_per_step=5,
)
result = await agent.run(max_steps=15)
print(result.final_result())
asyncio.run(main())

Reality check: smaller models struggle with complex multi-step tasks. Qwen2.5 32B handles most form-filling and navigation fine. Anything requiring nuanced reasoning about ambiguous UI — like “find the pricing plan that makes sense for a small team” — gets flaky below the 70B range. You’re not going to run this on a Raspberry Pi.

Self-Hosting Browser-Use in Docker

If you want this running as a service rather than a script you call manually:

docker-compose.yml
services:
browser-agent:
image: python:3.12-slim
working_dir: /app
volumes:
- ./app:/app
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- OLLAMA_BASE_URL=http://ollama:11434
command: >
sh -c "pip install browser-use langchain-openai langchain-ollama playwright &&
playwright install chromium &&
playwright install-deps &&
python agent.py"
depends_on:
- ollama
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
ollama_data:

One gotcha: playwright install-deps needs root and will pull a bunch of system libraries. Don’t try to slim this image too aggressively — you’ll spend three hours debugging why Chromium won’t launch.

Skyvern: Enterprise Clothes, Open Source Core

Skyvern is more opinionated. You define tasks via a REST API or their workflow DSL, and it runs them. The open-source version is self-hostable and fully functional; the cloud version adds scale, a nice UI, and managed infrastructure.

The key architectural difference: Skyvern sends both the DOM and a screenshot to the LLM with each step. This costs more tokens but makes it dramatically better at JS-heavy single-page apps where the DOM is a tangle of data-reactid attributes and the actual visible structure only makes sense visually.

docker-compose.yml
services:
skyvern:
image: skyvern/skyvern:latest
ports:
- "8000:8000"
environment:
- DATABASE_STRING=postgresql://skyvern:skyvern@postgres:5432/skyvern
- BROWSER_TYPE=chromium-headful
- LLM_KEY=sk-your-openai-key
- MODEL_NAME=gpt-4o
depends_on:
postgres:
condition: service_healthy
skyvern-frontend:
image: skyvern/skyvern-frontend:latest
ports:
- "8080:8080"
environment:
- VITE_WSS_BASE_URL=ws://localhost:8000
postgres:
image: postgres:16
environment:
POSTGRES_USER: skyvern
POSTGRES_PASSWORD: skyvern
POSTGRES_DB: skyvern
healthcheck:
test: ["CMD-SHELL", "pg_isready -U skyvern"]
interval: 5s
timeout: 5s
retries: 5
volumes:
- postgres_data:/var/lib/postgresql/data
volumes:
postgres_data:

Once it’s up, hit the API directly:

Terminal window
curl -X POST http://localhost:8000/api/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/signup",
"navigation_goal": "Fill out the signup form with email [email protected], username testuser123, and submit it",
"data_extraction_goal": "Extract the confirmation message after signup"
}'

Skyvern stores task history in Postgres, so you can query what happened, retry failed steps, and build workflows that chain tasks together. That statefulness is where it pulls ahead for anything beyond one-shot scripts.

What These Tools Can Actually Do Today

Both tools handle these cases reasonably well:

Multi-page form workflows. Think: log into a portal, navigate to a specific section, fill a form with dynamic fields, submit, confirm. This is where agentic browsers genuinely shine. The kind of thing that took 200 lines of Playwright and broke every time the portal updated.

Scraping behind logins. You authenticate once and the agent navigates naturally. No cookie juggling, no session token management in your code.

Data extraction from structured pages. “Go to this product page and return the price, availability, and all listed specs” works well when the LLM can read the DOM cleanly.

Exploratory tasks. “Find the cheapest shipping option for this order” — where the path isn’t known in advance — is something traditional scripts simply can’t do. The agent figures it out.

Where They Break

Honest talk, because the demo videos are carefully curated:

CAPTCHAs. Both tools will solve simple image CAPTCHAs occasionally, fail on hCaptcha/reCAPTCHA v3 almost always. Browser-Use has a solve_captcha hook you can wire to an external service, but you’re paying twice — once for the LLM, once for 2captcha or AntiCaptcha.

Rate limits and bot detection. Agentic browsers move at human-ish speed, but Cloudflare Bot Management and similar tools look at more than timing. Headless Chromium has fingerprinting tells. You can mitigate this with playwright-stealth or undetected-chromium, but it’s an arms race.

Hallucinated selectors. The LLM will sometimes confidently describe clicking a button that doesn’t exist, or extracting data from a field that isn’t there. Browser-Use has retry logic and error recovery, but a sufficiently confused model will loop until max_steps and return garbage. Always validate outputs programmatically if they’re going into anything important.

Cost at scale. A simple three-step task might be 10-15 LLM calls with full DOM context in each. At GPT-4o pricing, running 1,000 tasks a day gets expensive fast. Skyvern’s vision mode is even heavier because it sends screenshots too. Budget accordingly before you architect anything around these tools.

Dynamic JavaScript apps. Browser-Use’s DOM parser struggles with SPAs that render content client-side after the initial load. It’ll read an empty shell and tell you there’s nothing there. Skyvern handles this better with vision, but even that isn’t bulletproof.

Agentic vs. Traditional: When to Use What

Here’s the honest tradeoff matrix:

ScenarioUse This
Stable, well-defined form workflowTraditional Playwright/Puppeteer
Task path unknown in advanceAgentic (Browser-Use/Skyvern)
Site changes frequentlyAgentic (less brittle to UI changes)
Running 10,000 times/dayTraditional (cost and reliability)
One-off data extractionEither; agentic is faster to set up
Authentication + navigationAgentic wins
Performance-critical pipelineTraditional

Traditional Playwright is still the better tool when you know exactly what you need and the site is stable. It’s faster, cheaper, more debuggable, and completely deterministic. Agentic browsers earn their keep when the task is variable, the site changes often, or you want something a non-engineer can define.

The “brittleness vs flexibility” tradeoff is real. A Playwright script either works exactly right or throws an error you can fix. An agentic browser might kind-of-work and return plausible-sounding nonsense. You need validation layers you wouldn’t bother with in deterministic scripts.

The Ethics and ToS Conversation

Neither project will stop you from automating against sites that explicitly prohibit it in their ToS. That’s on you.

The practical angle: be a good citizen. Respect robots.txt. Add realistic delays between actions. Don’t hammer APIs. If you’re automating something because a site doesn’t have an API but you have a legitimate use case — fine, and your legal team should weigh in on the rest.

Using these tools to scrape competitor pricing or automate social media engagement at scale is a different conversation entirely. The tools are neutral; the use case isn’t.

Picking Your Starting Point

If you’re just experimenting or building something internal: Browser-Use + Ollama is the fast path. Low cost, quick iteration, no external dependencies. Pull down the library, point it at your local model, write 10 lines of Python.

If you need something production-grade with state management, retry logic, and an API surface your team can call: Skyvern self-hosted is the move. The additional setup overhead pays off once you’re running real workflows.

Either way, your 2 AM self will thank you for not spending a weekend writing CSS selectors when you could just tell the computer what you want.

Full example: Clone the working Compose files and agent scripts at github.com/KingPin/sumguy-examples/llm/agentic-browsers-browser-use-skyvern


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
iperf3 + nload: Network Diagnosis

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts