Agentic Browsers: Browser-Use & Skyvern Reviewed

Your Selenium Scripts Are Showing Their Age

Remember writing Playwright tests in 2021? You’d spend 40 minutes hunting the right CSS selector, another 20 minutes making it not break when the marketing team changed a button label, and then it would silently die six months later when the site got redesigned. The automation worked, kind of, but you were basically writing a brittle map to a city that kept renaming its streets.

Agentic browsers flip this on its head. Instead of page.click('#submit-btn-v2-final-FINAL'), you say “fill out the contact form with these details and submit it.” An LLM reads the DOM, figures out what’s what, and drives the browser like a human who actually understands the page.

That’s the pitch, anyway. Let’s see how it holds up.

What We’re Actually Talking About

Two projects are worth your time here: Browser-Use and Skyvern. Both let you describe a goal in plain language and have an LLM execute it against a real browser. They differ in approach, maturity, and who they’re aimed at.

Browser-Use is a Python library backed by Playwright. You give it a task string, hook it up to an LLM (OpenAI, Anthropic, or a local model via Ollama), and it builds an agent that reads the DOM as structured text, takes actions, and loops until the goal is met or it gives up.

Skyvern takes a more vision-forward approach: it combines DOM parsing and screenshots to understand pages, which helps with heavily JavaScript-rendered UIs where the raw HTML is a mess of React garbage. It comes in open-source and cloud-hosted flavors, and the enterprise polish shows: API endpoints, workflow definitions, retry logic, and a UI dashboard.

Both are real tools doing real work in production. Neither is magic.

Browser-Use: Give It a Goal and Step Back

Browser-Use is stupidly easy to get started with. Install it, wire it to an LLM, write two lines of Python.

import asyncio
from browser_use import Agent, ChatOpenAI

async def main():
    agent = Agent(
        task="Go to https://news.ycombinator.com, find the top story, and return its title and URL",
        llm=ChatOpenAI(model="gpt-4o"),
    )
    result = await agent.run()
    print(result.final_result())

asyncio.run(main())

That’s it. No selectors. No waiting for elements. No waitForSelector('.hn-title-link:nth-child(3)') prayers. The agent spins up a Chromium instance, navigates, reads the DOM as a simplified text tree, decides what action to take next, and loops.

Running It With Ollama Instead

Every agent step is an LLM call. With GPT-4o you’re burning real money per page interaction. If you’re running anything at scale or just want to experiment without a credit card bill, point it at a local model.

import asyncio
from browser_use import Agent
from browser_use.llm import ChatOllama

async def main():
    agent = Agent(
        task="Search DuckDuckGo for 'self-hosted monitoring tools', click the first result, and summarize what the page is about",
        llm=ChatOllama(model="qwen3:32b"),
        max_actions_per_step=5,
    )
    result = await agent.run(max_steps=15)
    print(result.final_result())

asyncio.run(main())

Reality check: smaller models struggle with complex multi-step tasks. Qwen3 32B handles most form-filling and navigation fine. Anything requiring nuanced reasoning about ambiguous UI, like “find the pricing plan that makes sense for a small team”, gets flaky below the 70B range. You’re not going to run this on a Raspberry Pi.

Self-Hosting Browser-Use in Docker

If you want this running as a service rather than a script you call manually:

services:
  browser-agent:
    image: python:3.12-slim
    working_dir: /app
    volumes:
      - ./app:/app
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - OLLAMA_BASE_URL=http://ollama:11434
    command: >
      sh -c "pip install browser-use playwright &&
             playwright install chromium &&
             playwright install-deps &&
             python agent.py"
    depends_on:
      - ollama

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama_data:

One gotcha: playwright install-deps needs root and will pull a bunch of system libraries. Don’t try to slim this image too aggressively: you’ll spend three hours debugging why Chromium won’t launch.

Skyvern: Enterprise Clothes, Open Source Core

Skyvern is more opinionated. You define tasks via a REST API or their workflow DSL, and it runs them. The open-source version is self-hostable and fully functional; the cloud version adds scale, a nice UI, and managed infrastructure.

The key architectural difference: Skyvern sends both the DOM and a screenshot to the LLM with each step. This costs more tokens but makes it dramatically better at JS-heavy single-page apps where the DOM is a tangle of data-reactid attributes and the actual visible structure only makes sense visually.

services:
  skyvern:
    image: skyvern/skyvern:latest
    ports:
      - "8000:8000"
    environment:
      - DATABASE_STRING=postgresql://skyvern:skyvern@postgres:5432/skyvern
      - BROWSER_TYPE=chromium-headful
      - LLM_KEY=sk-your-openai-key
      - MODEL_NAME=gpt-4o
    depends_on:
      postgres:
        condition: service_healthy

  skyvern-frontend:
    image: skyvern/skyvern-frontend:latest
    ports:
      - "8080:8080"
    environment:
      - VITE_WSS_BASE_URL=ws://localhost:8000

  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: skyvern
      POSTGRES_PASSWORD: skyvern
      POSTGRES_DB: skyvern
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U skyvern"]
      interval: 5s
      timeout: 5s
      retries: 5
    volumes:
      - postgres_data:/var/lib/postgresql/data

volumes:
  postgres_data:

Once it’s up, hit the API directly:

curl -X POST http://localhost:8000/api/v1/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/signup",
    "navigation_goal": "Fill out the signup form with email [email protected], username testuser123, and submit it",
    "data_extraction_goal": "Extract the confirmation message after signup"
  }'

Skyvern stores task history in Postgres, so you can query what happened, retry failed steps, and build workflows that chain tasks together. That statefulness is where it pulls ahead for anything beyond one-shot scripts.

What These Tools Can Actually Do Today

Both tools handle these cases reasonably well:

Multi-page form workflows. Think: log into a portal, navigate to a specific section, fill a form with dynamic fields, submit, confirm. This is where agentic browsers genuinely shine. The kind of thing that took 200 lines of Playwright and broke every time the portal updated.

Scraping behind logins. You authenticate once and the agent navigates naturally. No cookie juggling, no session token management in your code.

Data extraction from structured pages. “Go to this product page and return the price, availability, and all listed specs” works well when the LLM can read the DOM cleanly.

Exploratory tasks. “Find the cheapest shipping option for this order”, where the path isn’t known in advance, is something traditional scripts simply can’t do. The agent figures it out.

Where They Break

Honest talk, because the demo videos are carefully curated:

CAPTCHAs. Both tools will solve simple image CAPTCHAs occasionally, fail on hCaptcha/reCAPTCHA v3 almost always. Browser-Use has a solve_captcha hook you can wire to an external service, but you’re paying twice: once for the LLM, once for 2captcha or AntiCaptcha.

Rate limits and bot detection. Agentic browsers move at human-ish speed, but Cloudflare Bot Management and similar tools look at more than timing. Headless Chromium has fingerprinting tells. You can mitigate this with playwright-stealth or undetected-chromium, but it’s an arms race.

Hallucinated selectors. The LLM will sometimes confidently describe clicking a button that doesn’t exist, or extracting data from a field that isn’t there. Browser-Use has retry logic and error recovery, but a sufficiently confused model will loop until max_steps and return garbage. Always validate outputs programmatically if they’re going into anything important.

Cost at scale. A simple three-step task might be 10-15 LLM calls with full DOM context in each. At GPT-4o pricing, running 1,000 tasks a day gets expensive fast. Skyvern’s vision mode is even heavier because it sends screenshots too. Budget accordingly before you architect anything around these tools.

Dynamic JavaScript apps. Browser-Use’s DOM parser struggles with SPAs that render content client-side after the initial load. It’ll read an empty shell and tell you there’s nothing there. Skyvern handles this better with vision, but even that isn’t bulletproof.

Agentic vs. Traditional: When to Use What

Here’s the honest tradeoff matrix:

Scenario	Use This
Stable, well-defined form workflow	Traditional Playwright/Puppeteer
Task path unknown in advance	Agentic (Browser-Use/Skyvern)
Site changes frequently	Agentic (less brittle to UI changes)
Running 10,000 times/day	Traditional (cost and reliability)
One-off data extraction	Either; agentic is faster to set up
Authentication + navigation	Agentic wins
Performance-critical pipeline	Traditional

Traditional Playwright is still the better tool when you know exactly what you need and the site is stable. It’s faster, cheaper, more debuggable, and completely deterministic. Agentic browsers earn their keep when the task is variable, the site changes often, or you want something a non-engineer can define.

The “brittleness vs flexibility” tradeoff is real. A Playwright script either works exactly right or throws an error you can fix. An agentic browser might kind-of-work and return plausible-sounding nonsense. You need validation layers you wouldn’t bother with in deterministic scripts.

The Ethics and ToS Conversation

Neither project will stop you from automating against sites that explicitly prohibit it in their ToS. That’s on you.

The practical angle: be a good citizen. Respect robots.txt. Add realistic delays between actions. Don’t hammer APIs. If you’re automating something because a site doesn’t have an API but you have a legitimate use case, fine, and your legal team should weigh in on the rest.

Using these tools to scrape competitor pricing or automate social media engagement at scale is a different conversation entirely. The tools are neutral; the use case isn’t.

Picking Your Starting Point

If you’re just experimenting or building something internal: Browser-Use + Ollama is the fast path. Low cost, quick iteration, no external dependencies. Pull down the library, point it at your local model, write 10 lines of Python.

If you need something production-grade with state management, retry logic, and an API surface your team can call: Skyvern self-hosted is the move. The additional setup overhead pays off once you’re running real workflows.

Either way, your 2 AM self will thank you for not spending a weekend writing CSS selectors when you could just tell the computer what you want.

Full example: Clone the working Compose files and agent scripts at github.com/KingPin/sumguy-examples/llm/agentic-browsers-browser-use-skyvern

Agentic Browsers: Browser-Use & Skyvern Reviewed

Your Selenium Scripts Are Showing Their Age

What We’re Actually Talking About

Browser-Use: Give It a Goal and Step Back

Running It With Ollama Instead

Self-Hosting Browser-Use in Docker

Skyvern: Enterprise Clothes, Open Source Core

What These Tools Can Actually Do Today

Where They Break

Agentic vs. Traditional: When to Use What

The Ethics and ToS Conversation

Picking Your Starting Point

Responses from around the web

Discussion

Related Posts

Dify: Visual Agent Workflows

KV Cache Quantization: Free LLM Context, Almost

Mixture of Experts (MoE) for Self-Hosters, Demystified

Speculative Decoding: Faster LLMs With a Tiny Sidekick

Agentic Browsers: Browser-Use & Skyvern Reviewed

Your Selenium Scripts Are Showing Their Age

What We’re Actually Talking About

Browser-Use: Give It a Goal and Step Back

Running It With Ollama Instead

Self-Hosting Browser-Use in Docker

Skyvern: Enterprise Clothes, Open Source Core

What These Tools Can Actually Do Today

Where They Break

Agentic vs. Traditional: When to Use What

The Ethics and ToS Conversation

Picking Your Starting Point

Related Reading

Responses from around the web

Discussion

Related Posts

Dify: Visual Agent Workflows

KV Cache Quantization: Free LLM Context, Almost

Mixture of Experts (MoE) for Self-Hosters, Demystified

Speculative Decoding: Faster LLMs With a Tiny Sidekick