Skip to content
Go back

Function Calling in Local LLMs

By SumGuy 12 min read
Function Calling in Local LLMs

The Leap from Chatbot to Agent is Just Structured Tool Use

Here’s the thing: the jump from “GPT that answers questions” to “AI agent that actually gets stuff done” isn’t some magical leap in model intelligence. It’s a much simpler party trick — structured tool calling. The model stops spitting out prose and starts emitting JSON that says “hey, I need to call the weather API with these parameters” or “run this bash command” or “look this up in the database.”

For a long time, this was closed off to local models. You had to ship your request to OpenAI, Anthropic, or Anthropic’s competitor du jour, wait for them to handle your tools, and hope they didn’t hallucinate the function names. But in 2026, running a capable function-calling model locally is absolutely doable — and honestly, more reliable than you’d expect. Models like Llama 3.3, Qwen 2.5, and Hermes 3 can do this. Ollama makes it smooth. llama.cpp gives you the fine-grained control. And the patterns? They’re not even that weird once you understand what’s actually happening.

This article walks through what function calling actually is, which models do it well, the tooling (Ollama, llama.cpp, grammar-constrained generation), and a real working example: a local agent that queries a weather API and calls a calculator. We’ll also dig into what breaks, why, and how to fix it.

What Actually Happens When a Model “Calls a Function”

Function calling isn’t the model reaching into your filesystem and executing code. What’s happening is this:

  1. You give the model a list of available tools in a structured format (JSON schema, usually).
  2. You ask the model a question that requires using one or more of those tools.
  3. The model, trained to recognize the pattern, emits a structured response instead of freeform text. That response says “use tool X with arguments Y.”
  4. Your code parses that structured response, calls the actual tool, and feeds the result back to the model.
  5. The model, now armed with the tool output, answers the original question.

The model isn’t “calling” anything — your orchestration layer is. The model is just predicting what should be called next, and it’s doing so in a format you can parse reliably.

This is why the OpenAI tools format became so influential: it standardizes how you describe tools and how the model responds. But local models have options now, and some are arguably better for constrained generation.

Function Calling Formats: OpenAI, Ollama, and Native Llama 3.1+

OpenAI Tools Format

The OpenAI format is the lingua franca. You define tools like this:

{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}

The model sees this, understands the schema, and responds with something like:

{
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"Seattle, WA\", \"unit\": \"fahrenheit\"}"
}
}

Clean. Parseable. Standard across most LLM APIs. Ollama’s API mirrors this almost exactly when you set up tools.

Ollama’s Tool Support

Ollama supports the OpenAI tools format natively (added around late 2024). You pass tools as part of the request, and Ollama handles routing the model’s output through the tool-calling path. The upside: it’s familiar. The downside: Ollama’s implementation is thin — it relies entirely on the model’s training to follow the schema, with no constraint enforcement on the output.

Llama 3.1+ Native Format

Llama 3.1 introduced an alternative: a built-in tool-use format baked into the model’s tokenizer and training. Instead of JSON in a text field, the model emits special tokens that represent tool calls. This is theoretically more robust because it’s enforced at the token level, but in practice, most local inference engines (Ollama, llama.cpp) still convert this back to JSON for easy consumption. You usually don’t notice the difference.

llama.cpp’s Grammar-Constrained Generation (GBNF)

Here’s where it gets interesting: llama.cpp supports GBNF (EBNF-style grammars) to force the model to emit JSON that matches your schema. No hallucinated argument names. No malformed JSON. The model’s sampling is constrained to only tokens that are valid according to your grammar.

Terminal window
./main -m model.gguf -p "What's the weather?" \
-j schema.gbnf

This is powerful for reliability, especially when you’re running a smaller or less-trained model locally.

Models That Actually Do This Well in 2026

Not every model is trained for function calling. You need one where the developer intentionally included tool-use examples in the training data.

Llama 3.3 (70B, 8B) — The gold standard. Excellent function calling, native tool-use tokens, solid reasoning. If you’re running local models for serious work, start here.

Qwen 2.5 (72B, 7B) — Strong tool use, competitive with Llama 3.3, slightly more forgiving with schema variations.

Hermes 3 (405B) — Overkill for most local setups (because it’s 405B), but if you have the VRAM, it’s phenomenally reliable at tool calling.

Mistral Nemo (12B) — Surprisingly capable for its size. Not as robust as Llama 3.3, but gets the job done in tight memory budgets.

Functionary (7B, based on Mistral) — Purpose-built for tool calling. If you’re serious about function-calling agents, this is worth trying.

Avoid: Older models (Llama 2, Mistral 7B base, anything before 2024). They weren’t trained on tool examples and will hallucinate.

The Practical Setup: Ollama + Python

Let’s build something real. A local agent that:

  1. Takes a question like “What’s the weather in Portland and add 5 to 32?”
  2. Calls a mock weather API.
  3. Calls a calculator.
  4. Synthesizes the answer.

First, start Ollama with a capable model:

Terminal window
ollama pull llama2:70b-chat
ollama serve

Then, Python code to orchestrate the function calling:

agent.py
import json
import re
import requests
from ollama import Client
# Initialize Ollama client
client = Client(host="http://localhost:11434")
# Define tools schema (OpenAI format)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a specified location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name or city, state"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
},
{
"type": "function",
"function": {
"name": "calculator",
"description": "Perform basic arithmetic operations",
"parameters": {
"type": "object",
"properties": {
"operation": {
"type": "string",
"enum": ["add", "subtract", "multiply", "divide"],
"description": "The arithmetic operation"
},
"a": {
"type": "number",
"description": "First number"
},
"b": {
"type": "number",
"description": "Second number"
}
},
"required": ["operation", "a", "b"]
}
}
}
]
# Fake tool implementations
def get_weather(location, unit="fahrenheit"):
"""Mock weather API"""
weather_db = {
"seattle, wa": {"temp": 58, "condition": "rainy"},
"portland, or": {"temp": 62, "condition": "cloudy"},
"san francisco, ca": {"temp": 72, "condition": "sunny"}
}
data = weather_db.get(location.lower(), {"temp": 70, "condition": "unknown"})
return f"The weather in {location} is {data['condition']}, {data['temp']}°{unit[0].upper()}."
def calculator(operation, a, b):
"""Simple calculator"""
ops = {
"add": a + b,
"subtract": a - b,
"multiply": a * b,
"divide": a / b if b != 0 else None
}
result = ops.get(operation)
if result is None:
return f"Error: division by zero"
return f"{a} {operation} {b} = {result}"
# Tool execution dispatcher
def execute_tool(tool_name, args):
"""Call the actual tool based on name and args"""
if tool_name == "get_weather":
return get_weather(**args)
elif tool_name == "calculator":
return calculator(**args)
return f"Unknown tool: {tool_name}"
# Main agent loop
def run_agent(user_query, max_iterations=5):
"""Run the agent with function calling"""
messages = [
{
"role": "user",
"content": user_query
}
]
print(f"\n[User] {user_query}")
iteration = 0
while iteration < max_iterations:
iteration += 1
# Call the model with tools
response = client.chat(
model="llama2:70b-chat",
messages=messages,
tools=tools,
stream=False
)
# Check if model wants to call a tool
assistant_message = response["message"]
if not assistant_message.get("tool_calls"):
# No tool calls, model gave a direct answer
print(f"[Agent] {assistant_message['content']}")
return assistant_message["content"]
# Process tool calls
tool_results = []
for tool_call in assistant_message["tool_calls"]:
tool_name = tool_call["function"]["name"]
tool_args = tool_call["function"]["arguments"]
# Parse arguments (handle both string and dict formats)
if isinstance(tool_args, str):
tool_args = json.loads(tool_args)
print(f"[Tool Call] {tool_name}({tool_args})")
# Execute the tool
tool_result = execute_tool(tool_name, tool_args)
print(f"[Tool Result] {tool_result}")
tool_results.append({
"tool_call_id": tool_call.get("id", tool_name),
"tool_name": tool_name,
"content": tool_result
})
# Add assistant message and tool results back to conversation
messages.append(assistant_message)
for result in tool_results:
messages.append({
"role": "tool",
"content": result["content"],
"tool_call_id": result["tool_call_id"]
})
return "Max iterations reached without answer"
# Test it
if __name__ == "__main__":
result = run_agent("What's the weather in Portland, OR? Then add 5 to that temperature.")
print(f"\n[Final Answer] {result}")

Run it:

Terminal window
$ python agent.py
[User] What's the weather in Portland, OR? Then add 5 to that temperature.
[Tool Call] get_weather({'location': 'Portland, OR', 'unit': 'fahrenheit'})
[Tool Result] The weather in Portland, OR is cloudy, 62°F.
[Tool Call] calculator({'operation': 'add', 'a': 62, 'b': 5})
[Tool Result] 62 add 5 = 67
[Agent] The weather in Portland, OR is cloudy with a temperature of 62°F. Adding 5 to that would be 67°F.

Works. Actually works.

Why Models Mess Up: Hallucinated Args, Infinite Loops, Parallel Calls

Function calling isn’t magic. Models still hallucinate. Here’s what breaks:

Hallucinated argument names. The model sees your schema and invents a parameter that doesn’t exist. You ask for weather with location but the model emits {"city": "Portland"} instead. Worse at the 8B scale, better with Llama 3.3 70B.

Malformed JSON. The model emits close-enough JSON that regular parsers choke on. Missing quotes, stray commas, incomplete structures. Grammar-constrained generation in llama.cpp eliminates this entirely.

Infinite tool loops. The model calls a tool, gets the result, and decides to call the same tool again in a loop, never reaching a conclusion. Usually happens when the system prompt is unclear about when to stop using tools.

Parallel tool calls. Some models (especially newer ones trained on concurrent tool execution like GPT-4 Turbo) want to emit multiple tool calls in one response. Most local models don’t, but if yours does and your orchestration expects one-at-a-time, you’ll hang.

Ignoring tool results. The model calls a tool, you feed it the result, and the model acts like the result doesn’t exist. Often a sign of a weak model or a system prompt that didn’t clearly explain the loop.

Mitigations:

  1. Use a model explicitly trained for tool use (Llama 3.3, Qwen 2.5, Hermes 3).
  2. Use grammar constraints (llama.cpp + GBNF) to force valid JSON.
  3. Write clear system prompts that specify exactly when to stop using tools:
You are a helpful assistant. You have access to the following tools.
Call a tool only when necessary. Once you have all the information needed
to answer the user's question, provide the final answer directly. Do not
call a tool more than once for the same purpose. Do not call a tool and
then immediately call it again.
  1. Validate arguments before calling tools. If the model emits a parameter that doesn’t exist, reject it politely and ask the model to try again.
  2. Set iteration limits to prevent infinite loops (like the max_iterations=5 in the code above).

Grammar-Constrained Generation: llama.cpp’s GBNF Superpower

If you’re serious about reliability, use llama.cpp directly with GBNF. You write a grammar that describes the exact JSON structure your tools accept, and the sampler refuses to emit anything that violates it.

Example grammar for a simple tool call:

root : "{" ws "\"name\"" ws ":" ws string ws "," ws "\"args\"" ws ":" ws object ws "}"
object : "{" (pair ("," pair)*)? "}"
pair : string ws ":" ws value
value : string | number | boolean
string : "\"" ([^"\\] | "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]))* "\""
number : "-"? (0 | [1-9] [0-9]*) ("." [0-9]+)? ([eE] [-+]? [0-9]+)?
boolean: "true" | "false"
ws : ([ \t\n] ws)?

Then:

Terminal window
./main -m model.gguf \
-p "What's the weather in Portland?" \
-g grammar.gbnf

The model cannot emit invalid JSON. It’s constrained at the token level. This shifts reliability dramatically, especially for smaller models.

Function Calling vs. MCP: Different Layers

There’s confusion here, so let’s clear it up.

Function calling is a protocol: your orchestration layer describes tools as JSON schemas, the model emits structured tool calls, and you execute them. It’s what we just built above.

MCP (Model Context Protocol) is a standardized tool registry and communication spec. Instead of hardcoding tool definitions in your prompt, MCP lets you connect to a standard tool server that advertises what it can do. The model talks to the MCP server, which handles tool management and execution.

Function calling is the lowlevel protocol. MCP standardizes the tool layer on top. You can use function calling without MCP (like we did) and you can use MCP with function calling (MCP servers expose their tools as function schemas that feed into your calling loop).

For local models, you usually don’t need MCP unless you’re building something complex. Function calling directly is simpler and faster.

Putting It Together: A Real Local Agent

Here’s the full flow for something serious:

  1. Pick a model: Llama 3.3 70B if you have the VRAM, Qwen 2.5 otherwise.
  2. Define your tools: JSON schemas that are clear and specific.
  3. Run inference: Ollama for simplicity, llama.cpp for fine-grained control + GBNF.
  4. Orchestrate: Python with a loop that handles tool calls and feeds results back.
  5. Add error handling: Validate arguments, set iteration limits, catch hallucinations.
  6. Test edge cases: Ask the model to do weird stuff and watch where it breaks.

The tooling is there. The models are capable. The patterns are solid. In 2026, running a function-calling agent locally isn’t aspirational — it’s standard practice for anyone serious about AI at the edge. Your laptop can be the orchestrator. Your network can stay closed. And your models can actually do things instead of just chatting about them.

That’s the shift. Build like it.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
HACS: When Custom Integrations Bite You

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts