A clean local agent SDK for Ollama, vLLM, and OpenAI-compatible servers.
v0.3.1 · streaming · auto-tune · trace · cli · mit

FreeAgent SDK

A small, honest Python SDK for AI agents on local models. Streaming. Multi-turn by default. Markdown skills and memory. Auto-tuned per model. Single dependency. Actually faster than writing your own loop (HTTP connection reuse).

Not magic. The "framework catches what models can't" thesis is disproven for modern models — 0/40 adversarial cases triggered a real guardrail rescue. What it IS: the plumbing you'd write yourself, but done once, tested, and measured.

281
Unit Tests
1
Dep (httpx)
3
Providers
4
Models Benchmarked
~300
Token Overhead

What's In The Box

Streaming

Token-by-token streaming with semantic events: TokenEvent, ToolCallEvent, ToolResultEvent, RunCompleteEvent. Works with tool-using agents, not just chat.

Auto-Tune

Queries Ollama /api/show on init. Small models get stripped defaults. Context window from the model, not a guess. Opt-out with auto_tune=False.

Conversation

Multi-turn just works. Pluggable strategies: SlidingWindow, TokenWindow, UnlimitedHistory. Optional session persistence across process restarts.

Memory

Markdown files in .freeagent/memory/. Single memory tool (read/write/append/search/list). Auto-load files in system prompt. Daily logs.

Skills

Markdown SKILL.md with frontmatter. Bundled defaults. User skills extend or override. +25% accuracy on qwen3:4b (measured).

Trace API

agent.trace() shows every model call, tool call, retry, and validation event with relative timestamps. The debugger you always wanted.

CLI

freeagent ask qwen3:8b "hello" — one-shot with live streaming. freeagent chat — REPL. freeagent models — list. No extra deps (stdlib argparse).

Multi-Provider

Ollama, vLLM, OpenAI-compat (LM Studio, LocalAI, TGI). Provider protocol means any new backend is one file.

MCP

Connect to MCP servers via stdio or HTTP. Tool schemas auto-converted. Description truncation for small models. pip install freeagent-sdk[mcp]

Hello World

01_hello.py
from freeagent import Agent

agent = Agent(model="llama3.1:8b")
print(agent.run("What is Python?"))

With Tools

02_tools.py
from freeagent import Agent, tool
from freeagent.tools import system_info, calculator

@tool
def weather(city: str) -> dict:
    """Get weather for a city.
    city: The city name
    """
    return {"temp": 72, "condition": "sunny"}

agent = Agent(
    model="qwen3:8b",
    tools=[weather, system_info, calculator],
)

print(agent.run("Weather in Portland and disk space?"))

Memory is on by default. The agent gets a memory tool. Conversation is multi-turn by default. Telemetry is captured automatically.

Command Line

pip install freeagent-sdk gives you a freeagent command.

One-shot query

shell
$ freeagent ask qwen3:8b "What's the capital of France?"
Paris is the capital of France.

Streams tokens as they arrive. Exits when done.

Interactive REPL

shell
$ freeagent chat qwen3:8b
freeagent > What is Python?
Python is a high-level programming language...
freeagent > When was it released?
Python was first released in 1991...
freeagent > /trace
Trace for run 2 (qwen3:8b, native):
  +     0ms  run_start            "When was it released?"
  ...
freeagent > /exit

Slash commands: /clear reset conversation, /trace show last run, /exit quit.

List models

shell
$ freeagent models
Name                           Size       Modified
gemma4:e2b                     7.2GB      2026-04-06
qwen3:8b                       5.2GB      2025-12-09
llama3.1:latest                4.9GB      2025-12-12

Other commands

freeagent version

Print the installed FreeAgent version.

freeagent trace

Show the trace of the last recorded run.

Streaming

Token-level streaming with semantic events. Works for both chat and tool-using agents.

Sync iterator

stream.py
from freeagent import Agent, TokenEvent, ToolCallEvent, ToolResultEvent

agent = Agent(model="qwen3:8b", tools=[weather])

for event in agent.run_stream("Weather in Tokyo?"):
    if isinstance(event, TokenEvent):
        print(event.text, end="", flush=True)
    elif isinstance(event, ToolCallEvent):
        print(f"\n[Calling {event.name}...]")
    elif isinstance(event, ToolResultEvent):
        print(f"[{event.name} -> {'ok' if event.success else 'fail'}]")

Async generator

stream_async.py
async for event in agent.arun_stream("Weather in Tokyo?"):
    if isinstance(event, TokenEvent):
        print(event.text, end="", flush=True)

Event types

RunStartEvent

model, mode. Fired once at the start of each run.

IterationEvent

iteration. Fired at the start of each agent loop cycle.

TokenEvent

text, iteration. Each token as it arrives from the model.

ToolCallEvent

name, args. Fired when the model requests a tool call.

ToolResultEvent

name, result, success, duration_ms. After tool execution.

ValidationErrorEvent

tool_name, errors. Fired when a tool call fails validation.

RetryEvent

tool_name, retry_count. Fired on tool call retry.

RunCompleteEvent

response, elapsed_ms, metrics. Fired once when the run finishes.

Conversation Manager

Multi-turn conversations work out of the box. The agent remembers prior turns automatically using pluggable strategies.

Default: SlidingWindow

multi_turn.py
from freeagent import Agent

agent = Agent(model="qwen3:8b", tools=[weather])
agent.run("What's the weather in Tokyo?")
agent.run("Convert that to Celsius")  # remembers Tokyo was 85°F

Strategies

strategies.py
from freeagent import Agent, SlidingWindow, TokenWindow

# Default: SlidingWindow(max_turns=20)
agent = Agent(model="qwen3:8b")

# Token-based budget (small context models)
agent = Agent(model="qwen3:4b", conversation=TokenWindow(max_tokens=3000))

# Stateless mode (each run() independent)
agent = Agent(model="qwen3:8b", conversation=None)

Session Persistence

session.py
# Saves to .freeagent/sessions/my-chat.json
agent = Agent(model="qwen3:8b", session="my-chat")
agent.run("Hello!")

# Later, in a new process — restores conversation
agent = Agent(model="qwen3:8b", session="my-chat")

Available Strategies

SlidingWindow

Default. Keep the last N turns. Predictable token usage. SlidingWindow(max_turns=20)

TokenWindow

Keep history that fits a token budget. Fills from newest to oldest. TokenWindow(max_tokens=3000)

UnlimitedHistory

Keep everything. Use with caution on small models — will overflow context window.

Custom

Subclass ConversationManager. Implement prepare(), commit(), clear().

Model-Aware Defaults

FreeAgent queries Ollama's /api/show on init to detect model capabilities and auto-tune the framework. Small models get stripped defaults. Context window comes from the model, not a guess. Engine selection uses real capabilities, not a hardcoded list.

How it works

autotune.py
from freeagent import Agent

# Auto-tuned: 2B model → strips skills + memory tool
agent = Agent(model="gemma4:e2b")
print(agent.model_info.parameter_size)   # 5.1B
print(agent.model_info.is_small)         # True (MoE pattern)
print(len(agent.skills))                 # 0
print(agent.config.context_window)       # 131072 (real)

# Auto-tuned: 8B model → keeps full defaults
agent = Agent(model="qwen3:8b")
print(agent.model_info.parameter_size)   # 8.2B
print(len(agent.skills))                 # 2
print(agent.config.context_window)       # 40960 (real)

Overriding auto-tune

override.py
# Force bundled skills on a small model
agent = Agent(model="gemma4:e2b", bundled_skills=True, memory_tool=True)

# Disable auto-tune entirely
agent = Agent(model="qwen3:8b", auto_tune=False)

Detection rules

Small (<4B)

Strip bundled skills and memory tool. Tiny models get overwhelmed by the extra context.

MoE pattern

gemma3n:eXb, gemma4:eXb — treated as small regardless of actual param count (effective size matters).

Medium (4-14B)

Keep full defaults. This is the sweet spot where skills help and the memory tool doesn't overwhelm.

Context window

Set from model_info.context_length. qwen3:4b has 262k, llama3.1:latest has 131k — no more guessing.

Engine selection

Uses capabilities.includes("tools"), not a hardcoded model name list.

No Ollama?

Auto-tune silently no-ops for vLLM/OpenAI-compat providers. Defaults are used as specified.

Trace Inspection

Every run is automatically traced. agent.trace() shows a complete timeline of what happened — model calls, tool calls, retries, validation errors — with relative timestamps.

Example

trace.py
from freeagent import Agent, tool

@tool
def adder(a: int, b: int) -> int:
    """Add two numbers."""
    return a + b

agent = Agent(model="qwen3:4b", tools=[adder])
agent.run("What is 47 + 23?")

print(agent.trace())
Trace for run 1 (qwen3:4b, native): + 0ms run_start "What is 47 + 23?" + 0ms model_call_start iter=0 + 5800ms model_call_end tool_calls=1 + 5801ms tool_call adder(a=47, b=23) + 5801ms tool_result adder -> ok (0ms) + 5802ms model_call_start iter=1 + 12191ms model_call_end "The result of 47 + 23 is **70**." + 12191ms run_end 12191ms "The result of 47 + 23 is **70**."

What's recorded

run_start / run_end

User input, final response, elapsed ms, iteration count.

model_call_start / end

Iteration number, content preview, tool call count. Timings show the model's generation latency.

tool_call / tool_result

Tool name, arguments, success, duration_ms, result preview.

validation_error

When a tool call fails validation — tool name and specific error messages.

retry

When the agent retries after a validation error, with the retry count.

loop_detected / timeout

When the circuit breaker fires or the run hits its timeout.

Other formats

formats.py
# One-line summary
print(agent.last_run.summary())
# Run 1: qwen3:4b (native) 12191ms, 2 iters, 1 tool calls

# Markdown report
print(agent.last_run.to_markdown())

# Raw trace events
for te in agent.last_run.trace_events:
    print(te.timestamp, te.event_type, te.data)

Hooks — Lifecycle Events

Hooks let you observe and modify agent behavior at every stage. Register via decorator or direct call. Hooks never crash the agent — exceptions are silently caught.

All 13 Events

Agent Lifecycle

before_run · after_run

Model

before_model · after_model

Tools

before_tool · after_tool

Errors

on_validation_error · on_retry · on_error

Circuit Breaker

on_loop · on_max_iter · on_timeout

Memory

memory_load · memory_save · memory_update

Decorator Style

hooks example
@agent.on("before_tool")
def log_tool(ctx):
    print(f"→ Calling {ctx.tool_name}({ctx.args})")

@agent.on("after_tool")
def cache_result(ctx):
    if ctx.result and ctx.result.success:
        agent.memory.set(
            f"cache.{ctx.tool_name}",
            ctx.result.data
        )

@agent.on("on_error")
def alert(ctx):
    send_to_slack(f"Agent error: {ctx.error}")

Mutable Context — Skip & Override

@agent.on("before_tool")
def use_cache(ctx):
    # Skip the actual tool call if we have cached data
    cached = agent.memory.get(f"cache.{ctx.tool_name}")
    if cached:
        ctx.skip = True  # tool won't execute

@agent.on("after_run")
def sanitize(ctx):
    # Override the final response
    ctx.override_response = ctx.response.replace("password", "***")

Pre-Built Hooks

from freeagent import log_hook, cost_hook

# Logging — prints every lifecycle event
logger = log_hook(verbose=True)
agent.on("before_run", logger)
agent.on("before_tool", logger)
agent.on("after_tool", logger)
agent.on("after_run", logger)

# Cost tracking — counts tool calls per tool
track, stats = cost_hook()
agent.on("before_tool", track)
agent.run("check disk space")
print(stats())
# → {"calls": 1, "tools": {"system_info": 1}, "errors": 0}

Memory — Persistent Key-Value Store

File-backed JSON store that persists between runs. Auto-loads on agent start, auto-saves after each run. Memory context is injected into the system prompt so the model knows what it remembers.

Agent Integration

memory with agent
agent = Agent(
    model="llama3.1:8b",
    tools=[weather],
    memory_path="~/.freeagent/memory.json",  # persists to disk
)

# Pre-load preferences
agent.memory.set("user.name", "Alice", source="user")
agent.memory.set("user.units", "metric", source="user")

# The model sees this in its system prompt:
# ## Your Memory (facts you remember):
# - user.name: Alice
# - user.units: metric

agent.run("What's the weather?")
# Model knows to use metric because it's in memory

Standalone API

memory API
from freeagent import Memory

mem = Memory(path="~/.freeagent/memory.json")

mem.set("user.name", "Alice")          # create/update
mem.get("user.name")                     # → "Alice"
mem.get("missing", "default")            # → "default"
mem.has("user.name")                     # → True
mem.delete("user.name")                  # → True

mem.search("user.")                      # → all keys starting with "user."
mem.all()                                # → full dict
mem.keys()                               # → list of keys
len(mem)                                # → entry count
"user.name" in mem                      # → True

# Each entry tracks metadata:
# created_at, updated_at, access_count, source

Memory + Hooks Together

# Auto-cache tool results in memory
@agent.on("after_tool")
def auto_cache(ctx):
    if ctx.result and ctx.result.success:
        agent.memory.set(
            f"cache.{ctx.tool_name}.{hash(str(ctx.args))}",
            ctx.result.data,
            source="tool"
        )

# Use cache to skip redundant calls
@agent.on("before_tool")
def check_cache(ctx):
    key = f"cache.{ctx.tool_name}.{hash(str(ctx.args))}"
    if agent.memory.has(key):
        ctx.skip = True  # don't re-run

@tool Decorator

Write a function with type hints and a docstring. FreeAgent builds the JSON schema, Ollama spec, and ReAct description automatically.

tool decorator
@tool
def lookup_user(username: str) -> dict:
    """Look up a user by username.
    username: The username to look up
    """
    return {"name": "Alice", "role": "engineer"}

# Auto-generated:
lookup_user.name         # → "lookup_user"
lookup_user.schema()     # → JSON schema from type hints
lookup_user.to_ollama_spec()    # → Ollama tool format
lookup_user.to_react_description()  # → human-readable for ReAct
Design Rule

Keep schemas flat. One required field is ideal. Use strings over enums. Provide defaults. Every field you add is a chance for a small model to fail.

Skills System

Skills are markdown directories with SKILL.md files containing YAML frontmatter. They get injected into the system prompt automatically.

SKILL.md
---
name: nba-analyst
description: Basketball statistics expert
version: 1.0
tools: [search, calculator]
---

You are an NBA analyst. Always cite your sources.
When comparing players, use per-game averages.

Loading Skills

python
agent = Agent(
    model="qwen3:8b",
    tools=[search, calculator],
    skills=["./my-skills"],   # directory of skill folders
)

How It Works

Bundled

general-assistant and tool-user load automatically. ~157 tokens total.

User Skills

Extend bundled skills. Duplicate names override (last wins).

Budget

build_skill_context(skills, max_chars=N) — truncates when over budget.

No PyYAML

Built-in frontmatter parser handles the subset we need. Zero extra deps.

Multi-Provider

All providers implement the same 3-method interface: chat(), chat_with_tools(), chat_with_format().

OllamaProvider

Default. Connects to localhost:11434. Native tool calling + constrained JSON via GBNF.

VLLMProvider

OpenAI-compatible with vLLM defaults. VLLMProvider(model="qwen3-8b")

OpenAICompatProvider

Any OpenAI-compatible server: LM Studio, LocalAI, TGI. Custom API keys and headers.

Usage

python
from freeagent import Agent, VLLMProvider

provider = VLLMProvider(model="qwen3-8b")
agent = Agent(model="qwen3-8b", provider=provider, tools=[my_tool])

Small-Model Guardrails

The OpenAI-compat provider includes automatic recovery for common small-model issues:

Thinking Tags

Strips <think>...</think> from qwen3/deepseek responses.

JSON Recovery

Recovers arguments from code fences, embedded JSON in text, malformed strings.

Telemetry

Built into every agent — no setup: agent.metrics

python
agent.run("What's the weather?")
print(agent.metrics)               # quick summary
print(agent.metrics.tool_stats())  # per-tool breakdown
agent.metrics.to_json("m.json")   # export

Optional OpenTelemetry: pip install freeagent-sdk[otel] — traces and metrics flow automatically.

Dual-Mode Engine

┌──────────────────────────────┐ │ Agent.run() │ └──────────────┬───────────────┘ │ Model Profile Detection "llama3.1" → native │ "phi3" → react ┌────────┴────────┐ ┌────────▼──────┐ ┌────────▼────────┐ │ NativeEngine │ │ ReactEngine │ │ Ollama tool │ │ 1. Free-text │ │ API → struct │ │ 2. GBNF JSON │ │ tool_calls │ │ (two-step) │ └───────┬───────┘ └────────┬────────┘ └────────┬─────────┘ ValidatorRetryDispatchCircuit Breaker Hooks fire at each step
Two-Step Generation

The key insight: asking a small model to think AND produce JSON in one shot fails. Split reasoning (free text) from structured output (constrained JSON). The model thinks naturally, then gives just the arguments with grammar constraints.

Guardrails

Fuzzy Matching

Model says "gret"? → "Did you mean 'greet'?"

Type Coercion

"42" → 42, "true" → True. Auto-fixed.

Retry + Feedback

"Missing field 'city'. Schema: {city: string}." Concrete errors, not generic retries.

Circuit Breaker

Same tool+args 3x = stuck. Max iterations = stop. Timeout = partial result.

FreeAgent vs Raw Ollama

Tested with the same eval suite across 4 models and 100+ runs. Full results in evaluation/.

Tool Calling Accuracy (8 cases)

Model Raw Ollama FreeAgent qwen3:8b 75% 75% qwen3:4b 100% 88% llama3.1:8b 62% 75% (+13%)

MCP Tool Calling (21 NBA tools, 8 cases)

Model Raw Ollama FreeAgent qwen3:8b 100% 88% qwen3:4b 88% 88% llama3.1:8b 100% 88%

Skills Impact (A/B test, 5 cases)

Model With Skills Without Skills Delta qwen3:4b 100% 80% +20% qwen3:8b 80% 80% 0% llama3.1:8b 80% 80% 0%

Multi-Turn Conversations (6 conversations, 15 turns)

Model Raw Ollama FreeAgent (conv) qwen3:8b 93% 87% qwen3:4b 93% 87% llama3.1:8b 87% 80% gemma4:e2b N/A 80% (ReactEngine, 2B)

Memory Tool Usability (5 operations)

Model Accuracy Used Memory Tool qwen3:8b 60% 5/5 qwen3:4b 60% 4/5 llama3.1:8b 60% 4/5

Key Findings

Multi-turn just works

Conversation manager delivers 87% on multi-turn conversations out of the box. SlidingWindow default needs no configuration.

Improves llama3.1

FreeAgent boosts llama3.1 tool calling accuracy by +13% through fuzzy name matching and type coercion.

ReactEngine works

gemma4:e2b (2B) achieves 80% on multi-turn via text-based ReAct. Matches llama3.1 (8B) at 1/4 the size. No parse errors.

Skills help small models

Bundled tool-user skill improves qwen3:4b by +20%. Skills are neutral for larger models — they don't need the guidance.

Memory works

All models understand the single-tool action pattern (4-5/5 usage rate). Write operations had a .md extension bug (now fixed).

Zero crashes

All failures are accuracy issues (wrong answer, wrong tool), never framework errors. The guardrails work. 100+ eval runs, zero crashes.

Test Results — All Systems

=== Memory === get: Alice search: {'user.name': 'Alice', 'user.units': 'metric'} has: True persist: test_value (survived write → read cycle) prompt: 76 chars, contains memory context === Hooks === events: ['before_tool', 'after_tool', 'on_error'] ← all fired skip: True ← hook can skip tool execution override: 'intercepted!' ← hook can override response pre-built: log_hook, cost_hook OK === Agent Integration === Agent(model='llama3.1:8b', mode='native', tools=['greet', 'system_info'], memory=1) hooks registered: before_tool, after_run memory in prompt: True ← auto-injected === Validator === coerce: {'name': 'Test', 'loud': True} ← "true"→True fuzzy: "Unknown tool 'gret'. Did you mean 'greet'?" === Circuit Breaker === sequence: [CONTINUE, CONTINUE, CONTINUE, LOOP_DETECTED, LOOP_DETECTED] === Built-in Tools === cpu: {'cpu_cores': 2} calc: {'expression': '99 * 3', 'result': 297} === Stats === 16 Python files · 1,716 lines · 0 dependencies ALL TESTS PASSED

Project Structure

freeagent/
├── pyproject.toml
├── README.md
├── freeagent/
│   ├── __init__.py           ← Agent, tool, Memory, hooks exports
│   ├── agent.py              ← Agent class w/ hooks + memory integration
│   ├── hooks.py              ← 13 events, HookRegistry, log_hook, cost_hook
│   ├── memory.py             ← Memory class, MemoryEntry, persistence
│   ├── tool.py               ← @tool decorator, schema gen
│   ├── config.py             ← AgentConfig, model profiles
│   ├── messages.py           ← Message types, error feedback
│   ├── validator.py          ← fuzzy match, coercion, field checks
│   ├── circuit_breaker.py    ← loop detect, iteration limits
│   ├── engines/
│   │   └── __init__.py       ← NativeEngine + ReactEngine
│   ├── providers/
│   │   └── ollama.py         ← stdlib-only Ollama client
│   └── tools/
│       ├── system_info.py    ← disk, cpu, os
│       ├── calculator.py     ← safe math
│       └── shell.py          ← sandboxed commands
└── examples/
    ├── 01_hello.py
    ├── 02_builtin_tools.py
    ├── 03_custom_tool.py
    ├── 04_hooks.py           ← NEW: hooks demo
    └── 05_memory.py          ← NEW: memory demo

What's Done (v0.3.1)

✓ Streaming

Token-by-token streaming via agent.run_stream() / arun_stream(). Works for tool-using agents, not just chat. Semantic events: TokenEvent, ToolCallEvent, etc.

✓ Auto-Tune

Queries Ollama /api/show. Small models get stripped defaults. Context window from the model. Engine selection from real capabilities.

✓ Trace API

agent.trace() full timeline with relative timestamps. run_start, model_call_*, tool_*, validation_*, run_end.

✓ CLI

freeagent ask, freeagent chat, freeagent models, freeagent trace, freeagent version. Stdlib argparse, no extra deps.

✓ Conversation

Multi-turn by default. SlidingWindow, TokenWindow, UnlimitedHistory, session persistence.

✓ Memory

Markdown-backed files, single memory tool, auto_load, daily logs, caching.

✓ Skills

Markdown SKILL.md with frontmatter, bundled defaults, user extensions.

✓ Multi-Provider

Ollama (streaming), vLLM (streaming), OpenAI-compat (streaming).

✓ MCP Support

Stdio + streamable HTTP transports, schema conversion.

✓ Telemetry

Built-in metrics, optional OTEL export, per-tool stats, full trace events.

What's Next

Structured Output

First-class Pydantic schema support via agent.run_structured(). Already works under the hood via the constrained JSON path.

Subagents

Agent-as-tool composition. Spawn a specialist from inside a tool. Shared conversation context or isolated.

Tool Policies

allow / deny / ask per tool. Safety rails for production deployments.