What this does

Unit tests for agents mock the LLM and tool responses to verify decision logic, tool selection, error handling, and termination conditions without making real API calls.

Steps

Mock the LLM client. Replace real calls with controlled responses.

from unittest.mock import Mock, patch
import pytest

class MockLLM:
    def __init__(self, responses: list[dict]):
        self.responses = responses
        self.call_count = 0

    def invoke(self, messages):
        response = self.responses[self.call_count]
        self.call_count += 1
        return response

@pytest.fixture
def mock_llm():
    responses = [
        Mock(tool_calls=[Mock(function=Mock(name="search_web", arguments='{"query":"test"}'), id="call_1")]),
        Mock(content="Final answer.", tool_calls=None, finish_reason="stop")
    ]
    return MockLLM(responses)

Test tool selection logic. Verify the agent chooses the correct tool for a given input.

def test_agent_chooses_search_tool(mock_llm):
    tools = [{"type": "function", "function": {"name": "search_web", "parameters": {...}}}]
    agent = Agent(llm=mock_llm, tools=tools, tool_map={"search_web": lambda q: "result"})

    result = agent.run("Search for Python tutorials")

    assert "Python tutorials" in str(result)
    assert mock_llm.call_count == 2

Test error handling. Verify the agent recovers from tool errors.

def test_agent_handles_tool_error():
    def failing_tool(**kwargs):
        raise ValueError("API unavailable")

    agent = Agent(llm=mock_llm, tools=tools, tool_map={"failing_tool": failing_tool})
    agent.max_retries = 1

    result = agent.run("Use failing tool")

    assert "error" in result.lower() or "unavailable" in result.lower()

Test termination conditions. Ensure the agent stops when expected.

def test_agent_stops_at_max_turns():
    # LLM always returns tool_calls (never says stop)
    always_tool = Mock(tool_calls=[Mock(function=Mock(name="noop", arguments="{}"), id="c1")])
    mock = MockLLM([always_tool] * 10)

    agent = Agent(llm=mock, tools=tools, tool_map={"noop": lambda: None}, max_turns=3)
    result = agent.run("Loop test")

    assert "max turns" in result.lower()
    assert mock.call_count == 3

Use pytest fixtures for test isolation.

@pytest.fixture
def mock_tool_registry():
    registry = ToolRegistry()
    registry.register("search", lambda q: ["result"], {})
    registry.register("calculate", lambda e: "42", {})
    return registry

@pytest.fixture
def agent(mock_llm, mock_tool_registry):
    return Agent(llm=mock_llm, registry=mock_tool_registry)

Test decision logic in isolation. Test the routing function independently.

def test_route_to_correct_tool():
    router = IntentRouter()
    assert router.route("search for data") == "search_web"
    assert router.route("calculate 2+2") == "calculate"
    assert router.route("unknown request") == "ask_clarification"

Verification

pytest test_agent.py -v 2>&1 | Select-String -Pattern "PASSED|FAILED"
# Expected: Several PASSED lines

Common failures

Mocks diverge from real LLM responses. Mock responses may not match the actual message format. Serialize and save real responses as test fixtures.
Test flakiness from state leakage. Tests that modify global state (e.g., set_debug) affect subsequent tests. Use @pytest.fixture(autouse=True) to reset state.
Tool side effects in tests. Tests that call real tools write to databases or send emails. Always mock external dependencies.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

How to Debug Agent Reasoning and Tool Selection
How to Build Custom Tools for Agents