06. Function Calling in vLLM

Chapter 6 of 16 · 20 min

vLLM enables tool calling via guided decoding using the Outlines library. This ensures the model generates output that strictly conforms to the JSON schema, reducing the rate of malformed tool arguments.

Setting up vLLM with tool calling

pip install vllm requests
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    guided_decoding_backend="outlines"
)

Defining tools for vLLM

tools = [
    {
        "type": "function",
        "function": {
            "name": "web_search",
            "description": "Search the web for information.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "The search query"},
                    "max_results": {
                        "type": "integer",
                        "description": "Maximum results to return",
                        "default": 5
                    }
                },
                "required": ["query"]
            }
        }
    }
]

messages = [
    {"role": "system", "content": "You are a helpful assistant with access to tools."},
    {"role": "user", "content": "Who won the Nobel Prize in Physics in 2024?"}
]

# vLLM chat template with tools
from vllm.lm_api import chat

response = llm.chat(
    messages=messages,
    tools=tools,
    sampling_params=SamplingParams(temperature=0.0, max_tokens=256)
)

Understanding guided decoding

Guided decoding constrains the model's output to match the JSON schema. Without it, the model may output argument strings that look correct but contain malformed JSON. With guided decoding, vLLM generates one token at a time while ensuring the output tree stays valid according to the schema.

This has a trade-off: guided decoding is slower than unconstrained generation because the model cannot use beam search or certain optimizations. Enable it only when tool calling accuracy matters more than raw throughput.

Named vs. auto tool choice

vLLM supports two modes. In auto mode, the model decides which tool to call from among all available tools. In required mode, the model must call a tool—it cannot respond with plain text. Configure this in the request parameters:

response = llm.chat(
    messages=messages,
    tools=tools,
    tool_choice="required",  # Must call a tool
    sampling_params=SamplingParams(temperature=0.0, max_tokens=256)
)

Failure mode: schema mismatch

If the tool schema has errors, vLLM throws an exception during model loading or inference. Validate your schemas against the OpenAI function schema format before passing them to vLLM.

EXERCISE

Set up a vLLM instance with a tool-enabled model. Define two tools and verify the model calls the correct one based on the query. Then test with tool_choice="required" and confirm the model cannot refuse.