06. Function Calling in vLLM
vLLM enables tool calling via guided decoding using the Outlines library. This ensures the model generates output that strictly conforms to the JSON schema, reducing the rate of malformed tool arguments.
Setting up vLLM with tool calling
pip install vllm requests
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel_size=1,
guided_decoding_backend="outlines"
)
Defining tools for vLLM
tools = [
{
"type": "function",
"function": {
"name": "web_search",
"description": "Search the web for information.",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "The search query"},
"max_results": {
"type": "integer",
"description": "Maximum results to return",
"default": 5
}
},
"required": ["query"]
}
}
}
]
messages = [
{"role": "system", "content": "You are a helpful assistant with access to tools."},
{"role": "user", "content": "Who won the Nobel Prize in Physics in 2024?"}
]
# vLLM chat template with tools
from vllm.lm_api import chat
response = llm.chat(
messages=messages,
tools=tools,
sampling_params=SamplingParams(temperature=0.0, max_tokens=256)
)
Understanding guided decoding
Guided decoding constrains the model's output to match the JSON schema. Without it, the model may output argument strings that look correct but contain malformed JSON. With guided decoding, vLLM generates one token at a time while ensuring the output tree stays valid according to the schema.
This has a trade-off: guided decoding is slower than unconstrained generation because the model cannot use beam search or certain optimizations. Enable it only when tool calling accuracy matters more than raw throughput.
Named vs. auto tool choice
vLLM supports two modes. In auto mode, the model decides which tool to call from among all available tools. In required mode, the model must call a tool—it cannot respond with plain text. Configure this in the request parameters:
response = llm.chat(
messages=messages,
tools=tools,
tool_choice="required", # Must call a tool
sampling_params=SamplingParams(temperature=0.0, max_tokens=256)
)
Failure mode: schema mismatch
If the tool schema has errors, vLLM throws an exception during model loading or inference. Validate your schemas against the OpenAI function schema format before passing them to vLLM.
Set up a vLLM instance with a tool-enabled model. Define two tools and verify the model calls the correct one based on the query. Then test with tool_choice="required" and confirm the model cannot refuse.