05. vLLM Function Calling

Chapter 5 of 18 · 20 min

vLLM supports function calling through its chat template system and guided decoding. Serving models with tool support requires specific configuration during startup.

Start vLLM with tool use enabled:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --tool-call-format hermes \
    --dtype half \
    --gpu-memory-utilization 0.85

The --tool-call-format parameter supports hermes (Hermes format), mistral (Mistral格式), or functionary (Functionary format). Choose based on your model's fine-tuning. If unsure, test each format and observe which produces valid JSON.

Send requests with tool definitions using the OpenAI-compatible API:

import requests
import json

def call_vllm_with_tools(model: str, messages: list, tools: list) -> dict:
    url = "http://localhost:8000/v1/chat/completions"
    
    payload = {
        "model": model,
        "messages": messages,
        "tools": tools,
        "tool_choice": "auto"  # Let model decide which tool
    }
    
    headers = {"Content-Type": "application/json"}
    response = requests.post(url, json=payload, headers=headers)
    response.raise_for_status()
    return response.json()

The tool_choice parameter accepts "auto" for model-selected tools, "none" to disable tool calls, or a specific tool name to force a particular choice.

Extract tool calls from the vLLM response:

def extract_vllm_tool_calls(response: dict) -> list[dict]:
    tool_calls = []
    
    choice = response.get("choices", [{}])[0]
    message = choice.get("message", {})
    
    if "tool_calls" in message:
        for call in message["tool_calls"]:
            function = call.get("function", {})
            tool_calls.append({
                "name": function.get("name"),
                "arguments": json.loads(function.get("arguments", "{}"))
            })
    
    return tool_calls

response = call_vllm_with_tools(
    "meta-llama/Llama-3.1-8B-Instruct",
    [{"role": "user", "content": "Calculate 15 + 27"}],
    [calculator_tool_schema]
)

tool_calls = extract_vllm_tool_calls(response)

vLLM returns tool calls in the tool_calls array within the message object. The function.arguments field is a JSON string that must be parsed before using.

Hermes format produces structured output like:

{
  "name": "get_weather",
  "arguments": {
    "city": "Boston",
    "units": "celsius"
  }
}

Mistral format wraps the call differently but the client parsing remains similar. Test your parsing logic against both formats during development.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Start a vLLM server with tool support, define two tools, send a multi-step query requiring both tools, and verify the model calls them in sequence.