Autonomous Agent — AI glossary

An autonomous agent is a system that uses a language model to decide and execute multi-step tasks without human intervention at each step. It typically combines an LLM with tools (e.g., web search, code execution, file I/O) and a loop: the model observes results, plans the next action, and calls the appropriate tool. The agent runs until a stopping condition (e.g., task completion, max steps). For local operators, this means running a model like Llama 3.1 8B with a framework (e.g., LangChain, CrewAI) that orchestrates tool calls and manages context. VRAM matters because the agent must keep the conversation history and tool outputs in context, which can exceed the model's context window if not managed carefully.

Deeper dive

Autonomous agents extend a language model beyond single-turn Q&A by giving it agency: the model can call functions, interpret results, and decide subsequent actions. The core loop is: (1) receive a task, (2) generate a thought/plan, (3) execute a tool call (e.g., run Python code, fetch a URL), (4) observe the output, (5) repeat until done. Frameworks like LangChain, AutoGPT, and CrewAI implement this pattern. For local operators, the main constraints are context window size and inference speed. Each tool call and its result adds tokens to the context; a 4K context may fill quickly. Many agents use a 'max iterations' limit to avoid runaway loops. Quantization helps fit larger models into VRAM, but the agent's memory management (e.g., summarization, sliding window) is critical to avoid truncating important state. Some agents also use a separate 'planner' model (e.g., a larger model for reasoning) and a 'worker' model (e.g., a smaller model for tool calls) to balance quality and speed.

Practical example

An operator runs a local agent using Ollama and LangChain. They set up a tool that lets the agent execute Python code. The agent is asked to 'analyze a CSV file and plot the sales trend.' The agent first calls a Python tool to read the CSV, then generates a matplotlib script, executes it, and returns the plot. Each step consumes tokens: the CSV content (500 tokens), the generated code (200 tokens), and the plot description (~100 tokens). With a 4K context window, the agent can handle about 5-6 such steps before needing to summarize or stop. Running Llama 3.1 8B Q4 on an RTX 3090 yields ~30 tok/s, so each step takes ~10-20 seconds.

Workflow example

In LM Studio, an operator loads a model (e.g., Mistral 7B Q4) and enables the 'Agent Mode' (if available) or uses the OpenAI-compatible API with a framework like LangChain. They define tools in a JSON schema (e.g., { "name": "web_search", "parameters": {...} }). The agent loop runs: the model receives the system prompt with tool definitions, generates a function call, the runtime executes it, and the result is appended to the conversation. The operator monitors token usage via the LM Studio logs. If the context fills, they may see the agent repeating itself or dropping earlier steps — a sign to reduce max iterations or increase context length.