Build a memory-enabled local agent stack (May 2026)
An agent that takes a task, remembers what happened in prior sessions, retrieves relevant context from prior decisions, and avoids re-discovering the same dead ends — all on local hardware with no data leaving the machine.
- 01ToolCoding agent + planning loopopenhands
OpenHands over Goose / Aider for memory-enabled workflows: Planning Mode pairs naturally with persistent memory (the plan from session N becomes context for session N+1), and the MCP integration is the strongest in the open-source category. Goose is competitive but Mem0 integration is cleaner on OpenHands.
- 02ToolEpisodic + semantic memory layermem0
Mem0 over Letta for the default memory pick: drop-in API, less ceremony, faster to wire. Letta wins when you need OS-style explicit memory management (paging in/out memory blocks for long-horizon tasks) — promote to Letta only when you've outgrown Mem0's memory model.
- 03ToolMCP filesystem (file access with allowlisting)mcp-server-filesystem
Strict directory allowlist limits the agent's blast radius. Required for any agent that edits files; non-optional for a memory-enabled agent that may take destructive actions based on remembered context.
- 04ToolMCP git (repo metadata)mcp-server-git
Read-side git operations give the agent commit history awareness — crucial when memory says 'we tried X last session' and git can confirm whether X was actually committed or rolled back.
- 05ToolMCP postgres (structured-knowledge memory)mcp-server-postgres
Postgres MCP exposes a structured-knowledge database to the agent — complements vector-based memory (Mem0) by holding facts that need exact lookup. Pin to the current version and run with a least-privilege role; older versions had a SQL-injection escape on the read-only wrapper.
- 06ToolInference enginevllm
vLLM continuous batching matters here: a memory-enabled agent makes 10-30 retrieval-then-generate calls per task. Prefix caching keeps the memory-injection prompt resident across iterations. Use Ollama only if the agent runs at single-user pace.
- 07ModelCoding model with strong reasoningdeepseek-coder-v2-lite
DeepSeek Coder V2 Lite over Qwen 2.5 Coder for memory-heavy workflows: stronger at synthesizing across retrieved memory chunks (real test: better at 'reconcile session 3's plan with session 5's findings'). Qwen 2.5 Coder wins on raw HumanEval; DeepSeek V2 Lite wins on multi-turn synthesis.
- 08HardwareGPUrtx-4090
RTX 4090 24GB is the workstation default. The added memory-retrieval workload doesn't need more VRAM; what changes is system RAM (Mem0 + Postgres + agent buffer = bump to 64GB).
What “memory-enabled” actually means
Most “agent memory” tutorials end at “append embeddings to a vector store.” That covers maybe 30% of the use case. A genuinely memory-enabled agent needs three distinct memory shapes working together:
- Episodic memory — what happened in past sessions. “Last Tuesday we tried approach X and it failed because Y.” Vector-based retrieval works here; this is what Mem0 / Zep / Graphiti excel at.
- Semantic memory — generalized knowledge extracted from episodes. “The auth_token validation module always handles expired tokens via the refresh path, never by raising 401.” Semantic memory is harder: it requires consolidation across episodes, which mature memory systems (Mem0g, Zep's temporal graphs) do at varying quality.
- Structured memory — facts that need exact lookup, not similarity search. “Tasks completed this week, ranked by status.” A Postgres table beats a vector store every time for this; the agent uses MCP-postgres to query directly.
The headline architectural choice this stack makes: three memory layers, three different access patterns. Mem0 handles episodic + semantic via vector retrieval; MCP- postgres handles structured exact-lookup; MCP-git handles repo- state grounding. Each layer is queried separately by the agent at different points in its planning loop.
Step-by-step setup
1. Bring up vLLM with the coding model
# DeepSeek Coder V2 Lite Instruct in AWQ — fits 24GB with 32K context
docker run --gpus all --rm -d --name vllm \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--restart unless-stopped \
vllm/vllm-openai:v0.17.1 \
--model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct-AWQ \
--gpu-memory-utilization 0.85 \
--max-model-len 32768 \
--enable-chunked-prefill--gpu-memory-utilization 0.85 rather than 0.9 — memory-heavy workloads inject longer system prompts (memory chunks + tool schemas), which raise prefill peaks. The 5% headroom prevents OOM on long retrievals.
2. Set up Postgres + MCP-postgres for structured memory
# Postgres on the host — the agent's structured-knowledge DB
sudo -u postgres createdb agent_memory
sudo -u postgres psql -d agent_memory -c "
CREATE TABLE tasks (
id SERIAL PRIMARY KEY,
session_id TEXT,
task TEXT,
status TEXT,
outcome TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE decisions (
id SERIAL PRIMARY KEY,
session_id TEXT,
decision TEXT,
rationale TEXT,
outcome TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
"
# Create least-privilege role for MCP — MCP-postgres should NEVER run
# as superuser. Pin a current MCP version (CVE-pinned) and read-only.
sudo -u postgres psql -d agent_memory -c "
CREATE ROLE mcp_reader LOGIN PASSWORD 'pin-strong-password';
GRANT CONNECT ON DATABASE agent_memory TO mcp_reader;
GRANT USAGE ON SCHEMA public TO mcp_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO mcp_reader;
"
# MCP-postgres server (stdio transport, allowlisted DB)
npx -y @modelcontextprotocol/server-postgres \
postgresql://mcp_reader:pin-strong-password@localhost/agent_memoryCritical: the least-privilege role is what stops the older MCP-postgres SQL-injection escape from doing damage. Even if the wrapper's read-only mode is bypassed, the role can't modify or read other databases. Belt-and-suspenders is mandatory here, not optional.
3. Wire OpenHands with Mem0 + the three MCP servers
# config.toml
[llm]
model = "openai/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct-AWQ"
api_base = "http://localhost:8000/v1"
api_key = "anything"
[mcp]
servers = [
{ command = "npx", args = ["-y", "@modelcontextprotocol/server-filesystem", "/home/you/projects/active"] },
{ command = "npx", args = ["-y", "@modelcontextprotocol/server-git", "--repository", "/home/you/projects/active"] },
{ command = "npx", args = ["-y", "@modelcontextprotocol/server-postgres", "postgresql://mcp_reader:pin-strong-password@localhost/agent_memory"] }
]
[memory]
provider = "mem0"
config = {
vector_store = { provider = "lancedb", path = "/home/you/.mem0/lancedb" },
llm = { provider = "openai", api_base = "http://localhost:8000/v1", model = "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct-AWQ" },
embedder = { provider = "ollama", model = "mxbai-embed-large", host = "http://localhost:11434" }
}Mem0's configured to use LanceDB for vector storage, the local DeepSeek for memory consolidation/extraction, and Ollama for embeddings. All local. Same Postgres role pattern as above — Mem0 stores its consolidated memory state in LanceDB files, separate from the structured-knowledge Postgres DB.
4. Run a multi-session task
# Session 1: drop the agent into a real task
openhands run --plan-first \
--task "Investigate slow tests in tests/integration/auth/. \
Identify root cause; propose fix; run tests to verify."
# After session 1 closes, Mem0 consolidates episodic memory.
# Verify in the LanceDB store:
ls -la ~/.mem0/lancedb/
# Session 2 (next day): a follow-up task
openhands run --plan-first \
--task "Apply the auth-test fix you proposed yesterday. \
Run the full test suite. Document the change."
# Expected behavior: agent retrieves session 1 memory, knows the
# proposed fix, applies it without re-investigating from scratch.Memory hierarchy: episodic / semantic / structured
The agent's planning loop queries memory layers in a specific order; getting this wrong is the most common source of confusing agent behavior:
- Episodic first (Mem0). “What happened in past sessions related to this task?” Vector similarity over session transcripts. Returns the 3-5 most relevant past episodes with summaries.
- Semantic next (Mem0g if enabled, or Mem0 consolidation). “What general patterns have we observed?” Returns consolidated insights across episodes. This is where Mem0g's graph variant outperforms — multi-hop reasoning over consolidated facts.
- Structured last (MCP-postgres). “What are the exact facts about prior tasks/decisions?” SQL queries via the MCP layer. Returns rows; agent reasons over them.
- Repo state on demand (MCP-git). “Has the change in question already been committed?” Crucial for catching memory drift — episodic memory says “we made change X” but git log says it never landed; the agent should trust git.
Failure modes you'll hit
- Memory drift between sessions. Episodic memory says one thing; the actual repo / database state says another. The agent confidently reasons against stale knowledge. Mitigation: always query MCP-git or MCP-postgres for ground truth before acting on episodic memory.
- Mem0 retrieval returns junk. Embedding model mismatch (changed model after ingestion) corrupts the store. Pin the embedding model in Mem0 config; re-build memory if you change it.
- Postgres MCP exceeds the read-only role. Older MCP versions had a statement-stacking escape. The mitigation is BOTH pinning current versions AND running with a least-privilege role. Don't skip the role even if the wrapper claims to enforce read-only.
- Memory consolidation goes off the rails. Mem0 consolidates episodic memory into semantic memory at session boundaries. If the consolidation prompt is wrong or the LLM hallucinates, semantic memory carries fabricated facts forward. Audit consolidated memory periodically; disable consolidation if it's adding more noise than signal.
- Tool-call timeout on long memory queries. Default MCP tool-call timeout is 30 seconds; complex postgres queries can exceed this. Configure per-tool timeouts in OpenHands; don't leave the default.
- Context window exhaustion. 3 memory queries × 500 tokens each + tool schemas + system prompt = often 3-5K tokens before the actual task. With a 32K window, that's fine; with 8K, you've burned half the window before reasoning. Use a 32K+ context model.
Variations and alternatives
Letta variation. Replace Mem0 with Letta when you need OS-style explicit memory management — paging memory blocks in and out, controlled archival, deliberate working-memory vs archival-memory split. The cost is more configuration; the benefit is more deterministic memory behavior.
Zep / Graphiti variation. For multi-hop reasoning over knowledge graphs (memory-as-graph rather than memory-as-vector), swap Mem0 for Zep or Graphiti. Slower lookup, better at “what did Bob decide about authentication three sessions ago and why?”
SGLang variation. Replace vLLM with SGLang if your agent loop has very stable system prompts (memory-injection patterns rarely change). RadixAttention's prefix-tree wins compound dramatically when the same memory-injection prefix is reused across many tool calls per session.
OpenClaw variation. Replace OpenHands with OpenClaw if you want the newer agent runtime — it's faster-moving but less battle-tested on memory integration. Wait for ecosystem maturity before promoting it to the default unless you have appetite for early-adopter pain.
Who should avoid this stack
- Single-session agent users. If your agent loops are within one conversation and you don't come back to them, the memory layer is overhead. Use the simpler local coding-agent stack instead.
- Anyone unwilling to monitor memory consolidation. Memory systems hallucinate. They consolidate episodes into semantic facts that look right but aren't. Without periodic auditing, the agent's memory becomes a confident fiction. If you can't commit to monthly memory review, skip the memory layer.
- Anyone whose threat model includes data exfiltration via the agent. A memory-enabled agent that has ever processed sensitive data carries that data forward indefinitely. If your sensitive-data exposure model is “the agent should never carry context across sessions,” the memory layer breaks that guarantee by design.
- Beginners learning agent infrastructure. Memory + 3 MCP servers + Mem0 consolidation + LanceDB + Postgres is a lot of moving parts. Master the simpler stacks first; promote to memory-enabled once the simpler patterns are second nature.
Going deeper
- Mem0 catalog entry — integration patterns, the consolidation cycle, the LanceDB backend.
- Letta catalog entry — the OS-style explicit memory alternative.
- /systems/mcp — the protocol layer the three MCP servers use, including the postgres CVE caveat.
- Local coding-agent stack — the simpler memory-less precursor, with the same OpenHands + vLLM + MCP filesystem/git foundation.