RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to Implement Agent Memory (Short and Long Term)
HOW-TO · RAG

How to Implement Agent Memory (Short and Long Term)

advanced·30 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Agent framework, vector store for long-term memory, Python 3.10+

What this does

Short-term memory holds the current conversation within the context window. Long-term memory persists important facts across sessions using a vector store, enabling the agent to recall user preferences and past interactions.

Steps

  • Implement short-term memory as a sliding window. Keep the most recent N messages.
from collections import deque

class ShortTermMemory:
    def __init__(self, max_messages: int = 10):
        self.messages = deque(maxlen=max_messages)

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})

    def get_context(self) -> list[dict]:
        return list(self.messages)

    def summarize(self, llm) -> str:
        text = "\n".join(m["content"] for m in self.messages)
        summary = llm.invoke(f"Summarize this conversation:\n{text}")
        return summary.content
  • Build long-term memory with a vector store. Store facts as embeddings.
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
from langchain.schema import Document

class LongTermMemory:
    def __init__(self, collection_name: str = "agent_memory"):
        self.embeddings = OllamaEmbeddings(model="nomic-embed-text")
        self.store = Chroma(
            embedding_function=self.embeddings,
            collection_name=collection_name
        )

    def remember(self, fact: str, metadata: dict = None):
        doc = Document(page_content=fact, metadata=metadata or {})
        self.store.add_documents([doc])

    def recall(self, query: str, k: int = 5) -> list[str]:
        docs = self.store.similarity_search(query, k=k)
        return [d.page_content for d in docs]
  • Decide what to store long-term. Use the LLM to extract salient facts.
def extract_facts(conversation: str, llm) -> list[str]:
    prompt = f"""Extract important facts from this conversation that should be remembered long-term.
Return each fact on a new line.

Conversation:
{conversation}"""
    response = llm.invoke(prompt)
    return [f.strip() for f in response.content.split("\n") if f.strip()]
  • Integrate both memory types into the agent loop.
class MemoryAugmentedAgent:
    def __init__(self, llm, tools, stm: ShortTermMemory, ltm: LongTermMemory):
        self.llm = llm
        self.tools = tools
        self.stm = stm
        self.ltm = ltm

    def run(self, user_input: str) -> str:
        # Retrieve relevant long-term memories
        memories = self.ltm.recall(user_input)
        memory_context = "\n".join(f"Past fact: {m}" for m in memories)

        # Build prompt with both memory types
        prompt = f"{memory_context}\n\nConversation:\n"
        for msg in self.stm.get_context():
            prompt += f"{msg['role']}: {msg['content']}\n"
        prompt += f"user: {user_input}"

        response = self.llm.invoke(prompt)

        # Update short-term memory
        self.stm.add("user", user_input)
        self.stm.add("assistant", response.content)

        # Periodically extract long-term facts
        if len(self.stm.messages) >= 5:
            facts = extract_facts(prompt, self.llm)
            for fact in facts:
                self.ltm.remember(fact)

        return response.content

Verification

python -c "
from collections import deque
stm = deque(maxlen=3)
stm.append('msg1')
stm.append('msg2')
stm.append('msg3')
stm.append('msg4')
print(len(stm))
# Expected: 3 (sliding window)
"

Common failures

  • Memory staleness. Long-term memory stores facts that become outdated. Add an expiration timestamp or allow the user to delete memories.
  • Context window overflow from short-term memory. Even a sliding window of 20 messages can overflow a small context. Store summaries instead of raw messages.
  • Fact extraction captures trivial details. The LLM stores "User said 'OK'" as a fact. Use a stricter extraction prompt that requires informational value.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • How to Use Vector Store as Agent Memory
  • How to Manage Agent Context Window Limits
← All how-to guidesCourses →