TL;DR: A scalable memory framework for LLMs that combines short-term context, adaptive summaries, and cloud-based recall to achieve human-like memory without exceeding context limits.
As LLMs move toward persistent agents and multi-session applications, efficient memory handling becomes critical. This repository outlines a novel architectural approach to solve the context window bottleneck and nuance degradation in long-running conversations with LLMs.
LLMs simulate memory by repeatedly including prior messages within the prompt, an approach that quickly hits context limits, inflates latency, and increases inference cost.
Current solutions are insufficient:
- Full Context Inclusion: The entire conversation is re-sent with every prompt until context is truncated. This is computationally expensive, increases latency, and significantly raises the per-token cost of inference.
- Over-Summarization: Aggressive summarization reduces token usage but often erases critical nuance and continuity. It also fails to dynamically adapt to what the user or model actually needs.
Both paradigms result in artificial "scroll-based" memory, which degrades conversational quality in extended user sessions and is fundamentally unscalable.
To overcome these limitations, this proposed three-tiered hybrid memory architecture mimics human cognition — balancing fast, limited working memory with scalable, persistent recall.
Memory Tier | Content / Purpose | Location | Retrieval Mechanism |
---|---|---|---|
1. Short-Term (Working Memory) | The last 2-5 exchanges. Maintains immediate flow. | Standard LLM Context Window | Implicit (always included) |
2. Summarized (Directional Memory) | A compact, generalized summary ( |
Standard LLM Context Window | Dynamic (updated/included) |
3. Cloud-Persistent (Source-of-Truth Memory) | Full, verbatim conversation memory. Indexed for quick search and precision recall. Preserves all nuance. | Persistent Cloud Storage (Vector Database) | On-Demand Precision Fetch (by the model or user) |
The critical innovation is the On-Demand Precision Fetch mechanism, which allows the model to dynamically query and retrieve specific, relevant, verbatim chunks from Cloud-Persistent memory only when it needs them.
- User Query Arrives: The new user message is received.
- Working Prompt Construction: The Short-Term messages and Summarized directional memory are added to the prompt.
- Model Decision: If the model or memory manager determines if key information is missing for a high-quality answer, or if the embedding search yields a high relevance score:
- The relevant verbatim text chunk is retrieved.
- It is injected into the current prompt as additional context.
- Inference: The LLM processes the compact, high-relevance prompt to generate the response.
Users can also explicitly request recall:
“Hey, bring back what we discussed about project Phoenix last week.”
Feature | Hybrid Memory Architecture | Legacy Method |
---|---|---|
Nuance Preservation | High. Full original text is always available for recall. | Low. Risk of information loss through summarization/pruning. |
Efficiency & Cost | High. Only relevant info is injected; lower token count per request. | Low. Wasted tokens from resending large, entire conversations. |
Conversational Depth | Infinite. Scalable memory independent of the fixed context window size. | Limited. Hard-capped by the LLM's fixed context window size. |
Architecture | Mimics human memory (working + long-term recall). | Fakes memory with a linear text buffer. |
Static summaries | Adaptive, topic-based recall | Static, fixed context injection. |
User control | Explicit recall + memory visibility | Implicit and uncontrollable memory. |
- Persistent Log Storage: A secure, scalable database to house the full, raw conversation history.
- Embedding Indexer: A pipeline that converts every conversation turn into a searchable vector embedding and stores it in a Vector Database.
- Memory Manager Agent: A service layer that sits between the user and the LLM, responsible for:
- Handling the embedding search and relevance scoring.
- Constructing the compact working prompt (Short-Term + Summary).
- Injecting retrieved verbatim context on-demand.
- LLM Integration: Prompting strategies that instruct the LLM on how to utilize the injected Summarized and On-Demand context efficiently.
- Massive cost reduction: By minimizing prompt size and avoiding redundant context, this system can significantly cut per-message inference costs.
- Improved long-term coherence: Retains factual and stylistic continuity across weeks or months of interaction.
- Human-like reasoning continuity: Mimics short-term focus with long-term associative recall, improving “self-consistency” across topics.
- Developer control: Exposes APIs for selective recall, summarization, and user-controlled memory visibility.
-
Global On-Demand Recall: Allow the system to retrieve information from any past conversation with the user, not just the current session.
-
Context-Aware Prioritization: Prioritize recall from the current conversation or active topic before searching across older sessions, reducing latency and improving relevance.
-
Model-Initiated Precision Requests: Enable the model itself to issue targeted recall queries for specific topics, entities, or events when it detects missing context or low confidence.
-
Extensible Memory Hooks: Design the memory manager with open interfaces so future modules (e.g., emotional state tracking, personalization layers, or domain-specific memory) can be easily integrated.
-
Open Research Space: Encourage experimentation with adaptive summarization, cross-session linking, and relevance scoring methods that balance cost, precision, and recall fidelity.
- Latency from memory lookups and vector searches.
- Cost trade-offs between embedding generation and retrieval.
- Determining when and how to summarize safely.
- Creating intuitive prompts for user-initiated recall.
- Implement adaptive summarization with entropy-based compression.
- Explore latency-optimized vector DBs (FAISS, Milvus) for sub-100 ms retrieval.
- Add memory confidence scoring and visual heatmaps for explainability.
- Test across domains (coding, storytelling, tutoring) to measure recall precision vs. cost.
This work extends prior architectures such as Retrieval-Augmented Generation (RAG) and MemGPT, but focuses on verbatim recall fidelity and adaptive, context-aware summarization for multi-session persistence.
This project is currently in the conceptual and architectural design phase. It aims to serve as a research foundation and reference framework for building highly-scalable, cost-efficient, and nuanced conversational AI.
Open to feedback, simulation, or collaboration
- Email: [forwork2k24@gmail.com]
This design was inspired by an in-depth exploration of current LLM memory limitations and the desire for more human-like recall mechanisms.
This work is licensed under a Creative Commons Attribution 4.0 International License.
You are free to use, modify, and share this work for any purpose, with attribution.