AURA: A Decoupled, State-Externalized Architecture for Long-Term Context Management in LLM-Based Agent Systems
Note: This paper is a system description and position paper. It does not constitute a formal proof of convergence or a controlled empirical evaluation. The authors welcome reproduction and critique.
Abstract
Current approaches to equipping Large Language Models (LLMs) with persistent memory through Retrieval-Augmented Generation (RAG) share a common weakness: the retrieved context degrades over time due to embedding staleness, context saturation, and lack of autonomous curation. This paper describes AURA (Advanced Unified Retrieval Architecture), an agentic middleware layer that externalizes episodic and long-term memory into a separate runtime with explicit lifecycle management.
The architecture organises six specialised agent roles into two orthogonal circuits: a data-curation loop (roles 1–3) that maintains memory quality, and an algorithmic control loop (roles 4–6) that adjusts retrieval parameters when the system detects degradation. The key contribution is a scope-targeted half-life decay mechanism that assigns different expiration rates to different knowledge types—a simple but underexploited pattern in production RAG deployments. We provide the architectural specification, reference implementation in Python, and discuss the convergence properties of the self-correcting loop using a simplified error-dynamics model.
1. Introduction
Large Language Models, by construction, are stateless functions: given an input sequence, they produce a token-level conditional distribution over the vocabulary. Any persistent behaviour—memory of a user, recall of past decisions, accumulation of domain knowledge—must be supplied externally through the context window. This is conventionally done via Retrieval-Augmented Generation (RAG), where relevant documents are fetched from a vector store and appended to the prompt [1].
In production LLM systems, RAG faces a well-documented set of challenges:
- Context saturation. The retrieved window has a fixed size (typically 4k–128k tokens). Once full, older entries are evicted regardless of importance [2].
- Embedding staleness. A document indexed today may become irrelevant tomorrow. Without re-embedding or deprecation, the retriever continues to surface it.
- No autonomous curation. Vector databases store what they are given. No mechanism exists to flag, demote, or remove outdated or contradictory entries without manual intervention.
Several agentic frameworks have attempted to address these limitations. MemGPT (Letta) [3] introduces a virtual memory manager that pages context between a fast "working memory" and a slow "archival storage." AutoGen [4] and CrewAI [5] coordinate multiple LLM agents in conversation graphs, but do not enforce explicit memory lifecycle policies. Reflexion [6] adds a critic-evaluation step that reflects on agent output and feeds back into the next iteration, but operates at the level of a single episode, not over continuous long-term operation.
This paper describes AURA, a middleware architecture that treats memory not as a passive store but as an actively curated state layer with two properties: (i) each memory entry has a scope-dependent half-life after which its influence decays, and (ii) a background evaluation circuit monitors retrieval quality and adjusts parameters when degradation is detected.
The paper is organised as follows. Section 2 formalises the composite scoring function and scope-tuned decay. Section 3 describes the agent role topology. Section 4 presents the self-correcting circuit. Section 5 discusses a simplified error-dynamics model. Section 6 provides a reference implementation sketch. Section 7 surveys related work. Section 8 discusses limitations and open questions.
2. Memory Model: Composite Scoring with Scope-Decay
Let a memory entry be represented as a tuple M_i = (e_i, t_i, s_i, imp_i), where ei ∈ ℝd is a dense embedding, ti is the wall-clock time of insertion or last verification, si is a scope label (e.g., "session_task", "domain_knowledge", "user_profile"), and impi ∈ [0,1] is a baseline importance score assigned at write time.
When a query Q arrives, the system retrieves k entries with the highest combined score:
where:
- sim(·,·) is cosine similarity between the query embedding and the entry embedding.
- Δt = tnow − ti is elapsed time since verification.
- α, β, γ are intent-dependent coefficients summing to 1.0, currently set by a lightweight classifier.
- τ(si) is the scope-specific half-life, a configurable constant per scope.
The half-life τ(s) determines how quickly an entry fades from retrieval unless re-verified. Key values in the current deployment:
| Scope | τ | Rationale |
|---|---|---|
user_profile | ∞ (no decay) | User attributes rarely change |
domain_knowledge | 90 days | Market conditions evolve seasonally |
session_task | 3600 s | Transient task context |
regulatory | ≈10 years | Legal references must persist |
Contribution. Standard RAG systems retrieve top-k by cosine similarity alone. Adding scope-dependent temporal decay is an architectural pattern that is simple, implementable on any vector store, and—to our knowledge—not formalised in the current literature on LLM memory management. Its effect is to monotonically demote entries beyond their useful horizon without requiring explicit deletion.
3. Agent Role Topology
Rather than proposing a universal optimal architecture, we describe the topology deployed in AURA and the rationale for each role. The system separates roles into two circuits.
3.1 Circuit A: Data Curation
| # | Role | Operation | Timing |
|---|---|---|---|
| 1 | Executor | Accepts query Q, computes composite score, assembles context, passes to LLM, returns response | Online (every request) |
| 2 | Evolution Generator | Generates synthetic edge-case queries from existing entries to expose gaps or contradictions | Background (low priority) |
| 3 | Gatekeeper | Evaluates proposed new entries (from user input or Role 2) for redundancy, contradiction, and factual grounding before committing to vector store | Near-line (async) |
Circuit A manages memory content only: what enters, when it leaves, and how it ranks. It does not modify retrieval parameters.
3.2 Circuit B: Parameter Control
| # | Role | Operation | Timing |
|---|---|---|---|
| 4 | Autonomous Coder | Adjusts retrieval parameters (α, β, γ, chunk size) when anomalous drift is detected | Offline (triggered) |
| 5 | Syndicator | Monitors rolling retrieval accuracy; triggers Circuit B if mean confidence drops below θ = 0.80 over N = 50 queries | Background (periodic) |
| 6 | Mentor | Verifies that parameter changes respect invariant constraints (e.g., Σw = 1.0, cchunk ≥ 128) before deployment | Offline (guard) |
Why 6? This is not a theoretically optimal number. The current topology emerged from pragmatics: separating (a) data operations, (b) parameter adjustments, and (c) safety constraints into three layers per circuit. Fewer roles collapse these functions into single agents. More roles (9+) introduced coordination overhead that outweighed marginal gains in our deployment.
4. Self-Correcting Feedback Loop
Circuit B implements a closed-loop controller. The Syndicator (Role 5) periodically computes a rolling mean of validation scores. If the mean drops below threshold θ = 0.80, the system enters a lock state and invokes the Autonomous Coder (Role 4), which proposes adjusted hyperparameters. The Mentor (Role 6) applies invariant checks and, if satisfied, commits the patch.
Fig 1. AURA role-flow diagram.
5. A Simplified Model of Error Dynamics
We present a deliberately simplified model of how error propagates in a recursive single-circuit system and how a second control loop can bound it. This is not a proof of convergence—it is a speculative toy model intended to illustrate the intuition behind adding Circuit B.
Let Et denote the number of degraded entries at epoch t. With only Circuit A:
If Δadd > Δdecay, Et grows without bound. Circuit B introduces a correction term:
Limitation: This model treats "degraded entries" as a uniform quantity and ignores interaction effects. We include it as a motivating framework, not as a formal guarantee.
6. Reference Implementation
A minimal Python implementation is provided in the companion repository. This is a reference sketch, not a production system.
class AURAOrchestrator:
"""Demonstrates Circuit B activation on accuracy degradation."""
def __init__(self):
self.hyperparameters = {
"alpha": 0.5, "beta": 0.3, "gamma": 0.2,
}
self.recent_validations = []
self.system_lock = False
def on_validation(self, score: float):
self.recent_validations.append(score)
if len(self.recent_validations) > 50:
self.recent_validations.pop(0)
def check_drift(self):
if len(self.recent_validations) < 3:
return
mean = sum(self.recent_validations) / len(self.recent_validations)
if mean < 0.80:
self._run_circuit_b()
def _run_circuit_b(self):
proposed = self._role4_propose_patch()
if self._role6_verify(proposed):
self.hyperparameters.update(proposed)
self.recent_validations.clear()
def _role4_propose_patch(self):
return {"alpha": 0.3, "beta": 0.5, "gamma": 0.2}
def _role6_verify(self, patch: dict) -> bool:
w = sum(patch.values())
return abs(w - 1.0) <= 1e-5
7. Related Work
| Category | Works | Relationship |
|---|---|---|
| RAG | Lewis et al. (2020) [1] | Foundational retrieval framework |
| Context saturation | Liu et al. (2024) [2] | Lost-in-the-middle problem |
| Virtual memory for LLMs | MemGPT / Letta (2023) [3] | Shared goal; different approach (paging vs. decaying scores) |
| Multi-agent orchestration | AutoGen [4], CrewAI [5] | Similar role-decomposition; no built-in memory expiry |
| Agent reflection | Reflexion (2023) [6] | Single-episode critique vs. continuous monitoring |
| Model collapse | Shumailov et al. (2023) [7] | Recursive training degradation |
8. Limitations and Future Work
- No controlled evaluation. Claims are architectural and observational.
- Small deployment scale. Three concurrent users, two months, single real-estate agency.
- Synthetic generation quality. Role 2 uses the same LLM as responses—circular dependency.
- Cost. Six agent roles increase token consumption vs. single-shot RAG.
- Tuning sensitivity. θ = 0.80, N = 50 are empirically chosen.
9. Conclusion
We have presented AURA, a middleware layer that externalises memory into a curated state layer with scope-dependent decay. Our primary contribution—scope-dependent half-life scoring—is implementable on existing vector stores and addresses a genuine gap in production RAG deployments. The full codebase is available under Apache 2.0.
References
[1] P. Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," in Proc. NeurIPS, 2020.
[2] N. F. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL, 2024.
[3] C. Packer et al., "MemGPT: Towards LLMs as Operating Systems," arXiv:2310.08560, 2023.
[4] Q. Wu et al., "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation," arXiv:2308.08155, 2023.
[5] CrewAI, "CrewAI: Framework for Orchestrating Autonomous AI Agents," 2024.
[6] N. Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning," in Proc. NeurIPS, 2023.
[7] I. Shumailov et al., "The Curse of Recursion," arXiv:2305.17493, 2023.
Simulation notebook for Section 5: simulations/error_dynamics.ipynb in the companion repository.