AURA: A Decoupled, State-Externalized Architecture for Long-Term Context Management in LLM-Based Agent Systems

Authors: Alexey Voronin, Aurum Estate LLC
Category: Computer Science — Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
License: Apache License 2.0 · Repository: github.com/alexenti-code/AURA
Version: 2.0 (revised after peer review) · Date: June 2026

Note: This paper is a system description and position paper. It does not constitute a formal proof of convergence or a controlled empirical evaluation. The authors welcome reproduction and critique.

Abstract

Current approaches to equipping Large Language Models (LLMs) with persistent memory through Retrieval-Augmented Generation (RAG) share a common weakness: the retrieved context degrades over time due to embedding staleness, context saturation, and lack of autonomous curation. This paper describes AURA (Advanced Unified Retrieval Architecture), an agentic middleware layer that externalizes episodic and long-term memory into a separate runtime with explicit lifecycle management.

The architecture organises six specialised agent roles into two orthogonal circuits: a data-curation loop (roles 1–3) that maintains memory quality, and an algorithmic control loop (roles 4–6) that adjusts retrieval parameters when the system detects degradation. The key contribution is a scope-targeted half-life decay mechanism that assigns different expiration rates to different knowledge types—a simple but underexploited pattern in production RAG deployments. We provide the architectural specification, reference implementation in Python, and discuss the convergence properties of the self-correcting loop using a simplified error-dynamics model.

1. Introduction

Large Language Models, by construction, are stateless functions: given an input sequence, they produce a token-level conditional distribution over the vocabulary. Any persistent behaviour—memory of a user, recall of past decisions, accumulation of domain knowledge—must be supplied externally through the context window. This is conventionally done via Retrieval-Augmented Generation (RAG), where relevant documents are fetched from a vector store and appended to the prompt [1].

In production LLM systems, RAG faces a well-documented set of challenges:

Context saturation. The retrieved window has a fixed size (typically 4k–128k tokens). Once full, older entries are evicted regardless of importance [2].
Embedding staleness. A document indexed today may become irrelevant tomorrow. Without re-embedding or deprecation, the retriever continues to surface it.
No autonomous curation. Vector databases store what they are given. No mechanism exists to flag, demote, or remove outdated or contradictory entries without manual intervention.

Several agentic frameworks have attempted to address these limitations. MemGPT (Letta) [3] introduces a virtual memory manager that pages context between a fast "working memory" and a slow "archival storage." AutoGen [4] and CrewAI [5] coordinate multiple LLM agents in conversation graphs, but do not enforce explicit memory lifecycle policies. Reflexion [6] adds a critic-evaluation step that reflects on agent output and feeds back into the next iteration, but operates at the level of a single episode, not over continuous long-term operation.

This paper describes AURA, a middleware architecture that treats memory not as a passive store but as an actively curated state layer with two properties: (i) each memory entry has a scope-dependent half-life after which its influence decays, and (ii) a background evaluation circuit monitors retrieval quality and adjusts parameters when degradation is detected.

The paper is organised as follows. Section 2 formalises the composite scoring function and scope-tuned decay. Section 3 describes the agent role topology. Section 4 presents the self-correcting circuit. Section 5 discusses a simplified error-dynamics model. Section 6 provides a reference implementation sketch. Section 7 surveys related work. Section 8 discusses limitations and open questions.

2. Memory Model: Composite Scoring with Scope-Decay

Let a memory entry be represented as a tuple M_i = (e_i, t_i, s_i, imp_i), where e_i ∈ ℝ^d is a dense embedding, t_i is the wall-clock time of insertion or last verification, s_i is a scope label (e.g., "session_task", "domain_knowledge", "user_profile"), and imp_i ∈ [0,1] is a baseline importance score assigned at write time.

When a query Q arrives, the system retrieves k entries with the highest combined score:

Score(M_i, Q) = α · sim(e_Q, e_i) + β · exp(−Δt / τ(s_i)) + γ · imp_i

where:

sim(·,·) is cosine similarity between the query embedding and the entry embedding.
Δt = t_now − t_i is elapsed time since verification.
α, β, γ are intent-dependent coefficients summing to 1.0, currently set by a lightweight classifier.
τ(s_i) is the scope-specific half-life, a configurable constant per scope.

The half-life τ(s) determines how quickly an entry fades from retrieval unless re-verified. Key values in the current deployment:

Scope	τ	Rationale
`user_profile`	∞ (no decay)	User attributes rarely change
`domain_knowledge`	90 days	Market conditions evolve seasonally
`session_task`	3600 s	Transient task context
`regulatory`	≈10 years	Legal references must persist

Contribution. Standard RAG systems retrieve top-k by cosine similarity alone. Adding scope-dependent temporal decay is an architectural pattern that is simple, implementable on any vector store, and—to our knowledge—not formalised in the current literature on LLM memory management. Its effect is to monotonically demote entries beyond their useful horizon without requiring explicit deletion.

3. Agent Role Topology

Rather than proposing a universal optimal architecture, we describe the topology deployed in AURA and the rationale for each role. The system separates roles into two circuits.

3.1 Circuit A: Data Curation

#	Role	Operation	Timing
1	Executor	Accepts query Q, computes composite score, assembles context, passes to LLM, returns response	Online (every request)
2	Evolution Generator	Generates synthetic edge-case queries from existing entries to expose gaps or contradictions	Background (low priority)
3	Gatekeeper	Evaluates proposed new entries (from user input or Role 2) for redundancy, contradiction, and factual grounding before committing to vector store	Near-line (async)

Circuit A manages memory content only: what enters, when it leaves, and how it ranks. It does not modify retrieval parameters.

3.2 Circuit B: Parameter Control

#	Role	Operation	Timing
4	Autonomous Coder	Adjusts retrieval parameters (α, β, γ, chunk size) when anomalous drift is detected	Offline (triggered)
5	Syndicator	Monitors rolling retrieval accuracy; triggers Circuit B if mean confidence drops below θ = 0.80 over N = 50 queries	Background (periodic)
6	Mentor	Verifies that parameter changes respect invariant constraints (e.g., Σw = 1.0, c_chunk ≥ 128) before deployment	Offline (guard)

Why 6? This is not a theoretically optimal number. The current topology emerged from pragmatics: separating (a) data operations, (b) parameter adjustments, and (c) safety constraints into three layers per circuit. Fewer roles collapse these functions into single agents. More roles (9+) introduced coordination overhead that outweighed marginal gains in our deployment.

4. Self-Correcting Feedback Loop

Circuit B implements a closed-loop controller. The Syndicator (Role 5) periodically computes a rolling mean of validation scores. If the mean drops below threshold θ = 0.80, the system enters a lock state and invokes the Autonomous Coder (Role 4), which proposes adjusted hyperparameters. The Mentor (Role 6) applies invariant checks and, if satisfied, commits the patch.

┌──────────────────────────────────┐ │ External State (Vector Store) │ └──────────────────────────────────┘ ▲ ▲ ▲ [R4] writes ────┤ Read ─────┤ ┌────┤ │ │ │ │ ┌──────┴──┐ ┌─────┴───┴──┐ │ │ R3 │ │ R1 │ │ │Gatekeeper│ │ Executor │ │ └────▲────┘ └──────┬─────┘ │ │ │ │ ┌─────┴──┐ metrics │ │ R2 │ │ │ │ Gen │ ┌───────────┴────┴──┐ └────────┘ │ R5 Syndicator │ └──────────┬─────────┘ │ trigger ┌─────┴─────┐ │ R4 Coder │ └─────▲─────┘ │ verify ┌─────┴─────┐ │ R6 Mentor │ └───────────┘

Fig 1. AURA role-flow diagram.

5. A Simplified Model of Error Dynamics

We present a deliberately simplified model of how error propagates in a recursive single-circuit system and how a second control loop can bound it. This is not a proof of convergence—it is a speculative toy model intended to illustrate the intuition behind adding Circuit B.

Let E_t denote the number of degraded entries at epoch t. With only Circuit A:

E_t+1 = E_t + Δ_add − Δ_decay

If Δ_add > Δ_decay, E_t grows without bound. Circuit B introduces a correction term:

E_t+1 = E_t + (Δ_add − γ_t · E_t) − Δ_decay

Limitation: This model treats "degraded entries" as a uniform quantity and ignores interaction effects. We include it as a motivating framework, not as a formal guarantee.

6. Reference Implementation

A minimal Python implementation is provided in the companion repository. This is a reference sketch, not a production system.

class AURAOrchestrator:
    """Demonstrates Circuit B activation on accuracy degradation."""

    def __init__(self):
        self.hyperparameters = {
            "alpha": 0.5, "beta": 0.3, "gamma": 0.2,
        }
        self.recent_validations = []
        self.system_lock = False

    def on_validation(self, score: float):
        self.recent_validations.append(score)
        if len(self.recent_validations) > 50:
            self.recent_validations.pop(0)

    def check_drift(self):
        if len(self.recent_validations) < 3:
            return
        mean = sum(self.recent_validations) / len(self.recent_validations)
        if mean < 0.80:
            self._run_circuit_b()

    def _run_circuit_b(self):
        proposed = self._role4_propose_patch()
        if self._role6_verify(proposed):
            self.hyperparameters.update(proposed)
            self.recent_validations.clear()

    def _role4_propose_patch(self):
        return {"alpha": 0.3, "beta": 0.5, "gamma": 0.2}

    def _role6_verify(self, patch: dict) -> bool:
        w = sum(patch.values())
        return abs(w - 1.0) <= 1e-5

Category	Works	Relationship
RAG	Lewis et al. (2020) [1]	Foundational retrieval framework
Context saturation	Liu et al. (2024) [2]	Lost-in-the-middle problem
Virtual memory for LLMs	MemGPT / Letta (2023) [3]	Shared goal; different approach (paging vs. decaying scores)
Multi-agent orchestration	AutoGen [4], CrewAI [5]	Similar role-decomposition; no built-in memory expiry
Agent reflection	Reflexion (2023) [6]	Single-episode critique vs. continuous monitoring
Model collapse	Shumailov et al. (2023) [7]	Recursive training degradation

8. Limitations and Future Work

No controlled evaluation. Claims are architectural and observational.
Small deployment scale. Three concurrent users, two months, single real-estate agency.
Synthetic generation quality. Role 2 uses the same LLM as responses—circular dependency.
Cost. Six agent roles increase token consumption vs. single-shot RAG.
Tuning sensitivity. θ = 0.80, N = 50 are empirically chosen.

9. Conclusion

We have presented AURA, a middleware layer that externalises memory into a curated state layer with scope-dependent decay. Our primary contribution—scope-dependent half-life scoring—is implementable on existing vector stores and addresses a genuine gap in production RAG deployments. The full codebase is available under Apache 2.0.

References

[1] P. Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," in Proc. NeurIPS, 2020.

[2] N. F. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL, 2024.

[3] C. Packer et al., "MemGPT: Towards LLMs as Operating Systems," arXiv:2310.08560, 2023.

[4] Q. Wu et al., "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation," arXiv:2308.08155, 2023.

[5] CrewAI, "CrewAI: Framework for Orchestrating Autonomous AI Agents," 2024.

[6] N. Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning," in Proc. NeurIPS, 2023.

[7] I. Shumailov et al., "The Curse of Recursion," arXiv:2305.17493, 2023.

Simulation notebook for Section 5: simulations/error_dynamics.ipynb in the companion repository.

AURA: A Decoupled, State-Externalized Architecture for Long-Term Context Management in LLM-Based Agent Systems

Abstract

1. Introduction

2. Memory Model: Composite Scoring with Scope-Decay

3. Agent Role Topology

3.1 Circuit A: Data Curation

3.2 Circuit B: Parameter Control

4. Self-Correcting Feedback Loop

5. A Simplified Model of Error Dynamics

6. Reference Implementation

7. Related Work

8. Limitations and Future Work

9. Conclusion

References