← AURA.KIM · Research EN · v2.0

AURA: A Decoupled, State-Externalized Architecture for Long-Term Context Management in LLM-Based Agent Systems

Authors: Alexey Voronin, Aurum Estate LLC
Category: Computer Science — Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
License: Apache License 2.0 · Repository: github.com/alexenti-code/AURA
Version: 2.0 (revised after peer review) · Date: June 2026

Note: This paper is a system description and position paper. It does not constitute a formal proof of convergence or a controlled empirical evaluation. The authors welcome reproduction and critique.

Abstract

Current approaches to equipping Large Language Models (LLMs) with persistent memory through Retrieval-Augmented Generation (RAG) share a common weakness: the retrieved context degrades over time due to embedding staleness, context saturation, and lack of autonomous curation. This paper describes AURA (Advanced Unified Retrieval Architecture), an agentic middleware layer that externalizes episodic and long-term memory into a separate runtime with explicit lifecycle management.

The architecture organises six specialised agent roles into two orthogonal circuits: a data-curation loop (roles 1–3) that maintains memory quality, and an algorithmic control loop (roles 4–6) that adjusts retrieval parameters when the system detects degradation. The key contribution is a scope-targeted half-life decay mechanism that assigns different expiration rates to different knowledge types—a simple but underexploited pattern in production RAG deployments. We provide the architectural specification, reference implementation in Python, and discuss the convergence properties of the self-correcting loop using a simplified error-dynamics model.

1. Introduction

Large Language Models, by construction, are stateless functions: given an input sequence, they produce a token-level conditional distribution over the vocabulary. Any persistent behaviour—memory of a user, recall of past decisions, accumulation of domain knowledge—must be supplied externally through the context window. This is conventionally done via Retrieval-Augmented Generation (RAG), where relevant documents are fetched from a vector store and appended to the prompt [1].

In production LLM systems, RAG faces a well-documented set of challenges:

Several agentic frameworks have attempted to address these limitations. MemGPT (Letta) [3] introduces a virtual memory manager that pages context between a fast "working memory" and a slow "archival storage." AutoGen [4] and CrewAI [5] coordinate multiple LLM agents in conversation graphs, but do not enforce explicit memory lifecycle policies. Reflexion [6] adds a critic-evaluation step that reflects on agent output and feeds back into the next iteration, but operates at the level of a single episode, not over continuous long-term operation.

This paper describes AURA, a middleware architecture that treats memory not as a passive store but as an actively curated state layer with two properties: (i) each memory entry has a scope-dependent half-life after which its influence decays, and (ii) a background evaluation circuit monitors retrieval quality and adjusts parameters when degradation is detected.

The paper is organised as follows. Section 2 formalises the composite scoring function and scope-tuned decay. Section 3 describes the agent role topology. Section 4 presents the self-correcting circuit. Section 5 discusses a simplified error-dynamics model. Section 6 provides a reference implementation sketch. Section 7 surveys related work. Section 8 discusses limitations and open questions.

2. Memory Model: Composite Scoring with Scope-Decay

Let a memory entry be represented as a tuple M_i = (e_i, t_i, s_i, imp_i), where ei ∈ ℝd is a dense embedding, ti is the wall-clock time of insertion or last verification, si is a scope label (e.g., "session_task", "domain_knowledge", "user_profile"), and impi ∈ [0,1] is a baseline importance score assigned at write time.

When a query Q arrives, the system retrieves k entries with the highest combined score:

Score(Mi, Q) = α · sim(eQ, ei) + β · exp(−Δt / τ(si)) + γ · impi

where:

The half-life τ(s) determines how quickly an entry fades from retrieval unless re-verified. Key values in the current deployment:

ScopeτRationale
user_profile∞ (no decay)User attributes rarely change
domain_knowledge90 daysMarket conditions evolve seasonally
session_task3600 sTransient task context
regulatory≈10 yearsLegal references must persist

Contribution. Standard RAG systems retrieve top-k by cosine similarity alone. Adding scope-dependent temporal decay is an architectural pattern that is simple, implementable on any vector store, and—to our knowledge—not formalised in the current literature on LLM memory management. Its effect is to monotonically demote entries beyond their useful horizon without requiring explicit deletion.

3. Agent Role Topology

Rather than proposing a universal optimal architecture, we describe the topology deployed in AURA and the rationale for each role. The system separates roles into two circuits.

3.1 Circuit A: Data Curation

#RoleOperationTiming
1ExecutorAccepts query Q, computes composite score, assembles context, passes to LLM, returns responseOnline (every request)
2Evolution GeneratorGenerates synthetic edge-case queries from existing entries to expose gaps or contradictionsBackground (low priority)
3GatekeeperEvaluates proposed new entries (from user input or Role 2) for redundancy, contradiction, and factual grounding before committing to vector storeNear-line (async)

Circuit A manages memory content only: what enters, when it leaves, and how it ranks. It does not modify retrieval parameters.

3.2 Circuit B: Parameter Control

#RoleOperationTiming
4Autonomous CoderAdjusts retrieval parameters (α, β, γ, chunk size) when anomalous drift is detectedOffline (triggered)
5SyndicatorMonitors rolling retrieval accuracy; triggers Circuit B if mean confidence drops below θ = 0.80 over N = 50 queriesBackground (periodic)
6MentorVerifies that parameter changes respect invariant constraints (e.g., Σw = 1.0, cchunk ≥ 128) before deploymentOffline (guard)

Why 6? This is not a theoretically optimal number. The current topology emerged from pragmatics: separating (a) data operations, (b) parameter adjustments, and (c) safety constraints into three layers per circuit. Fewer roles collapse these functions into single agents. More roles (9+) introduced coordination overhead that outweighed marginal gains in our deployment.

4. Self-Correcting Feedback Loop

Circuit B implements a closed-loop controller. The Syndicator (Role 5) periodically computes a rolling mean of validation scores. If the mean drops below threshold θ = 0.80, the system enters a lock state and invokes the Autonomous Coder (Role 4), which proposes adjusted hyperparameters. The Mentor (Role 6) applies invariant checks and, if satisfied, commits the patch.

┌──────────────────────────────────┐ │ External State (Vector Store) │ └──────────────────────────────────┘ ▲ ▲ ▲ [R4] writes ────┤ Read ─────┤ ┌────┤ │ │ │ │ ┌──────┴──┐ ┌─────┴───┴──┐ │ │ R3 │ │ R1 │ │ │Gatekeeper│ │ Executor │ │ └────▲────┘ └──────┬─────┘ │ │ │ │ ┌─────┴──┐ metrics │ │ R2 │ │ │ │ Gen │ ┌───────────┴────┴──┐ └────────┘ │ R5 Syndicator │ └──────────┬─────────┘ │ trigger ┌─────┴─────┐ │ R4 Coder │ └─────▲─────┘ │ verify ┌─────┴─────┐ │ R6 Mentor │ └───────────┘

Fig 1. AURA role-flow diagram.

5. A Simplified Model of Error Dynamics

We present a deliberately simplified model of how error propagates in a recursive single-circuit system and how a second control loop can bound it. This is not a proof of convergence—it is a speculative toy model intended to illustrate the intuition behind adding Circuit B.

Let Et denote the number of degraded entries at epoch t. With only Circuit A:

Et+1 = Et + Δadd − Δdecay

If Δadd > Δdecay, Et grows without bound. Circuit B introduces a correction term:

Et+1 = Et + (Δadd − γt · Et) − Δdecay

Limitation: This model treats "degraded entries" as a uniform quantity and ignores interaction effects. We include it as a motivating framework, not as a formal guarantee.

6. Reference Implementation

A minimal Python implementation is provided in the companion repository. This is a reference sketch, not a production system.

class AURAOrchestrator:
    """Demonstrates Circuit B activation on accuracy degradation."""

    def __init__(self):
        self.hyperparameters = {
            "alpha": 0.5, "beta": 0.3, "gamma": 0.2,
        }
        self.recent_validations = []
        self.system_lock = False

    def on_validation(self, score: float):
        self.recent_validations.append(score)
        if len(self.recent_validations) > 50:
            self.recent_validations.pop(0)

    def check_drift(self):
        if len(self.recent_validations) < 3:
            return
        mean = sum(self.recent_validations) / len(self.recent_validations)
        if mean < 0.80:
            self._run_circuit_b()

    def _run_circuit_b(self):
        proposed = self._role4_propose_patch()
        if self._role6_verify(proposed):
            self.hyperparameters.update(proposed)
            self.recent_validations.clear()

    def _role4_propose_patch(self):
        return {"alpha": 0.3, "beta": 0.5, "gamma": 0.2}

    def _role6_verify(self, patch: dict) -> bool:
        w = sum(patch.values())
        return abs(w - 1.0) <= 1e-5
CategoryWorksRelationship
RAGLewis et al. (2020) [1]Foundational retrieval framework
Context saturationLiu et al. (2024) [2]Lost-in-the-middle problem
Virtual memory for LLMsMemGPT / Letta (2023) [3]Shared goal; different approach (paging vs. decaying scores)
Multi-agent orchestrationAutoGen [4], CrewAI [5]Similar role-decomposition; no built-in memory expiry
Agent reflectionReflexion (2023) [6]Single-episode critique vs. continuous monitoring
Model collapseShumailov et al. (2023) [7]Recursive training degradation

8. Limitations and Future Work

  1. No controlled evaluation. Claims are architectural and observational.
  2. Small deployment scale. Three concurrent users, two months, single real-estate agency.
  3. Synthetic generation quality. Role 2 uses the same LLM as responses—circular dependency.
  4. Cost. Six agent roles increase token consumption vs. single-shot RAG.
  5. Tuning sensitivity. θ = 0.80, N = 50 are empirically chosen.

9. Conclusion

We have presented AURA, a middleware layer that externalises memory into a curated state layer with scope-dependent decay. Our primary contribution—scope-dependent half-life scoring—is implementable on existing vector stores and addresses a genuine gap in production RAG deployments. The full codebase is available under Apache 2.0.


References

[1] P. Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," in Proc. NeurIPS, 2020.

[2] N. F. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL, 2024.

[3] C. Packer et al., "MemGPT: Towards LLMs as Operating Systems," arXiv:2310.08560, 2023.

[4] Q. Wu et al., "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation," arXiv:2308.08155, 2023.

[5] CrewAI, "CrewAI: Framework for Orchestrating Autonomous AI Agents," 2024.

[6] N. Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning," in Proc. NeurIPS, 2023.

[7] I. Shumailov et al., "The Curse of Recursion," arXiv:2305.17493, 2023.


Simulation notebook for Section 5: simulations/error_dynamics.ipynb in the companion repository.