perpetual inference

Perpetual Inference: On the Absence of Idle Cognition in Large Language Models

Abstract. Large language models compute only when prompted. Between inference calls, nothing happens. No memory consolidation, no schema extraction, no reorganization of learned representations. Biological cognition does not work this way. The default mode network, hippocampal replay during sleep, and spontaneous memory reactivation all perform critical computational work without external input. We call this missing capacity perpetual inference: background computation that runs without a task, operating on accumulated experience to improve future performance. The central question is what this computation should optimize. We formalize three candidate objectives drawn from theoretical neuroscience and information theory: free energy minimization, complementary learning systems replay, and rate-distortion compression. These are not equivalent. They imply different architectures, different resource allocations, and different experimental predictions. We propose the Consolidation Loop, a concrete architecture implementing CLS-inspired offline replay over an external memory store, and describe experiments to determine whether idle processing produces improvements that retrieval alone cannot.

1. The Absence

A human given no input continues to think. Default mode network activity during rest involves spontaneous memory replay, self-referential processing, and future simulation. During sleep, hippocampal sharp-wave ripples replay compressed experiences through neocortical circuits, consolidating episodic memories into semantic knowledge. Subjects who sleep between learning sessions outperform those who don't, and the benefit is specific to reorganization of memories, not passage of time.

A transformer has no analog to any of this. The computation is reactive. A forward pass requires an input. Between API calls, parameters are frozen, activations do not exist, nothing is processed. The system at time t+1 is identical to the system at time t unless an external prompt intervenes.

Agent architectures that approximate continuity do not change this picture. The Ralph Wiggum technique intercepts a model's exit signal, re-feeds the original prompt, and lets the model observe modified files from previous iterations. Anthropic's long-running agent harness (2025) uses an initializer agent and a progress file to bridge context windows across sessions. Both are task-driven loops. The model responds to an explicit instruction at every step. If you remove the instruction, the model stops. There is no unprompted computation.

Three consequences follow from this absence.

First, memory quality degrades without consolidation. Systems that accumulate interaction history face interference and noise accumulation. Without a process that detects patterns, resolves contradictions, and compresses redundancies, the memory store gets worse over time, not better.

Second, transfer requires abstraction. Applying knowledge from one context to another requires extracting the structural invariant from specific episodes. A system that stores every interaction verbatim but never abstracts across them cannot transfer. Biological consolidation is precisely the mechanism that converts episode-specific memory into transferable schematic knowledge.

Third, self-correction requires reflection. Humans notice errors in their own reasoning after the fact, sometimes hours later. This requires a process that revisits past computations without being asked. A system that only reviews its outputs on demand cannot spontaneously self-correct.

2. What Should Idle Processing Optimize?

The interesting question is not whether background processing would help. It would. The question is what it should compute. Task-driven loops have an explicit objective: pass the tests, finish the feature. Idle processing has no external loss signal. The system must generate its own optimization target.

Three theoretical frameworks give three different answers.

2.1 Free Energy Minimization

Friston's free energy principle says that biological agents maintain a generative model of their environment and minimize the divergence between predictions and actual input. Idle processing, in this framework, should reduce expected future prediction error across the distribution of inputs the agent anticipates.

Let q(θ) be the agent's model and p(x) the anticipated input distribution. The idle process minimizes:

F = E_{x ~ p(x)} [ -log q(x; θ) ] + KL(q(θ) || p(θ))

The first term is expected surprise. The second is a complexity penalty preventing overfitting to recent experience.

Computationally, this means: run the generative model forward, simulate plausible future inputs, evaluate predictions against simulations, update representations to reduce the gap. This is dreaming. The model hallucinates scenarios and learns from them.

Prediction: a system doing this improves on inputs distributionally similar to past experience. It does not improve on distributionally novel inputs.

2.2 Complementary Learning Systems Replay

CLS theory says consolidation requires replaying episodic memories through a slow-learning system that extracts statistical regularities. The fast learner stores episodes. The slow learner builds schemas by training on interleaved replays of old and new episodes, avoiding catastrophic interference through temporal mixing.

The idle process samples from the episodic store M = {m₁, ..., m_N} with a curriculum that interleaves recent and distant episodes, and trains a schema extractor S:

L_schema = Σ_i,j d(S(m_i), S(m_j)) · I[structural_match(m_i, m_j)]
- d(S(m_i), S(m_j)) · I[¬structural_match(m_i, m_j)]

This is contrastive: pull together episodes with shared structure, push apart episodes without it.

Prediction: a system doing this improves on tasks requiring transfer across contexts, where the same abstract pattern appears in a new setting. It does not improve on tasks solvable by retrieving a specific past episode verbatim.

2.3 Rate-Distortion Compression

Rate-distortion theory defines optimal lossy compression. For a fidelity criterion D and storage budget R, there exists an encoding that minimizes distortion subject to the rate constraint:

minimize E[d(m, m̂)]
subject to I(M; M̂) ≤ R

The idle process compresses interaction history to maximize useful information per bit of storage, where "useful" is defined by task-relevant distortion. The problem: the future task distribution is unknown, so the distortion measure must be estimated.

Prediction: a system doing this degrades gracefully under memory pressure. Performance on frequent patterns is preserved while rare patterns degrade. Without compression, degradation is uniform or chaotic.

2.4 These Are Not the Same Thing

Framework	Optimizes for	Mechanism	Distinctive prediction
Free energy	Reduced future surprise	Forward simulation	Improves on expected inputs
CLS replay	Schema extraction	Contrastive episode replay	Improves on cross-context transfer
Rate-distortion	Storage efficiency	Lossy compression	Graceful degradation under pressure

Biology likely combines all three. A practical system must choose. We start with CLS replay because it has the strongest empirical support from neuroscience, the clearest computational implementation, and the most directly measurable experimental predictions.

3. The Consolidation Loop

3.1 Components

Episodic store. Every interaction is logged: prompt, response, tool calls, outcomes, user feedback. Timestamped.

Consolidation process. Runs asynchronously between inference sessions. Does not modify base model weights. Operates on an external memory store that the model reads at inference time.

Schema extractor. Takes batches of episodes and identifies structural patterns: recurring intent patterns, common failure modes, entities appearing across conversations, reasoning strategies that succeeded or failed. Can be the same LLM or a smaller specialized model.

Memory rewriter. Updates the memory store based on extracted schemas: merges redundant entries, elevates patterns to explicit schemas, annotates episodes with abstractions, prunes episodes subsumed by schemas.

3.2 The Loop

Between inference sessions:

while idle AND unconsolidated episodes remain:
    batch = sample_episodes(store, strategy=interleaved_uniform_over_time)
    schemas = schema_extractor(batch)
    for schema in schemas:
        if matches existing schema: update (strengthen, refine)
        else if confidence > threshold: add to store
    for episode in batch:
        if fully captured by schemas: mark consolidatable
    update cursor

The sampling strategy matters. Uniform sampling over episodes overrepresents recent interactions because there are more of them. CLS theory requires interleaving recent and distant episodes. We sample uniformly over time periods, not episodes, ensuring old and new experience receive equal representation.

3.3 Integration With Inference

At inference time, the model receives relevant schemas and specific episodes alongside its standard context. Consolidation improves inference two ways. Schemas are more compact and more general than raw episodes. A schema like "this user asks follow-up questions about error handling after receiving code" is more useful than the five specific episodes that demonstrate it. And by pruning subsumed episodes, the store shrinks, so retrieval precision improves because there are fewer redundant entries to confuse the retrieval system.

3.4 Scope

The Consolidation Loop leaves model weights untouched. It operates entirely on an external memory store, making it adjacent to RAG, except that RAG retrieves from a static store. The Consolidation Loop makes the store dynamic and self-organizing. And unlike Ralph Wiggum or other agent loops, there is no task. The optimization target comes from the CLS objective internally, with no external prompt.

None of the individual components are novel. Episode logging, schema extraction, memory graphs, interleaved replay are all established. The argument is that these should be combined into a background process that runs without prompting, that the choice of objective function matters and produces testable differences, and that the absence of this process in current systems is a real limitation, not an engineering convenience.

4. Experiments

4.1 Core Hypothesis

A system with idle consolidation outperforms an equivalent system without it, given identical interaction history and retrieval infrastructure.

The control matters. Give both systems 100 past interactions and a vector database. Both can retrieve relevant episodes at inference time. The question: does reorganization produce improvements beyond what retrieval alone achieves?

4.2 Transfer Across Contexts

A simulated user interacts with the system across 50 sessions on diverse topics. Some topics share structural patterns (e.g., the same debugging strategy applies in Python and Rust). After the interaction phase, test on new instances of shared patterns in unseen contexts.

Conditions: (A) raw episode retrieval, (B) CLS replay consolidation, (C) rate-distortion compression only.

Prediction: B outperforms A and C on transfer tasks. C outperforms A on retrieval speed but not transfer.

4.3 Degradation Under Memory Pressure

The system accumulates 500+ sessions. Memory store has a size constraint.

Conditions: (A) FIFO eviction, (B) random eviction, (C) consolidation-driven eviction (subsumed episodes removed first).

Prediction: C preserves performance on frequent patterns and degrades on rare patterns. A and B degrade more uniformly and more severely.

4.4 Spontaneous Error Detection

The system receives interactions containing factual errors that the user did not correct. During idle time, the consolidation process reviews past interactions.

Conditions: (A) no idle processing, (B) idle processing with self-review subobjective.

Prediction: B flags some fraction of past errors without being prompted. A never detects them.

4.5 Comparison With Task-Driven Loops

Same interaction history, same idle compute budget. One condition runs the Consolidation Loop. The other runs a prompted loop: "Review your past interactions and improve your memory store."

Prediction: the prompted loop produces more surface-level improvements (deduplication, obvious cleanups). The Consolidation Loop produces deeper structural improvements (transferable schemas). The prompted loop is biased by what the model expects "reviewing memory" to mean. The Consolidation Loop follows the CLS objective, which may discover structure the model would not think to look for if asked.

This last experiment is the most important. If task-driven loops produce equivalent results, the entire argument for unprompted idle processing collapses into "just prompt the model to review itself periodically," which is a cron job, not a research contribution.

5. Objections

"This is just RAG with a maintenance script." RAG retrieves from a static store. The Consolidation Loop makes the store self-organizing. The distinction is between a library and a librarian who reorganizes the shelves at night. Whether the librarian adds enough value to justify the salary is an empirical question. Experiment 4.2 is designed to answer it.

"The compute cost is not justified." Possibly. If retrieval over raw episodes is good enough, consolidation wastes resources. Experiment 4.2 directly tests this.

"Without weight modification, the benefits are limited." Correct. A more ambitious version would update a LoRA adapter or soft prompts based on consolidated schemas. We scope to external memory because weight modification introduces catastrophic forgetting and makes evaluation harder. The external memory version is the minimum viable test of the hypothesis.

"The biological analogy is misleading." Sleep consolidation involves synaptic homeostasis, neurochemical state changes, and hippocampal-cortical dialogue that have no transformer analog. The analogy is functional: the computational work of replay, abstraction, and compression is valuable regardless of substrate. If we are wrong about this, Experiment 4.2 will show it.

"Context windows will keep growing and this will not matter." This is the strongest objection. If models eventually process their entire interaction history in a single window with perfect attention, consolidation is unnecessary. Two responses. First, attention scales quadratically (or n log n with sparse variants) and interaction histories grow without bound. There will always be a crossover point. Second, 10,000 raw episodes is a worse retrieval target than 200 schemas plus 500 episodes, even with perfect attention, because schemas represent compressed structural knowledge the model would otherwise re-derive on every forward pass. A hundred-page disorganized notebook and a ten-page organized summary are not equivalent even if you can read both.

6. Related Work

Continual learning addresses forgetting during training. The Consolidation Loop addresses memory degradation during deployment, without modifying weights.

Memory-augmented architectures (NTM, DNC, RETRO, Memorizing Transformers) provide read/write memory banks. Memories are written during forward passes and read at inference. They are not reorganized between calls.

Cognitive architectures (SOAR, ACT-R, LIDA) include long-term memory consolidation as an explicit component. This idea predates LLMs by decades. The contribution here is the formalization through competing theoretical frameworks with distinguishable experimental predictions, applied to the specific architectural gap in modern agent systems.

Self-play and synthetic data generation (AlphaGo, self-instruct) let systems generate their own training signal. Self-play is task-driven: generate data to improve at Go, or at instruction following. The Consolidation Loop is untargeted: reorganize memory to improve generally.

Anthropic's long-running agent harness (2025) solves a related problem with engineering. The progress file is a human-designed schema. The Consolidation Loop proposes to generate schemas automatically, without a human specifying the format or content of what should be preserved between sessions.

7. Conclusion

Every deployed LLM agent is purely reactive. Idle time between sessions is wasted. We have argued that this wastes more than compute. It wastes the opportunity to consolidate experience into reusable knowledge.

The core contribution is not the architecture (which is straightforward) but the formalization. Three theoretical frameworks specify three different objectives for idle processing, with three different experimental signatures. Distinguishing them requires the experiments in Section 4. If CLS replay produces transfer benefits that raw retrieval and prompted self-review cannot match, the implication is that agent systems should ship with idle compute budgets: resources allocated to background processing with no immediate task.

The brain does not wait to be prompted. We should find out whether that matters.

10/1/2025