yahtzee!

home

Investors who put money into context window startups should go to a casino and put it all on red. At least the casino has free drinks.

The pitch is always some version of: models are stateless, naive context stuffing is O(n²) in attention complexity, and retrieval quality degrades for information buried in the middle of long prompts. Liu et al. documented this in 2023. The pitch is technically correct. It doesn't matter, because the category sits in a gap that closes from both ends simultaneously. The hard problems are too hard for startups. The easy problems are too easy to be defensible. There is no middle.

Take Supermemory. Instead of raw context stuffing, they process documents into atomic memory units with resolved coreferences, store them as nodes in a knowledge graph with three relationship types (updates, extends, derives), and do hybrid retrieval against atomic memories with source chunks injected alongside. They publish LongMemEval_s numbers: 81.6% overall versus 60.2% for full-context GPT-4o. The temporal grounding with dual timestamps is clever engineering. But trace any of their design decisions to its theoretical endpoint and you arrive at a research problem that either Google will solve with brute compute or nobody will solve at all.

Context windows went from 4k to 128k to 1M tokens in roughly three years. RoPE, ALiBi, YaRN, sparse attention, KV cache optimization. The "lost in the middle" degradation that motivates half of Supermemory's architecture has been directly addressed by instruction tuning on long-context data. Anthropic, Google, and OpenAI have teams working on this with compute budgets that make the problem tractable in ways it isn't for a startup. Whatever scaling buys, labs will buy first. Every incremental improvement in native context handling shrinks the addressable market for external memory systems. Supermemory's 81.6% versus GPT-4o's 60.2% is a real gap today. It will not be a real gap in eighteen months if context scaling continues on trend. The startup is running to stay ahead of a curve it cannot influence.

A knowledge graph with pgvector embeddings and temporal metadata is not a novel research contribution. The contextual retrieval technique underlying Supermemory's ingestion pipeline is Anthropic's own published method. The graph relationship types are standard knowledge graph patterns. The decay and recency weighting is textbook memory management. A competent engineer with access to Neo4j or a Postgres graph extension and a chunking library gets to 80% of this in a week. The remaining 20% is polish, not defensibility. The build-versus-buy argument holds when the thing you're buying has genuine complexity underneath: fraud detection, distributed consensus, compiler optimization. It collapses when the implementation is well-documented, the components are open source, and the primary value is that someone assembled them for you. Assembly is not a moat. There are no network effects to compound on. Every customer is an API call. Nothing about having more customers makes the product better for existing ones. Churn is cheap because switching costs are low and the alternative is always one sprint away.

So if the easy version is commoditized and the scaling version gets absorbed by labs, is there a hard version worth building? Yes. But the hard version is a research program, not a product.

The foundational problem is that memory is reconstruction, not retrieval. When you remember something, you don't fetch a stored record. You reconstruct an experience from fragments, filling gaps with inference and schema. McClelland et al.'s Complementary Learning Systems theory gives the formal framing: biological memory requires two systems with fundamentally different learning rates. A fast learner (hippocampus) that encodes episodes with minimal interference in high-dimensional sparse representations. A slow learner (neocortex) that extracts statistical structure through interleaved replay, performing something like gradient descent with a very small learning rate over replayed experiences to avoid catastrophic interference.

The computational consequence is the stability-plasticity dilemma. New learning degrades old knowledge. Kirkpatrick et al. formalized one approach with Elastic Weight Consolidation, penalizing changes to parameters that were important for previous tasks, weighted by the diagonal of the Fisher information matrix:

L(θ) = L_task(θ) + (λ/2) Σᵢ Fᵢ(θᵢ - θ*ᵢ)²

The Fisher diagonal is a crude approximation. The full Fisher is intractable, the diagonal assumes parameter independence, and it captures nothing about relational structure between memories. Every continual learning method that followed, from Progress-and-Compress to PackNet, hits the same wall. No principled architectural solution exists. Supermemory doesn't engage with this. Its memory is append-only with heuristic decay. It doesn't consolidate, doesn't compress episodes into schemas, doesn't have a mechanism for deciding what to forget based on structural importance. That's fine for a product. It's a concession that the hard problem is untouched.

Supermemory's updates/extends/derives taxonomy is a static classification for something dynamic and contextual. Whether a new fact updates or extends an old one depends on the agent's goals, the task, and sometimes information neither the agent nor the system has yet. One misclassification propagates. And the entity resolution feeding the graph has F1 in the 85-95% range. Each misresolved entity is a wrong edge, and wrong edges in multi-hop queries don't just miss information. They return wrong information with high confidence. If single-hop retrieval accuracy is p, naive multi-hop accuracy for k hops degrades as p^k. At p = 0.85, three hops gives you 0.61. Five hops gives you 0.44. You lose majority accuracy before you get to any query that requires real reasoning across a knowledge base. Knowledge graph query answering over incomplete graphs connects to open-world query answering, which is #P-hard in the general case. The embedding-based approximations everyone uses (TransE, RotatE, CompGCN) trade formal guarantees for scalability, and their failure modes on complex relational queries are poorly understood.

Supermemory's dual timestamps are point-based temporal representation, the simplest formalism in the space. Allen's interval algebra defines 13 possible relationships between time intervals: before, meets, overlaps, during, starts, finishes, equals, and their inverses. Real temporal reasoning over documents requires inferring these from natural language cues that are massively ambiguous. "After the merger" could mean after the announcement, after closing, or after integration. Each interpretation changes the temporal graph, and state-of-the-art temporal relation classification on TimeBank plateaus around 65-70%. That's single-document extraction. Cross-document temporal alignment, reconciling timelines from sources with different granularities, reference frames, and implicit assumptions, is substantially harder and has no mature benchmark at all.

The standard retrieval model treats relevance as R(q, d), a function of query and document. What memory actually requires is R(q, d, S, T, H): relevance conditioned on agent state, task, and history. The dimensionality explodes and training signal becomes sparse. Frame it as a contextual bandit: the action is which memories to surface, the context is (q, S, T, H), the reward is downstream task performance. The action space is combinatorial, the reward is delayed, and the context representation is itself an open problem. Memory-augmented architectures like NTM, DNC, and Memorizing Transformers try to let the model attend over a memory bank directly, but the read/write heads learn relevance patterns that are fixed post-training and don't adapt to new tasks without fine-tuning. That brings you back to the stability-plasticity dilemma. Every road leads to the same unsolved problem.

The evaluation problem might be the most damaging. LongMemEval and every similar benchmark test needle-in-a-haystack retrieval, which is the easiest subproblem. The question that matters is whether having a memory system leads to better task outcomes than not having it, across the distribution of tasks the agent will face. That's a causal inference problem. You'd need the Rubin potential outcomes framework to even define the estimand, and running the counterfactual requires evaluating full task completion with and without the memory intervention across a representative task distribution. Nobody does this. The field cannot distinguish genuine progress from benchmark gaming, which is part of why investment in the category is so noisy and will remain so.

The formal frameworks that would actually solve memory (CLS, continual learning theory, Allen's interval algebra, event calculus, rate-distortion theory, contextual bandits, causal inference) each capture a piece of the problem. Nobody has unified them. The systems getting funded operate at the bag-of-tricks level: pgvector plus chunking plus graph plus reranking. That works for demos. It doesn't engage with the hard problems. And the hard problems require sustained, expensive research that happens in labs with large compute budgets or in academic settings with long time horizons. Neither maps to the venture model.

The labs will absorb the problems that are tractable with scale. The startups are left with the problems that aren't tractable at all. Investors will call it a timing issue. It isn't. The timing was always going to be wrong because the category occupies a gap that closes from both ends simultaneously. There is nothing to build in the middle.

4/2/2025