Perfect Retrieval Recall on the Hardest AI Memory Benchmark — Running Fully Local
We've been benchmarking Aingram's hybrid retrieval pipeline against LongMemEval, the most rigorous public benchmark for long-term memory in AI chat assistants. This post covers the retrieval-only r...

Source: DEV Community
We've been benchmarking Aingram's hybrid retrieval pipeline against LongMemEval, the most rigorous public benchmark for long-term memory in AI chat assistants. This post covers the retrieval-only results — before any LLM generation step — because we think they tell an important story about where memory system failures actually come from. Background: What LongMemEval Tests LongMemEval (Wu et al., ICLR 2025) is a benchmark of 500 hand-curated questions embedded across scalable user-assistant chat histories. The LongMemEval-S split gives each question a history of approximately 115,000 tokens (~40 sessions). Questions span five memory abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. The standard evaluation is end-to-end: ingest the conversation history, retrieve relevant sessions, pass them to an LLM, generate an answer, and score with an LLM judge. Most published numbers (Zep: 71.2%, Emergence AI: 86%) are end-to-end accur