Back to Open Source

Agent Memory System

A production-ready, three-tier memory architecture designed for long-running AI agents.

Architecture

Three tiers of memory optimized for latency and capacity

Query Input

Memory Router

Intelligent Routing

Cost-benefit analysis

L1 HOT

Latency: ≤1ms

Storage: Ring Buffer

Capacity: ~10-50 KB

L2 WARM

Latency: ≤100ms

Storage: Vectors + FTS + SQLite

Capacity: ~GB scale

L3 COLD

Latency: ≤1s

Storage: RocksDB Archive

Capacity: ~TB scale

Click on tiers to explore details

Live Query Demo

Experience the routing and retrieval logic in real-time

Example queries:

💡 Demo mode: Using mock data.

Use Cases

Real-world scenarios for consulting and research

Example Query:

What did we learn about healthcare M&A trends in Q3 2024?

Results:

  • Past deal analysis: Healthcare M&A increased 23% in Q3
  • Client note: Client X interested in healthcare vertical
  • Industry report summary: Regulatory changes driving consolidation

Implementation Details

Core logic in Rust for maximum performance

Router Function

Intelligent routing with heuristic cascade

fn route_query(q: &Query) -> RouteResult {
    // Small talk stays in working memory
    if is_smalltalk(q) {
        return query_l1(q);
    }
    
    // Exact matches: sparse + KV with dense sanity check
    if needs_exact_match(q) {
        let result = query_l2_sparse(q) + query_l2_kv(q);
        return sanity_check_dense(result);
    }
    
    // Semantic search: dense ANN with reranking
    if needs_semantic_search(q) {
        let candidates = query_l2_dense_ann(q, TOP_K);
        let reranked = late_interaction_rerank(candidates);
        return enrich_from_l1(reranked);
    }
    
    // Fall back to archive for low confidence or long-horizon queries
    if has_low_confidence() || is_long_horizon(q) {
        let archive_result = tap_l3_archive(q);
        let summary = summarize(archive_result);
        promote_to_l2_kv(summary);
        return archive_result;
    }
    
    // Log telemetry for analysis
    log_query_metrics(QueryMetrics {
        tiers_accessed,
        latency_ms,
        recall,
    });
}

Performance Metrics

Benchmarked on consumer hardware (Apple Silicon)

Embedding Latency (MLX)

0.06ms

130x faster than CoreML

Cache Hit Latency

~0.05ms

10x faster than MLX

Warm Retrieval (p95)

≤100ms

Target achieved

Cold Retrieval (p95)

≤1s

With summarization

LLM Inference

3-10x faster

With LMCache KV reuse

Indexing Throughput

20+ files/sec

Sustained on consumer hardware

Embedding Performance Comparison

Query Latency Breakdown (L2 WARM)

Router decision~0.1ms
Dense search (HNSW)~5ms
Sparse search (FTS5)~3ms
KV lookup (SQLite)~2ms
Reranking~10ms
Total (p50)~20ms
Total (p95)≤100ms

Production Readiness

Test Coverage

216+ tests

Cache Hit Rate

≥70%