Back to Learn
intermediate

Engrams by DeepSeek: Conditional Memory via Scalable Lookup

An in-depth exploration of DeepSeek's Engram architecture, a novel primitive that introduces conditional memory through O(1) N-gram lookups, enabling massive parameter scaling via CPU-offloaded embedding tables.

TLDR

Engrams, introduced by DeepSeek AI in January 2026, represent a paradigm shift in Large Language Model (LLM) architecture by decoupling Conditional Memory from Conditional Computation. While Mixture-of-Experts (MoE) provides sparsity in processing, Engrams provide sparsity in storage through O(1) N-gram lookups. By offloading static pattern recognition to massive embedding tables—often stored in CPU RAM—Engrams allow models to scale their knowledge base to billions of parameters with less than 3% latency overhead. The core discovery is the Sparsity Allocation Law, which suggests that optimal performance is achieved when approximately 20–25% of a model's parameter budget is dedicated to this static, conditional memory.

Conceptual Overview

The evolution of transformer architectures has largely focused on increasing the depth and width of the hidden layers. However, DeepSeek's research into Engrams suggests that much of the computational effort in early transformer layers is wasted on "memorization" rather than "reasoning." Engrams address this by introducing a dedicated memory primitive.

Conditional Memory vs. Conditional Computation

To understand Engrams, one must distinguish between two types of sparsity:

  1. Conditional Computation (MoE): Dynamically selecting a subset of weights (experts) to process a specific token. This reduces the FLOPs required for inference but keeps the memory requirements high for GPU VRAM.
  2. Conditional Memory (Engram): Dynamically retrieving specific "knowledge vectors" based on the local context (N-grams) without executing dense matrix multiplications.

Engrams function as a massive, sparse lookup table. When the model encounters a specific sequence of tokens (e.g., a technical term or a common phrase), it performs an Exact Match (EM) lookup in an Engram table. This retrieves a pre-computed embedding that summarizes the "static" knowledge associated with that sequence, which is then fused into the transformer's hidden state.

The N-gram Lookup Mechanism

The Engram module utilizes N-grams (where N typically ranges from 2 to 8) as keys. Unlike traditional RAG (Retrieval-Augmented Generation) which uses dense vector similarity search, Engrams use a hash-based O(1) lookup. This ensures that the retrieval process is independent of the table size, allowing for memory banks that exceed the capacity of GPU VRAM.

The U-shaped Scaling Law (Sparsity Allocation Law)

DeepSeek identifies a U-shaped scaling law regarding the allocation of parameters between dense weights and sparse Engram memory. If a model has too little Engram memory, it wastes compute on memorization. If it has too much, the "reasoning" capacity (dense layers) is spread too thin. The "sweet spot" identified by DeepSeek is a 20–25% allocation to Engram parameters.

Engram Architecture: Transformer Block with CPU RAM Embedding IntegrationEngram Architecture: Transformer Block with CPU RAM Embedding Integration Infographic Description: A diagram showing a Transformer block. Parallel to the Attention and MLP layers is the Engram Module. The input tokens are hashed into N-grams, which point to a massive Embedding Table in CPU RAM. The retrieved vector is passed through a Fusion Gate and added back to the residual stream.

Practical Implementations

Implementing Engrams requires a departure from standard GPU-only training and inference pipelines.

The Two-Phase Operation

  1. Retrieval Phase: The input sequence is decomposed into overlapping N-grams. For each N-gram, a hash function maps the sequence to an index in the Engram table. Because this is an O(1) operation, it can be performed in parallel with the initial embedding layer of the transformer.

  2. Fusion Phase: The retrieved Engram vector e_i is combined with the hidden state h_i of the transformer. DeepSeek utilizes a gated linear unit (GLU) for fusion: h'_{i} = h_i + \sigma(W_g [h_i; e_i]) \cdot (W_f e_i) where W_g and W_f are small learnable matrices that decide how much of the Engram memory should influence the current token's representation.

CPU-Offloading and Latency

One of the most significant advantages of Engrams is their compatibility with CPU RAM offloading. Since the lookup is O(1) and the fusion happens only at specific layers (usually the early-to-mid layers), the Engram table can reside in system memory.

  • Latency: DeepSeek reports that with high-speed PCIe Gen5 or CXL links, the latency overhead of fetching an Engram from CPU RAM is <3%.
  • Throughput: By offloading the "knowledge" parameters to CPU, the GPU VRAM is freed up for larger KV caches, significantly increasing the maximum context window and batch size.

Integration with NER and EM

Engrams are particularly effective for tasks involving Named Entity Recognition (NER) and Exact Match (EM) requirements.

  • In NER, Engrams can store specific embeddings for millions of unique entities (e.g., rare chemical compounds or obscure historical figures) that would otherwise be "blurred" in a standard dense model.
  • In EM tasks, such as code generation or legal document analysis, the N-gram lookup ensures that specific syntax patterns or boilerplate clauses are retrieved with perfect fidelity.

Advanced Techniques

Multi-Head Engram Lookups

Similar to Multi-Head Attention, Engrams can be implemented with multiple "heads." Each head might look at a different N-gram length (e.g., Head 1 looks at 2-grams, Head 2 looks at 5-grams). This allows the model to capture both local syntactic patterns and longer semantic phrases simultaneously.

Comparing Prompt Variants (A)

When comparing prompt variants (A), Engram-equipped models show significantly higher stability. Traditional models often fluctuate in their internal representations based on minor phrasing changes. Engrams, by anchoring the representation to specific N-gram lookups, provide a "grounding" effect that makes the model more robust to prompt perturbations.

Dynamic Engram Updating

While the initial Engram tables are populated during pre-training, DeepSeek has experimented with "Online Engram Accumulation." In this setup, the model can update its Engram table during a long-running conversation or across a large document, effectively acting as a persistent, high-speed cache of the current context.

Research and Future Directions

The introduction of Engrams marks a shift toward "Hardware-Aware Architecture." As we hit the limits of GPU VRAM, the ability to utilize the terabytes of RAM available on modern AI servers becomes critical.

The Sparsity Allocation Law in AGI

DeepSeek's research suggests that as we move toward AGI, the ratio of "Memory" to "Compute" may need to shift even further. If a model can "look up" the laws of physics or the syntax of a programming language via an Engram, it can dedicate 100% of its active FLOPs to solving the specific problem at hand rather than recalling the rules.

Future Hardware: CXL and Beyond

The future of Engrams is tied to interconnect technology. Technologies like Compute Express Link (CXL) will allow GPUs to access CPU-attached memory with even lower latency, potentially making Engram lookups as fast as local VRAM access. This would allow for "Exascale Memory" models where the Engram table contains trillions of entries, covering almost every known N-gram in human language.

Frequently Asked Questions

Q: How do Engrams differ from standard RAG?

A: RAG (Retrieval-Augmented Generation) typically retrieves whole documents or chunks based on semantic similarity (vector search), which is computationally expensive and happens outside the model's forward pass. Engrams are an architectural primitive inside the transformer that performs O(1) exact-match lookups of N-gram embeddings at the token level.

Q: Does the Engram table make the model harder to train?

A: Actually, it can simplify training. By offloading static memorization to the Engram table, the dense layers (the "brain") converge faster on reasoning tasks. DeepSeek uses a "Warm-up and Freeze" strategy where the Engram table is populated and then the dense layers are fine-tuned to utilize it.

Q: What happens if an N-gram is not in the table?

A: The Engram module includes a "null" or "default" return. If a hash miss occurs or the N-gram is not present, the fusion gate simply outputs a zero-vector, and the transformer relies entirely on its standard dense layers.

Q: Can Engrams be used for multi-modal data?

A: Yes. DeepSeek has proposed "Visual Engrams" where patches of images are quantized into "visual words" and used as keys in a lookup table, similar to how text N-grams are used.

Q: How does the 20-25% parameter budget work?

A: According to the Sparsity Allocation Law, if you have a 100B parameter budget, you should allocate ~75B to the transformer's dense weights and ~25B to the Engram embedding table. This ratio provides the optimal balance between the model's ability to reason and its ability to recall specific facts.

References

  1. DeepSeek AI (2026). Engrams: Scaling Conditional Memory in Transformers.
  2. Shazeer et al. (2017). Outrageously Large Neural Networks.
  3. DeepSeek-V3 Technical Report (2025).
  4. Vaswani et al. (2017). Attention is All You Need.

Related Articles

Related Articles

Hybrid Search

A deep technical exploration of Hybrid Search, detailing the integration of sparse lexical retrieval and dense semantic vectors to optimize RAG pipelines and enterprise discovery systems.

Keyword Search

A deep technical exploration of Keyword Search (lexical retrieval), covering the mechanics of inverted indexes, the mathematical foundations of BM25, Learned Sparse Retrieval (LSR), and its integration into hybrid RAG architectures.

Semantic Search Ranking

A comprehensive technical guide to modern semantic search ranking, exploring the transition from lexical BM25 to multi-stage neural pipelines involving Bi-Encoders, Cross-Encoders, and Late Interaction models.

Vector Search

An exhaustive technical guide to vector search, exploring high-dimensional embeddings, Approximate Nearest Neighbor (ANN) algorithms, and the architectural integration of vector databases in modern AI retrieval systems.

Cross-Lingual and Multilingual Embeddings

A comprehensive technical exploration of cross-lingual and multilingual embeddings, covering the evolution from static Procrustes alignment to modern multi-functional transformer encoders like M3-Embedding and XLM-R.

Dimensionality and Optimization

An exploration of the transition from the Curse of Dimensionality to the Blessing of Dimensionality, detailing how high-dimensional landscapes facilitate global convergence through saddle point dominance and manifold-aware optimization.

Embedding Model Categories

A comprehensive technical taxonomy of embedding architectures, exploring the trade-offs between dense, sparse, late interaction, and Matryoshka models in modern retrieval systems.

Embedding Techniques

A comprehensive technical exploration of embedding techniques, covering the transition from sparse to dense representations, the mathematics of latent spaces, and production-grade optimizations like Matryoshka Representation Learning and Late Interaction.