TLDR
Engrams, introduced by DeepSeek AI in January 2026, represent a paradigm shift in Large Language Model (LLM) architecture by decoupling Conditional Memory from Conditional Computation. While Mixture-of-Experts (MoE) provides sparsity in processing, Engrams provide sparsity in storage through O(1) N-gram lookups. By offloading static pattern recognition to massive embedding tables—often stored in CPU RAM—Engrams allow models to scale their knowledge base to billions of parameters with less than 3% latency overhead. The core discovery is the Sparsity Allocation Law, which suggests that optimal performance is achieved when approximately 20–25% of a model's parameter budget is dedicated to this static, conditional memory.
Conceptual Overview
The evolution of transformer architectures has largely focused on increasing the depth and width of the hidden layers. However, DeepSeek's research into Engrams suggests that much of the computational effort in early transformer layers is wasted on "memorization" rather than "reasoning." Engrams address this by introducing a dedicated memory primitive.
Conditional Memory vs. Conditional Computation
To understand Engrams, one must distinguish between two types of sparsity:
- Conditional Computation (MoE): Dynamically selecting a subset of weights (experts) to process a specific token. This reduces the FLOPs required for inference but keeps the memory requirements high for GPU VRAM.
- Conditional Memory (Engram): Dynamically retrieving specific "knowledge vectors" based on the local context (N-grams) without executing dense matrix multiplications.
Engrams function as a massive, sparse lookup table. When the model encounters a specific sequence of tokens (e.g., a technical term or a common phrase), it performs an Exact Match (EM) lookup in an Engram table. This retrieves a pre-computed embedding that summarizes the "static" knowledge associated with that sequence, which is then fused into the transformer's hidden state.
The N-gram Lookup Mechanism
The Engram module utilizes N-grams (where N typically ranges from 2 to 8) as keys. Unlike traditional RAG (Retrieval-Augmented Generation) which uses dense vector similarity search, Engrams use a hash-based O(1) lookup. This ensures that the retrieval process is independent of the table size, allowing for memory banks that exceed the capacity of GPU VRAM.
The U-shaped Scaling Law (Sparsity Allocation Law)
DeepSeek identifies a U-shaped scaling law regarding the allocation of parameters between dense weights and sparse Engram memory. If a model has too little Engram memory, it wastes compute on memorization. If it has too much, the "reasoning" capacity (dense layers) is spread too thin. The "sweet spot" identified by DeepSeek is a 20–25% allocation to Engram parameters.
Engram Architecture: Transformer Block with CPU RAM Embedding Integration
Infographic Description: A diagram showing a Transformer block. Parallel to the Attention and MLP layers is the Engram Module. The input tokens are hashed into N-grams, which point to a massive Embedding Table in CPU RAM. The retrieved vector is passed through a Fusion Gate and added back to the residual stream.
Practical Implementations
Implementing Engrams requires a departure from standard GPU-only training and inference pipelines.
The Two-Phase Operation
-
Retrieval Phase: The input sequence is decomposed into overlapping N-grams. For each N-gram, a hash function maps the sequence to an index in the Engram table. Because this is an O(1) operation, it can be performed in parallel with the initial embedding layer of the transformer.
-
Fusion Phase: The retrieved Engram vector e_i is combined with the hidden state h_i of the transformer. DeepSeek utilizes a gated linear unit (GLU) for fusion: h'_{i} = h_i + \sigma(W_g [h_i; e_i]) \cdot (W_f e_i) where W_g and W_f are small learnable matrices that decide how much of the Engram memory should influence the current token's representation.
CPU-Offloading and Latency
One of the most significant advantages of Engrams is their compatibility with CPU RAM offloading. Since the lookup is O(1) and the fusion happens only at specific layers (usually the early-to-mid layers), the Engram table can reside in system memory.
- Latency: DeepSeek reports that with high-speed PCIe Gen5 or CXL links, the latency overhead of fetching an Engram from CPU RAM is <3%.
- Throughput: By offloading the "knowledge" parameters to CPU, the GPU VRAM is freed up for larger KV caches, significantly increasing the maximum context window and batch size.
Integration with NER and EM
Engrams are particularly effective for tasks involving Named Entity Recognition (NER) and Exact Match (EM) requirements.
- In NER, Engrams can store specific embeddings for millions of unique entities (e.g., rare chemical compounds or obscure historical figures) that would otherwise be "blurred" in a standard dense model.
- In EM tasks, such as code generation or legal document analysis, the N-gram lookup ensures that specific syntax patterns or boilerplate clauses are retrieved with perfect fidelity.
Advanced Techniques
Multi-Head Engram Lookups
Similar to Multi-Head Attention, Engrams can be implemented with multiple "heads." Each head might look at a different N-gram length (e.g., Head 1 looks at 2-grams, Head 2 looks at 5-grams). This allows the model to capture both local syntactic patterns and longer semantic phrases simultaneously.
Comparing Prompt Variants (A)
When comparing prompt variants (A), Engram-equipped models show significantly higher stability. Traditional models often fluctuate in their internal representations based on minor phrasing changes. Engrams, by anchoring the representation to specific N-gram lookups, provide a "grounding" effect that makes the model more robust to prompt perturbations.
Dynamic Engram Updating
While the initial Engram tables are populated during pre-training, DeepSeek has experimented with "Online Engram Accumulation." In this setup, the model can update its Engram table during a long-running conversation or across a large document, effectively acting as a persistent, high-speed cache of the current context.
Research and Future Directions
The introduction of Engrams marks a shift toward "Hardware-Aware Architecture." As we hit the limits of GPU VRAM, the ability to utilize the terabytes of RAM available on modern AI servers becomes critical.
The Sparsity Allocation Law in AGI
DeepSeek's research suggests that as we move toward AGI, the ratio of "Memory" to "Compute" may need to shift even further. If a model can "look up" the laws of physics or the syntax of a programming language via an Engram, it can dedicate 100% of its active FLOPs to solving the specific problem at hand rather than recalling the rules.
Future Hardware: CXL and Beyond
The future of Engrams is tied to interconnect technology. Technologies like Compute Express Link (CXL) will allow GPUs to access CPU-attached memory with even lower latency, potentially making Engram lookups as fast as local VRAM access. This would allow for "Exascale Memory" models where the Engram table contains trillions of entries, covering almost every known N-gram in human language.
Frequently Asked Questions
Q: How do Engrams differ from standard RAG?
A: RAG (Retrieval-Augmented Generation) typically retrieves whole documents or chunks based on semantic similarity (vector search), which is computationally expensive and happens outside the model's forward pass. Engrams are an architectural primitive inside the transformer that performs O(1) exact-match lookups of N-gram embeddings at the token level.
Q: Does the Engram table make the model harder to train?
A: Actually, it can simplify training. By offloading static memorization to the Engram table, the dense layers (the "brain") converge faster on reasoning tasks. DeepSeek uses a "Warm-up and Freeze" strategy where the Engram table is populated and then the dense layers are fine-tuned to utilize it.
Q: What happens if an N-gram is not in the table?
A: The Engram module includes a "null" or "default" return. If a hash miss occurs or the N-gram is not present, the fusion gate simply outputs a zero-vector, and the transformer relies entirely on its standard dense layers.
Q: Can Engrams be used for multi-modal data?
A: Yes. DeepSeek has proposed "Visual Engrams" where patches of images are quantized into "visual words" and used as keys in a lookup table, similar to how text N-grams are used.
Q: How does the 20-25% parameter budget work?
A: According to the Sparsity Allocation Law, if you have a 100B parameter budget, you should allocate ~75B to the transformer's dense weights and ~25B to the Engram embedding table. This ratio provides the optimal balance between the model's ability to reason and its ability to recall specific facts.
References
- DeepSeek AI (2026). Engrams: Scaling Conditional Memory in Transformers.
- Shazeer et al. (2017). Outrageously Large Neural Networks.
- DeepSeek-V3 Technical Report (2025).
- Vaswani et al. (2017). Attention is All You Need.