Compute Requirements

TLDR

Compute requirements in 2025 have evolved from a simple pursuit of raw FLOPS (Floating Point Operations Per Second) into a multi-dimensional optimization problem involving High-Bandwidth Memory (HBM), interconnect throughput, and algorithmic efficiency. As Large Language Models (LLMs) continue to scale, the industry is witnessing a $3.9\times$ annual increase in training compute demand. However, the bottleneck has shifted: modern workloads are increasingly memory-bound or interconnect-bound rather than compute-bound. Engineering teams must now balance Model FLOPs Utilization (MFU) against the "Memory Wall" while navigating the trade-offs between training-time scaling and the emerging paradigm of test-time compute. Key strategies for mitigation include precision-optimized architectures (e.g., 1.58-bit LLMs) and IO-aware algorithms like FlashAttention.

Conceptual Overview

Compute requirements represent the quantified hardware and operational resources—specifically TFLOPS, VRAM capacity, and network bandwidth—necessary to execute workloads within defined latency and throughput constraints. In the context of modern AI, these requirements are governed by the relationship between algorithm complexity, data volume, and hardware throughput.

The Roofline Model and Arithmetic Intensity

The fundamental framework for understanding compute requirements is the Roofline Model. This model relates processor performance to memory traffic by plotting "Attainable Performance" (GFLOPS/s) against "Arithmetic Intensity" (FLOPs/Byte).

Arithmetic Intensity: This is the ratio of total floating-point operations performed to the total bytes of data moved from main memory (HBM).
- If a kernel has low arithmetic intensity, it is Memory-Bound. The processor spends most of its time waiting for data to arrive from VRAM.
- If a kernel has high arithmetic intensity, it is Compute-Bound. The processor is fully utilized, and performance is limited by the peak TFLOPS of the hardware.
The Ridge Point: This is the specific arithmetic intensity where a system transitions from being memory-bound to compute-bound. For an NVIDIA H100, the ridge point is significantly higher than previous generations, meaning models must perform more operations per byte of data to reach peak performance.

Scaling Laws: Kaplan vs. Chinchilla

The estimation of compute requirements is heavily influenced by neural scaling laws.

Kaplan Scaling Laws (2020): Suggested that model performance scales primarily with the number of parameters ($N$), the size of the dataset ($D$), and the amount of compute ($C$). It favored scaling model size more aggressively than data size.
Chinchilla Scaling Laws (Hoffmann et al., 2022): Revised this by demonstrating that most models were "under-trained." Hoffmann found that for a compute-optimal model, $N$ and $D$ should be scaled equally. Specifically, a model requires approximately 20 tokens per parameter for optimal training.

In 2025, we are seeing a shift toward "inference-optimal" scaling, where models are trained on far more data than the Chinchilla limit (e.g., Llama 3) to reduce the compute requirements during deployment.

![Infographic Placeholder](The Roofline Model for AI Accelerators: A graph showing the relationship between Arithmetic Intensity (x-axis) and Attainable Performance (y-axis). A diagonal line represents the memory bandwidth limit (HBM throughput), and a horizontal plateau represents the peak compute performance (TFLOPS). The 'Ridge Point'—where the two lines meet—indicates the optimal balance. Annotations show how quantization (e.g., moving from FP16 to FP8) shifts the workload from the memory-bound slope toward the compute-bound plateau, effectively increasing throughput by reducing the data movement overhead.)

Practical Implementations

1. VRAM Footprint Estimation

Accurately calculating VRAM requirements is the primary defense against "Out-of-Memory" (OOM) failures. For a model with $P$ parameters, the memory requirements are partitioned as follows:

Model Weights:
- FP16/BF16: $2 \times P$ bytes.
- INT8/FP8: $1 \times P$ bytes.
- 4-bit Quantization: $0.5 \times P$ bytes.
Optimizer States (Training): Using AdamW in FP32 requires $12 \times P$ bytes (4 bytes for the master weight, 4 for the momentum, and 4 for the variance).
Gradients: Typically $2 \times P$ bytes.
KV Cache (Inference): The Key-Value cache stores the context for autoregressive generation. Its size is calculated as: $$Memory_{KV} = 2 \times \text{layers} \times \text{heads} \times \text{dim} \times \text{seq_len} \times \text{batch_size} \times \text{bytes_per_element}$$

For a 70B parameter model at FP16, the weights alone require 140GB. To run this on 80GB A100s, one must use either quantization or model parallelism.

2. Model FLOPs Utilization (MFU)

MFU is the metric used to evaluate how efficiently a workload uses the available hardware. It is the ratio of the actual throughput achieved to the theoretical peak of the GPU. $$MFU = \frac{\text{Actual Throughput (tokens/sec)} \times \text{FLOPs per token}}{\text{Peak Hardware FLOPS}}$$ A low MFU (e.g., < 30%) indicates that the system is bottlenecked by something other than compute—usually the interconnect (latency between GPUs) or data loading (CPU-to-GPU bottleneck).

3. Compute for Information Retrieval (IR)

In production environments, compute requirements are not limited to the LLM. Information Retrieval (IR) systems, such as those used in RAG (Retrieval-Augmented Generation), introduce their own hardware demands:

Embedding Generation: High-throughput inference is required to convert incoming queries into vectors.
Vector Search: While often memory-latency bound, large-scale IR requires significant CPU or GPU compute for similarity calculations (e.g., Cosine Similarity or Inner Product) across millions of vectors.
Re-ranking: To achieve a high Exact Match (EM) rate, systems often employ a "Cross-Encoder" re-ranker. This is a compute-intensive step where a model evaluates the top-$k$ retrieved documents against the query. The compute cost here scales linearly with the number of documents being re-ranked.

Advanced Techniques

Precision Optimization: The 1.58-Bit Revolution

The most aggressive approach to reducing compute requirements is the move toward ternary weights ${-1, 0, 1}$. Research into BitNet b1.58 has shown that LLMs can maintain high performance while replacing floating-point multiplications with simple additions.

Compute Impact: Matrix multiplication (MatMul) is the most expensive operation in AI. By using 1.58-bit weights, MatMul is replaced by integer addition, which is significantly more energy-efficient and requires less silicon area.
Memory Impact: This allows for a massive reduction in VRAM usage, potentially allowing 100B+ parameter models to run on hardware with limited HBM.

Distributed Parallelism Strategies

When a single GPU's compute or memory is insufficient, workloads must be distributed:

Tensor Parallelism (TP): Splits individual layers across multiple GPUs. This is highly interconnect-bound, requiring sub-microsecond latency provided by NVLink.
Pipeline Parallelism (PP): Splits the model's layers sequentially across GPUs. While it reduces the interconnect pressure compared to TP, it introduces "pipeline bubbles" (idle time) that lower the overall MFU.
Expert Parallelism (EP): Used in Mixture-of-Experts (MoE) models. Only a fraction of the model's parameters (the "experts") are activated for any given token. This allows for models with trillions of parameters to have the compute requirements of a much smaller model during the forward pass.

FlashAttention and IO-Awareness

Standard Attention mechanisms have a quadratic complexity $O(N^2)$ relative to sequence length. FlashAttention addresses this by being "IO-aware." It avoids writing the large $N \times N$ attention matrix to the slow HBM by tiling the computation and keeping intermediate results in the fast on-chip SRAM. This technique reduces the memory-bound nature of long-context models, effectively doubling the speed of training and inference for high-context windows.

Research and Future Directions

Test-Time Compute Scaling

A major shift in 2025 is the concept of test-time compute scaling (as seen in models like OpenAI's o1). Instead of relying solely on a larger model (scaling $N$), we allow the model to "think" longer during inference.

Search and Reasoning: By using techniques like Monte Carlo Tree Search (MCTS) or Chain-of-Thought (CoT) at inference time, a smaller model can outperform a much larger model on complex reasoning tasks.
Infrastructure Shift: This moves the compute requirement from VRAM (model size) to TFLOPS (inference duration). It allows for a more flexible allocation of resources where "easy" queries use minimal compute and "hard" queries use significant test-time compute.

Silicon Photonics and Optical Interconnects

As clusters grow to tens of thousands of GPUs, the energy lost in copper wiring becomes a primary constraint. Research into Silicon Photonics aims to use light for data transfer between chips. This would provide:

Terabit-per-second bandwidth with near-zero latency.
Disaggregated Compute: The ability to pool memory and compute across a data center as if they were on the same board, effectively breaking the physical limits of the server chassis.

Sustainable IR and Green Compute

The environmental cost of meeting these compute requirements is driving research into "Green AI." This involves optimizing for Exact Match (EM) accuracy per Watt. Techniques include:

Dynamic Computation: Skipping layers for simple tokens.
Approximate IR: Using lossy compression on vector databases to reduce the memory and compute overhead of Information Retrieval without significantly degrading the EM score of the final output.

Frequently Asked Questions

Q: Why does my model crash with an OOM error even though the weights fit in VRAM?

An OOM (Out-of-Memory) error occurs because VRAM must hold more than just the weights. During inference, the KV Cache grows with the sequence length and batch size. During training, you must also account for optimizer states (which can be $6\times$ the size of the weights) and activations stored for the backward pass.

Q: How does quantization affect the Exact Match (EM) score of a model?

Quantization (e.g., moving from FP16 to INT4) reduces the precision of the model's weights. While this significantly lowers compute and memory requirements, it can introduce "quantization noise." For complex Information Retrieval (IR) tasks, this might lead to a slight drop in EM scores, though modern techniques like QLoRA and GPTQ minimize this impact to negligible levels.

Q: What is the difference between NVLink and InfiniBand in terms of compute clusters?

NVLink is a specialized, high-speed interconnect (up to 900GB/s on H100) designed for communication within a node or a small group of nodes. It is required for Tensor Parallelism. InfiniBand is a high-speed networking protocol (up to 400Gbps) used for communication between nodes in a large cluster. It is typically used for Data Parallelism and Pipeline Parallelism.

Q: Is it better to scale the model size or the training data?

According to the Chinchilla Scaling Laws, you should scale both proportionally. However, if your goal is to minimize inference-time compute requirements, it is often better to "over-train" a smaller model on more data. This results in a more capable model that fits into smaller VRAM footprints.

Q: How do I calculate the total TFLOPS required to train a model?

A common heuristic is $C \approx 6PD$, where $C$ is the total compute in FLOPs, $P$ is the number of parameters, and $D$ is the number of training tokens. For example, training a 70B model on 2 Trillion tokens requires approximately $6 \times 70 \times 10^9 \times 2 \times 10^{12} = 8.4 \times 10^{23}$ FLOPs.

References

Ma, S., et al. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv:2402.17764.
Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models. arXiv:2203.15556.
Dao, T., et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135.
Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.
NVIDIA (2024). NVIDIA Blackwell Architecture Technical Brief.
OpenAI (2024). Learning to Reason with LLMs (o1 Research).