TLDR
Vector database formats have transitioned from experimental in-memory structures to sophisticated, multi-tiered storage engines designed to handle the unique demands of high-dimensional AI data. The primary challenge is the "curse of dimensionality," which renders traditional B-Tree indexing obsolete. Modern solutions balance three competing metrics: storage efficiency (via Product Quantization), search performance (via HNSW or Vamana graphs), and random access (via columnar disk formats like Lance). While HNSW remains the standard for low-latency, RAM-resident applications, the industry is rapidly shifting toward DiskANN and Lance to manage billion-scale datasets on NVMe storage. Success in this domain requires rigorous evaluation through A (comparing prompt variants) to ensure that the chosen format and quantization level do not degrade the semantic integrity of the retrieval process.
Conceptual Overview
A Vector Database is a specialized database optimized for storing and querying embeddings—mathematical representations of data (text, images, audio) in high-dimensional space. Unlike traditional relational databases that rely on exact matches or range scans, vector formats are built for Approximate Nearest Neighbor (ANN) search.
The Curse of Dimensionality
In standard data engineering, we deal with low-dimensional scalars. In these environments, spatial partitioning (like K-D trees) works efficiently because the volume of the space is manageable. However, as we move to embeddings with 768, 1536, or even 3072 dimensions, the "volume" of the space increases exponentially. In these high-dimensional regimes, the distance between any two random points tends to converge, and the "corners" of the hypercube contain most of the volume. This makes exhaustive searching computationally impossible at scale, necessitating formats that can "shortcut" the search space.
The Trade-off Triangle
Every vector format exists on a spectrum defined by three vertices:
- Recall Accuracy: How close the ANN result is to the "true" nearest neighbor.
- Latency/Throughput: How many queries per second (QPS) the system can handle at sub-10ms speeds.
- Infrastructure Cost: The physical cost of the storage medium (RAM vs. SSD vs. Object Storage).

Distance Metrics and Format Compatibility
The underlying format must support specific mathematical operations to calculate similarity. The most common are:
- Cosine Similarity: Measures the angle between vectors, ignoring magnitude.
- Euclidean Distance (L2): Measures the straight-line distance between points.
- Inner Product (IP): Measures the projection of one vector onto another.
The choice of format often dictates which metric is most efficient. For example, Binary Quantization is exceptionally fast for Hamming distance but loses significant information if the model was trained for L2 distance.
Practical Implementations
The industry has converged on several architectural archetypes, each solving a specific part of the scale-performance equation.
1. Hierarchical Navigable Small World (HNSW)
HNSW is currently the most widely used format for high-performance vector search. It is a graph-based index that builds a hierarchy of layers.
- The Structure: The bottom layer (Layer 0) contains all data points and a dense graph of their relationships. Each subsequent layer above it contains a subset of the points from the layer below, creating a "skip-list" for graphs.
- Search Mechanism: The search begins at the top layer, where the graph is sparse. It quickly "zooms in" on the relevant neighborhood and then drops down to the next, denser layer. This process repeats until the search reaches Layer 0 and finds the local nearest neighbors.
- Pros: Extremely fast retrieval and high recall. It handles incremental updates better than most cluster-based formats.
- Cons: High memory overhead. HNSW requires the entire graph and the vectors to be in RAM. Additionally, the graph structure itself adds roughly 20-40 bytes of overhead per vector for pointers.
2. Inverted File Index with Product Quantization (IVF-PQ)
IVF-PQ is the standard for massive-scale datasets that cannot fit in RAM.
- IVF (Inverted File): The vector space is partitioned into $k$ clusters (Voronoi cells) using k-means. Each vector is assigned to its nearest cluster center (centroid). At query time, the system only searches the clusters closest to the query vector.
- PQ (Product Quantization): This is a lossy compression technique. A high-dimensional vector is split into $m$ sub-vectors. For each sub-vector, a codebook is learned. The sub-vector is then replaced by the index of the nearest entry in the codebook.
- Example: A 128-dimension vector of FP32 (512 bytes) can be compressed into 8 bytes using PQ, representing a 64x reduction in size.
- Pros: Incredible compression. Allows billion-scale indices to be stored in a fraction of the original space.
- Cons: Lower recall due to quantization error and the risk of the "coarse quantizer" (IVF) missing the correct cluster.
3. Disk-Native Formats: Lance and DiskANN
As datasets grow to the petabyte scale, the industry is moving away from RAM-only formats toward formats optimized for NVMe SSDs.
- DiskANN (Vamana): Unlike HNSW, which has a high "diameter" (number of hops), DiskANN uses the Vamana graph algorithm. Vamana graphs are designed to have a small diameter and high connectivity, which minimizes the number of disk reads required to traverse the graph. It typically stores a compressed index in RAM and the full-precision vectors on disk.
- Lance: Lance is a modern columnar data format designed specifically for machine learning. Unlike Parquet, which is optimized for analytical scans, Lance is optimized for random access. It uses a "fast scan" layout that allows it to retrieve specific vectors and their associated metadata in a single I/O operation.
- Pros: 10x-100x cheaper than RAM-based solutions. Lance allows for seamless integration of vector search and SQL-like metadata filtering.
Advanced Techniques
To optimize these formats further, several advanced quantization and filtering techniques are employed.
Scalar Quantization (SQ)
Scalar Quantization is a simpler alternative to PQ. It involves mapping the range of floating-point values in each dimension to a smaller set of integers (e.g., INT8 or INT4).
- SQ8: Maps 32-bit floats to 8-bit integers. This provides a 4x compression with minimal recall loss (often <1%).
- SQ4: Maps to 4-bit integers. This provides 8x compression but requires more careful calibration to avoid significant precision loss.
Binary Quantization (BQ)
For specific models (like those from Cohere or OpenAI), vectors can be compressed into a single bit per dimension.
- Mechanism: If a value is positive, it becomes 1; if negative, 0.
- Performance: Distance is calculated using the Hamming Distance (XOR followed by a population count). Modern CPUs have specialized instructions (POPCNT) that make this calculation nearly instantaneous. BQ can provide a 32x compression ratio.
Two-Stage Retrieval (Re-ranking)
Because quantization is lossy, many formats use a two-stage process:
- Coarse Search: Use a highly compressed index (like PQ or BQ) to find the top $K$ candidates (e.g., $K=1000$).
- Refinement: Retrieve the full-precision (FP32 or FP16) vectors for those 1000 candidates from disk and perform an exact distance calculation to find the final top 10. This restores the recall lost during the quantization phase.
Hybrid Search and Metadata Filtering
Real-world queries are rarely just "find similar vectors." They are usually "find similar vectors where category == 'electronics'."
- Pre-filtering: The database filters the metadata first, then searches the vector index. This is slow if the filter is very restrictive (e.g., only 10 items match the metadata).
- Post-filtering: The database searches the vector index first, then applies the filter. This is slow if the top vector results don't match the metadata.
- Native Filtering: Formats like Lance and Weaviate's HNSW implementation allow the search algorithm to "check" the metadata while traversing the graph, effectively combining both steps.
Research and Future Directions
The field of vector storage is evolving toward hardware-native and "learned" architectures.
GPU-Accelerated Indexing (NVIDIA RAFT)
Building an HNSW graph for 100 million vectors can take hours on a CPU. NVIDIA's RAFT library provides GPU-accelerated implementations of IVF-PQ and graph-based indices. By leveraging the massive parallelism of GPUs, indexing time can be reduced from hours to minutes, and query throughput can reach hundreds of thousands of QPS.
Decoupled Storage and Compute
The "Serverless" trend is hitting vector databases. In this model, the vector index is stored as a set of immutable files in object storage (like AWS S3). Compute nodes are spun up on demand to query these files. This requires "cloud-native" formats that are highly seek-friendly, allowing the compute node to fetch only the specific bytes of the index it needs without downloading the entire file.
Evaluation via A
As the complexity of these formats increases, the methodology for evaluating them has shifted. Engineers now use A (comparing prompt variants) to determine the robustness of a storage format. For example, if a user asks "What are the best hiking boots?" versus "Recommend some footwear for mountain trails," the vector database should ideally return the same neighborhood of results.
By running A tests across different quantization levels (e.g., comparing SQ8 vs. PQ), developers can identify the "semantic breaking point" where compression begins to alter the meaning of the retrieved data. This is critical for RAG (Retrieval-Augmented Generation) systems, where the quality of the prompt variant can significantly impact the LLM's final output.
Learned Indices
Recent research suggests that neural networks can replace traditional indexing structures. Instead of a graph, a small "index model" learns the distribution of the vectors and predicts the memory location of the nearest neighbors. While still experimental, learned indices promise even higher compression ratios and faster lookup times by adapting to the specific data distribution of the embeddings.
Frequently Asked Questions
Q: What is a Vector Database?
A: A Vector Database is a specialized database optimized for storing and querying embeddings. It uses Approximate Nearest Neighbor (ANN) algorithms to navigate high-dimensional spaces, enabling semantic search, recommendation engines, and long-term memory for AI agents.
Q: Why is HNSW so popular despite its high memory usage?
A: HNSW provides the best balance of speed and recall for most production use cases. Its ability to handle incremental updates (adding data without a full re-index) and its sub-millisecond query times make it the gold standard for applications where performance is more critical than hardware cost.
Q: When should I choose Lance over Parquet?
A: You should choose Lance when your workload requires random access to individual rows or dense vector search. Parquet is designed for analytical "scans" of entire columns, which makes it extremely slow for the point-lookups required in vector retrieval and re-ranking.
Q: How does Product Quantization (PQ) actually save space?
A: PQ saves space by breaking a large vector into smaller chunks and replacing those chunks with a "code" from a pre-learned dictionary. Instead of storing 1536 floating-point numbers, you might only store 96 one-byte integers, each pointing to a common pattern in the data.
Q: What is A in the context of vector search evaluation?
A: In this context, A refers to the practice of comparing prompt variants to test the stability of the retrieval system. It helps developers understand how different storage formats and quantization levels respond to variations in natural language, ensuring that the system remains semantically consistent regardless of how a query is phrased.
 -> 5. Top K Candidates (Compressed) -> 6. Disk Fetch (Full Precision) -> 7. Re-ranking -> 8. Final Results. A side-branch shows 'Metadata Filtering' occurring at step 4.)
References
- Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.
- Jegou, H., et al. (2011). Product Quantization for Nearest Neighbor Search.
- Subramanya, S. J., et al. (2019). DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node.
- Lance: Modern Columnar Data Format for ML (Official Documentation).
- NVIDIA RAFT: Graph-based and IVF-based vector search on GPUs.
- Douze, M., et al. (2024). The Faiss Library.