TLDR
Milvus is an enterprise-grade, open-source vector database specifically engineered to manage, store, and search billions of high-dimensional vectors. As a cornerstone of the modern AI stack, it facilitates Retrieval-Augmented Generation (RAG), recommendation engines, and multimodal search by providing millisecond-level latency on massive datasets.
Unlike traditional databases, Milvus utilizes a cloud-native, disaggregated architecture that separates storage from compute. This allows for independent horizontal scaling of query nodes (for search) and data nodes (for ingestion). It supports a diverse array of indexing algorithms—including HNSW, IVF, and DiskANN—and offers advanced hybrid search capabilities that combine dense semantic retrieval with sparse keyword matching. As a graduated project of the LF AI & Data Foundation, Milvus is the industry standard for self-hosted, large-scale vector data management.
Conceptual Overview
At its core, Milvus is designed to solve the "Similarity Search" problem at a scale where traditional relational or document databases fail. In the AI era, unstructured data (text, images, video) is converted into numerical vectors (embeddings). Milvus provides the infrastructure to perform Approximate Nearest Neighbor (ANN) searches across these embeddings with high precision and low latency.
The Disaggregated Architecture: "Log as the Backbone"
The defining characteristic of Milvus (v2.0+) is its total decoupling of components. This architecture follows the principle that all data mutations are treated as a stream of logs, ensuring high availability and elastic scalability.
-
Access Layer (Stateless Proxies): The entry point for client requests via gRPC, REST, or SDKs. Proxies handle static validation, request routing, and result merging. Because they are stateless, they can be scaled behind a load balancer to handle massive concurrent connections.
-
Coordinator Service (The Brain): The management plane that maintains cluster topology and assigns tasks.
- Root Coord: Manages metadata (collections, partitions) and handles Time Tick generation for global consistency.
- Data Coord: Manages data placement and background compaction.
- Query Coord: Manages search load balancing and segment handoffs between nodes.
- Index Coord: Manages the lifecycle of index building tasks.
-
Worker Nodes (The Muscles):
- Query Nodes: Responsible for executing search requests. They pull data segments from object storage into local memory or cache to perform ANN computations.
- Data Nodes: Subscribe to the log broker, consume incoming data, and persist it into "segments" in object storage.
- Index Nodes: Specialized nodes that build accelerated search structures (like HNSW graphs) from raw vector data to speed up future queries.
-
Storage Layer (The Foundation):
- Meta Store: Typically
etcd, storing the cluster's structural metadata and health status. - Log Broker: Typically
Apache PulsarorKafka. It acts as the system's write-ahead log (WAL), ensuring data persistence and atomicity. - Object Storage: Typically
S3,MinIO, orAzure Blob Storage. This is where the actual bulk data (vectors and scalar fields) and index files reside.
- Meta Store: Typically
Data Model: Collections, Segments, and Shards
Milvus organizes data hierarchically to optimize for distributed processing:
- Collections: Equivalent to tables in SQL. A collection has a fixed schema defining vector dimensions and scalar fields.
- Partitions: Logical divisions within a collection. Searching a specific partition reduces the search space, significantly improving performance for multi-tenant or time-series data.
- Segments: The physical unit of data. Milvus automatically groups data into segments. A "growing" segment resides in memory/log broker; once it reaches a threshold (e.g., 512MB), it is "sealed," persisted to object storage, and indexed.
- Shards: Data is horizontally partitioned into shards to distribute the ingestion load across multiple Data Nodes.
 connects to the Coordinator Layer (Root, Data, Query, Index Coords). Below that, the Worker Layer shows Query Nodes, Data Nodes, and Index Nodes interacting with the Log Broker (Pulsar/Kafka). At the bottom, the Storage Layer shows Object Storage (S3/MinIO) and Meta Storage (etcd). Arrows indicate the flow of data from ingestion through the Log Broker to Data Nodes and finally to Object Storage, while Query Nodes pull data from Object Storage for search.)
Practical Implementations
Implementing Milvus effectively requires choosing the right indexing strategy and deployment mode based on your dataset size and latency requirements.
1. Indexing Strategies: Speed vs. Accuracy
Milvus supports a wide range of ANN algorithms, each with specific trade-offs:
- HNSW (Hierarchical Navigable Small World):
- Mechanism: Builds a multi-layered graph where the top layers are sparse for fast navigation and the bottom layers are dense for accuracy.
- Best for: In-memory search where low latency is the priority. It is currently the most popular index for RAG.
- IVF (Inverted File):
- Mechanism: Partitions the vector space into Voronoi cells (clusters). Search is restricted to the $k$ nearest clusters.
- IVF_FLAT: High accuracy but high memory usage.
- IVF_PQ (Product Quantization): Compresses vectors into short codes, allowing for massive datasets to fit in limited RAM at the cost of some precision.
- DiskANN:
- Mechanism: An SSD-optimized index that keeps a small compressed graph in RAM and the full vectors on disk.
- Best for: Billion-scale datasets where fitting everything in RAM is cost-prohibitive.
- GPU_CAGRA:
- Mechanism: A graph-based index optimized for NVIDIA GPUs using the RAFT library.
- Best for: High-throughput batch processing and extremely low-latency requirements.
2. Metric Types
The choice of metric depends on the embedding model used:
- L2 (Euclidean Distance): Measures the straight-line distance. Common in image processing.
- IP (Inner Product): Often used in recommendation systems where vector magnitude matters.
- Cosine Similarity: The standard for NLP and RAG, measuring the angle between vectors (normalized IP).
3. Deployment Workflow (PyMilvus)
A typical implementation involves defining a schema, creating an index, and performing a search.
from pymilvus import connections, Collection, FieldSchema, DataType, CollectionSchema
# 1. Establish Connection
connections.connect("default", host="localhost", port="19530")
# 2. Define Schema
# Milvus supports dynamic schemas, but explicit schemas are recommended for production.
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=1536), # OpenAI dimension
FieldSchema(name="metadata", dtype=DataType.JSON) # Store flexible metadata
]
schema = CollectionSchema(fields, description="Knowledge Base Search")
# 3. Create Collection and Index
collection = Collection("kb_collection", schema)
index_params = {
"metric_type": "COSINE",
"index_type": "HNSW",
"params": {"M": 16, "efConstruction": 200}
}
collection.create_index(field_name="vector", index_params=index_params)
# 4. Data Ingestion
data = [
[[0.1, 0.2, ...]], # List of vectors
[{"source": "doc_1", "page": 10}] # List of JSON metadata
]
collection.insert(data)
# 5. Search with Scalar Filtering
collection.load() # Load collection into Query Node memory
search_params = {"metric_type": "COSINE", "params": {"ef": 64}}
results = collection.search(
data=[[0.1, 0.2, ...]],
anns_field="vector",
param=search_params,
limit=3,
expr="metadata['source'] == 'doc_1'" # Boolean filtering
)
Advanced Techniques
Hybrid Search: Dense + Sparse Retrieval
Modern search systems often require both semantic understanding (Dense Vectors) and exact keyword matching (Sparse Vectors). Milvus supports Hybrid Search, allowing users to store both types in a single collection.
- Dense Vectors: Captured by models like BERT or Ada-002 for "meaning."
- Sparse Vectors: Captured by algorithms like BM25 or SPLADE for "keywords."
- Fusion: Milvus uses Reciprocal Rank Fusion (RRF) or weighted scoring to combine the results from both search types into a single ranked list.
Scalar Filtering and Bitsets
Milvus performs pre-filtering using bitsets. When a query includes a scalar filter (e.g., status == 'active'), Milvus generates a bitset where each bit represents an entity's eligibility. During the ANN graph traversal, the engine checks the bitset; if a vector's bit is not set, it is ignored. This ensures that the results are 100% accurate regarding the metadata constraints without sacrificing the speed of the vector search.
Multi-Tenancy Strategies
For SaaS applications, managing data for thousands of users is critical. Milvus offers three patterns:
- Collection-per-tenant: Strongest isolation, but limited by the system's collection limit (approx. 65k).
- Partition-per-tenant: Good balance of isolation and performance.
- Field-level filtering: All tenants share a collection, and a
tenant_idfield is used in every query. This is the most scalable approach, supporting millions of tenants.
Research and Future Directions
Milvus 3.0: The Vector Data Lake
The project is evolving from a specialized vector store into a Vector Data Lake. This transition focuses on:
- Unified Storage: Natively handling structured, semi-structured (JSON), and unstructured data in a single, high-performance format.
- Serverless Native: Optimizing for cloud environments where compute can scale to zero, and storage is decoupled via tiered S3/local-disk caching.
- Real-time Updates: Improving the LSM-tree (Log-Structured Merge-tree) approach for vectors to allow for high-frequency deletions and updates without triggering expensive re-indexing.
Hardware Acceleration and AI Agents
Research is heavily focused on Hardware-Software Co-design. By offloading distance computations to FPGAs or utilizing NVIDIA's GPU-accelerated libraries, Milvus aims to reduce the TCO (Total Cost of Ownership) for billion-scale deployments. Furthermore, Milvus is becoming the "Long-term Memory" for AI Agents, where the database stores not just documents, but the agent's past actions, tool-use history, and reasoning traces to provide persistent context across sessions.
Frequently Asked Questions
Q: How does Milvus compare to Pinecone or Weaviate?
A: Milvus is a self-hosted, open-source enterprise solution (with a managed version via Zilliz). Compared to Pinecone (SaaS-only), Milvus offers total control over data residency and infrastructure. Compared to Weaviate, Milvus is generally more performant at massive scales (billions of vectors) due to its disaggregated architecture, while Weaviate is often preferred for smaller-scale ease of use and its GraphQL-first approach.
Q: Can I run Milvus on a single machine?
A: Yes. Milvus offers a Standalone mode via Docker Compose, which packages all components into a few containers. This is ideal for development. For production, Cluster mode on Kubernetes is the recommended path for high availability.
Q: Does Milvus support ACID transactions?
A: Milvus supports entity-level atomicity and follows an eventual consistency model by default. However, it allows users to tune the consistency level (Strong, Bounded Staleness, Session, Eventually) per query, balancing the trade-off between data visibility and search performance.
Q: What are the hardware requirements for 1 million vectors?
A: For 1 million 768-dimensional vectors using HNSW, you would typically need ~4GB of RAM for the index and additional overhead for the OS and metadata. Milvus is highly efficient, but memory scales linearly with vector dimensions and the number of entities.
Q: How do I handle data deletion in Milvus?
A: Milvus supports deletion by primary key or via filter expressions. Deletions are initially "soft" (marked in a delete buffer) and are permanently purged during the compaction process, which merges small segments into larger ones to maintain search efficiency.
References
- https://milvus.io/docs
- https://arxiv.org/abs/2103.01530
- https://zilliz.com/blog
- https://lfaidata.foundation/projects/milvus/
- https://github.com/milvus-io/milvus