SmartFAQs.ai
Back to Learn
advanced

Milvus

Milvus is an enterprise-grade, open-source vector database designed for massive-scale similarity search. It features a cloud-native, disaggregated architecture that separates storage and compute, enabling horizontal scaling for billions of high-dimensional embeddings.

TLDR

Milvus is an enterprise-grade, open-source vector database specifically engineered to manage, store, and search billions of high-dimensional vectors. As a cornerstone of the modern AI stack, it facilitates Retrieval-Augmented Generation (RAG), recommendation engines, and multimodal search by providing millisecond-level latency on massive datasets.

Unlike traditional databases, Milvus utilizes a cloud-native, disaggregated architecture that separates storage from compute. This allows for independent horizontal scaling of query nodes (for search) and data nodes (for ingestion). It supports a diverse array of indexing algorithms—including HNSW, IVF, and DiskANN—and offers advanced hybrid search capabilities that combine dense semantic retrieval with sparse keyword matching. As a graduated project of the LF AI & Data Foundation, Milvus is the industry standard for self-hosted, large-scale vector data management.


Conceptual Overview

At its core, Milvus is designed to solve the "Similarity Search" problem at a scale where traditional relational or document databases fail. In the AI era, unstructured data (text, images, video) is converted into numerical vectors (embeddings). Milvus provides the infrastructure to perform Approximate Nearest Neighbor (ANN) searches across these embeddings with high precision and low latency.

The Disaggregated Architecture: "Log as the Backbone"

The defining characteristic of Milvus (v2.0+) is its total decoupling of components. This architecture follows the principle that all data mutations are treated as a stream of logs, ensuring high availability and elastic scalability.

  1. Access Layer (Stateless Proxies): The entry point for client requests via gRPC, REST, or SDKs. Proxies handle static validation, request routing, and result merging. Because they are stateless, they can be scaled behind a load balancer to handle massive concurrent connections.

  2. Coordinator Service (The Brain): The management plane that maintains cluster topology and assigns tasks.

    • Root Coord: Manages metadata (collections, partitions) and handles Time Tick generation for global consistency.
    • Data Coord: Manages data placement and background compaction.
    • Query Coord: Manages search load balancing and segment handoffs between nodes.
    • Index Coord: Manages the lifecycle of index building tasks.
  3. Worker Nodes (The Muscles):

    • Query Nodes: Responsible for executing search requests. They pull data segments from object storage into local memory or cache to perform ANN computations.
    • Data Nodes: Subscribe to the log broker, consume incoming data, and persist it into "segments" in object storage.
    • Index Nodes: Specialized nodes that build accelerated search structures (like HNSW graphs) from raw vector data to speed up future queries.
  4. Storage Layer (The Foundation):

    • Meta Store: Typically etcd, storing the cluster's structural metadata and health status.
    • Log Broker: Typically Apache Pulsar or Kafka. It acts as the system's write-ahead log (WAL), ensuring data persistence and atomicity.
    • Object Storage: Typically S3, MinIO, or Azure Blob Storage. This is where the actual bulk data (vectors and scalar fields) and index files reside.

Data Model: Collections, Segments, and Shards

Milvus organizes data hierarchically to optimize for distributed processing:

  • Collections: Equivalent to tables in SQL. A collection has a fixed schema defining vector dimensions and scalar fields.
  • Partitions: Logical divisions within a collection. Searching a specific partition reduces the search space, significantly improving performance for multi-tenant or time-series data.
  • Segments: The physical unit of data. Milvus automatically groups data into segments. A "growing" segment resides in memory/log broker; once it reaches a threshold (e.g., 512MB), it is "sealed," persisted to object storage, and indexed.
  • Shards: Data is horizontally partitioned into shards to distribute the ingestion load across multiple Data Nodes.

![Infographic Placeholder](A technical diagram showing the four-layer architecture of Milvus. At the top, the Access Layer (Proxies) connects to the Coordinator Layer (Root, Data, Query, Index Coords). Below that, the Worker Layer shows Query Nodes, Data Nodes, and Index Nodes interacting with the Log Broker (Pulsar/Kafka). At the bottom, the Storage Layer shows Object Storage (S3/MinIO) and Meta Storage (etcd). Arrows indicate the flow of data from ingestion through the Log Broker to Data Nodes and finally to Object Storage, while Query Nodes pull data from Object Storage for search.)


Practical Implementations

Implementing Milvus effectively requires choosing the right indexing strategy and deployment mode based on your dataset size and latency requirements.

1. Indexing Strategies: Speed vs. Accuracy

Milvus supports a wide range of ANN algorithms, each with specific trade-offs:

  • HNSW (Hierarchical Navigable Small World):
    • Mechanism: Builds a multi-layered graph where the top layers are sparse for fast navigation and the bottom layers are dense for accuracy.
    • Best for: In-memory search where low latency is the priority. It is currently the most popular index for RAG.
  • IVF (Inverted File):
    • Mechanism: Partitions the vector space into Voronoi cells (clusters). Search is restricted to the $k$ nearest clusters.
    • IVF_FLAT: High accuracy but high memory usage.
    • IVF_PQ (Product Quantization): Compresses vectors into short codes, allowing for massive datasets to fit in limited RAM at the cost of some precision.
  • DiskANN:
    • Mechanism: An SSD-optimized index that keeps a small compressed graph in RAM and the full vectors on disk.
    • Best for: Billion-scale datasets where fitting everything in RAM is cost-prohibitive.
  • GPU_CAGRA:
    • Mechanism: A graph-based index optimized for NVIDIA GPUs using the RAFT library.
    • Best for: High-throughput batch processing and extremely low-latency requirements.

2. Metric Types

The choice of metric depends on the embedding model used:

  • L2 (Euclidean Distance): Measures the straight-line distance. Common in image processing.
  • IP (Inner Product): Often used in recommendation systems where vector magnitude matters.
  • Cosine Similarity: The standard for NLP and RAG, measuring the angle between vectors (normalized IP).

3. Deployment Workflow (PyMilvus)

A typical implementation involves defining a schema, creating an index, and performing a search.

from pymilvus import connections, Collection, FieldSchema, DataType, CollectionSchema

# 1. Establish Connection
connections.connect("default", host="localhost", port="19530")

# 2. Define Schema
# Milvus supports dynamic schemas, but explicit schemas are recommended for production.
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=1536), # OpenAI dimension
    FieldSchema(name="metadata", dtype=DataType.JSON) # Store flexible metadata
]
schema = CollectionSchema(fields, description="Knowledge Base Search")

# 3. Create Collection and Index
collection = Collection("kb_collection", schema)
index_params = {
    "metric_type": "COSINE",
    "index_type": "HNSW",
    "params": {"M": 16, "efConstruction": 200}
}
collection.create_index(field_name="vector", index_params=index_params)

# 4. Data Ingestion
data = [
    [[0.1, 0.2, ...]], # List of vectors
    [{"source": "doc_1", "page": 10}] # List of JSON metadata
]
collection.insert(data)

# 5. Search with Scalar Filtering
collection.load() # Load collection into Query Node memory
search_params = {"metric_type": "COSINE", "params": {"ef": 64}}
results = collection.search(
    data=[[0.1, 0.2, ...]], 
    anns_field="vector",
    param=search_params,
    limit=3,
    expr="metadata['source'] == 'doc_1'" # Boolean filtering
)

Advanced Techniques

Hybrid Search: Dense + Sparse Retrieval

Modern search systems often require both semantic understanding (Dense Vectors) and exact keyword matching (Sparse Vectors). Milvus supports Hybrid Search, allowing users to store both types in a single collection.

  • Dense Vectors: Captured by models like BERT or Ada-002 for "meaning."
  • Sparse Vectors: Captured by algorithms like BM25 or SPLADE for "keywords."
  • Fusion: Milvus uses Reciprocal Rank Fusion (RRF) or weighted scoring to combine the results from both search types into a single ranked list.

Scalar Filtering and Bitsets

Milvus performs pre-filtering using bitsets. When a query includes a scalar filter (e.g., status == 'active'), Milvus generates a bitset where each bit represents an entity's eligibility. During the ANN graph traversal, the engine checks the bitset; if a vector's bit is not set, it is ignored. This ensures that the results are 100% accurate regarding the metadata constraints without sacrificing the speed of the vector search.

Multi-Tenancy Strategies

For SaaS applications, managing data for thousands of users is critical. Milvus offers three patterns:

  1. Collection-per-tenant: Strongest isolation, but limited by the system's collection limit (approx. 65k).
  2. Partition-per-tenant: Good balance of isolation and performance.
  3. Field-level filtering: All tenants share a collection, and a tenant_id field is used in every query. This is the most scalable approach, supporting millions of tenants.

Research and Future Directions

Milvus 3.0: The Vector Data Lake

The project is evolving from a specialized vector store into a Vector Data Lake. This transition focuses on:

  • Unified Storage: Natively handling structured, semi-structured (JSON), and unstructured data in a single, high-performance format.
  • Serverless Native: Optimizing for cloud environments where compute can scale to zero, and storage is decoupled via tiered S3/local-disk caching.
  • Real-time Updates: Improving the LSM-tree (Log-Structured Merge-tree) approach for vectors to allow for high-frequency deletions and updates without triggering expensive re-indexing.

Hardware Acceleration and AI Agents

Research is heavily focused on Hardware-Software Co-design. By offloading distance computations to FPGAs or utilizing NVIDIA's GPU-accelerated libraries, Milvus aims to reduce the TCO (Total Cost of Ownership) for billion-scale deployments. Furthermore, Milvus is becoming the "Long-term Memory" for AI Agents, where the database stores not just documents, but the agent's past actions, tool-use history, and reasoning traces to provide persistent context across sessions.


Frequently Asked Questions

Q: How does Milvus compare to Pinecone or Weaviate?

A: Milvus is a self-hosted, open-source enterprise solution (with a managed version via Zilliz). Compared to Pinecone (SaaS-only), Milvus offers total control over data residency and infrastructure. Compared to Weaviate, Milvus is generally more performant at massive scales (billions of vectors) due to its disaggregated architecture, while Weaviate is often preferred for smaller-scale ease of use and its GraphQL-first approach.

Q: Can I run Milvus on a single machine?

A: Yes. Milvus offers a Standalone mode via Docker Compose, which packages all components into a few containers. This is ideal for development. For production, Cluster mode on Kubernetes is the recommended path for high availability.

Q: Does Milvus support ACID transactions?

A: Milvus supports entity-level atomicity and follows an eventual consistency model by default. However, it allows users to tune the consistency level (Strong, Bounded Staleness, Session, Eventually) per query, balancing the trade-off between data visibility and search performance.

Q: What are the hardware requirements for 1 million vectors?

A: For 1 million 768-dimensional vectors using HNSW, you would typically need ~4GB of RAM for the index and additional overhead for the OS and metadata. Milvus is highly efficient, but memory scales linearly with vector dimensions and the number of entities.

Q: How do I handle data deletion in Milvus?

A: Milvus supports deletion by primary key or via filter expressions. Deletions are initially "soft" (marked in a delete buffer) and are permanently purged during the compaction process, which merges small segments into larger ones to maintain search efficiency.

References

  1. https://milvus.io/docs
  2. https://arxiv.org/abs/2103.01530
  3. https://zilliz.com/blog
  4. https://lfaidata.foundation/projects/milvus/
  5. https://github.com/milvus-io/milvus

Related Articles

Related Articles

Chroma

Chroma is an AI-native, open-source vector database designed to provide long-term memory for LLMs through high-performance embedding storage, semantic search, and hybrid retrieval.

Elasticsearch

A deep technical exploration of Elasticsearch's architecture, from its Lucene-based inverted indices to its modern role as a high-performance vector database for RAG and Agentic AI.

FAISS (Facebook AI Similarity Search)

A comprehensive technical deep-dive into FAISS, the industry-standard library for billion-scale similarity search, covering its indexing architectures, quantization techniques, and GPU acceleration.

Qdrant: Engineering High-Performance Vector Infrastructure for Agentic AI

A technical deep-dive into the Rust-based vector database architecture, focusing on Filterable HNSW, quantization strategies, and the roadmap toward Agent-Native Retrieval.

Advanced Query Capabilities

An exhaustive technical exploration of modern retrieval architectures, spanning relational window functions, recursive graph traversals, and the convergence of lexical and semantic hybrid search.

Attribute-Based Filtering

A technical deep-dive into Attribute-Based Filtering (ABF), exploring its role in bridging structured business logic with unstructured vector data, hardware-level SIMD optimizations, and the emerging paradigm of Declarative Recall.

Hybrid Query Execution

An exhaustive technical exploration of Hybrid Query Execution, covering the fusion of sparse and dense retrieval, HTAP storage architectures, hardware-aware scheduling, and the future of learned index structures.

Multi-Tenancy Features

An exhaustive technical exploration of multi-tenancy architectures, focusing on isolation strategies, metadata-driven filtering, and resource optimization in modern SaaS and AI platforms.