Definition
A high-dimensional vector space where disparate data modalities—such as text, images, and audio—or different languages are mapped into a single, shared coordinate system. This allows for cross-modal retrieval in RAG pipelines, enabling a text-based query to retrieve semantically relevant non-text assets by calculating their proximity within the same mathematical manifold.
Not just a large vector database, but a shared mathematical representation where different encoders are aligned to the same semantic coordinates.
"A Universal Translator’s Map: A single globe where a mountain is marked at the same GPS coordinate whether the label is a photo, a written word, or a spoken name."
- Cross-modal Retrieval(Application)
- Contrastive Learning(Prerequisite)
- Multimodal LLM(Component)
- Projection Layer(Component)
Conceptual Overview
A high-dimensional vector space where disparate data modalities—such as text, images, and audio—or different languages are mapped into a single, shared coordinate system. This allows for cross-modal retrieval in RAG pipelines, enabling a text-based query to retrieve semantically relevant non-text assets by calculating their proximity within the same mathematical manifold.
Disambiguation
Not just a large vector database, but a shared mathematical representation where different encoders are aligned to the same semantic coordinates.
Visual Analog
A Universal Translator’s Map: A single globe where a mountain is marked at the same GPS coordinate whether the label is a photo, a written word, or a spoken name.