Definition
High-dimensional numerical vectors generated by vision models that map visual features into a latent space, enabling cross-modal retrieval where images and text share the same coordinate system. In RAG pipelines, these allow agents to query visual databases using natural language by calculating the distance between the query vector and the image embeddings.
Unlike image compression or raw pixels, embeddings represent semantic meaning, not visual reconstruction data.
"A GPS coordinate on a conceptual map where 'Golden Retriever' and 'Yellow Lab' are neighboring houses, regardless of the file format or resolution."
- Multimodal RAG(Application Framework)
- Vector Database(Storage Infrastructure)
- Cosine Similarity(Retrieval Logic)
- Contrastive Learning(Foundational Training Method)
Conceptual Overview
High-dimensional numerical vectors generated by vision models that map visual features into a latent space, enabling cross-modal retrieval where images and text share the same coordinate system. In RAG pipelines, these allow agents to query visual databases using natural language by calculating the distance between the query vector and the image embeddings.
Disambiguation
Unlike image compression or raw pixels, embeddings represent semantic meaning, not visual reconstruction data.
Visual Analog
A GPS coordinate on a conceptual map where 'Golden Retriever' and 'Yellow Lab' are neighboring houses, regardless of the file format or resolution.