Yesterday amid a flurry of enterprise AI product updates, Google announced arguably its most significant one for enterprise customers: the public preview availability of Gemini Embedding 2, its new embeddings model — a significant evolution in how machines represent and retrieve information across different media types.
While previous embedding models were largely restricted to text, this new model natively integrates text, images, video, audio, and documents into a single numerical space — reducing latency by as much as 70% for some customers and reducing total cost for enterprises who use AI models powered by their own data to complete business tasks.
VentureBeat collaborator Sam Witteveen, co-founder of AI and ML training company Red Dragon AI, received early access to Gemini Embedding 2 and published a video of his impressions on YouTube. Watch it below:
Who needs and uses an embedding model?For those who have encountered the term "embeddings" in AI discussions but find it abstract, a useful analogy is that of a universal library.
In a traditional library, books are organized by metadata: author, title, or genre. In the "embedding space" of an AI, information is organized by ideas.
Imagine a library where books aren't organized by the Dewey Decimal System, but by their "vibe" or "essence". In this library, a biography of Steve Jobs would physically fly across the room to sit next to a technical manual for a Macintosh. A poem about a sunset would drift toward a photography book of the Pacific Coast, with all thematically similar content organized in beautiful hovering "clouds" of books. This is basically what an embedding model does.
An embedding model takes complex data—like a sentence, a photo of a sunset, or a snippet of a podcast—and converts it into a long list of numbers called a vector.
These numbers represent coordinates in a high-dimensional map. If two items are "semantically" similar (e.g., a photo of a golden retriever and the text "man's best friend"), the model places their coordinates very close to each other in this map. Today, these models are the invisible engine behind:
Search Engines: Finding results based on what you mean, not just the specific words you typed.
Recommendation Systems: Netflix or Spotify suggesting content because its "coordinates" are near things you already like.
Enterprise AI: Large companies use them for Retrieval-Augmented Generation (RAG), where an AI assistant "looks up" a company's internal PDFs to answer an employee's question accurately.
The concept of mapping words to vectors dates back to the 1950s with linguists like John Rupert Firth, but the modern "vector revolution" began in the early 2000s when Yoshua Bengio’s team first used the term "word embeddings". The real breakthrough for the industry was Word2Vec, released by a team at Google led by Tomas Mikolov in 2013. Today, the market is led by a handful of major players:
OpenAI: Known for its widely-used text-embedding-3 series.
Google: With the new Gemini and previous Gecko models.
Anthropic and Cohere: Providing specialized models for enterprise search and developer workflows.
By moving beyond text to a natively multimodal architecture, Google is attempting to create a singular, unified map for the sum of human digital expression—text, images, video, audio, and documents—all residing in the same mathematical neighborhood.
Why Gemini Embedding 2 is such a big dealMost leading models are still "text-first." If you want to search a video library, the AI usually has to transcribe the video into text first, then embed that text.
Google’s Gemini Embedding 2 is natively multimodal.
As Logan Kilpatrick of Google DeepMind posted on X, the model allows developers to "bring text, images, video, audio, and docs into the same embedding space".
It understands audio as sound waves and video as motion directly, without needing to turn them into text first. This reduces "translation" errors and captures nuances that text alone might miss.
For developers and enterprises, the "natively multimodal" nature of Gemini Embedding 2 represents a shift toward more efficient AI pipelines.
By mapping all media into a single 3,072-dimensional space, developers no longer need separate systems for image search and text search; they can perform "cross-modal" retrieval—using a text query to find a specific moment in a video or an image that matches a specific sound.
And unlike its predecessors, Gemini Embedding 2 can process requests that mix modalities. A developer can send a request containing both an image of a vintage car and the text "What is the engine type?". The model doesn't process them separately; it treats them as a single, nuanced concept. This allows for a much deeper understanding of real-world data where the "meaning" is often found in the intersection of what we see and what we say.
One of the model's more technical features is Matryoshka Representation Learning. Named after Russian nesting dolls, this technique allows the model to "nest" the most important information in the first few numbers of the vector.
An enterprise can choose to use the full 3072 dimensions for maximum precision, or "truncate" them down to 768 or 1536 dimensions to save on database storage costs with minimal loss in accuracy.
Benchmarking the performance gains of moving to multimodalGemini Embedding 2 establishes a new performance ceiling for multimodal depth, specifically outperforming previous industry leaders across text, image, and video evaluation tasks.
The model’s most significant lead is found in video and audio retrieval, where its native architecture allows it to bypass the performance degradation typically associated with text-based transcription pipelines.
Specifically, in video-to-text and text-to-video retrieval tasks, the model demonstrates a measurable performance gap over existing industry leaders, accurately mapping motion and temporal data into a unified semantic space.
The technical results show a distinct advantage in the following standardized categories:
Multimodal Retrieval: Gemini Embedding 2 consistently outperforms leading text and vision models in complex retrieval tasks that require understanding the relationship between visual elements and textual queries.
Speech and Audio Depth: The model introduces a new standard for native audio embeddings, achieving higher accuracy in capturing phonetic and tonal intent compared to models that rely on intermediate text-transcription.
Contextual Scaling: In text-based benchmarks, the model maintains high precision while utilizing its expansive 8,192 token context window, ensuring that long-form documents are embedded with the same semantic density as shorter snippets.
Dimension Flexibility: Testing across the Matryoshka Representation Learning (MRL) layers reveals that even when truncated to 768 dimensions, the model retains a significant majority of its 3,072-dimension performance, outperforming fixed-dimension models of similar size.
For the modern enterprise, information is often a fragmented mess. A single customer issue might involve a recorded support call (audio), a screenshot of an error (image), a PDF of a contract (document), and a series of emails (text).
In previous years, searching across these formats required four different pipelines. With Gemini Embedding 2, an enterprise can create a Unified Knowledge Base. This enables a more advanced form of RAG, wherein a company’s internal AI doesn't just look up facts, but understands the relationship between them regardless of format.
Early partners are already reporting drastic efficiency gains:
Sparkonomy, a creator economy platform, reported that the model’s native multimodality slashed their latency by up to 70%. By removing the need for intermediate LLM "inference" (the step where one model explains a video to another), they nearly doubled their semantic similarity scores for matching creators with brands.
Everlaw, a legal tech firm, is using the model to navigate the "high-stakes setting" of litigation discovery. In legal cases where millions of records must be parsed, Gemini’s ability to index images and videos alongside text allows legal professionals to find "smoking gun" evidence that traditional text-search would miss.
In its announcement, Google was upfront about some of the current limitations of Gemini Embedding 2. The new model can accommodate vectorization of individual files that comprise of as many as 8,192 text tokens, 6 images (in as single batch), 128 seconds of video (2 minutes, 8 seconds long), 80 seconds of native audio (1.34 minutes), and a 6-page PDF.
It is vital to clarify that these are input limits per request, not a cap on what the system can remember or store.
Think of it like a scanner. If a scanner has a limit of "one page at a time," it doesn't mean you can only ever scan one page. it means you have to feed the pages in one by one.
Individual File Size: You cannot "embed" a 100-page PDF in a single call. You must "chunk" the document—splitting it into segments of 6 pages or fewer—and send each segment to the model individually.
Cumulative Knowledge: Once those chunks are converted into vectors, they can all live together in your database. You can have a database containing ten million 6-page PDFs, and the model will be able to search across all of them simultaneously.
Video and Audio: Similarly, if you have a 10-minute video, you would break it into 128-second segments to create a searchable "timeline" of embeddings.
As of March 10, 2026, Gemini Embedding 2 is officially in Public Preview.
For developers and enterprise leaders, this means the model is accessible for immediate testing and production integration, though it is still subject to the iterative refinements typical of "preview" software before it reaches General Availability (GA).
The model is deployed across Google’s two primary AI gateways, each catering to a different scale of operation:
Gemini API: Targeted at rapid prototyping and individual developers, this path offers a simplified pricing structure.
Vertex AI (Google Cloud): The enterprise-grade environment designed for massive scale, offering advanced security controls and integration with the broader Google Cloud ecosystem.
It's also already integrated with the heavy hitters of AI infrastructure: LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, and ChromaDB.
In the Gemini API, Google has introduced a tiered pricing model that distinguishes between "standard" data (text, images, and video) and "native" audio.
The Free Tier: Developers can experiment with the model at no cost, though this tier comes with rate limits (typically 60 requests per minute) and uses data to improve Google’s products.
The Paid Tier: For production-level volume, the cost is calculated per million tokens. For text, image, and video inputs, the rate is $0.25 per 1 million tokens.
The "Audio Premium": Because the model natively ingests audio data without intermediate transcription—a more computationally intensive task—the rate for audio inputs is doubled to $0.50 per 1 million tokens.
For large-scale deployments on Vertex AI, the pricing follows an enterprise-centric "Pay-as-you-go" (PayGo) model. This allows organizations to pay for exactly what they use across different processing modes:
Flex PayGo: Best for unpredictable, bursty workloads.
Provisioned Throughput: Designed for enterprises that require guaranteed capacity and consistent latency for high-traffic applications.
Batch Prediction: Ideal for re-indexing massive historical archives, where time-sensitivity is lower but volume is extremely high.
By making the model available through these diverse channels and integrating it natively with libraries like LangChain, LlamaIndex, and Weaviate, Google has ensured that the "switching cost" for businesses isn't just a matter of price, but of operational ease. Whether a startup is building its first RAG-based assistant or a multinational is unifying decades of disparate media archives, the infrastructure is now live and globally accessible.
In addition, the official Gemini API and Vertex AI Colab notebooks, which contain the Python code necessary to implement these features, are licensed under the Apache License, Version 2.0.
The Apache 2.0 license is highly regarded in the tech community because it is "permissive." It allows developers to take Google’s implementation code, modify it, and use it in their own commercial products without having to pay royalties or "open source" their own proprietary code in return.
How enterprises should respond: migrate to Gemini 2 Embedding or not?For Chief Data Officers and technical leads, the decision to migrate to Gemini Embedding 2 hinges on the transition from a "text-plus" strategy to a "natively multimodal" one.
If your organization currently relies on fragmented pipelines — where images and videos are first transcribed or tagged by separate models before being indexed — the upgrade is likely a strategic necessity.
This model eliminates the "translation tax" of using intermediate LLMs to describe visual or auditory data, a move that partners like Sparkonomy found reduced latency by up to 70% while doubling semantic similarity scores. For businesses managing massive, diverse datasets, this isn't just a performance boost; it is a structural simplification that reduces the number of points where "meaning" can be lost or distorted.
The effort to switch from a text-only foundation is lower than one might expect due to what early users describe as excellent "API continuity".
Because the model integrates with industry-standard frameworks like LangChain, LlamaIndex, and Vector Search, it can often be "dropped into" existing workflows with minimal code changes. However, the real cost and energy investment lies in re-indexing. Moving to this model requires re-embedding your existing corpus to ensure all data points exist in the same 3,072-dimensional space.
While this is a one-time computational hurdle, it is the prerequisite for unlocking cross-modal search—where a simple text query can suddenly "see" into your video archives or "hear" specific customer sentiment in call recordings.
The primary trade-off for data leaders to weigh is the balance between high-fidelity retrieval and long-term storage economics. Gemini Embedding 2 addresses this directly through Matryoshka Representation Learning (MRL), which allows you to truncate vectors from 3072 dimensions down to 768 without a linear drop in quality.
This gives CDOs a tactical lever: you can choose maximum precision for high-stakes legal or medical discovery—as seen in Everlaw’s 20% lift in recall—while utilizing smaller, more efficient vectors for lower-priority recommendation engines to keep cloud storage costs in check.
Ultimately, the ROI is found in the "lift" of accuracy; in a landscape where an AI's value is defined by its context, the ability to natively index a 6-page PDF or 128 seconds of video directly into a knowledge base provides a depth of insight that text-only models simply cannot replicate.