Skip to main content
← All posts
Architecture Deep-Dive12 min readMar 2026

Multimodal RAG: How to Build AI Search Across Text, Images, Audio, and Video

Text-only RAG is table stakes. Here's how I built a system that searches across PDFs, images, audio recordings, and video — all in one unified vector space.

Multimodal RAGVector SearchGeminiWhisperComputer Vision
D

Dhruv Tomar

AI Solutions Architect

Tech Stack

Next.jsFastAPIGeminiPineconeWhisperFFmpeg

Architecture

Upload (any modality) -> Modality Router -> Text: chunk + embed | Image: Gemini vision description + embed | Audio: Whisper transcribe + embed | Video: FFmpeg frame extract + Whisper audio + embed -> Unified Pinecone index -> Hybrid retrieval -> SSE streaming response with source citations.
5 modalities supported
100K+ documents indexed
Sub-500ms retrieval
Source citations on every answer

Standard RAG handles text. But real knowledge lives in PDFs with diagrams, recorded meetings, training videos, and product images. My Multimodal RAG system handles all of them.

The Core Insight: Every modality can be converted to text, then embedded into the same vector space. Images become descriptions. Audio becomes transcriptions. Video becomes frame descriptions + audio transcription. Once everything is text, standard RAG applies.

Text Pipeline: Straightforward — chunk by semantic boundaries (headers, paragraphs, topic shifts). Embed with text-embedding-3-large. Store in Pinecone with metadata (source file, page number, chunk index).

PDF Pipeline: PDFs are special because they contain both text and visual elements (charts, diagrams, tables). Extract text with standard parsing. For pages with images/charts, send to Gemini Vision for description. Embed both the extracted text and the visual descriptions.

Image Pipeline: Send each image to Gemini Vision with the prompt: "Describe this image in detail, including any text, diagrams, charts, or visual information." The description becomes the searchable text. Store the original image URL in metadata for display.

Audio Pipeline: Transcribe with OpenAI Whisper (or local WhisperKit for speed). Chunk the transcript by speaker turns or time segments. Each chunk includes a timestamp — so when a user searches "what did we discuss about pricing," the result links to the exact moment in the recording.

Video Pipeline: This is the most complex. Extract audio track with FFmpeg, transcribe with Whisper. Extract key frames (1 per 5 seconds for long videos, 1 per second for short). Send frames to Gemini Vision for scene descriptions. Combine audio transcript + frame descriptions + timestamp metadata. A user can search "the slide about Q3 revenue" and get the exact video timestamp.

The Unified Index: All modalities live in one Pinecone index. Each vector has metadata: modality type, source file, timestamp/page number, original content URL. The retrieval pipeline doesn't care about modality — it searches the unified space and returns the most relevant chunks regardless of source type.

Handling Scale: At 100K+ documents, indexing is a batch job. I use FastAPI background tasks for ingestion — upload triggers async processing. Progress tracking via WebSocket. Pinecone handles the vector scale; Supabase stores metadata and file references.

What Surprised Me: Gemini Vision descriptions of technical diagrams are shockingly good. A complex architecture diagram gets described accurately enough that semantic search finds it when someone asks "how does the auth flow work." This was the moment I realized multimodal RAG isn't a gimmick — it genuinely unlocks knowledge that text-only systems miss.

Open-sourced at github.com/aiagentwithdhruv/multimodal-rag.

Want to build something like this?

I architect and deploy end-to-end AI systems — from MVP to revenue.

Let's Talk