Innovative solutions driving your business forward.
Discover our insights & resources.
Explore your career opportunities.
Learn more about Sogeti.
Start typing keywords to search the site. Press enter to submit.
Generative AI
Cloud
Testing
Artificial intelligence
Security
January 06, 2025
Disclaimer: This series explores concepts and creative solutions for building Advanced Retrieval Augmented Generation (RAG) Systems. If you’re new to RAG systems, I recommend exploring some introductory materials first. Microsoft’s Introduction to RAG in AI development and Google Cloud’s free course on Retrieval Augmented Generation are good places to start.
Welcome back to my series on building advanced RAG systems! In the previous post, I explored how to move beyond basic document loading and text splitting. If you haven’t read the first part 1, I recommend starting there as I will be building on those concepts in this and future blogs. For part 2 of this series, I will be tackling another important and sometimes overlooked aspect of document ingestion – handling visual elements like images, tables, flow charts, diagrams, and graphs.
An important detail often overlooked in RAG implementation tutorials is the handling of images and other visual (or non-text-based) elements. When using popular frameworks like Langchain, document loaders and text splitters typically discard images, diagrams, and other visual content by default. This is because these tools don’t have a native way to process visual elements. An unfortunate consequence of this is that some developers are unaware that they need to process visual data separately. This leads to the development of RAG implementations that essentially throw the visual content in the garbage each time they load a new document. By omitting visual data, we lose valuable information.
The development of multimodal “Vision” models has removed the text-only constraint. While true multimodal embedding systems are still in the experimental phase, we have access to powerful vision models that can bridge this gap.
Two notable options are:
These models can accept an image as input and generate text about it. When provided with well-crafted instruction prompts vision models excel at generating detailed textual descriptions of visual content. This includes identifying text content in the image itself as well as positional information. This capability opens up a lot of possibilities for preserving visual information in our RAG systems.
I find that different types of visuals benefit from different ingestion approaches:
Images
For standard images, the straightforward approach works well:
The key here is crafting good prompts for your vision model. You want descriptions that capture not just what’s in the image, but its context and purpose within the document. Providing the vision model with some surrounding text or the image caption text can also improve the results.
Here’s where things get interesting. While you could use the basic image-to-text description approach, I’ve found a more powerful alternative, convert these elements into Markdown Mermaid format. This approach has several advantages:
Here’s a simple example of a basic flow chart represented in Mermaid format:
graph TD
A[Document Received] –> B[Extract Visual Elements]
B –> C[Process with Vision Model]
C –> D[Convert to Mermaid/Text]
D –> E[Integrate into RAG System]
This is what it looks like when rendered:
For data visualizations, I recommend a hybrid approach:
This ensures you’re preserving the visual appearance and the underlying data. Pie charts, histograms and several other graph types can be converted to Mermaid or other code-based formats like the approach used for flow charts and diagrams.
When implementing these approaches, keep in mind:
While adding vision model processing to your RAG pipeline does increase computational overhead and processing time, the benefits often outweigh the costs. In my experience, the improved comprehension and response quality from including visual information makes the investment worthwhile. This is especially true in technical or educational contexts where visuals are often used to convey complex information.
In the next post, I’ll explore another powerful technique for enhancing RAG systems document summarization. We’ll look at how these elements can provide context that might otherwise be lost in the document retrieval process.
Technical Lead – Robotics & AI | France
Data Governance is a foundational element for organizations striving to harness the power of data while managing associa…
Data governance – the framework that ensures data is accurate, secure, and well-managed – plays a critical role in shapi…
organizations spend time and resources on preventing outages as much as possible and when they happen make sure that eit…
We use cookies to improve your experience on our website. They help us to improve site performance, present you relevant advertising and enable you to share content in social media.
You may accept all cookies, or choose to manage them individually. You can change your settings at any time by clicking Cookie Settings available in the footer of every page.
For more information related to the cookies, please visit our cookie policy.