AI Agent Championship is live from 20 Sept to 15th Oct 2024

How to Build a State-of-the-Art (SOTA) RAG Engine?

How to Build a State-of-the-Art (SOTA) RAG Engine?

Table of Contents

Estimated reading time: 9 minutes

Retrieval Augmented Generation (RAG) plays a pivotal role in enhancing the capabilities of Large Language Models (LLMs). It provides necessary context during interactions with any LLM. It’s common knowledge that LLMs are inherently stateless, lacking an in-built memory in their APIs. While ChatGPT, as a commercial front-end application, incorporates memory, the back-end APIs—integral to our application integrations—do not. This is where RAG becomes crucial. It enables LLMs to conduct searches using the data you provide, effectively sourcing information from supplied content. For businesses, developing a state-of-the-art RAG engine is essential to ensure consistent, predictable outcomes each time they utilize an LLM to query data.

To construct a leading-edge Retrieval Augmented Generation (RAG) engine, comprehending the controllables is imperative. These controllables are essentially what you manipulate to fine-tune a RAG engine for optimal output. 

image 2

So, what all could you finetune or optimize?

ParameterDescriptionOur Recommendation
Data SourcesIdentify and streamline the data sources. Choosing the right data source will save you a lot of time in integrating the RAG engine.PDF
Data NormalizationNormalize the data for vectorization. This will help in consistency at embedding, indexing and vector stores.Text
Chunk SizesEvaluate with various chunk sizes and overlap parameters based on content type. This will improve the retrieval performance.750 or 1000 words
Embedding ModelsChoose the embedding model that suits.OpenAI, BGE
Vector DBChoose the vector database that suitsWeaviate
RAG TechniquesTry various RAG techniques based on the usecaseHybrid Fusion
Query TransformationImplement query transformationHyDE
RerankingTry various rerankers algorithmLyzr or Cohere
Large Language ModelChoose the LLM that suitsGPT-4 Turbo
PromptingTry various prompt techniquesChain of Thought + Few Shot

The process begins with identifying the data sources, which vary widely in type. They can range from PDF documents, text, and videos, to audio files, images, databases, and even JSON formats, among others. Choosing the right data type to build your RAG engine is crucial, as the same data might exist in multiple formats. For instance, in the case of one of Lyzr’s customers, we observed that data stored as PDF documents, identical to that in a NoSQL database, was more amenable to processing by an LLM. The data from PDFs was easier to vectorize, store, and process, leading to more accurate results.

After selecting the appropriate data source, the next step is data normalization. This process ensures that various data types are standardized, facilitating the embedding models to generate semantic relationships between vector embeddings—a key factor in the RAG engine’s performance. In Lyzr RAG SDK, we’ve streamlined data normalization and data parsing to a highly sophisticated level, easing the workload on embedding models.

While these initial steps may not involve direct tuning, they are crucial foundations. Next, we will explore the actual parameters that you can adjust to enhance your RAG engine’s performance.

In refining a Retrieval Augmented Generation (RAG) engine, several tunable parameters are pivotal. Firstly, the chunk sizes and overlap size for segmenting input data before it’s processed by the embedding model can be adjusted. Selection of the embedding model is another critical decision, given the variety of models available, each suited to different needs. The choice of vector database is also essential, with options like Weaviate, PGVector, and Pinecone offering unique strengths. For instance, the Weaviate database addresses the need for indexing algorithms in the vector database, a task that would require more effort if using simpler libraries like FAISS.

Query rephrasing is another adjustable parameter, where techniques such as HyDE can be employed to refine queries for enhanced results. The RAG technique itself offers flexibility, with various retrieval strategies available. Hybrid fusion is a popular choice among these. Additionally, the re-ranker parameter can be tuned to further improve output quality. The choice of Large Language Models (LLMs) also plays a significant role, as different models like OpenAI’s GPT-4, Gemini, Ultra, or open-source models like Mixtral, have varied behaviors and context handling capabilities.

Lastly, another crucial aspect is the prompting technique. The way prompts are structured significantly influences the output quality. These various parameters provide ample scope for customization and optimization in a RAG bot.

Now let us look at each of these tune-able parameters in little more detail.


Chunking and Overlap in Chatbot Development

After data normalization, a critical aspect of chatbot development is the process of chunking and establishing overlaps. This step involves dividing large text documents into smaller segments or chunks, then processed by an embedding model for vectorization and storage. The primary challenge is to ensure continuity between these chunks. A prevalent approach is to create chunks of approximately 1000 words, with a 20-word overlap. This overlap aids in ensuring a smooth transition from one chunk to another, maintaining contextual flow. However, it is advantageous to experiment with varying chunk sizes and overlaps to determine the most effective structure for a specific application.

Selection of Embedding Models

Selecting the right embedding model is crucial, as there’s no universal solution that fits all needs. Different models offer unique advantages. A valuable resource for comparing these models is the MTEB English leaderboard by Hugging Face, accessible at Hugging Face’s MTEB Leaderboard. From our experience, models like Voyage, BGE, and OpenAI’s text-ada have demonstrated effectiveness in chatbot applications. It’s essential to experiment with various models to identify the one that best matches your data type and sources.


Lyzr’s Enterprise SDKs support this experimentation by allowing users to select from over 20 embedding models, including Voyage, OpenAI Text-ada, Jina, BGE, and others.

Choosing the Right Vector Database

The choice of an appropriate vector database is pivotal in developing a top-tier chatbot. The market offers a variety of options, each with its unique strengths. PGvector by Supabase, for example, integrates vector database capabilities into a PostgreSQL framework, while MongoDB Atlas incorporates these functionalities into a NoSQL DB framework.

Specialized vector databases like Weaviate are tailored for such applications. Weaviate efficiently handles indexing algorithms, metadata management, and multi-indexing, streamlining the process. Our project, www.theYCbot.com, operates on Weaviate and showcases the benefits of an advanced database system without requiring the parameter fine-tuning discussed in this blog.

All of Lyzr’s Open-Source SDKs are compatible with Weaviate. Lyzr’s Enterprise SDKs also default to the Weaviate Vector Database, but customers have the flexibility to choose from a wide range of vector databases offered by Lyzr’s SDKs, including Pinecone, PGvector, Qdrant, Chroma, Atlas, Lance, and more.

Hypothetical Document Embeddings (HyDE)

Hypothetical Document Embeddings (HyDE) represent an innovative approach to query transformation. In this method, a Large Language Model (LLM) like GPT generates a hypothetical response to a query. The chatbot then searches for vector embeddings that closely align with this hypothetical response, harnessing the power of semantic search. This technique uses both the query and the generated response to locate the most relevant and accurate answers.

Lyzr Enterprise SDKs offer the flexibility to select your preferred query transformation method. The most effective chatbot architecture often results from experimenting with various combinations and configurations to achieve the desired performance.

Advanced Retrieval Strategies

Exploring further into retrieval strategies, there are several methods to enhance chatbot performance. Llama Index provides insightful analysis into different Retrieval-Augmented Generation (RAG) techniques. For example, auto-merger retrieval combines multiple search results in a layered manner, whereas hybrid fusion search integrates keyword and vector search results, re-ranking them for greater relevance. It’s important to consider that these advanced strategies might entail additional costs due to potential extra LLM calls.

image 3
How to Build a State-of-the-Art (SOTA) RAG Engine? 3

Lyzr RAG SDK offer the capability to select from a variety of RAG techniques simply by specifying the desired method as a parameter. Lyzr RAG SDK, excels by automatically selecting the most suitable RAG technique for a specific chatbot. However, it also allows for manual override, enabling users to specify a preferred RAG technique for customized optimization tailored to the chatbot’s unique requirements.

Re-rankers for Enhanced Coherence

Re-rankers are instrumental in refining search results for improved coherence. The Cohere Re-ranker, as an instance, emphasizes search results that align closely with the user’s query. Another approach involves utilizing a Large Language Model (LLM) to re-rank the results, thereby enhancing the relevance of the final output. Lyzr SDKs provide the option to manually select a re-ranker or to depend on its advanced algorithm for choosing the most appropriate re-ranking technique.

Leveraging LLMs for Interpretation and Output Generation

LLMs, such as GPT-4, play a crucial role in interpreting search results and generating human-readable outputs. These models are adept at summarizing and structuring information retrieved from vector databases, converting it into coherent and pertinent responses. Fine-tuning LLMs can further customize these responses to better reflect a specific brand or industry identity.

Lyzr Enterprise SDKs facilitate seamless integration and switching between various LLMs, including GPT3.5, GPT4, GPT4Turbo, Claude, Llama2, Mistral-7B, Mixtral 8x7B, and over 100+ open-source models accessible through HuggingFace APIs.

A special acknowledgment to the LiteLLM open-source contribution, which has significantly aided Lyzr SDKs in seamlessly connecting with multiple LLMs.

The Significance of Prompting in LLMs

The effectiveness of Large Language Models (LLMs) is heavily influenced by the quality of prompts they receive. Basic one-shot prompts, similar to those used in theycbot.com, can be effective. However, experimenting with diverse prompting techniques can substantially elevate a chatbot’s performance. Greg Brockman, co-founder of OpenAI, has highlighted the critical role of prompt engineering in maximizing the capabilities of GPT4. Techniques such as FewShot, ChainOfThought, ReAct, and combinations like ChainOfThought (CoT) plus FewShot (FS) or ReAct plus FewShot are popular. Refer to the below blog post on Prompt Engineering 101.


A State-of-the-Art RAG architecture need not necessarily employ all the above techniques. But you will never know until you have tested all these yourself. Lyzr RAG SDK allows you to build the best RAG engine for your applications without the complexity of integrating various tune-able features or manually testing the effectiveness of each parameter tuning. Speak to our team today for a detailed demo.

What’s your Reaction?
+1
11
+1
0
+1
2
+1
0
+1
0
+1
0
+1
0
Book A Demo: Click Here
Join our Slack: Click Here
Link to our GitHub: Click Here
Share this:
Enjoyed the blog? Share it—your good deed for the day!
Need a demo?
Speak to the founding team.
Launch prototypes in minutes. Go production in hours.
No more chains. No more building blocks.