RAG Techniques - OpenAI API + Qdrant
Overview: You can enhance your minimal Retrieval-Augmented Generation pipeline (OpenAI API + Qdrant vector DB) by incrementally adding various RAG techniques. Below is a structured guide covering each technique, with explanations tailored to your setup, simple implementation ideas, and notes on trade-offs. The focus is on lightweight, practical approaches (no heavy orchestration frameworks or complex agents unless absolutely required).
1. Simple RAG (Basic Retrieval-Augmented Generation)
What it is: The baseline RAG architecture: embed documents into Qdrant, retrieve relevant chunks for a user query, and feed those chunks to an OpenAI completion to generate an answer.
How to apply: In your minimal setup, implement Simple RAG as follows:
- Indexing: Split your text data into chunks (e.g. by paragraphs or fixed tokens). Compute embeddings (e.g. using OpenAI’s embedding API) and store them in Qdrant with the chunk text as payload.
- Querying: For a new question, compute the query’s embedding and search Qdrant for nearest chunks (top k).
- Answer Generation: Concatenate the retrieved chunks into a prompt (with perhaps a prefix like "Context:" and then "Question:"). Call the OpenAI API (completion) to generate an answer using this context.
Example (pseudo-code):
# Indexing (run once):
for doc in documents:
chunks = split_document(doc, chunk_size=200)
for chunk in chunks:
vector = openai_embed(chunk.text)
qdrant.upsert(vector=vector, payload={"text": chunk.text, "source": doc.id})
# Query-time:
query_vec = openai_embed(user_query)
results = qdrant.search(vector=query_vec, top=5) # retrieve top 5 similar chunks
context = " ".join([res.payload["text"] for res in results])
prompt = f"Answer the question using given context.\nContext: {context}\nQuestion: {user_query}\nAnswer:"
answer = openai_complete(prompt)
Trade-offs: This simple approach is easy to implement but the answers are only as good as the retrieved text. It may sometimes return irrelevant chunks if the query or chunks aren’t a close match. Also, the OpenAI model may include unsupported information if retrieval fails. Nevertheless, this is the foundation for all other enhancements.
2. Simple RAG Using a CSV File (Structured Data)
What it is: A variation of basic RAG where source data is tabular (CSV). Instead of unstructured documents, your knowledge comes from a table or CSV file.
How to apply: Even with CSV data, the process is similar: you’ll treat each row or cell description as a chunk of text to embed and store in Qdrant. For example, if the CSV contains facts (rows with columns), you might convert each row into a descriptive sentence before embedding. At query time, search those embeddings for relevant rows and include them in the context.
- Indexing CSV rows: Read the CSV, and for each row, create a textual representation (e.g., "Country: France, Capital: Paris, Population: 67 million."). Embed that text and store in Qdrant (payload can include the original row or an ID reference).
- Retrieval and Answering: Query as usual by embedding the question and searching. The retrieved payload gives you relevant rows which you then present to the OpenAI model in the prompt. The model can formulate an answer by reading the structured info.
Example: If the question is “What is the capital of France?”, the system will find the row about France and supply “Country: France, Capital: Paris, Population: 67 million.” as context, enabling the model to answer “Paris.”
Trade-offs: Using structured data is straightforward, but make sure the text representation is clear. The OpenAI model might not reason about tables unless the prompt format is well-crafted. In a minimal setup, avoid complex SQL or table parsers – simply embedding text works, though it may lose some structure-specific nuance.
3. Reliable RAG (Validated Retrieval and Answering)
What it is: An improvement over Simple RAG that adds validation and refinement steps to ensure the retrieved info is actually relevant and used correctly. The goal is to increase trustworthiness and accuracy of the answers.
How to apply: In practice, you can implement reliability checks in two places:
- Post-retrieval validation: After getting top chunks from Qdrant, verify their relevancy before using them. For example, you can use a simple heuristic (does the chunk contain some query keywords?) or an LLM-based check. A lightweight approach: ask the OpenAI model (in a separate prompt) whether a chunk contains information that answers the question. Only keep chunks where the model responds positively. This ensures unrelated text doesn’t confuse the final answer.
- Answer validation: After generating an answer, you can have the model double-check if the answer is fully supported by the retrieved context. For instance, prompt the model: “Given the context, is the answer fully supported? If not, revise it.” This uses the model to self-validate and correct any unsupported assertions.
Additionally, “highlight the segment of docs used for answering” is mentioned. In a minimal setup, this could mean instructing the model to quote or reference the exact source text when answering. For example, the prompt can encourage answers like: “Paris is the capital of France (as stated in the context).” This makes it clear which part of the context justified the answer.
Trade-offs: Validation steps require extra OpenAI calls or logic, which can increase latency and cost. Simple heuristics might reject useful information if they’re too strict, or let junk through if too lax. LLM-based checks improve accuracy but double the API usage. However, these steps greatly improve answer correctness and user trust by reducing hallucinations.
4. Choosing the Right Chunk Size
What it is: Tuning how you split documents into chunks. Chunks that are too large may dilute focus or exceed token limits; too small and they might lose context. This technique is about finding a balanced fixed chunk size.
How to apply: In your Qdrant indexing, experiment with different chunk sizes:
- Try splitting source text into smaller vs. larger pieces (e.g. 100 tokens vs 500 tokens) and see which yields better retrieval results.
- If your documents have natural sections (paragraphs, bullet points), you might use those as chunks. Otherwise, decide on a token or character length that captures a complete thought.
A pragmatic approach:
def split_document(text, size=200):
# split by sentences approximately, ensuring each chunk ~ size tokens
...
Test your pipeline’s accuracy with a few queries under various chunk sizes. If the answer quality improves when using, say, 300-token chunks instead of 100-token chunks, stick with that for your data.
Trade-offs: Larger chunks carry more context (good for comprehensive answers) but each chunk embedding might match a query on irrelevant parts, leading to lower precision. Smaller chunks improve retrieval precision (each chunk is more specific) but you might need to retrieve more of them to cover the answer’s context. Also, very small chunks (like single sentences) may cause the model to lose broader context unless you retrieve many at once. Finding the “sweet spot” is usually an empirical process.
5. Proposition Chunking (Finer-Grained Chunks via LLM)
What it is: An advanced chunking method where each document is broken into factual propositions – concise, standalone statements – using an LLM. This yields a highly granular knowledge base.
How to apply: In a minimal pipeline, you can implement proposition chunking in two phases:
-
Generation of propositions (index-time): Use the OpenAI API to rephrase or summarize each chunk into a set of factual statements. For example, if a chunk says “France’s capital is Paris and it has 67 million people,” the LLM could produce propositions like “France’s capital is Paris.” and “France has a population of 67 million.” Each proposition is a self-contained fact. Store these generated propositions in Qdrant as separate entries (with metadata linking back to the source chunk/document).
-
Quality check (optional): If you want to ensure accuracy, you might have the LLM or some logic verify each proposition for correctness. In a lightweight approach, you could skip this or just manually spot-check, but an automated check could be done by asking the model to judge if the proposition is fully supported by the original text (though that becomes heavy).
-
Query-time retrieval: When a query comes in, embed it and search the proposition vectors. Because propositions are very targeted, the matches are likely to be directly relevant facts. You’d retrieve those facts and present them to the OpenAI model to answer the question.
Example: If the user asks “What is the capital of France and its population?”, the retrieval might return the two propositions above (“France’s capital is Paris.” and “France has 67 million people.”). The final prompt to the model would include these, making it straightforward for GPT to answer with both pieces of information.
Trade-offs: This method can dramatically improve precision – the retrieved texts are pinpoint facts, reducing noise. However, it requires significant upfront work and OpenAI calls to generate the propositions. It also expands the index size (many more, smaller entries in Qdrant). There’s a risk of the LLM introducing errors during proposition generation, so a quality check step is important but adds complexity. In a prototyping context, you might generate propositions for a small subset to see if it boosts performance before scaling up.
6. Query Transformations (Improving User Queries)
What it is: Enhancing or reformulating the user’s question to improve retrieval results. Often user queries are short or ambiguous – transforming them can yield better matches in Qdrant.
How to apply: There are a few lightweight strategies you can implement with OpenAI’s help:
- Query rewriting: Use the OpenAI model to rephrase the query in a more explicit or detailed way. For example, prompt: “Rewrite the query to be more specific for searching the knowledge base:
{user_query}
.” Then embed the rewritten query for the vector search. This can fix poorly worded questions or add context (e.g., turning “capital France?” into “What is the capital city of France?”). - Step-back prompting (broadening the query): If a query is very narrow, you can ask the model to generate a slightly broader question that encompasses the original. For instance, from “Python list comprehension speed”, broaden to “How does the speed of list comprehensions in Python compare to other methods?”. This might retrieve more context, especially if the exact keywords aren’t present in documents.
- Sub-query decomposition: For complex questions containing multiple parts, break the query into simpler sub-questions. You can do this by prompting the model, e.g.: “Break this question into simpler sub-questions:
{user_query}
.” Then retrieve answers for each sub-part and combine them. In a minimal setup, you could just manually identify conjunctions or multiple clauses in the question and split them.
After transforming the query using one or more of these methods, perform the usual retrieval with Qdrant. You can even retrieve with both the original and transformed query and merge the results.
Trade-offs: Query transformations add extra OpenAI calls per user question, which increases latency and cost. There’s also a risk that the transformed query drifts from the user’s intent if the model goes astray. Keep the transformations moderate (you can review the rewritten query before using it). Despite the overhead, this technique can significantly improve retrieval recall, ensuring that relevant info isn’t missed due to wording. It’s quite practical to prototype – you can log original vs. transformed queries and see which yields better answers.
7. Hypothetical Questions (HyDE Approach)
What it is: The HyDE technique (Hypothetical Document Embeddings) involves generating a hypothetical question or answer that could be found in your data, and using its embedding for retrieval. In essence, you use the LLM to create a bridge between the query and the vector store.
How to apply: For each user query, do the following:
- Generate a hypothetical answer or related question: Prompt the OpenAI model with the user’s query to produce a hypothetical document or Q&A pair. For example, “Imagine the answer to this question. Write a brief answer:”. If the question is “What are the health benefits of green tea?”, the model might generate a paragraph about green tea’s benefits (even if it’s hallucinated from general knowledge).
- Embed the hypothetical text: Treat this generated text as a surrogate for the query. Compute its embedding with OpenAI and query Qdrant. The idea is that this embedding may capture the topic in a way closer to how your documents are written, improving retrieval alignment.
- Retrieve and answer: Get relevant chunks from Qdrant using that embedding. Then proceed to answer generation as usual (you can still provide the original query and the retrieved chunks to the final prompt).
Example (pseudo-code):
# HyDE retrieval
prompt = f"Provide a short hypothetical answer to: '{user_query}'"
hypo_answer = openai_complete(prompt)
embedding = openai_embed(hypo_answer)
results = qdrant.search(vector=embedding, top=5)
...
# then use results in final answer prompt
Trade-offs: HyDE can greatly improve retrieval if the original query is sparse – the LLM’s guess often contains synonyms or related context that yield better vector matches. However, it doubles the number of OpenAI calls per query (one to generate the hypothetical answer, one for the final answer), increasing cost and response time. There’s also a chance the LLM’s hypothetical answer focuses on aspects your data doesn’t cover, potentially skewing retrieval. To mitigate this, you might also include the original query embedding in the search or do two searches (one with query, one with hypothetical) and merge results for safety. In a minimal prototype, HyDE is relatively easy to try and often provides a quick boost to retrieval quality for difficult queries.
8. Hypothetical Prompt Embeddings (HyPE)
What it is: HyPE is like a “pre-computed HyDE.” It transforms retrieval into a question-question matching task by generating hypothetical queries at indexing time. Rather than embedding document text directly, you embed questions that the document could answer.
How to apply: This requires an offline preprocessing step:
- Index-time (off-line) augmentation: For each chunk of your documents, use the OpenAI model to generate several possible questions that chunk can answer. For example, if a chunk is about green tea benefits, generate questions like “What are the health benefits of green tea?”, “Does green tea improve metabolism?”, etc. Each question is embedded and stored in Qdrant instead of (or in addition to) the original chunk text. You would store the vector for the question, and in the payload keep a reference to the chunk of text that prompted it.
- Query-time retrieval: Now, when a user query comes in, simply embed the query and search Qdrant. Since Qdrant is now populated with “hypothetical query” vectors, it will return the stored questions most similar to the user’s query. You then fetch the associated text chunks (from metadata) and provide them to the LLM for answering. Essentially, you’re matching the user’s question to previously generated question embeddings, which often aligns better than direct text matching.
Why it helps: This avoids generating a hypothetical answer at runtime (like HyDE does). Retrieval becomes faster and cheaper because it’s just one vector search – you already did the heavy LLM work during indexing. According to the repository, HyPE can significantly boost precision and recall (improving context precision by up to 42 points, and recall by up to 45 points) without runtime overhead.
Trade-offs: The trade-off is the upfront cost and complexity:
- Indexing cost: You must call the OpenAI API many times (for each chunk, multiple questions). This could be expensive and time-consuming for large corpora.
- Storage: You’ll store multiple vectors per original chunk, which increases Qdrant index size.
- Maintenance: If your data updates, you need to regenerate questions for new chunks. However, once set up, HyPE makes queries super-efficient (no need for on-the-fly query expansion). For a prototype, you might try HyPE on a small subset to gauge improvement. It’s more complex to implement than HyDE, but it pays off in query performance if your use case sees repeated queries on a static dataset.
9. Contextual Chunk Headers
What it is: A technique to enrich each chunk with additional context by prepending a brief header before embedding. The header provides document-level or section-level context that might otherwise be lost when chunking.
How to apply: When preparing chunks for Qdrant:
- Determine some high-level context for the chunk. This could be the document title, section heading, or a summary of the preceding content.
- Prepend this header text to the chunk itself, separated by a delimiter. For example: “Document: Health Benefits of Tea — Chunk: Green tea contains antioxidants...”
- Compute the embedding on this combined text (header + chunk) and store as usual. The header ensures the chunk’s vector carries broader context, which can improve retrieval accuracy if the query matches something more general.
In practice, if your documents have structure (like headings or filenames), include those. If not, you might generate a one-sentence summary for each chunk using the LLM (briefly, offline) and use that as a header.
Why it helps: Imagine a query asks, “In which article is metabolism discussed?” Without headers, a chunk about “metabolism boost” might not match because it lacks the article title. With a header that includes the article name, the embedding might capture that context and retrieve correctly. This method is used in some RAG systems to boost accuracy by giving embeddings more semantic signal.
Trade-offs: The implementation is straightforward and low-cost (especially if using existing titles or headings). The main caution is to keep headers concise – if the header is too large relative to the chunk, it could dominate the embedding and make unrelated chunks seem similar (e.g., if too many chunks share the same header, their vectors might cluster). In a minimal pipeline, this is an easy win if you have meaningful metadata for chunks (like source titles or section names).
10. Relevant Segment Extraction (RSE)
What it is: Instead of treating each chunk independently, RSE looks at the bigger picture: it tries to stitch together adjacent chunks from the original document that are all relevant to the query. This yields a longer, coherent text segment to provide to the LLM.
How to apply: You can implement RSE as a post-processing step after your initial retrieval:
- Perform the usual vector search in Qdrant with the query embedding to get top-N chunks.
- Analyze those top results for continuity. If multiple top chunks come from the same document and are near each other (e.g., chunk 5 and chunk 6 of the same doc), consider merging them into one larger segment. You might even check chunk 4 or 7 (neighbors) if they weren’t in top N, to see if adding a neighbor chunk yields a more complete passage.
- The goal is to produce one or a few consolidated text segments that cover the answer more completely than disjoint chunks. You then feed these larger segments to the OpenAI model.
Implementation approach: Maintain metadata in Qdrant for chunk indices and document IDs. For example, if a retrieved chunk has doc_id = D1
and chunk_index = 5
, you can look up in your original data if chunk_index 6
(or 4) of D1
is also relevant. A simple heuristic: if chunk 6 was also retrieved in the top N, definitely merge it with 5. If not, but you suspect the answer might span a boundary, you could still fetch it. After merging, you might end up with, say, a full paragraph or section from the doc, which provides better context to answer the query.
Trade-offs: By merging chunks, you reduce fragmentation of context (the LLM sees a fuller picture). This is especially useful if the answer is split across two chunks. However, be cautious:
- If you merge chunks that are not all relevant, you risk adding fluff that could distract the model.
- You must ensure the merged segment doesn’t exceed token limits for the OpenAI model.
- Implementation is a bit more involved (you need to track and fetch neighboring chunks from storage).
In a lightweight prototype, you can approximate RSE by always retrieving slightly more chunks from Qdrant (say top 10) and then grouping those by document before prompting. This way, if a document has many relevant pieces, you capture them. RSE adds complexity, but it can boost the completeness of answers by providing more contiguous context where needed.
11. Context Enrichment (Neighboring Sentences)
What it is: This technique enriches a retrieved snippet by including its immediate context – typically the sentences before and after – to give the LLM more understanding. Essentially, even if you retrieve one highly relevant sentence, you also supply its neighbors.
How to apply: If your chunks or indexing is at the sentence level (or very small chunks):
- When you get a top result (say a single sentence or a very short chunk), go back to the original document/source and pull in a bit of text immediately around that chunk. For example, include the preceding sentence and the following sentence.
- Concatenate these (original sentence + neighbors) as one context passage for answering.
If your chunks are larger (paragraph-sized), you might already have sufficient context and might not need this. But if you experimented with very fine chunking (like proposition or sentence-level), this ensures the model isn’t confused by lack of context.
Implementation: In your metadata, keep the original text or an index that allows you to retrieve neighbors. For instance, store each chunk with sentence_index
and have access to the list of sentences. Then:
best = top_result.payload["text"] # e.g., the central sentence
idx = top_result.payload["sentence_index"]
doc = documents[top_result.payload["doc_id"]]
context_snippet = doc.sentences[idx-1] + best + doc.sentences[idx+1]
You would do this for each top result or at least the top 1-2 results to give more coherent context.
Trade-offs: This is relatively simple and costs nothing extra from OpenAI (you’re just fetching a bit more text from your store). It can prevent the model from misinterpreting a snippet. On the downside, if the neighboring text is lengthy or irrelevant, it could introduce noise. The key is to balance context – one or two extra sentences are usually enough. In summary, context enrichment is a low-effort tweak to improve answer quality when using very small retrieval units.
12. Semantic Chunking (Chunk by Meaningful Sections)
What it is: Rather than splitting text by arbitrary length, semantic chunking means dividing documents into semantically coherent units. For example, splitting at topic or subtopic boundaries, so each chunk is about one concept.
How to apply: In practice:
- Use document structure: If your documents have headings, chapters, or paragraph breaks that correspond to topic shifts, use those as chunk boundaries. For instance, each FAQ entry could be a chunk, or each section under a heading in a Markdown file.
- NLP-based splitting: If structure isn’t obvious, you can use algorithms to find topic shifts. Simple approach: take a sliding window over the text and measure similarity between adjacent sentences; split when similarity drops (indicating a new topic). More advanced: use an open-source model or library to cluster sentences by topic.
- LLM-assisted: In a minimal way, you could even prompt the LLM to outline the document into sections. For example, “Divide this text into sections by topic:” and use the suggested sections as chunks.
Once determined, embed these semantically meaningful chunks into Qdrant instead of uniform blocks. The retrieval then works on concept-level chunks.
Advantages: Because each chunk is internally coherent, any retrieval hit is likely to be a good match to the query’s intent (less chance of a chunk containing mixed topics where only half is relevant). It can improve both precision and recall, as the query will match whole relevant sections.
Trade-offs: Implementing semantic chunking might require additional steps or tools. If your corpus is small or well-structured, the benefit might be marginal. Also, chunks can become quite large if a section is long, which might stress token limits. In those cases, you might need to further break down very large sections (perhaps recursively). For prototyping, use simple heuristics (like splitting by existing headings or double newlines) – this often yields better segments than a naive fixed-size split with very little extra effort.
13. Contextual Compression (Summarize Retrieved Text)
What it is: Summarizing or compressing the retrieved context before feeding it to the generative model. The idea is to preserve key information while staying within token limits and removing fluff.
How to apply: After you retrieve top chunks from Qdrant:
- Use the OpenAI API to compress them. For example, you can prompt: “Summarize the following text focusing only on information relevant to answering the question:
{question}
. Text:{chunk}
”. This will produce a shorter version of each chunk that ideally retains the important details related to the query. - You can do this for each retrieved chunk individually or combine all retrieved text and summarize in one go (if the combined length isn’t too large).
- Then take these summaries as the context you feed into the final answer prompt. The final prompt might be like: “Using the following summarized context, answer the question... [summaries] ... Question: X”.
Benefits: This approach can significantly reduce token usage. If you have, say, 3000 tokens of retrieved text but can compress it to 1000 tokens of relevant summary, you save cost and make it easier for the model to focus on pertinent info. It’s especially useful if your retrieval grabs some extraneous sentences around the answer – the summarization can strip those out.
Trade-offs: Compressing context means you’re relying on the LLM to not drop any critical information. There is a risk that a summary might omit a subtle fact needed for the answer or even misrepresent something (though if you instruct it to be factual and focused, it usually does well). It also adds an extra OpenAI call (or a few, if summarizing chunks separately) per query, which affects latency and cost. For a prototype, it might be worth implementing this if you frequently hit context length limits or notice the model gets distracted by irrelevant details. Otherwise, you might skip it initially and only introduce if needed for scale.
14. Document Augmentation via Question Generation
What it is: Augmenting your document set by generating questions (and potentially answers) for each document or chunk, then indexing those questions similar to HyPE. This increases the ways relevant info can be retrieved.
How to apply: This technique is closely related to HyPE (#8) but can be done in a simpler way:
- For each chunk of text in your documents, ask the OpenAI model to generate a list of questions that the chunk can answer. Prompt example: “Generate 3 different questions that could be answered by the following text:
{chunk}
”. Collect these questions. - Insert each generated question into Qdrant as a new entry. The embedding of the question becomes a key to retrieve the original chunk. You’ll want to store metadata so you know which chunk (or document) the question came from (similar to HyPE’s approach).
- Optionally, you could also generate the answers to those questions (since the answer is basically the chunk text or a summary of it). Having a Q&A pair could help if you want to do some custom ranking, but for retrieval, the question alone is often enough.
At query time, you embed the user’s question and search against this augmented index. Because you’ve vastly increased the “query space” your index covers, you have a better chance of hitting relevant content even if there’s not a direct textual match in the original docs.
Trade-offs: This augmentation can improve recall significantly – obscure pieces of knowledge in the documents become reachable via the questions you generated. However:
- It shares similar costs with HyPE: many OpenAI calls up front to generate questions, and a larger index.
- You might generate questions that are too generic or not useful. To mitigate that, guide the prompt to produce specific, content-linked questions.
- There’s redundancy: multiple chunks might yield similar questions, leading to duplicate entries in the vector DB (consider deduplicating identical questions to save space).
For a minimal pipeline, you could try this on a small scale (e.g., one question per chunk or only on key parts of your data) and see if question-based retrieval helps. It’s a bit heavy, but it’s conceptually straightforward and doesn’t require changes to your query workflow (just changes what you index).
15. Fusion Retrieval (Combining Multiple Search Methods)
What it is: Instead of relying solely on vector similarity, fusion retrieval mixes different retrieval strategies (like keyword search + vector search) to get more robust results.
How to apply: In your context, you have Qdrant for vector search. You can introduce a basic keyword or traditional search and then fuse the results:
- Keyword search: For a minimal approach, you might not have a full-text index like ElasticSearch, but you can do a simple keyword match in your text corpus. For example, filter or score chunks by how many query terms they contain (you can maintain a separate inverted index or just do a brute-force scan of top vector hits for keywords).
- Run both the vector search and the keyword-based search. You’ll get two sets of candidate chunks.
- Fusion: Merge the candidates. You can take the union of both sets and then rank them. Ranking could be as simple as: if a chunk appears in both sets, boost it higher. Or use a weighted sum of normalized vector similarity and keyword match score.
- Another approach is to do a hybrid query if Qdrant supports it (some vector DBs allow filtering by keywords or boosting by a keyword score if you supply an additional parameter, but if not, manual fusion as above is fine).
Example: If the query is “apple health benefits”, a keyword search might find chunks containing “apple” and “health” explicitly. Vector search might retrieve a chunk about fruit nutrition that doesn’t explicitly mention “health benefits”. By fusing, you ensure that if either method finds something relevant, it’s not missed.
Trade-offs: Fusion can improve recall (catching results pure vector search might miss due to phrasing) and precision (keyword ensures exact terms appear, reducing weird off-target embeddings). The complexity is slightly higher: you need to implement a basic search or use an existing one. If the corpus is small, a simple Python search over text could suffice. If it’s large, you may need a lightweight search library. Also consider that keyword search won’t work well if queries use synonyms not present in text – that’s where vectors shine. So fusion aims to cover each other’s blind spots. It’s quite feasible to implement in a prototype without heavy frameworks by just writing a few scoring functions.
16. Intelligent Reranking
What it is: Once you have a list of retrieved documents (from any method), reranking means re-ordering them by more sophisticated relevance criteria than the raw similarity score. “Intelligent” reranking uses advanced models (like LLMs or cross-encoders) to score relevance.
How to apply: After you get, say, the top 10 results from Qdrant (which are ranked by vector similarity), you can apply one or more reranking strategies:
- LLM-based scoring: Use the OpenAI model to directly judge relevance. For each retrieved chunk, you could pose a prompt: “Question:
{query}
\nDocument:{chunk}
\nIs this document helpful for answering the question? Respond with a score 1-5.” Parse the model’s output as a relevance score and then sort chunks by this score. This effectively uses GPT-4/3.5 as a custom relevance evaluator, potentially capturing nuances that vector similarity missed. - Cross-encoder model: If you want something faster/cheaper and are open to using a smaller model, you could use a cross-encoder (like a mini BERT-based model) that takes a (query, chunk) pair and outputs a relevance score. This would require adding an extra model to your pipeline (not using OpenAI API), which might conflict with your “no heavy frameworks” constraint. But if it’s a small self-contained model, it could be acceptable.
- Metadata or heuristic reranking: You can also incorporate domain knowledge. For example, if your chunks have a freshness timestamp and the query asks for current info, you might boost more recent chunks. Or if one source is generally higher quality, boost those. This is more manual but can layer on top of similarity score.
After computing new scores, sort the results accordingly and then use that sorted list for answering. Reranking helps ensure the best answer sources are considered first by the LLM.
Trade-offs: LLM-based reranking (using OpenAI) will significantly increase cost – essentially you’re making an extra API call for each retrieved chunk you want to score. For instance, scoring 10 chunks with GPT-4 could be expensive and slow. You might mitigate by only reranking the top 5 instead of top 10, or using a cheaper model (GPT-3.5) for scoring. Also, model scoring can introduce its own errors (the model might erroneously dismiss a relevant chunk if it doesn’t see obvious connections). Despite the cost, reranking can boost answer quality by feeding the model the most relevant info first. If budget is a concern, consider simpler rerankers (like a small BERT or even a TF-IDF overlap measure) as a proxy. This technique is more on the advanced side for a prototype, but you can experiment with a small subset to see if it improves results (e.g., compare answers with and without reranking to decide if it’s worth it).
17. Multi-faceted Filtering
What it is: Applying additional filters to retrieval results beyond just similarity, to ensure they meet certain criteria. “Facets” could be metadata (like date, author), content properties, diversity, etc.
How to apply: Within Qdrant or in post-processing, you can enforce various filters:
- Metadata filtering: If your data has attributes, you can use Qdrant’s filtering in the search query. For example, only search within documents of a certain type or date range if the query implies that (e.g., filter out older entries if the question asks “latest information”). Qdrant supports filter conditions on payload, e.g.,
filter={ "year": { "$gt": 2020 } }
to only consider newer content. - Score threshold: Only accept results above a certain similarity score. If the top results have low scores (meaning the query is not similar to anything in the DB above a threshold), you might decide none are reliable and handle that (maybe fall back to a different method or say you don’t know). This avoids presenting barely-relevant info to the LLM.
- Content filtering: If you want to ensure certain content standards, e.g., remove any retrieved text that doesn’t contain a needed keyword or that contains disallowed content. For instance, if the query is about a specific person, you might filter out chunks that don’t mention that person’s name at all, even if they came up via embedding.
- Diversity filtering: If your top results are almost duplicates (common when the same fact is present in multiple docs), you can choose to only take one of them to avoid redundant context. You might detect duplicates by comparing text or by seeing if multiple results come from the same document – then keep one or two and drop the rest.
Trade-offs: Filters can greatly refine result quality, but they require some domain knowledge or assumptions. If filters are too strict, you might remove genuinely useful info. For example, setting a high similarity threshold might cause the system to return “no answer” for a valid query that just didn’t embed well. Metadata filters depend on having that metadata; they add complexity in managing data labeling. In a minimal setup, you can implement the simplest ones easily (score threshold and basic diversity check). Qdrant’s built-in filtering by payload is also easy to use and doesn’t add overhead on your side. This technique is more about precision control – ensuring the LLM sees only pertinent, high-quality context. Use it judiciously so as not to accidentally filter out the answer itself.
18. Hierarchical Indices (Two-Tier Retrieval)
What it is: Using a multi-level index structure, often two-tiered: a high-level index to choose relevant sections or documents, and a lower-level index to get specific chunks. It’s like first finding the right book, then the right page.
How to apply: Set up two Qdrant (or vector) collections or use two different embedding strategies:
- Tier 1 – Document-level: Create an embedding for each whole document or a summary of each document. You could, for example, take each document’s title + intro paragraph, embed that, and store with doc ID. This index helps you retrieve which documents are likely relevant.
- Tier 2 – Chunk-level: This is your existing chunk index (the one we’ve been discussing so far with all chunks).
The retrieval process becomes:
- Use the query embedding to search the document-level index first. Get the top M documents that are likely relevant.
- For those top docs, fetch their chunk IDs and search within those only (you can either search the chunk index with a filter by doc_id, or simply retrieve all chunks of those docs and rank by similarity to the query).
- Now you have chunks that are not only similar in content but also come from the most relevant documents.
Example: If you have 1000 documents with 100k total chunks, searching all 100k chunks might return 3 chunks from document #20, 1 chunk from document #305, etc. With hierarchical approach, you first find, say, top 3 documents (doc #20, #21, #305), then only search within those documents’ chunks. This can surface slightly lower-similarity chunks from the right documents that a global search might have missed because they were outranked by chunks from less relevant docs.
Trade-offs: This approach improves efficiency (you narrow search space quickly) and sometimes quality, because it emphasizes document relevance. It’s particularly useful if you have lots of documents where queries usually pertain to a few of them. In Qdrant, implementing this might mean maintaining two collections and doing two queries, which is more code and complexity. You have to also generate good document-level embeddings (a summary or average of chunk embeddings could work too). For a minimal prototype, consider this if you observe that many top chunk results come from the same few documents – an indicator that you could pre-filter by doc. If your dataset isn’t huge, a single-tier might be fine; hierarchical shines with scale. The main cost is storing an additional index and the extra query step, but both are quite manageable.
19. Ensemble Retrieval (Multiple Models in Parallel)
What it is: Using more than one retrieval model or embedding approach and combining their outputs. The idea is that different models may capture different nuances, and an ensemble is more robust.
How to apply: There are a few ways to ensemble in a lightweight manner:
- Multiple embedding models: For example, use OpenAI’s embeddings and another embedding model (perhaps a domain-specific one or an open-source model). Search the query in both vector spaces. Then merge results (similar to fusion retrieval but here both are semantic). You might take the union of the top results from each model, or intersect if you want high precision.
- Different query variants: Even using the same model (OpenAI embeddings), you could ensemble by using both the original query and a transformed query (from Technique #6) and union their retrieval results.
- Voting or weighting: If using two embedding models, you could rank results by a weighted combination of their similarity scores from each model. Or simpler, if a document is retrieved by both models independently, trust it more.
Example: Suppose you have a general embedding model and a code-specialized model. A query about code could benefit from the latter, whereas a plain model might miss some context. By combining, you ensure code-related queries still find matches. In code:
results1 = qdrant_general.search(query_vec_general)
results2 = qdrant_code.search(query_vec_code)
combined = merge(results1, results2)
If maintaining two separate Qdrant collections, you’d do parallel searches. Alternatively, you might store multiple embedding vectors in the same Qdrant entry (if supported) and craft a custom search – but that’s more complex.
Trade-offs: Running multiple models obviously has overhead: you may need additional infrastructure (for open-source models) or additional OpenAI calls (if you use two different OpenAI embedding endpoints). It doubles indexing effort and storage (for each data point you store multiple vectors). Ensembling is powerful if your data or queries are diverse (one model might not catch everything). If your domain is fairly homogeneous, a single good model should suffice, and ensemble might yield diminishing returns. In a prototype phase, consider ensembling if you notice clear gaps – e.g., certain queries where another approach consistently finds answers that the primary approach misses. Otherwise, it might be overkill initially. Keep it in the toolbox as your project grows.
20. Dartboard Retrieval (Relevance + Diversity Optimization)
What it is: A retrieval strategy that explicitly optimizes for both relevance and diversity in the returned set. It’s sometimes called Maximal Marginal Relevance or similar – ensuring results aren’t all redundant.
How to apply: In a basic way, you can implement a greedy selection algorithm for your final retrieved context:
-
First, retrieve a larger-than-needed set of candidates from Qdrant (say top 20 by similarity).
-
From these, build the final set (e.g., 5 to pass to the LLM) iteratively:
- Pick the highest-score chunk as the first result.
- For each subsequent pick, don’t just pick the next highest-score. Instead, pick the chunk that has a good balance of high similarity to the query and low similarity to the already picked chunks (to introduce diversity).
-
To do this, you need a way to compute similarity between chunks. Since you have their embeddings, you can compute cosine similarity between a candidate chunk’s vector and each selected chunk’s vector. You might then define a combined score like
score_final = query_similarity - λ * max(similarity_to_any_selected)
, where λ is a tunable parameter controlling diversity vs. relevance. -
Select the chunk with the highest
score_final
as next. Repeat until you have the desired number of chunks.
In essence, this prevents you from, for example, selecting 5 chunks that all say the same thing or all come from the same document section.
Trade-offs: This method can yield a richer set of information for the LLM, potentially improving the answer by covering different aspects or angles. However, it might also drop the 2nd or 3rd most similar chunk in favor of a more diverse one that’s slightly less similar. If those top-similarity chunks actually all contain critical info, you risk leaving something out. The key is tuning how much diversity to enforce. In practice, a small λ (just avoiding exact duplicates) is low-risk and beneficial. A high λ (favoring very diverse results) might hurt if the query is very specific. Implementing this is somewhat more complex than taking top K, but it’s doable with a few lines to calculate similarities between vectors (you can use dot products from the embeddings). For a prototype, if you notice a lot of near-duplicate retrievals, this is a neat trick to spread out the context info. If not, plain top-K may suffice.
21. Multi-modal Retrieval
What it is: Extending RAG to handle not just text, but other data types like images, audio, PDFs, etc.. In a minimal pipeline, this means incorporating non-text information by converting it into text or using specialized models.
How to apply: There are two general approaches:
- Convert other modalities to text (Captioning/OCR): For images, you can use an image captioning model (like OpenAI’s VL models or open-source ones) to produce a textual description. For PDFs or PPTs, you’d use OCR or PDF text extractors to get the raw text content. Once converted to text, treat it like any other document: chunk, embed, and store in Qdrant. Now your retrieval can surface information from images or PDFs because their textual equivalents are in the index.
- Embed other modalities directly and use multi-modal models: This is more complex and likely outside a minimal setup. It would involve using an embedding model that can handle images (like CLIP) to embed images and store those vectors. Then for a query, if the query could be an image or text, you would use corresponding encoders and possibly retrieve across modalities. For example, the user might provide an image and you find similar images via vector search. Or in the “ColBERT/ColBERTa” approach mentioned (Colapli in the text), converting everything to images might refer to generating an image representation of text and using an image model to retrieve – which is quite advanced and not typical.
In your case, you likely want the first approach:
- If you have relevant non-text data (like diagrams, screenshots), run them through an OCR or caption generator offline.
- Store the resulting descriptive text in your Qdrant index (perhaps with a metadata tag that it came from an image or PDF).
- At query time, those descriptions could be retrieved alongside normal text chunks. You would then include them in the context, possibly phrasing them as “Image description: [content]” so the LLM knows it came from an image.
Trade-offs: The biggest effort here is building the pipeline to convert non-text to text. There are open-source tools for OCR (e.g., Tesseract for images/PDF) and captioning models for images. This does add external dependencies and complexity to your system. If your data is already mostly text, you may not need multi-modal. But if, say, you have important info locked in scanned documents or diagrams, this can unlock that for your RAG system. One must also consider accuracy – OCR errors or bad captions could mislead the model. So, ensure the conversion quality is high or double-check critical pieces. In summary, multi-modal retrieval in a lightweight way means “make everything text one way or another.” It’s powerful but only pursue it if your use case truly demands multi-modal knowledge.
22. Retrieval with Feedback Loops
What it is: Incorporating user feedback to iteratively improve the system’s retrieval performance over time. The idea is the system learns from its mistakes or successes.
How to apply: Even without building a full learning pipeline, you can add simple feedback mechanisms:
- Explicit feedback: If your application UI allows, you can have users upvote/downvote answers or mark if the answer was helpful. You can then trace back which documents were used and increase or decrease their “reputation” score in the system. For instance, maintain a metadata field in Qdrant for each chunk’s usefulness. A very basic implementation: if an answer was good, increment a counter for those chunks; if bad, decrement. Then at retrieval time, you adjust similarity scores slightly (e.g., multiply by a factor like
1 + reputation
or filter out chunks with very negative feedback). - Implicit feedback: Even if users don’t explicitly rate, you can infer from usage. If you provide sources for answers and see which source links users click, that’s a sign those chunks were considered helpful. You could then boost those chunks for similar queries in the future (this requires logging query–result interactions).
- Active learning loop: Periodically, take a set of queries that had poor results (maybe ones where the user asked again or rephrased, indicating first try failed) and analyze what went wrong. You might find documents that should have been retrieved but weren’t (maybe because of missing synonyms). You can then augment those documents (e.g., add more synonyms to their text or add more embeddings for them) or tune your embedding/query strategy accordingly.
In a minimal code-based setup, you might store a simple JSON or database of feedback: e.g., {chunk_id: score}
and then use that to post-filter or adjust ranks on subsequent searches.
Trade-offs: The main challenge is collecting enough feedback and then deciding how to use it. Overfitting to feedback from a few queries could hurt general performance (so don’t over-adjust scores based on one user’s opinion). Also, incorporating feedback might make the system less predictable (because now retrieval isn’t just pure similarity, it’s influenced by past interactions). Start subtle – e.g., only if a chunk gets repeatedly flagged as irrelevant, consider excluding it. This technique shines in long-running applications with many users/questions on the same data, where you can meaningfully learn. In a short-term prototype with few queries, you might not gather much feedback, but it’s good to design your system with a place to plug this in later.
23. Adaptive Retrieval (Dynamic Strategies per Query)
What it is: Adjusting retrieval approach based on the nature of the query. Not all questions are equal – some might be factual, some open-ended, some very specific. Adaptive retrieval means classifying the query and then using a strategy optimized for that class.
How to apply: Introduce a lightweight query classifier (this could be a simple rule-based check or an LLM prompt). Examples of query types and adaptations:
- Factual, specific questions: e.g. “What is the capital of X?” – Here a straightforward vector search might be fine. You might restrict to highly relevant small chunks (because the answer is likely a name or date).
- Broad, exploratory questions: e.g. “Explain how metabolism is affected by tea.” – Maybe the answer isn’t in one chunk. For these, you might retrieve more chunks to cover different points, or use a summarization of an entire section. You could decide to use context compression (#13) here to handle the broader context.
- Very recent or time-sensitive queries: If you have timestamps, an adaptive strategy could prioritize the latest documents.
- Multi-part queries: If the query has an “and” or multiple sub-questions, you might apply the sub-query decomposition from #6 as an adaptive step.
To implement, you could do something simple like:
if "why" in query or "explain" in query:
strategy = "broad"
elif len(query.split()) < 5:
strategy = "specific"
else:
strategy = "default"
Or use OpenAI: “Classify this question as ‘specific’, ‘broad’, or ‘multifaceted’.” Then:
- For ‘specific’: do normal retrieval, maybe k=3 chunks.
- For ‘broad’: increase k (retrieve more chunks), maybe use summarization.
- For ‘multifaceted’: break it and retrieve for each part separately.
Trade-offs: The risk is misclassification – if your logic guesses wrong, it might use a less optimal strategy (e.g., treat a specific question as broad and include too much context). However, even a rough heuristic can improve performance by not using a one-size-fits-all approach. It adds branching in your code but not heavy complexity. Start with obvious distinctions (like known keyword patterns or query length). Over time, you can refine the categories. Adaptive retrieval aims to make your pipeline more flexible and efficient per query. Just be sure to test each path to ensure it indeed helps for that query type.
24. Iterative Retrieval (Multiple Rounds)
What it is: Instead of a single retrieve-then-answer step, iterative retrieval involves looping: use initial results or partial answers to reformulate a new query, retrieve again, and so on. It’s akin to the system asking follow-up questions to itself.
How to apply: A simple way to try this:
- Perform the normal retrieval for the query and get some chunks.
- Ask the LLM (or use logic) if additional information is needed. For instance, after generating an initial answer, you could inspect if the answer contains phrases like “not sure” or “needs more data” or if the answer is incomplete. Alternatively, you can prompt the model: “Do you have enough info to answer thoroughly? If not, what would you ask next?”. If it suggests a follow-up query, take that and treat it as a new query to retrieve more info.
- Another method: use the initial retrieved context and question to have the LLM generate a refined query on its own. For example: “Given the context and question, generate a follow-up query to find missing details.” Then use that query for a second round of Qdrant search. Merge those new results with the original ones and ask the final question again with the expanded context.
Example: User asks: “How did the 2021 supply chain crisis affect small businesses?” The first retrieval might bring general info on the crisis. The LLM’s initial answer might say, “It affected many small businesses with delays and shortages, but specifics are not provided.” The system could then form a follow-up query like “impact of 2021 supply chain crisis on small businesses specific examples” and retrieve again. The second retrieval might yield a case study or statistics that the first missed. Those are then added to produce a more complete answer.
Trade-offs: Iterative retrieval can find information that a single shot misses, improving answer quality for complex queries. However, it introduces more OpenAI calls and more complexity in orchestration:
- You need a stopping condition (maybe limit to 1 or 2 iterations to avoid a loop).
- There’s added latency from performing multiple searches and LLM calls.
- The approach relies on the LLM to guide the process, which might sometimes ask irrelevant follow-ups or not know when to stop.
For a lightweight implementation, keep it to just one extra round max. You can also decide to trigger it only for questions above a certain complexity (determined perhaps by query length or if the initial answer confidence is low). It’s a step toward an “agent-like” behavior but can be done in a deterministic, simple way without external frameworks – just some conditional logic and re-prompting in your code.
25. DeepEval Evaluation (Assessing RAG performance)
What it is: DeepEval is a framework for evaluating RAG systems on multiple metrics like correctness, faithfulness, etc., using test cases. While not a retrieval technique, it’s a way to systematically measure how well your pipeline is doing.
How to apply: In a prototyping context, you might not integrate a full evaluation suite into your app, but you can adopt some practices:
- Create a small set of QA pairs as a validation set. These are questions with known answers (and perhaps known relevant sources in your data). After you implement various techniques, run these questions through your pipeline and see if the answers are correct and which documents were used.
- If using a library like
deepeval
(as the repository suggests), you could set up tests for certain criteria. For example, test that for factual questions, the answer text is contained in the retrieved context (faithfulness test), or test that numeric answers match expected values (accuracy test). - Another manual metric: count how often the model says "I don't know" or hallucinates an answer not supported by context. Adjust your pipeline (with techniques above) to reduce these.
In code, a simplistic evaluation could be:
test_questions = [
{"q": "What is the capital of France?", "expected_answer": "Paris"},
...
]
for t in test_questions:
ans, used_context = rag_pipeline_answer(t["q"])
evaluate_correctness(ans, t["expected_answer"])
evaluate_support(ans, used_context)
Where evaluate_support
might check if the expected answer or key facts appear in the context (a rough measure of grounding).
Trade-offs: Rigorous evaluation is often overlooked in prototyping, but incorporating it early can guide you on which techniques actually help. The downside is the time to create a good test set and possibly the effort to integrate an evaluation library. If using GPT-based evaluation (like having GPT-4 judge outputs), that again costs tokens. But for research purposes, even a small set of tests run after each major change can be invaluable. DeepEval or similar frameworks can automate this, but you can also do it manually or with simple scripts. The benefit is you avoid relying on subjective feel and have data-driven insight into improvements.
26. GroUSE Evaluation (Grounded Output Evaluation)
What it is: GroUSE is another evaluation framework focusing on whether the LLM’s answers are grounded in the provided context and other metrics. It often uses GPT-4 as a judge to score answers on various dimensions.
How to apply: Similar to DeepEval, this is about evaluation rather than a live technique. In practice, if you want to use something like GroUSE:
- You would take your pipeline’s outputs (answers along with the context that was fed to the LLM) and feed them to an evaluation harness. The harness (possibly using GPT-4) would score things like: correctness, completeness, relevance of context, etc.
- For example, GroUSE defines metrics like consistency, relevance, factuality, etc. You might prompt GPT-4 with a template: “Given the question, the provided context, and the answer, rate the answer on a scale for factual correctness, relevance, etc.”. This requires some coding and using the
grouse
Python package if available.
For a lightweight approach, you could manually do a GroUSE-like check on a few outputs. Essentially, you want to ensure that:
- The answer only contains info present in the context (no new facts – measures grounding).
- Important relevant context isn’t omitted in the answer (measure of completeness).
- The answer actually addresses the question (relevance/precision).
Even without full automation, just reviewing some outputs with these criteria in mind is useful.
Trade-offs: Doing thorough evaluations with GPT-4 judges means additional API usage. For a prototype, you might not run this continuously – maybe just as a one-time or periodic analysis. The benefit of frameworks like GroUSE is that they provide a structured way to catch failure modes (like answers that look good but aren’t grounded). Since your question is about implementing techniques, you may not need to implement GroUSE per se, but being aware of these evaluation dimensions can guide your design. For instance, if evaluation shows low grounding scores, you know to focus on techniques that improve faithfulness (like #3 Reliable RAG or #27 Explainable Retrieval). In summary, use evaluation frameworks as a guide for improvement, but they are outside the main retrieval pipeline.
27. Explainable Retrieval
What it is: Making the retrieval step transparent – explaining why certain documents or passages were retrieved. This can help both developers (to debug relevance) and end-users (to build trust).
How to apply: On a basic level, you can generate explanations for the top retrieved chunks before passing them to the LLM:
- A simple built-in way: if using keyword search, you could highlight which keywords matched. For vector search, it’s trickier since it’s semantic. But you can still attempt to summarize why a chunk was picked. One approach: compute the overlap between query terms and chunk text (just as a rough explanation: “This chunk shares the term ‘metabolism’ with your query”).
- Another approach: use the LLM to explain retrieval. For each chunk, prompt something like: “Question:
{Q}
\nSnippet:{chunk}
\nExplain briefly how this snippet is relevant to the question.”. The model might respond, “It mentions the benefits of green tea which relates to the question about health benefits.” You could then present that explanation alongside the snippet or use it internally to verify relevancy. - If exposing the retrieval step to the user, you can show snippets with a title or source and maybe a line like “(from [Document XYZ], contains info about ABC which matches your question)”. This can be done manually if you have metadata (like document titles or section headings that hint at content).
For development debugging: Logging explanations for retrieved results can help you see when irrelevant info sneaks in. For example, an explanation might reveal the chunk was retrieved because it loosely matched a concept – if that’s a false positive, you can adjust your pipeline (maybe via filtering or adjusting embedding).
Trade-offs: Generating explanations can use extra tokens if done via LLM, and it slows things down a bit. If you keep explanations simple (like term overlaps or metadata display), the cost is minimal. Users often appreciate knowing why the system gave an answer (“we found this info in a 2020 report about X”). Just ensure the explanation itself is correct – don’t claim a chunk is relevant if it’s not. In a minimal prototype, you could start by simply showing the source text passages to the user (that alone is a form of explainability). Then, if needed, add one-sentence rationales. This technique doesn’t improve answer quality directly, but it improves transparency, which is valuable in its own right.
28. Knowledge Graph Integration (Graph RAG)
What it is: Combining knowledge graph data (structured triples of facts) with your unstructured text RAG system. A knowledge graph can provide structured context like relationships between entities, which complements the text retrieval.
How to apply: In a minimal way, you can leverage any structured data you have:
- If you have a database or knowledge base of facts, you can query it alongside your vector DB. For example, if the question is “Who is the CEO of Company X?”, your text RAG might find a news article, but if you have a knowledge graph, you could directly query it for the
CEO
relationship ofCompany X
and get that the CEO isPerson Y
. - Implementation: identify entities in the query (like person names, company names, etc.) using a simple NER (named entity recognition) tool or even regex. Then see if those entities exist in your structured data. If yes, fetch related info (like all properties of that entity, or specifically the property asked about if the query indicates one).
- Merge the structured info with unstructured context for the LLM. For instance, provide a section in the prompt: “Knowledge Graph Info: Company X – CEO: Person Y; Founded: 2010; Headquarters: London.” followed by “Unstructured Context:” with the text snippets.
This way, the model gets both sources. The structured info can guide the model to the answer more directly if it’s a straightforward fact.
Trade-offs: You need to have a knowledge graph or structured data in the first place, which not every project has. Building one can be non-trivial (extracting entities and relations from text automatically is an entire project on its own). If you do have such data (even if just a simple database), integration is very beneficial for factual queries. The model can cross-verify the text against the structured data, reducing errors. The complexity lies in entity recognition and deciding what part of the KG to retrieve. For a minimal approach, you might hardcode certain relations of interest. For example, if many of your questions involve people or places, you could integrate a public dataset (like WikiData queries via an API) to fetch facts as needed. Just be mindful that merging two contexts (KG and text) means the model has more to chew on. Ensure the prompt clearly delineates them, so it uses the KG data as factual ground truth and uses text for additional detail.
29. GraphRAG (Microsoft’s Implementation)
What it is: This refers to a specific open-source approach by Microsoft that deeply integrates a knowledge graph with RAG. It likely involves building a graph from your text data and using it in the retrieval/generation loop.
How to apply: In principle, GraphRAG will:
- Extract entities and their relationships from your corpus to construct a graph (nodes and edges representing concepts and their connections).
- Use that graph to inform retrieval. For example, if a query mentions two entities, the system could traverse the graph to find how they’re connected and pull documents related to that connection.
- Possibly generate answers by walking the graph (like following a chain of reasoning through the KG) and using the LLM to fill gaps.
For your minimal setup, implementing a full GraphRAG is likely too heavyweight. However, you can borrow ideas:
- Do a one-time entity extraction on your documents using an NER model (or even GPT in a data preprocessing mode: “extract all entities and relations from this text”).
- Build a simple mapping, e.g., a dictionary mapping each entity to the list of chunks/docs that mention it. This is like an inverted index by entity.
- Then, for a query that has identifiable entities, you can directly retrieve chunks by that mapping (in addition to vector search). E.g., question about “Paris” – fetch all chunks linked to entity “Paris” from this map, maybe rank them by some importance.
This is not exactly a graph, but it’s structured retrieval by entity, which covers a lot of what a knowledge graph integration would give you in simpler form.
Trade-offs: The true GraphRAG is advanced; it may improve accuracy and allow multi-hop reasoning (like answering questions that require combining info from different parts of the graph). But constructing the graph requires NLP pipeline (entity/relation extraction) which can be noisy or labor-intensive to do properly. In a prototype, a full graph might be overkill unless your data naturally has a graph structure (e.g., legal documents with references, or an encyclopedic corpus). If you do try a bit of it, keep it simple as described (entity index). The Microsoft GraphRAG might have specific code or tools – exploring those could be informative, but integrating that into your minimal setup might conflict with the “no heavy frameworks” rule. Consider it an inspiration for how to incorporate structured connections in your data, rather than something to directly implement from scratch in early prototyping.
30. RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval)
What it is: A method that uses recursive summarization to organize information in a tree-like hierarchy. Essentially, it breaks down a large knowledge source into a tree of summaries, enabling a hierarchical drill-down when answering.
How to apply: The full RAPTOR approach might involve:
- Summarizing documents into sections, summarizing those into higher-level summaries, and so on, forming a tree (like chapter -> section -> paragraph hierarchy, each with an LLM-generated summary).
- At query time, using the query to traverse the tree: find relevant high-level summaries, then go down to relevant sections, and finally retrieve specific content.
In a minimal sense, you could do a simpler version:
- Create summaries of each document (maybe a few sentences each) and store those separately (similar to hierarchical indices in #18, but with LLM-generated summaries which might capture the essence better than raw text).
- If a document is very large, you could even summarize each section within it.
- Then, for a query, first search the summaries to find relevant docs or sections, then retrieve actual chunks from those areas as needed.
The “recursive” aspect implies you might not stop at one level: e.g., if the summary of a book is relevant, you then search within chapter summaries, then within paragraphs. This is like doing a deep dive guided by summaries at each step.
Trade-offs: Implementing recursion adds complexity and multiple LLM calls (to generate all these summaries and possibly to navigate them). It’s beneficial when dealing with very large texts where a flat vector search might miss context. If your data is moderately sized, a simpler two-tier (doc and chunk) approach might suffice (#18). However, one advantage of a summary tree is that the model can handle reasoning at different granularities – broad info vs. detailed info. For a prototype, you might try just one level of summarization (like document summaries for first-stage retrieval). Building a full tree (document -> section -> sub-section summaries) is a lot of upfront work and storage, and you’d need logic to query it properly. Unless you have a clear need (e.g., book-length documents where questions might be about various levels of detail), you can consider RAPTOR more of a conceptual ideal than a necessity early on.
31. Self RAG
What it is: A dynamic approach where the system itself decides how to combine retrieval and generation, potentially deciding to skip retrieval or do extra retrieval based on the query. It’s like an autonomous RAG agent that figures out what it needs.
How to apply: In a lightweight interpretation:
- Before retrieving, you might have the LLM analyze the query and decide: “Do I need to look up information for this, or can I answer from general knowledge?”. If it’s something the model likely knows (common knowledge), it might choose to answer directly. If it’s something obscure or data-specific, it should retrieve. This could be done by prompting the model: “Classify this question as answerable with given data or without data.” If it says “without data” and you trust it, you could just let the model answer directly. (Though in many RAG contexts, you always want grounding, so this is optional.)
- Another aspect: dynamically decide how many documents to retrieve. A self-RAG approach might say: “I found some info, but I’m not fully confident, let me get more.” You can emulate this by checking the answer’s confidence or completeness as in #24 Iterative Retrieval, and looping if needed.
- Self-RAG as described sounds like a multi-step agent: retrieval decision -> retrieve -> maybe more retrieve -> generate -> verify -> etc.. Fully implementing that involves writing a chain of prompts where the model can explicitly output actions (like “search for X”) and then you execute them. This basically enters the territory of building an agent (like using an LLM to drive its own retrieval via a protocol).
For minimal pipeline, a cautious approach: use a simple heuristic or a single-model prompt to decide if retrieval is needed. Example: if the query explicitly mentions your data’s domain (like names or terms known to be in your docs), do retrieval. If it’s a generic question that your docs may not cover, you might still do retrieval but be aware it could come up empty (in which case, perhaps just let the model answer from its training data or respond with uncertainty).
Trade-offs: The risk with skipping retrieval is the model might hallucinate an answer or use outdated training data. Many RAG setups prefer to always retrieve to keep the model grounded. However, there are scenarios (like very simple commonsense questions) where retrieval is unnecessary and only adds noise. Self-RAG is advanced and requires careful design to avoid the model making poor decisions. Implementing a full self-directed chain would conflict with “no heavy frameworks”, because it’s essentially what agent frameworks (LangChain agents, etc.) do. So, at most, you can incorporate a bit of self-analysis as guidance. This technique is cutting-edge and perhaps overkill for early prototyping, but being aware of it can inspire features like query classification (as above) and iterative loops.
32. Corrective RAG (CRAG)
What it is: A sophisticated pipeline that not only retrieves from the vector DB but can also do web searches and rewrite queries based on what was found or not found. It “corrects” the course of retrieval dynamically to get the best answer.
How to apply: A full CRAG system might:
- Use an initial retrieval from Qdrant. If the retrieved info has gaps or low relevance, invoke a web search (or some secondary source) to get additional information.
- Use an LLM to evaluate the retrieval: e.g., “Did the retrieved docs likely answer the question? If not, what else should be done?”. It might answer: “The info is incomplete, let's search the web for XYZ.”
- Then incorporate that external info and even refine the query for a second pass in Qdrant (maybe the web info provided new keywords or clarified the query).
- Finally, generate the answer from a combination of internal and external sources.
In a minimal environment, you might not want to integrate live web search, but you could simulate a scaled-down corrective strategy:
- If the vector search returns nothing useful (perhaps all scores below a threshold or the LLM says “I don’t have enough info”), you could then query an external API or database if available. For instance, maybe check a Wikipedia API for the answer as a fallback.
- Or, automatically reformulate the query (similar to #6 transformations) and try the vector search again. This is a “corrective” step if the first try failed.
Trade-offs: This approach aims for maximum answer accuracy by any means necessary – using multiple tools and steps. It’s powerful but definitely complex. Introducing web search means you need an API (like Bing) and have to parse those results, which is a whole subsystem. That might be outside your current scope. Query rewriting we already discussed; doing it iteratively and conditionally (only when needed) is an added layer of logic. If you foresee your internal data often not having the answer and you want to backstop with external info, a corrective approach is useful. Just beware of blending external info – your answer might end up citing things not in your internal data, which could be fine but might violate the idea of focusing on your data. For prototyping, you might implement a very basic check: if confidence low, call a quick web search (maybe using an unofficial API or a pre-downloaded knowledge base) and see if that helps. This is certainly moving beyond a self-contained system, so consider it if you are exploring how to handle unanswered queries.
33. Sophisticated Controllable Agent (for Complex RAG Tasks)
What it is: This is essentially describing an agent-based system where the LLM uses a deterministic “brain” or plan to break down very complex queries and perform multiple actions (search, retrieve, analyze, plan, answer). It’s the opposite of a minimal pipeline – it’s a maximal approach for tough cases.
How to apply: Given the constraint to avoid heavy frameworks, you likely would not implement this fully. But to outline what it entails:
- The agent would anonymize or simplify the question, then create a high-level plan (like steps required to answer).
- It would then execute a series of sub-queries or retrievals (perhaps using many of the techniques above in combination) to gather pieces of information.
- It might adjust its plan as it finds new info (“continuous re-planning”), ensuring all sub-questions are answered.
- Finally, it verifies the final answer thoroughly against sources.
This resembles a complex decision tree or workflow that the LLM navigates. LangChain or similar frameworks provide tools for this (tools usage, planning, etc.), but doing it from scratch is a big project.
In a minimal setting, you can borrow some spirit:
- For particularly complex multi-hop questions, you could manually script a few step solution. For example, if the question is, “Compare the economic policies of country X in 1990 and 2020 and their outcomes.” You know this involves multiple parts: find info about 1990 policies, find info about 2020 policies, then compare. You could do two separate retrievals for 1990 and 2020, then ask the LLM to compare using those two sets of context.
- That is essentially an agent behavior hardcoded for that pattern of query. You might not cover all possible complex queries, but you can handle some patterns.
Trade-offs: A fully autonomous, controllable agent is very powerful but extremely complex to build and debug. It’s likely overkill for most use cases unless you specifically need that level of compositional reasoning and have very challenging queries. It would involve many moving parts (the “deterministic graph brain” as mentioned). For now, it’s enough to recognize that such solutions exist for specialized needs. If your prototype eventually faces questions that require reasoning through many documents or steps, you can consider introducing some agent-like orchestration at that point (potentially using existing libraries in a controlled way). Until then, focusing on the simpler techniques above will yield plenty of improvement without diving into full agent territory.
Conclusion: Each of these techniques can be mixed and matched in your RAG pipeline as needed. Start with the foundational ones (chunk sizing, simple RAG flow), then gradually incorporate query enhancements and context enrichment. Use evaluation (#25, #26) to measure gains. Advanced methods like knowledge graphs or multi-step agents are optional depending on how complex your needs become. By keeping each enhancement modular and lightweight, you can prototype rapidly and evolve your system without the weight of a large framework, staying in control of how each piece works. Good luck with your RAG implementation!
Looking to grow your business by leveraging AI?
Let's discuss how we can transform your business operations, enhance customer experiences, and drive growth by leveraging AI.
Book a free consultation