What is Retrieval Augmented Generation?

The simplest and most effective way to make a large language model more accurate, knowledgeable, and robust in practical, domain-specific applications.

May 13, 2024

In the last article in this series we began exploring how LLMs work and how we can, despite their many limitations, use them to build robust and useful applications in practical scenarios. In this article, we continue with this exploration of the good side of LLMs. Today I want to focus on augmentation techniques: ways to enhance vanilla LLMs and mitigate some of their most obvious limitations.

Augmentation techniques allow extending the capabilities of an LLM in several ways without retraining the models. We can extend the breadth of their knowledge by providing relevant context taken from external knowledge sources, or teaching them to directly query specialized search engines or APIs. We can extend their reasoning and problem-solving capabilities by integrating them with specialized tools that perform specific tasks, or enabling them to generate and run code on-the-fly. Finally, we can turn LLMs into fully fledged agents able to decide how and when to interact with the environment, gather information, and take actions to further long-term goals.

In this and follow-up posts, we will review the most common augmentation strategies that can be used with current state-of-the-art LLMs. These techniques are at the reach of everyone and require little effort compared to the value they provide. In this article, we’ll focus on the simplest and yet most effective augmentation technique: retrieval augmented generation.

This article is part of my upcoming book How to Train your Chatbot, a practical handbook on using LLMs to build all sorts of cool stuff.
You can get early access to the current draft and all future updates in the following link:
How to Train your Chatbot - 50% off

Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) is the most common augmentation technique used today in practical applications that leverage large language models. It is easy to implement and provides a very good value for money, especially in restricted domains where vanilla LLMs have none or very limited knowledge.

A typical example is building a Q&A bot that can answer queries related to institutional or private knowledge. For example, a bot that knows about your company’s policies and internal rules. It is unlikely that any vanilla LLM, no matter how powerful, can answer a precise query about your organization’s inner workings.

To bridge this knowledge gap, we introduce a retrieval component that can access a knowledge base (database, folder of documents, etc.) and retrieve a small chunk of relevant context for a particular user query. This context is then inserted into a prompt template, along with the user query, instructing the LLM to answer the query based on the given context. If the context is indeed relevant for the query, this often works wonders.

RAG approaches differ in at least two dimensions. First, you must decide how to index the background knowledge, which in turns determines how the retrieval works. For example, if the knowledge consists of a set of text documents (the most common scenario), you will almost certainly use a vector database for similarity queries. But you can also use a more sophisticated search strategy, such as full-text search engine like ElasticSearch or even a search service like Google or a commercial search API.

Second, you must decide how exactly the user query is used to locate the most relevant context. The simplest option is to directly provide the user query to the retrieval engine—e.g., embedding the user query if you’re using a vector store, or directly submitting the user query to the search engine. However, the user query is often not the most informative way to query the search engine, and you can resort to the LLM itself to modify, extend, or even completely change the user query.

Let’s dive in.

Retrieval strategies

The following retrieval strategies allow us to index, store, and compute the relevant context for a domain-specific user query.

Vector databases

In this approach, each document is split into meaningful chunks of text (e.g, paragraphs or sections) and each chunk is transformed into an embedding. At inference time, the user query is used to obtain a “key” embedding. This key is compared to each stored chunk and the most similar are inserted in the context.

For large scale scenarios, you will use an efficient vector store that can quickly locate the most similar vectors. Production-level implementations can locate one chunk out of millions in sub-millisecond time. Some traditional database systems like PostgreSQL provide vector stores, and there are also open-source libraries like Faiss which specialize in super-fast vectorized retrieval.

Structured databases

If the knowledge is stored in a traditional SQL database, then you must resort to SQL code generation for retrieval. The simplest solution is to have an LLM generate the appropriate SQL statement for a given user query in a single shot, but this process can be improved with multiple passes, as we’ll see in future articles when we tackle code generation.

Knowledge graphs

A compelling alternative for storing well-structured facts about a specific business domains is knowledge graphs. Explaining what is a knowledge graph in detail goes beyond the scope of this article, but in a nutshell, it is a network of the relevant entities in a domain and their interrelationships. For example, in the clinical domain, a medical knowledge graph could contain nodes for diseases, symptoms, and drugs, and edges relating which disease is associated with which symptoms and which drugs can be prescribed for each disease.

Querying a knowledge graph depends on the underlying implementation. If you are using a graph database, such as Neo4j, this isn’t that much different to querying a traditional database. You will probably use an LLM to generate query statements in an appropriate query language (e.g., Cypher in the case of Neo4).

However, you can also expose the graph structure to the LLM and use it as a controller to navigate the graph. The simplest approaches involve asking the LLM for the relevant entities and relations to focus on, and obtaining the relevant induced subgraph. More advanced approaches involve constructing a relevant subgraph by iteratively querying the LLM about which relations (edges) are worthwhile to explore in each iteration.

Search APIs

Finally, the relevant domain knowledge may be accessible from a storage service that provides a search API. This can range from locally deployed document databases such as ElasticSearch or MongoDB to cloud-based instances to using third-party search APIs like Google, Reddit, Wikipedia, and a myriad other domain-specific services.

In these cases, your retrieval strategy will depend on the idiosyncrasies of the search service you use. For simple textual search, like ElasticSearch or the Google API, you may simply submit the user query directly. However, if the search API has relevant parameters, this becomes an instance of function calling, which we’ll see in future articles.

Query strategies

Whatever the retrieval strategy and storage solution you use, you'll need a way to convert the user query into a proper key to retrieve the right context. These are the most common strategies.

Search by query

The most direct strategy is to simply submit the user query to your retrieval engine. In the case of vector databases, this implies embedding the user query directly, while in the case of search APIs this involves sending the user query as the main argument in the API search method.

The obvious upside of this strategy is its simplicity, and the fact that it works surprisingly well in many cases. This of course depends on how robust your search engine is, and more specifically, how closely the user query matches the query language expected by your search engine.

Search by answer

Specifically in vector databases and embedding-based retrieval, researchers have observed that the user query is often not informative enough to pinpoint the most relevant document chunk. For example, if you knowledge base is composed of research papers or technical documents in general, it is very unlikely that a user query formulated in an imprecise, informal language will have an embedding that is most similar to the exact paragraph that holds the answer, especially if that answer is non-trivial, and thus syntactically very different to the query.

In these cases, a neat trick is to use the LLM first to generate an answer on-the-fly, and then embed that answer and use it as query key. The reason this works is that, even if the vanilla LLM doesn’t have the precise knowledge to answer the user query in detail, it is often capable of producing at least a plausibly-sounding response that mimics the right language.

LLM-guided search

A more elaborate search strategy involves prompting the LLM to generate a suitable search query. This can work better than both previous strategies if the prompt is carefully constructed. By providing the LLM both the context from which the user is coming as well as the characteristics of the knowledge base, we can leverage the powerful natural language understanding capabilities of the LLM to bridge that gap between what the user says, what the user actually wants, and what the search engine needs.

The simplest example of this approach is prompting the LLM to, given a user query, produce a set of relevant search queries for a specific-purpose search engine. For example, if you’re building a Medical Q&A bot backed by a custom knowledge base, and the user query is something like “What are the effects of looking straight into the Sun during an eclipse”, it is unlikely this query by itself will result in the right article. However, an LLM can easily determine that an appropriate query would be “Solar eclipses: medical recommendations”.

If you enhance this approach with a small set of examples, the LLM can quickly learn to map from fuzzy user queries to much more precise and domain-specific queries. Thus, this approach works best when you’re dealing with custom search engines or knowledge bases that are not as capable as, say, Google, to provide a relevant context for an arbitrary user query.

Iterated retrieval & refinement

This is an extension of the previous approach in which instead of a single shot, we let the LLM iteratively provide more relevant queries. The objective is to construct a relevant context one step at a time, by obtaining a query from the LLM, extracting the relevant chunk, and using self-reflection to let the LLM decide if additional information is required.

This approach has the advantage that if the first query is not that informative, we still get a few shots to pinpoint the exact context we need. However, this can quickly get out of hand and produce a huge, semi-relevant or mostly irrelevant context that will confuse the LLM more than it helps.

To counter this effect, we can add a refinement step after each retrieval, as follows. We let the LLM produce a query, find the most relevant chunk, and then ask the LLM to, given the query and the current context, extract a summary of the relevant points mentioned in that context. This way, even if we end up extracting dozens of chunks, the final context could be a very concise and relevant summary of the necessary background knowledge.

Conclusions

Retrieval augmented generation is a growing and very promising field that will definitely play a major role in the near-term future of LLM-powered applications. But it is no silver bullet. Hallucinations and biases can still creep in. In general, there is no guarantee that even with the perfect context and LLM will produce a correct answer. However, in practice, RAG tends to outperform vanilla LLMs in many low-resource domains. It is also much more cost effective than fine tuning and enables much faster iterations.

However, RAG is far from the only way to extend and enhance LLMs with access to external tools and data. In future articles we will explore two other also promising techniques: function calling and code generation. Once we’ve covered all that ground, we will delve into the exciting world of linguistic agents, which is one of the most powerful paradigms to bring LLMs into practice.

As usual, if you think anyone would benefit from this knowledge, please share this post with them. And let me know in the comments what you think of this series and what else you’d like to see covered.

Daniel Nest

Thanks as always for breaking RAG down to the basics.

I havent heard of Iterated retrieval and refinement before, but it makes perfect sense and gives me "chain-of-thought" vibes. It's a bit like sending the LLM into a research rabbit hole on your behalf.

Expand full comment

1 reply by Alejandro Piad Morffis

1 more comment...