Building a Perplexity AI clone
Combining RAG with Google Search to build a bot that knows about everything.
Note: This article is part of my in-progress ebook, How to Train your Chatbot, a jam-packed handbook on how to use LLMs to build all sorts of cool stuff. You can get early access to the ebook (PDF and EPUB) as well as the full source code of all demos in the following link:
In this article, we will continue our exploration of retrieval augmented generation, but zooming out to the widest possible scale. Instead of answering questions from a restricted domain, like a PDF, let’s tackle the opposite problem: a bot that can answer anything at all!
The problem is our bot doesn’t know about everything that has happened in the world, and especially it doesn’t know about anything that has happened since its base pretraining.
But you know who does? Google! Well, not exactly everything, but a giant search engine like Google is the closest we have to a repository of the world’s knowledge. However, as everyone who’s ever used a traditional search engine knows, it is often not trivial to find the answer to your question. You have to sift through several of the search results and skim through their content.
On the other hand, LLMs, even the most powerful and recent ones, are by definition static, in the sense that you cannot update their knowledge without some costly retraining. And even if you could, there is still a limit to how many raw facts they could learn and then retrieve flawlessly. As we’ve seen, LLMs are lousy at remembering minute details like dates and are prone to subtle hallucinations.
So, how can we combine the language understanding and synthesis power of LLMs with the recency and breadth of knowledge of search engines?
With RAG, of course!
Note: If this is the first article of this series for you, make sure to check the previous articles here and here, and my guide to retrieval augmented generation. This article bases heavily on previously discussed concepts.
The plan
Keeping in line with the previous article, I will first explain at a broad level what is the workflow and the necessary functionalities for this app, and then we’ll go into implementation details. Since much of this demo builds on previous functionality, I won’t show you the whole code, only the novel parts.
You can check a working version of this demo online at https://llm-book-search.streamlit.app. This demo is provided on a best-effort basis and might be taken offline at any moment.
At a high level, this demo looks like a poor man’s search engine. We’ll have a simple search bar where the user can type a query. But actually, this isn’t just a search engine, it’s an answer engine! Instead of listing the matching websites, we’ll take it a step further and crawl those websites, read them, find a relevant snippet that answers the user query, and compute a natural language response with citations and all. Furthermore, we won’t even pass the exact user query to Google, but instead ask the LLM to suggest a small set of relevant queries that are more suitable for search.
Here is a screenshot of the final result, showing our answering engine talking about recent news (the launch of GPT-4o) which happened way after its training:
Here is the general workflow:
The user types a question.
The LLM provides a set of relevant related queries.
For each query, we obtain a set of search results from Google.
For each search result, we crawl the webpage (if possible) and store the raw text.
Given the original user question and the crawled text, we build a suitable context.
The LLM answers the question with the provided context.
There are several problems we must solve to implement this workflow, and we will go over them one at a time.
Getting the relevant queries
The first step is to compute a set of relevant queries related to the user question. Why? Well, first, because the user question might not be the type of query that search engines are good at solving. If you’re using Google, this might be less of an issue. But if you apply this strategy to self-hosted traditional search engines, like ElasticSearch, they instead just use some fancy keyword-based retrieval, with no semantic analysis.
But more importantly, there simply might not be a single best document (or webpage) with an answer to the user question. For example, if the user asks about a comparison between two products, but there is no actual comparison in the data. All you have is separate documents describing each product. In this case, the best strategy is to perform separate queries for each product and let the LLM do the comparison, effectively leveraging the best of both paradigms.
Here is a prompt to achieve this. It’s a simple but effective combination of prompting strategies. At its core, this is a zero-shot instruction, but we also include a brief chain of thought to nudge the model into a more coherent answer structure. Finally, we instruct the model to produce a structured response so we can parse it more easily.
From this prompt, any reasonably tuned model will produce something like the following, for the query “what is gpt4-o and how does it compare to gpt4”:
Notice that, from the interpretation, we can see the model “knows” what GPT-4 is but not GPT-4o. Yet, it produced pretty sensible search queries, including one individual query for GPT4-o, which is very close to what we would do if we were researching this subject on our own.
Implementing this structured response prompt in our system takes a bit more effort than just passing the prompt. The reason is we want to force the model as much as possible to produce a well-formed JSON object and, while careful prompting can get us pretty far, it is still possible for the model to deviate from just producing a JSON object, and adding some chatty messsages like “Of course, here is your JSON object:” which would make it harder to parse the response.
For this purpose, most OpenAI-compatible APIs implement something called JSON mode which forces the response to be a parseable JSON object. It won’t guarantee that you get the JSON structure you asked for (this is, in general, not solvable without modifying the sampling method) but it will guarantee that, if the model responds at all, it will be a well-formed JSON.
To take advantage of this API feature in our implementation, we will add a JSON mode method to our Chatbot
class, which also skips the conversation history workflow, because we usually don’t want these messages to be part of a traditional conversation, but rather use them for one-off instructions.
In the main application loop, we simply call this method and access the queries
key to find the queries.
Searching online
This is probably the easiest part of the demo. We can use any number of Google Search wrappers to obtain a list of web pages given a set of queries. In this case, I’m using googlesearch-python
, which is one of the libraries with most Github stars, but feel free to experiment.
The first step is to combine all the results from the different queries into a single set so that we don’t crawl a web page twice.
This isn’t the exact code in our application, because we’d have some streamlit-specific instructions in there to print some status messages while crawling.
The next step is to crawl each of these results, skipping the ones that fail (because they are either not HTML content, or take too long to load, etc.). We use BeautifulSoup to parse the HTML and obtain a blob of continuous text extracted from every HTML node that has any text at all. This isn’t the prettiest or most robust way to parse an HTML file, especially if you want to show it to a human, but for our purposes, it works pretty well because the LLM will be able to sift through it and extract the relevant parts.
Notice that, at the same time we’re parsing the HTML, we split the resulting text in chunks of, say, 256 words, and store each chunk separately. You can probably already guess why, right?
Finding the right context
The next problem we have to solve is giving the LLM a reasonably short fragment of the relevant web pages where the answer to the user query might be. Depending on how you configure this demo, the search step might have extracted hundreds of chunks with thousands of words in total, many of which might be irrelevant. For example, you may ask for a specific event date and get a whole Wikipedia page where that event is passingly mentioned in one of the paragraphs.
To solve this problem we will resort again to the most effective augmentation strategy, retrieval augmented generation, and our old friend the VectorStore
class. We will index the extracted chunks on the fly and immediately extract the most relevant ones.
The result of this snippet is a chunks
list containing a small number of relevant chunks, formatted as Python dictionaries with the following structure:
The id
field, computed as an incremental index, will be useful later on for printing explicit references.
Building the final response
All that’s left is formatting a proper RAG-enabled prompt and injecting the relevant content extracted from the previous step. Here is the prompt we’re using:
Notice how we explicitly instruct the model to produce square-bracketed references whenever possible. The chunks
we inject in the context are pretty printed JSON objects from the previous section that contain a convenient id
field.
And that’s it, all that remains is a few streamlit-specific bells and whistles here and there to get this demo up and running. For example, we add a few numeric inputs to let the user play around with the parameters of the search (how many queries to perform, how many chunks to extract, etc.)
And, if you check the full source code, we also have a few instructions here and there to make this semi-fancy layout in two columns with a search bar at the top.
Conclusions
Retrieval augmented generation is an extremely versatile paradigm. In the previous article in this series, we saw the standard approach, using just vector-based retrieval from a static text source. In this chapter, we use a generic search engine as the data source, taking advantage of the massive complexities hidden behind what looks like a simple Google search. We’re leveraging years and years of innovation in collection, storage, and lightning-fast retrieval from billions of web pages. We’re standing on the shoulders of giants, quite literally.
You can adapt this demo to any search engine that provides a text search interface. A common use case is getting all your institutional knowledge in a self-hosted search-enabled database like ElasticSearch, and using it to build an in-house Q&A bot.
By now, you should be starting to see a pattern in this integration of large language models with other traditional technologies. We will use the model to transform the user input into something structured to feed our underlying tool, compute some results, and then use the model again to produce a natural language response. This is why we talk about LLMs as a natural language user interface. It’s basically a wrapper for any traditional computational tool that adds a very convenient conversational input and output, but the tool still performs the basic computation.
This combination of LLMs with other tools helps bridge the gap between the incredible versatility of language models and the efficiency and robustness of more traditional tools, while minimizing (although not entirely eliminating) many of the inherent limitations of LLMs like hallucinations and biases.
In future articles, we will stretch this paradigm to its limits, making our LLM interact with all sorts of APIs and even produce its own executable code.
In this article I didn't go too deep into ensuring the JSON response was properly formatted, but in the book we tackle this problem more seriously and build a safe JSON parser that most of the time ensures the response is not only valid JSON, but actually conforms to a user-defined class. Feel free to ask if you want more details.