An introduction to the most important problem in Computer Science and one of the most clever algorithms ever devised.
Searching is at the core of the vast majority of software ever written. From scientific computation to everyday apps, there’s no shortage of examples where searching for something, a piece of information, a concrete datum, or a solution to an equation, is a fundamental operation. Databases are the most visible and common implementation of search ubiquitous in most applications.
But searching in Computer Science goes far beyond scanning a list of elements to find the one that matches a query. Searching is one of the fundamental strategies for problem-solving. The hardest problems in Computer Science, the so-called NP-Hard, are all about search. So it’s no wonder search algorithms abound, from the basic linear search to using super elaborate data structures, heuristics, and approximations.
In this and one or more follow-up posts, I want to tell you about the basic ideas behind searching and sorting and how to make both processes efficient. And finally, I want to show you a beautiful and surprising relationship between both problems, which has interesting philosophical implications for Computer Science at large.
Suppose you have to search for a given book in a huge library, and all you know is the title. The simplest (and dumbest) strategy is to scan the bookshelf from top to bottom, left to right, looking at every single book until you find the one you’re looking for, or you’re convinced it is not in there. If you have, say, 1 million books1, and the time it takes you to grab a book, read its title and put it back is on average 30 seconds, it could take you almost a year (~347 days) to scan the whole library. Assuming the book you’re looking for is equally likely to be anywhere in the library, every query will take on average close to six months!2
Can we do any better? Well, of course! If we only had some sort of organization, for example, putting all books that start with the same letter on contiguous shelves, we could speed up the search immensely. We can complicate this system as much as we want, with rows tagged themselves also by the second letter of the title, and so on.
However, there is one simple strategy we can use that is almost too good to be true. What if we just sort all books by title, starting from the top left shelf to the bottom right? How fast can we search now? The answer, it turns out, is that we can search as fast as possible.
Searching as fast as possible
The basic idea we will follow is quickly discarding the biggest possible number of books with the minimum amount of effort. If the books are sorted, and we pick a random book, we can immediately discard either all the books before or all the books after it, depending on whether the title we’re looking for comes before or after the title we randomly picked. We repeat this procedure among the remaining set of books until we either find the one we’re looking for or discard all the books.
How much faster is this approach? Well, it depends on how many books we can discard in each step. If we pick one of the books close to the beginning of the list, e.g., the 10,000th book, then two things can happen. If we’re lucky and the title we’re looking for comes before, we’ve just discarded 99% of the library by looking at a single book! But if we’re unlucky, we only discarded the first 1%. Intuitively, we should try to discard the same amount of books regardless of whether the one we’re looking for comes before or after, so we should always split the list in half.
This algorithm is called Binary Search, and it is one of the most beautiful algorithms you encounter when first learning to program.
If we do this, looking at the 500,000th book first, then either the 250,000th or the 750,000th, and so on, we can narrow down the list of possible books pretty fast. A simple information-theoretic argument shows that this is the best possible strategy if the only operation we can do is compare titles.3 In the worst case, we will have to do at most close to log(N) comparisons, where N is the initial number of books, which for 1 million gives us a whooping amount of 20 comparisons! Even if each comparison takes us 10 minutes instead of 30 seconds (presumably because we have to walk across the library to go from the 500,000th to the 750,000th book) we still find the book we’re looking for, among 1 million books, in a mere 3 hours. Compare that with the six months our naive strategy required!
The lesson from Binary Search
Binary Search is the first truly clever algorithm you learn about, one that exploits an innocent characteristic of the problem to obtain a maximally efficient strategy. The main takeaway from Binary Search is a lesson that most grandmas are keen on repeating over and over to their grandchildren: if you keep things in order, you will have a much easier time finding them. Granny just didn’t know how true this can be. By exploiting the order of elements and carefully choosing where we look we can speed up the search exponentially.4
In brief, it pays to keep things sorted.
Hence, next time we’ll take a look at sorting, the complementary operation of searching, how to make it as fast as possible, and a surprising connection that will close the story beautifully.
And this is all for today. I want to thank @Hrokrinn for their proofreading and suggestions, and all of my subscribers and Twitter followers for giving me the inspiration to keep writing.
This is not an exaggeration by any means. For example, the Library of Congress has over 25 million books and close to 170 million items in total. There are hundreds of libraries worldwide with more than 1 million physical books inside.
This is why libraries have catalogs, indexes, and all sorts of shortcuts for quickly finding what you’re looking for.
An intuitive explanation goes like this. We have to find a specific element X among N, and it could be potentially anywhere, so the probability that either of the elements is X is 1/N. How can we ask a single question that gives us the maximum amount of information? One way to look at this problem is to consider the relation between information and entropy. The more entropy a distribution has, the less information it contains. Our current best guess is that X can be anywhere, uniformly distributed, and has the maximum amount of entropy for all discrete distributions of N elements, so it has the least amount of information.
If we split the N elements at the k-th percentile, in two subsets of sizes k*N and (1-k)*N respectively, then the amount of information we gain with that single question can be shown to be proportional to the sum of the entropies of both subsets (again, each taken uniformly distributed). A simple mathematical analysis shows that this sum is maximum when k = 0.5, which is when we split the collection in half.
To see why this is the case, consider looking at Binary Search from the inverse function viewpoint. Let’s say searching with Binary Search takes K operations (K comparisons, that is). In K comparisons, you can find any given element among N, such that log(K) = N. This means that N = 2^K (that’s 2 to the power of K). Thus, a naive linear search will require exponentially more effort than a Binary Search in a collection of the same size.