The Computist Journal: 🧠 The Science of Computation

The fastest algorithm ever devised

Alejandro Piad Morffis — Mon, 01 Jun 2026 11:30:44 GMT

Every post on June is related to algorithms, anchored on The Algorithm Codex. I’m also running an experiment this month: stretching the pace to one-post-a-day with per-day theme. Monday is for Mind-blowing. Let’s see how that goes.

Photo by Arnaud Mesureur on Unsplash. I promise, it will make sense in the end. It’s a bunch of trees…

When someone tells you they have the “fastest algorithm ever devised”, you probably think they bumped into trick to make clever use of the cache, or something equally tasteless, perhaps Youtube-worthy for 2 hours. I know I’d do if you came to me with this title.

But this is not that post. This is about the actual fastest algorithm we have ever devised, and probably the fastest anyone will ever do. And not in a trivial sense. There is one problem in computer science (practical, appearing everywhere you look, explainable in two sentences) where we have an algorithm, a correctness proof, a complexity bound, and a matching lower bound that closes the problem.

Done. Problem solved. No more papers.

That kind of closure is the rarest thing in any field of human knowledge.

Subscribe now

Four conditions, one winner

But what does “fastest algorithm” mean? There are many ways you can interpret this phrase, and some of them are trivially true.

Reading an entry from an array takes constant time. Nothing beats that. But it’s one operation on a flat structure, no progression, nowhere lower to go. Getting the title should require something harder.

The interesting frontier is a non-trivial problem with a naive solution everyone reaches for, and another, mindblowingly faster solution that is provably optimal. The problem we analyze today has four traits we want:

First, general: appears in real systems across many domains, not a toy. Second, explainable in two sentences: if you need a paragraph to state it, the optimality result arrives too late to matter. Third, a real progression from naive to optimal: intermediate ideas that each earned their place, not a single flash of inspiration. Fourth, provably closed: algorithm and lower bound meet exactly. Not “hard to improve.” Closed.

Our problem is Union-Find, which clears all four. Nothing else with a non-trivial structure clears them as cleanly. Let me explain it to you in two sentences.

You’ve used this all day

Union-Find goes like this. You have n things, organized into groups. You need to merge two groups and ask whether two things are in the same group.

That’s it. No deletion. No ordering. No traversal. Two operations on a shifting partition: union and find.

You’ve used it today, probably without noticing. Your phone’s photo app groups every selfie of the same person into a face album by computing connected components, and connected components is what Union-Find does for a living. The bucket tool in any image editor (Paint, Photoshop, Procreate, whatever) floods a region with color by labeling connected pixels? Yes, Union-Find.

Every time your TypeScript or Rust compiler decides two variables must share a type, Hindley-Milner type inference unifies them through Union-Find. Spanning Tree Protocol, the thing that lets the switches in your office or home network agree on which links to use so packets don’t cycle forever, runs Kruskal’s algorithm; Kruskal needs Union-Find to detect cycles.

The same idea shows up across classical computer science under different names. The canonical image segmentation algorithm (Felzenszwalb and Huttenlocher, 2004) is the bucket tool’s research-grade cousin. Procedurally generated mazes in games are built by running Kruskal on a grid. Percolation theory in physics, the study of how fluids flow through porous materials, is the same partition-growing-by-merging problem in continuous form. Whenever a system has equivalence classes that grow online, Union-Find is almost always the engine underneath.

Union-Find is actually pretty simple

The naive solution is straightforward: give every element a label. To merge two groups, sweep through the array and relabel everything in one group with the other’s identifier. Works perfectly. Linear work per merge. Nothing cheaper to write, and nothing more expensive to run at scale.

Union-Find: n elements, shifting partitions, two operations. Before: three groups. After union(x, y): two. Made with Tesserax.

The first idea that earns its keep is a structural change. Stop labeling every element. Give each element a parent pointer: a single reference to another element. The root of each tree is the element that points to itself, and the root names the group.

To check if two elements share a group: walk each one up its parent pointers to a root. Same root, same group.

To merge two groups: find both roots and point one at the other. One pointer write, regardless of how many elements are involved.

The trade has moved. Merge is cheap. But your find now costs as much as the tree is tall.

Each group is a tree. Arrows point from children up to parents; the root points to itself.

Here is the failure mode. An adversarial union sequence can build a long thin chain: element one points to element two, which points to element three, all the way down. Every find walks the whole chain. Linear again.

The rest of the story is about keeping the trees short, like, really, really short.

Making Union-Find fast, really fast

The first fix is the obvious one once you’ve seen the failure. Track each tree’s height: its rank. When merging, always hang the shorter tree under the root of the taller. The merged tree is no taller than the taller of its inputs.

Union by rank: the shorter tree hangs under the taller root. The merged tree’s height is unchanged.

The counting argument is clean. A tree of rank r must contain at least 2^r elements. You can prove it by induction on the merge rule. So the rank, which bounds the height, can never exceed log₂(n). Both operations now run in O(log n). A million elements: naive worst case is a million steps; rank-bounded is twenty.

One integer per node buys us exponential improvement in speed. Read that again.

Most algorithms textbooks would stop here. The second fix is what changes this from a really fast algorithm to the fastest algorithm ever, period.

Notice something about the walk-to-root operation. When it finishes, it has discovered the root. Every node it visited could be rewired, right now, on the way back, to point straight at the root.

The structure flattens itself, in place, every time you ask it a question. The next walk on that path takes one step. Work you already did is work you never have to do again.

Path compression: find(x) collapses the chain into a star. Every walked node now points straight at the root.

On its own, this trick is good. Combined with the rank rule, the cost drops below logarithmic, into a regime that barely has a name.

¿How fast is this? Strap on, baby.

Union-Find is the fastest algorithm ever devised

Robert Tarjan, in 1975, proved the two ideas together give an amortized cost per operation of the inverse Ackermann function, written α(n).

To understand why that means fast, you need a picture of what you’re inverting. The Ackermann function grows like hell. Faster than exponentials, faster than towers of exponentials, faster than any function built by ordinary recursion. A(4, 4), only four steps in, already exceeds the number of atoms in the observable universe. It’s a freaking monster, one of the fastest growing functions ever invented.

The inverse grows correspondingly sloooowww. For any n less than 2 to the power of 65536 (a number whose decimal representation runs twenty thousand digits), α(n) ≤ 4. Read that again. For any input you could construct on any computer that will ever be built, the average cost per operation is at most four pointer hops.

Effectively constant. Call it constant, for practical purposes. But it is not constant. There is a real function in there. That fact is not a technicality. This is not a gap waiting for a smarter trick. It is a permanent lower bound, proved by Fredman and Saks in 1989.

Michael Fredman and Michael Saks worked in the cell-probe model, which counts how many stored bits any data structure must read to answer a query. They proved that Ω(α(n)) is the true lower bound for the union-find problem. No structure of any design can answer union and find queries doing less amortized work than α(n).

Tarjan’s upper bound says path-compression-plus-union-by-rank does at most α(n) amortized. Fredman and Saks’s lower bound says no structure can ever do less. The two bounds meet exactly.

Done. Problem solved. No more papers.

Problem solved is a rare luxury

Newton’s laws work until you approach the speed of light. Quantum mechanics works until you zoom out to cosmological scales. Turbulence in fluids is still open. It has been for a century. The three-body problem is still open. Physics doesn’t finish; it hands off. Every theory we have is a model that fits data within its domain of validity, and I think we forget how strange that is.

The Schrödinger equation describes how electrons move in any atom. For helium (two electrons) it has no closed-form solution. The electron interactions couple (each one affecting all the others) in ways that resist exact treatment. Chemistry approximates, and the gap is permanent, not a waiting-for-faster-computers problem.

Mathematics does close problems, and famously. Wiles closed Fermat’s Last Theorem in 1995, after the conjecture had sat open for 358 years. The four-color theorem closed in 1976 (with a computer’s help, which still rankles some). Poincaré closed in 2003. But the closures cluster on problems mathematicians chose: beautiful, deep, and often distant from anything that runs in your hand.

Computer science almost never gets all three at once. A problem that is practical, used in real systems every day; crisply specified, fitting in two sentences; and provably closed, algorithm and lower bound matched exactly. Union-Find is in this corner. Comparison-based sorting being Θ(n log n) is in it. Linear search on unsorted data being Θ(n) is in it. The list is short.

Algorithm and lower bound, meeting exactly.

Done. Problem Solved. No more papers.

The fastest non-trivial algorithm ever devised takes twenty lines to write, runs in essentially constant time per operation, and is one of the very few things we have ever — in any field — entirely finished.

If that isn’t freaking mind blowing, I don’t what is. Until next time, stay curious.

Union-Find is chapter 14 of The Algorithm Codex, a book that spells out the algorithms every programmer eventually meets in real, functioning Python — no pseudocode bullshit, no complicated math.

Every chapter ends with the same three questions: is it correct? how efficient? is it optimal? Most chapters land two out of three. A small few land all three. Union-Find is the cleanest example in the book of all three boxes ticked.

Get the Algorithm Codex

The Basics of Search

Alejandro Piad Morffis — Tue, 27 Jan 2026 15:32:20 GMT

This article is a draft of Chapter 1 of The Algorithm Codex. You can read the entire Part I (chapters 1 through 6) online for free (forever). If you want to support this initiative, feel free to grab the official PDF.

Photo by Katerina Kerdi on Unsplash

Searching is arguably the most important problem in Computer Science. In a very simplistic way, searching is at the core of critical applications like databases, and is the cornerstone of how the internet works.

However, beyond this simple, superficial view of searching as an end in itself, you can also view search as means for general-purpose problem solving. When you are, for example, playing chess, what your brain is doing is, in a very fundamental way, searching for the optimal move–the only that most likely leads to winning.

In this sense, you can view almost all of Computer Science problems as search problems. In fact, a large part of this book will be devoted to search, in one way or another.

In this first chapter, we will look at the most explicit form of search: where we are explicitly given a set or collection of items, and asked to find one specific item.

We will start with the simplest, and most expensive kind of search, and progress towards increasingly more refined algorithms that exploit characteristics of the input items to minimize the time required to find the desired item, or determine if it’s not there at all.

Linear Search

Let’s start by analyzing the simplest algorithm that does something non-trivial: linear search. Most of these algorithms work on the simplest data structure that we will see, the sequence.

A sequence (Sequence class) is an abstract data type that represents a collection of items with no inherent structure, other than each element has an index.

from typing import Sequence

Linear search is the most basic form of search. We have a sequence of elements, and we must determine whether one specific element is among them. Since we cannot assume anything at all from the sequence, our only option is to check them all.

def find[T](x:T, items: Sequence[T]) -> bool:
    for y in items:
        if x == y:
            return True

    return False

Our first test will be a sanity check for simple cases:

from codex.search.linear import find

def test_simple_list():
    assert find(1, [1,2,3]) is True
    assert find(2, [1,2,3]) is True
    assert find(3, [1,2,3]) is True
    assert find(4, [1,2,3]) is False

Analyzing Linear Search

Once we have an implementation, we must subject it to the three-step analysis established in our foundations.

Is it correct?

The property of correctness ensures that for any valid input, the algorithm produces the expected output. For linear search, we can verify this through three increasingly formal lenses:

The Exhaustive Argument: Suppose an element x exists in the sequence. By definition, there is some index i such that items[i] == x. Since the algorithm performs an equality test over every single index in the sequence without exception, it is logically impossible to miss the item if it is there.
The Inductive Argument: We can reason about the algorithm’s correctness across different input sizes. For a sequence of length 0, the loop never executes, and the algorithm correctly returns False. Assume the algorithm works for a sequence of length n. For a sequence of length n+1, the target x is either in the first n elements—where the inductive hypothesis ensures we find it—or it is the n+1-th element, which we check in the final iteration. If it is in neither, the algorithm correctly concludes it is not present.
The Loop Invariant: We can define a formal invariant for the for loop: At the start of iteration i, the element x has not been found in the first i−1 elements of the sequence. By the time the loop completes at iteration n, if the function hasn’t returned True, we know with certainty that x is not in the first n elements, which constitutes the entire sequence.

How efficient is it?

We analyze linear search using the RAM model, assuming each comparison and iteration step has a unitary cost.

Time Complexity: In the worst-case scenario (the item is at the very end or not present at all), we must perform n comparisons for a sequence of size n. This gives us a growth rate of O(n), or linear time.
Space Complexity: The algorithm only requires a constant amount of extra memory to store the loop variable and the target, regardless of the input size, resulting in O(1) space complexity.

Is it optimal?

Intuitively, linear search must be optimal for unstructured data. If we know nothing about the order or distribution of the elements, we are mathematically forced to look at every single item at least once to be certain x is not there. Any algorithm that skipped an element could be “fooled” if that specific element happened to be the one we were looking for. Thus, for a generic sequence, O(n) is the best possible lower bound.

To prove this more formally, we employ an adversarial argument, a powerful technique in complexity theory where we imagine a game between our algorithm and a malicious adversary.

The Adversary’s Strategy: Suppose an algorithm claims to find an element x (or prove its absence) by examining fewer than n elements—say, n−1 elements. The adversary waits for the algorithm to finish its n−1 checks.
The “Trap”: Because there is one element the algorithm did not inspect, the adversary is free to define that specific element as x if the algorithm concludes “False,” or as something other than x if the algorithm concludes “True” without having seen it.
The Conclusion: Since the adversary can always change the unexamined element to make the algorithm’s answer wrong, any correct algorithm must inspect every element in the worst case.

This proves that the lower bound for searching an unstructured sequence is Ω(n). Linear search, which operates in O(n), meets this lower bound exactly, making it a tightly optimal solution for the problem as defined. Unless we possess more information about the data’s structure—the central theme of the next chapter—we simply cannot do better.

Indexing and Counting

The find method is good to know if an element exists in a sequence, but it doesn’t tell us where. We can easily extend it to return an index. We thus define the index method, with the following condition: if index(x,l) == i then l[i] == x. That is, index returns the first index where we can find a given element x.

def index[T](x: T, items: Sequence[T]) -> int | None:
    for i,y in enumerate(items):
        if x == y:
            return i

    return None

When the item is not present in the sequence, we return None. We could raise an exception instead, but that would force a lot of defensive programming.

Let’s write some tests!

from codex.search.linear import index

def test_index():
    assert index(1, [1,2,3]) == 0
    assert index(2, [1,2,3]) == 1
    assert index(3, [1,2,3]) == 2
    assert index(4, [1,2,3]) is None

As a final step in the linear search paradigm, let’s consider the problem of finding not the first, but all occurrences of a given item. We’ll call this function count. It will return the number of occurrences of some item x in a sequence.

def count[T](x: T, items: Sequence[T]) -> int:
    c = 0

    for y in items:
        if x == y:
            c += 1

    return c

Let’s write some simple tests for this method.

from codex.search.linear import count

def test_index():
    assert count(1, [1,2,3]) == 1
    assert count(2, [1,2,2]) == 2
    assert count(4, [1,2,3]) == 0

Analysis

We won’t dwell too much in this section since the analysis is very similar to linear search–these are just specialized versions of it. Once more, we have O(n) algorithms (with O(1) memory cost) for a problem that is provable Ω(n). Thus, given our assumptions (that there is no intrinsic structure to the elements order), we have optimal algorithms.

Min and Max

Let’s now move to a slightly different problem. Instead of finding one specific element, we want to find the element that ranks minimum or maximum. Consider a sequence of numbers in an arbitrary order. We define the minimum (maximum) element as the element x such as x <= y (x >= y) for all y in the sequence.

Now, instead of numbers, consider some arbitrary total ordering function f, such that f(x,y) <= 0 if and only if x <= y. This allows us to extend the notion of minimum and maximum to arbitrary data types.

Let’s formalize this notion as a Python type alias. We will define an Ordering as a function that has this signature:

from typing import Callable

type Ordering[T] = Callable[[T,T], int]

Now, to make things simple for the simplest cases, let’s define a default ordering function that just delegates to the items own <= implementation. This way we don’t have to reinvent the wheel with numbers, strings, and all other natively comparable items.

def default_order(x, y):
    if x < y:
        return -1
    elif x == y:
        return 0
    else:
        return 1

Let’s write the minimum method using this convention. Since we have no knowledge of the structure of the sequence other than it supports partial ordering, we have to test all possible items, like before. But now, instead of returning as soon as we find the correct item, we simply store the minimum item we’ve seen so far, and return at the end of the for loop. This guarantees we have seen all the items, and thus the minimum among them must be the one we have marked.

from codex.types import Ordering, default_order

def minimum[T](items: Sequence[T], f: Ordering[T] = default_order) -> T:
    m = items[0]

    for x in items:
        if f(x,m) <= 0:
            m = x

    return m

The minimum method can fail only if the items sequence is empty. In the same manner, we can implement maximum. But instead of coding another method with the same functionality, which is not very DRY, we can leverage the fact that we are passing an ordering function that we can manipulate.

Consider an arbitrary ordering function f such f(x,y) <= 0. This means by definition that x <= y. Now we want to define another function g such that g(y,x) <= 0, that is, it inverts the result of f. We can do this very simply by swapping the inputs in f.

def maximum[T](items: Sequence[T], f: Ordering[T] = default_order) -> T:
    return minimum(items, lambda x,y: f(y,x))

We can easily code a couple of test methods for this new functionality.

from codex.search.linear import minimum, maximum

def test_minmax():
    items = [4,2,6,5,7,1,0]

    assert minimum(items) == 0
    assert maximum(items) == 7

The correctness, cost, and optimality analysis is very similar in these cases as well.

Conclusion

Linear search is a powerful paradigm precisely because it is universal. Whether we are checking for the existence of an item, finding its index, or identifying the minimum or maximum element in a collection, the exhaustive approach provides absolute certainty. No matter the nature of the data, if we test every single element and skim through every possibility, the problem will be solved.

The primary drawback of this certainty is the cost: some search spaces are simply too vast to be traversed one item at a time. To achieve better performance, we must move beyond the assumption of an unstructured sequence. We need to know more about the search space and impose some level of structure.

In our next chapter, we will explore the most straightforward structure we can impose: order. We will see how knowing the relative position of items allows us to implement what is arguably the most efficient and beautiful algorithm ever designed.

But that’s a story for another Tuesday.

Remember you can read the full draft online (currently Part 1) or get a beautifully typed PDF if you want to support the project.

The Most Beautiful Algorithms Ever Designed

Alejandro Piad Morffis — Mon, 10 Feb 2025 12:03:23 GMT

Photo by Ken Suarez on Unsplash

Donald Knuth, the father of algorithm design and analysis and one of the most impactful, insightful, and influential people in Computer Science, said science is what we understand well enough to explain to a computer, and art is everything else we do. In his view, computer programming is a form of art, and he is undoubtedly a masterful artist.

Whether you subscribe to this broad notion of "art" or not, you won't deny there are a handful of ideas in Computer Science that are so elegant, creative, innovative, powerful, or just downright beautiful that they at least blur the boundary between science and art.

In this post, I want to kick off a new series exploring the idea that some algorithms are inherently beautiful, regardless of their technical details. I will suggest what I consider the most beautiful algorithms ever designed and give you a short rationale for their selection. Then, in future posts, I will discuss each of these in detail and, hopefully, convince you that they are, indeed, beautiful.

Subscribe now

What makes an algorithm beautiful

Many believe, myself included, that beauty is in the eye of the beholder. Thus, deciding what the most beautiful algorithms ever are is a highly personal and subjective endeavour. Still, I want to justify my personal criteria for selecting some algorithms among others. You may agree to some extent with these points or disagree entirely, but I hope you'll at least understand where I'm coming from when I say some algorithms are beautiful.

First and foremost, I value elegance. Anyone can solve a complicated problem with a similarly complicated solution, full of special cases and tricks. But there is beauty in finding a simple, clear, elegant solution that happens to be as effective or more than anything else you can think of.

Second, I value cleverness. Simple ideas, yes, but also powerful ideas that exploit some hidden insight or reveal something unique about the problem they're solving. Bonus points if they involve clever mathematics.

Finally, I also consider the impact of these algorithms. Many of the most beautiful algorithms are broadly applicable and unlock entire branches of knowledge that were previously hidden.

However, there are also algorithms so unique and weird that they deserve special mention, if only because they show how far the human intellect can go to find solutions to interesting problems. I have a few of these.

The List of Beautiful Algorithms

Based on the above criteria, I created a long list of candidate algorithms. Then, I narrowed it down to 16 because it is a beautiful but still manageable number. Of course, this limitation is totally arbitrary, but so is everything else in this article.

When deciding what to include, I also prioritized diversity. For example, I left out many beautiful graph algorithms, not necessarily because the other algorithms in this list are more beautiful. We could fill the whole list with clever ideas from graph theory, and I preferred to go wide rather than deep here.

All of these algorithms have some unique, clever insight that makes them work. All of them are paradigmatic of a particular technique for algorithm design. In general, I think they cover a broad spectrum of the type of reasoning involved in crafting algorithms, and I hope they'll give you a good taste of what it means to be creative in computer science.

So, without further ado, here is my list of the most beautiful algorithms ever designed. Except for the first one, this list is in no particular order.

Binary Search

If there is an algorithm that should be number one in every list of beautiful algorithms, that is Binary Search. It is the simplest, most clever algorithm ever designed. It deals with the most straightforward version of the most important problem in computer science: search. And it does so in the most marvellous and efficient possible way.

Binary Search is the first serious algorithm you will probably ever learn in a CS major, and the first one that does something truly clever: by optimally exploiting the structure of the data, Binary Search manages to find any element in a very, very large collection in the blink of an eye. It is so fast that if you had to search the whole universe (all 10^80 or so particles in it), you would need no more than 300 operations—probably much less than a millisecond in any modern computer.

To achieve this, Binary Search requires only one key assumption: the data must be sorted. If that's the case, Binary Search is perfect. It prunes the search space in a way that can be mathematically proven optimally efficient—in the sense that it makes the most efficient use of the information available in the ordering of the search space. It is impossible to be any faster when all you can rely on is the order of elements.

Binary Search teaches us something profound about the nature of search and problem-solving in general: it teaches us that structure matters a lot. If you can organize the search space in a meaningful way, you can search through it in a breeze. In a sense, this is the most crucial insight in computer science: all computer science problems are ultimately search problems, and all clever solutions to search problems are, in a sense, finding ways to organize the search space for effective pruning.

QuickSort

If search is the most important problem in CS, then sorting is probably the most well-studied one. Among the many flavours of sorting, the simplest incarnation is sorting a finite list of numbers (or comparable items in general). And here, QuickSort is undoubtedly the king of the pack.

QuickSort is not only among the fastest general-purpose sorting algorithms—its name alone says so—but it is also probably the most elegant. It cleverly uses randomization to ensure that, on average, we make the optimal selection that leads to the fastest possible sorting strategy.

Quicksort is the simplest example of a clever use of randomization aimed not at producing an approximate result but rather at making an algorithm robust to all possible inputs. The algorithm's balance of speed and simplicity epitomizes beauty in design and demonstrates the harmony between theoretical elegance and practical application.

A* Search

A* is an extension to the most beautiful graph algorithm—Dijkstra’s shortest path—to incorporate domain knowledge as a heuristic function. A* achieves extraordinary efficiency by balancing the cost to reach a node (what we know for sure) with a heuristic estimate of the cost to reach the goal (what we can guess). This synergy allows it to explore pathways intelligently, often finding optimal solutions with significantly fewer computations than uninformed methods.

Its adaptability makes A* valuable in theoretical contexts and real-world applications, such as route planning in navigation systems and AI for games, where understanding and optimizing paths is critical.

This algorithm beautifully balances exploration and exploitation, always finding the shortest path while minimizing cost. It is a prime example of the combination of formal and heuristic thinking.

RSA

RSA is a paradigmatic algorithm in public-key cryptography. It is the first algorithm to achieve something that intuitively seems paradoxical: sending a message through a public channel to another party that is impossible for anyone except them to decipher, even though everyone else knows all the details of the communication protocol and can see all messages.

RSA relies on the difficulty of factoring prime numbers, which for a very long time was thought to be a pure mathematical pursuit devoid of any practical applications. But then cryptography came with a simple but very powerful insight: that some functions are easy to compute but almost impossible to invert—aptly called one-way functions.

RSA was the first algorithm to show how one could build truly unhackable systems using clever math (at least before quantum computing was a thing). Nowadays, stronger cryptographic frameworks work even under the threat of ubiquitous and cheap quantum computing. But all of them owe their birth to the unique insights unlocked by RSA.

Fast Fourier Transform

The Fourier transform is undoubtedly the most important operation in signal processing. It allows the decomposition of any signal into its individual frequency components, and it has uncountable applications, from compression to pattern recognition to synthesis.

However, for a very long time, it was considered an intrinsically expensive operation, applicable only to shorter sequences. The FFT is a revolutionary algorithm that transforms signals from their original domain to the frequency domain, allowing for efficient data analysis and processing. It's hard to oversell its importance; for many, it's the single most impactful algorithm ever designed.

FFT is instrumental in numerous applications, including digital signal processing, image analysis, and solving partial differential equations. Its profound impact on both mathematics and engineering underscores its elegance, showcasing how computational efficiency can radically enhance data interpretation and manipulation.

K-means

K-means is a straightforward but powerful clustering algorithm that partitions data into distinct groups. It answers how to classify different objects when you know almost nothing about them. The key insight is that similar objects should be bundled together and separated from other bundles of similar objects.

Despite its simplicity, K-means can provide insights into the underlying structure of complex information. When K-means is not enough (because the data is too complex), we often resort to methods that, deep down, are basically a transformation of the data followed by a K-means-like approach.

KMP

KMP, or the Knuth-Morris-Pratt algorithm, is a highly efficient string-searching method that finds the occurrences of a pattern within a text in linear time. The clever idea in KMP is that past searches, even failed ones, provide information about future searches. By preprocessing the pattern to create a partial match table, KMP optimally skips unnecessary comparisons, enhancing the searching process.

This clever design mitigates the performance issues seen in naive approaches, making it particularly suitable for applications involving large datasets such as text editing, data parsing, and bioinformatics.

Minimax

Minimax is a fundamental algorithm used in decision-making and game theory, particularly in zero-sum games. It systematically evaluates possible moves by minimizing the possible loss in a worst-case scenario. Minimax answers the question of what one should do to behave optimally in a sequence of decisions, assuming everyone else is also behaving optimally. This strategic approach not only aids in deriving the optimal move for a player but also prevents adverse outcomes against an opponent’s best strategy.

Despite its power, Minimax stands out for its simplicity. It is the backbone of many AI implementations in competitive gaming contexts. Although it is no longer the algorithm of choice for practical implementations of optimal AI players, its theoretical properties still inform the design of every decision-making algorithm in adversarial environments.

Monte Carlo methods

Monte Carlo methods are a family of extremely powerful statistical techniques used to understand and quantify uncertainty in prediction and forecasting models. The purpose is to find some optimal solution, e.g., an optimal strategy in a sequence of decisions that involve uncertainty or inherent randomness.

Monte Carlo exploits the intuitive idea of random sampling but with a clever twist: as you obtain more information about some specific outcome, the uncertainty about it decreases. By balancing the expected potential of each outcome with its uncertainty, Monte Carlo efficiently finds the best solutions while minimizing wasted effort by exploring unpromising ones.

It is hard to overstate the importance of Monte Carlo. Its many variants are widely applied anywhere from finance and engineering to computer science, AI, and physical sciences, allowing analysts to model scenarios that would be infeasible to compute deterministically. Its versatility and capacity for simulating the unpredictable make it an invaluable tool for decision-making.

Genetic Algorithms

Genetic Algorithms (GA) are a family of general-purpose optimization strategies inspired by the principles of natural selection and genetics. They employ mechanisms such as selection, crossover, and mutation to evolve solutions over time. This adaptive search technique is incredibly effective for optimization problems, exploring a vast solution space through iterative improvements.

GAs are used everywhere we find complex optimization problems: in engineering, economics, or artificial intelligence, just to mention a few cases. Their ability to emulate evolutionary processes reveals a profound connection between biological principles and computational strategies, underscoring their elegance and utility.

Raytracing

Raytracing is the quintessential computer graphics algorithm. It's a straightforward approximation of the rendering equation that can simulate how light interacts with different objects to create photorealistic images. Raytracing is a recursive algorithm that follows light rays as they bounce through a virtual scene and are reflected, refracted, or absorbed.

Every time you see a CGI effect in a Hollywood movie, whether live-action or animated, Raytracing is working behind the scenes. Modern rendering algorithms are significantly more complex but ultimately based on the same principles.

SVD

SVD, or Singular Value Decomposition, is easily the most important linear algebra algorithm for data representation and analysis. At its core, SVD is about finding an optimal representation of the values in a matrix that preserves as much information as possible. When used on real data, SVD reveals intrinsic structures that enable processes like noise reduction and feature extraction.

This method is critical in machine learning applications like image classification and natural language processing. Its ability to simplify complex datasets highlights its significance, demonstrating how advanced mathematical techniques can lead to practical solutions in various applications.

Deflate

Deflate is a widely used compression algorithm that reduces data size to optimize storage and transmission. It cleverly combines the two most paradigmatic approaches to data compression: LZ77 (pattern-based) and Huffman (information-theoretic) coding.

Deflate is the underlying algorithm for many widely used file formats, including ZIP files and PNG images, where maintaining data integrity during compression is vital. The algorithm's ability to balance efficiency and compression ratio makes it a valuable tool in applications ranging from software distribution to data archiving.

HyperLogLog

HyperLogLog is the kind of algorithm that looks like magic. It solves what seems like an unsurmountable problem with a genuinely clever idea. The purpose of HyperLogLog is to allow very accurate estimates of the total number of items in a very, very large dataset when items are added asynchronously in real time. Think about counting all likes in a hugely viral YouTube video.

Using hash functions and a clever data summarising technique, HyperLogLog maintains a minimal memory footprint while providing accurate estimates of unique elements. Anywhere you see massively distributed databases, you can probably find HyperLogLog or a variant of it underneath the most basic counting operations.

PageRank

PageRank is the algorithm that propelled Google to dominate the online search market. It assigns a numerical score to each page, reflecting its importance based on the number and quality of links pointing to it. The quirk is that the Internet is huge, so effectively computing this value for all web pages is unfeasible.

PageRank builds a probabilistic model that estimates the likelihood of a user landing on a particular page. It uses clever numerical approximation techniques to avoid processing the entire Internet graph every time a new web page is added or removed. Beyond search, PageRank is useful for estimating the importance of individual nodes in any type of network, such as social media or knowledge graphs, so it is also widely used in research.

Backpropagation

Backpropagation is probably the most important algorithm of the AI revolution. It is the algorithm that allows the training of artificial neural networks efficiently by computing gradients of the loss function with respect to the model's weights and propagating errors back through the network layers.

Its discovery in the 80s allowed for the first time to train arbitrarily deep neural networks, which are behind most, if not all, modern AI applications, including image recognition, natural language understanding and generation, and even reinforcement learning.

Closing remarks

And there you go. Again, keep in mind this is just my personal selection, and many of you can and most likely will disagree with me on some or maybe all of them--well, except Binary Search, that's a hill I'm willing to die on!

In future posts, I will explore each of these algorithms, explaining not only how they work in the most intuitive terms possible but also highlighting what makes them stand out as some of the most beautiful ideas ever conceived in computer science.

Until then, I would love to hear your thoughts on this topic. Do you think algorithms, and abstract ideas in general, can be beautiful? Why, or why not? And, if you answer yes, then what do you consider the most beautiful algorithms ever designed?

The Power of Abstraction

Alejandro Piad Morffis — Mon, 18 Nov 2024 12:02:17 GMT

Photo by Adi Suryanata on Unsplash

Abstraction is the fundamental ability that powers the way computer scientists think through and solve problems. It is the most important tool in the computer scientist's arsenal and often the hardest skill to master.

It is also a very powerful tool for general-purpose problem-solving. The right abstraction can often unlock unforeseen solutions to otherwise intractable problems. In some rare cases, it even allows for fundamentally novel insights only possible by thinking at the right level of abstraction.

In this article, I will explain abstraction in the context of solving problems, why it is such a powerful tool, and finally, give a concrete tip on how to improve at it.

This article is part of a series on computational thinking called How to Think like a Computer Scientist. In this series, I explore techniques from the computer scientist’s toolset and how they can be applied by anyone to solve problems in any domain.

If you want to learn more, subscribe for free to get future articles directly in your inbox.

Subscribe now

Abstraction in action

Abstraction is the process of taking a problem, situation, or concept in general, removing all irrelevant details, and leaving a simplified, more general version that captures what is truly essential to the thing being abstracted.

Abstraction is a form of purposeful simplification—but not one that stems from laziness or a lack of capacity. It's a meaningful simplification that makes a problem easier to solve but also more general and, thus, more relevant.

What is relevant or not about any given concept or situation is very contextual. Depending on the type of analysis you want, the same details may or may not be important.

That is, the why determines the what.

To see this in action, let's begin with a typical example from a typical computer science class.

An example problem

Suppose you want to solve the problem of getting as fast as possible from home to work. How could we solve this?

We first need to understand two things: what we are exactly asking for and what our decision space is, that is, what we can do.

For example, this problem is very different if you use public transportation or a personal vehicle. That is, the means by which you can move is an important detail, as it will determine your decision space. Can you travel through any street, and stop anywhere? Or do you need to pick from a predesigned set of routes and stop at one of the predetermined bus stations?

Likewise, the problem changes if we want to minimize total travelling time, fuel cost, time spent waiting at red lights, or the total number of swears thrown at cyclists.

Suppose you pick a goal: minimize total time spent on the road. Furthermore, assume you have a personal vehicle that can, in principle, travel anywhere in the city.

This is already an exercise in abstraction because we are deciding, for example, that the level of risk of a given road is irrelevant. We want the fastest route, not the safest. We could have done the opposite or somehow tried to find something in between.

The point is that deciding what matters immediately makes some otherwise important details irrelevant to this particular instance of the problem.

Making the problem tractable

Are we ready to solve the problem? Well, not yet. There are still a thousand factors that can determine how fast you can move around the city: traffic, semaphore lights, pedestrians, quality of the road... And worse, all of this changes daily, perhaps even every second! If we want to have a chance to tackle this problem at all, we'll have to make some further simplifications.

So, let's assume we can model each road segment as a straight line between intersections, with an average time cost associated. That is, we throw away all the details of a given road except for the average time it takes to traverse it. It doesn't matter how many lanes the road has or how bumpy it is. Also, it doesn't matter which time of day or year we happen to be.

This is a huge simplification, but one that makes the problem tractable with a very efficient computational algorithm—Dijkstra's algorithm, to be precise.

The details of the solution don't matter for the purpose of this article. What matters is that by simplifying this initially seemingly unsurmountable problem in a meaningful way, we obtain a tractable problem that we can solve with clever math.

Back to the real world

Now comes the real question: How well does the answer to the artificial, simplified, abstract problem we ended up solving apply to the real-world problem we started with?

The answer, of course, is that it depends. It depends mostly on how important the details we left out were. Every abstraction is a simplification, and thus, any solution to an abstract problem is, at best, an approximate solution to a real problem.

But approximations are all we can have in practice. All real-life problems are infinitely detailed if you take everything into consideration. If you want an exact answer to a real-life problem, you may need to consider the physics of the entire universe!

So, the answer to how good the approximation is, is this: as good as your abstraction is at capturing what is truly essential to the problem and throwing away what doesn't matter.

In this particular case, we'll have a pretty good solution, assuming the average times we assign to each road are really characteristic of said road and not just random values. But there is something even more valuable in what we've done, which is why abstraction is such a powerful tool.

Let's take a look at the bigger picture.

Why abstraction is so powerful

The power of abstraction can be seen in at least three different levels.

Abstraction as simplification

We've already seen the first level: abstraction makes problems easier to solve. Since abstracting is simplifying—as we are, by definition, removing details—we end up solving a problem that is, in some sense, smaller.

In our path-finding example, we simplified things a lot by deciding we didn't care for the width of the roads, the elevation of the terrain, the pavement quality, the weather, and many other details that would make the original problem impossible to solve.

In this simplified version of the problem, we can actually do some math and come up with a sensible solution. Perhaps, as in this case, we can even prove we have an optimal solution (to the simplified problem, that is). Thus, abstraction can make hard problems tractable or at least a bit easier.

Abstraction as generalization

But the second level is even more profound. By abstracting away the things that made our example problem related to cars, roads, and pedestrians—since, ultimately, those things weren't essential—we actually ended up solving a more general problem: finding a minimum-cost route in a graph.

In doing so, we not only solved our mundane go-to-work problem but also solved all the problems of going from anywhere to anywhere else with minimum cost, as long as we only cared about the cost of each connection. And since we didn't define cost in any particular way—just some number that must be minimized—we can apply this solution to time, money, fuel, and any other magnitude we care about!

Now, instead of a car in a city, think of a network of airports worldwide where you want to reach a destination by paying the minimum price. The abstract version of that problem is the same as this one if the cost of each route is the ticket price.

Likewise, we can think of minimizing transmission time in a computer network or even maximizing the probability of a message reaching a destination. All of these are instances of the minimum-cost path-finding problem we just solved!

Thus, since an abstraction is also a generalization, in another sense, you end up solving a bigger problem than you originally had and get almost for free the solution to a potentially infinite number of similar problems.

Abstraction as conceptualization

Finally, there is a third level, but you must trust me here because it's hard to give you a simple example. The thing is, by abstracting problems, we can sometimes find answers to questions that weren't even conceivable in the concrete problems.

This works because abstraction is also a form of concept creation: you create new concepts and relations that unlock new ways of thinking about things. These new concepts, in turn, let you discover new truths that were invisible before.

Some of the most important ideas in computability theory, such as the existence of specific, non-trivial, and relevant problems that cannot be solved at all, are only possible because we have very abstract notions for the concepts of algorithm, computer, and problem.

So, if abstraction is such a powerful tool, how can one get better at it?

Mastering abstraction

The short answer is that there is no shortcut, no easy formula. You get better at problem-solving in any domain by solving a lot of problems in that domain.

However, I can give you a concrete tip here. It's going to sound like a plug, but hear me out.

Abstraction is a very powerful tool for solving real-life problems in any domain. But there is one particular domain where it shines: computer programming.

The reason is that computer programming is, by definition, an exercise in abstraction. Digital computers are almost incapable of dealing with the fine-grained nature of real-life problems. Programming languages require very precise, clear instructions and definitions. So, any solution to any problem using a computer is a solution to an abstract problem.

If you want to improve at coming up with good abstractions, whatever you work on, you can start by learning to code. Getting any level of expertise in coding will force you to learn to abstract problems as a side effect.

In a sense, learning to code is like going to the gym, but for the mind.

Some people are professional athletes. Some even compete at the Olympic level in weightlifting. Some go to the gym for fun or just to show off. But almost anyone, regardless of their daily job, could benefit from a healthier body, stronger muscles, and more agility.

Likewise, some people code for work or fun. But almost anyone would benefit from thinking better, and being better at abstract thinking is one of the best ways to improve your general problem-solving skills.

This would be just about the right time to promote my programming book or course, but I won't. Truth is, you don't need me for that. Search online, pick the top 3 or 4 results, try them out, and choose the one that best fits your learning style and pace.

Now, don't get me wrong. I've said before most learning resources out there suck, but that is about learning to code at a professional level. Just like professional athletes need state-of-the-art equipment and world-class trainers, but if all you want is to get a bit healthier, buying a couple of dumbbells or even just doing half a dozen push-ups every day gets you most of the way there.

So there you have it. Learn to code. Doesn't matter which language or development stack. Just grab a coding tutorial and get used to doing a couple of programming exercises every week. That alone will make you a much better thinker in time.

Closing remarks

In a previous article, I described the main principles that guide my educational ethos. One of these principles is show, then tell. The basic premise is that explaining a complex idea is best achieved by first showing things and then explaining how they work.

Some of you asked me for examples of these principles. This article is one example of show, then tell. First, I showed you how abstraction works in a concrete example and then explained the general concept.

Actually, this article is a meta-example of show, then tell. First, I showed you how show-then-tell works, and now I'm telling you about it. Hope you enjoyed it ;)

Going back to the main topic, this article is part of a series I'm writing on how to bring the tools of computational thinking into your daily life. If you want to learn to think like a computer scientist, stick around. I'll be sharing a lot more in future articles.

Why everyone should learn to code

Alejandro Piad Morffis — Fri, 25 Oct 2024 10:31:48 GMT

Photo by Dan Cristian Pădureț on Unsplash

You should learn to code.

Yes, you! No, I don’t know who you are or what interests you, but I know you can and should learn to code. There are many obvious and not-so-obvious reasons why learning to code is one of the best intellectual investments you can make in yourself.

The most obvious reason is that coding is a high-paying skill, and being good at it pretty much guarantees you can find a good job at any of the million software companies hiring like crazy out there.

Yes, I’m sure you’ve heard many stories of developers struggling with coding interviews and getting rejected dozens of times. And sure, even some of the best coders out there can fail at a given time for a myriad of reasons, including bad luck.

But the thing no one tells you is how few of the people applying for a coding interview today can actually code. And it’s not their fault. Many have learned superficial coding practices from dubious YouTube instructors or cheap online blogs. But learning to code, in the sense I’m using the term, implies an entirely different level of understanding.

To be fair, there are also plenty of cases where the interviewer doesn’t understand what a good programmer looks like. Some coding interviews focus way too much on programming language trivia or algorithm design challenges, which are, at best, imperfect proxies for measuring good coding skills.

In any case, learning to code well is a very marketable skill. If you want a job in the tech industry, being a reasonably good programmer will get you halfway there.

But even if you don’t want a job in the software industry, there is a very good reason why learning to code is one of the best intellectual investments anyone can make today. The reason is that coding teaches you to think in a completely new way, making you extremely good at solving problems in all domains, even without a computer.

But to explain why, let me first tell you what coding is really about.

What is coding about

Coding is part science, part engineering, and part art. It is a science because it requires understanding and applying well-established, formally defined principles and techniques to solve complex problems. It is engineering because it requires the ability to design and build systems with complicated trade-offs. It is art because it requires—and boosts—creativity and imagination to create beautiful solutions to challenging problems.

Beautiful, right? But, in practice, what the hell does this mean? At a surface level, coding seems similar to any other technical skill, like, I don’t know, using advanced industrial machinery and tools. In a sense, coding is about using computers to do stuff, right? Why isn’t this just the same as using, say, a hammer and some nails to build a house?

For starters, yes, the computer is a kind of tool, but it is a completely different kind of tool. Most tools extend our physical capabilities or help us overcome our physical limitations. They make us stronger, faster, more precise.

The computer, on the other hand, is a tool to extend our minds. Computers help our brains do stuff they cannot do easily by themselves. They help us overcome our cognitive limitations.

But programming is far less about computers than you think. The great Edsger Dijkstra is often quoted claiming that computer science is as much about computers as astronomy is about telescopes. Computers are just a tool—an incredibly powerful tool—but the important part is not the tool you use, but what you do with it.

Programmers use computers to solve problems.

This is the deeper reason why learning to code is a different type of skill—a more general skill than it initially seems. Coding is general-purpose problem-solving. In fact, there is no problem that can be solved at all that cannot be solved with a computer—this is the Church-Turing thesis. Learning to code is, ultimately, learning to solve problems.

Why? Coding relies on one fundamental ability called abstraction, which is the process of taking a problem, situation, or concept in general and stripping it down to its core defining features, forgetting everything that is irrelevant or useless for the concrete purpose you have at hand.

And the thing is, abstraction is the most useful cognitive ability in the modern world. It is at the core of problem-solving. Any sufficiently complex problem requires you to think at several levels of abstraction and switch between them constantly. In fact, the ability to think at the right level of abstraction at any given point in the process of solving a concrete problem is probably the most critical ability to master to become a really good programmer.

Let me draw a cheap analogy here. Consider driving a car from home to work. Many things happen inside your car, including a bunch of physical and chemical processes, that are fundamental for the car to work correctly. But thinking about the combustion process going inside your engine is useless for reaching your destination—if your car is working correctly, at least. Actually, worrying about your engine's chemical reactions while driving arguably makes you a worse driver, not a better one.

Instead, when driving in the city, you often need to deal with three different levels of abstraction. The lowest level is just to make the car move: shifting gears in due time, pressing the pedals, etc. The middle level is about not crashing: avoiding incoming traffic, staying in your lane, etc. The higher level is about navigation: picking the right path to reach your destination most efficiently.

When people learn to drive, they often learn to think at the lowest level of abstraction first. They struggle to make the car move. But once they master this level, it becomes almost automatic, and they can start to think in terms of what other cars are doing and how to negotiate the road. And finally, they are able to navigate around town seemingly without effort, while casually talking about, I don’t know, programming maybe.

Similarly, in computer programming, you will often deal with three simultaneous levels of abstraction.

First, we have the code level. At this level, all you care about is expressing the solution to a problem in a language that a computer can understand. Computers are very dumb machines, you see. For all their power, you need to talk to them in very, very precise terms using a programming language. This may seem daunting at first, because it requires a very strict use of syntactic rules, but almost everyone can learn to write and read computer code, just like almost everyone can learn to accelerate, break, and turn. This is the easier task.

The second level is the algorithm level. At this level, you care about actually solving a problem by designing what is called an algorithm. In short, an algorithm is a very precise set of instructions that is guaranteed to provide a solution for a given problem. Coming up with the best algorithm for a problem is a pretty advanced skill, one that is almost always left for college-level advanced programming courses. But fear not, is one skill that can be mastered with enough practice.

The third level is the system level. At this level, you think about the design of a complex system made out of multiple interacting components, often hundreds of tiny algorithms working together to solve a big problem. Also, in this level, you have to think about the end user of your code, either a human on the other side of the display or a computer on the other side of the planet. Systems thinking is probably the hardest skill in this combo, but it can also be mastered in due time, and it is tremendously useful even outside computer programming.

In larger organizations, people tend to specialize in roles that focus most on one level of abstraction. For example, software architects might spend most of their time thinking about the system as a whole and far less about low level details. Still, the ability to move through abstraction layers is there.

So, learning to code will teach you to think at several levels of abstraction simultaneously: to decompose problems into parts, solve each in the best possible way, and then make those parts work together to become a solution to the original problem. It will teach you to think rigorously about why some potential solution doesn’t actually work and to detect subtle reasoning flaws in code—and, by extension, in other people’s behaviours as well as your own.

In short, coding makes you a better thinker. And thinking better is arguably the best recipe for success in general life, all other things being equal.

Why learning to code (well) is hard

But if learning to code is such a great cognitive investment and automatically makes everyone better at solving real-life problems, why are so many computer programmers out there not… being great at life?

Just like learning to drive, most people will master the lowest level first. They will learn how to express known solutions to known problems in computer lingua. They will learn to write working code.

But then, most people stay here. They can make the car move, but they can’t navigate to their destination, nor can they plan the best route and avoid the morning traffic. And the reason they don’t advance more is, often, because they don’t even know there is something else to be learned.

You see, most resources to learn how to program out there will actually just teach you how to write code. They will not teach you how to think about problems in the way good programmers do, so you can come up with clever solutions to these problems. There are books on this, too, of course. But most are college textbooks or advanced technical books that aren’t marketed at or even written for the first-time learner.

On the other hand, you will find a lot of books on how to do software engineering, and how to design systems that are reliable, maintainable, and a bunch of other fancy adjectives. But these are most often, I kid you not, written as if for people who already know this stuff. Much like college-level math books—that you often need to already know the math to even be able to read the book—the majority of books on software engineering are written by software engineers for software engineers, and they are talking among themselves, repeating the same mantras they already agree on.

Don’t get me wrong, there are good resources out there. But there are vastly, vastly more bad resources that won’t get you very far down the road that really matters: learning to think like a computer programmer. We already said this requires thinking at different levels of abstraction, and most learning resources never touch more than one of these levels.

But beyond the lack of good holistic learning resources, the deeper reason why learning to code is hard is precisely because it requires that you rewire your brain to think in a different, novel, more abstract and rigorous way. And your brain doesn’t want to do that.

Your brain evolved to keep you alive—actually, to transmit your genes—and for that, it needs to focus on mastering only two skills: getting food and getting laid. All else is secondary.

So, you have to trick your brain into believing that this is a critical skill to acquire. And that takes a lot of time and a lot of patience. No one learns to code well in 24 days or two months or even a full college semester. And, more importantly, no one learns to code by watching others code. That is a small part of learning, but you have to do the chores.

Coding is much more an acquired skill than learned knowledge. There is some knowledge involved, for sure. There are algorithms you can memorize. There is syntax and semantics and rules. There are design patterns and reusable ideas. But more than all of that, coding is the process of taking a complex problem, breaking it down into manageable parts, and then explain in the most precise possible language how to solve each of those parts, and how to put them together again.

That is, coding is about doing stuff much more than it is about knowing stuff. And there is only one way to learn how to do something: practice, lots of practice. And most people who think they can code out there simply haven’t put enough time into it.

But what about AI?

Yes, the elephant in the room! Will AI make coding skills irrelevant? Aren’t programmers coding themselves out of job?

Good that you ask. I’ve written many times before about why the current language modelling paradigm is, on the one hand, incapable of true reasoning and, on the other hand, unlikely to replace programmers anytime soon. So let me just repeat a few arguments here for the sake of completion.

First, code generators are indeed powerful tools that any professional programmer would do well to learn how to integrate effectively into their workflow. They can automate some of the most boring tasks and provide some much-needed unblockers from time to time.

However, code generators, at least those based on large language models, are inherently unreliable. So anyone using them to solve a given coding problem without a deep understanding of the problem and the proposed solution is shooting themself in the foot. AI-generated code can and will have errors, often subtle ones, that you need to be sufficiently experienced to catch.

On the other hand, language models struggle to deal with context. They often have a very limited context size compared to a reasonably large software project. But even if context size wasn’t a limitation, LLMs are known to arbitrarily ignore relevant parts of context and overfocus on irrelevant parts. They fail to grasp the big picture when things become complex enough.

In the end, this means that you can trust LLMs to generate small chunks of code that you can easily verify and understand, and that are relatively atomic and independent of the rest of the system.

They can definitely help at the code level—e.g., so you don’t need to remember or even search how that specific method or endpoint is invoked. They can help a little bit in the algorithm level, especially for applying relatively well-known strategies that you might have forgotten or not heard about. But they are almost useless at the system level.

The reason is that, the higher the level of abstraction, the more context matters. When you’re thinking about the whole architecture of your application, you need to understand how it all fits together, and even tiny details can be decisive for system-wide design choices.

So far, LLMs are incapable of such high-level reasoning. They can provide abstract, standard advice on software architecture, but the cannot reason that deeply about your specific application context and propose truly useful, novel, and accurate solutions for system-level design. And it seems we need a paradigm shift to overcome this limitation.

Thus, the better you are at coding—the better you are at thinking at the highest levels of abstraction—the more you can get out of an AI code generator. LLMs don’t make bad programmers better. In fact, they make them worse.

How should you learn to code

Finally, let me address the question of how. What follows is a very personal opinion—based on years of experience, but an opinion nonetheless.

There are two schools of thought about how to learn to code. The first one is the foundations-first school. In this school, you start by learning the syntax of a programming language and then learn how to write successively more complex programs in that language. This is the more traditional bottom-up approach, and it is often the design in college-level programming courses in engineering or computer science majors.

The upside with this approach is you get very strong foundations. After a full college year learning this way, you’re ready to tackle some reasonably complex coding problems. The downside is it takes a long time before you can do anything exciting that you can showcase. You spent most of your time solving abstract problems rather than building useful apps.

The second approach is championed by the applications-first school of thought. It is the complete opposite approach. You are thrown into a fully working application—often a rather simple one, but still far more complex than a single algorithm—and you get a high-level explanation of how the whole things work, as well as some deep dives into the important parts. This top-down approach is most common in online tutorials, bootcamps, and videos.

The upside is, of course, the pragmatism. You learn something immediately useful that you can replicate and maybe tweak for your own use case. And you get to see a lot of (hopefully) well-written code. However, if you only learn like this, you can end up with lots of disconnected ideas and no sense of the big picture. And more importantly, you don’t get a chance to develop the skill to think on your own and come up with your own ideas.

It probably won’t sound too bold or innovative to claim that I think the right way is a combination of these two. Of course it is! But the devil is in the details.

Good programming pedagogy is more than just doing algorithms one day and applications another. It’s not just an heterogenous mix. It requires a thoughful merge of these two approaches in a way that you are doing both things in unison: you’re thinking at the high level and coding at the low level simultaneously.

In my programming classes, I try to do precisely this: I create a fully working application—often a small demo, but something that has a concrete purpose and is designed for a real, human user.

We first spend some time thinking at a very high level about how such a system could be designed and discussing possible architectures. Then I crack open the code and show the parts that involve new content: maybe it’s a new instruction, a design pattern, or an algorithm.

At this point, I go back and forth between showing actual code and working on abstract ideas on the blackboard. This is time to think about why the code works, and to generalize these concrete ideas to more abstract patterns that can be learned.

Finally, I will challenge the students to make changes to the code. Some changes are superficial and only there for the coding level, so they get to practice with the new syntax. Other exercises tackle adding functionality or modifying methods to do extra stuff. These require the ability to think at a higher abstraction and understand what’s going on under the hood.

This is the best way I’ve found to get students to think about their code at several levels of abstraction simultaneously. If you’re interested in seeing this process in action, let me know in the comments. If there is enough interest, I may come back with some beginner-level tutorials using this approach.

The Missing Introduction to Formal Language Theory

Alejandro Piad Morffis — Tue, 16 Jul 2024 11:02:38 GMT

Photo by Ryan Wallace on Unsplash

Next semester, I’ll teach the introductory course in Compilers again, after a couple years, in the CS major at the University of Havana. It was an excellent opportunity to dust my notes on formal languages, parsers, analyzers, code generators, etc. Going over the course material, I decided I needed to rewrite most lecture notes because my understanding of the fundamental issues in formal languages has dramatically changed since I last taught this course.

So, as I prepare for the upcoming course, I’ll be sharing with you some of the most intriguing and mindblowing ideas from one of the areas of Computer Science with some of the most profound results. We will begin with a very intuitive introduction to formal language theory and build our way up to understand how compilers, text editors, virtual machines, and all the associated components work.

Buckle up!

What is a (formal) language?

Intuitively, a language is just a collection of correct sentences. In natural languages (Spanish, English, etc,), each sentence is made up of words, which have some intrinsic meaning, and there are rules that describe which sequences of words are valid.

Some of these rules, which we often call “syntactic” are just about the structure of words and sentences, and not their meaning–like how nouns and adjectives must match in gender and number or how verbs connect to adverbs and other modifiers. Other rules, which we call “semantic”, deal with the valid meanings of collections of words–the reason why the sentence “the salad was happy” is perfectly valid syntactically but makes no sense. In linguistics, the set of rules that determine which sentences are valid is called a “grammar”.

In formal language theory, we want to make all these notions as precise as possible in mathematical terms. To achieve this, we will have to make some simplifications which will ultimately imply that natural languages fall outside the scope of what formal language theory can fully study. But these simplifications will enable us to define a very robust notion of language for which we can make pretty strong theoretical claims.

So let’s build this definition from the ground up, starting with our notion of words, or, formally, symbols:

Definition 1.1 (Symbol) A symbol is an atomic element with an intrinsic meaning.

Examples of symbols in abstract languages might be single letters like a, b or c. In programming languages, a symbol might be a variable name, a number, or a keyword like for or class. The next step is to define sentences:

Definition 1.2 (Sentence) A sentence (alternatively called a string) is a finite sequence of symbols.

An example of a sentence formed with the symbols a and b is abba. In a programming language like C# or Python, a sentence can be anything from a single expression to a full program.

One special string is the empty string, which has zero symbols and will often bite us in proofs. It is often denoted as 𝜖.

We are almost ready to define a language. But before, we need to define a “vocabulary”, which is just a collection of valid symbols.

Definition 1.3 (Vocabulary) A vocabulary 𝑉 is a finite set of symbols.

An example of a vocabulary is { 𝑎,𝑏,𝑐 }, which contains three symbols. In a programming language like Python, a sensible vocabulary would be something like {for,while,def,class,…} containing all keywords, but also symbols like +, ., etc.

What about identifiers?
If you think about our definition of vocabulary for a little bit, you’ll notice we defined it as finite set of symbols. At the same time, I’m claiming that things like variable and function names, and all identifiers in general, will end up being part of the vocabulary in programming languages. However, there are infinitely many valid identifiers, so… how does that work?
The solution to this problem is that we will actually deal with two different languages, on two different levels. We will define a first language for the tokens, which just determines what types of identifiers, numbers, etc., are valid. Then the actual programming language will be defined based on the types of tokens available. So, all numbers are the same token, all identifiers are another token, and so on.

Given a concrete vocabulary, we can then define a language as a (possibly infinite) subset of all the sentences that can be formed with the symbols from that vocabulary.

Definition 1.4 (Language) Given a vocabulary 𝑉, a language 𝐿 is a set of sentences with symbols taken from 𝑉.

Let’s see some examples.

Examples of languages

To illustrate how rich languages can be, let’s define a simple vocabulary with just two symbols, 𝑉={ 𝑎,𝑏 }, and see how many interesting languages we can come up with.

The simplest possible language in any vocabulary is the singleton language whose only sentence is formed by a single symbol from the vocabulary. For example, 𝐿𝑎 = { 𝑎 } or 𝐿𝑏 = { 𝑏 }. This is, of course, rather useless, so let’s keep up.

We can also define what’s called a finite language, which is just a collection of a few (or perhaps many) specific strings. For example,

𝐿1 = { 𝑏𝑎𝑏, 𝑎𝑏𝑏𝑎, 𝑎𝑏𝑎𝑏𝑎, 𝑏𝑎𝑏𝑏𝑎 }

Since languages are sets, there is no intrinsic order to the sentences in a language. For visualization purposes, we will often sort sentences in a language in shortest-to-largest and then lexicographic order, assuming there is a natural order for the symbols. But this is just one arbitrary way of doing it.

Now, we can enter the realm of infinite languages. Even when the vocabulary is finite, and each sentence is also a finite sequence of symbols, we can have infinitely many different sentences in a language. If you need to convince yourself of this claim, think about the language of natural numbers: every natural number is a finite sequence of, at most, 10 different digits, and yet, we have infinitely many natural numbers because we always take a number and add a digit at the end to make a new one.

Similarly, we can have infinite languages simply by concatenating symbols from the vocabulary ad infinitum. The most straightforward infinite language we can make from an arbitrary vocabulary 𝑉 is called the universe language, and it’s just the collection of all possible strings one can form with symbols from 𝑉.

Definition 1.5 (Universe language) Given a vocabulary 𝑉, the universe language, denoted 𝑉∗ is the set of all possible strings that can be formed with symbols from 𝑉.

An extensional representation of a finite portion of 𝑉∗ would be:

𝑉∗ = { 𝜖, 𝑎, 𝑏, 𝑎𝑎, 𝑎𝑏, 𝑏𝑎, 𝑏𝑏, 𝑎𝑎𝑎, 𝑎𝑎𝑏, 𝑎𝑏𝑎, 𝑎𝑏𝑏, 𝑏𝑎𝑎, 𝑏𝑎𝑏, 𝑏𝑏𝑎, 𝑏𝑏𝑏, ... }

We can now easily see that an alternative definition of language could be any subset of the universe language of a given vocabulary 𝑉.

Now, let’s take it up a notch. We can come up with a gazillion languages just involving 𝑎 and 𝑏, by concocting different relationships between the symbols. For this, we will need some way to describe the languages that don’t require listing all the elements–as they are infinitely many. We can do it with natural language, of course, but in the long run, it will pay to be slightly more formal when describing infinite languages.

For example, let 𝐿2 be the language of strings over the alphabet 𝑉 = { 𝑎, 𝑏 } with the exact same number of 𝑎 and 𝑏.

𝐿2 = { 𝜖, 𝑎𝑏, 𝑎𝑎𝑏𝑏, 𝑎𝑏𝑎𝑏, 𝑏𝑎𝑏𝑎, 𝑏𝑎𝑎𝑏, 𝑎𝑏𝑏𝑎, ... }

We can define it with a bit of math syntax sugar as follows:

𝐿2 = { 𝜔 ∈ { 𝑎,𝑏 }∗ | #(𝑎,𝜔) = #(𝑏,𝜔) }

Let’s unpack this definition. We start by saying, 𝜔 ∈ { 𝑎,𝑏 }∗, which literally parses as “strings 𝜔 in the universe language of the vocabulary { 𝑎,𝑏 },” but is just standard jargon to say “string made out of 𝑎 and 𝑏. Then we add the conditional part #(𝑎,𝜔) = #(𝑏,𝜔), which should be pretty straightforward: we are using the #(,) notation to denote the function that counts a given symbol in a string.

𝐿2 is slightly more interesting than 𝑉∗ because it introduces the notion that a formal language is equivalent to some computation. This insight is the fundamental idea that links formal languages and computability theory, and we will formalize this idea in the next section. But first, let’s see other, even more interesting languages, to solidify this intuition that languages equal computation.

Let’s define 𝐿3 as the language of all strings in 𝑉∗ where the number of 𝑎 is a prime factor of the number of 𝑏. Intuitively, working with this language—e.g., finding valid strings–will require us to solve prime factoring, as any question about 𝐿 that has different answers for string in 𝐿 than for strings not in 𝐿 will necessarily go through what it means for a number to be a prime factor of another.

But it gets better. We can define the language of all strings made out of 𝑎 and 𝑏 such that, when interpreting 𝑎 as 0 and 𝑏 as 1, the resulting binary number has any property we want. We can thus codify all problems in number theory as problems in formal language theory.

And, as you can probably understand already, we can easily codify any mathematical problem, not just number theory. Ultimately, we can define a language as the set of strings that are valid input/ouput pairs for any specific problem we can come up with. Let’s make this intuition formal.

Recognizing a language

The central problem in formal language theory is called the word problem. Intuitively, it is about determining whether a given string is part of a language. Formally:

Definition 1.6 (The Word Problem) Given a language 𝐿 on some vocabulary 𝑉, the word problem is defined as devising a procedure that, for any string 𝜔 ∈ 𝑉∗, determines where 𝜔 ∈ 𝐿.

Notice that we didn’t define the word problem simply as “given a language 𝐿 and a string 𝜔, is 𝜔 ∈ 𝐿”. Why? Because we might be able to answer that question correctly only for some 𝜔, but not all. Instead, the word problem is coming up with an algorithm that answers for all possible strings 𝜔—technically, a procedure, which is not exactly the same.

The word problem is the most important question in formal language theory, and one of the central problems in computer science in general. So much so, that we actually classify languages (and by extension, all computer science problems) according to how easy or hard it is to solve their related word problem.

In the next few chapters, we will review different classes of languages that have certain common characteristics which make them, in a sense, equally complex. But first, let’s see what it would take to solve the word problem in our example languages.

Solving the word problem in any finite language is trivial. You only need to iterate through all of the strings in the language. The word problem becomes way more interesting when we have infinite languages. In these cases, we need to define a recognizer mechanism, that is, some sort of computational algorithm or procedure, to determine whether any particular string is part of the language.

For example, language 𝐿2 has a very simple solution to the word problem. The following Python program gets the job done:

def l2(s):
    a,b = 0,0

    for c in s:
        if c == "a":
            a += 1
        else:
            b += 1
    return a == b

A fundamental question in formal language theory is not only coming up with a solution to the word problem for a given language but, actually, coming up with the simplest solution–for a very specific definition of simple: how much do you need to remember. In other words: what kind of algorithms can solve the word problem for what kind of languages?

For example, we can solve 𝐿2 with 𝑂(𝑛) memory. That is, we need to remember something proportional to how many 𝑎’s and 𝑏’s are in the string. And we cannot solve it with anything less than that, as we will prove a couple chapters down the road.

Now, let’s turn to the opposite problem of generating strings from a given language and wonder what, if any, is the connection between these two.

Generating a language

Suppose you want to generate all strings from a language like 𝐿2. To make things simpler, let’s redefine it as 𝐿2′, the language of strings over {𝑎,𝑏} with the same number of 𝑎’s and 𝑏’ but where all 𝑎’s come before all 𝑏’s. This means 𝑎𝑎𝑏𝑏 is a valid string in 𝐿, but not 𝑎𝑏𝑏𝑎. This language is also called 𝑎𝑛𝑏𝑛, that is, 𝑛 symbols 𝑎 followed by 𝑛 symbols 𝑏.

Here is a simple Python method that generates infinitely many strings from 𝐿2′:

def generate_l2():
    s = ""

    while True:
        yield s
        s = "a" + s + "b"

Let’s unpack this. We start with the empty string 𝜖, defined in code as s = "". Then, we enter an infinite cycle where we yield the current string, and then attach an 𝑎 to the front and a 𝑏 to the back. Take a moment to convince yourself that any string in the form 𝑎𝑛𝑏𝑛 is eventually generated by this method and, furthermore, only those strings are generated by the method.

This method is actually pretty neat because it not only generates (eventually) all of 𝑎^𝑛 𝑏^𝑛; it does so in increasing length order. It isn’t immediately obvious why this is such a good thing but here’s a bold claim: if you have a generating method for any language 𝐿, then you have a recognizing method too.

Wait, what!? Yep, you heard it right. And actually, it goes both ways. If you have a recognizing algorithm, you also have a generating one. Let’s make this our first theorem in formal language theory.

Theorem 1.1 Let 𝐿 be a formal language. There exists an algorithm 𝐴 for generating all strings in 𝐿 (in increasing length order) if and only if there also exists another algorithm 𝐴′ for solving its word problem.

Proof. To prove this, let’s first understand what the theorem is saying. If we have an algorithm 𝐴 that generates all strings in a language, we can also come up with another algorithm 𝐴′ (presumably using 𝐴) that solves the word problem, and vice-versa.

To prove this type of theorems, the most usual approach is to assume you have 𝐴 (or 𝐴′) as some kind of abstract, black-box algorithm, and try to construct the other. Let’s do it from generation to recognition first, as the other way around will be fairly easy once this is done.

⇒ Suppose we have an algorithm 𝐴 that generates all strings in 𝐿, and we are given an arbitrary string 𝜔. Let 𝑛=|𝜔| be the length of 𝜔. We just need to run 𝐴 until we either see 𝜔, in which case the answer is true (𝜔 ∈ 𝐿) or until we see one string with length greater than 𝑛, in which case the answer is false (𝜔 ∉ 𝐿). Since 𝐴 generates strings in increasing length order, one of these must happen in a finite time for any 𝜔.

Now, let’s do it the other way around.

⇐ Suppose we have an algorithm 𝐴′ that solves the word problem from 𝐿. Then we do the following. Define 𝐿∗ as the universe language associated with 𝐿. We can very easily code a generating algorithm 𝐴∗ for 𝐿∗ in increasing length order, simply by permuting all symbols. Now, run 𝐴∗ and, for each string 𝜔 generated, run 𝐴′(𝜔). If the output is true, then yield 𝜔. Otherwise, skip it.

So there you have it. Generating (in increasing order) and recognizing are two faces of the same problem. Cool, right? But why does this matter? For starters, it gives us a tremendously powerful connection between two sub-branches of formal language theory that we will explore in the following chapters.

Moving on

We are just scratching the surface of what formal language theory can do, and we have already touched upon several areas of computer science.

We have defined a super general notion (language) that is ultimately as profound and powerful as the very notion of algorithm. We have identified a central problem in formal language theory (the word problem) that is as deep as the very question of what problems can be solved, at all, with a computer. We connected two fundamental problems in languages (recognizing and generating) and discovered they are but two sides of the same coin. And we left hanging the question of which languages can be solved with which types of algorithms, which is ultimately a question about complexity theory. Phew!

In the following few articles, we will continue exploring the world of formal languages. We will dive into the different classes of languages according to the complexity of their generating and recognizing algorithms. We will find many intriguing unsolvable problems that have deep connections with other areas in computer science, from the most practical to the most esoteric. When we finish this dive, we will have a much more solid understanding of what computers can ultimately do. And then, will turn to programming languages and apply all these ideas to solving the more practical problem of building a compiler.

The Hardest Problem

Alejandro Piad Morffis — Thu, 25 Jan 2024 11:00:59 GMT

This article is part of a chapter on Computational Complexity in my upcoming book The Science of Computation. You can read all the details here:
And you can support the book by preordering your digital, DRM-free copy, with early and frequent updates, including all future editions.
Get early access - $10

What is a Hard (Computational) Problem?

There are many problems in computer science for which, even though they seem complicated at first, if we think hard enough, we can come up with clever algorithms that solve them pretty fast.

A very good example is the shortest path problem. This is basically the problem of finding the shortest path between two given points in any type of network. For example, you have a map with millions of places and intersections in your city, and you have to go between one point at one extreme of the map and another at the other end of the map.

Our cleverest algorithms, on average, have to analyze very few options to get you the shortest path. They are very fast. This is what powers your GPS route planner, for example.

However, other problems can sometimes be very similar to the first type, but no matter how hard we think, we cannot find any clever algorithm that always works.

The quintessential example of this is the traveling salesman problem. Consider a similar setting: you have a city and a bunch of places you need to visit in one tour. The question is, what is the best possible tour —the one that takes the least time or travels the least distance?

This problem seems very similar to the shortest path problem, but if you solve it by starting at one point and traveling to the closest next location iteratively, you can end up with a cycle that is several times worse than the best possible cycle.

There are approximate solutions that you can try in very constrained settings. Still, in the general case, no one has ever developed a clever algorithm that solves this problem any faster than checking all possible cycles.

And the thing is, all possible cycles are a huge number.

If you have 20 cities, all possible tours amount to something close to 20 factorial, which is … (checks numbers) … well, huge. Not only that, if you have a computer today that can solve the problem for, let's say, 20 cities in one second, then 22 cities will take nearly 8 minutes, 25 cities will take 73 days, and 30 cities will take you… 3 and a half million years!

This is an exponential solution, and they become bad really fast. But for many problems, like the traveling salesman, we have nothing better —that always works.

Easy versus hard problems

The first type of problem, like the shortest path, are called P-problems —technically, polynomial time complexity, but don’t worry about that. P basically means problems that are easy to solve.

Now, for the second type of problem, we don't really know whether they are P-problems, but there is a subset of these that has an intriguing characteristic.

If I ask you to find me a cycle with, say, less than 100 km in total, that will be extremely hard. However, if I tell you, “here is a cycle that takes less than 100 km” you can easily verify it —just add up the length of each road.

These types of problems are called NP problems —technically, non-deterministic polynomial time complexity, but again, forget about that. NP basically means problems that are easy to verify.

So, the most fundamental question in computer science, P versus NP, ultimately is the following:

P vs NP: Are all problems that are easy to verify also easy to solve? Or are there problems that are intrinsically harder to solve than to verify?

Intuitively, for many, it seems it must be the latter. Think about what it would mean to easily solve any problem that is also easy to verify. In a sense, it would be as if recognizing a masterpiece would be the same difficulty as creating the masterpiece in the first place. Being a great critic would be the same as being a great artist.

This seems false, but intuitions in math and computer science are often wrong. We cannot use this kind of analogy to reach any sort of informed conclusion.

Are there really hard problems?

However, theoretical reasons hint at the possibility that P is indeed distinct from NP —that there are some truly, fundamentally difficult problems. The strongest one is the existence of so-called NP-complete problems. Problems, like the traveling salesman, so difficult to solve efficiently that if you could solve one of them, you would solve all the other NP problems at the same time.

When we say a problem is difficult, we mean that it is exponentially easier to verify a solution's correctness than it is to find the solution in the first place. This concept is crucial because, for most problems in computer science, such as sorting, searching for elements in an array, or solving equations, the effort required to solve the problem is roughly the same as the effort needed to verify the solution.

In technical terms, the advantage of verifying over solving is only polynomially easier, not exponentially. However, there are certain problems, like the traveling salesman problem, for which we have been unable to find a fast and efficient algorithm – only exponential ones. Nevertheless, these solutions are very easy to verify.

The heart of the P versus NP debate lies in whether these inherently difficult problems truly exist. Do problems exist that are far more challenging to solve than to verify? To answer this question fully, we would need to find a problem for which a polynomial time algorithm cannot exist.

P vs NP remains an unsolved question, and although it has seen a lot of progress, it appears to hint at the need for a new kind of mathematics. Our current mathematical methods lack the power to tackle these challenging meta-questions. However, we do have the next best thing —a wealth of empirical and theoretical evidence suggesting that, indeed, many of these problems may be unsolvable efficiently.

One of the simplest forms of empirical evidence is the vast number of problems for which we lack a polynomial time algorithm to solve them. However, this evidence is not very strong, as it could simply mean we aren’t smart enough to find those algorithms.

What we need is a more principled way of answering this question. When mathematicians are faced with an existential problem —in the sense of, does there exist an object with these properties, not in the sense of, you know, God is dead and all—what they do is try and find extreme cases that can represent the whole spectrum of objects to analyze.

In this case, we are dealing with finding problems that are difficult to solve. So it makes sense to ask, “What is the most difficult problem possible?” A problem so hard that if we crack it, we crack them all.

That’s the idea behind NP-completeness. Let’s focus on these super tricky problems – the toughest ones in this field. If there are problems so tough that solving just one of them would also solve all the others, that would seriously challenge the idea of P equals NP.

These really tough problems are called NP-complete problems, a concept defined by Stephen Cook in the 1970s. An NP-complete problem is basically a problem that is NP —that is, easy to verify—, and a solution to it in polynomial time would also give us a solution to any other problem in NP. So, in a way, these are the only problems we need to look at. If we solve one of these, we've proven P equals NP. And if we can't solve just one, we've proven P not equal to NP.

In short, we just need to focus on NP-complete problems. So, the main question then becomes: Are there problems so complex that they are essentially equivalent to all other problems?

Obviously, the hardest problems wouldn't revolve around everyday topics like finding paths in maps, sorting numbers, or solving equations. No, these problems should be much more abstract, so that they could encompass numerous other problems in their definition.

One problem to rule them all

Cook's idea was to create a problem centered around problem-solving itself. For instance, how about simulating a computer? That would be an incredibly abstract problem. To make it even more challenging, they could turn this computer simulation into a decision problem —keep in mind that NP problems are decision problems.

One way to define this problem is to imagine having an electronic circuit that computes some logical formula —this is all computers do at their core, as all computable arithmetic can be reduced to logic. Let's take the simplest logical circuit, a completely stateless one, and ask: Is there any input that will make this circuit output True? If so, we call the circuit satisfiable. Otherwise, it is deemed unsatisfiable.

Hence, given an arbitrary logical circuit, the task of determining if it's satisfiable or not is called circuit satisfiability, or Circuit-SAT for short.

In 1970, Stephen Cook proved that Circuit-SAT is as hard or harder than all other NP problems. If you can solve circuit satisfiability, you can solve any other problem in NP. This means you can solve the traveling salesman problem and the knapsack problem, but also sorting, searching, and basically any easily verifiable problem.

The proof of this idea is quite involved and complex, and not something I can fully explain in this post, so I'll save that for a more detailed discussion later. But basically, since logical circuits are extremely expressive and powerful, and can compute almost everything, any decision problem in computer science can be transformed into a logical circuit that simulates it solution. Cook’s proof actually involves constructing a circuit for an abstract NP problem, so it’s pretty technical. But the intuition is that these things are basically computers, so you’re just simulating any possible (decision) algorithm.

And voilá! This proves the existence of at least one NP-complete problem, meaning there is one problem in NP that is as difficult as all problems in NP. Stephen Cook's work in 1971 essentially kick-started the entire field of computational complexity and defined the most important question in computer science: P versus NP.

The story, however, doesn't just end there. Circuit-SAT is great, but it is too much of an abstract problem. Just one year later, Richard Karp came along and demonstrated that 21 well-known and very concrete computer science problems were also NP-complete problems. These problems included the traveling salesman problem, the knapsack problem, various scheduling problems, and many graph coloring problems. In short, there turned out to be a whole bunch of problems that fell under this category, not just some abstract circuit simulation task.

These NP-complete problems aren't just theoretical issues, either. They are practical problems that we encounter regularly in logistics, optimization, and scheduling. After Karp proved his 21 original NP-complete problems, a wave of people started proving that nearly any problem in computer science involving combinatorics could be classified as NP-complete. As a result, there are now over 4,000 papers proving different NP-complete problems, ranging from determining the best move in Tetris to folding proteins.

This compelling evidence has led many computer scientists to believe that P != NP and that there are, indeed, problems that are fundamentally harder to solve, not just because we lack some understanding about them but because, by their very nature, they just cannot be solved efficiently.

What does this mean?

If it turns out that if P != NP, then there are some fundamental limits to how fast a computer can solve the majority of the most important problems in logistics, planning, simulation, etc. While we have many heuristics to solve either some cases perfectly or most cases approximately, all these problems may ultimately be unsolvable quickly.

Not even a superadvanced alien —or computer— could be exponentially faster than us. Thus, P vs NP might be our best weapon to stop a self-improving AGI from reaching superhuman capacity. Even gods can’t escape complexity theory.

The Answer to All Questions

Alejandro Piad Morffis — Tue, 23 Jan 2024 17:02:31 GMT

This article is part of my upcoming book The Science of Computation. You can check all the details about in the following post.
You can also support the book by preordering your digital copy, getting early and frequent updates forever.
Preorder the book - $10

What is an algorithm?

In 1930, Alan Turing aimed to answer this question once and for all. At that time, computers as we know them today did not exist. However, the term "computer" was already used, referring to individuals, primarily women, who would perform computations.

These human computers, often instructed by physicists or mathematicians, were responsible for solving calculation problems. Once the problem was devised and all the reasoning and creativity were complete, these women would receive a piece of paper containing precise instructions for the computation.

Their sole duty was to follow these instructions meticulously. They would manipulate the input numbers, essentially performing the step-by-step process that modern computers carry out. By adhering to the instructions, they would ultimately arrive at the desired answer, even if they didn’t understand the underlying math or physics involved.

Now, Turing is a logician living in the early 20th century, and all logicians back then sought to address a crucial question that dominated mathematical research at the time. This question pondered the existence of limitations in mathematics.

Can purely computational, logical reasoning solve every problem? Are there certain questions that lie beyond the realm of what pure logic can answer? Such limitations were hinted at by logicians like Gödel, Russell, and Hilbert. However, Turing approached this problem from a different perspective: computational processing instead of pure logical demonstration.

The Turing machine

Turing's initial idea was to formalize and abstract the concept of a computer. He began by contemplating the nature of human computers. These human computers used a sheet of paper, often divided into cells, on which they wrote various symbols, including numbers. Turing realized the two-dimensional aspect of the paper is not essential. It could be simplified into a one-dimensional tape of cells on which symbols could be placed, deleted, or added at any given moment.

He then observed that these individuals followed instructions, focusing only on one specific instruction at any given time. There was no need to consider multiple instructions concurrently, as they were performed sequentially. Additionally, these instructions were finite and entirely predetermined for a given problem. Regardless of the length of the numbers being multiplied, for example, the instructions for multiplication are always the same.

From these observations, Turing conceived the idea of a computation machine, which he referred to as an "alpha-machine." He imagined this contraption as a state machine with fixed states connected by transitions. These transitions dictated actions based on the observed symbol and current state. For example, if the observed symbol was a 3 and the machine was in state 7, the instruction would specify changing the number to a 4 and transitioning to state 11.

Despite its simplicity, this concept encapsulates the fundamental principle of following instructions and utilizing an external memory, such as a tape, to store intermediate computations. By uniting these concepts, Turing laid the foundation for the conception of a computation machine —an abstract device capable of performing any type of computation.

The alpha-machine, now known as a Turing machine, is an abstract representation of a concrete algorithm. For instance, there can be a Turing machine designed explicitly for multiplying two numbers, a Turing machine for computing the derivative of a function, or even a Turing machine for translating from Spanish to English —as hard as that may seem. Each machine is a distinct algorithm akin to a specific program in Python or C++.

The universal computer

But Turing took a further step, introducing the concept of a universal machine. This is where the brilliance of his idea shines. Turing deduced that a Turing machine's description can be considered a form of data as well. We can take the states and transitions of a Turing machine and convert them into a long sequence of numbers, which can then be inputted into the tape of another Turing machine.

A universal Turing machine is designed to receive the description of any concrete Turing machine, along with its respective inputs, on its input tape. The universal Turing machine then simulates the precise actions of the specific Turing machine, replicating its behavior with the given input. Turing demonstrated this could be done, thus inventing the modern concept of a general-purpose computer.

And that’s how Alan Turing almost single-handedly invented the science of Computation —the theoretical field that explores the limits of computation and its practical implementation, now somewhat confusingly called “Computer Science.”

Remarkably, Turing machines, despite their abstract nature, serve as the practical blueprints for modern computers. In modern computers, a microprocessor contains a fixed set of instructions, while random access memory provides us with a virtually unbounded tape —though not technically infinite. Hence, the concept of a universal Turing machine aligns closely with a modern computer.

However, the story does not conclude here. You see, Turing had two fundamental questions in mind. First, he sought to determine the essence of an algorithm and how we can formalize the concept of computation. That’s the Turing Machine we just talked about. But, most importantly, he targeted the most crucial question regarding the limitations of mathematics and computation at the beginning of the 20th century.

The quest to answer all questions

During this time, mathematicians were engaged in a significant undertaking to resolve mathematics altogether. They aimed to address all the remaining critical open questions in the field. A prominent mathematician named David Hilbert compiled a list of 20 questions that he regarded as the most pivotal in mathematics. While many questions on the list pertained to specific areas of mathematics, at least two questions dealt with the fundamental limits of the discipline itself.

The first question asked to prove the completeness and consistency of mathematics. This meant establishing that all truths can be proven and that mathematics has no internal contradictions.

Initially, most mathematicians believed this to be true. However, Kurt Gödel's incompleteness theorem dealt a major blow to this belief. Gödel demonstrated that in any sufficiently strong mathematical system, there are always truths that cannot be proven within that system. To prove them, looking outside the system and introducing new axioms was necessary. This revelation undermined the mathematicians' quest to solve mathematics definitively.

Yet, one fundamental question remained, and it was also widely expected to be possible. This question examined whether, for all provable truths, there existed a completely mechanized procedure—an algorithm—that could find and deliver a proof within a finite amount of time. In simpler terms, it asked if computers could prove all provable theorems.

Surprisingly, the answer also turned out negative. There are problems in math and computer science that have clear answers—either true or false—yet cannot be solved algorithmically. These are known as undecidable problems: no algorithm can determine the truth or falsehood of these problems for all instances. And that is not a limitation of current technology. It’s a fundamental limitation of the very notion of algorithm.

Undecidable problems

There are two arguments to understand why computer science must have undecidable problems, using two of the most powerful in logic. The first is a diagonalization argument, very similar to Cantor’s original proof that the natural numbers are less than the real numbers. If you haven’t seen it, check out José J. Rodríguez’s article on Cantor’s diagonalization.

Essentially, any problem in computer science can be seen as a function that takes some input and produces some output. For example, the problem of adding two numbers can be represented as a function that takes two numbers and produces their sum. The question then becomes whether there are functions that cannot be solved by any theoretical machine or algorithm.

Now, here’s the kicker. The number of possible problems, or functions, is uncountable, while the number of Turing machines is countable. We can list them all by enumerating all binary strings and determining which ones correspond to well-formed Turing machines, in the same way in which we can enumerate all Python programs simply by enumerating all possible strings and running the Python interpreter on each one.

So, the number of problems corresponds to the cardinality of real numbers, while the number of programs corresponds to natural numbers. Consequently, there must be infinitely many mathematical problems that cannot be solved by an algorithm.

However, even if this is true, it is conceivable that we may not care about most of these unsolvable problems. They might be unusual or random functions that we do not find significant because for almost every problem we can think of, we can devise an algorithm to solve it. So, while there may be many solvable problems, their importance is subjective. Turing thus aimed to find one specific problem that was both important and unsolvable, and for that, he used the second most powerful tool in the logician’s arsenal.

Consider a Turing machine and a specific input. The Turing machine can either run indefinitely in a loop or eventually halt. If the machine halts after a certain number of steps when given that input, we can determine this after a finite amount of time. However, if the machine never halts, we might never be able to tell whether it will halt. It may always be that it has not yet stopped but will do so at some point in the future. Therefore, running a Turing machine that never halts on a given input cannot tell us that it will not halt.

Turing's question, then, is whether it is possible to determine, by examining only the code and input of a Turing machine, whether it will halt or not, without actually running the machine. This is known as the halting problem.

To establish the undecidability of the halting problem, Turing employs a negated self-reference, a powerful method pioneered by Bertrand Russell with his barber’s paradox. The formal proof is somewhat involved, but the basic idea is pretty straightforward.

Following Russell's template, Turing presents a thought experiment assuming the existence of a “magical” Turing machine that can determine if any other Turing machine will halt without executing it. To show this is impossible, Turing pulls a Russell and creates a new Turing machine, using that magical halting machine, that halts when another arbitrary Turing machine doesn’t. By running the new machine on itself, Turing builds an explicit contradiction: the machine must halt if and only if it doesn't halt. This contradiction proves that the existence of the magical Turing machine for the halting problem is inherently self-contradictory and, therefore, impossible.

What does this mean?

What is the significance of the halting problem? For starters, with this result, Turing not only kickstarted computer science but also established its core limitations on day one: the existence of well-defined problems that cannot be solved by algorithms.

However, although the halting problem may seem abstract, it can be interpreted practically as a fundamental limitation of the kind of software we can build. When writing complex code, we often want a compiler, or linter, to determine if our code is error-free before executing it. For instance, a modern compiler for a statically typed programming language can detect and notify developers of potential type errors before runtime, like using undefined variables or calling unexisting methods.

Ideally, we would want a super smart compiler that could answer questions like: Will this program ever result in a null reference error? Is there a possibility of infinite recursion? Can this program cause an index-out-of-range exception? Unfortunately, these questions all trace back to the fundamental nature of the halting problem; they are, in general, undecidable.

Therefore, the undecidability of the halting problem implies that most of the problems we could expect compilers to solve are fundamentally unsolvable. Consequently, automated program verification —the process wherein programs are checked for intended functionality by other programs— is generally undoable. We must find heuristics and approximate solutions to particular cases because no single algorithm will work on all possible such problems.

This limitation extends to the realm of artificial intelligence as well. These findings tell us that there are fundamental tasks computers will never be able to accomplish. Maybe this means AGI is impossible, but maybe it doesn’t. Perhaps humans are also fundamentally limited in this same way. Many believe human brains are just very powerful computers, and if we are ultimately something beyond fancy Turing machines, we might never be able to know.

But that’s a story for another Tuesday.

Supervised vs Reinforcement Learning

Alejandro Piad Morffis — Tue, 14 Nov 2023 11:01:30 GMT

Supervised Learning and Reinforcement Learning are the two main paradigms in machine learning, both involving learning from experience but in fundamentally different ways.

Supervised learning is comparable to imitation, where learning takes place through demonstrations of solving a given task. This type of learning is commonly used for tasks such as text classification, image classification, and audio transcription, as well as in training large language and diffusion models.

The basic concept of supervised learning is to utilize expert demonstrations of a task to train a machine learning system to map the input space, such as pixels in an image, to the desired output space, like a label indicating the presence or absence of lung cancer in annotated X-ray images.

In contrast, reinforcement learning involves learning through trial and error. It is often applied in robotics, self-driving cars, and any scenario requiring a sequence of actions where a plan is executed before assessing the outcome. This approach is utilized when direct access to expert demonstrations is unavailable, such as in tasks like walking or driving a car, where the method used by humans is not precisely understood or easily communicated to an AI.

Moreover, reinforcement learning is preferred when evaluating the results of a task is much more straightforward than actually solving it; for instance, it is easier to determine if a car crashed or not, than to drive the car, or to ascertain if a robot moved forward or fell over, than instructing the robot on movement.

When deciding between supervised learning and reinforcement learning, the feasibility of direct demonstration by experts becomes a crucial factor. Supervised learning is more suitable when experts can directly demonstrate the task, providing a clear framework for learning. In contrast, reinforcement learning is better suited for tasks where the direct demonstration is difficult, but the evaluation is feasible.

This distinction can be best understood in terms of direct and inverse problems, common in computer science. The direct problem is the task you want to solve, from input to output, while the inverse problem involves going from the output to the input.

For example, in image classification, the direct problem is categorizing images: that is, given an input image, provide an output text that describes it –either a simple label or a full natural language description. The inverse problem is thus image generation based on a given label or description.

Since the direct problem is easier —or at least not harder— to solve than the inverse, in this case, it is more feasible to find experts to annotate images and provide accurate labels, making supervised learning the preferable approach.

Consider the example of self-driven cars, where the direct problem involves navigating complex scenarios, while the inverse problem is simulating those scenarios to determine if the car reaches its destination. While both problems are complex, the inverse problem is relatively easier —we just have to build a realistic video game. In this case, even though making an accurate car simulator is no small feat, it is still way more manageable than collecting hundreds of thousands of hours of experts driving cars in many different locations.

Thus, when the direct and inverse problems exhibit similar complexity, supervised learning is typically preferred. However, if the inverse problem is significantly easier than the direct problem, reinforcement learning becomes a more viable option.

If it’s unclear beforehand which problem is easier, you can consider two heuristics. The first heuristic involves examining the availability of data on expert performance that you can access: books, papers, tutorials, video demonstrations, etc. If such data is readily available, supervised learning may be a viable option. Otherwise, you may need to hire experts for annotation or build a simulation.

The second heuristic involves observing how humans learn to perform similar tasks. Is it through formal education or intensive practice? The former —e.g., medical diagnosis—indicates that humans learn via imitation, suggesting that supervised learning is feasible. In the other case, humans learn via trial and error, like in sports, thus suggesting that reinforcement learning is more suitable.

Now, while this heuristic provides valuable insight into the problem, it may not be universally applicable. For instance, traditional board games like chess have historically been tackled using supervised learning, but the most recent models for games—such as AlphaGo and AlphaStar— have shown that reinforcement learning is even more effective.

If you can do either, then consider the advantages of each method. With sufficient data, supervised learning can yield faster initial results, as learning from clear examples of good behavior is more efficient. However, this approach often reaches a limit at the human expert level. On the other hand, reinforcement learning has no upper limit and can achieve superhuman performance if the simulation is sufficiently accurate. Therefore, a hybrid solution that starts with supervised learning and transitions to reinforcement learning may effectively push performance to the highest level possible.

Reinforcement Learning with Human Feedback (RLHF) is an interesting middle ground between Supervised Learning and pure Reinforcement Learning. In RLHF, like in pure RL, an agent interacts with an environment and learns from trial and error. But unlike pure reinforcement, where the simulation determines the feedback for each action, here we let human experts evaluate the agent's performance and learn from their feedback. Thus, we learn the best feedback function in supervised mode and then learn the optimal behavior in reinforcement mode.

In any case, there are no hard rules when deciding whether to use supervision or reinforcement, only some heuristics. As in all disciplines, when faced with a novel problem, scientists and engineers must either draw from experience in similar problems or try different things and see what works. And this is just supervised and reinforcement learning at display. It’s pretty meta, isn’t it?

A Universal Language for Reasoning

Alejandro Piad Morffis — Thu, 09 Nov 2023 11:56:05 GMT

Probably the greatest idea in all of Computer Science is the definition of “computation”. This foundational step enabled everything else in our field to fall into place. It started with a dream of one of the greatest mathematicians, philosophers, and scientists of all time and very likely the first computer scientist, Gottfried Leibniz, and concluded with the seminal work of Alan Turing, the most important mathematician of the 20th century.

This is Origins of CS, a series exploring the foundational ideas in Computer Science, the long story of humanity’s search for the ultimate machine, a machine to solve all problems; the kind of machine we today call a “computer”.

Humanity’s ingenuity extends far back into the foggy ages of the agricultural revolution and probably even farther back. Since the dawn of homo sapiens, our species has always been interested in predicting the future. Tracking the seasons, planning crops, estimating an enemy’s forces, and building cities, all of these tasks require significantly complex computations that often need to be reasonably accurate to be helpful.

For this reason, ancient civilizations developed all sorts of computational devices to keep track of complex phenomena such as planetary motion. The most famous case is the Antikythera mechanism, a two-thousand-and-some-old device used to predict eclipses regarded as the first known analog computer. The zairja is another example from around 1000 CE, a device used to make astrological predictions, much like a simplified language model (only half-jokingly here).

The Antikythera mechanism. Taken from Wikipedia, CC BY 2.5

This is a trend we see developing well into the Middle Ages and the Renaissance: constructing machines that perform some sort of calculation, whose final realization —before the arrival of electrical computers— is Charles Babbage’s differential and analytical engines. But this is a story for another day.

In this post, I want to begin exploring the ideas that lead to the notion of “algorithm” or computational process, regardless of their physical realization. This is the first entry of Origin of CS, the story of the origins of Computer Science.

Note: A previous version of this post was read by my very first few subscribers. I’m updating and resharing this post today with you, as part of a new series I’m starting.

The modern history of Computer Science can be said to begin with Gottfried Leibniz, one of the most famous mathematicians of all time. Leibniz worked on so many fronts that it is hard to overestimate his impact. He made major contributions to math, philosophy, laws, history, ethics, and politics. He is probably best known for co-inventing, or co-discovering, calculus along with Isaac Newton. Most of our modern calculus notation, including the integral and differential symbols, is heavily inspired by Leibniz’s original notation. Leibniz is widely regarded as the last universal genius, but in this article, I want to focus on his larger dream about what mathematics could be.

Leibniz’s portrait, by Christoph Bernhard Francke - Herzog Anton Ulrich-Museum, online, Public Domain.

Leibniz was amazed by how much notation could simplify a complex mathematical problem. Take a look at the following example:

“Bob and Alice are brothers. Bob is 5 years older than Alice. Two years ago, his age was twice as hers. How old are they?”

Before Algebra was invented, the only way to work through this problem was to think hard, or maybe try a few lucky guesses. But with algebraic notation, you just write some equations like the following and then apply some straightforward rules to obtain an answer.

Any high-school math student can solve this kind of problem today, and most don’t really have the slightest idea of what they’re doing. They just apply the rules, and voila, the answer comes out. The point is not about this problem in particular but about the fact that using the right notation —algebraic notation, in this case— and applying a set of prescribed rules —an algorithm— you can just plug in some data, and the math seems to work by itself, pretty much guaranteeing a correct answer.

In cases like this, you are the computer, following a set of instructions devised by some smart “programmers” that require no creativity or intelligence, just to painstakingly apply every step correctly.

Leibniz saw this and thought: “What if we can devise a language, like algebra or calculus, but instead of manipulating known and unknown magnitudes, it takes known and unknown truth statements in general?”

In his dream, you would write equations relating known truths with some statements you don’t know, and by the pure syntactic manipulation of symbols, you could arrive at the truth of those statements.

Leibniz called it Characteristica Universalis. He imagined it to be a universal language for expressing human knowledge, independently of any particular language or cultural constraint, and applicable to all areas of human thought. If this language existed, it would be just a matter of building a large physical device —like a giant windmill, he might have imagined— to be able to cram into it all of the current human knowledge, let it run, and it would output new theorems about math, philosophy, laws, ethics, etc. In short, Leibniz was asking for an algebra of thoughts, which we today call logic.

He wasn’t the first to consider the possibility of automating reasoning, though. Having a language that can produce true statements reliably is a major trend in Western philosophy, starting with Aristotle's syllogisms and continuing through the works of Descartes, Newton, Kant, and basically every single rationalist that has ever lived. Around four centuries earlier, philosopher Ramon Llull had a similar but narrower idea which he dubbed the Ars Magna, a logical system devised to prove statements about, among other things, God and the Creation.

However, Leibniz is the first to go as far as to consider that all human thought could be systematized in a mathematical language and, even further, to dare dream about building a machine that could apply a set of rules and derive new knowledge automatically. For this reason, he is widely regarded as the first computer scientist in an age where Computer Science wasn’t even an idea.

Leibniz's dream echoes some of the most fundamental ideas in modern Computer Science. He imagined, for example, that prime numbers could play a major part in formalizing thought, an idea that is eerily prescient of Gödel’s numbering. But what I find most resonant with modern Computer Science is the equivalence between a formal language and a computational device, which ultimately becomes the central notion of computability theory, a topic we will explore in future issues.

Unfortunately, most of Leibniz's work in this regard remained unpublished until the beginning of the 20th century, when most of the developments in logic needed to fulfill his dream were well on their way, pioneered by logicians like Gottlob Frege, Georg Cantor, Bertrand Russell, and ending in the final realization of a computational model by Alan Turing.

But that’s a story for another day :)

Thinking like a Computer Scientist - Rigor and Formality

Alejandro Piad Morffis — Mon, 16 Oct 2023 11:52:59 GMT

A rabbit coming out of a hat. — generated with SDXL via MonsterAPI.

As a teacher, I always need to emphasize the importance of core foundational computer science courses such as logic and discrete math to my students because sometimes they can lose track of the forest by focusing too much on the trees. These courses do not necessarily provide concrete skills or knowledge that can be applied directly. Instead, they help develop fundamental reasoning skills that are applicable everywhere.

For example, in discrete math, students will make proofs about theorems in number theory, combinatorics, computability, and graph theory. While some of these concepts are useful for computer scientists, the most important takeaway from these courses is the ability to think rigorously and formally. The theorems can be found in books, but the critical thinking skills developed in these courses are invaluable.

In my discrete math course, I tell my students there are two critical components to producing a solid mathematical demonstration: rigor and formality. These two are orthogonal. You can achieve rigor without being overly formal or formality without being rigorous. When it comes to mathematical thinking, these two elements interplay to make your reasoning and argumentation clear and sound.

In this article, we will break down what it means to be rigorous and what it means to be formal. Then, we will examine the relationship between rigor and formality and argue why rigor is typically more crucial than formality.

This is the first of a series of articles I want to write on critical reasoning skills for undergrad students. I will mostly write from the context of math and computer science, but I think these ideas are valuable everywhere a sound and logical argumentation is desirable.

Rigor and formality

Rigor and formality are two orthogonal qualities of clear argumentation. In short, formality refers to form, while rigor refers to content —alternatively, we can say formality refers to syntax while rigor refers to semantics. Good arguments, like mathematical proofs, are both rigorous and formal. That is, they are solid in both content and form.

First, let’s examine these two qualities, and then we’ll analyze how they interplay in critical reasoning.

What is formality

In general, formality involves strictly adhering to specific rules of etiquette. In mathematics and computer science, this etiquette is a formal language. This means that what you say is highly restricted, as you must use specific symbols and follow specific syntactic rules with precise meaning.

In mathematical terms, formality means sticking to mathematical notation. Mathematical notation involves using symbols to refer to different objects. For instance, if you want to talk about a number with a particular property, you give it a name like “x” or “y”. You might also use subscripts to refer to a set of numbers with the same property or use more complicated symbols with specific quirks in their allowed syntax.

By naming things consistently, you ensure that anyone following your proof understands precisely what you're discussing. This is better than saying, "Remember I told you there were three numbers; just take the second one and multiply it by the third" because it removes ambiguity.

The benefit of formality is that anyone will understand precisely what you are saying based on the syntax. The advantage of clear syntax is that it removes all friction from decoding what you're talking about. Anyone reading your proof will know exactly which objects you're referring to and what you're trying to say about them. This means that understanding the proof becomes a matter of understanding the semantics and logical validity of the proof rather than figuring out what you're talking about.

What is rigor

In general, rigor refers to the thoroughness and coherence of an argumentation. When reasoning is rigorous, each step follows logically from the previous step without any hidden assumptions or presuppositions. There are no non-trivial premises that one must agree on before following the reasoning.

In mathematical terms, rigor comes from strictly following well-known rules of reasoning without taking unjustified leaps. This involves clearly stating your premises and backing up every claim with the necessary support. All claims must be either a logical consequence of the premises or previously stated claims. This builds a chain of reasoning such that any conclusion can be traced back to the premises via logical reasoning.

Thus, if the reasoning is rigorous, it should be self-evident to anyone following it that if you accept the premises, you must also accept the conclusion. However, this does not necessarily mean you will agree with the premises. The concept of rigor provides validity to your premise, argument, and conclusion. It ensures that your thinking is correct and that the truth you have uncovered is true based on your premises.

The advantage of being rigorous is that it ensures your conclusions are sound —that is, they are necessarily true, given the premises. Rigor also helps focus the discussion on the core of the argument rather than irrelevant elements like who’s doing the argument. If someone argues against a rigorous argumentation, they must either reject the premises or show a flawed logical step somewhere. This helps to uncover red-herring and other fallacies people use to divert a discussion in their favor without providing solid arguments.

The interplay between rigor and formality

It is possible to have rigorous argumentation without formality.

Academic papers often present rigorous argumentation that is not overly formal. Philosophical essays, for example, demonstrate this when the writer is clear, logical, and precise. They define their topic and make careful, logical connections between premises. They justify each statement they make, even if it initially seems trivial. This is most evident in modern analytical philosophy, which attempts to be logical and rational but is prevalent in all good philosophical arguments.

Similarly, natural science research can also demonstrate rigorous yet not overly formal argumentation. In experimental sciences such as physics, chemistry, and biology, well-thought-out experiments are conducted with detailed setups. The hypotheses and goals of the experiment are clearly stated, and data is analyzed to draw conclusions. Scientists rigorously distill their conclusions from the data to ensure the experiments support them. While scientific papers may use semi-formal notation, such as algebra or statistics, and graphs and tables, they are not as formal as a fully-fledged mathematical proof.

The opposite is also true; you can be exceedingly formal but lack all rigor.

Being overly formal in your writing or argumentation does not necessarily guarantee accuracy. For instance, you will have incorrect conclusions if you work on a math or geometry proof and rely on an unexisting theorem without following proper logical steps. Even if you use precise terminology and notation, your argumentation may still be flawed and fail to follow previous steps logically.

Formal but not rigorous argumentations can appear by accident or by malicious manipulation. A typical example is college students, who learn to grasp the syntax of proofs but struggle with their semantics, thus producing invalid proofs that still appear mathematically sound because all the notation is flawless.

On the other hand, sometimes, people trying to convince you of something untrue will use various tactics, including an appeal to formality, to make their argument seem more credible. By using many definitions, symbols, and formulas, they may try to trick you into believing their reasoning is correct.

This may be accidental, but it can also happen intentionally when researchers try to make their assumptions and conclusions seem more reliable by embedding them in a formal framework.

In political discussions, clever people often use rhetorical language to try and persuade you of their “truth”. They might use formal and complex language, mention scientific or technical concepts, and quote well-known personalities to lend themselves credibility. However, much of what they say is meaningless or plain wrong. Quoting some dead philosopher doesn’t make your argument necessarily any more accurate, but many fall prey to this illusion.

Finally, this practice is also pervasive in pseudosciences, which try to appear scientific but do not follow a genuinely scientific approach. They use formal language, define terms, and create complex taxonomies, ontologies, and theories but lack rigor.

By now, it should be clear that while having rigor without formality can sometimes make understanding difficult, having formality without rigor is a complete disaster. Therefore, it is crucial to prioritize rigor over formality, not only in mathematical thinking but also in reasoning in general.

First, make it rigorous, and then make it formal.

How to think rigorously and formally

The final question I want to address is how to approach writing, thinking, or communicating rigorously and formally. When teaching my students, I advise them to approach this kind of writing —e.g., for mathematical proofs— in three layers.

The first layer is intuition: understanding why something is true. By building intuition and seeing connections between different parts of the issue, you can become convinced of the truth of your argument. For example, a mathematical proof uncovers hidden relationships that may initially seem complicated. However, with a solid intuition, these relations become self-evident. The intuition layer is where you reach a deep understanding of why something is true.

Next, you add rigor to your intuition. This involves making your ideas as detailed and coherent as possible, checking every step, and making explicit all intermediate steps that logically follow from previous claims. You should avoid relying on non-trivial assumptions that aren't explicitly stated.

By adding rigor, you may realize that your initial intuition was incorrect. If you're wrong, you should be able to identify where the issue lies. For instance, you may discover that you cannot justify a particular jump between claims. In such a case, you should try to find an opposite proof that shows whether your initial idea was mistaken.

However, if you succeed in making your proof as rigorous as possible, it should be convincing even though it is written in natural language without any formality. Nonetheless, the argument may still be confusing, as it's written in your language and not in a universal language understood by all professionals in your discipline. There may also be some ambiguity in how something is interpreted. For example, if you say “the last number”, it may be unclear who precisely the last number refers to.

Finally, you can rewrite your argument using a stricter, more formal notation. In math, this involves defining every concept you mention, giving them a name, and using variables for referring to objects in your proof. When referring to something previously mentioned, use the name given to it instead of repeating its definition. For example, instead of saying, “That number we just saw”, say, “Let n be the first number that has property p,” and use n in subsequent references.

Formality will also help you identify flaws in your argumentation. If you find it difficult to formalize, your proof sketch is poorly laid out. The formality layer is like a programming language that ensures your proof is sound. If it parses and compiles correctly, it is likely to be true —provided you did the rigorous work first.

An illustrative example

Let me finish with a simple example to illustrate the process of going from intuitive to rigorous to formal argumentation. I will examine the classical proof for the infinite nature of prime numbers. This is one of the fundamental theorems in arithmetic and has a beautiful proof often used to introduce students to the art of mathematical proof-making.

Here’s an intuitive idea of why this must be false.

All numbers are either prime or divisible by some smaller prime(s). If there were finitely many primes, we could build a large number that is not divisible by any of them, multiplying them all together and adding one. But this larger number has to either be prime or have prime factors we didn’t consider.

Now, let’s make that rigorous.

Let’s start by assuming the opposite of what we want to prove. Let’s say there are only finitely many prime numbers. We could list them all out, like this: the first prime number, the second prime number, the third prime number, and so on, until we reach the last prime number.
Now, let’s create a new number. To do this, we multiply all the prime numbers in our list together and then add one to the result. This gives us a big number that is not on our list of primes.
This new number is not divisible by any of the primes in our list. When we divide it by any of our primes, there is always a remainder of one. This is because, by definition, this big number is equal to something multiplied by that prime number plus one.
But every number has some prime divisors, or it is prime. So, either this new number is a prime not in our list, or it has prime factors that were not in our original list.
In either case, we have a contradiction. We assumed that we had a complete list of all the primes, but now we’ve found a prime that wasn’t on our list. This contradiction means that our original assumption (that there are only finitely many primes) must be false. Therefore, there must be infinitely many prime numbers.

The previous proof is rigorous because each claim follows from the previous claims and premises, and there are no hidden assumptions. There is no flawed reasoning, i.e., the logic of the argumentation is sound. However, it is still hard to follow because the lack of formal notation forces us to use phrases like “that number” which may be ambiguous, especially when the argumentation is sufficiently long.

Finally, let’s make it formal:

Due to Substack’s lack of inline math support, we have to put an image here.

It doesn’t matter if you precisely understand every step in this formal demonstration. We can turn a rigorous proof into a formal one simply by carefully naming things and using the proper syntax and notation. However, if the proof is not rigorous, to begin with, no amount of formalization will make it better.

Incidentally, note how, in this formal demonstration, we didn’t need to explicitly differentiate whether m is prime itself or a composite number with a prime divisor not in P. This is just a tiny detail to highlight that, upon formalization, a proof may become more straightforward —if you understand the notation, of course.

Conclusions

Writing is one of the best ways to think clearly. Instead of relying on your faulty memory, if you explicitly lay out your ideas on paper —or electrons, for that matter— you can more easily see if your ideas make sense and discover connections you previously didn’t think of.

However, not all writing is the same. Depending on your purpose, your writing can be more or less detailed and more or less structured. Sometimes, completely unstructured, free-flow writing is precisely what you need to get something out of your mind. However, if your purpose is to be correct and convincing, you’re better off striving to be rigorous and formal.

Formality and rigor are two cornerstones of sound, logically-backed reasoning. By learning to structure and organize your arguments such that they are both rigorous and as formal as necessary —but not overly formal as to make them difficult to understand— you can become a better writer and, ultimately, a better thinker.

Foundations of ML #2 - Learning Paradigms

Alejandro Piad Morffis — Wed, 09 Aug 2023 11:00:54 GMT

This is the second entry in a series on the Foundations of Machine Learning. In the previous post, we talked about how Machine Learning works. It’s about having some source of experience E for solving a given task T, which allows us to find a program P that is (hopefully) optimal for some metric M. We argued that this paradigm represents a fundamental shift in how we build software because we turn the problem from coming up with a solution ourselves to searching for a suitable solution among sensible strategies —from open-ended to closed-ended.

Photo by Vlad Tchompalov on Unsplash

If you haven’t read the previous post or need a recap, check it out before moving on.

In this post, we will start discussing the different subtypes of machine learning paradigms we can find. If you are familiar with ML, you’ve probably heard terms like supervised, unsupervised, or reinforcement learning. This post will explain exactly what these terms mean. But before, we will develop a unified framework to think about machine learning paradigms encompassing these three and many other ML flavors.

All machine learning is learning from experience, as we’ve seen. To learn, you need to be able to measure how well you’re doing; that is, you need feedback. Thus, to understand where learning paradigms differ, we will look at the different types of experiences that we can access and the types of feedback we can rely on. Once those intuitions are in place, we will meet the three fundamental learning strategies: learning by imitation, pattern recognition, and trial and error.

Combining different flavors of experiences, feedback, and learning strategies, we can define all machine learning paradigms, from supervised, unsupervised, and reinforcement learning, to self-supervised learning, active learning, and other hybrid modes. We will finish this post by describing the most popular paradigms.

Subscribe now

Buckle up!

Deconstructing learning paradigms

Machine learning paradigms vary in at least three essential factors: the nature of the experience available, the quality of the feedback, and the strategy used to learn. The fundamental dimension to characterize a source of experience is whether it is static or dynamic. Regarding feedback, the fundamental dimension to characterize it is implicit vs. explicit.

Nature of experience

Static experience is, for example, a collection of books. You can learn from books by reading other people’s experiences. But you are limited beforehand to the amount of quality of experience that was put in those books. If you read from a very good author —or teacher— they will have chosen the precise experience you need to learn effectively. But you cannot control what they teach you —you cannot ask questions to a book—you can only choose beforehand what source to consult.

In contrast, dynamic experience is what you get by practicing, e.g., playing a sport or a musical instrument. Dynamic here means that the amount and quality of the experience you get depends at least partially on how you act. You try to kick the ball a certain way or put your fingers on a particular key on the piano, and you’ll get different feedback. Thus you can, to some extent, control the experience you get by asking different questions.

In short, static experience is predefined before we access it. We can perhaps decide in which order to consume it, but we cannot fundamentally change the experience we get. Dynamic experience depends upon our behavior —by acting in specific ways, we can shape the experience we get. We must understand, though, that this isn’t a binary distinction but a spectrum going from pure static to more dynamic.

In computational terms, the most common type of static experience is a dataset, i.e., a collection of data —e.g., images, text, audio, video, etc.— that someone put together. The most common type of dynamic experience is a simulation, i.e., a computer program with specific rules where one or more virtual agents (also computer programs) can interact.

Quality of feedback

Explicit feedback is directly related to your learning target. For example, if you’re learning to play an instrument, the most explicit feedback would be, “This is how you must move your fingers.” In the chess example, very explicit feedback would be knowing the optimal move on a given board.

However, you often can only get implicit feedback., in various degrees. For example, take learning to play soccer. You kick the ball and observe its trajectory. If it doesn’t land where it should, you know you did something wrong but not exactly what. This feedback is only indirectly related to your learning target —e.g., the exact force and torque you need to hit the ball. This is also a spectrum rather than a binary distinction.

The more explicit the feedback, the easier it is to learn. However, providing explicit feedback often entails knowing how to solve the problem and effectively communicating that knowledge. Your music teacher not only knows how to play the guitar; they can also tell you precisely what you’re doing right or wrong. But if you’re learning to ride a bicycle, no amount of explanation will do. We don’t even know exactly how we do it, let alone how to explain it to someone else.

Lower-quality feedback also can occur in at least two critical ways: it becomes delayed and sparse. Delayed feedback is, for example, when you lose a game of chess. You only know that your strategy was flawed at the end, but you can’t tell exactly which moves were good or bad. Sparse feedback is the more general issue of getting a feedback signal only for a subset of your decisions. Both these phenomena make learning increasingly harder.

Learning strategies

Three main learning strategies underpin all ML algorithms so far in use. These are the actual mechanisms by which a model is trained based on experience. While the lines between ML paradigms are somewhat blurry, they all use one or a combination of learning by imitation, trial-and-error, and pattern recognition.

Learning by imitation is the most common and intuitive technique. It’s one of the main ways children learn in humans and most mammal species. You observe some behavior and try to reproduce it. Learning by imitation is the most efficient type but requires clear and direct feedback.

Learning by trial and error is what you do when feedback is indirect and sparse. If you cannot know precisely what you should do, but at least you can know if your actions are better or worse, you can try different things and see which is better. Learning by trial and error can take much longer than imitation, especially if the space of possible decisions is huge. It requires dynamic feedback since you need to be able to decide which actions to try out.

Learning by pattern recognition is the most indirect type of learning, useful when the feedback is the worst quality: implicit and static. It’s the learning that happens when, for example, you read a non-didactic book or watch a movie. There is no explicit learning objective, but you can still find some interesting lessons to keep. It involves finding patterns in the available experience that explain or summarize at least part of it.

These three learning strategies are not disjoint. They can be used in different combinations depending on the available experience.

Machine learning paradigms

Now that all intuitions are in place, we can briefly enumerate the most important machine learning paradigms. We will start with the three basics: supervised, unsupervised, and reinforcement learning. Each occurs when one specific combination of experience, feedback, and learning strategies intersect. Then we will see three prominent hybrid paradigms that blur the frontiers between these three.

Supervised Learning

The most common, simple, and helpful learning paradigm is Supervised Learning. It involves static experience with the most explicit kind of feedback, and it’s based on imitation learning.

In this paradigm, the experience is a static collection of input/output pairs, and the task is defined as finding a function that produces the correct output for any given input. Supervised learning is thus a problem if prediction: given an input, predict the corresponding output. The underlying assumption is that there is some correlation (or, in general, a computable relation) between the structure of an input and its corresponding output and that it is possible to infer that function or mapping from a sufficiently large number of examples.

The output can have any structure, from highly complex to simple atomic values. When the output has a complex structure (e.g., a sequence or an arbitrary object with properties), we call it a structured prediction problem. When the output is a simple atomic value, we have two special sub-problems: classification and regression.

Classification is when the output is a category out of a finite set. Examples are detecting the object(s) that appear in an image, assigning a sentiment to a text, or predicting if a person has a disease —like cancer— given the results of a set of medical exams.

Classification problems can be binary —e.g., predict cancer or no cancer— or multi-class —e.g., predict the sentiment of a text. Multi-class problems can also be multi-label when the output can be zero, one, or more than one category for a single input—for example, predicting the topics of a piece of text.

Regression is when the output is a continuous value, bounded or not. A typical example is property valuation, e..g, predicting the price of a house or car, or any other object given its characteristics.

Unsupervised Learning

The main alternative to supervised learning is, you guessed it, Unsupervised Learning. The main difference is that we don’t have access to the “right outputs.” Instead, we have a large static dataset and want to find hidden patterns. Thus, feedback is highly indirect and often implicitly defined, and we resort to pattern recognition as the primary learning strategy.

The underlying assumption is that there is some regularity in the structure of those elements that help explain their characteristics with a restricted amount of information, hopefully significantly less than just enumerating all elements. In this approach, the programmer leaves a deeper footprint since the kind of regularities that can be found are a bias that needs to be provided.

Two common sub-problems in unsupervised learning are associated with where we want to find that structure: clustering and dimensionality reduction.

Clustering is when we care about the relationship between different elements. This task is often framed as finding the (generally predefined number of) groups to which each element belongs. Examples include grouping users given their behavior on social media or grouping products given their characteristics in an online store.

Clustering requires defining an implicit similarity relation between elements, such that elements in the same group are, on average, more similar to each other than to the elements in other groups. Thus, the performance of clustering is often highly subjective, depending on how we define that similarity. Also, other assumptions and biases determine the “geometry “of the clusters. Long story short, every clustering algorithm is the best one under its own assumptions.

Dimensionality Reduction is when we care about the internal structure internal of each element. The task is often framed as finding a compact representation of the elements capturing their most salient characteristics. In other words, they keep the most information possible, where each model defines what information means in its context.

This task is often used as a preprocessing step before supervised learning because it helps reduce each element’s detail to its most essential traits. For this reason, there aren’t easily recognizable examples of dimensionality reduction in user-facing applications.

Reinforcement Learning

The third fundamental ML paradigm is Reinforcement Learning. The main difference with the previous two is that instead of a static collection of data, here we have a dynamic simulation where an agent can act, and the task is formulation as learning to make decisions that lead to desirable outcomes. It is based on learning by trial and error.

This paradigm is useful when we have to learn to perform a sequence of actions, and there is no obvious way to define the “correct” sequence beforehand other than trial and error, such as training artificial players for video games, robots in real life, or self-driven cars. The feedback is formulated as a payoff obtained after the agent performs one or a sequence of actions, and the objective is often to maximize the lifetime expected accumulated payoff.

Feedback in reinforcement earning can be delayed and sparse, meaning our agents must learn to perform long sequences of actions that lead to a single outcome. Thus, a major problem is credit assignment, i.e., deciding which actions in that sequence should account for what part of the ultimate payoff. Also, the same action can lead to different payoffs at different moments.

For example, in self-driving cars, you get an immediate positive payoff every second the car hasn’t crashed, a mid-term negative payoff for every driving law broken, a longer-term payoff if you reach the desired location, etc. The same action —e.g., moving the driving wheel a degree to the left— can be involved in all those payoffs. This level of indirection between action and feedback makes RL far more complex than supervised and unsupervised in general.

Reinforcement learning approaches can be divided into model-based and model-free. In the former, the agent learns a model of the environment. Thus it can predict the payoff that specific sequences of actions will provide and optimize accordingly. In the latter, the agent learns to predict the “correct” action without explicitly computing its expected payoff.

Semi-supervised Learning

This is a straightforward mixture of supervised and unsupervised learning, in which we have explicit output samples for just a few of the inputs plus many additional inputs where we can try, at least, to learn some structure.

Semi-supervised Learning is useful in almost any supervised learning problem when we hit the point where getting additional labeled data —with both inputs and outputs— is too expensive, but it is easy to get lots of unlabeled data —just with inputs.

Its advantage is that if the input data is relatively uniform, the underlying structure useful for predicting outputs —the one you would need to learn in supervised mode— will closely match the structure you can discover unsupervised. Thus, you can learn a lot from the data alone, in an unsupervised way, and then add a small supervised tweak to the mix.

One form of this paradigm, unsupervised pertaining, was very popular in the early days of deep learning. Training an entire neural network end-to-end on lots of data was computationally unfeasible back then. The alternative was to pre-train intermediate layers of the network in unsupervised mode before training the final layer(s) in supervised mode.1 As newer, simpler architectures have been invented and computational power has increased, this approach has fallen out of favor, but it is still helpful in low-resource scenarios.

Self-supervised Learning

This paradigm is also in-between supervised and unsupervised learning but has a different setup. In Self-supervised Learning, we want to predict an explicit output, but that output is simultaneously part of the input. So in a sense, the output is also defined implicitly. Thus, this paradigm is simultaneously supervised and unsupervised instead of simply concatenating supervised and unsupervised problems.

A straightforward example is language models, like BERT and GPT, where the objective is (hugely oversimplifying) to predict the n-th word in a sentence from the surrounding words, a problem for which we have lots of data —e.g., all the text on the Internet.

Self-supervised learning is a highly effective approach in machine learning as it blurs the boundaries between supervised and unsupervised learning. Its reliance on explicit feedback makes it as efficient in learning as the supervised paradigm. However, what sets it apart is that it doesn’t require human labeling, making it a scalable and much more efficient option, just like in unsupervised learning.

Active Learning

The final paradigm we’ll analyze in this post is Active Learning, which is also a hybrid of sorts, but this time mixing traits of supervised and reinforcement learning. In this setup, we have a vast, potentially infinite amount of unlabelled data as experience and access to an “oracle” —e.g., a human expert— to whom we can ask for the correct output for any specific input.

The key to active learning lies in pinpointing the most effective examples to request from humans. This approach can optimize our knowledge acquisition while minimizing human effort.

Active learning can reduce the cost of data collection enormously compared to traditional supervised learning while keeping the same model performance. As in supervised learning, we have explicit, high-quality feedback. But unlike supervised learning, we only pay the cost of getting that feedback —which often entails having a human annotator— for the set of learning most informative examples.

In active learning, we train the model and annotate the data simultaneously in a tight loop. As the model improves, it will tell us exactly which data point is most cost-effective to annotate next. Once finished, we obtain the trained model and the final, optimally annotated dataset.

Final remarks

Not every task comes with a paradigm label, though. In the realm of machine learning, these paradigms not only operate independently. Often the more powerful methods live on the frontiers.

In this final section, we aim to give you a flavor of how these interactions look. We will present the ideas behind some illustrative examples without technical details.

Autoencoders are an excellent example of how supervised and unsupervised learning encapsulate a dual role. At their core, they are unsupervised models designed to uncover hidden patterns in unlabeled data. Yet, this process can be enhanced using supervised learning techniques. We feed an encoder with the input data, and it returns a more compact representation as output. Then a decoder receives this output as an input, producing something in the same format as the original output. The composition of both the encoder and the decoder is what we call the autoencoder. Given a similarity metric, the task is to learn to reconstruct the input.2

This architecture resembles self-supervised learning —in the sense that the output is part of the input, i.e., it’s the input itself. We are not quite interested in the output itself, though. We are instead interested in the intermediate, more compact representation.

Embeddings are another interesting example, a general method for representing discrete variables, like words, items, or users, by transforming them into continuous vectors within a lower-dimensional space. These embeddings encapsulate some semantic or contextual attributes of the variables. Their application extends to natural language processing, computer vision, recommender systems, and beyond.

In the unsupervised scenario, they can detect similarity, association, or hierarchy. Facing supervised problems, they offer substantial utility due to their capacity to handle nominal data that do not have explicit characteristics. For instance, when dealing with word embeddings, words that share analogous meanings or fall within the same category can be positioned closer together within the vector space.

Conclusions

In conclusion, the various subtypes of Machine Learning paradigms are shaped by the interplay between different types of experiences and feedback and the learning strategies employed.

Static and dynamic experiences refer to the nature of the data used for training, with static experiences involving fixed datasets and dynamic experiences involving continuous data streams or simulations. Explicit and implicit feedback refers to the type of guidance provided to the machine learning algorithm, with explicit feedback being more direct while implicit feedback is more subtle and indirect.

Considering these three key concepts, we can better understand the differences and similarities between machine learning paradigms. For example, supervised learning involves explicit feedback and static experiences, while reinforcement learning involves implicit feedback and dynamic experiences. On the other hand, unsupervised learning relies on implicit feedback and can be either static or more dynamic, depending on the nature of the data.

Ultimately, by understanding these key concepts, we can gain a deeper understanding of the underlying principles that govern Machine Learning and make informed decisions about which approach to use for a given problem. In the next entry in this series, we will discuss another fundamental distinction between Machine Learning approaches: generative vs discriminative modeling.

Don’t worry about the details of this process. We will talk about neural networks and their training paradigms in future posts.

This concept is analogous to a musician trying to replicate several pieces by ear. The musician’s brain builds a representation of the music that’s simpler than the music itself. Eventually, most musicians agreed on a unified language for this representation —but that’s totally out of scope for this article ;)

Foundations of ML #1 - What is Machine Learning?

Alejandro Piad Morffis — Wed, 26 Jul 2023 16:14:45 GMT

Machine Learning is a fundamental paradigm shift in how we build computer programs. It opens the door to solving problems that are unsolvable by traditional means. In short, Machine Learning answers the question, “What if we don't know how to write a program to solve this problem?” The solution is simple and elegant: we write another program to find the best program to solve this problem.

Photo by Hassan Pasha on Unsplash

This is the first article in a series about the Foundations of Machine Learning. This series will cover basic concepts, the main learning and modeling paradigms, and evaluating and comparing models. From there, future series will explore specific models and algorithms in greater detail.

This series is written in collaboration with , a mathematician turned PhD student of mine in Machine Learning, which is like, the best combo I could get.

Subscribe now

This first series is aimed at general readers interested in knowing more about this field so they can understand the conversation around Machine Learning going on in the media. For this reason, this series has no code or math whatsoever, just intuitions —ok, maybe I’ll throw in a few equations for performative reasons, but nothing strictly necessary.

However, if you are interested in Machine Learning from a software development or a scientific perspective, this series is still for you. In that case, I believe this series will be helpful as an intuitive foundation from which you can move on to more technical courses. I’ll give you some pointers if you want to dig deeper.

In this first article, I want to answer the broadest questions: What is Machine Learning, and when is it useful? To get there, first, we’ll look at why we want Machine Learning in the first place by comparing it with the traditional approach in computer science on a motivating example: making a world-class chess player. Then we’ll define exactly what “the Machine Learning approach” means. And finally, we’ll briefly discuss how we can actually do Machine Learning.

Buckle up!

Building a chess player the traditional way

To understand why machine learning is important, we must first discuss when the traditional approach to solving computer problems fails miserably. For that, let’s briefly talk about what it would take to solve what was for a long time considered the quintessential problem in artificial intelligence: making a chess player that can beat the world champion.

If you know how to code, you can already begin to imagine how hard it must be to write a program that plays chess at the highest level. Let me illustrate the type of approach that a typical software developer would follow.

First, you must understand the rules of chess: how pieces move and what it means to win and lose a game. This only guarantees that you can write a program that never makes invalid moves. But that’s far from good enough. To beat even the weakest opponents, you’ll need to code some strategies, i.e., rules for the program to not simply pick one random valid move but a move with a reasonable chance of leading to victory.

To find strategies, you can either read strategy books, consult an expert, or maybe think really hard and develop good ideas yourself. Either way, you will end up coding a program that does a combination of the following:

a bunch of hard-coded rules for well-known openings and endings; and
some heuristic to play the mid-game, most likely based on weighting different pieces according to their position on the board.

And this approach works, but it has many disadvantages, some obvious and some less. The first obvious limitation is that building and maintaining such a complex program is increasingly difficult, especially as new strategies are discovered. That kind of long collection of rules is a headache for most software engineers. Second, slightly less obvious, we never know if we're playing the optimal strategy. Chess is an unsolved game, meaning no definitive strategy is known to work against all opponents.

But the most important limitation of the traditional programming approach is that the program can only be as good as the programmer. There is nuance here, of course. Maybe the knowledge comes from domain experts and not precisely the programmers. But in any case, one or more humans’ knowledge about the problem must be explicitly coded in software.

This means that while the computer is definitely an immense boost in scale and speed —we can solve bigger problems much faster—, fundamentally, traditional software cannot take us beyond human capabilities. The best chess-playing program using this approach might exceed the best human by sheer brute force, but it cannot fundamentally play better chess.

To see why this is a huge issue, consider what happens when we face a problem no one knows how to solve explicitly. At least two kinds of problems in this category are amenable to machine learning.

First, we have unstructured problems that we can solve, but we can’t explain how. These are problems where we hint there must be a set of rules that work —because we humans solve them intuitively— but we cannot explicitly list those rules in a way that a computer can understand —i.e., in a computer program. Examples include perception (e.g., recognizing objects in images) and language.

And second, we have inverse problems that we can solve easily in one direction but not in reverse. These are problems for which we know how to go from cause to effect, but it is not easy to determine, given an effect, what was the cause. Most physics falls in this category, e.g., given a specific design for an engine, we can simulate in a computer—and thus calculate— its performance under certain conditions. The inverse problem would be, given some constraints of design and a desired performance, to find the optimal engine design.

Our motivating example, playing chess, falls somewhere in between. On the one hand, some humans are extremely good at it. Still, while there are books and books of strategies, no one can explain exactly what the world champions do, not even themselves —otherwise, everyone could be a world champion. On the other hand, given a set of moves in chess, we can simulate it and compute who is the winner, but it is not easy to determine the right set of moves that, from a given board, leads to someone winning.

The bottom line is: we know there must be an optimal strategy (or set of strategies) for playing chess at the world-champion level —because there are people who do it—, but neither we nor anyone else can think through and come up with a computationally convenient description of that strategy.

The way of Machine Learning

Now here is the Machine Learning approach to this problem.

Instead of attempting to code the optimal strategy directly, let’s turn the problem on its side and ask, “If I had a huge number of possible strategies, could I identity the best one?”

This change in framing is a big leap because it makes the problem at least tractable. We are no longer trying to invent or discover something; we are “merely” searching for something inside a well-defined set of things. The problem shifted from being open-ended (come up with a winning strategy) to closed-ended (select the best strategy among all of these)!

However, this leap hinges on two crucial assumptions.

First, we must be able to enumerate all possible chess-playing strategies or, at least, to define a sufficiently large set of strategies such that the best one among them has a fair chance of being the one we’re looking for. We call this set the hypothesis space, and defining it will prove non-trivial. In fact, having a good hypothesis space will be the fundamental obstacle in getting machine learning to work.

For this motivating example, we can easily define a sensible hypothesis space that will illustrate the thinking one needs to develop in machine learning. Let’s say we assign a score to each pair of piece-position, for example, white-pawn-at-B7, black-rook-at-D3, etc. We compute the “value” of any concrete board configuration by adding all scores associated with the pieces in their corresponding locations.

Assuming we have the right scores, our chess-playing program is pretty simple:

Generate every possible board from the current one, applying all valid moves.
For each board, compute its value using those —still unknown— scores.
Return the move that leads to the highest valued board.

We assume that those scores somehow capture the notion of “this is a good board to be in.” A board with a higher value is better because I’m more likely to win by playing a move that takes me to that configuration. This is, of course, a huge oversimplification because we are ignoring all interactions between pieces. A rook may be more or less valuable in a given position because it challenges an enemy piece or protects one of your own.1

However, even with this simplification, we already have six different types of pieces (rooks, knights, bishops, pawns, queens, and kings) times 64 different positions times 2, equal to 768 different scores. Every possible assignment to those 768 scores represents a different strategy, a hypothesis —also called a model. Note that even if we restrict the scores to whole numbers between 1 and 100, we still talk about 100 to the power of 786 different strategies —a 1 followed by 1536 zeros! Trying out all those hypotheses before the universe reaches heat death is impossible.2

So far, we have a simplistic but still huge hypothesis space. Now we need a way to search for an optimal —or as good as possible— strategy in it, which in practice means finding the right way to tune those scores so that moving to boards with higher scores actually tends to increase your chance of winning. But how do we know which boards actually lead to winning?

Here comes another key piece in machine learning solutions: leveraging experience. Obviously, we don’t know an a priori way to determine if a board is better than another. If we had, that would be our strategy! But we can access millions, potentially even an infinite number of chess matches. How can we learn the right scores from that source of experience?

There are two fundamentally different paradigms in machine learning which fit chess well. We will dive deeper into them next time, but for now, let’s illustrate how they work in the case of chess.

The first paradigm is learning by imitation —also called supervised learning. In our example, it would work as follows. We first collect all the chess matches ever recorded that we can find. We note whether whites win or lose for each match and tag each board in that match accordingly.3

The second paradigm is learning by trial and error —also called reinforcement learning. Instead of using pre-recorded matches, we let two computer programs play at will —e.g., choosing each move randomly. And just like before, we tag each board as a winning or losing board conditioned on the results of each simulated match.

We will have plenty to talk about the advantages and limitations of each paradigm in future entries in this series. For now, all that matters is that we can access a huge amount of experience that tells us how good different boards are in practice.

With all that experience, how do we actually write a Machine Learning program? Or, specifically in this case, how do we adjust the scores so that the resulting strategy is the best possible? Details vary, but the bottom line is always the same.

Instead of directly coding the program that plays chess, we write a meta-program —also called a training algorithm— that will search for the best chess program, which means finding the best set of scores in our example. And here comes the final ingredient: best with respect to what?

In any machine learning setting, we will need a way to compare different hypotheses —models, or strategies in the case of chess. Thus, finding the best model in our hypothesis space is equivalent to solving an optimization problem in which we seek the hypothesis or model that minimizes some error metric. This metric tells us how well any given hypothesis solves the task we wanted to learn, e.g., playing chess in this article.

We will discuss evaluation metrics in detail in a future article, but for now, let’s consider the most straightforward metric solution in our example. The best strategy in chess is the one that wins the most number of games. Thus, the direct solution would be optimized to maximize the probability of winning. However, for reasons that will become evident in future articles, we can rarely optimize for a direct success metric. That is usually mathematically intractable or at least quite involved.

Instead, we often find a proxy error metric easier to formulate as an optimization problem and close enough to our objective. In this example, a sensible approximation would be attempting to guess whether the current board is winning. Thus, we will compare two strategies —hypotheses or models— not precisely by which is more likely to win but by which is more accurate guessing if a given board appears on a winning game.4

Now comes the technical part. Instead of simply enumerating all hypotheses —which we know is impossible, we will pull a rabbit out of the hat and turn the problem of finding the best strategy into solving a system of mathematical equations! Taking the scores as variables, each winning board yields an equation like the following:

While each losing board yields a similar equation:

Then, we will use a bit of mathematical magic5 to determine how to assign the scores w such that most winning boards sum close to 1 and losing boards sum close to 0.6

This magic trick of turning an impossibly large search problem into a clever mathematical problem is pivotal to the success of machine learning. It is what allows our trainer algorithm to actually work at all. Otherwise, we would be stuck with trying randomly strategy after strategy to see which is best —something that we will actually do at some point, mind you.

And that’s it! We just tricked a computer into learning to play chess!

Machine Learning formalized

So far, we’ve built an intuitive formulation for a Machine Learning approach to solving chess. To summarize, we first simplified playing good chess to merely deciding which is a good next board by assigning some scores to all possible pieces and positions. Then we used loads of previous experience (recorded or simulated) to gather evidence for good and bad boards. And then, we turned the problem of finding the best scores into solving a mathematical problem.

Now let’s zoom out and finally define machine learning. This one is my favorite definition, due to MIT professor and author of one of the most influential books on ML, Tom Mitchell.

Machine Learning is a computational approach to problem-solving with four key ingredients:
A task to solve (T)
A performance metric (M)
A computer program (P)
A source of experience (E)
You have a Machine Learning solution when:
The performance of program P at task T, as measured by M, improves with access to the experience E.

That's it. Notice that the gist of the definition is that experience improves performance. This starkly contrasts traditional software, as we’ve seen in this post.

In a typical software application, how well our program solves a problem depends only on its initial programming. The application doesn’t improve the more you use it. Summarizing, in the traditional approach we:

Define a desired output, e.g., the best move in a given board.
Think very hard about the process to compute that output based on a deep understanding of the problem.
Write the program P that produces the output.

Machine learning aims to build a program that improves automatically with experience, even if we don’t fully understand how to solve the problem. To achieve that, we:

Assume there is a template that any sensible program P follows, e.g., a formula parameterized with some unknown values —a hypothesis space.
Write a program Q — a training algorithm— that finds the best values based on some experience E.
Run Q on E to find the best program P —the model or hypothesis.

The training algorithm Q is any of a vast array of machine learning techniques: decision trees, neural networks, naive Bayes, logistic regression, etc. Don’t worry if these names sound alien; we’ll meet them all in due time.

Conclusion

In conclusion, the Machine Learning approach represents a major paradigm shift to computational problem-solving. Instead of directly writing a program P to solve task T, you actually code a trainer program Q that, when run on suitable experience E, finds the best program P, according to some metric M.

This paradigm is so useful because there is an incredible amount of tasks for which we don't know how to write a solution directly. However, in many cases, writing a trainer algorithm is fairly straightforward, provided we have enough experience (e.g., data or simulations) to train on. Machine Learning allows us to tackle problems we can’t solve directly with a performance that often surpasses the best humans in that task.

And that's it for today. In the next entry in the series, we'll talk about the different flavors of experience we can have.

This simplification of ignoring interactions between elements is extremely common in machine learning. Viewing this from a probabilistic perspective, we’re assuming independence among the influence of each variable —each piece-position pair— concerning the probability of having a winning board. Assuming independence is often dubbed the naive approach, but it is actually a pretty smart thing to do in many cases.

A careful examination will show you that there are symmetrical positions for which having different scores makes no sense. However, for simplicity, let’s happily ignore that fact. Even accounting for all symmetries, the number of strategies will be so large it is unfathomable. And yes, that’s a word I just learned.

This means we’re computing scores for the white player. Scores for the black player will be symmetrical but with opposite signs.

In practice, you won’t directly optimize this probability, but rather a continuous approximation called the logistic function.

Mathematical magic is also known as algebra, optimization, probabilities, etc.

I told you we would throw a few math formulas for performative reasons, but don’t worry; the exact equations don’t matter. All that matters, for now, is that there is a way to convert a search problem into a mathematical problem. These equations aren’t even the ones we would actually use (see footnote 4).

What is Computer Science

Alejandro Piad Morffis — Mon, 05 Jun 2023 12:01:46 GMT

Computer Science is a relatively new field that emerged in the early 20th century from a massive effort to formalize Mathematics. However, it wasn't until the mid-60s and early 70s that it gained recognition as a distinct and unique scientific field. This was achieved by establishing the theoretical grounds of computability and complexity theory, which proved there were many interesting, novel, and challenging scientific questions involving computation.

Computer Science is an extensive field that encompasses a wide range of knowledge and skills. It goes from the most abstract concepts related to the nature of computable problems to the most practical considerations for developing useful software. To solve problems in this field, one must navigate through various levels of abstraction, making it challenging to provide a concise summary that the average reader can easily understand.

In this series, I will introduce you to the vast and exciting world of Computer Science. We will begin by examining the numerous interconnected subfields of CS at a macro level. We will delve into specific fields, concepts, and even practical techniques and algorithms in future posts. I hope you enjoy the ride, and maybe, if I’m persuasive enough, you will decide that you want to master Computer Science.

If you enjoy this post, you will probably want to check my upcoming book The Science of Computation, which will expand on all these subjects.

A Map of Computer Science

Computer Science encompasses a vast range of subjects, incorporating multiple areas of math and engineering. It also introduces many original concepts and ideas. Thus it is impossible to divide this field objectively into subfields or separate areas with precise borders. I will attempt to categorize it into somewhat cohesive disciplines, but keep in mind there is considerable interconnectedness between them.

In the following sections, I will lay out what I believe are the most fundamental concepts and problems in Computer Science. Whenever possible, I will provide a link to further reading, most commonly to a relevant Wikipedia article, which in these topics is a highly accurate and accessible resource. I will also provide links to the other posts in this series in bold format.

Theoretical Foundations

Let's start by looking at the foundations. The fundamental question of Computer Science is something akin to "What kinds of problems can be solved with algorithms." It was first asked formally by David Hilbert in 1901 and first answered by Alan Turing in 1936. This central question of computability theory establishes the foundational theory for what can be done with any computer.

The core concept in computability is the Turing machine, an abstract computer model that we define to study the properties of computation and computable problems irrespective of any concrete machine our current technology supports. Turing machines let us answer which kinds of problems are, in principle, undecidable —which means that no algorithm can ever be devised to solve them completely. Surprisingly, there are many of those, not all esoteric; there are very practical problems in Computer Science that we know are mathematically impossible to solve. Turing machines also give a precise definition of algorithm.

Once we lay out the limits of computation, we can ask which computable problems are easier or harder. Complexity theory deals with the complexity of different problems. It asks how efficiently, most commonly in terms of computing time and memory, we can solve any problem. For some problems, we can even prove that we have found the most efficient algorithm anyone could ever develop. The most important problem in complexity theory, and probably in the whole of Computer Science, is the famous question of whether P=NP, which is ultimately a question about the nature of really hard problems.

The final ingredient in the foundational theory of Computer Science is the relationship between formal languages and automata. Formal language theory allows us to formalize the notion of a language, whether it is a human language like English or Spanish, a programming language like Python or C#, and any technical, abstract, or mathematical notations used in many fields.

The basic concept in formal language theory is formal grammar, a formal definition of what can be said with a specific language. In contrast, automata theory deals with abstract machines —like Turing machines— of different levels of complexity that can recognize specific families of formal languages. The study of the relationship between language generation and recognition led to the invention of compilers, which are computer programs that understand specific programming languages and translate them to machine code that can be run on concrete hardware.

Math Foundations

Computer Science emerged from Mathematics, and therefore, it adopted the mathematical practice of establishing axioms and demonstrating theorems. Besides, when tackling real-world computational challenges, we frequently apply mathematical techniques used in engineering and other fields of science. Consequently, a substantial amount of mathematical knowledge underpins Computer Science, albeit with a computational perspective.

Logic forms the bedrock of Computer Science as it gave birth to its fundamental principles. By enabling us to formalize reasoning algorithmically, Logic allows us to create programs that can manipulate logical formulas and prove theorems and also lets us formally prove that an algorithm works as intended. Discrete Math is a natural extension of Logic, encompassing a vast mathematical field that studies discrete objects such as collections of natural numbers. It formalizes numerous basic computational concepts like graphs and trees. Additionally, Discrete Math lies at the core of modern cryptography, which relies heavily on number theory.

Algebra and calculus are fundamental math subfields used in most engineering and scientific disciplines, and even more so in Computer Science. Algebra deals with mathematical structures such as vectors, matrices, vector spaces, and operations between them. Many abstract algebraic operations can be interpreted as some form of data manipulation. Thus they appear at the core of many CS applications, from computer graphics to compression to search engines to machine learning. Algebra also strongly connects with computational geometry, the branch of geometry that studies computational approaches to 2D and 3D modeling, computer-aided design (CAD), and many related applications.

Calculus deals with the relationship between changing quantities. It enables us to determine how to adjust the input to achieve a desired output, even when their relationship is fairly complex. If you go deeper into the formal properties of these relations, you will find the field of mathematical analysis. Two key concepts in calculus and analysis are differentiation and integration, and they have numerous applications in Computer Science, particularly in continuous optimization, which also underpins many modern machine learning methods. An interesting middle ground between algebra and calculus is the field of differential equations, which helps to model complex phenomena, from weather to engines to epidemics, where the rates of change among different quantities depend on each other. Mathematical analysis also provides the foundations for the fields of probability and statistics.

Probability is the branch of mathematics that studies reasoning under uncertainty. In a sense, it’s an extension of Logic when we consider that statements can not only be true or false, but they have some degree of likelihood. A related field is Statistics, which involves understanding and extracting insights from —often uncertain— data. Both fields are extremely relevant to data science and intersect with Computer Science in many tasks, from analyzing real-world data to designing programs that can reason and make decisions under uncertainty, such as autonomous vehicles or virtual agents.

However, continuous math (including algebra, calculus, probabilities, and statistics) assumes real numbers with infinite precision, but in physical computers, all we can hope for is finite approximations. Thus, every mathematical operation performed in a computer can involve some approximation errors. Dealing with this problem robustly and efficiently is surprisingly hard. It is the purpose of numerical analysis, an essential tool to avoid catastrophic errors from misunderstood approximations. It also provides efficient algorithms for many abstract operations from algebra and calculus.

Algorithms

Beyond the theoretical foundations, solving a concrete Computer Science problem will often require using one or more algorithms. Hence, studying which specific algorithm can solve which concrete problem is a big part of Computer Science. Some of the most basic and commonly used algorithms you’ll encounter are devoted to solving problems on collections of items, such as lists of numbers or sequences of letters. Two basic data structures used throughout Computer Science are arrays and strings.

Strings are sequences of symbols —called characters— most commonly used to represent text. String matching, i.e., the problem of finding if a given pattern appears in a text, is one the most studied problems in CS with applications everywhere from compiler design to DNA analysis to chatbots to search engines.

An array is a collection of items of an arbitrary type —e.g., numbers, words, images, or custom user data records. The most basic operations in arrays are searching and sorting, which give rise to some of the most clever algorithms ever designed. On top of arrays, we find many data structures that focus on efficiently performing some particular operations on specific types of objects. These include lists, queues, stacks, heaps, and the two fundamental ways to represent data in Computer Science: trees and graphs.

A tree is a hierarchical collection of items in which every item has a parent and potentially many children. It is used anywhere we need to represent intrinsically hierarchical data, such as organizations of people. However, trees also appear in many problems where the hierarchy is not obvious, such as evaluating mathematical expressions or searching large mutable collections of objects.

Graphs are the most general data structure to represent abstract information computationally. A graph is a collection of nodes connected by edges. It can represent anything network-like, from geospatial data to social networks to abstract things like the dependencies between subtasks in a given problem. Graph algorithms are so vast that you can find whole subfields dedicated to just a subset of them. Still, the most basic one is path-finding, i.e., finding a path —often optimal in some sense— between one specific source node and one or more destinations.

However, just memorizing a huge toolbox of algorithms is not enough. Sometimes you’ll find problems for which no one has already designed an algorithm. In these cases, you’ll need algorithm design strategies, which are higher-level ideas and patterns useful for creating new algorithms.

The most powerful such strategy is recursion, so much so that “you can do anything with recursion” is a popular meme in Computer Science. Recursion is all about solving a problem by decomposing it into subproblems that are easier to solve. When taken to its extreme, this idea allows us to formalize the whole theory of computability, which means that any algorithm in Computer Science is, at its core, recursive. There are many recursive patterns; some of the most commonly used are tail recursion, divide and conquer, and backtracking.

Other powerful design strategies underpin many of the most successful algorithms in Computer Science. Greedy algorithms, for example, are the ones that make the best short-term choice. While this is often suboptimal, there are surprisingly many cases where a carefully chosen local optimum leads to a global optimum. Dynamic programming is another such strategy. Under the right circumstances, it allows transforming a slow recursive algorithm into a very fast non-recursive version while retaining all the correctness guarantees.

Computational Systems

Mastering algorithms is not enough, though. Ultimately, we need a physical device that can run those algorithms and give us a result at a reasonable cost. Real computers are the physical embodiment of Turing machines, but the devil is in the details. Building a real computer is far more involved than just wiring some cables. Even assuming, as we do in CS, that electronic components always work as expected, the design of resilient computational systems is a whole discipline.

At the lowest level, we have transistors, microscopic circuitry that performs a very simple operation: an electronic current breaker. We can design logical circuits —combinations of electronic components that perform whatever logical operation we desire— with just a few transistors. We can design circuits to add, multiply, or store values to build a Central Processing Unit (CPU) connected to a Random Access Memory (RAM) bank. This is the basic computer, which can be programmed with very simple languages called Assembly Languages. It is at this level that the frontier between hardware and software melds.

Modern computers have increasingly more complex components, including external memory drives and peripherals like displays and keyboards. Keeping all of those components in harmony is the job of the Operating System. OSes like Windows and Linux do much more, though. They provide an abstraction layer over common operations that many programs need, including interfacing with hardware peripherals —e.g., playing sounds or capturing video— or writing and reading from files. Two key abstractions are processes, which allow each program to behave as if it were the only one running, and threads, which enable concurrency even if you have a single physical CPU.

But if one computer is incredible, the real magic begins once you have two or more working together. Enter the world of computer networking. The first problem we must solve is simply getting two computers on both sides of the planet, talking reliably to each other over a noisy, error-prone communication channel. The TCP/IP protocol is a collection of algorithms that guarantee communication even when intermediary steps fail constantly. It's the software backbone of the Internet, possibly the single most important technological advance after the first industrial revolution.

Distributed systems allow multiple computers to communicate and collaborate, creating larger organizational levels of computational systems. The simplest setup is the client-server architecture, where an application is split between the user and the service provider. As these systems scale up, they require coordination among several computers to maintain reliability even when some individual components fail. To achieve this, consensus algorithms are used for data distribution, monitoring system health, and error correction tasks.

Some of today's largest software platforms, like the Google Search Engine and Amazon Web Services, rely on distributed systems as their foundation. These infrastructures are often called cloud computing —vast networks of computers working together as one unified computing platform.

A crucial component in computational systems is security, especially when valuable information is stored on computers connected to the internet. Ensuring reliable and secure communication between computers is a major challenge in computer science, addressed with the help of cryptography.

Cryptography has its roots in breaking machine codes during World War II, such as the Enigma code deciphered by Alan Turing. Modern cryptography relies on symmetric and asymmetric or public-key methods, which enable two actors to establish a secure channel over an insecure communication medium without ever meeting each other. This principle forms the backbone of most online interactions, from reading your email to using your bank account to chatting with another person so that no one can access your data, often not even the service provider.

Software Engineering

Building software that works is extremely hard. It's not enough to master all the algorithms, data structures, theorems, and protocols. Software products have unique qualities among all other engineering products, especially regarding flexibility and adaptability requirements. Any software that lives long enough will have to be changed in ways that weren't predictable when the software was first conceived. This poses unique challenges to the engineering process of making software, the domain of software engineering.

Software construction starts with a programming language, of which there are plenty of variants. Some programming languages are designed for specific domains, while others are called general purpose and designed for any programming task. A fundamental distinction is between imperative and declarative languages. The most influential paradigms are functional (often declarative) and object-oriented (most commonly imperative) programming models.

Programming languages are tools, so their effectiveness in solving a concrete problem relies heavily on applying best practices. Software engineering also deals with finding principles and practices to use better the available tools, including technical tools and human resources. The most important set of software engineering principles is the SOLID principles. Other principles, such as DRY and YAGNI, emphasize specific, pragmatic mindsets about software development. Additionally, we have dozens of design patterns: concrete reusable solutions to common software development problems.

Beyond the actual code, the most crucial resource most applications must manage is data. As the necessity to organize, query, and process larger and larger amounts of data increases, we turn from classic in-memory data structures to more advanced long-term storage solutions: databases. The relational database is the most pervasive model for representing and storing application data, especially structured data. However, several alternative non-relational paradigms are gaining relevance as less structured domains (such as human language and images) become increasingly important.

In the data-centric software world, information systems are some of the most complex applications we can find. They provide access to vast amounts of information, which can be from a concrete domain —such as medical or commercial information— or general purpose like search engines. These systems combine complex algorithms and data structures for efficient search with massively distributed hardware architectures to serve millions of users in real-time.

As we've seen, software ultimately runs on someone's computer, and different computational systems pose different challenges. From a software engineering perspective, we can understand those challenges better if we think in terms of platforms. The three major software platforms are the desktop (apps installed in your computer), the mobile (apps installed in your smartphone), and the web (apps running on your browser).

Developing desktop software is hard because different users will have different hardware, operating systems, and other considerations, making it hard to build robust and portable software. Mobile development brings additional challenges because mobile devices vary widely and are often more limited than desktop hardware, so performance is even more crucial. The web is the largest platform, and we've already discussed distributed systems. The web is an interesting platform from the software engineering perspective because it is pervasive and lets us abstract from the user's hardware. Perhaps the most famous software development technologies are those associated with web development: HTML, CSS, and JavaScript.

The final component in software engineering concerns the interaction between the user and the application. Human-computer interaction (HCI) studies the design of effective user interfaces, both physical and digital. Two major concerns are designing an appropriate user experience, and ensuring accessibility for all users, fundamentally those with special needs such as limited eyesight, hearing, or movement.

Artificial Intelligence

Artificial Intelligence (AI) is a broad and loosely defined umbrella term encompassing different disciplines and approaches to designing and building computational systems that exhibit some form of intelligence. Examples of machines exhibiting intelligent behavior range from automatically proving new theorems to playing expert-level chess to self-driving cars to modern chatbots. Accurately defining what we mean by machine intelligence is extremely difficult and out of the scope of this post, but ultimately it revolves around the capacity to solve hard problems with as little explicit guidance as possible. We can cluster the different approaches in AI into three broad areas: knowledge, search, and learning.

Knowledge representation and reasoning studies how to store and manipulate domain knowledge to solve reasoning and inference tasks, such as medical diagnosis. We can draw from Logic and formal languages to represent knowledge in a computationally convenient form. Ontologies are computational representations of the concepts, relations, and inference rules in a concrete domain that can be used to infer new facts or discover inconsistencies automatically via logical reasoning. Knowledge discovery encompasses the tasks for automatically creating these representations, for example, by analyzing large amounts of text and extracting the main entities and relations mentioned. Ontologies are a special case of semantic networks, graph-like representations of knowledge, often with informal or semi-formal semantics.

Search deals with finding solutions to complex problems, such as the best move in a chess game or the optimal distribution of delivery routes. The hardest search problems often appear in combinatorial optimization, where the space of possible solutions is exponential and thus unfeasible to explore completely. The most basic general search procedures —Depth-First Search (DFS) and Breadth-First Search (BFS)— are exact, exhaustive, and thus often impractical. Once you introduce some domain knowledge, you can apply heuristic search methods, such as A* and Monte Carlo Tree Search, which avoid searching the entire space of solutions by cleverly choosing which solutions to look at. The ultimate expression of heuristic search is metaheuristics —general-purpose search and optimization algorithms that can be applied nearly universally without requiring too much domain knowledge.

Machine learning enables the design of computer programs that improve automatically with experience and is behind some of the most impressive AI applications, such as self-driving cars and generative art. In ML, we often say a program is “trained” instead of explicitly coded to refer to this notion of learning on its own. Ultimately, this involves finding hypotheses that can be efficiently updated with new evidence.

The three major paradigms in this field are supervised, unsupervised, and reinforcement learning. Each case differs in the type of experience and/or feedback the learning algorithm receives. In supervised learning, we use annotated data where the correct output for each input is known. Unsupervised learning, in contrast, doesn’t require a known output; it attempts to extract patterns from the input data solely by looking at its internal structure.

Reinforcement learning involves designing systems that can learn by interaction via trial and error. Instead of a fixed chunk of data, reinforcement learning places a learning system —also called an agent in this paradigm— in an environment, simulated or real. The agent perceives the environment, acts upon it, and receives feedback about its progress in whatever task it is being trained on. When the environment is simulated, we can rely on different paradigms, such as discrete events or Monte Carlo simulations, to build a relatively realistic simulation. Reinforcement learning is crucial in robotics and recent advances in large language models.

All of the above are general approaches that can be applied to various domains, from medical diagnosis to self-driving cars to robots for space exploration. However, two domains of special importance that have seen massive improvements recently are vision and language. The most successful approaches in both fields involve using artificial neural networks, a computational model loosely inspired by the brain that can be trained to perform many perceptual and generative tasks. ANNs draw heavily from algebra, calculus, probability, and statistics, representing some of the most complex computer programs ever created. The field of neural networks is alternatively called deep learning, mostly for branding purposes.

Computer vision deals with endowing computational systems with the ability to process and understand images and videos for tasks like automatic object segmentation and classification. Classic approaches to computer vision rely heavily on signal processing algorithms stemming from algebra and numerical analysis. Modern approaches often leverage neural networks, most commonly convolutional networks, loosely inspired by the visual parts of animal brains.

Natural language processing (NLP) enables computational systems to process, understand, and generate human-like language. It encompasses many problems, from low-level linguistic tasks, such as detecting the part-of-speech of words in a sentence or extracting named entities, to higher-level tasks, such as translation, summarization, or maintaining conversations with a human interlocutor. The most successful approaches in modern NLP leverage transformer networks, a special type of neural network architecture that can represent some forms of contextual information more easily than other architectures. These are the mathematical underpinnings behind technologies like large language models and chatbots.

Conclusions

Computer science is a vast field encompassing both theoretical and practical aspects. It goes from the most abstract computational concepts grounded in logic and mathematics to the most pragmatic applications based on solid engineering principles to frontier research like building intelligent systems. Working in CS also involves navigating complex social interactions between developers, users, and society and requires awareness of the many ethical issues that automation and computation pose.

Mastering computer science is a significant challenge, making it one of the most demanding academic pursuits. However, this discipline offers immense fulfillment since computation lies at the heart of our modern world. Software pervades all sectors and industries. Those proficient in computer science can find work across various disciplines and not only by creating commercial software. Across all scientific disciplines, from understanding the frontiers of space to solving climate change, designing new materials, and extending human life, every challenging problem in every scientific field has computational aspects that require collaboration between computer scientists and experts from that domain.

With this post, I hope to have ignited your curiosity in exploring the beautiful ideas at the core of computer science and its intersection with other fields. If that's the case, then I invite you to stay in touch. I'll be back with more Computer Science topics very soon.

Binary Search

Alejandro Piad Morffis — Mon, 27 Mar 2023 02:07:04 GMT

Photo by Clay Banks on Unsplash

Searching is at the core of the vast majority of software ever written. From scientific computation to everyday apps, there’s no shortage of examples where searching for something, a piece of information, a concrete datum, or a solution to an equation, is a fundamental operation. Databases are the most visible and common implementation of search ubiquitous in most applications.

But searching in Computer Science goes far beyond scanning a list of elements to find the one that matches a query. Searching is one of the fundamental strategies for problem-solving. The hardest problems in Computer Science, the so-called NP-Hard, are all about search. So it’s no wonder search algorithms abound, from the basic linear search to using super elaborate data structures, heuristics, and approximations.

In this and one or more follow-up posts, I want to tell you about the basic ideas behind searching and sorting and how to make both processes efficient. And finally, I want to show you a beautiful and surprising relationship between both problems, which has interesting philosophical implications for Computer Science at large.

Subscribe now

Suppose you have to search for a given book in a huge library, and all you know is the title. The simplest (and dumbest) strategy is to scan the bookshelf from top to bottom, left to right, looking at every single book until you find the one you’re looking for, or you’re convinced it is not in there. If you have, say, 1 million books1, and the time it takes you to grab a book, read its title and put it back is on average 30 seconds, it could take you almost a year (~347 days) to scan the whole library. Assuming the book you’re looking for is equally likely to be anywhere in the library, every query will take on average close to six months!2

Can we do any better? Well, of course! If we only had some sort of organization, for example, putting all books that start with the same letter on contiguous shelves, we could speed up the search immensely. We can complicate this system as much as we want, with rows tagged themselves also by the second letter of the title, and so on.

However, there is one simple strategy we can use that is almost too good to be true. What if we just sort all books by title, starting from the top left shelf to the bottom right? How fast can we search now? The answer, it turns out, is that we can search as fast as possible.

Searching as fast as possible

The basic idea we will follow is quickly discarding the biggest possible number of books with the minimum amount of effort. If the books are sorted, and we pick a random book, we can immediately discard either all the books before or all the books after it, depending on whether the title we’re looking for comes before or after the title we randomly picked. We repeat this procedure among the remaining set of books until we either find the one we’re looking for or discard all the books.

How much faster is this approach? Well, it depends on how many books we can discard in each step. If we pick one of the books close to the beginning of the list, e.g., the 10,000th book, then two things can happen. If we’re lucky and the title we’re looking for comes before, we’ve just discarded 99% of the library by looking at a single book! But if we’re unlucky, we only discarded the first 1%. Intuitively, we should try to discard the same amount of books regardless of whether the one we’re looking for comes before or after, so we should always split the list in half.

This algorithm is called Binary Search, and it is one of the most beautiful algorithms you encounter when first learning to program.

If we do this, looking at the 500,000th book first, then either the 250,000th or the 750,000th, and so on, we can narrow down the list of possible books pretty fast. A simple information-theoretic argument shows that this is the best possible strategy if the only operation we can do is compare titles.3 In the worst case, we will have to do at most close to log(N) comparisons, where N is the initial number of books, which for 1 million gives us a whooping amount of 20 comparisons! Even if each comparison takes us 10 minutes instead of 30 seconds (presumably because we have to walk across the library to go from the 500,000th to the 750,000th book) we still find the book we’re looking for, among 1 million books, in a mere 3 hours. Compare that with the six months our naive strategy required!

An illustrative example of Binary Search in a small collection of sorted numbers.

The lesson from Binary Search

Binary Search is the first truly clever algorithm you learn about, one that exploits an innocent characteristic of the problem to obtain a maximally efficient strategy. The main takeaway from Binary Search is a lesson that most grandmas are keen on repeating over and over to their grandchildren: if you keep things in order, you will have a much easier time finding them. Granny just didn’t know how true this can be. By exploiting the order of elements and carefully choosing where we look we can speed up the search exponentially.4

In brief, it pays to keep things sorted.

Hence, next time we’ll take a look at sorting, the complementary operation of searching, how to make it as fast as possible, and a surprising connection that will close the story beautifully.

And this is all for today. I want to thank @Hrokrinn for their proofreading and suggestions, and all of my subscribers and Twitter followers for giving me the inspiration to keep writing.

This is not an exaggeration by any means. For example, the Library of Congress has over 25 million books and close to 170 million items in total. There are hundreds of libraries worldwide with more than 1 million physical books inside.

This is why libraries have catalogs, indexes, and all sorts of shortcuts for quickly finding what you’re looking for.

An intuitive explanation goes like this. We have to find a specific element X among N, and it could be potentially anywhere, so the probability that either of the elements is X is 1/N. How can we ask a single question that gives us the maximum amount of information? One way to look at this problem is to consider the relation between information and entropy. The more entropy a distribution has, the less information it contains. Our current best guess is that X can be anywhere, uniformly distributed, and has the maximum amount of entropy for all discrete distributions of N elements, so it has the least amount of information.

If we split the N elements at the k-th percentile, in two subsets of sizes k*N and (1-k)*N respectively, then the amount of information we gain with that single question can be shown to be proportional to the sum of the entropies of both subsets (again, each taken uniformly distributed). A simple mathematical analysis shows that this sum is maximum when k = 0.5, which is when we split the collection in half.

To see why this is the case, consider looking at Binary Search from the inverse function viewpoint. Let’s say searching with Binary Search takes K operations (K comparisons, that is). In K comparisons, you can find any given element among N, such that log(K) = N. This means that N = 2^K (that’s 2 to the power of K). Thus, a naive linear search will require exponentially more effort than a Binary Search in a collection of the same size.