The Missing Introduction to Formal Language Theory
A new series on the foundations of Computer Science and the science behind Compilers
Next semester, I’ll teach the introductory course in Compilers again, after a couple years, in the CS major at the University of Havana. It was an excellent opportunity to dust my notes on formal languages, parsers, analyzers, code generators, etc. Going over the course material, I decided I needed to rewrite most lecture notes because my understanding of the fundamental issues in formal languages has dramatically changed since I last taught this course.
So, as I prepare for the upcoming course, I’ll be sharing with you some of the most intriguing and mindblowing ideas from one of the areas of Computer Science with some of the most profound results. We will begin with a very intuitive introduction to formal language theory and build our way up to understand how compilers, text editors, virtual machines, and all the associated components work.
Buckle up!
What is a (formal) language?
Intuitively, a language is just a collection of correct sentences. In natural languages (Spanish, English, etc,), each sentence is made up of words, which have some intrinsic meaning, and there are rules that describe which sequences of words are valid.
Some of these rules, which we often call “syntactic” are just about the structure of words and sentences, and not their meaning–like how nouns and adjectives must match in gender and number or how verbs connect to adverbs and other modifiers. Other rules, which we call “semantic”, deal with the valid meanings of collections of words–the reason why the sentence “the salad was happy” is perfectly valid syntactically but makes no sense. In linguistics, the set of rules that determine which sentences are valid is called a “grammar”.
In formal language theory, we want to make all these notions as precise as possible in mathematical terms. To achieve this, we will have to make some simplifications which will ultimately imply that natural languages fall outside the scope of what formal language theory can fully study. But these simplifications will enable us to define a very robust notion of language for which we can make pretty strong theoretical claims.
So let’s build this definition from the ground up, starting with our notion of words, or, formally, symbols:
Definition 1.1 (Symbol) A symbol is an atomic element with an intrinsic meaning.
Examples of symbols in abstract languages might be single letters like a
, b
or c
. In programming languages, a symbol might be a variable name, a number, or a keyword like for
or class
. The next step is to define sentences:
Definition 1.2 (Sentence) A sentence (alternatively called a string) is a finite sequence of symbols.
An example of a sentence formed with the symbols a
and b
is abba
. In a programming language like C# or Python, a sentence can be anything from a single expression to a full program.
One special string is the empty string, which has zero symbols and will often bite us in proofs. It is often denoted as 𝜖.
We are almost ready to define a language. But before, we need to define a “vocabulary”, which is just a collection of valid symbols.
Definition 1.3 (Vocabulary) A vocabulary 𝑉 is a finite set of symbols.
An example of a vocabulary is { 𝑎,𝑏,𝑐 }, which contains three symbols. In a programming language like Python, a sensible vocabulary would be something like {for,while,def,class,…}
containing all keywords, but also symbols like +
, .
, etc.
What about identifiers?
If you think about our definition of vocabulary for a little bit, you’ll notice we defined it as finite set of symbols. At the same time, I’m claiming that things like variable and function names, and all identifiers in general, will end up being part of the vocabulary in programming languages. However, there are infinitely many valid identifiers, so… how does that work?
The solution to this problem is that we will actually deal with two different languages, on two different levels. We will define a first language for the tokens, which just determines what types of identifiers, numbers, etc., are valid. Then the actual programming language will be defined based on the types of tokens available. So, all numbers are the same token, all identifiers are another token, and so on.
Given a concrete vocabulary, we can then define a language as a (possibly infinite) subset of all the sentences that can be formed with the symbols from that vocabulary.
Definition 1.4 (Language) Given a vocabulary 𝑉, a language 𝐿 is a set of sentences with symbols taken from 𝑉.
Let’s see some examples.
Examples of languages
To illustrate how rich languages can be, let’s define a simple vocabulary with just two symbols, 𝑉={ 𝑎,𝑏 }, and see how many interesting languages we can come up with.
The simplest possible language in any vocabulary is the singleton language whose only sentence is formed by a single symbol from the vocabulary. For example, 𝐿𝑎 = { 𝑎 } or 𝐿𝑏 = { 𝑏 }. This is, of course, rather useless, so let’s keep up.
We can also define what’s called a finite language, which is just a collection of a few (or perhaps many) specific strings. For example,
𝐿1 = { 𝑏𝑎𝑏, 𝑎𝑏𝑏𝑎, 𝑎𝑏𝑎𝑏𝑎, 𝑏𝑎𝑏𝑏𝑎 }
Since languages are sets, there is no intrinsic order to the sentences in a language. For visualization purposes, we will often sort sentences in a language in shortest-to-largest and then lexicographic order, assuming there is a natural order for the symbols. But this is just one arbitrary way of doing it.
Now, we can enter the realm of infinite languages. Even when the vocabulary is finite, and each sentence is also a finite sequence of symbols, we can have infinitely many different sentences in a language. If you need to convince yourself of this claim, think about the language of natural numbers: every natural number is a finite sequence of, at most, 10 different digits, and yet, we have infinitely many natural numbers because we always take a number and add a digit at the end to make a new one.
Similarly, we can have infinite languages simply by concatenating symbols from the vocabulary ad infinitum. The most straightforward infinite language we can make from an arbitrary vocabulary 𝑉 is called the universe language, and it’s just the collection of all possible strings one can form with symbols from 𝑉.
Definition 1.5 (Universe language) Given a vocabulary 𝑉, the universe language, denoted 𝑉∗ is the set of all possible strings that can be formed with symbols from 𝑉.
An extensional representation of a finite portion of 𝑉∗ would be:
𝑉∗ = { 𝜖, 𝑎, 𝑏, 𝑎𝑎, 𝑎𝑏, 𝑏𝑎, 𝑏𝑏, 𝑎𝑎𝑎, 𝑎𝑎𝑏, 𝑎𝑏𝑎, 𝑎𝑏𝑏, 𝑏𝑎𝑎, 𝑏𝑎𝑏, 𝑏𝑏𝑎, 𝑏𝑏𝑏, ... }
We can now easily see that an alternative definition of language could be any subset of the universe language of a given vocabulary 𝑉.
Now, let’s take it up a notch. We can come up with a gazillion languages just involving 𝑎 and 𝑏, by concocting different relationships between the symbols. For this, we will need some way to describe the languages that don’t require listing all the elements–as they are infinitely many. We can do it with natural language, of course, but in the long run, it will pay to be slightly more formal when describing infinite languages.
For example, let 𝐿2 be the language of strings over the alphabet 𝑉 = { 𝑎, 𝑏 } with the exact same number of 𝑎 and 𝑏.
𝐿2 = { 𝜖, 𝑎𝑏, 𝑎𝑎𝑏𝑏, 𝑎𝑏𝑎𝑏, 𝑏𝑎𝑏𝑎, 𝑏𝑎𝑎𝑏, 𝑎𝑏𝑏𝑎, ... }
We can define it with a bit of math syntax sugar as follows:
𝐿2 = { 𝜔 ∈ { 𝑎,𝑏 }∗ | #(𝑎,𝜔) = #(𝑏,𝜔) }
Let’s unpack this definition. We start by saying, 𝜔 ∈ { 𝑎,𝑏 }∗, which literally parses as “strings 𝜔 in the universe language of the vocabulary { 𝑎,𝑏 },” but is just standard jargon to say “string made out of 𝑎 and 𝑏. Then we add the conditional part #(𝑎,𝜔) = #(𝑏,𝜔), which should be pretty straightforward: we are using the #(<symbol>,<string>) notation to denote the function that counts a given symbol in a string.
𝐿2 is slightly more interesting than 𝑉∗ because it introduces the notion that a formal language is equivalent to some computation. This insight is the fundamental idea that links formal languages and computability theory, and we will formalize this idea in the next section. But first, let’s see other, even more interesting languages, to solidify this intuition that languages equal computation.
Let’s define 𝐿3 as the language of all strings in 𝑉∗ where the number of 𝑎 is a prime factor of the number of 𝑏. Intuitively, working with this language—e.g., finding valid strings–will require us to solve prime factoring, as any question about 𝐿 that has different answers for string in 𝐿 than for strings not in 𝐿 will necessarily go through what it means for a number to be a prime factor of another.
But it gets better. We can define the language of all strings made out of 𝑎 and 𝑏 such that, when interpreting 𝑎 as 0 and 𝑏 as 1, the resulting binary number has any property we want. We can thus codify all problems in number theory as problems in formal language theory.
And, as you can probably understand already, we can easily codify any mathematical problem, not just number theory. Ultimately, we can define a language as the set of strings that are valid input/ouput pairs for any specific problem we can come up with. Let’s make this intuition formal.
Recognizing a language
The central problem in formal language theory is called the word problem. Intuitively, it is about determining whether a given string is part of a language. Formally:
Definition 1.6 (The Word Problem) Given a language 𝐿 on some vocabulary 𝑉, the word problem is defined as devising a procedure that, for any string 𝜔 ∈ 𝑉∗, determines where 𝜔 ∈ 𝐿.
Notice that we didn’t define the word problem simply as “given a language 𝐿 and a string 𝜔, is 𝜔 ∈ 𝐿”. Why? Because we might be able to answer that question correctly only for some 𝜔, but not all. Instead, the word problem is coming up with an algorithm that answers for all possible strings 𝜔—technically, a procedure, which is not exactly the same.
The word problem is the most important question in formal language theory, and one of the central problems in computer science in general. So much so, that we actually classify languages (and by extension, all computer science problems) according to how easy or hard it is to solve their related word problem.
In the next few chapters, we will review different classes of languages that have certain common characteristics which make them, in a sense, equally complex. But first, let’s see what it would take to solve the word problem in our example languages.
Solving the word problem in any finite language is trivial. You only need to iterate through all of the strings in the language. The word problem becomes way more interesting when we have infinite languages. In these cases, we need to define a recognizer mechanism, that is, some sort of computational algorithm or procedure, to determine whether any particular string is part of the language.
For example, language 𝐿2 has a very simple solution to the word problem. The following Python program gets the job done:
def l2(s):
a,b = 0,0
for c in s:
if c == "a":
a += 1
else:
b += 1
return a == b
A fundamental question in formal language theory is not only coming up with a solution to the word problem for a given language but, actually, coming up with the simplest solution–for a very specific definition of simple: how much do you need to remember. In other words: what kind of algorithms can solve the word problem for what kind of languages?
For example, we can solve 𝐿2 with 𝑂(𝑛) memory. That is, we need to remember something proportional to how many 𝑎’s and 𝑏’s are in the string. And we cannot solve it with anything less than that, as we will prove a couple chapters down the road.
Now, let’s turn to the opposite problem of generating strings from a given language and wonder what, if any, is the connection between these two.
Generating a language
Suppose you want to generate all strings from a language like 𝐿2. To make things simpler, let’s redefine it as 𝐿2′, the language of strings over {𝑎,𝑏} with the same number of 𝑎’s and 𝑏’ but where all 𝑎’s come before all 𝑏’s. This means 𝑎𝑎𝑏𝑏 is a valid string in 𝐿, but not 𝑎𝑏𝑏𝑎. This language is also called 𝑎𝑛𝑏𝑛, that is, 𝑛 symbols 𝑎 followed by 𝑛 symbols 𝑏.
Here is a simple Python method that generates infinitely many strings from 𝐿2′:
def generate_l2():
s = ""
while True:
yield s
s = "a" + s + "b"
Let’s unpack this. We start with the empty string 𝜖, defined in code as s = ""
. Then, we enter an infinite cycle where we yield the current string, and then attach an 𝑎 to the front and a 𝑏 to the back. Take a moment to convince yourself that any string in the form 𝑎𝑛𝑏𝑛 is eventually generated by this method and, furthermore, only those strings are generated by the method.
This method is actually pretty neat because it not only generates (eventually) all of 𝑎^𝑛 𝑏^𝑛; it does so in increasing length order. It isn’t immediately obvious why this is such a good thing but here’s a bold claim: if you have a generating method for any language 𝐿, then you have a recognizing method too.
Wait, what!? Yep, you heard it right. And actually, it goes both ways. If you have a recognizing algorithm, you also have a generating one. Let’s make this our first theorem in formal language theory.
Theorem 1.1 Let 𝐿 be a formal language. There exists an algorithm 𝐴 for generating all strings in 𝐿 (in increasing length order) if and only if there also exists another algorithm 𝐴′ for solving its word problem.
Proof. To prove this, let’s first understand what the theorem is saying. If we have an algorithm 𝐴 that generates all strings in a language, we can also come up with another algorithm 𝐴′ (presumably using 𝐴) that solves the word problem, and vice-versa.
To prove this type of theorems, the most usual approach is to assume you have 𝐴 (or 𝐴′) as some kind of abstract, black-box algorithm, and try to construct the other. Let’s do it from generation to recognition first, as the other way around will be fairly easy once this is done.
⇒ Suppose we have an algorithm 𝐴 that generates all strings in 𝐿, and we are given an arbitrary string 𝜔. Let 𝑛=|𝜔| be the length of 𝜔. We just need to run 𝐴 until we either see 𝜔, in which case the answer is true (𝜔 ∈ 𝐿) or until we see one string with length greater than 𝑛, in which case the answer is false (𝜔 ∉ 𝐿). Since 𝐴 generates strings in increasing length order, one of these must happen in a finite time for any 𝜔.
Now, let’s do it the other way around.
⇐ Suppose we have an algorithm 𝐴′ that solves the word problem from 𝐿. Then we do the following. Define 𝐿∗ as the universe language associated with 𝐿. We can very easily code a generating algorithm 𝐴∗ for 𝐿∗ in increasing length order, simply by permuting all symbols. Now, run 𝐴∗ and, for each string 𝜔 generated, run 𝐴′(𝜔). If the output is true, then yield 𝜔. Otherwise, skip it.
So there you have it. Generating (in increasing order) and recognizing are two faces of the same problem. Cool, right? But why does this matter? For starters, it gives us a tremendously powerful connection between two sub-branches of formal language theory that we will explore in the following chapters.
Moving on
We are just scratching the surface of what formal language theory can do, and we have already touched upon several areas of computer science.
We have defined a super general notion (language) that is ultimately as profound and powerful as the very notion of algorithm. We have identified a central problem in formal language theory (the word problem) that is as deep as the very question of what problems can be solved, at all, with a computer. We connected two fundamental problems in languages (recognizing and generating) and discovered they are but two sides of the same coin. And we left hanging the question of which languages can be solved with which types of algorithms, which is ultimately a question about complexity theory. Phew!
In the following few articles, we will continue exploring the world of formal languages. We will dive into the different classes of languages according to the complexity of their generating and recognizing algorithms. We will find many intriguing unsolvable problems that have deep connections with other areas in computer science, from the most practical to the most esoteric. When we finish this dive, we will have a much more solid understanding of what computers can ultimately do. And then, will turn to programming languages and apply all these ideas to solving the more practical problem of building a compiler.
Alejandro, I enjoy your writing even and would love to take a class from you. Though I need to reread this (I have a lot of thoughts swirling right now about how formal language may be a purification of natural language, cleaning it of ambiguities and empty shells), I wanted to let you know that your passion and energy shine through.
This is wonderful. Thanks Alejandro.