Why Artificial Neural Networks Are So Damn Powerful - Part I
The most intuitive explanation of the mathematical prowess that are neural networks
You already know that neural networks are everywhere, and there’s a reason for that, beyond fad. And no, it’s not simply because they are “inspired in the human brain”—we already debunked this partial myth in our previous article.
The true strength of neural networks lies primarily in their nature as mathematical constructs that are extremely flexible and powerful. This makes it relatively easy for them to adapt to nearly any domain. Additionally, they excel at leveraging vast amounts of data and computational power.
The ability of neural networks to model complex relationships and learn from vast amounts of data stems from several key mathematical properties. This article will explore these strengths, including the universal approximation theorem, the role of inductive biases in different architectures, and how layers within a network perform representation and manifold learning. We’ll also discuss the versatility of neural networks in transforming various learning objectives into optimized loss functions.
Then, in Part II, we’ll see how these networks scale impressively with increasing data and compute resources, further solidifying their position as a cornerstone of modern AI.
Universal Approximators
The Universal Approximation Theorem is a cornerstone of neural networks and machine learning. It asserts that a feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of inputs to any desired degree of accuracy, provided that the activation function is non-constant, bounded, and continuous.
Whoa, that was a mouthful. In short, a big enough but still very simple neural network can, in theory, learn any pattern you want to with an arbitrary degree of precision.
The theorem's history dates back to the late 1980s when researchers began formalizing the mathematical foundations of neural networks. Early work by researchers like George Cybenko in 1989 demonstrated that single-layer networks could achieve this approximation capability. Since then, various versions and extensions of the theorem have been developed, solidifying its importance in understanding the theoretical underpinnings of neural networks.
What does this mean in practical terms? The Universal Approximation Theorem implies that neural networks are incredibly versatile tools capable of modelling a wide array of functions, from simple linear relationships to intricate non-linear mappings. This flexibility allows them to be effectively applied across numerous domains, such as image recognition, natural language processing, and more.
However, while the theorem guarantees the existence of such approximations, it does not provide a method for finding them efficiently in practice. Thus, while neural networks can theoretically learn any function, achieving that in real-world scenarios often requires careful design and training strategies.
Architectural Flexibility
While the Universal Approximation Theorem assures us that a sufficiently large, fully connected feedforward neural network with a single hidden layer can approximate any continuous function, this approach is often impractical for real-world applications. Instead, we can leverage specialized structures and different types of layers to create neural networks tailored for specific tasks. This architectural flexibility allows us to exploit the unique characteristics of the data and the problem domain, enhancing performance and efficiency.
One prominent example is convolutional layers, commonly used in Convolutional Neural Networks (CNNs). These layers are designed to process grid-like data, such as images. By applying convolutional filters, they can detect local patterns and features, such as edges or textures, while maintaining spatial hierarchies. This structure is particularly effective for image recognition tasks, where understanding spatial relationships is crucial.
Another example is recurrent layers, found in Recurrent Neural Networks (RNNs). These layers are specifically designed to handle sequential data, such as time series or natural language. By maintaining a hidden state that captures information from previous inputs, RNNs can effectively model temporal dependencies and context. This makes them well-suited for tasks like language modeling and speech recognition.
As a final example, remember transformers, which have revolutionized natural language processing. Unlike traditional RNNs, transformers rely on self-attention mechanisms to weigh the importance of different input elements relative to one another. This allows them to capture long-range dependencies and contextual relationships more effectively than previous architectures. Transformers have become the backbone of many state-of-the-art models in NLP, enabling tasks such as translation and text generation.
By employing these specialized layers, we can create neural networks that don't attempt to approximate arbitrary functions but rather exploit the inherent structure of the problems we are trying to solve.
Representation Learning
One of the most powerful aspects of neural networks is their ability to perform representation learning, which can be understood as a sequence of increasingly abstract feature extraction mechanisms. This process allows neural networks to automatically discover and learn relevant features from raw data without requiring manual feature engineering. Essentially, each layer in a neural network transforms the input data into higher-level representations, capturing more complex patterns as the information flows through the network.
Consider an image classification task. When we analyze what each layer of a convolutional neural network (CNN) is learning, we can observe a fascinating progression. The initial layers typically act as simple feature detectors, identifying basic elements such as edges and textures in various orientations. These early detectors are crucial for understanding the fundamental building blocks of an image.
As we move deeper into the network, these simple features begin to combine into more complex shapes and patterns. For instance, the next layers might learn to recognise geometric shapes like circles and squares by aggregating the edge information detected in the earlier layers. Further down the line, these shape detectors merge into even more sophisticated representations, such as figure-like detectors that can identify parts of objects or specific patterns.
By the time we reach layers 20 or more in a deep CNN, the network has developed a highly abstract understanding of the input data. At this stage, it can accurately detect complex objects like dogs, cars, or houses based on the intricate features it has learned to recognise throughout its architecture.
This hierarchical approach to feature extraction means that almost any neural network designed for classification tasks can be viewed as a sequence of increasingly abstract and complex feature extractors.
Manifold Learning
Manifold learning is another insightful way to interpret what neural networks are doing during the learning process. When tackling problems like image classification, we can think of it as a complex instance of a nearest neighbour problem. For example, all images of cats share certain similarities, just as images of dogs do. However, this similarity is not immediately apparent in the input domain—the pixel values—because images that represent similar concepts (like two different cats) can be quite distant from each other in terms of pixel-by-pixel distance.
To understand this better, we can posit that a high-dimensional space exists where these images are represented more meaningfully. In this space, points corresponding to similar images are close together, while those representing fundamentally different objects—like dogs, ships, or houses—are far apart. The challenge is that this "true" image space is tangled and twisted, making it difficult to identify these relationships directly.
Manifold learning refers to the ability of neural networks to find a set of transformations that project the original data from the input space (e.g., pixels in image classification) into this complex high-dimensional space where similar objects (e.g., images of cats) cluster together. If we could untangle this manifold, we could perform a simple nearest-neighbour comparison in a more meaningful context. Neural networks do that implicitly.
We can thus view deep neural networks as a series of projections into increasingly complicated manifolds. Each layer in the network transforms the input data, gradually mapping it closer to this ideal space where similar objects are grouped together. The final layer of the network, right before the softmax classification, thus contains a very twisted and tangled projection of the original input, to the point it would be unrecognizable by humans. Still, it happens to be the projection that best clusters together similar objects.
Backpropagation
Training neural networks effectively hinges on the backpropagation algorithm, a pivotal method since its introduction in the 1970s. Backpropagation allows for the fine-tuning of weights within a neural network by computing how to adjust all parameters based on the error from the previous iteration. This feedback mechanism is essential for optimizing the network's performance, as it systematically reduces the error rate by adjusting weights to improve predictions.
The power of backpropagation lies in its ability to compute gradients efficiently, regardless of the network's size or complexity. By applying the chain rule of calculus, backpropagation calculates how weight changes affect the overall error function. This means that even in deep networks with many layers, backpropagation can determine the necessary adjustments for each weight, enabling training to an arbitrary degree of precision (provided the network has enough capacity, i.e., is big enough).
In theory, this makes all neural networks trainable, but in practice, achieving effective training often requires careful management of various factors. For instance, practitioners must navigate challenges such as vanishing and exploding gradients, which can impede learning in deep networks. Additionally, hyperparameter tuning and regularization techniques are often necessary to ensure convergence and prevent overfitting. We’ll tackle these problems in Part II.
Flexible Learning Objectives
Neural networks are trained using backpropagation, which requires a well-defined loss function to measure the learning error. This loss function approximates how far off the network's predictions are from the target values. The loss function must be differentiable for gradient descent to work effectively. This allows us to compute gradients and optimize the network's weights.
However, many learning objectives are not inherently differentiable. A prime example is classification error, often referred to as 0/1 loss. This type of loss is binary: you either classify an instance correctly or incorrectly, providing no gradient information for optimization. Fortunately, we can create differentiable approximations of such non-differentiable loss functions.
For instance, the binary cross-entropy loss is a commonly used differentiable approximation for 0/1 loss in binary classification tasks. It captures the essence of correct and incorrect classifications while allowing for a continuous range of error values. This enables the model to learn more effectively by providing meaningful gradient information even when predictions are not perfect.
Similarly, other tasks have their own tailored loss functions that facilitate learning. For example, a common loss function in regression tasks is Mean Squared Error (MSE), which measures the average squared difference between the predicted and actual values.
For multi-class classification problems, neural networks often use Categorical Cross-Entropy Loss, an extension of binary cross-entropy. This loss measures the dissimilarity between the predicted probability distribution and the true distribution over multiple classes, making it particularly effective for problems with many output categories.
Hinge loss is frequently employed in tasks like binary classification, which focuses on maximizing the margin between classes. This loss is also commonly used in support vector machines and some neural networks to ensure better separation between categories.
Contrastive loss is often used in more specialized applications, such as face recognition or metric learning. This loss function helps models learn embeddings by minimizing the distances between similar pairs of data points while maximizing the distances between dissimilar ones.
Each of these loss functions is designed to suit specific learning objectives while maintaining differentiability, ensuring that gradient descent can be applied effectively to a wide range of dissimilar tasks.
Conclusions
Neural networks are incredibly powerful and flexible mathematical constructs. They can approximate any continuous function, adapt to specific tasks through specialized architectures like convolutional, recurrent, and transformer layers, and automatically extract increasingly abstract features from raw data.
Additionally, they untangle complex data relationships by projecting inputs into high-dimensional manifolds where similar items cluster together. Finally, their ability to transform diverse learning objectives into differentiable loss functions enables effective optimization via gradient descent.
And we have a very powerful and general training algorithm—backpropagation—to ensure we can effectively make use of all these mathematical properties.
But while these strengths explain their theoretical power, they don’t fully account for their practical success. In our follow-up article, we’ll explore how neural networks are perfectly suited to scale with the increasing availability of data and compute power, which is key to their dominance in modern AI.
Excellent article - very clear on the specific advantages of each type. Thank you
What if there is no gradient in the training data to learn? Random search just as good?