Supervised vs Reinforcement Learning
Some quick notes on when and how to use each of the two main Machine Learning paradigms.
Supervised Learning and Reinforcement Learning are the two main paradigms in machine learning, both involving learning from experience but in fundamentally different ways.
Supervised learning is comparable to imitation, where learning takes place through demonstrations of solving a given task. This type of learning is commonly used for tasks such as text classification, image classification, and audio transcription, as well as in training large language and diffusion models.
The basic concept of supervised learning is to utilize expert demonstrations of a task to train a machine learning system to map the input space, such as pixels in an image, to the desired output space, like a label indicating the presence or absence of lung cancer in annotated X-ray images.
In contrast, reinforcement learning involves learning through trial and error. It is often applied in robotics, self-driving cars, and any scenario requiring a sequence of actions where a plan is executed before assessing the outcome. This approach is utilized when direct access to expert demonstrations is unavailable, such as in tasks like walking or driving a car, where the method used by humans is not precisely understood or easily communicated to an AI.
Moreover, reinforcement learning is preferred when evaluating the results of a task is much more straightforward than actually solving it; for instance, it is easier to determine if a car crashed or not, than to drive the car, or to ascertain if a robot moved forward or fell over, than instructing the robot on movement.
When deciding between supervised learning and reinforcement learning, the feasibility of direct demonstration by experts becomes a crucial factor. Supervised learning is more suitable when experts can directly demonstrate the task, providing a clear framework for learning. In contrast, reinforcement learning is better suited for tasks where the direct demonstration is difficult, but the evaluation is feasible.
This distinction can be best understood in terms of direct and inverse problems, common in computer science. The direct problem is the task you want to solve, from input to output, while the inverse problem involves going from the output to the input.
For example, in image classification, the direct problem is categorizing images: that is, given an input image, provide an output text that describes it –either a simple label or a full natural language description. The inverse problem is thus image generation based on a given label or description.
Since the direct problem is easier —or at least not harder— to solve than the inverse, in this case, it is more feasible to find experts to annotate images and provide accurate labels, making supervised learning the preferable approach.
Consider the example of self-driven cars, where the direct problem involves navigating complex scenarios, while the inverse problem is simulating those scenarios to determine if the car reaches its destination. While both problems are complex, the inverse problem is relatively easier —we just have to build a realistic video game. In this case, even though making an accurate car simulator is no small feat, it is still way more manageable than collecting hundreds of thousands of hours of experts driving cars in many different locations.
Thus, when the direct and inverse problems exhibit similar complexity, supervised learning is typically preferred. However, if the inverse problem is significantly easier than the direct problem, reinforcement learning becomes a more viable option.
If it’s unclear beforehand which problem is easier, you can consider two heuristics. The first heuristic involves examining the availability of data on expert performance that you can access: books, papers, tutorials, video demonstrations, etc. If such data is readily available, supervised learning may be a viable option. Otherwise, you may need to hire experts for annotation or build a simulation.
The second heuristic involves observing how humans learn to perform similar tasks. Is it through formal education or intensive practice? The former —e.g., medical diagnosis—indicates that humans learn via imitation, suggesting that supervised learning is feasible. In the other case, humans learn via trial and error, like in sports, thus suggesting that reinforcement learning is more suitable.
Now, while this heuristic provides valuable insight into the problem, it may not be universally applicable. For instance, traditional board games like chess have historically been tackled using supervised learning, but the most recent models for games—such as AlphaGo and AlphaStar— have shown that reinforcement learning is even more effective.
If you can do either, then consider the advantages of each method. With sufficient data, supervised learning can yield faster initial results, as learning from clear examples of good behavior is more efficient. However, this approach often reaches a limit at the human expert level. On the other hand, reinforcement learning has no upper limit and can achieve superhuman performance if the simulation is sufficiently accurate. Therefore, a hybrid solution that starts with supervised learning and transitions to reinforcement learning may effectively push performance to the highest level possible.
Reinforcement Learning with Human Feedback (RLHF) is an interesting middle ground between Supervised Learning and pure Reinforcement Learning. In RLHF, like in pure RL, an agent interacts with an environment and learns from trial and error. But unlike pure reinforcement, where the simulation determines the feedback for each action, here we let human experts evaluate the agent's performance and learn from their feedback. Thus, we learn the best feedback function in supervised mode and then learn the optimal behavior in reinforcement mode.
In any case, there are no hard rules when deciding whether to use supervision or reinforcement, only some heuristics. As in all disciplines, when faced with a novel problem, scientists and engineers must either draw from experience in similar problems or try different things and see what works. And this is just supervised and reinforcement learning at display. It’s pretty meta, isn’t it?
meta, indeed!!!