The Future of AI is Open Source
Why the future of artificial intelligence, and large language models in particular, will be built on open-source foundational models.
I strongly believe in the potential of open source. Open source software offers numerous benefits over closed source. Over the past 30 years, we have witnessed the growth of this movement from being a fringe ideology to becoming a widely embraced mindset, even by the world's leading companies.
Nevertheless, not everything can be open source. Building a business around software requires having some proprietary elements that can be monetized for profit. However, many successful business models involve a hybrid approach, where certain portions of your codebase are open-sourced while others remain closed-source.
One effective approach to capitalize on open source is releasing your codebase as an open-source project, allowing you to leverage the community effect. In addition, you can offer premium services such as cloud hosting and enterprise solutions, including single sign-on and customer service. This is particularly advantageous for those who prefer not to self-host.
This model is widely employed in the realm of backend-, platform-, and infrastructure-as-a-service. It is prevalent in various domains, including database systems and productivity tools like GitLab. While you can self-host these services, selling their cloud version is often their main business model.
The hybrid model combines the advantages of the open-source and closed-source models. With open source, you benefit from the community effect, as well as a large number of beta testers and early adopters, which improves the reliability of your product. Additionally, public development allows you to receive feedback and reports on platforms like GitHub. Even small contributions, such as documentation or user examples, greatly enhance the open-source model.
On the other hand, maintaining a closed part of your application has its benefits. For instance, you can close your user interface while releasing the backend and core functionality. By offering a cloud-hosted user interface, along with advanced features like drag-and-drop interfaces and logging administration, you can cater to enterprise users willing to pay for these services. Furthermore, you can also provide customer service, on-premise deployment, and develop client-specific plugins or components.
As AI becomes the foundation of the new Software 2.0 paradigm, the discussion between open- and closed-source software becomes central issue. What does open-source AI look like? Are there clear benefits to open-sourcing at least some part of your AI stack? Can you gain more by giving away more?
In this issue, I want to explore these questions, focusing on the rise of large language models as the backbone infrastructure for a significant portion of the near-future AI applications.
What does open-source AI look like?
Open source is progressively dominating the realm of software, starting from the core layers and expanding towards the infrastructure layers of computing. The primary type of software likely to be open source is infrastructure software. For instance, operating systems, virtualization software, and many drivers are open source. Moving up the hierarchy, development tools also fall within the realm of open source, as they form a slightly higher infrastructure level.
However, regarding consumer applications, open source does not excel similarly. Open-source compilers tend to be superior, but image, video, and audio editors are not competitive compared to closed-source alternatives such as Photoshop or other Adobe products. Similarly, open-source video games are far behind their closed-source counterparts regarding entertainment value. The closer the software is to lower-level implementation details, the more beneficial it becomes to adopt an open-source approach.
Now, let's examine how this applies in the context of Artificial Intelligence. First, it's important to note that the development tools and frameworks in AI, such as TensorFlow and PyTorch, while open source, are not the core of my argument —their open-source nature primarily stems from their classification as development tools, even though they are part of the AI stack.
Looking closely at language models within the AI stack, we can draw a parallel to basic infrastructure. For instance, foundation models —models trained on extensive text corpora, like the Llama family models and their various iterations— serve as fundamental building blocks. These models can be likened to the operating system level, forming the core infrastructure of the AI ecosystem.
Suddenly, we have the option to commercialize fine-tuned adapters for specific domains. Let's say you have Llama as an open-source model. You can either fine-tune it yourself using your own infrastructure or provide me with your data, and I will fine-tune it for you on my infrastructure.
Moreover, there is an opportunity to sell fine-tuned versions of models for commercializable domains. For instance, I take an open-source model trained on large data. I then fine-tuned it on a small amount of data specifically for, say, the market valuation domain. This model can be used for generating business models, business plans, enterprise emails, quarterly reports, etc.
Essentially, you can take one of these infrastructure-like, foundational models, fine-tune it for a specific domain using your collected data and computational and human resources, and it would make sense for you to sell it. Thus, as we get closer to the application layer, it becomes more sensible to keep things closed.
To open-source or not to open-source
Then why don't companies like Google or OpenAI open source their foundation models? This question is worth pondering because not all companies are keen on open-sourcing AI. Currently, only Meta is trying to open-source very large models. The remaining open-source models are typically smaller and built upon existing open-source foundation models. The large-scale models are generally kept closed source.
So, what is preventing companies like Google and OpenAI from joining the open-sourcing trend? Baring some extreme views1, the fundamental argument these companies have to choose open or closed source is, of course, an economic one. What is the strategy that will provide the most profit? Call me a cynic, but I don’t get fooled by their appeal to safety or privacy principles. These are for-profit companies, so their incentives are aligned with their investors —and there’s nothing inherently wrong about that, if you ask me.
So let’s examine, from their point of view, the financial advantages of open-sourcing or not their foundation models. Collecting large datasets and investing significant computing power in training a huge model puts your competitors in a difficult position. They need to invest similar resources to compete with you. Consequently, only major players like Google, Meta, and Microsoft could achieve this independently.
In contrast, if, say, OpenAI chooses to open source GPT-4, they would give away whatever advantage they had initially. Doing so saves competitors from replicating the extensive computing power needed to utilize their model. Thus, OpenAI could only differentiate itself through the fine-tuned and application layers. Their advantage would be not the GPT-based model but just the ChatGPT application.
It is understandable why these companies are unwilling to open source their foundation models as they invest significant resources in building them. Unlike projects such as the Linux kernel, which can thrive on contributions from hobbyists worldwide, developing a foundation model appears to require the backing of a large corporation with substantial financial and infrastructural capabilities.
Training GPT4 is significantly more expensive than creating the Linux kernel, at least in the short term. One may argue that the Linux kernel has required extensive monetary investment in volunteer hours over many years, and these numbers may amount to a substantial quantity. However, when faced with training a massive model in just six months, it becomes evident that no open-source software, regardless of its nature, can allocate millions of dollars within such a short timeframe to go from inception to completion. Thus, this distinction between traditional open-source and AI open-source is crucial.
However, Meta did open-source their model. Although they initially were reticent to go full open-source —they released Llama under a non-commercial research license until it was “accidentally” leaked. Nonetheless, Llama 2, the newest model, is commercially licensed. There is an unusual restriction, though: companies generating billions in revenue are, in principle, prohibited from using it commercially.
This sounds like Meta wants to have their cake and eat it, too. They aim to gain recognition for being a cutting-edge company that shares open-source projects —crucial for their credibility with developers since they are not the “cool” company; Google holds that reputation. Thus, Meta seeks the social credit associated with being the company that shares significant projects like Llama while simultaneously striving to limit competition from major players.
Meta sees Google and the other big players as the only competition because, unlike other AI-first startups like OpenAI, Hugging Face, Anthropic, etc., Meta’s core product is not language models or anything AI-related. Like Google, Meta’s core product is much bigger and darker: advertisement.
AI and language models serve only as valuable tools to enhance their core business model's profitability. In this regard, Meta does not directly compete with many small startups attempting to develop ChatGPTs for various domains. Meta is indifferent to such competition.
Whether you use Llama to create a chatbot for business purposes, generate fiction, or automate emails, it bears no significance to them. They are distinct from Grammarly, Hugging Face, and countless other startups focused on constructing AI-powered tools. This is not their area of expertise or profit.
Consequently, they are not bothered by granting these smaller companies a more convenient life by releasing their models. Their primary concern is making it harder for Google, Amazon, Microsoft, and Apple to outpace them. Hence, by outmaneuvering these four entities, they can benefit fully from the clout of being open-source friendly without any of the business drawbacks.
The reasoning behind Meta's decision to open source Llama, I think, is to strengthen their status as a cool company while safeguarding their tech from its real competitors. Despite not sharing Meta's ethos, I agree with their choice to open-source AI. This move benefits the community, advances scientific research, and enhances business prospects when embraced by all.
The advantage of open-sourcing
However, doesn't open-sourcing your model put you at a disadvantage in a world where no one else is doing it? Aren't companies like Google, Microsoft, Amazon, and Apple, by building foundation models but not open-sourcing them, still beating you to the punch?
I believe not. Open sourcing a foundation model and allowing a vast community of researchers, hobbyists, entrepreneurs, start-ups, and students worldwide to contribute offers numerous advantages. The extensive collaboration from these groups enables the development of innovative ideas and products on top of the foundation model. I argue this far outweighs any potential drawbacks of keeping the model closed and restricting product development solely to the company.
In the future, all our creations will rely either on closed services like ChatGPT or open-source models like Llama. The considerable brand recognition gained from having a community of users of your foundation model is an invaluable asset. However, it is not the sole or most significant benefit of open-sourcing your foundation models.
That is collaboration. You see, the most significant issue that needs to be addressed in language models if they are to have lasting value and not be just a passing trend is reliability. The challenges of hallucinations, prompt reproducibility, and biases can all hinder the progress of language model technology and even bring a new AI winter. To truly become more than mere entertainment tools and instead serve as foundational infrastructure that can be relied upon for building applications, language models must tackle these fundamental problems.
However, solving these problems requires going beyond the API layer. Any solution to hallucinations requires modifying the model —adjusting the weights, altering the architecture, or refining the sampling process. These changes must be implemented at the model level rather than relying solely on API access.
With open-source models, anyone can contribute to addressing hallucinations, grounding the model, improving biased language, enhancing adherence to prompting, and increasing reliability. The availability of model weights and access to the source allows for a seamless combination of efforts. This approach allows for the multiple benefits of a distributed community of researchers and developers addressing the core issues, in the same sense that reliability issues of traditional software are often solved by open collaboration.
In contrast, if you operate on a closed-source basis and keep the model internally within your company, even if you have the power to invest as much money as you desire, the resource that remains most restricted is the availability of sufficiently skilled individuals who can tackle these challenges effectively.
Throwing money allows for extensive data collection and scaling computational power. However, scaling the number of qualified personnel is not as easy. Even with substantial resources, OpenAI, Microsoft, or Google cannot recruit every artificial intelligence engineer, data scientist, or machine learning theorist worldwide.
The key limitation to making AI truly useful lies in the scarcity of human resources.
Open source is the key to scaling in human resources. This model promises unlimited and mostly free access to a vast pool of talented individuals eager to work with you. In contrast, closed source restricts your access to only those individuals you can persuade to work for you, often depending on the salary you offer. No matter the amount you pay, there will always be exceptionally bright individuals who choose to work for your competitors instead of you.
Open-source language models, such as Llama, have the leading edge here. Unlike GPT, the community will develop and improve Llama and its successors. This means they can become more reliable, customizable, and practical compared to closed-source models. As open-source OSes, compilers, databases, and development tools have demonstrated before, I think open-source language models will eventually become more robust and generally superior to anything you can develop in closed source.
Conclusions
The current state of open-source language models versus closed-source models is far from the ideal I just painted. Today, closed-source models far exceed open-source models regarding reliability and robustness. Anyone who has attempted to build applications using GPT through OpenAI's API and then transitioned to Llama or other open-source models has witnessed their unreliability and general lack of robustness.
However, this situation is not unique to language models. This unreliability is expected in initial open-source projects, as the first generation of any open-source software is typically subpar.
Linux, for instance, took a considerable amount of time before it became a robust operating system, surpassing all closed-source alternatives. Similar patterns are seen with compilers, database engines, IDEs, and frameworks. Initially, closed-source solutions have advantages in terms of development speed, benefiting from massive upfront investment in computing and data, which open-source initiatives cannot match.
Nevertheless, we witness open-source scalability in infrastructure-oriented software. This is because the potential of open source is nearly limitless, tapping into a vast pool of human contributors. In contrast, closed-source tools remain constrained by the number of individuals they can hire. As a result, open source holds the advantage in the long run, at least at the foundational level.
Currently, open-source machine learning and language models are in their early stages. Therefore, you should expect closed-source models to outperform open-source models for the next few iterations, perhaps even a few years. However, as history has demonstrated, open source will eventually catch up and become the optimal solution. Closed-source AI companies will gradually transition to open-source and contribute to the community. Eventually, the infrastructure layer of AI will become open-source, like most software infrastructure has.
For developers today, I advise continuing to use closed-source models for production but keeping tabs on the open-source landscape. Become an early adopter and contributor, and help enhance the robustness and accessibility of AI for future generations. Computer Science is undergoing a massive revolution akin to the 60s and 70s. This time, you can be one of the pioneers.
A small but vocal group believes open-sourcing AI is a catastrophic mistake as it can precipitate an AGI extinction event. I wholeheartedly disagree with this view for many reasons, but this is not the article to rant against AI doomers. There’s another article in the queue for that.
This is very eloquent piece of writing, Alejandro. A pleasure to read.
I believe that META's decision to open-source Llama model is a masterstroke in marketing. As you rightly point out, they could use the reputation boost (or could have before this move, anyway). Now all of a sudden META is in the same conversation with Microsoft and Amazon regarding AI innovations.
Now, more investment dollars will flow into META, making the illusion completely real.