OSI readies controversial Open AI definition

10 min read Original article ↗
Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

The Open Source Initiative (OSI) has been working on defining Open Source AI—that is what constitutes an AI system that can be used, studied, modified, and shared for any purpose—for almost two years. Its board will be voting on the Open Source AI Definition (OSAID) on Sunday, October 27, with the 1.0 version slated to be published on October 28. It is never possible to please everyone in such an endeavor, and it would be folly to make that a goal. However, a number of prominent figures in the open-source community have voiced concerns that OSI is setting the bar too low with the OSAID—which will undo decades of community work to cajole vendors into adhering to or respecting the original Open Source Definition (OSD).

Defining Open Source AI

OSI executive director Stefano Maffulli announced the organization's intent to provide a definition for open-source AI in June 2023. He took exception to announcements of "large language models, foundational models, tooling, services all claiming to be 'open' or 'Open Source'", while adding restrictions which run afoul of the OSD. A survey of large-language model (LLM) systems in 2023 found that ostensibly open-source LLMs did not live up to the name.

The problem is not quite as simple as saying "use an OSD-compliant license" for LLMs, because there are many more components to consider. The original OSD is understood to apply to the source code of a program in "the preferred form in which a programmer would modify the program". A program is not considered open source if a developer cannot study, use, modify, and share a program, and a license is not OSD‑compliant if it does not preserve those freedoms. A program can include non-free data and still be open source. For example, the game Quake III Arena (Q3A) is available under the GPLv2. That distribution, however, does not include the pak files that contain the maps, textures, and other content required to actually play the commercial game. Despite that, others can still use the Q3A code to create their own games, such as Tremulous.

When discussing an "AI system", however, things are much more complicated. There is more than just the code that is used to run the models to do work of some kind, and the data is not something that can be wholly separate from the system in the way that it can be with a game. When looking at, say, LLMs, there is the model architecture, the code used to train models, model parameters, the techniques and methodologies used for training, the procedures for labeling training data, the supporting libraries, and (of course) the data used to train the models.

OSI has been working on its definition since last year. It held a kickoff meeting on June 21, 2023 at the Mozilla headquarters in San Francisco. It invited participation afterward via a regular series of in-person and online sessions, and with a forum for online discussions. LWN covered one of the sessions, held at FOSDEM 2024, in February.

The current draft of the OSAID takes its definition of an AI system from the Organisation for Economic Co-operation and Development (OECD) Recommendation of the Council on Artificial Intelligence:

A machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments.

This includes source code for training and running the system, model parameters "such as weights or other configuration settings", as well as "sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system".

Preferred form to make modifications

Those elements must all be available under OSI-approved licenses, according to the proposed definition, which seems perfectly in line with what we've come to expect when something is called "open source". There is an exception, though, for things like the data information and model parameters which must be available under "OSI-approved terms". The definition of OSI-approved terms is not supplied yet.

There is no requirement to make the training data available. To be compliant with the current draft of the OSAID, an AI system need only provide "detailed information" about the data but not the data itself.

The OSI published version 0.0.9 on August 22. It acknowledged then that "training data is one of the most hotly debated parts of the definition". However, the OSI was choosing not to require training data:

After long deliberation and co-design sessions we have concluded that defining training data as a benefit, not a requirement, is the best way to go.

Training data is valuable to study AI systems: to understand the biases that have been learned, which can impact system behavior. But training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.

As it stands, some feel that the OSAID falls short of allowing the four freedoms that it is supposed to ensure. For example, julia ferraioli wrote that without including data, the only things that the OSAID guarantees are the ability to use and distribute an AI system. "They would be able to build on top of it, through methods such as transfer learning and fine-tuning, but that's it."

Tom Callaway has written at length on LinkedIn about why open data should be a requirement. He acknowledges that there are good reasons that distributors of an AI system may not want, or be able, to distribute training data. For example, the data itself may have a high monetary value on its own, and a vendor may be unwilling or unable to share it. Acme Corp might license a data set and have permission to create an AI system using it, but not the ability to distribute the data itself. The data might have legal issues, ranging from confidentiality (e.g., medical data sets) to a desire to avoid lawsuits from using copyrighted data.

All of those are understandable reasons for not distributing data with an AI system, he said, but they don't argue for crafting a definition that allows companies to call their system open:

If we let the Open Source AI definition contain a loophole that makes data optional, we devalue the meaning of "open source" in all other contexts. While there are lots of companies who would like to see open source mean less, I think it's critical that we not compromise here, even if it means there are less Open Source AI systems at first.

Objections to lack of training data are more than an attachment to the original meaning of open source. Giacomo Tesio posted a list of issues he considered unaddressed in the RC2 version of the OSAID, including a claim that there is inherent insecurity due to the ability to plant undetectable backdoors in machine-learning models.

Others weigh in

The Free Software Foundation (FSF) announced that it was working on "a statement of criteria for free machine learning applications" to call something a free (or libre) machine-learning application. The FSF says that it is close to a definition, and is working on the exact text. However, it adds that "we believe that we cannot say a ML application 'is free' unless all its training data and the related scripts for processing it respect all users, following the four freedoms".

However, the FSF makes a distinction between non-free and unethical in this case:

It may be that some nonfree ML have valid moral reasons for not releasing training data, such as personal medical data. In that case, we would describe the application as a whole as nonfree. But using it could be ethically excusable if it helps you do a specialized job that is vital for society, such as diagnosing disease or injury.

The Software Freedom Conservancy has announced an "aspirational statement" about LLM-backed generative AI for programming called "Machine-Learning-Assisted Programming that Respects User Freedom". Unlike the OSAID, this target focuses solely on computer-assisted programming, and was developed in response to GitHub Copilot. The announcement did not directly name the OSI or the OSAID effort, but said "we have avoided any process that effectively auto-endorses the problematic practices of companies whose proprietary products are already widely deployed". It describes an ideal LLM system built only with FOSS, with all components available, and only for the creation of FOSS.

Response to criticisms

I emailed Maffulli about some of the criticisms of the current OSAID draft, and asked why OSI appears to be "lowering the bar" when the OSI has never budged on source availability and use restrictions. He replied:

I'll be blunt: you mention "source redistribution" in your question and that's what leads people like [Callaway] into a mental trap [...]

There are some groups believing that more components are required to guarantee more transparency. Other groups instead believe that model parameters and architecture are enough to modify AI. The Open Source AI Definition, developed publicly with a wide variety of stakeholders worldwide, with deep expertise on building AI (see the list of endorsers), found that while those approaches are legitimate, neither is optimal. The OSAID grants users the rights (with licenses) and the tools (with the list of required components) to meaningfully collaborate and innovate on (and fork, if required) AI systems. We have not compromised on our principles: we learned many new things from actual AI experts along the way.

Maffulli objected to the idea that the OSAID was weaker or making concessions, and said that the preferred form for modifying ML systems was what is in the OSAID: "it's not me nor OSI board saying that, it's in the list of endorsers and in [Carnegie Mellon University's] comment". He added that OSI had synthesized input from "AI builders, users, and deployers, content creators, unions, ethicists, lawyers, software developers from all over the world" to arrive at the definition. A "simple translation" of the OSD, he said, would not work.

Stephen O'Grady, founder of the RedMonk analyst firm, also makes the case that the OSD does not easily translate to AI projects. But he does not believe that the term open source "can or should be extended into the AI world" as he wrote in a blog post on October 22:

At its heart, the current deliberation around an open source definition for AI is an attempt to drag a term defined over two decades ago to describe a narrowly defined asset into the present to instead cover a brand new, far more complicated future set of artifacts.

O'Grady makes the case that the OSI has set out on a pragmatic path to define open-source AI, which requires nuance. Open source has succeeded, in part, because the OSD removes nuance. Does a license comply with the OSD or doesn't it? It's pretty easy to determine. Less so with the OSAID. The pragmatic path, he said:

Involves substantial compromise and, more problematically, requires explanation to be understood. And as the old political adage advises: "If you're explaining, you're losing."

It would have been better, he said, if the OSI had not tried to "bend and reshape a decades old definition" and instead had tried to craft something from a clean slate. That seems unlikely now, he said, after two years of trying to "thread the needle between idealism and capitalism to arrive at an ideologically sound and yet commercially acceptable" definition.

Indeed, it seems likely that the OSI board will move forward with the current draft of the OSAID or something close to it. The impact that will have is much less certain.