 
      Finally, there's an "official" definition of open source AI.
The Open Source Initiative (OSI), the long-running institution trying to define and "steward" all things open source, today released version 1.0 of its Open Source AI Definition (OSAID). The product of several years of collaboration with academia and industry, the OSAID is intended to offer a standard by which anyone can determine whether AI is open source—or not.
That is what this reporter wondered and perhaps you are wondering. So, why does it matter to have a definition of open source AI where there is consensus? There's a big motivation on that, said OSI EVP Stefano Maffulli.
"Regulators are already watching the space, " Maffulli told TechCrunch, noting bodies like the European Commission tried to give special recognition for open source. "We did explicit outreach to a diverse set of stakeholders and communities — not only the usual suspects in tech. We even tried to reach out to the organizations that most often talk to regulators in order to get their early feedback."
Open AI
According to the OSAID, an AI model has to allow a person to "substantially" reproduce its design for it to be open source. The model must make public all relevant information regarding the origin, processing, and ways of obtaining or licensing its training data.
It is an open source AI-an AI model that lets you see how it's been built," Maffulli said. "That means you have access to all the components, like complete code used for training and data filtering.".
The OSAID further details usage rights developers can expect with open-source AI. For instance, there's the freedom to use the model for any purpose, or modify it without asking anyone for permission. "Most importantly, you should be able to build on top," Maffulli added.
It doesn't have mechanisms for the enforcement of compliance. Nor can it influence developers and development teams to be bound and beholden to OSAID. But it does indicate its intent to label or flag models tagged as "open source" but lack the true characteristic.
"Our hope is that if someone tries to abuse the term, the AI community will say, 'We don't recognize this as open source,' and it gets corrected," Maffulli said. Historically, this has had mixed results, but it's not altogether without effect.
Many startups and major tech companies, most visibly Meta, have described their release strategies for AI models using the term "open source." However, few of these meet the OSAID's criteria. For instance, Meta requires platforms with over 700 million monthly active users to request a special license to use its Llama models.
Maffulli has been very critical on the fact that Meta was going to call its models "open source." He said that after consultation with OSI, Google and Microsoft agreed to drop using this term for models which were not fully open; but Meta hasn't he said.
Stability AI, which has long touted its models as "open," charges businesses bringing in more than $1 million in revenue an enterprise license. And French AI upstart Mistral's license prohibits the use of certain models and outputs for commercial ventures.
A study last August by researchers at the Signal Foundation, the nonprofit AI Now Institute, and Carnegie Mellon found that many "open source" models are basically open source in name only. The data required to train the models is kept secret, the compute power needed to run them is beyond the reach of many developers, and the techniques to fine-tune them are intimidatingly complex.
Rather than democratizing AI, these "open source" projects tended to entrench and expand centralized power, the authors of the study concluded. Indeed, Meta's Llama models have racked up hundreds of millions of downloads, and Stability claims that its models power up to 80% of all AI-generated imagery.
Counterpoints
Meta disagrees with that assessment, naturally, and has concerns with the OSAID as drafted (after having had a hand in drafting it). A spokesperson defended the company's license for Llama, noting that the terms-and-accompanying acceptable use policy serve as guardrails on harmful deployments.
Meta said that, when it comes to model details, including the information related to training data, the company is being "cautious," taking this approach as laws, including California's new training transparency law, develop.
  "We agree with our partner the OSI on much, but we, like others across the industry, disagree with their new definition," the spokesperson said. There isn't an open-source definition for AI, and there really isn't one that does because past definitions for open-source don't fit with today's rapidly evolving complexities in models of AI. We're making Llama free and openly available, but we have our license and acceptable use policy that will actually keep people safe with a bit of a restriction on their use. We'll continue working with the OSI and other industry groups to get AI free responsibly, whether that fits into technical definitions of "open source".
Other initiatives in codifying "open source" AI include proposals from the Linux Foundation, criteria for "free machine learning applications" proposed by the Free Software Foundation, and proposals by other AI researchers.
Meta, ironically, is one of the organizations funding the OSI's efforts — along with tech leaders like Amazon, Google, Microsoft, Cisco, Intel, and Salesforce. (The OSI recently received a grant from the nonprofit Sloan Foundation to reduce its dependence on tech-industry donors.)
Meta's reluctance to publish training data probably has to do with how its -- and most -- AI models are created.
AI companies scrape tremendous pieces of images, audio, videos, and more from social media and websites and train their models on that "available data," usually called that. In today's competitive market, the ways and means a company assembles and refines its dataset are a competitive advantage. And companies cite this among the main reasons for non-disclosure.
However, details of training data can also paint a legal target on the backs of developers. Authors and publishers claim Meta used copyrighted books for training. Artists have filed suits against Stability for scraping their work and reproducing it without credit, an act they compare to theft.
It's not particularly hard to see how OSAID could be harmful to corporations trying to favorably resolve lawsuits, when the plaintiffs and the courts are persuasively struck by a definition of corporate fraud or other wrongdoing.
Questions remaining open
Some argue it doesn't go far enough, for instance in treating proprietary training data licensure. CTO of Lightning AI Luca Antiga adds that a model can be fully compliant with the requirements of the OSAID, yet the data upon which it was trained remains inaccessible. Is it open if you have to pay thousands to inspect the private stores of images that a model's creators paid to license?
"To be of practical value, especially for businesses, any definition of open source AI needs to give reasonable confidence that what is being licensed can be licensed for the way that an organization is using it," Antiga told TechCrunch. "By neglecting to deal with licensing of training data, the OSI is leaving a gaping hole that will make terms less effective in determining whether OSI-licensed AI models can be adopted in real-world situations."
As the OSAID version 1.0 describes, the OSI does not touch on copyrighting AI models, and if granting a copyright license is sufficient to make sure a model meets the open source definition. Whether or not models can be copyrighted under present IP law remains unclear. But if the courts declare that they can, new "legal instruments" would need to be cooked up by the OSI to make proper "open source IP-protected models."
Maffulli concurred that definition is going to have to evolve and may need to do so sooner rather than later. The OSI established a committee whose duty will be the ongoing monitoring of use and application of OSAID and proposing the changes required for future revisions.
"This is not the work of lone geniuses in a basement," he said. "It's work that is being done in the open with wide stakeholders and various interest groups."