If you can't reproduce the model then it's not open-source
twitter.com> Imagine if Linux published only a binary without the codebase. Or published the codebase without the compiler used to make the binary. This is where we are today.
This was such a helpful way to frame the problem! Something felt off about the "open source models" out there; this highlights the problem incredibly well.
In my mind, what's more crucial here is code for downloading/scraping and labeling the data, not the model architecture nor training script.
As much as I appreciate Mis(x)tral, I would've loved it even more if they released code for gathering data.
I'm speculating they are attempting to avoid controversy about their datasources. That and a possible competitive edge depending on what specific sets/filtering they're using.
To avoid controversy AND potential lawsuits.
Yup.
I think many countries (japan already has) will allow IP for training data.
They just need to buy time until then.
It’s common for third party model testers to not disclose what they mean by “Refusal” parameter as well, for obvious reasons. The world is full of witch-hunting maniacs now and will stay so for an indefinite amount of time. Just wait until the whole thing becomes more widely known and they realize. All AI companies have to hurry up before the doors shut.
IMHO much of the key training data can't simply be downloaded/scraped/labeled, no matter what code you had - it's not like it's freely accessible to everyone and just needs some code to get it and process it. You can't scrape all of Google Books archive or all of Twitter, and quite a few things that could be scraped at one point may actively prevent you from scraping them now.
I don't mind to have ready to use datasets instead the code for downloading/scraping and labeling. It will save a lot of time. It is not complicated to write some code for gathering the data, it might be sometimes impossible to replicate the datasets after all if some parts of the data which you have to scrape are already gone (removed because of various reasons).
I think a better analogy is firmware binary blobs in the Linux kernel, or VM bytecodes.
The LLM inference engine (architecture implementation) is like a kernel driver that loads a firmware binary blob, or a virtual machine that loads bytecode. The inference engine is open source. The problem is that the weights (firmware blobs, VM bytecodes) are opaque: you don't have the means to reproduce them.
The Linux community has long argued that drivers that load firmware blobs are cheating: they don't count as open source.
Still, the "open source" LLMs are more open than "API-gated" LLMs. It's a step in the right direction, but I hope we don't stop there.
If we're continuing the analogy, the compute required to turn the source into binaries costs millions of dollars. Not a license fee for the compiler, but the actual time on a computer.
The GPL describes the source as the "preferred form for modification".
And, that's obviously fun, because with LLMs, you have the LLM itself which cost hundreds of thousands in compute to train, but given you have the weights it's eminently fine-tunable. So it's actually not really like Linux - rather it's closer to something like a car, where you had no hope of making it in the first place but now you have it, maybe you can modify it.
So in this case, the weights are the source code and the training material + compute time is like the software development process that went into creating the source code.
It would probably take well over a million dollars in engineering hours to recreate the postgres source code from scratch, just as it would take millions in compute to rebuild the weights.
The model weights ARE the preferred form for modification
As long-time a 'practitioner' of machine learning models I strongly disagree, the preferred form for model modification is by retraining the model with a tweak to the parameters or the training algorithm or the model structure or data selection or length of training.
You can get some effects by fine tuning, and in that case it may be preferable as it's cheaper, but in general if I want to have a different or better model, that involves retraining.
I don’t really believe your long time practitioning is aligned to the kind of models being discussed
Yeah, that's why data scientists are out there editing the weights rather than cleaning up datasets and rerunning training with different settings.
If that was supposed to be clever it just sounds naive. There’s a ton of work going on fine tuning open source models
> There’s a ton of work going on fine tuning
... models provided in weights only form. (mostly!)
I believe the preferred form would be the whole kit and caboodle: the collection and filtering scripts, the data to the extent that it's non-public, the training routine, and the model weights... because sometimes you'll perform changes at any of those stages.
Do you actually do this for a living? Do you have experience doing this and have credibility talking about what’s preferred? I do.
OK. Where is your reproduction of Pythia trained from scratch? Or MPT? Or Amber? Shall we play a game where you give paper regarding pretraining (and we are not taling about puny models based on wikitext2) I give you a paper based around finetuning and we'll see who run out of papers first?
Reproduction is not the goal! Making papers is not the goal! Making useful models is the goal. And having open source models is by an enormous degree more useful thing.
I see you’re someone else, so I’ll ask you too. Do you actually have any experience doing this? Have you ever fine tuned models or tried to change architecture or put a piece of one model into another?
>Making useful models is the goal.
Sure, training datasets for pythia is useful. The Pile was used in lots of models. However it's hardly relevant that pythia itself was trained on pile. They live separate lives.
Having just weights already allows making results that are incredibly useful(you don't need original dataset for flash attention, or tuning foundation model into the chat model).
Point is: Having both doesn't make released model more useful.
>Do you actually have any experience doing this? Have you ever fine tuned models or tried to change architecture or put a piece of one model into another?
Yes on both finetune and "changing" architecture: with adapters and similar approaches you don't need to retrain everything from scratch after modifying the guts of the original architecture up to your liking, you just need to not stir it up too much. Training on the task at hand is sufficient.
No, I haven't glued parts of existing models together(ensemble doesn't count)
> Reproduction is not the goal! Making papers is not the goal! Making useful models is the goal.
1. The thread is about the requirements of calling a model open source. The goal of making the models is separate from the requirements of the open source definition.
2. Suppose that author A of a model prefers working exclusively with the model weights to modify the model. Author A's preferred form of modification includes the model weights - and whatever scripts are needed to generate a running model from the weights - but does not include the training set and initial training scripts. Suppose that author B of an unrelated model prefers to retrain the model as part of the process of modifying the model. If author B changes the training set and/or changes the training scripts, then the training set is part of the preferred form of modifying the model. The training set and the training scripts are both necessary for turning the training set into a running model, so I think author B would have to include the training set even if author B changes only the training scripts. Correct me if I'm wrong.) jncfhnb, you're like author A, so if you were to release an open source model then you would need to include the weights but not the training data. Trapais and nullc, don't assume that every model author is author B.
For personal reference, here is the relevant excerpt from the open source definition from the Open Source Initiative [A1]:
> The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.
Open source software is not the same as free software, but here is the relevant excerpt from the free software definition explainer from the Free Software Foundation [A2]:
> Source code is defined as the preferred form of the program for making changes in. Thus, whatever form a developer changes to develop the program is the source code of that developer's version.
> Reproduction is not the goal!
It is for Open Source. Hence why it's silly to call these models open source.
Unless you want to try modifying the model structure, in which case the weights aren’t necessarily valid anymore and will need to be retrained.
The GNU GPLv3 requires "Corresponding Source", not only the files that contain lines such as "def foo(bar):" or "foo(bar)". The Corresponding Source includes all of the files needed to turn your unmodified/modified copy of the source code into something the user can run, with exceptions to some of the tools that the author of the GPLed program has no authorship in.
> The “Corresponding Source” for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work.
...
> You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License
Model weights alone are not Corresponding Source. In order to distribute a model you made under the GPLv3, you would have to give users the model weights and the scripts needed to turn the model weights into a runnable model. That's assuming that you only work with the model weights when modifying the model. If you in particular retrain the model as part of modifying the model, then you would have to provide the training data and initial training scripts as well.
Even though I wrote about a particular free software license which happens to be an open source license, the open source definition from the Open Source Initiative also refers to the preferred form of changing the work [2]:
> The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.
For good measure, here is the relevant excerpt from the free software definition from the Free Software Foundation [3]:
> Obfuscated “source code” is not real source code and does not count as source code.
> Source code is defined as the preferred form of the program for making changes in. Thus, whatever form a developer changes to develop the program is the source code of that developer's version.
> Freedom 1 includes the freedom to use your changed version in place of the original. If the program is delivered in a product designed to run someone else's modified versions but refuse to run yours—a practice known as “tivoization” or “lockdown,” or (in its practitioners' perverse terminology) as “secure boot”—freedom 1 becomes an empty pretense rather than a practical reality. These binaries are not free software even if the source code they are compiled from is free.
The FSF's free software definition requires that the user be practically - not merely theoretically - allowed to modify the source code and turn the source code into a running program. Because of that, the free software definition considers build scripts to be part of the source code. I can't find an explicit analogue of the practically-modifiable requirement in the open source definition, but I think providing the model weights without providing the scripts needed to turn the weights into a functioning copy of the existing model would be obfuscation i.e. a violation of the open source definition.
[1] https://www.gnu.org/licenses/gpl-3.0.en.html
Off-topic but that's why I always fail to pick up android dev after so many false starts. It just never felt right.
Android is not open source.
No it’s not. You have everything you need to modify the models to your own liking. You can explore how it works.
This analogy is bad. Models are unlike code bases in this way.
> You have everything you need to modify the models to your own liking.
What if I wanted to train it using only half of its training set? If the inputs that were used to generate the set of released weights are not available I can’t do that. I have a set of weights and the model structure but without the training dataset I have no way of doing that.
To riff on the parent post, I have:
For the vast majority of open source models I have:Source + Compiler => Binaries
They’re not exactly the same as the source code/binary scenario because I can still do this (which isn’t generally possible with binaries):[unavailable inputs] + Model Structure => Weights
Another way to look at it is that with source code I can modify the code and recompile it from scratch. Maybe I think the model author should have used a deeper CNN layer in the middle of the model. Without the inputs I can’t do a comparison.Model Structure + Weights + [my own training data] => New Weights> Maybe I think the model author should have used a deeper CNN layer in the middle of the model. Without the inputs I can’t do a comparison.
You can fine tune into a different model architecture.
You’re right on not being able to retrain the model from scratch on half its data without that data but that’s likely pointless.
I’d be happy to be wrong about this but my understanding is that changing the architecture of the last few layers is feasible with fine-tuning but changing middle layers isn’t likely going to work very well without having the full original input set.
> likely pointless
It doesn’t take too much creativity to come up with ideas about why someone might want to do that:
- researchers who want to investigate how much the dataset can be reduced (and thus training cost) and what the accuracy penalty is
- someone who wants to for either religious or ethical reasons minimize the probability that the model was trained on pornography
- someone who’s curious about whether there’s significant redundancy in the existing input datasets
- someone who’s curious about whether there are a much smaller subset of images in the input dataset that can quickly help the first few CNN input layers converge before training the middle and output layers on the larger dataset.
Edit: I suspect the real reason they don’t want to share the input dataset is purely because a high-quality annotated dataset is a valuable commodity. While I don’t do ML work myself day-to-day, I do work with a team that does in a very niche field and I can only imagine how much effort they had to go through to get the annotated dataset that they’ve put together. Even just collecting the images for it involved many hours of drone flights in different locales around North America in varying weather and lighting.
Original input set is irrelevant.
You will need some data of your own of course to fill in the blanks
Edit; however conversely, you can also splice out layers from one model into another original model. It’ll take some retraining, but this works!
You can do the same with binaries. Can modify those all you want.
Models are the compiler + makefiles. Dataset is the code.
I don't know about the OSI's open source definition [1] in general, but specific licenses might consider makefiles and build scripts to be part of the source code. (For what it's worth, the free software definition from the FSF does consider makefiles and build scripts to be part of the source code [2].)
No, it’s not the same. Yes, you can technically modify binaries, but it’s not at all the preferred way to modify the program.
Congratulations. You've almost finished understanding my comment.
Well you’ve failed and managed to be a dick
>Or published the codebase without the compiler used to make the binary.
A slightly offtopic complaint, but too often I have seen tutorials for open source stuff (coughopenglcough) where they don't provide the proper commands to compile and link everything required to build it. Figuring it out makes the "getting started" portion even more tedious.
Open Source and Free Software wasn't formulated to deal with the need for this level of gargantuan amounts of data and compute.
Can the public compete? What percentage of the technical public could we expect to participate, and how much data, compute, and data quality improvement could they bring to the table? I suspect that large corporations are at least an order of magnitude advantaged economically.
There is a big effort being worked on in China, Yuanqing Lin gave an interview on the deep learning course that works on this magnitude [1]. They suggest that they will host both the resources to store the data, train the data, and have all those algorithms available in China.
The public doesn't have the resources to train the largest state-of-the-art LLMs, but training useful LLMs seems doable. Maybe not for most individuals but certainly for a range of nonprofits, research teams and companies.
Isn't is relatively easy for a smaller model to poke holes in the output of a larger model?
But not nearly as in reach as modifying open source models.
Open Source and Free Software are not about the amount of data.
I think the process of data acquisition isn't so clear-cut. Take CERN as an example: they release loads of data from various experiments under the CC0 license [1]. This isn't just a few small datasets for classroom use; we're talking big-league data, like the entire first run data from LHCb [2].
On their portal, they don't just dump the data and leave you to it. They've got guides on analysis and the necessary tools (mostly open source stuff like ROOT [3] and even VMs). This means anyone can dive in. You could potentially discover something new or build on existing experiment analyses. This setup, with open data and tools, ticks the boxes for reproducibility. But does it mean people need to recreate the data themselves?
Ideally, yeah, but realistically, while you could theoretically rebuild the LHC (since most technical details are public), it would take an army of skilled people, billions of dollars, and years to do it.
This contrasts with open source models, where you can retrain models using data to get the weights. But getting hold of the data and the cost to reproduce the weights is usually prohibitive. I get that CERN's approach might seem to counter this, but remember, they're not releasing raw data (which is mostly noise), but a more refined version. Try downloading several petabytes of raw data if not; good luck with that. But for training something like a LLM, you might need the whole dataset, which in many cases have its own problems with copyrights…etc.
[1] https://opendata.cern.ch/docs/terms-of-use
[2] https://opendata.cern.ch/docs/lhcb-releases-entire-run1-data...
You're right that most people have neither the need nor the ability to recreate the data themselves. But the same applies to using open-source software in the first place: most people who use OSS have neither the need nor the ability to compile the software from source themselves. But the whole point of OSS is that that source is available for those who want to use it, whether to study it, to diagnose a bug, or something else. I think the same is true for the LHC's technical details or a model's training data: most people won't recreate it at home, but it's important to make it available, and even someone who can't rebuild the whole thing themselves might spot an important bug or omission by going through the data collection details.
I think the biggest issue is with publishing the datasets. Then people and companies would discover that it's full of their copyrighted content and sue. I wouldn't be surprised if they slurped in the whole Z-Library et Al into their models. Or Google their entire Google Books Dataset
Somewhat unrelated, but here is a thought experiment...
If a human knows a song "by heart" (imperfectly), it is not considered copyright infringement.
If a LLM knows a song as part of its training data, then it is copyright infringement.
But what if you developed a model with no prepared training data and forced it to learn from it's own sensory inputs. Instead of shoveling it bits, you played it this particular song and it (imperfectly) recorded the song with it's sensory input device. The same way humans listen to and experience music.
Is the latter learning model infringing on the copyright of the song?
If a person plays a song similarly enough, then it is copyright infringment! Mere knowledge is irrelevant, it is the producing of copies (and also a few related actions) which is prohibited by copyright.
No language model plays a song either in the narrow sense, they just send a representation of the song to some other program (or human) that might play it.
Mere knowledge is irrelevant only because we don't (yet) have a mechanism to pry open one's brains and inspect the copying of songs within different parts of one's brain. Otherwise, mechanistically, besides one using silicon and other using wetware, they're pretty much doing the same thing.
> send a representation of the song
That is copying. If not the song itself, at the least a close derivative work.
That's my point. If you could pry open a human brain and decipher how it works, you'll see some representation of the song being sent around to various parts of the brain.
This depends, how many times does it need to hear the song to build up a reasonably consistent internal reproduction, and are you paying per stream or buying the input data as CD Singles - or just putting the AI in a room with the radio on and waiting for it to take in the playlist a few times ?
Let's assume it is in a room with a radio listening to music, and that the AI is "general purpose" meaning that it can also perform other functions. It is not the sole purpose of the AI to do this all day.
I see where you are coming from in trying to identify the source of the copyright. This would be important information if a human wanted to sue another human for re-producing copyright material.
However, does that apply here? Nobody hears a human humming a song and asks if they obtained that music legally. Should it be important to ask an AI that same question if the purpose of listening to the song is not to steal it?
The standards applied are exactly the same regardless of what tools are used. It doesn't matter if you're talking about a dumb AI, a general purpose AI, or a Xerox machine.
If you want an exception to copyright, you're going to want to start looking at a section 107 (of the copyright act) exception: https://www.copyright.gov/title17/92chap1.html#107
The reason someone walking down the street and humming a song is not a violation is because it very clearly meets all of the tests in section 107.
The biggest problem with feeding stuff through a black box like an LLM is it isn't easy for a human to determine how close the result is to the original. An LLM could act like a Xerox machine, and it won't tell you.
I think this conversion has corrected some misgivings I had about the AI copyright argument. My takeaway is;
Possession copyright material is not inherently infringing on a copyright. Disseminating copyright material is unless you meet section 107. AI runs afoul of section 107 when it verbatim shares copyright material from its dataset without attribution.
> AI runs afoul of section 107 when it verbatim shares copyright material from its dataset without attribution.
Technically, the AI doesn't run afoul. The person disseminating the copyrighted material does.
Not humming, but Don't we prevent singing songs sometimes? The birthday song was famously held up by ip law for some years right?
> If a LLM knows a song as part of its training data, then it is copyright infringement.
No it isn't. You can feed whatever you want into your LLM, including copyrighted data. The issues arise when you start reproducing or distributing copyrighted content.
>You can feed whatever you want into your LLM, including copyrighted data.
That's currently the subject of considerable legal debate.
https://edition.cnn.com/2023/07/10/tech/sarah-silverman-open...
That is mostly an issue of the latter, whether the service that Meta/OpenAI offers outputs content that is a violation of copyright. Technically, derivative works are a copyright violation, but if you're not distributing them, you normally have a good fair use argument, and/or nobody knows.
The Open Source Initiative, who maintain the Open Source Definition, have been running a whole series over the past year to collect input from all sorts of stakeholders about what it means for an AI to be open source. I was lucky enough to participate in an afternoon long session with about a hundred other people last year at All Things Open.
https://deepdive.opensource.org/
I encourage you to go check out what's already being done here. I promise it's way more nuanced than anything than is going to fit on a tweet.
Can you summarize? I'm reading https://deepdive.opensource.org/wp-content/uploads/2023/02/D... but it seems to tackle too many questions when I'm really only interested on what criteria to use when deciding whether (for example) Stable Diffusion is open source or not.
Anyway, to go on a tangent, some day maybe with zero knowledge proofs we will be able to prove that a given pretrained model was indeed the result of training using a given dataset, in a way that can be verified vastly cheaper than training the model itself from scratch. (This same technique could also be applied to other things like verifying if a binary was compiled from a given source with a given compiler, hopefully verified in a cheaper way than compiling and applying all optimizations from scratch).
If this ever materialize, then we can just demand proofs.
Here's a study on that
https://montrealethics.ai/experimenting-with-zero-knowledge-...
https://dl.acm.org/doi/10.1145/3576915.3623202
And here is another
Applying the term "open source" to AI models is a bit more nuanced than to software. Many consider reproducibility the bar to get over to earn the label "open source."
For an AI model that means the model itself, the dataset, and the training recipe (e.g. process, hyperparameters) often also released as source code. With that (and a lot of compute) you can train the model to get the weights.
Same with open-core - if you can't self-host the thing on your own infra then its not REALLY OSS
Many companies are using "open source" as marketing rather then actually releasing open source software and models. No data? Not open source. Special license cutting out self-hosting or competitive use? Not open source.
^^^^ Well said
"the project does not benefit from the OSS feedback loop" It's not like you can submit PRs to training data that fixes specific issues the way you can submit bug fixes, so I'm skeptical you would see much of a feedback loop.
"it’s hard to verify that the model has no backdoors (eg sleeper agents)" Again given the size of the datasets and the opaque way training works, I am skeptical that anyone would be able tell if there is a backdoor in the training data.
"impossible to verify the data and content filter and whether they match your company policy" I don't totally know what this means. For one, you can/probably should apply company policies to the model outputs, which you can do without access to training data. Is the idea that every company could/should filter input data and train their own models?
"you are dependent on the company to refresh the model" At the current cost, this is probably already true for most people.
"A true open-source LLM project — where everything is open from the codebase to the data pipeline — could unlock a lot of value, creativity, and improve security." I am overall skeptical that this is true in the case of LLMs. If anything, I think this creates a larger surface for bad actors to attack.
You can grep for bad words. What you can't do(unless hoops are jumped through) is to verify that weights came from the same dataset. You can set the same random seed and still get different results. Calculations are not that deterministic. (https://pytorch.org/docs/stable/notes/randomness.html#reprod...).
>I am overall skeptical that this is true in the case of LLMs
This skepticism seems reasonable. EleutherAI have documentation to reproduce training (https://github.com/EleutherAI/pythia#reproducing-training). So far I haven't seen it leading to anything. Lots of arxiv papers I've seen complain about time and budget constraint even regarding finetunes, forget pretraining.
The company policy/backdoors issues are possibly like the whole Getty Images debacle. If a company contracts with a provider or just uses a given model themselves, they may have no idea that it's taking from a ton of copyrighted work AND with enough of a trail where the infringed party could probably win a suit.
Backdoors I'd think of is if there are some sneaky words (maybe not even english) that all of a sudden causes it to emit NSFW outputs. Microsoft's short-lived @TayandYou comes to mind (but I don't think anyone's making that mistake again, where multiple users' sessions are pooled).
I don't agree, and the analogy is poor. One can do the things he lists with a trained model. Having the data is basically a red herring. I wish this got more attention. Open/free software is about exercising freedoms, and they all can be exercised if you've got the model weights and code.
https://www.marble.onl/posts/considerations_for_copyrighting...
But one of the four freedoms is being able to modify/tweek things, including the model. If all you have is the model weights, then you can't easily tweak the model. The model weights is hardly the preferred form for making changes to update the model.
The equivalent would be someone which gives you only the binary to Libreoffice. That's perfectly fine for editing documents and spreadsheets, but suppose you want to fix a bug in Libreoffice? Just having the binary is going to make it quite difficult to fix things.
Simiarly, suppose you find that the model has a bias in terms of labeling African Americans as criminals; or women as lousy computer programmers. If all you have is the model weights of the trained model, how easily can you fix the model? And how does that compare with running emacs on the Libreoffice binary?
If all you have are the model weights, you can very easily tweak the model. How else are all these "decensored" Llama2 showing up on Hugging Face? There's a lot of value in a trained LLM model itself and it's 100% a type of openness to release these trained models.
What you can't easily do is retrain from scratch using a heavily modified architecture or different training data preconditioning. So yes, it is valuable to have dataset access and compute to do this and this is the primary type of value for LLM providers. It would be great if this were more open — it would also be great if everybody had a million dollars.
I think it's pretty misguided to put down the first type of value and openness when honestly they're pretty independent, and the second type of value and openness is hard for anybody without millions of dollars to access.
Well, by that argument it's trivially easy to run emacs on a binary and change a pathname --- or wrap a program with another program to "fix a bug". Easy, no?
And yet, the people who insist on having source code so they can edit the program and recompile it have said that for programs, having just the binary isn't good enough.
>suppose you find that the model has a bias in terms of labeling African Americans as criminals; or women as lousy computer programmers. If all you have is the model weights of the trained model, how easily can you fix the model?
That's textbook fine-tuning and is basically trivial. Adding another layer and training that is many orders of magnitude more efficient than retraining the whole model and works ~exactly as well.
Models are data, not instructions. Analogies to software are actively harmful. We do not fix bugs in models any more than we fix bugs in a JPEG.
Instructions is exactly what weights are. We just have no idea what those instructions are.
You can fine tune a model, you ve got way more power to do so given the trained model than starting from scratch and the raw data.
Next step will be to ask for GPU time. Because even with data, model code and training framework you may have no resources to train. "The equivalent would be" someone gives you the code, but no access to mainframe which is required to compile. Which would make it not open source(?) There are other variations, like original compiler was lost, current compilers aren't backward compatible. Does that make old open source code closed now?
In other words there should be a reasonable line when model is called open source. In extreme view it's when the model, the training framework, and the data are available for free. This would mean open source model can be trained only on public domain data. Which makes class of open source models very, very limited.
More realistic is to make the code and the weights available. So that with some common knowledge new model can be trained, or old fine tuned, on available data. Important note: weights cannot be reproduced even if original training data is available. It will be always a new model with (slightly) different responses.
Down voted, hmm... I'll add bit more then. Sometimes it's even good that model cannot be easily reproduced. Original developers usually have some skills and responsibility. While 'hackers' don't. It's easy to introduce bias into the data , like removing selected criminal records, and then publish model with similar name. That would be confusing, some may mistake fake one for the real.
PS: If I ever make my models open I can't open the data anyway. License on images directly prohibits publishing them.
My main concern is that if all you have are weights you're stuck hoping for the benevolence of whatever organization is actually able to train the model with their secret dataset.
When they get bought by Oracle and progress slows to a crawl because it's not profitable enough to interest them, you can't exactly do a LibreOffice. Or they can turn around and say "license change, future versions may not be used for <market that controlling company would like to dominate>" and now you're stuck with whatever old version of the model while they steamroll your project with newer updates.
Open weights are worth nothing in terms of long term security of development, they're a toy that you can play with but you have no assurances of anything for the future.
Everything you just said applies to normal software. Oh no! Big Corp just started a closed fork of their open source codebase! Well, the open source version is still there. The open source community can build off of it.
You may complain that subsequent models are not iterative on the past and so having that old version doesn’t help; but then the data probably changes too so having the old data would largely leave you with the same old model.
When you train an updated model on a new dataset do you really start by deleting all of the data that you collected for training the previous version?
Probably not. But if it’s the new data providing the advantage then you’re not exactly better off having the old data and the model vs. just having the model.
The idea would be that another group could fork it and continue adding to the dataset on their own.
As opposed to not being able to fork it at all because an "open source" model actually just means "you are allowed to use this particular release of our mystery box."
You do not need the original dataset to train the model on an additional dataset
Maybe I misunderstood your original question. To be clear, the process of modifying a trained model does not require the presence of the original data. You said “deleted” which perhaps I misinterpreted. You’re not “instantiating a new model from scratch” when you modify it. You’re continuing to train it where it left off.
What if you want to start with a subset of the original data? Like you've trained a model, and then later said "You know, this new data we're adding is great, but maybe pulling all those comments from 4chan earlier was a mistake," wouldn't that require starting fresh with access to the actual data?
Technically correct but not a very realistic request / approach.
The general idea is to get as good of a mastery of language as possible, generally, and then fine tune to specialize on tasks
> The “source code” for a work means the preferred form of the work for making modifications to it.
-- gplv3
These AI/ML models are interesting in that the weights are derived from something else (training set), but if you're modifying them you don't need that. Lots of "how to do fine-tuning" tutorials floating around, and they don't need access to the original training set.
Are there any true open-source LLM models, where all the training data is publicly-available (with a compatible license) and the training software can reproduce bit-identical models?
Is training nondeterministic? I know LLM outputs are purposely nondeterministic.
>Are there any true open-source LLM models, where all the training data is publicly-available (with a compatible license)
Mamba has a version, trained on publicly available SlimPajama. RedPajama-INCITE was trained on non-slimmed version of the dataset(it's only one dataset).
I'm not sure if training scripts are available.
Pythia definitely has scripts. However it was trained on the pile, so you have to find books3 on your own.
Also I believe LLM360 is an explicit attempt to do it with llama.
>Is training nondeterministic?
Correct. Torch documentation has a section on reproducibility of a training.
I think the answer is in the name. The "source" has always been what you need to build the thing. In this context I think we can agree that the thing is the model. Based on that the model is no more open source than a binary program.
I'll venture to say the majority of these "open access models" are meant to serve as advertisements of capabilities (either of hardware, research, or techniques) and nothing more. MPT being one of the most obvious example.
Many don't offer any information, some do offer information but provide no new techniques and just threw a bunch of compute and some data to make a sub-par model that shows up on a specific leaderboard.
Everyone is trying to save a card up their sleeve so they can sell it. And showing up on scoreboards is a great advertisement.
I like how Debian's machine learning policy says this:
Publish your data and prepare to get vilified by professional complainers because the data doesn't conform to their sensibilities. Lots of downside with very little of the opposite.
No, but it's still insanely useful and free as in beer.
> if you can’t reproduce the model then it’s not truly open-source.
Open-source means open source, it does not make reproducibility guarantees. You get the code and you can use the code. Pushed to the extreme this is like saying Chromium is not open-source because my 4GB laptop can't compile it.
Getting training code for GPT-4 under MIT would be mostly useless, but it would still be open source.
> Pushed to the extreme this is like saying Chromium is not open-source because my 4GB laptop can't compile it.
Not really, an analog would be if Chromium shipped LLVM IR as its source but no one could get any version of LLVM to output the exact same IR no matter what configurations they tried, and thus any "home grown" Chromium was a little off.
Then what we need isn't open source. It's something else. Maybe called "Open Model?"
Yes that would make sense. I'm in no way arguing that models can't be more open, just that overloading a commonly used expression such as "open-source" and then complaining that projects are not complying with your new definition of open-source just does not make sense to me.
We made our last language model fully reproducible including all datasets, training details, hyper parameters etc: https://stability.wandb.io/stability-llm/stable-lm/reports/S...
95% of the value comes from the model being freely downloadable and analyzable (i.e. not obfuscated/crippled post-hoc). Sure there is some difference, but as researchers I care far more about open access than making every "gnuight" on the internet happy that we used the right terminology.
So, we need something like dockerfiles for models?
it's model available, not open source!
Agreed.
I would argue that while technically correct, it is not what most people really care. What they care about are the following:
1. Can I download it?
2. Can I run it on my hardware?
3. Can I modify it?
4. Can I share my modifications with others?
If those questions are in the affirmative, then I think most people consider it open enough, and it is a huge step for freedom compared to the models such as OpenAI.
It's a great observation. People simply want their free stuff.
The potential challenge arises in the future. Today's models will probably look weak compared to models we'll have in 1, 3 or 10 years which means that today's models will likely be irrelevant in years hence. Every competitive "open" model today is tied closely to a controlling organization weather it's Meta, Mistral.AI, TII, 01.AI, etc.
If they simply choose not to publish the next iteration of their model and follow OpenAI's path that's the end of the line.
A truly open model could have some life beyond that of its original developer/organization. Of course it would still take great talent, updated datasets, and serious access to compute to keep a model moving forward and developing but if this is done in the "open" community then we'd have some guarantee for the future.
Imagine if Linux was actually owned by a for-profit corporation and they could simply choose not to release a future version AND it was not possible for another organization to fork and carry on "open" Linux?
Some people want more than that, e.g. they want to fix their printer but the driver is closed source, so they start the GNU project and the broader free software movement, responsible for almost all software innovation for decades.
The #3 is an issue. If I get a binary of some software with a permissive license, I technically could patch that binary to modify some functionality, but I'd rather really like to have the source code instead.
Similarly, if I have a LLM model with a permissive license, I technically could fine-tune it to modify its behavior, but for some kinds of modifications I'd really rather re-run (parts of) the training differently.
„Can it be trusted?“ is the question many people will care about, when the awareness of the risks becomes higher. If this question can be answered without publishing the source, fine, but that would probably mean that publisher must be liable for damages from model output.