There are VERY FEW fully open LLMs. Most are the equivalent of source-available in licensing and at best, they're only partially open source because they provide you with the pretrained model.
To be fully open source they need to publish both the model and the training data. The importance is being "fully reproducible" in order to make the model trustworthy.
In that vein there's at least one project that's turning out great so far:
Not just LLMs but all kinds of models are equivlant to freeware, aka the model itself and other essential bits for it to work. I won't even call it source avaliable as there is no source.
Take redis as example. I can still go grab the source and compile a binary that works. This doesn't applies on ML models.
Of course one can argue the training process isn't determistic thus even with the exact training corpus, it can't create the same model in terms of bits on mulitple runs. However, I would argue the same corpus provide the chance to train a model of similar or equivalent performance. Hence the openness of the training corpus is an absolute requirement to qualify a model being FOSS.
I've seen this said multiple times, but I'm not sure where the idea that model training is inherently non-deterministic is coming from. I've trained a few very tiny models deterministically before...
Fortunately, LLMs don't really need to be fully open source to get almost all of the benefits of open source. From a safety and security perspective it's fine because the model weights don't really do anything; all of the actual work is done by the framework code that's running them, and if you can trust that due to it being open source you're 99% of the way there. The LLM model just sits there transforming the input text into the output text.
From a customization standpoint it's a little worse, but we're coming up with a lot of neat tricks for retraining and fine-tuning model weights in powerful ways. The most recent bit development I've heard of is abliteration, a technique that lets you isolate a particular "feature" of an LLM and either enhance it or remove it. The first big use of it is to modify various "censored" LLMs to remove their ability to refuse to comply with instructions, so that all those "safe" and "responsible" AIs like Goody-2 can turned into something that's actually useful. A more fun example is MopeyMule, a LLaMA3 model that has had all of his hope and joy abliterated.
So I'm willing to accept open-weight models as being "nearly as good" as a full-blown open source model. I'd like to see full-blown open source models develop more, sure, but I'm not terribly concerned about having to rely on an open-weight model to make an AI system work for the immediate term.
I suppose the importance of the openness of the training data depends on your view of what a model is doing.
If you feel like a model is more like a media file that the model loaders are playing back, where the prompt is more of a type of control over how you access this model then yes I suppose from a trustworthiness aspect there's not much to the model's training corpus being open
I see models more in terms of how any other text encoder or serializer would work, if you were, say, manually encoding text. While there is a very low chance of any "malicious code" being executed, the importance is in the fact that you can check the expectations about how your inputs are being encoded against what the provider is telling you.
As an example attack vector, much like with something like a malicious replacement technique for anything, if I were to download a pre-trained model from what I thought was a reputable source, but was man-in-the middled and provided with a maliciously trained model, suddenly the system I was relying on that uses that model is compromised in terms of the expected text output. Obviously that exact problem could be fixed with some has checking but I hope you see that in some cases even that wouldn't be enough. (Such as malicious "official" providence)
As these models become more prevalent, being able to guarantee integrity will become more and more of an issue.
I'm not sure where you get that idea. Model training isn't inherently non-deterministic. Making fully reproducible models is 360ai's apparent entire modus operandi.
If a layman may ask, what are folks even using AI/LLMs for mostly? Aside from playing around with some for 10-15 mins out of simple curiosity, I don't have a practical use for platforms like ChatGPT. I'm just wondering what the average tech enthusiast uses these for, outside of academia.
I teach language. I get paid for my time in front of students, not the time it takes to prepare their lessons and the materials. I use AI to quickly reference grammar rules, to fabricate example dialogs in specific scenarios to practice, and to suggest activities to do in class to practice the target grammar. I never do exactly as it says, just take it as kind of a source of suggestions for me to build from.
That sounds like a time saver for sure. I imagine that some of those elements (grammar rules) are widely available everywhere, while others (practice dialogues, activity suggestions focused on the use of language) would require a fairly specific training model.
A friend of mine and I have gotten used to using it during our conversations. We do fast fact-checking or find a good first opinion regarding silly topics. We often find it faster than digging through search-engine results and interpreting scattered information. We have used it for thought experiments, intuitive or ELI5 explanations of topics that we don’t really know about, finding peer-reviewed sources for whatever it is that we’re interested in, or asking questions that operationalizing into effective search engine prompts would be harder than asking with natural language. We always always ask for citations and links, so that we can discard hallucinations.
Thanks for sharing! I'm probably too set in my ways to ever utilize AI for things like this. I never use virtual assistants like Alexa or Google either, as I like to vet and interpret the source of information myself. Having the citations would be handy, but ultimately I'd want to read them myself so the IA/VA just becomes an added step.
we use it to classify data that is needed to be sent to one of three endpoints. chatgpt tells our tool where it belongs. there are.probably more practical ways to do this, but the customer wanted AI in his product so here we are 🤷
What's FOSS-AI? A model everyone can download and use for free? Or in the OSS spirit that everything need to be open and without discrimination of use, aka OSS training data corpus and no AUP attached?
Or you mean the inference engine running those models?
I'm just convinced all of y'all asking about this are in a huge circle jerk that never ends, but refuses to understand how it all works.
A model is a model. It's a simplified way of narrowing down thresholds of confidences. It's
a pretty basic sorting algorithm that runs super fast on accelerated hardware.
You people seem to think it's like fucking magic that steals your soul.
Don't send information over the wire, and you're golden. Learn how it works, and stop asking dumb questions like this is all brand new, PLEASE.
There is a difference between a general scare about the AI buzzword and legitimate distrust in online services which are closely connected to american spying institutions (regardless if they are ai or not)
If my calories tracker app would apoint a (former) NSA official on their board, I would be looking for alternatives too. This is not about AI, this is about a company with huge sets of private data being closely interconnected with american spy institutions.
Sad that you don't seem to be able to distinguished between legitimate security questions and badly informed hypes/scares ass soon as a buzzword like AI occurs
My documented process https://fabien.benetou.fr/Content/SelfHostingArtificialIntelligence but honestly I just tinker with this. Most of that isn't useful IMHO except some pieces, e.g STT/TTS, from time to time. The LLM aspect itself is too unreliable, and I do like 2 relatively recent papers on the topic, namely :
which are respectively saying that the long-tail makes it practically impossible to train AI to be correct in rare cases and that "hallucinations" are a misnomer for marketing purposes to be replaced instead by "bullshit" used to convinced people without caring for veracity.
Still, despite all this criticism it is a very popular topic, hyped up to be the "future" of computing. Consequently I did want to both try and help others to do so rather than imagine that it was restricted to a kind of "elite". I try to keep the page up to date but so far, to be honest, I do it mostly defensively, to be able to genuinely criticize because I did take the time to try, not reject in block.
PS: I do try also state of the art, both close and open-source, via APIs e.g OpenAI or Mistral but only for evaluation purposes, not as tools part of my daily usage.