What irks me most about this claim from OpenAI and others in the AI industry is that it's not based on any real evidence. Nobody has tested the counterfactual approach he claims wouldn't work, yet the experiments that came closest--the first StarCoder LLM and the CommonCanvas text-to-image model--suggest that, in fact, it would have been possible to produce something very nearly as useful, and in some ways better, with a more restrained training data curation approach than scraping outbound Reddit links.
All that aside, copyright clearly isn't the right framework for understanding why what OpenAI does bothers people so much. It's really about "data dignity", which is a relatively new moral principle not yet protected by any single law. Most people feel that they should have control over what data is gathered about their activities online, as well as what is done with those data after it's been collected, and even if they publish or post something under a Creative Commons license that permits derived uses of their work, they'll still get upset if it's used as an input to machine learning. This is true even if the generative models thereby created are not created for commercial reasons, but only for personal or educational purposes that clearly constitute fair use. I'm not saying that OpenAI's use of copyrighted work is fair, I'm just saying that even in cases where the use is clearly fair, there's still a perceived moral injury, so I don't think it's wise to lean too heavily on copyright law if we want to find a path forward that feels just.
I know, on lemmy you will get the impression that engineers and scientists are all just bumbling fools who are intellectually outclassed by any high schooler with internet access. But how likely is that, really?
Scaling laws are disputed, but if an effort has in fact already been undertaken to train a general purpose LLM using only permissively-licensed data, great! Can you send me the checkpoint on Huggingface, a github page hosting relevant code, or even a paper or blog post about it? I've been looking and hadn't found anything like that yet.
There is not enough permissively licensed text to train models of any size, and what there is, lacks in diversity. Wikipedia, government documents, stack overflow, century old stuff, ... An LLM trained on that is not likely to be called "general purpose", because scaling laws. Sometimes such small models are trained for research purposes but I don't have a link ready. They are not something you'd actually use. Perhaps you could look at Microsoft's Phi series of models. They are trained on synthetic data, though that's probably not what you are looking for.
Apparently, this is about creating a new kind of intellectual property; a generalized and hypercharged version of copyright that applies to all sorts of data.
Maybe, this is a touchy subject, but to me this seems like an extremely right wing approach. Turn anything into property and the magic market will turn everything into rainbows and unicorns. Maybe you feel different about this?
Regardless of classification, such a policy is obviously devastating to society. Of course, your argument does not consider society but only the feelings of some individuals. Feelings are valid but one has to consider the effect of such a policy, too. Not every impulse should be given power. This is especially true where such feelings are strongly influenced by culture and circumstance. For example, people in the US and the UK have -on the whole - rather different feelings on being ruled by a king. I don't feel that I should be able to control what other people do with data, maybe because I'm a bit older and was socialized into that whole information-wants-to-be-free culture. I don't even remember having a libertarian phase.
I'm not proposing anything new, and I'm not here to "pitch" anything to you--read Jaron Lanier's writings e.g. "Who Owns the Future", or watch a talk/interview given by him, if you're interested in a sales pitch for why data dignity is a problem worth addressing. I admire him greatly and agree with many of his observations but am not sure about his proposed solution (mainly a system of micro-payments to creators of the data used by tech companies)--I'm just here to point out that copyright infringement isn't in fact, the main nor the only thing that is bothering so many people about generative AI, so settling copyright disputes isn't going to stop all those people from being upset about it.
As to your comments about "feelings", I would turn it around to you and ask why it is important to society that we prioritize the feelings (mainly greed) of the few tech executives and engineers who think that they will profit from such practices over the many, many people who object to them?
@General_Effort@mm_maybe
Maybe this will finally be the push we, as a society, need to realize that "intellectual property" is a legal fiction that we are all better off without?
Yeah, I would agree that there's something really off about the framework that just doesn't fit most people's feelings of justice or injustice. A synth YouTuber, of all people, made a video about this that I liked, though his proposed solution is about as workable as Jaron Lanier's: https://youtu.be/PJSTFzhs1O4?si=ZvY9yfOuIJI7CVUk
Again, I don't have a proposal of my own, I've just decided for myself that if I'm going to do anything money-making with LLMs in my practice as a professional data scientist, I'll rely on StarCoder as my base model instead of the others, particularly because a lot of my clients are in the public sector and face public scrutiny.