As it turns out, it’s impossible to remove a user’s data from a trained A.I. model. Deleting the model entirely is also difficult—and there’s little regulation to enforce either option.
I'm rather curious to see how the EU's privacy laws are going to handle this.
(Original article is from Fortune, but Yahoo Finance doesn't have a paywall)
"AI model unlearning" is the equivalent of saying "removing a specific feature from a compiled binary executable". So, yeah, basically not feasible.
But the solution is painfully easy: you remove the data from your training set (ie, the source code), and re-train your model (recompile the executable).
Yes, it may cost you a lot of time and money to accomplish this, but such are the consequences of breaking the law. Maybe be extra careful about obeying laws going forward, eh?
removing a specific feature from a compiled binary executable
That's actually very feasible. Compiled binaries translate directly to assembly, which is taught to most (all?) comp sci undergrads. When the binary is compiled by a standard compiler the translated assembly is very easy to understand, and for software that has protections/obfuscations like DRM and viruses there are reverse engineering tools like IDA Pro.
Retraining the model is incredibly expensive. That basically means not training the model with any user data, even if it slips in accidentally, by someone sabotage the training data, or even with consent (since consent can be revoked).
Anything else is going to bite US in the ass. Asking for consent kills any kind of open source development. It puts AI solely in the hands of like three companies. Our economy is going to be very AI focused in the future, they would literally own all of us.
You aren't getting paid either way so we might as well all enjoy the fruits of humanities labor freely instead of been forced into a subscription model of it.
"Most of the data used by large companies isn’t available to the majority of people. We think that stifles innovation."
Yes crowd sourcing is a solution but is only really possible if you are able to reach many people like Mozilla can. They only have 20k of hours up to date. Tortoise needed 50k hours and was made by one guy who open sourced it. He would not have been able to build without scraping YouTube.
Crowd sourcing also becomes much more complicated for llms or if you are making models in other language.
Asking for consent doesn't kill open source development. Consent is the very reason we have licensed code. MIT, Apache, GPL3... And development is done and code is reused in accordance of those licenses.
Making llms requires a stupid amount of data, much more than what is found in the creative commons. Same goes for image gen. Unless you have been accumulating data since forever through tricking people when they sign up to your website or app, you can't train anything without scraping most of the data.
It has nothing to do with licensing but the fact that there just isn't enough "free-use" data.
Yeah, there's no point in the model where you can pinpoint that data. It's like asking a brain surgeon to slice your brain to make you forget something. Sure, he could do it, but don't be surprised if you can't speak or remember your wife when you wake up...
The only option is to relearn from the new filtered training data, or filter it on the way out, which is likely easier said than done because it has no real context of what it's doing.
Patches today patch source code. The kind of binary patching you talk about only works with deterministic builds, which sadly there's not enough of out there.
I don't see how that's related at all. Having deterministic builds only matters if you're building a binary from source, if you're working with some distributed binary you'll be applying the patch to identical binaries anyway. And if a new binary is distributed, that's going to be because something in the source was changed; deterministic builds will still give you a different binary if the source changes.
Binary patching is still common, both for getting around DRM and for software updates.
Much like DLLs exist for compiled binary executables, could we not have modular AI training data? Then only a small chunk would need to be relearned at a time.
The difference in between having or not something in the training set of a Neural Network is going to be different values for non-integer factors all over the neural network and, worse, it is just as like that they're tiny differences as it is that they're massive differences.
Or to give you a decent metaphor for it, "it would be like trying to remove a specific egg from a bowl of scrambled eggs".
A trained AI model is a set of weights that is applied to the given neural network, the difference between two models, one trained without key data and one trained with key data, can be computed and a tool can be created to generate a transformation from model A to model B, or even a good approximation of model B trained with another AI.
I don't doubt that mathematically, but practically that sounds like it would be functionally equivalent to just retraining the model. Like if it were more efficient to just calculate the model weights based on input data, that's what we would do, there would be no need to go through the training process. We could just start with a completely untrained model and calculate the difference between that model and one that was trained with all the data. The more I think about it the more I doubt that mathematically. The feasibility of this would depend heavily on the details of the model and how it was trained. Lots of times the order in which the data was presented during training has an impact on the final result, so there's no guarantee your subtraction would achieve the same or even similar result as retraining without the specified data. Maybe you can reference some papers on the topic.
You are correct. It would be heinously expensive to "remove" training data. Even training a very rudimentary model can take hours on a high-end tensor processor.
I have a bachelors in computer science specialised in data engineering and data science, with a masters in data science, and I have worked for some years in computer vision, training and tweaking models.
Currently specialised in data engineering, but I'd wager I do know about what I'm talking about.
People who "work with AI" most of the time don't know shit about how it internally works, so I don't know if that's a label I'd even use to give an informed opinion about the matter.
It takes so.much money to retrain models tho...like the entire cost all over again ...and what if they find something else?
Crazy how murky the legalities are here ..just no caselaw to base anything on really
For people who don't know how machine learning works at a very high level
basically every input the AI is trained on or "sees" changes a set of weights (float type decimal numbers) and once the weights are changed you can't remove that input and change the weights back to what they were you can only keep changing them on new input
So we just let them break the law without penalty because it's hard and costly to redo the work that already broke the law? Nah, they can put time and money towards safeguards to prevent themselves from breaking the law if they want to try to make money off of this stuff.
No one has established that they've broken the law in any way, though.
Authors are upset but it's unclear if they can prove they were damaged in some way or that the companies in question are even liable for anything.
Remember,the burden of proof is on the plaintiff not these companies if a suit is brought.
I just skimmed through the "right to be forgotten" site from the EU and there is nothing specifically mentioned about "search engines" or at least not from what I can find.
Basically, ANY website that has users from the EU needs to comply with the GDRP which means that you have the "right to be forgotten" when:
The personal data is no longer necessary for the purpose an organization originally collected or processed it.
An organization is relying on an individual’s consent as the lawful basis for processing the data and that individual withdraws their consent.
An organization is relying on legitimate interests as its justification for processing an individual’s data, the individual objects to this processing, and there is no overriding legitimate interest for the organization to continue with the processing.
An organization is processing personal data for direct marketing purposes and the individual objects to this processing.
An organization processed an individual’s personal data unlawfully.
An organization must erase personal data in order to comply with a legal ruling or obligation.
An organization has processed a child’s personal data to offer their information society services.
However, you cannot ask for deletion if the following reasons apply:
The data is being used to exercise the right of freedom of expression and information.
The data is being used to comply with a legal ruling or obligation.
The data is being used to perform a task that is being carried out in the public interest or when exercising an organization’s official authority.
The data being processed is necessary for public health purposes and serves in the public interest.
The data being processed is necessary to perform preventative or occupational medicine. This only applies when the data is being processed by a health professional who is subject to a legal obligation of professional secrecy.
The data represents important information that serves the public interest, scientific research, historical research, or statistical purposes and where erasure of the data would likely to impair or halt progress towards the achievement that was the goal of the processing.
The data is being used for the establishment of a legal defense or in the exercise of other legal claims.
The GDPR is also not particularly specific and pretty vague from what I have read which will also apply to AI and not just "google searches".
That means that anyone who gathered the data with or without the consent of the user will have to apply for that if they are serving the application to EU users. This also includes being able to be forgotten so every company has to have the necessary features to delete the data.
And since the Regulation (it is NOT a law), is already a few years old now and the company that should delete your data does not in fact delete it "without undue delay". So the arguments "but we can't" or "it takes too much time" aren't really valid here, this should have been considered when the application was written/designed.
However, as stated in the contra points above, someone might argue that AI like ChatGPT could operate in the interest of research or the public interest and that a deletion of that data or data set could "impair or halt progress to that achievement that was the goal".
That means that from my knowledge right now it is pretty clear. If someone has private data about you, you can request them to be deleted and that should be done without delay which seems to be that the company has one month to comply with that request.
But, these are just the things I could gather from the official websites.
The "safeguard" would be "no PII in training data, ever". Which is fine by me, but that's what it really means. Retraining a large dataset every time a GDPR request comes in is completely infeasible.