Fediverse @lemmy.world Sean Tilley @lemmy.world 8mo ago

Maven Imported 1.12 Million Fediverse Posts

wedistribute.org Maven Imported 1.12 Million Fediverse Posts

A social network founded by a former OpenAI employee was caught importing public posts from Mastodon...and ran AI analysis to add tags to them.

Maven, a new social network backed by OpenAI's Sam Altman, found itself in a controversy today when it imported a huge amount of posts and profiles from the Fediverse, and then ran AI analysis to alter the content.

Fediverse @lemmy.zip BrikoX @lemmy.zip 8mo ago

Maven Imported 1.12 Million Fediverse Posts

wedistribute.org /2024/06/maven-mastodon-posts/

6 2

Fediverse @lemmy.ml Sean Tilley @lemmy.ml 8mo ago

Maven Imported 1.12 Million Fediverse Posts

wedistribute.org /2024/06/maven-mastodon-posts/

104 39

93 comments

Pretty wild
- The wildest part is that he's surprised that Mastodon peeps would react negatively to their posts being scrapped without consent or even notification and fed into an AI model. Like, are you for real dude? Have you spent more than 4 seconds on Mastodon and noticed their (our?) general attitude towards AI? Come the hell on...
  
  People can complain, but the Fediverse is built to make consuming user’s data easy. If you don’t want AI using your data, don’t put it on such an easily “scrapable” network.
  
  It sounds like they weren't "being fed into an AI model" as in being used as training material, they were just being evaluated by an AI model. However...
  
  Have you spent more than 4 seconds on Mastodon and noticed their (our?) general attitude towards AI?
  
  Yeah, the general attitude of wild witch-hunts and instant zero-to-11 rage at the slightest mention of it. Doesn't matter what you're actually doing with AI, the moment the mob thinks they scent blood the avalanche is rolling.
  
  It sounds like Maven wants to play nice, but if the "general attitude" means that playing nice is impossible why should they even bother to try?
  
  It's not surprised. He's acting surprised because he got caught. It's pretty standard for these jerkass tech bros. "Move fast break things" is code "break laws be unethical" - as I think we've all seen if you do it often and fast enough you can keep way ahead of any kind of accountability because everybody else is trying to play catch up well the last thing has already filtered out of the news cycle.
  
  I'm surprised as well. We put our posts up for anyone to replicate and republish, yet we still get mad when somebody replicates and republishes it. It does not make sense. Activitypub is an open network with zero privacy expectations.
- His Mastodon: https://mastodon.social/@jsecretan/with_replies
- Look at that shit-eating grin, he knows. There's no way someone can be that out of touch, right? Right?!?
- How does someone with a last name that close to secretion choose to go by Jimmy?
I was confused why a package manager would need to import posts from a social network.

Why name a new product the same as a very popular existing product?
- Obviously it's named after Maven Black-Briar
- I mean maven is super bloated so it wouldn't surprise me
I was confused on what they were trying to accomplish, and even after reading the article I am still somewhat confused.

Instead, when a user posts something, the algorithm automatically reads the content and tags it with relevant interests so it shows up on those pages. Users can turn up the serendipity slider to branch out beyond their stated interests, and the algorithm running the platform connects users with related interests.

Perhaps I'm a minority, but I don't see myself getting much utility out of this. I already know what my interests are, and don't have much interest in growing them algorithmically. If a topic is really interesting, I'll eventually find out about it via an actual human.
- Yeah, we're trying to get the fuck away from algorithms. That's what makes the fediverse such a big draw currently, for me.
  
  Only algorithm I need is posts I subscribe to, in descending order. That's about it
  
  You're on slrpnk.net, I assume it's not implementing any of this stuff. As long as you don't sign up for Maven I don't see how this is going to affect you.
- TikTok is really popular operating on essentially the same principle. I, for one want nothing to do with that.
- Instead, when a user posts something, the algorithm automatically reads the content and tags it with relevant interests so it shows up on those pages.
  
  Motherfucker this is what hashtags are for.
- So you don't ever want to learn about new things? And even if you did, you wouldn't want those new things be efficiently suggested to you and instead be bundled with a bunch of other boring crap?
  
  Also, what you're asking for is what the tool seems to do. You would put the slider all the way to one side to avoid having new stuff suggested. Existing social media platforms often just shove stuff at you endlessly.
That's why I keep saying it's pointless to defederate corpos. They'll just scrape everything before you notice.
- The fact they even got DMs from at least one instance is crazy.
  
  And it's also damming for private messaging on mastodon.
  
  I once read vague complaints about it being a rushed implementation. While I won't trust those without evidence, I for sure wouldn't trust mastodon with my PMs. At least, not until how this was allowed to happen is figured out and fixed if necessary.
  
  P.S. I'm still not sure I believe in PMs in the fediverse. If I need to share something and care about keeping it private, I'd rather move the conversation elsewhere.
  
  Well the problem is user perception/understanding.
  
  The reality is they were literally direct messages, not private messages.
- Defederation is more about not being flooded with 1000x more users than the Fediverse currently has
  
  So far we only have a corpo fedi-twitter in form of Threads. In that case non-corpo instance user has to specifically follow someone before their content is federated so that sounds like a bit overblown issue.
  
  Unfortunately a lot of people think it's to do with scraping as well. The amount of "defederate Threads so that they can't scrape my data" posts I saw was about 50-50 with the sensible takes.
- Plus even if you defederate them, oops, it’s all public anyway!
I was confused at first, I thought it was the Apache project
Classic Scam Altman!
Oh shit, the persona guy was right! We should all be adding license to our comments, so could not legally train model that are then used for commercial purposes.
- The easiest way is a sitewide NoAI meta tag, since it’s the current standard. Researchers are much more likely to respect a common standard and extremely unlikely to respect a single user’s personal solution adding a link to their comments.
  
  This is the only way I see it being acceptable. How do we add this to instances?
  
  I feel like the bad thing about this is, whereas the researchers will mostly respect this, companies who want to make money out of data will still secretly keep using the data anyways. I am more ok with the data being used for non-profit research and not for making money but this would likely have the opposite effect.
- @[email protected]
  
  Thanks for linking me 🙏 The makers of Maven probably set off a bomb now and people might ask for anti-AI features on the clients and servers.
  
  Anti Commercial-AI license
- yeah they were. I hope more people start doing it even if it doesn't legally hold water its still a good way to show that fediverse users won't stand for that.
  
  _Anti _{Commercial-AI} _license _(CC _BY-NC-SA _4.0)
  
  Why do you think it won't hold water legally? There's a case going right now against Github Copilot for scraping GPL licences code, even spitting it back out verbatim, and not making "open" AI actually open.
  
  Creative Commons is not a joke licence. It actually is used by artists, authors, and other creative types.
  
  Imagine Maven or another company doing the same shit they just did and it coming to light there were a bunch of noncommercially licences content in there. The authors could band together for a class action lawsuit and sue their asses. Given the reaction of users here and on mastodon, I wouldn't even be surprised if it did happen.
  
  Anti Commercial-AI license
- It's especially for these kinds of dumb cases where they simply copy content wholesale and boast about it. With more people licencing their contents as non commercial, the "hot water" these companies get in could not just be trivial but actually legal.
  
  Would be great if web and mobile clients supported signatures or a "licence" field from which signatures were generated. Even better would be if people smarter than me added a feature to poison AI training data. This could also be done by a signature or some other method.
  
  Anti Commercial-AI license
  
  I don't know; AFAIK, Reddit successfully argued that they own Wallstreetbets' trademarks in court. That might void all of these licenses depending on the ToS of the instance being used.
- Lol that shit don't do shit
Am I misunderstanding this, or did they just fuck up the integration so it's one way with a plan to make it two ways after, and the AI alteration is just sentiment analysis on whatever they took?
- Looks like it.
  
  In addition to pulling in posts, the import process seems to be running AI sentiment analysis to add tags and relational data after content reaches Maven’s servers. This is a core part of Maven’s product: instead of follows or likes, a model trains itself on its own data in an attempt to surface unique content algorithmically.
  
  But of course, that news doesn't give the reader those lovely rage endorphins or draw clicks.
  
  This is the Fediverse, having the content we post get spread around to other servers is the whole point of all this. Is this a face-eating leopard situation? People are genuinely surprised and upset that the stuff we post here is ending up being shown in other places?
  
  There is one thing I see here that raises my eyebrows:
  
  Even more shocking is the revelation that somehow, even private DMs from Mastodon were mirrored on their public site and searchable. How this is even possible is beyond me, as DM’s are ostensibly only between two parties, and the message itself was sent from two hackers.town users.
  
  But that sounds to me like a hackers.town problem, it shouldn't be sending out private DMs to begin with.
- They kind of fucked up everything in approaching this by not talking to the community and collecting feedback, making dumb assumptions in how the integration was supposed to work, leaking private posts, running everything through their AI system, and neglecting to represent the remote content as having came from anywhere else.
  
  The other thing is that Maven's whole concept is training an AI over and over again on the platform's posts. Ostensibly, this could mean that a lot of Fediverse content ended up in the training data.
Genuine question, do instances not have a GPL license on their content? With that license, anyone can use all the data but only for open source software.
- Instances don't actually own the copyright to comments. The poster owns the copyright and licenses it to the instance. Which lets the instance use it, but not sublicense to others.
- The current assumption made by these companies is that AI training is fair use, and is therefore legal regardless of license. There are still many ongoing court cases over this, but one case was already resolved in favor or the fair use position.
- I don't think you can use gpl for anything but code. Creative commons license would be more appropriate.
Does Maven have anything to do with AI despite being backed by a dude who works for open AI?
- Yes, the entire platform trains itself on posts within its platform to make algorithmic decisions and present it to users. Instead of likes or follows, you just have that.
  
  But it doesn't actually produce content that's AI generated by an LLM model?
yeah but who posts to mastodon under public instead of unlisted/quiet public?
- Pretty much everybody.

93 comments