More troubleshooting was done today. What did we do:
Yesterday evening @phiresky@[email protected] did some SQL troubleshooting with some of the lemmy.world admins. After that, phiresky submitted some PRs to github.
We started using this image, and saw a big drop in CPU usage and disk load.
We saw thousands of errors per minute in the nginx log for old clients trying to access the websockets (which were removed in 0.18), so we added a return 404 in nginx conf for /api/v3/ws.
We updated lemmy-ui from RC7 to RC10 which fixed a lot, among which the issue with replying to DMs
We found that the many 502-errors were caused by an issue in Lemmy/markdown-it.actix or whatever, causing nginx to temporarily mark an upstream to be dead. As a workaround we can either 1.) Only use 1 container or 2.) set proxy_next_upstream timeout;max_fails=5 in nginx.
Currently we're running with 1 lemmy container, so the 502-errors are completely gone so far, and because of the fixes in the Lemmy code everything seems to be running smooth. If needed we could spin up a second lemmy container using the proxy_next_upstream timeout;max_fails=5 workaround but for now it seems to hold with 1.
And thank you all for your patience, we'll keep working on it!
Oh, and as bonus, an image (thanks Phiresky!) of the change in bandwidth after implementing the new Lemmy docker image with the PRs.
Edit So as soon as the US folks wake up (hi!) we seem to need the second Lemmy container for performance. So that's now started, and I noticed the proxy_next_upstream timeout setting didn't work (or I didn't set it properly) so I used max_fails=5 for each upstream, that does actually work.
This is why having a big popular instance isn't all bad. It helps detect and fix the scaling problems and inefficiencies for all the other 1000s of instances out there!
This, if everyone kept just spreading out to smaller instances as suggested in the beginning, while still a sensible thing to do, no one would have noticed these performance issues. We need to think a few years out, assuming Lemmy succeeds and Reddit dies, and expect that "small instance" will mean 50k users.
I sincerely doubt reddit will die anytime soon, it'll just exist as its own thing that it's new target audience gets bored with and moves on from in a few years when something new and flashy catches their eye in the app store. Just like they do all the other apps designed in exactly the same fashion that reddit is currently morphing into.
Meanwhile Lemmy will be slowly building it's communities up to be what reddit used to be.
I'm actually kinda waiting a few releases to start promoting my instance anywhere, letting some other brave instance admins work the kinks out a bit first.
If this project is to stay for the long haul, we gotta load test it and stabilize it. These folks are doing the important work here. Large instances are more or less inevitable if Lemmy sticks.
You guys had better quit it with all this amazing transparency or it's going to completely ruin every other service for me. Seriously though amazing work and amazing communication.
Also to other people: DONATE TO FOSS PROJECTS. If 50.000 people donate only 0.5€, we have 25.000€ for funding the servers, coding, motivating/ people etc.
Just don't take a cup of coffee for 1 day. We are already 2 millions in Lemmy instances. We can build a decentralized world together!!
If there's a thing you really need to pay in a foreign currency, look into Revolut or Wise. Since I occasionally have to pay stuff in Turkish lira, GBP and donated to USD-only Receivers I like to keep Revolut as my secondary bank account since exchanging one currency to another is completely free!
For example, if you speak a second language, you can even help with translation in projects. Its very easy. E.g. I translated the Jerboa (Lemmy client for Android) in Greek 2-3 days ago. I needed only 1 hour to finish and special 15-20 minutes for fixes that I missed, yesterday.
Boy does it feel good to have those reports and understand the work you guys do. It's really inspiring! Thanks for your hard work, everything has been silk smooth! This instance is really great, Lemmy and its devs are really amazing and I feel at home in a nice, cozy community.
You'd be surprised at how much performance this kind of setup can squeeze off. Often the limitation is more on the DB/storage than network handling and processing power.
I’ve since successfully upvoted some comments and made replies without it hanging with the spinning circle. Not sure what the issue was but it all seems to be running smoothly now. Thanks.
This is why I love open source. The fact that a community can directly debug the code that's it's being hosted on and directly contribute the improvements back is just wild. Thanks for all the hard work @[email protected] and the rest of the lemmy.world team! The site already feels much more responsive.
Good to see a heavy production server taking on the scaling issues. Thank you! To discuss Lemmy performance issues, there is a dedicated community: [email protected]
It now feels pretty good to browse and it now makes the experience of using Lemmy much more enjoyable. Having to spam the vote buttons was really annoying.
Even though i'm not from this instance, this is such a nice way of keeping the users posted about changes.
I wish more companies (I know this is not a company) went straight to the point, instead of using vague terms like "improved stability, fixed few issues with an update" when things are changed. I hope all instance owners follow this trend.
Can we have an update on the status of Lemmy.world and how close ties we are going to have with Meta's threads? Threads is going to support ActivityPub, but time has shown that this is an attempt to try to kill this open platform and eventually replace it with theirs once they get everyone in their ecosystem. (Embrace, Extend...extinguish) Mastodon has said today that they don't mind sleeping with vipers when their demise / dissolution is in Meta's best interest.
Please tell me we are defederating from Meta....or let us know what to expect
EDIT: I originally stated that Mastodon told them to fuck off, but I got confused with Fosstodon (who did that). Mastodon doesn't mind being in bed with Meta
Where have you seen Mastodon formally state they have no interest in working with them?
I'm genuinely asking because I'm relatively new to Mastodon and Lemmy and want to be as informed as possible with this whole Meta situation. And just to be transparent, I'm not a Meta fan at all, to the extent I've never had an account with any of their products
I did read this official Mastodon blog post today...
Lemmy's devs and the .world admins have done in a month what Reddit hasn't done in it's whole existence: having a smooth and almost bug-free experience.
Not to undervalue the efforts going into this, because I appreciate the new community and the transparency, but I believe we have wildly different definitions of 'almost bug free'
Which, is also something to consider about user experience consistency. Will be a challenge with growth. Fortunately, plugged in admins and devs will help.
By almost bug free I was speaking only about my own experience. There are probably loads of things under the hood I'm not noticing, but it's been hours since I last noticed any issue.
I agree that we've still got a long way to go though: both Lemmy and Jerboa are far from their 1.0 release yet.
Same! My first thought was “that’s an impressing looking graph. I have absolutely no idea what it means.” The proof is in the pudding though - lemmy.world is much improved!
I'm very curious: does single Lemmy instance have the ability to horizontally scale to multiple machines? You can only get so big of a machine. You did mention a second container, so that would suggest that the Lemmy software is able to do so, but I'm curious if I'm reading that right.
Shouldn't the correct HTTP status code for a removed API be 410? 404 indicates the domain wasn't found or doesn't exist, 410 indicates a resource being removed
Awesome work - things seem to be running much more smoothly today.
Do you have anything behind CDN by chance? Looking at the lemmy.world IPs, the server appears to be hosted in Europe and web traffic goes directly there? IPv4 apparently seems to be resolving to a Finland-based address, and IPv6 apparently seems to be resolving to a Germany-based address.
If you put the site behind a CDN, it should significantly reduce your bandwidth requirements and greatly drop the number of requests that need to hit the origin server. CDNs would also make content load faster for people in other parts of the world. I'm in New Zealand, for example, and I'm seeing 300-350 ms latency to lemmy.world currently. If static content such as images could be served via CDN, that would make for a much snappier browsing experience.
How great is it to be a part of history in the making -
This is Web 3 in its fomenting -
Headlines ~5yrs:
The ending of Web 2 was unceremonious and just ugly. u/spez and moron@musk watched as their social media networks signaled the end of Web 2 and slowly dissolved. Blu bird’s value disintegrated and Reddit’s hopes for IPO did likewise. Twitter and Reddit dissolved into odorous flatulence as centralization fell apart to the world’s benefit. Decentralized/federated social media such as Mastodon and Lemmy made their convoluted progress and led Web 3’s development and growth…
This is how history is made, it’s ugly and convoluted but comes out sweeet…
Whilst I'm aware that too many users on one instance can be a bad thing for the wider Fediverse, I think it is a great thing at the moment in terms of how well people are banding together to fix the issues being encountered from such a surge in users.
The issues being found on lemmy.world results in better lemmy instances for everyone and improves the whole Fediverse of lemmy instances.
I'm very impressed with how well things are being debugged under pressure, well done to all those involved 👏
I'd volunteer to be a technical troubleshooter - very familiar with docker/javascript/SQL, not super familiar with rust - but I'm sure yall also have an abundance of nerds to lend a hand.
You should try to contact one of the admins of this server (Ruud is very busy tho, lots of mentions) and see if you could be of any help. I am sure they would appreciate even just the offer 😄
It blows my mind with the amount of traffic you guys must be getting that you are only running one container and not running in a k8s cluster with multiple pods (or similar container orchestration system)
Edit: misread that a second was coming up, but still crazy that this doesn’t take some multi node cluster with multiple pods. Fucking awesome
Yeah this morning everything has loaded so much quicker, I’ve been able to post and vote on comments no problem! Lemmy is really starting to take form. I fucking love this whole thing.
Really great job, guys! I know from my experience in SRE that these types of debugs, monitoring and fixes can be much pain, so you have all my appreciation. I'm even determined to donate on Patreon if it's available
It felt like I’d jinx us all if I commented but THANK YOU! This has been a wonderful experience today. Absolutely loving it and knew you just needed some time to work out the kinks that happen with fast growth.
You know there's something about dealing with the lagginess in the past few days makes me appreciate the fast and responsive of the update. It nice to see the community grows and makes the experience at Lemmy feels authentic.
I hope to start on some small contributions sometime next week. Stability has been noticeably better the last few days and I imagine it’s only going to get better.
A lot of this stuff is pretty opaque to me, but I still read through it because I love how detailed they are in sharing what's going on under the hood and how it relates to problems users experience. Kudos to you guys!
I like that the post goes in detail and allows us tech nerds to get hard watching this stuff instead of the regular corpo jumbo change log that consists of:
Compared to days prior, things are running much better today. Page load speed, reliable post/reply/upvoting. I dunno if it's just happenstance but whatever knobs, levers and keystrokes you're manipulating, keep going the things! Thanks so much for a home absent of corporate BS.
Man I thought I noticed something different. For the past week or so I've gotten nothing but network error and Java errors in Jerboa which are completely gone now. Posts load almost instantly too. Appreciate the effort guys, was going insane.
Would HAProxy work better as a load balancer? For work we switched due to some issues with NGINX; so far, the service has been much more consistent with pretty much no downtime, even when restarting server hosts.
@[email protected] is this docker container y'all are using available on a registry? We'd like to use it. And do you have a load balancer in front of your lemmy-ui image to allow two containers to run? or is that built in and I just never noticed it?
Well we use the cetra3/lemmy:the-phiresky-cut image but you can also wait for the next RC or release which will have the PRs.
We load balance with nginx which works if you use max_fails=5 for each upstream.
I'll be honest I don't know what any of this means but what I can say is I absolutely love the transparency of all of this. It's so refreshing and maybe I'll start learning more about what I'm looking at because I'll keep seeing it. Great work!
The instance seems to be much better. Posting and commenting is not taking as long and loading times are way better. I hope things can stay this good or even get better.
Have you looked into possibly migrating to kubernetes or some other form of docker container management/orchestration system to help with automatic scaling and load balancing?
Thank you so much for the hard work, time and money you spend into making lemmy.world run very smoothly. This much transparency is awesome for something that's being used so massively.
Thank you so much! I will be donating a few cappuccinos your way when my next check arrives. I really appreciate how awesome of a community you’ve brought together & all of the transparency with the updates (and the frequency) is astounding! Keep up the great work but don’t forget to take breaks :)
Please recommend people to update their app in a topic title. Connect couldn't even load a topic without failing out today. An update fixed it, but I had to manually force it because it didn't apply automatically.
This will drive people away. Literally none of the communities I subscribe to on world even seemed to have a new comment.
damn bro, y’all coming in clutch to improve stability of this lemmy instance.
Good shit bros. Hope to contribute upstream and find more performance related bugs. I browsed the code for lemmy, and could not find any performance tests.
Minor thing but over night both wefwef and Memmy clients are showing the wrong comment score (karma) against my profile, and given they are showing the same amount I assume it’s related to API fed data. Value was correct yesterday. Easy for me to confirm given I have only 2 dozen posts and the value has dropped to single digits.
Not a biggie, but figured I’d report it in case there was some issue causing that. Might be some optimisation around indexing or something has intentionally or unintentionally impacted that.
Otherwise the service feels much more stable currently. No timeouts today where it’s been very frequent the past few days. Nice job. 👍
Cheers. Seems to be incrementing ok with my posts today, but total seems like it was reset or is incrementing only from the past 24-48 hours of comments, so is short 50 or so.
Servers still performing well since my last post but 5 hours earlier.
Really appreciate all the hard work going on behind the scenes! Feels night and day different after the changes. Also appreciate the transparency. Nice to see in this day and age.
I can't imagine the amount of work Lemmy's devs and ITs are under since few days, but those are important for the future of Lemmy. Keep the good work! You're awesome!
Its not on the default lemmy-ui on the web, but many third party apps for lemmy are showing total comment and post “scores” on user’s profile pages. Not exactly sure how they are calculating that, but ive noticed that those scores were wiped for at least my account today. Again, not a big deal, just thought it was strange how that would happen. It is showing like this in memmy as well as wefwef so it’s definitely some data out in the lemmy verse that went bad. it is starting to “count” stuff from comments ive made today though so, who knows.
Everything is feeling great so far. The only bug I'm encountering is that when opening a thread (in Firefox on desktop) it auto-scrolls down past the content to the replies.
The 502 error still seems to be common by me. Vut they are less permanent. Before it stayed multiple refreshes now it is safe to say after 1 reload it is most of the time gone.