There are reports of IT outages affecting major institutions in Australia and internationally.
All our servers and company laptops went down at pretty much the same time. Laptops have been bootlooping to blue screen of death. It's all very exciting, personally, as someone not responsible for fixing it.
Apparently caused by a bad CrowdStrike update.
Edit: now being told we (who almost all generally work from home) need to come into the office Monday as they can only apply the fix in-person. We'll see if that changes over the weekend...
Reading into the updates some more... I'm starting to think this might just destroy CloudStrike as a company altogether. Between the mountain of lawsuits almost certainly incoming and the total destruction of any public trust in the company, I don't see how they survive this. Just absolutely catastrophic on all fronts.
If all the computers stuck in boot loop can't be recovered... yeah, that's a lot of cost for a lot of businesses. Add to that all the immediate impact of missed flights and who knows what happening at the hospitals. Nightmare scenario if you're responsible for it.
This sort of thing is exactly why you push updates to groups in stages, not to everything all at once.
Agreed, this will probably kill them over the next few years unless they can really magic up something.
They probably don't get sued - their contracts will have indemnity clauses against exactly this kind of thing, so unless they seriously misrepresented what their product does, this probably isn't a contract breach.
If you are running crowdstrike, it's probably because you have some regulatory obligations and an auditor to appease - you aren't going to be able to just turn it off overnight, but I'm sure there are going to be some pretty awkward meetings when it comes to contract renewals in the next year, and I can't imagine them seeing much growth
Don't most indemnity clauses have exceptions for gross negligence? Pushing out an update this destructive without it getting caught by any quality control checks sure seems grossly negligent.
I think you're on the nose, here. I laughed at the headline, but the more I read the more I see how fucked they are. Airlines. Industrial plants. Fucking governments. This one is big in a way that will likely get used as a case study.
They can have all the clauses they like but pulling something like this off requires a certain amount of gross negligence that they can almost certainly be held liable for.
Don't we blame MS at least as much? How does MS let an update like this push through their Windows Update system? How does an application update make the whole OS unable to boot? Blue screens on Windows have been around for decades, why don't we have a better recovery system?
Crowdstrike runs at ring 0, effectively as part of the kernel. Like a device driver. There are no safeguards at that level. Extreme testing and diligence is required, because these are the consequences for getting it wrong. This is entirely on crowdstrike.
The four multinational corporations I worked at were almost entirely Windows servers with the exception of vendor specific stuff running Linux. Companies REALLY want that support clause in their infrastructure agreement.
I've worked as an IT architect at various companies in my career and you can definitely get support contracts for engineering support of RHEL, Ubuntu, SUSE, etc. That isn't the issue. The issue is that there are a lot of system administrators with "15 years experience in Linux" that have no real experience in Linux. They have experience googling for guides and tutorials while having cobbled together documents of doing various things without understanding what they are really doing.
I can't tell you how many times I've seen an enterprise patch their Linux solutions (if they patched them at all with some ridiculous rubberstamped PO&AM) manually without deploying a repo and updating the repo treating it as you would a WSUS. Hell, I'm pleasantly surprised if I see them joined to a Windows domain (a few times) or an LDAP (once but they didn't have a trust with the Domain Forest or use sudoer rules...sigh).
doesn't like a quarter of the internet kinda run on Azure?
Said another way, 3/4 of the internet isn't on Unsure cloud blah-blah.
And azure is - shhh - at least partially backed by Linux hosts. Didn't they buy an AWS clone and forcibly inject it with money like Bobby Brown on a date in the hopes of building AWS better than AWS like they did with nokia? MS could be more protectively diverse than many of its best customers.
I've had my PC shut down for updates three times now, while using it as a Jellyfin server from another room. And I've only been using it for this purpose for six months or so.
Windows server, the OS, runs differently from desktop windows. So if you're using desktop windows and expecting it to run like a server, well, that's on you. However, I ran windows server 2016 and then 2019 for quite a few years just doing general homelab stuff and it is really a pain compared to Linux which I switched to on my server about a year ago. Server stuff is just way easier on Linux in my experience.
Completely justified reaction. A lot of the time tech companies and IT staff get shit for stuff that, in practice, can be really hard to detect before it happens. There are all kinds of issues that can arise in production that you just can't test for.
But this... This has no justification. A issue this immediate, this widespread, would have instantly been caught with even the most basic of testing. The fact that it wasn't raises massive questions about the safety and security of Crowdstrike's internal processes.
From what I've heard and to play a devil's advocate, it coincidented with Microsoft pushing out a security update at basically the same time, that caused the issue. So it's possible that they didn't have a way how to test it properly, because they didn't have the update at hand before it rolled out. So, the fault wasn't only in a bug in the CS driver, but in the driver interaction with the new win update - which they didn't have.
Lots of security systems are kernel level (at least partially) this includes SELinux and AppArmor by the way. It's a necessity for these things to actually be effective.
You posted this 14 hours ago, which would have made it 4:30 am in Austin, Texas where Cloudstrike is based. You may have felt the effect on Friday, but it's extremely likely that the person who made the change did it late on a Thursday.
Yeah my plans of going to sleep last night were thoroughly dashed as every single windows server across every datacenter I manage between two countries all cried out at the same time lmao
Marginal? You must be joking. A vast amount of servers run on Windows Server. Where I work alone we have several hundred and many companies have a similar setup. Statista put the Windows Server OS market share over 70% in 2019. While I find it hard to believe it would be that high, it does clearly indicate it's most certainly not a marginal percentage.
Not too long ago, a lot of Customer Relationship Management (CRM) software ran on MS SQL Server. Businesses made significant investments in software and training, and some of them don't have the technical, financial, or logistical resources to adapt - momentum keeps them using Windows Server.
For example, small businesses that are physically located in rural areas can't use cloud based services because rural internet is too slow and unreliable. Its not quite the case that there's no amount of money you can pay for a good internet connection in rural America, but last time I looked into it, Verizon wanted to charge me $20,000 per mile to run a fiber optic cable from the nearest town to my client's farm.
Here's the fix: (or rather workaround, released by CrowdStrike)
1)Boot to safe mode/recovery
2)Go to C:\Windows\System32\drivers\CrowdStrike
3)Delete the file matching "C-00000291*.sys"
4)Boot the system normally
It's disappointing that the fix is so easy to perform and yet it'll almost certainly keep a lot of infrastructure down for hours because a majority of people seem too scared to try to fix anything on their own machine (or aren't trusted to so they can't even if they know how)
They also gotta get the fix through a trusted channel and not randomly on the internet. (No offense to the person that gave the info, it’s maybe correct but you never know)
This sort of fix might not be accessible to a lot of employees who don't have admin access on their company laptops, and if the laptop can't be accessed remotely by IT then the options are very limited. Trying to walk a lot of nontechnical users through this over the phone won't go very well.
Might seem easy to someone with a technical background. But the last thing businesses want to be doing is telling average end users to boot into safe mode and start deleting system files.
If that started happening en masse we would quickly end up with far more problems than we started with. Plenty of users would end up deleting system32 entirely or something else equally damaging.
It might not even be that. A lot of places have many servers (and even more virtual servers) running crowdstrike. Some places also seem to have it on endpoints too.
I'm on a bridge still while we wait for Bitlocker recovery keys, so we can actually boot into safemode, but the Bitkocker key server is down as well...
Not that easy when it's a fleet of servers in multiple remote data centers. Lots of IT folks will be spending their weekend sitting in data center cages.
This is going to be a Big Deal for a whole lot of people. I don't know all the companies and industries that use Crowdstrike but I might guess it will result in airline delays, banking outages, and hospital computer systems failing. Hopefully nobody gets hurt because of it.
Definitely not small, our website is down so we can't do any business and we're a huge company. Multiply that by all the companies that are down, lost time on projects, time to get caught up once it's fixed, it'll be a huge number in the end.
Was it actually pushed on Friday, or was it a Thursday night (US central / pacific time) push? The fact that this comment is from 9 hours ago suggests that the problem existed by the time work started on Friday, so I wouldn't count it as a Friday push. (Still, too many pushes happen at a time that's still technically Thursday on the US west coast, but is already mid-day Friday in Asia).
Wow, I didn't realize CrowdStrike was widespread enough to be a single point of failure for so much infrastructure. Lot of airports and hospitals offline.
The Federal Aviation Administration (FAA) imposed the global ground stop for airlines including United, Delta, American, and Frontier.
Wait, monopolies are bad? This is the first I've ever heard of this concept. So much so that I actually coined the term "monopoly" just now to describe it.
I don't think that's what's happening here. As far as I know it's an issue with a driver installed on the computers, not with anything trying to reach out to an external server. If that were the case you'd expect it to fail to boot any time you don't have an Internet connection.
Yep, stuck at the airport currently. All flights grounded. All major grocery store chains and banks also impacted. Bad day to be a crowdstrike employee!
My flight was canceled. Luckily that was a partner airline. My actual airline rebooked me on a direct flight. Leaves 3 hours later and arrives earlier. Lower carbon footprint. So, except that I'm standing in queue so someone can inspect my documents it's basically a win for me. 😆
Working on our units. But only works if we are able to launch command prompt from the recovery menu. Otherwise we are getting a F8 prompt and cannot start.
Yep, this is the stupid timeline. Y2K happening to to the nuances of calendar systems might have sounded dumb at the time, but it doesn't now. Y2K happening because of some unknown contractor's YOLO Friday update definitely is.
A few years ago when my org got the ask to deploy the CS agent in linux production servers and I also saw it getting deployed in thousands of windows and mac desktops all across, the first thought that came to mind was "massive single point of failure and security threat", as we were putting all the trust in a single relatively small company that will (has?) become the favorite target of all the bad actors across the planet. How long before it gets into trouble, either because if it's own doing or due to others?
No bad actors did this, and security goes in fads. Crowdstrike is king right now, just as McAfee/Trellix was in the past. If you want to run around without edr/xdr software be my guest.
If you want to run around without edr/xdr software be my guest.
I don't think anyone is saying that... But picking programs that your company has visibility into is a good idea. We use Wazuh. I get to control when updates are rolled out. It's not a massive shit show when the vendor rolls out the update globally without sufficient internal testing. I can stagger the rollout as I see fit.
Hmm. Is it safer to have a potentially exploitable agent running as root and listening on a port, than to not have EDR running on a well-secured low-churn enterprise OS - sit down, Ubuntu - adhering to best practice for least access and least-services and good role-sep?
It's a pickle. I'm gonna go with "maybe don't lock down your enterprise Linux hard and then open a yawning garage door of a hole right into it" but YMMV.
I'm so exhausted... This is madness. As a Linux user I've busy all day telling people with bricked PCs that Linux is better but there are just so many. It never ends. I think this is outage is going to keep me busy all weekend.
My dad needed a CT scan this evening and the local ER's system for reading the images was down. So they sent him via ambulance to a different hospital 40 miles away. Now I'm reading tonight that CrowdStrike may be to blame.
Honestly kind of excited for the company blogs to start spitting out their disaster recovery crisis management stories.
I mean - this is just a giant test of disaster recovery crisis management plans. And while there are absolutely real-world consequences to this, the fix almost seems scriptable.
If a company uses IPMI (Called Branded AMT and sometimes vPro by Intel), and their network is intact/the devices are on their network, they ought to be able to remotely address this.
But that’s obviously predicated on them having already deployed/configured the tools.
I mean - this is just a giant test of disaster recovery plans.
Anyone who starts DR operations due to this did 0 research into the issue. For those running into the news here...
CrowdStrike Blue Screen solution
CrowdStrike blue screen of death error occurred after an update. The CrowdStrike team recommends that you follow these methods to fix the error and restore your Windows computer to normal usage.
Rename the CrowdStrike folder
Delete the “C-00000291*.sys” file in the CrowdStrike directory
Disable CSAgent service using the Registry Editor
No need to roll full backups... As they'll likely try to update again anyway and bsod again. Caching servers are a bitch...
I think we’re defining disaster differently. This is a disaster. It’s just not one that necessitates restoring from backup.
Disaster recovery is about the plan(s), not necessarily specific actions. I would hope that companies recognize rerolling the server from backup isn’t the only option for every possible problem.
I imagine CrowdStrike pulled the update, but that would be a nightmare of epic dumbness if organizations got trapped in a loop.
Note this is easy enough to do if systems are booting or you dealing with a handful, but if you have hundreds of poorly managed systems, discard and do again.
IPMI is not AMT. AMT/vPro is closed protocol, right? Also people are disabling AMT, because of listed risks, which is too bad; but it's easier than properly firewalling it.
Better to just say "it lets you bring up the console remotely without windows running, so machines can be fixed by people who don't have to come into the office".
I meant to say that intel brands their IPMI tools as AMT or vPro. (And completely sidestepped mentioning the numerous issues with AMT, because, well, that’s probably a novel at this point.)
Depends on your management solutions. Intel vPro can allow remote access like that on desktops & laptops even if they’re on WiFi and in some cases cellular. It’s gotta be provisioned first though.
VPro is massively a desktop feature. Servers have proper IPMI/iDrac with more features, and VPro fills (part of) that management gap for desktops. It's pretty cool it's not spoiled or disabled. Time to reburn all the desktops in east-07? VPro will be the way.
Been at work since 5AM... finally finished deleting the C-00000291*.sys file in CrowdStrike directory.
182 machines total. Thankfully the process in of itself takes about 2-3 minutes. For virtual machines, it's a bit of a pain, at least in this org.
lmao I feel kinda bad for those companies that have 10k+ endpoints to do this to. Eff... that. Lot's of immediate short term contract hires for that, I imagine.
That's one of those situations where they need to immediately hire local contractors to those remote sites. This outage literally requires touching the equipment. lol
I'd even say, fly out each individual team member to those sites.. but even the airports are down.
Yeah, there are USB sticks that identify as keyboards and run every keystroke saved in a text file on its memory in sequence. Neat stuff. The primary use case is of course corrupting systems or bruteforcing passwords without touching anything... But they work really well for executing scripts semi-automated.
We had a bad CrowdStrike update years ago where their network scanning portion couldn’t handle a load of DNS queries on start up. When asked how we could switch to manual updates we were told that wasn’t possible. So we had to black hole the update endpoint via our firewall, which luckily was separate from their telemetry endpoint. When we were ready to update, we’d have FW rules allowing groups to update in batches. They since changed that but a lot of companies just hand control over to them. They have both a file system and network shim so it can basically intercept **everything **
crowdstrike sent a corrupt file with a software update for windows servers. this caused a blue screen of death on all the windows servers globally for crowdstrike clients causing that blue screen of death. even people in my company. luckily i shut off my computer at the end of the day and missed the update. It's not an OTA fix. they have to go into every data center and manually fix all the computer servers. some of these severs have encryption. I see a very big lawsuit coming...
they have to go into every data center and manually fix all the computer servers.
Jesus christ, you would think that (a) the company would have safeguards in place and (b) businesses using the product would do better due diligence. Goes to show thwre are no grown ups in the room inside these massive corporations that rule every aspect of our lives.
I'm calling it now. In the future there will be some software update for your electric car, and due to some jackass, millions of cars will end up getting bricked in the middle of the road where they have to manually be rebooted.
. they have to go into every data center and manually fix all the computer servers
Do they not have IPMI/BMC for the servers? Usually you can access KVM over IP and remotely power-off/power-on/reboot servers without having to physically be there. KVM over IP shows the video output of the system so you can use it to enter the UEFI, boot in safe/recovery mode, etc.
I've got IPMI on my home server and I'm just some random guy on the internet, so I'd be surprised if a data center didn't.
One possible fix is to delete a particular file while booting in safe mode. But then they'll need to fix each system manually. My company encrypts the disks as well so it's going to be a even bigger pain (for them). I'm just happy my weekend started early.
Huh. I guess this explains why the monitor outside of my flight gate tonight started BSoD looping. And may also explain why my flight was delayed by an additional hour and a half...
My company used to use something else but after getting hacked switched to crowdstrike and now this. Hilarious clownery going on. Fingers crossed I'll be working from home for a few days before anything is fixed.
I see a lot of hate ITT on kernel-level EDRs, which I wouldn't say they deserve. Sure, for your own use, an AV is sufficient and you don't need an EDR, but they make a world of difference. I work in cybersecurity doing Red Teamings, so my job is mostly about bypassing such solutions and making malware/actions within the network that avoids being detected by it as much as possible, and ever since EDRs started getting popular, my job got several leagues harder.
The advantage of EDRs in comparison to AVs is that they can catch 0-days. AV will just look for signatures, a known pieces or snippets of malware code. EDR, on the other hand, looks for sequences of actions a process does, by scanning memory, logs and hooking syscalls. So, if for example you would make an entirely custom program that allocates memory as Read-Write-Execute, then load a crypto dll, unencrypt something into such memory, and then call a thread spawn syscall to spawn a thread on another process that runs it, and EDR would correlate such actions and get suspicious, while for regular AV, the code would probably look ok. Some EDRs even watch network packets and can catch suspicious communication, such as port scanning, large data extraction, or C2 communication.
Sure, in an ideal world, you would have users that never run malware, and network that is impenetrable. But you still get at avarage few % of people running random binaries that came from phishing attempts, or around 50% people that fall for vishing attacks in your company. Having an EDR increases your chances to avoid such attack almost exponentionally, and I would say that the advantage it gives to EDRs that they are kernel-level is well worth it.
I'm not defending CrowdStrike, they did mess up to the point where I bet that the amount of damages they caused worldwide is nowhere near the amount damages all cyberattacks they prevented would cause in total. But hating on kernel-level EDRs in general isn't warranted here.
Kernel-level anti-cheat, on the other hand, can go burn in hell, and I hope that something similar will eventually happen with one of them. Fuck kernel level anti-cheats.
The issue is with a widely used third party security software that installs as a kernel level driver. It had an auto update that causes bluescreening moments after booting into the OS.
This same software is available for Linux and Mac, and had similar issues with specific Linux distros a month ago. It just didn't get reported on because it didn't have as wide of an impact.
had similar issues with specific Linux distros a month ago. It just didn’t get reported on because it didn’t have as wide of an impact.
Because most data center admins using linux are not so stupid to subscribe to remote updates from a third party. Linux issues happen when critical package vulnerabilities make it into the repo.
This outage is probably costing a significant portion of Crowd strike's market cap. They're an 80 billion dollar company but this is a multibillion outage.
Someone's getting fired for this. Massive process failures like this means that it should be some high level managers or the CTO going out.
This is proof you shouldn't invest everything in one technology. I won't say everyone should change to Linux because it isn't immune to this, but we need to push companies to support several OS
There is a fix people have found which requires manual booting into safe mode and removal of a file causing the BSODs. No clue if/how they are going to implement a fix remotely when the affected machines can't even boot.
Having had to fix >100 machines today, I'm not sure how a reimage would be less work. Restoring from backups maybe, but reimage and reconfig is so painful
Everyone is assuming it’s some intern pushing a release out accidentally or a lack of QA but Microsoft also pushed out July security updates that have been causing bsods on the 9th(?). These aren’t optional either.
What’s the likelihood that the CS file was tested on devices that hadn’t got the latest windows security update and it was an unholy union of both those things that’s caused this meltdown. The timelines do potentially line up when you consider your average agile delivery cadence.
A lot of people I work with were affected, I wasn't one of them. I had assumed it was because I put my machine to sleep yesterday (and every other day this week) and just woke it up after booting it. I assumed it was an on startup thing and that's why I didn't have it.
Our IT provider already broke EVERYTHING earlier this month when they remote installed" Nexthink Collector" which forced a 30+ minute CHKDSK on every boot for EVERYONE, until they rolled out a fix (which they were at least able to do remotely), and I didn't want to have to deal with that the week before I go in leave.
But it sounds like it even happened to running systems so now I don't know why I wasn't affected, unless it's a windows 10 only thing?
Our IT have had some grief lately, but at least they specified Intel 12th gen on our latest CAD machines, rather than 13th or 14th, so they've got at least one win.
OK, but people aren't running Crowdstrike OS. They're running Microsoft Windows.
I think that some responsibility should lie with Microsoft - to create an OS that
Recovers gracefully from third party code that bugs out
Doesn't allow third party software updates to break boot
I get that there can be unforeseeable bugs, I'm a programmer of over two decades myself. But there are also steps you can take to strengthen your code, and as a Windows user it feels more like their resources are focused on random new shit no one wants instead of on the core stability and reliability of the system.
It seems to be like third party updates have a lot of control/influence over the OS and that's all well and good, but the equivalent of a "Try and Catch" is what they needed here and yet nothing seems to be in place. The OS just boot loops.
Those things never worked for me... Problems always persisted or it failed to apply the restore point. This is from the XP and Windows 7 days, never bothered with those again. To Microsoft's credit, both W7 and W10 were a lot more stable negating the need for it.
I can't say about XP or 7 but they've definitely saved my bacon on Win10 before on my home system. And the company I work for has them automatically created and it made dealing with the problem much easier as there was a restore point right before the crowdstrike update. No messing around with the file system drivers needed.
I'd really recommend at least creating one at a state when your computer is working ok, it doesn't hurt anything even if it doesn't work for you for whatever reason. It's just important to understand that it's not a cure all, it's only designed to help with certain issues (primarily botched updates and file system trouble).
Not easy to switch a secured 4,000+ workstation business. Plus, a lot of companies get their support, license, and managed email from one vendor. It's bundled in such a way that it would cost MORE to deploy Linux. (And that very much on purpose)
It's entertaining to me that our brand of monopolistic / oligarchic capitalism itself disincentivizes one-time costs that are greatly outweighed by the risk of future occurrences. Even when those one-time costs would result in greater stability and lower prices...and not even on that big of a time horizon. There is an army of developers that would be so motivated to work on a migration project like this. But then I guess execs couldn't jet set around the world to hang out at the Crowdstrike F1 hospitality tent every weekend.
Not at my company. We're all stuck in BSOD boot loops thanks to BitLocker, and our BIOS is password protected by IT. This is going to take weeks for them to manually update, on site, all the computers one by one.
Eh. This particular issue is making machines bluescreen.
Virtualized assets, If there's a will there's a way.
Physical assets with REALLY nice KVMs... you can probably mount up an ISO to boot into to remove the stupid definitions causing this shit.
Everything else? Yeah... you probably need to be there physically to fix it.
But I will note that many companies by policy don't allow USB insertion... virtually or not. Which will make this considerably harder across the board. I agree that I think the majority could be fixed remotely. I don't think the "other" categories are only 1%... I think there's many more systems that probably required physical intervention. And more importantly... it doesn't matter if it's 100% or 0.0001%... If that one system is the one that makes the company money... % population doesn't matter.
It's Russia, or Iran or China or even our "ally" Saudi Arabia. So really, it's time to reset the clock to pre 1989. Cut Russia and China off completely, no investment, no internet, no students no tourist nothing. These people mean and are continually doing us harm and we still plod along and some unscrupulous types become agents for personal profit. Enough.
best day ever. the digitards get a wakeup call. how often have been lectured by imbeciles how great whatever dumbo closed source is.
"i need photoshop", "windows powershell and i get work done", "azure and onedrive and teams...best shit ever", " go use NT, nobody will use a GNU".
yeah well, i hope every windows user would be kept of the interwebs for a year and mac users just burn in hell right away. lazy scum that justifies shitting on society for their own comfort. while everyone needs a drivers license, dumb fucking parents give tiktok to their kids...idiocracy will have a great election this winter.
So when’s the last time you touched some grass? It’s a lovely day outside. Maybe go to a pet shelter and see some puppies? Are you getting enough fiber? Drinking enough water? Why not call a friend and hang out?
While I get your point on the over reliance on Microsoft, some of us are going to be stuck spending the whole day trying to fix this shit. You could show some compassion.
no. after decades...not anymore. again and again. there is no good excel, apple users are not "educated". just assholes caring about themselves and avoiding the need to learn.
i did not yet hear any valid argument.
why not show compassion for driving bmw series 5 without a drivers license around kindergard?
i'm seriously just done with every windows,android,osx user forever. digital trumps. thats all they are. me. me. me.
It's fun to dunk on the closed source stuff, but this is a release engineering thing that isn't the fault of a software license.
Render unto Caesar, dude.
It's a nice rant, but it also combined government regulation and monocultures and indolence. You should cut that up into several different rants, bringing out the proper one in a more focused spiel for a timely and dramatic win. The lack of cohesion also means you're taking longer to get the rant out, and if it's too long to hold focus you'll reduce your points awarded for the dunk.
Otherwise, lots of street-preacher potential. Solid effort.