I have two, one is actually complicated and one was so obtuse that I never would have figured it out in a million years:
Actually complicated: I still don't know how it happened, but somehow an update on Arch filled the boot partition with junk files, which then caused the kernel update to fail because of no disk space, which then kind of tanked the whole system. It took ages, but with a boot disk and chroot-ing back into the boot partition I eventually managed to untangle it all. I was determined to see it through and not reinstall.
Ridiculous: One day when using Ubuntu, the entire system went upside-down. As in, everything was working perfectly fine, but literally the screen was upside-down. After much Googling I had no luck figuring it out, then I accidentally found the solution - I'd plugged a PS4 controller into the USB on the laptop to charge it, and for some reason Ubuntu interpreted the gyroscope on the controller as "rotate the screen display" so when I moved it, the screen spun round. I only figured it out by accident when I plugged it back it and it spun back to normal lol.
Ridiculous: One day when using Ubuntu, the entire system went upside-down. As in, everything was working perfectly fine, but literally the screen was upside-down. After much Googling I had no luck figuring it out, then I accidentally found the solution - I’d plugged a PS4 controller into the USB on the laptop to charge it, and for some reason Ubuntu interpreted the gyroscope on the controller as “rotate the screen display” so when I moved it, the screen spun round. I only figured it out by accident when I plugged it back it and it spun back to normal lol.
I had a similar one. I had a usb-powered fan cooling pad that my laptop was sitting on. My laptop would randomly go into boot loops when I turn it on. I thought it was a grub issue so I always had my usb stick ready to re-install grub. Did some dusting one day and forgot to plug in the cooling fan, then the boot loop never happened again. Turns out it was the fan plugged into the usb that was causing it.
A couple years ago on Reddit I saw a story where a dude working IT support had to drive to a remote office or replace a workstation that wouldn't boot. When he got there the lady whose desk it was had some shitty USB fan or maybe an led Christmas tree plugged into one of the USB ports. He unplugged that and the pc booted fine.
Ah I remember that one! Classic. I also remember a story about someone who lost an entire PC in their apartment. It was running and connected to the network, they could ping it, but couldn't physically find it lol.
I manage a machine that runs both media transcodes and some video game servers.
The video game servers have to run in real-time, or very close to it. Otherwise players using them suffer noticeable lag.
Achieving this at the same time that an ffmpeg process was running was completely impossible. No matter what I did to limit ffmpegs use of CPU time. Even when running it at lowest priority it impacted the game server processes running at top priority. Even if I limited it to one thread, it was affecting things.
I couldn't understand the problem. There was enough CPU time to go around to do both things, and the transcode wasn't even time sensitive, while the game server was, so why couldn't the Linux kernel just figure it out and schedule things in a way that made sense?
So, for the first time I read up on how computers actually handle processes, multi-tasking and CPU scheduling.
As FFMPEG is an application that uses ALL available CPU time until a task is done, I came to the conclusion that due to how context switching works (CPU cores can only do one thing, they just switch out what they do really fast, but this too takes time) it was causing the system to fall behind on the video game processes when the system was operating with zero processing headroom. The scheduler wasn't smart enough to maintain a real-time process in the face of FFMPEG, which would occupy ALL available cycles.
I learned the solution was core pinning. Manually setting processes to run on certain cores of the CPU. I set FFMPEG to use only one core, since it doesn't matter how fast it completes. And I set the game processes to use all but that one core, so they don't accidentally end up queueing for CPU time on a core that doesn't have the headroom to allow the task to run within a reasonable time range.
This has completely solved the problem, as the game processes and FFMPEG no longer wait for CPU cycles in the same queue.
I think the difference is simply that most processes only have a certain amount that needs accomplishing in a given unit of time. As long as they can get enough CPU time, and do so soon enough after getting in line for it, they can maintain real-time execution.
Very few workloads have that much to do for that long. But I would expect other similar workloads to present the same problem.
There is a useful stat which Linux tracks in addition to a simple CPU usage percentage. The "load average" represents the average number of processes that have requested CPU time, but have to queue for it.
As long as the number is lower than the available number of cores, this essentially means that whenever one process is done running a task, the next in line can get right on with theirs.
If the load average is less than the number of cores available, that means the cores have idle time where they are essentially just waiting for a process to need them for something. Good for time-sensitive processes.
If the load average is above the number of cores, that means some processes are having to wait for several cycles of other processes having their turn, before they can execute their tasks. Interestingly, the load average can go beyond this threshold way before the CPU hits 100% usage.
I found that I can allow my system to get up to a load average of about 1.5 times the number of cores available, before you start noticing it when playing on one of the servers I run.
And whenever ffmpeg was running, the load average would spike to 10-20 times the number of cores. Not good.
I've found that the silliest desktop problems are usually the hardest to solve, and the "serious" linux system errors are the easiest.
System doesn't boot? Look at error message, boot from a rescue disk, mount root filesystem and fix what you did wrong.
Wrong mouse cursor theme in some Plasma applications, ignoring your settings? Some weird font rendering issue? Bang your head against a wall exploring various dotfiles and rc files in your home directory for two weeks, and eventually give up and nuke your profile and reconfigure your whole desktop from scratch.
A couple of weeks ago I moved Firefox to one side. Window disappeared, but Firefox was still running "somewhere" on my desktop, but was not actually be rendered to the screen. Killing the process and relaunching just resulted in it be rendered to this weird black hole. Log out of gnome and log back in? Same! Reboot? Same!
Ended up deleting it's config folder and re-attaching to Firefox sync in order to have it working again. No idea what went wrong, nor will I ever most likely.
There really should be a hotkey for "move window to primary display" or somesuch. The worst is when just the top "cleat" of the window is inaccessible, making it impossible to simply move the window yourself.
Alternately, a CLI tool to just trash a specific app's window settings, or a system control panel that lets you browse these settings, would be incredible.
I feel like i had a disappearing window like that a lifetime ago and the fix was to change the resolution.
I don't know if that uncovered the void to the right or forced the window to reassign itself to usable space. But it worked then. Hell, it could have been windows for all I recall.
Yeah for some reason a single game ignores the system sound settings and goes straight to a line out. My system doesn't see that the game is outputting sound and I can't change it. (Arch with KDE)
Somewhat related on windows 11, for some reason teams volume will desync from system volume. I'll put system volume to 0 and still be hearing teams. It's the same audio device being selected. I don't understand why it would ever work that way but here we are
Oh my god, you've put it into (really nice) words something I've felt since quite some time now. I've no trouble (in fact even joy) when something major is fucked up. But all this GUI shenanigans, I've usually no idea where to even begin. The lack of structure and hierarchy completely flummoxes me. Or maybe I just don't have enough experience debugging userland stuff
Around 2017 I spent three days on and off trying to diagnose why my laptop running elementary OS had no wifi support. I reinstalled the wifi drivers and everything countless times. It worked for many days initially then just didn't one day when I got on the laptop. Turns out I had accidentally flipped the wifi toggle switch while it was in my bag. I forgot the laptop had one. Womp womp.
I had a friend come over to my place to fix her laptops wifi. After about an hour searching for any setting in windows that i could have missed, i coincidentally found a forum where one pointed out this could be due to a hardware wifi switch...
Bricked my pc twice because of the bootloader and couldn't repair it. From now on i just nuke my system if something is fucky and have a shell script do the installing of packages etc.
My first Linux machine crashing. This was way before Redhat, Ubuntu, Arch, or OpenSUSE. This was installed from 60+ floppy disks on a 386-40 with 8MB of RAM.
This machine ran happily, but it crashed under heavy load. I checked out causing the load by using different applications, but could not nail it to a certain software. So the next thing I checked was the RAM. Memtest86 ran for a day without any problems. But the crashes still came. So I got the infrared camera from the lab to see if some hardware overheats. Nope, this went nowhere, either.
Then I tested the harddisk. Read test of the whole HD went without problems. I copied the data on a backup medium and did a write and read test by dd'ing /dev/zero over the whole disk, and then dd'ing the disk to /dev/null. Nothing did show up.
I reinstalled the Linux, and it crashed again. But this time, I noticed that something was odd with the harddisk. I added a second swap partition, disabled the first, and the machine ran without problems. Strange...
So I wrote a small program that tested the part of the disk occupied by the old swap space: Write data, read data, and log everything with timestamps. And there was the culprit: There was an area on the HD where I could write any data, but when I read blocks from that area, a) It took a very long time for the read, b) the blocks I read were containing all zero, regardless of what I had written, and worst of all c) there was no error indication whatsoever from the controller or drive. Down at the kernel level, the zeroed blocks were happily served by the HD with an "OK". And the faulty area was right in the middle of the original swap partition.
Working for a VoIP company in the early 2010s I rm -rf'd the /bin/ directory. As root. On a production server. On site.
I ended up booting from my phone (android app for iso booting) then manually coppied over the files from another machine. Chrooted and some stuff was broken but rebuilding from the package manager reinstalled everything that was missing. Got the system back up in around 40 mins after that colossal screw up. Good fun and a great learning experience. Honestly, my manager should not have had me doing anything on a root shell with no training.
Around 2003-2004. I was still a bit of a Linux noob, just getting to grips with Gentoo.
Had two no-name WiFi adapters that weren't directly supported under Linux. Found some obscure forum thread that mentioned them, along with which lines in which source code driver to change to make these adapters work.
Maybe this goes a bit deeper than the question intended, but I’ve made and shared two patches that I had to apply locally for years before they were merged into the base packages.
Oh god I remember that. Luckily in 2003 my main computer was scraped together from discarded parts at my father's day job, so it was ethernet only
In 2024 on a laptop I still have wifi problems though. Most recently, if I closed and opened the laptop lid (suspend + resume), the wifi hardware just disappeared off the face of the kernel.
Turns out that the iwlwifi kernel module just irreversibly crashes when the laptop suspends and can only be fixed with a reboot.
So I had the fun task learning about systemd pre-suspend hooks to unload the driver before suspend and load it again on resume.
Fixed a typo in my /etc/fstab that prevented the NAS from mounting. I am a bear of little brain. But I'm also proof that you don't have to be some master hacker to successfully run Linux.
Saved me from reinstalling. Made me realise that there really should be an alternative to typing into fstab by hand since us humans will make mistake. Either that or make fstab nog crash completly on an error but just skip it.
I have no idea how widespread it is among other distros, but ArchLinux's bootable install disk/iso comes with a genfstabcommand that snapshots your current mount points and outputs it as a fstab.
You still need to figure out where and how to mount everything yourself, but at least it saves you from most typos that could otherwise end up in the fstab file.
Not a Linux problem per se, but I had a 128GB image disk in a unknown .bin format which belongs to a proprietary application. The application only ran on Windows.
I tried a few things but nothing except Windows based programs seemed able to identify the partitions, while I could run it in Wine, it dealt with unimplementend functions. So after a bit of googling and probing the file, it turns out the format had just a 512 bytes as header which some Windows based software ignored. After including the single block offset, all the tools used in Linux started working flawlessly.
This is so arcane to me. Like, I more or less understand your high-level explanation, but then you gloss over "including the block offset" but how would one do that ??
Inspecting the file with a hex editor would give you lots of useful info in this case. If you know approximately what the data should look like, you can just see where the garbage (header) ends and the data starts. I've reverse engineered data files from an oscilloscope like this.
Well, in this scenario the image file had 512 bytes sections, each one is called a block. If you have a KiB (a kibibyte = 1024 bytes) it will occupy 2 blocks and so on...
Since this image file had a header with 512 bytes (i.e. a block) I could, in any of the relevant Linux mounting software (e.g. mount, losetup), choose an offset adding to the starting block of a partition. The command would look like this:
sudo mount -o loop,offset=$((header+partition)) img_file /mnt
We had a system with a mirrored boot disk. One of the disks failed. And we were unable to boot from the other, because the boot device in OBP (~BIOS) pointed to a device-specific partitIon. When we manually booted from the live device, it was lacking the boot sector code, and wouldn't boot. When we booted from CDROM, the partitions wouldn't mount because the virtual device mapping pointed to the dead drive.
This was a gas futures trading system, and rebuild wasn't an option. Restoring from backup woyld have lost four hours of trades, which would be an extreme last resort.
A coworker and I spent all night on the box. We had a whiteboard covered with every stage of the boot sequence broken down, and every redirection we needed to (a) boot and (b) repair the system. The issue started mid-afternoon, and we finally got it back up by around 6:30 am.
Back in the day, I upgraded a Slackware install from kernel 1.3 to 2.0. That was a fucking adventure.
The fun part about back then was that if your machine wouldn't boot or if you couldn't get your modem or pppd working, you probably didn't have another internet connected device so you might have to drive somewhere with a computer...or try to figure it out through books.
Yep. I remember at the time I saw a lot of advice saying "you know you might want to seriously consider just installing your distro from scratch with a newer version." Tracking down all of the dependencies (some of which had to be installed as binaries) was a very manual process.
Edit: Oh and another fun aspect of that time period was that since downloads were so slow on a modem, if you wanted a newer version or to try out another distro, you would go and order a cdrom from a place like Walnut Creek.
Sometimes .... usually I just hit a wall because I don't know enough but I know enough to get myself in trouble .... so I just stop, reformat, reinstall and start all over.
About the biggest lesson I've learned from Linux is not to mess with too many things unless you want to learn about it and have lots of time in your hands.
Otherwise if you find a good distro for your needs, a stick with it, don't change it, update and backup regularly.
I once broke my Ubuntu install by trying to convert it KDE Neon, that reinstalled half my packages and left it in an basically unusable state. I then un-broke the install while upgrading multiple Ubuntu releases, that reinstalled the other half as well. It actually worked, and I'm still using that install.
It was some combination of both, the system would post, past the bootloader, attempt to initialize drivers and other standard starting packages and then immediately panic and drop into an emergency terminal (/TTS), with a failure to mount the root partition, from what I can recall. It tried it a couple times and then there was an error message that said: "Bailing out, you are on you own, good luck"
Mine is close to that. I still had a working libc, but the dynamic library for C++ programs wouldn't load, so most of the Gentoo tools and several other things I expected simply crashed on startup.
Found enough working programs to get the library restored and remove the bad arch flags from my configuration to start another emerge world.
After that, I was pretty confident that I could run Linux at least as confidently as I had previously run WinNT 4.
Generally if you remove a file, it won't affect programs that already have it open. So if you delete libc, hope that you don't lose power. If worse comes to worst, you'll need to pull the drive and mount it on another machine.
Making a Palm Pilot getting a live connection to the internet through an infrared connection (Red Hat Linux). That was circa 2004, and I spent 10 hours, all night on it.
A couple months ago, I made a Palworld server box out of a spare motherboard assembly (mobo, processor, ram) from a computer I had recently upgraded.
I didn't have any spare drives lying around, so I plugged in 7 USB flash drives and made them into a RAID array. Not a true RAID array, but a BTRFS filesystem with volumes spread onto each flash drive, with the data redundancy set to raid1, and the metadata redundancy set to raid1c3.
It worked... in the sense that I never lost any data. It certainly didn't work in the sense of having good uptime.
The first problem was getting it to boot right. The boot line in GRUB had "root=UUID=..." instead of a specific drive named. That is normal. However, in BTRFS multi-volume filesystems, all the volumes have the same UUID. So the initrd was only waiting for a single drive matching that UUID, then trying to mount it as the root filesystem. This failed, because the kernel had not yet set up the other 6 USB drives, and this BTRFS filesystem needs all 7 volumes present. Maybe 6, if you used the "degraded" mount option.
The workaround was to wait for this boot process to fail, at which point you get dropped into an initrd shell. Then, you look at all the drives and make sure they're all there. And then... I don't exactly remember what happened next. I think it was some black magic that erases your mind in the process. I somehow got it booted from the initrd shell.
Installing Steam and the Palworld server worked ok, and it even ran for a few hours before crashing overnight.
The next morning, I tried rebooting it. Unfortunately, the USB drives weren't all appearing. Turns out the motherboard had some bad USB ports, some sometimes-bad USB ports, and a maybe-bad PCIe bus, because the PCIe USB expansion card I plugged in had weird problem that it had never had before.
I found the most reliable ports and plugged the drives in there. But you can't just replug them in the initrd. It doesn't have USB hotplug support. So each time it tried to boot with not all the drives there, I restarted it again until one time I finally had all the drives.
I changed the GRUB boot line to "root=/dev/sdg1" . This made it wait for all the drives to load, in any order, and whichever one was last would be mounted as the root filesystem (but the kernel would automatically include all the others too, since they were successfully initialized).
The bad USB ports kept bringing down the server every day or two. I bought a cheap NVMe drive and added it to the BTRFS filesystem, and then removed all the USB drives except the largest. That fixed the reliability. It's been like that since.
Now, to boot the server, all I have to do is change the GRUB boot line to "root=/dev/sdb1" . Since the NVMe drive is much faster than the USB drive, it always initializes first. If the initrd waits for sdb2, then it will always have both drives initialized when it tries to mount the root filesystem.
I could add that to the grub.cfg, or come up with some other more permanent solution, but I'm not planning on rebooting this server ever again. My friends fell off Palworld, and I gave a shutdown date that's about a week away. And the electricity is pretty reliable here.
Still trying to use Linux Mint on my 2013ish MacBook Pro as a daily driver. Got the MacBook for free and it wouldn't update anymore, so installed Linux Mint and it's been great for the most part. Still trying to access my NAS on it though. Having to manually mount drives is a new experience for me, and it's not coming to me intuitively. Reached out via IRC, got some help but still working on it.
More than a decade ago a user came into #ubuntu-server on Freenode (now libera.chat ) and said that they had accidentally run "rm -rf /* something*" in a root shell.
Note the errant space that made that a fatal mistake. I don't remember how far it actually got in deleting files, but all of /bin/ /sbin/ and /usr/ were gone.
He had 1 active ssh connection, and couldn't start another one.
It was a server that was "in production", was thousands of miles away from him, and which had no possibility for IPMI / remote hands.
Everyone (but me) in the channel said that he was just SoL and should just give up.
I stayed up most of the night helping him. I like challenges and I like helping people.
This was in the sysv-init (maybe upstart) days, and so a decent number of shell scripts were running, and using basic *nix commands.
We recovered the bash binary by running something along the lines of
(If you can access "lsof" then "sudo lsof | grep deleted" will show you any files that are open, but also "deleted". You may be surprised at how many there are!)
But bash needed too many shared libraries to make that practical.
Somehow we were able to recover curl and chmod, after which I had him download busybox-static. From there we downloaded an Ubuntu LiveCD iso, loop mounted it, loop mounted the squashfs image inside the iso, and copied all of /bin/ , /sbin/ , /etc , and so on from there onto his root FS.
Then we re-installed missing packages, fixed up /etc/ (a lot of important daemons, including the one that was production critical, kept their configuration files open, and so we were able to use lsof to find the magic symlinks to them in /proc/$pid/fd/ and just cp them back into /etc/.
We were able to restart openssh-server, log in again, and I don't remember if we were brave enough to test rebooting.
But we fucking did it!
I am certainly getting a lot of details wrong from memory. It's all somewhere at irclogs.ubuntu.com though. My nick was / is Jordan_U.
I just told this story to a friend but I did the standard rm -rf * as root while in the / directory. And this was back in the day where we nfs mounted every other machine and root privileges propagated through NFS. I think it was on the 2nd or 3rd machine when I thought -- "this seems to be taking longer than I thought".
So I mostly fried the SSD by using it to write and rewrite ML checkpoints and logs, this in turn made the device read only and I somehow managed to migrate to a different SSD probably using clonezilla or something, but it messed up the bootloader so I installed refind in a new partition, configured it and voila it works. It's scary because you need to do everything without seeing your system even half alive anywhere along the process, but it's not actually hard, just copying data and installing/configuring a bootloader. But for a then 20year old at his more or less first job my head was on fire for the 1.5 days this took.
By far the most difficult single thing that I've ever had to fix that actually had to do with the system.
I now don't flood my SSDs with data that is constantly rewritten.
Learned how drivers worked and fixed a driver for an USB to I2C chip. It's still buggy but at least it sorta works now.
Some more details: I was using a CH347 (USB to UART/SPI/I2C) and there was an open source driver that used a previous chip version. The original dev had hardcoded the bulk IO endpoints indices. The only change I had to do was just iterate over the endpoints and search for the correct ones. But at first, I didn't understand anything about how the USB subsystem worked and how drivers were loaded. All I could tell was the USB device was correctly detected but the I2C driver wasn't being loaded, despite proper udev rules, correct vendor/product IDs, etc.
Upgrading the system I removed glibc from the system (Debian). apt wasn't working, etc. Had to manually fix dependencies and everything. Currently my working OS so all fixed.
This doesn't fit the question exactly but I feel it's in the same spirit, and a kind of interesting solution, I think.
Back in the early days of scryptcoin mining, I had a few gpu mining rigs running Linux. Occasionally they would hard lock and I'd have to power cycle them.
What I ended up doing is getting some usb to serial adapters, wrote a python script that ran on startup and would send a character over serial at a set interval in a loop. That was hooked up, if I recall correctly, to an attiny85 using softwareserial and some ttl to rs232 conversion. It would listen over serial and if it didn't receive anything with a reasonable time frame it'd flip a relay that cut mains power to the pc, then flipped it back. A deadman's switch, of a sort. It worked great!
I remember a story about someone who did something similar with a server that kept hanging. They rigged up a second computer to ping it over the local network and if there was no response for a certain amount of time, the computer would eject its CD-ROM tray which had been lined up neatly with the reset button on the server.
Since it couldn't eject fully, it then retracted, having rebooted the server.
I assume that was a temporary fix... and it was probably a Windows server tbh.
The closest I've done is having a job run every 12 hours checking if a process was over a certain memory usage (memory leak) and restarting it if it was. That was also Windows, but the same thing on Linux wouldn't have been difficult... not that the Linux servers ever had that problem.
This will feel extremely simple for some folks, but I was having a hell of a time getting Steam games that had previously worked through Proton running. I scoured the internet for solutions after trying to install proton-ge and testing multiple versions. Eventually someone had the galaxy brain idea to suggest installing WINE. For some reason, that fixed the problem real good.
Are you including back in the day when we had to use windows device drivers via ndiswrappers?
I've managed to remove a critical library once but did manage to extract it from an RPM on another machine and manually install it. That was good enough to get me to the point where I could yum reinstall.
Pre-linux we had an HP workstation where the disc drive died and of course we had no backups. I managed to frankenstein the disc by connecting the platters on the broken disc to the circuit board of a working disc. This worked and I was able to back up the disk and reload on to a new drive.
And then we bought an 8mm tape drive for backups and I had to port some drivers to HP-UX to get it to work. But we had awesome backups after that!
It's not the biggest issue I managed to fix, but it was definitely the hardest to figure out a fix for:
Whenever I would boot up any game on my Linux machine I would have microstutters ever so often, and it was frequent and lengthy enough to be very annoying, and thus started my 2 month long quest to figure out what was going wrong.
To cut a long story short, the compositor I was using had suddenly decided to do a breaking update and change the names of the backends they were using.
single gpu vm passthrough. took a few days for troubleshooting, and i didnt even want to get it to be undetectable by game anticheat, i hear that needs building your own kernel for some advanced detection methods.
I've generally had good luck with hardware and things just worked under linux. But one day I upgraded a few machines on my network to 2.5G ethernet. Several already had the ports, but my little NUC NAS box didn't, so I installed a 2.5G usb ethernet dongle. No matter what I did, I couldn't get it to work. It would show up and NM would act like it was up and there were no errors or anything, but it just wouldn't actually function.
Eventually, I found out that it has a built in USB data partition that contains the drivers for windows. The card was coming up as a usb disk first when the hardware was assigned and not a network card which it should have been.
I had to write a blacklist the usb modules first, which I had done before, but I had to also write a udev rule to automatically add the network card and driver on boot. It wasn't that difficult to actually do, but I had just never had to do anything with udev rules before. Took me a good three days of troubleshooting to finally get everything to work correctly on boot.
I run a distro with OpenRC instead of systemd, so I had to gain some understanding of udev permissions for USB devices and come up with my own udev rules for Steam because I couldn't follow Valve's setup guide.
VR pretty much just worked for me with my vive. Had some issues with weird stuttering and tearing but I managed to find a solution in some config file.
I have had an issue for years that I couldn't pinpoint to a root cause (I'm strongly inclined to think it's a kernel issue). I bought a CM Storm Quickfire TK keyboard with ABNT2 layout.
The issue is: every time I try to type any key that is not a letter or number one, the computer freezes for a full ten seconds before acknowledge the press and showing the character. Tried a bunch of Linux distros through the years and the issue persists. On Windows it works flawlessly.
Just give up trying to debug the problem, but I still have this hole in my heart where the cause of this issue lives.
while playing around with face/fingerprint unlock for my laptop, I messed up pam (Linux Pluggable Authentication Modules) and no passwords were working anymore except for the root account. At first I was still on my account, but then I stupidly rebooted and could only log in as root. After so many config edits, I gave up and instead booted up windows (my laptop's dual booted), setting up a new linux install in VirtualBox, and then copying over the PAM config files from the vm to the actual Linux install.
and it all somehow worked!
I am now facing another issue which I'm gonna say here in the hope somebody has already
ran into it: after updating to KDE plasma 6, tap to click works on my touchpad, but actually, physically, pressing on the trackpad doesn't work. I can hear the pad's physical clicking noise, but nothing happens os wise
I can't remember the details anymore, but for a year or two I had a bad run of absolutely hosing my boot config and leaving myself in a state where the system either couldn't find it's kernel or couldn't find the root partition and would drop me into an initramfs emergency shell. I got pretty good at booting into a live environment, getting all my dm-raid and lvm disks discovered, mounting all the relevant file systems in the right place, chrooting in and rebuilding the pieces that were broken
At one point, my laptop's Nvidia drivers were all tangled up. The package dependency graph had portions of the screwy, we-don't-need-your-stinking-standard-version-scheme, binary blob drivers both in front of and behind the currently installed version. I had to basically gut everything Nvidia related, by performing surgery on the filesystem and Apt database, and then build it back. At one point, I was flying in text mode only; not hard, but worth mentioning since it shows how deep a cut this was.
Related: getting the above nonsense to cooperate with containers that also want to do GPU things. As much as I wanted this work with coding up a one-and-done solution (e.g. docker-compose or BASH script), you can't get away with mounting the host Nvidia driver and tools via volumes. The software on the container image itself must be built against the specific version you're running - no exceptions. So, I now rebuild these containers after every Nvidia package upgrade (from the author's git repo), which is a stupid way to achieve containerization. If Nvidia had a stable API/ABI across releases, this would just work. /rant
windows update kept downloading these bloated "updates" that included brand new software that I didn't want or use, broke my settings, added a bunch of spyware, adware and other shit and slowed down my system
installing linux fixed that instantly and permanently
Some programs still relying on python2 when the operating system has long since upgraded to python3.
Not really an issue per se, I just had to switch those apps over to using the flatpak version which would have it installed as needed. (I'm looking at you GIMP)
A Gentoo upgrade package list with over 100 packages and conflicts all over the place. Then do it again when the list grows to the same size in a few months.
Installing a hadoop cluster across 5 machines. I wouldn't say I fixed it, but I made it so it wouldn't collapse until long after I'd left that company.
Xfree86 was sonetimes a mess. And i did not have a browser anymore when it refused to start. So man pages only.
I once rm -rf all the db files of a running database: Recovered the files via inodes since they were all still open on the running database, that was a mess.
I screwed up permissions on an LXC container in Proxmox by converting it from unprivileged to privileged (against recommendations) and had to mount it offline and write a script with find into chown via the execute flag to change all the UIDs and GIDs from the shifted unprivileged ones to the standard host-level ones.
Luckily this was in my own lab so it was a (mostly) harmless learning experience.
At some point I've installed rust implementation of the coreutils from the AUR, they worked for a long while until some ssl vulnerability were discovered and everyone had to update the library. As you can imagine, without working coreutils system were hard to use. troubleshooting were also a pain in the ass because who could blame coreutils of all things? :P
Well, the command was designed to fix the most common Windows problems like corrupted files and weird settings. So of course help lines are going to ask to run it. It was made to automatically fix problems.
/var was almost full and I ran pacman -Syu and left the comp to go and make dinner.
This was also at the time Plasma 6 was rolling out.
It was a big upgrade along with a new kernel.
Download seemed to go smoothly, but during installation, it didn't have enough space to unpack stuff and there was no kernel available to boot.
Even the "previous kernel" options didn't work.
It wasn't too hard to fix because I had learnt how to use pacman in a chroot env, but my dinner got cold by the time I was ready to eat.
I still haven't learnt the lesson though. This is the third time I am having a problem with paccache and I still haven't setup a removal daemon/cron job.
For me it was migrating my Arch install from EXT4 to ZFS. GRUB had to be configured in particular ways to get it to work with ZFS and I didn't do it properly so it wouldn't/couldn't boot.
Then I updated ZFS to a version that wasn't supported by GRUB yet so I chrooted into my installation to switch to Systemd-boot with Unified Kernel Images. Now I still can't figure out how to add a boot entry for Windows. I followed the proper steps I think but selecting the Windows entry just reloads Systemd-boot.
Not fixed but there is an Arch problem that is and will always be the bane of mi existence.
For some reason when I click with the trackpad buttons the touchpad gets frozen for like a second (it's like they are recognised by the system as keyboard buttons, I have enabled that option to temporarily disable it when using keyboard).
I've checked for hours and days the libinput documentation and some synaptics libraries, even legacy ones. It is to this day the only problem that has lead me to reinstall my system but the problem remains.
It's not even like I have some niche setup, I mean, surely there must be thousands of Arch users running with a ThinkPad X1 Carbon Gen 7, and surely not every single one of them must be running it like this, right?
It has come to a point where I just gave up and got used to my system as is, but I'm sure I would be running fanfare if some day I am able to fix it.
My mint install won't let sound through my sound card. Drivers are there, it knows exactly the brand and model of card and shows it, it even knows when I plug/unplug stuff from it, but 0 sound, ever.
The solution?
Just plug my headphones into my new speakers that have their own DAC, anyway.
I had problems with the session manager
My lightdm was broken and I tried to fix it.
Disable, enable, start, stop the service in systemctl
I have changed the configuration of lightdm
I've tried different lightdm greeters
But the problem wasn't with lightdm, it was xorg.
I don't use xorg, and now I use terminal session manager "ly"
It will work even without xorg
I was trying to setup Timeshift for system snapshots on a work computer with Ubuntu. It didn’t work for some reason so I tried to first get rid of it. After uninstalling it, I wanted to remove, what I though, were remains of TS files in /run/timeshift, but the root partition was still mounted, so I rm-rfd the whole root, luckily except for home. And the computer has BIOS password with secure boot, so talking to IT dep about what I’ve done that is…. Or is it?
The /boot and the initramfs was still in place, so it was dropping me to emergency shell when trying to boot. Connecting external USB to see if I can mount it, hmm doesn’t show up. Quick search on my private computer what kernel modules are required for USB storage, modprobed couple of xhci_* and bang, was able to mount it. I saved result of ls -l /dev/disk/by-uuid on the drive and moved to my private PC, where I created VM and installed exact same Ubuntu with exact config (LVM+Luks) and after it was done I copied all of / content to the (now formatted as ext4) external drive using cp -a, then edited fstab and crypttab to put proper UUIDs there, set up hostname and user account accordingly. Then moved back to the borked laptop, copied the newly installed Ubuntu back to the root partition, rebooted and it worked perfectly on first try and continues to work. All of that roller coster in just a single hour.
I once removed most of X by trying to remove Gnome dependencies and it lead to an interesting couple of hours but I did have a working system when I was done.
There were countless dependency bugs and broken systems but at least I learned how to use the Gentoo Forum and also a lot of how Linux works.
Accidentally put grub on the wrong partition on the device, which it was not happy with. Was able to copy some files over, manually boot the OS, and reconfigure grub to be in the right partition, took me about 2 hours? Then I did it again on a different machine, and speedran it lol
Some of the crap I had to do back in the late 00s to get wifi, sleep and power management even barely working on some machines felt like the hardest thing at the time. I wonder how I’d fare with those issues today, 17 years later, knowing quite a bit more about the underlying OS and working with the OS daily… I don’t know that I’d qualify that as difficult more than it was extremely tedious and a bunch of trial and error of configuration options I didn’t know anything about.
If we’re talking about modern day… not so much honestly. btrfs snapshots saved my ass a couple of times, the rare issue I encounter I just rollback and wait for an upstream fix, and the rest I typically ignore or use something else. Everything tends to run quite smooth for me as a general rule, though.
I feel seen here, I was building a Ubuntu server and messed up the firewall settings not being able to get an internet connection, hours of trying to get back to where I was I gave up and plan to just start from scratch next time.
Is there a way of taking system snapshots with Linux?
I recently managed to recover from a corrupted libstdc .so. Turns out I shouldn't have bothered because the it was a Pi and, of course, the SD card had shit the bed, but I was pretty happy with myself for like 30 minutes.
Jumping from the default kernel with zfs to the xanmod kernel using a manually compiled version of zfs. I don't rememeber a whole lot but it was quite... interesting. Next would be a suddenly vanished efi partition and my f* mainboard refusing to boot ZBM.
Bonus: my currently still unfixed problem is a very weird freezing/stuttering of the whole OS and the only (useless) "lead" I have is workqueue: fill_page_cache_func hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
Used to be messing with kernel arguments and installing/tweaking boot parameters. That was until Grub broke, I learned systemd-boot and chrooting into the system via live USB
Now if I break anything it's just a matter of "sigh, let me get the USB and type a few commands"
I managed a CentOS system where someone accidentally deleted everything from /usr, so no lib64, and no bin. I didn't have a way to get proper files at the time, so I hooked the drive up to my Arch system, made sure glibc matched, and copied yum and other tools from Arch.
Booted the system, reinstalled a whole lot of yum packages, and... the thing still worked.
That's almost equivalent to a reinstall, though. As a broke college student, I had a laptop with a loose drive, that would fall out very easily. I set it up to load a few crucial things into a ramdisk at boot, so that I could browse the web and take notes even if the drive was disconnected, and it would still load images and things. I could pull the cover off and push the drive back in place to save files, but doing that every time I had class got really tiring, so I wanted it to run a little like a live system.
I have taken a drive with filesystem issues, mounted on a different machine and either backup data I wanted to keep or copy files to make the original machine runnable.
Rescuing home partition from ZFS, actually that doesn't really count since I did have to reinstall (was no longer booting), but recovering the Home partition from ZFS and to the other ext4 drive was much harder than it should've been and that's why I would never recommend people use ZFS.
tell me, please, who thought it was a good idea for a filesystem to remember the last machine it was mounted from and refuse to let itself be mounted by a different operating system instance even if all the hardware is present?
My first home server would get lost on the network every week, at different times and without any apparent reason. I performed hard resets by unplugging and plugging it back in.
After several months, I decided to connect a screen to it, and I initially thought it had hung up, but it hadn't. After some investigation, I discovered that every time my router obtained a new dynamic IP address, the server lost its network connection, requiring a reset. I wrote a script to check the network connection every minute, and if it's lost again, it will be reset.
Installed fedora on btrfs and upgraded from 38 to 39 week after installation, everything broke so bad, even ssd which was used for it locked, not just filesystem, ssd was new btw
Hmm I have come up with a bunch of neat solutions over the years. Where to start?
One time I broke the sudoers file on a distro without a root account, thoroughly locking myself out. I used docker -v /:/chroot to get myself root access to my root filesystem where I fixed the sudoers file. Protip always use visudo
I don't know how I fixed it, but KDE Plasma 5.whatever on MX was acting up. It would let me login but if I couldn't do much else. Wouldn't respond to my clicks or anything. Thankfully I could open Yakuake and install a different desktop environment. Then, one day while I was backing up files to do a reinstall, it started working again. I could use Plasma without issues. I have no clue what fixed it, though.
It also came with a non-issue of now my laptop won't auto turn on every time I open it up, but I'll take that over having to reinstall and set things back up.