My first Linux machine crashing. This was way before Redhat, Ubuntu, Arch, or OpenSUSE. This was installed from 60+ floppy disks on a 386-40 with 8MB of RAM.
This machine ran happily, but it crashed under heavy load. I checked out causing the load by using different applications, but could not nail it to a certain software. So the next thing I checked was the RAM. Memtest86 ran for a day without any problems. But the crashes still came. So I got the infrared camera from the lab to see if some hardware overheats. Nope, this went nowhere, either.
Then I tested the harddisk. Read test of the whole HD went without problems. I copied the data on a backup medium and did a write and read test by dd'ing /dev/zero over the whole disk, and then dd'ing the disk to /dev/null. Nothing did show up.
I reinstalled the Linux, and it crashed again. But this time, I noticed that something was odd with the harddisk. I added a second swap partition, disabled the first, and the machine ran without problems. Strange...
So I wrote a small program that tested the part of the disk occupied by the old swap space: Write data, read data, and log everything with timestamps. And there was the culprit: There was an area on the HD where I could write any data, but when I read blocks from that area, a) It took a very long time for the read, b) the blocks I read were containing all zero, regardless of what I had written, and worst of all c) there was no error indication whatsoever from the controller or drive. Down at the kernel level, the zeroed blocks were happily served by the HD with an "OK". And the faulty area was right in the middle of the original swap partition.