Hello fellow selfhoster,
I was wondering how important it is to have ECC Memory. I want a server that is really reliable and ECC memory pops up as one of the must haves for reliability. But it seems to me in my research that it is quite expensive to get a setup with ECC memory. How important is ECC memory for a server (I rely on).
So far I have been rocking a Raspberry pi 4 which has ECC memory
Anecdotal evidence, but I have been self hosting for 3 years and never had a single problem without ECC memory.
I think the thing is if you are serving thousands of requests constantly like a "real website" and/or have financial reliance on it then ECC memory becomes a huge ROI draw. If you are running a media server + a few services for one household, ECC memory is very overkill in the vast majority of cases and you won't see a difference.
If you're using memory for storage operations, especially for something like ZFS cache, then you ideally want ECC so errors are caught and corrected before they corrupt your data, as a best practice.
In the real world unless you're buying old servers off ebay that already have it installed the economics don't make sense for self hosted. The issues are so rare and you should have good backups anyways. I've never run into a problem for not using ECC, been self hosting since 2010 and have some ZFS pools nearly that old. I exclusively run on consumer stuff with the exception of HBAs and networking, never had ECC.
For large storage, ECC helps a lot for avoiding storage corruption. In combination with a redundant architecture in zfs it is almost bullet-proof. (Make no mistake, redundant storage is no substitute for backups! You still need those.)
One option is to use comparatively old server hardware. I have some pretty old stuff (around 10 years) that uses DDR3 RAM, which is dirt cheap, even with ECC (somewhere around 1 €/GB). And it will be fast enough by far for most applications. The downside is higher power consumption for the same performance. The Dell T320 I have with eight 3.5" SAS disks and 32 GB RAM uses some 140 W of power, to give you a ballpark figure.
Yea I have been trying to avoid high power consumption as power is quite expensive here. I think for my case non ECC + ZFS + backup will suffice. Thanks!
Best is to use a file system with checksum error correction to mitigate against the rare non-ecc memory issue. I use btrfs which does a good job for that.
Think of it this way: if a cosmic ray happen to land on a silicon cell in your RAM and flip a random bit from 1 to 0, how screwed will you be? If the answer is "meh, I'll just restart the computer/restore corrupted data from backup" then you probably don't need it.
My understanding is that as the amount and speed of memory increases, the usefulness of ECC in detecting and preventing the types of errors that can cause a crash or corrupt a file goes up.
But for home use it's probably more useful to focus on storage redundancy and backups, or a UPS to keep things running during power blips/outages.
According to source the ecc has to 'kick-in' about 3700 times per year and dimm module. That's 10 times per day and dimm.
Depending on how important your server is to you you'll either need it (in case of important data you absolutely don't want to lose) or forget about it (just a hobby project, nothing serious).
The answers in this thread are surprisingly complex, and though they contain true technical facts, their conclusions are generally wrong in terms of what it takes to maintain file integrity. The simple answer is that ECC ram in a networked file server can only protect against memory corruption in the filesystem, but memory corruption can also occur in application code and that’s enough to corrupt a file even if the file server faithfully records the broken bytestream produced by the app.
If you run a Postgres container, and the non-ecc DB process bitflips a key or value, the ECC networked filesystem will faithfully record that corrupted key or value. If the DB bitflips a critical metadata structure in the db file-format, the db file will get corrupted even though the ECC networked filesystem recorded those corrupt bits faithfully and even though the filesystem metadata is intact.
If you run a video transcoding container and it experiences bitflips, that can result in visual glitches or in the video metadata being invalid… again even if the networked filesystem records those corrupt bits faithfully and the filesystem metadata is fully intact.
ECC in the file server prevents complete filesystem loss due to corruption of key FS metadata structures (or at least memory bit-flips… but modern checksumming fs’s like ZFS protect against bit-flips in the storage pretty well). And it protects from individual file loss due to bitflips in the file server. It does NOT protect from the app container corrupting the stream of bytes written to an individual file, which is opaque to the filesystem but which is nonetheless structured data that can be corrupted by the app. If you want ECC-levels of integrity you need to run ECC at all points in the pipeline that are writing data.
That said, I’ve never run an ECC box in my homelab, have never knowingly experienced corruption due to bit flips, and have never knowingly had a file corruption that mattered despite storing and using many terabytes of data. If I care enough about integrity to care about ECC, I probably also care enough to run multiple pipelines on independent hardware and cross-check their results. It’s not something I would lose sleep over.
DDR5 has built in data checking which is ECC without the automatic correction which might be worthwhile depending on your setup.
Your ECC on the pi i believe isn't for the memory chip but for the on chip die's cache for ARM.
For me personally, if my racked server supports it, I get ECC. If it doesn't, I don't sweat it. Redundance in drives, power, and networking is much more important to me and are order of magnitudes higher chance of failing from my anecdotal experience. If I can save those dollars for another higher probably failure, I do that.
DNS is a lynchpin of my network (and wife approval factor) which I splurge a bit for with physical redundance of an identical mini computer that runs it and fail over to same ip if the first box fails. Those considerations are way before if the server has ECC. Just my $0.02.
Thanks for the feedback!
Yea think a ZFS redundancy + Backup will do for my application then. From what I am reading here it is less common than I imagined
It's extremely common in Enterprise where costs for a 100k+ server isn't the most expensive part of running, maintaining, servicing said server. If your home lab isn't practicing 3-2-1 backups (at least three copies of your data, two local (on-site) but on different media/devices, and at least one copy off-site) yet, I'd spend money on that before ECC.