Search

Type

Sort

Pawb.Social Announcements @pawb.social Crashdoom @pawb.social 11mo ago

Oct 6th

One of the data storage systems (CEPH) encountered a critical failure when Proxmox lost connection to several of its nodes, ultimately resulting in the CEPH configuration being cleared by the Proxmox cluster consensus mechanism. No data, except ElasticSearch, was stored on CEPH.

When the connection was lost to the other nodes, a split-brain occurred (when nodes disagree on which changes are authoritative and which should be dropped). As we tried to recluster all of the nodes, a resolution occured that resulted in the ceph.conf file being wiped and the data on CEPH being unrecoverable.

Thankfully, we’ve suffered no significant data loss, with the exception of having to rebuild the Mastodon ElasticSearch indexes from 6 AM this morning to present.

I’d like to profusely apologize for the inconvenience, but we felt it necessary at the time to offline all services as part of our disaster recovery plan to ensure no damage occurred to the containers while we investigated.

2

Pawb.Social Announcements @pawb.social Crashdoom @pawb.social 1y ago

Unexpected Downtime Postmortem

July 7 11:30 PM MT

Around 11:30 PM, one of the core GCFI breakers tripped resulting in the UPS array running until around just after midnight before dropping entirely.

We had not set up monitoring on the APC UPS as we hadn’t anticipated that as a failure point due to the reliability of the Colorado electrical grid, and didn’t anticipate the breaker itself flipping after months of continued use.

All services should be online again and working to catch up on any dropped content overnight, and we apologize again for the inconvenience.

We’ll be trying to identify the root cause throughout the day, working on the electrical connections, and setting up alerting through the APC UPS to detect and persistent A/C loss.

4