"There was an emergency and power was temporarily cut to the card reader terminals, and now none of them work. We need to get to the server room to reboot the access control server, but it's behind a card-only door."
"How much do we want to get in?"
"Very much."
"Stand back."
And that's how I ended up busting down the server room's door.
I'm currently working with a client that doesn't have a health endpoint or any kind of monitoring on their new API . They say monitoring isn't needed because it will never go down.
Naturally it went down on day two. They still haven't added any "unnecessary" monitoring, insisting that it will never go down.
Just have an endpoint in your API (like /health) that doesn’t do anything but return “ok”.
So if your database goes down, your filesystem is full, etc, that endpoint will always return “ok” with HTTP 200. That way you can setup a ping monitoring service that will trigger an alarm if the process itself is down.
You of course need more pinging for the database server etc. But at least you know which service is down instead of “the whole website is down and we don’t know which parts”.
That's because you don't write bugs. Health check are only needed when you're planning on having bugs in your system. Instead of doing monitoring I prefer to spend a bit more time fixing all the bugs and then my systems never break so no monitoring is needed. Of course the downside is lack of job security. They can fire me and the system will just continue running forever, no support needed. If you add some bugs they cannot fire you because someone needs to keep fixing the broken system.