Tales from Tech Support @lemmy.world slazer2au @lemmy.world 3mo ago

Yes boss, the failover works

This is a repost of a story I posted on Reddit a few years ago.

Story participants
Me: Slazer
Boss: the boss
T1: Tech 1
T2 Tech 2

Backstory

The boss is all about redundancy and backup. If he finds a single point of failure that I have missed he lets us know and sets a time frame for when he wants it resolved along with a when the failover testing should be done. Because an untested backup is worse than no backup.

To spare the boring BGP details
We have 2 data centres in our closest state capitol. With transit multihomed transit through a single level 2 carrier (while not true multihomed we have transit of last resort through one of our layer 2 customers).

One day the boss arrives in the office around 10:30 AM after being in a huff about hearing of a major outage in a competitors network.

Boss: Slazer, did you get our traffic balanced over our 2 transit paths like we discussed a while ago?
Me: Yes, DC1 advertises prefix 1,3,5 and the aggregate. DC2 advertises prefixes 2,4,6 and the aggregate.
Boss: What happens when one of the transit fails?
Me: I am advertising the DC2 prefixes out DC1 with the backup BGP community. Then doing the same thing for DC1 prefixes over DC2. In the event of a transit failure the upstream has a backup path ready to go. Boss: and it works?
Me: Yes, last time I tested it was about 2 or 3 months ago and it failover over correctly.
Boss: Why haven't you tested it sooner?
Me: RANCID hasn't reported a configuration change since the last test. I only test it if there has been a config change on and of those routers.
Boss: But how can you be sure it still works?
Me: Shall I force a failover now to show it works?
Boss: Sure. (which I assume he said with sarcasm)

Me: Starts logging to DC1 core router

T1 seeing me do my configuration change face.

T1: If you are doing that I am going for a break.

I shutdown our transit interface for DC1 and wait for BGP to time out.
After about 10 min with no calls the boss turns around and continues the conversation.

Boss: So when will you be testing the failover?
Me: We are, right now.
Boss: What??!! as his face drops.
Me: You agreed. Plus this way now you know for sure it works because the phones haven't started ringing.
T2: Slazer is right. The graphs show how an increase in traffic on DC2 transit.

Boss slides over to T2 desk. Sure enough, the graph for DC1 transit is reading zero traffic and the graph for DC2 is showing all the transit traffic for the state.

Boss: That doesn't looks like much traffic.
Me: Only about 20-30% of our traffic goes via Transit, the rest goes via the various IXs we are on.
Boss: Who don't we get via the IX?
Me: Customers of our transit provider who aren't on any IX, Telstra and Optus as they aren't on any IX, and any international site that doesn't use a CDN.

We continue discussing for a good 20 - 30 min about where we get various traffic from and further redundancy in the core networks. During which time T1 returns from his break.

T1: Phones are quiet?
Me: Yes.
Boss: Can you turn the DC1 transit back on?

I walk back to my desk and turn the transit interface back on and see the BGP peer back on. While T2 and the boss are watching the graph for DC2 transit it drops about 2/3 of traffic and that appears back on DC1 transit.

And from that day the Boss hasn't asked about the transit failover because now he knows it works.

10 comments

Halfway through reading this, I was concerned there was going to be something like "the phones aren't ringing because they go through the same DC that went down" haha

It's nice when things work exactly as intended
- We had soft phones on our mobiles so if the desk phones broke the mobiles would still work.
That’s a fun story, thanks for sharing. There’s always a nice feeling showing the doubtful boss that you are in control of your realm. Which bosses would learn to trust the experts they hire
sigh

I am jealous. I can't count how many times I've explained how various things work to my "superiors" and they've either:

Forgotten

Tried to get a different answer they like better from someone else

Made me contact a vendor and have them explain the same thing

Simply not trusted my answer

It's possible I'm just bad at explaining things. But it's a certainty that the people above me were reactionaries motivated by fear of the reactionaries above them. The toxic culture ran deep and started at the top.

I stayed there way too long.
- It is hard if you have a non technical boss to explain technical things.
  
  Even though that boss was technical he was RF technical, not network engineering technical. So I had many diagrams created in LibreNMS and weathermaps to show things.
This goes to show the power of demos after work is completed. I am support role but will be requesting more demos after my teams’ sprints… at least on major stuff if nothing else.
- The trick is to have a test environment to make sure things work before deploying to prod. But not every org realises the value in having a seperate production environment.
  
  I did spend weeks in eve-ng testing out my bgp setup before deploying it to live routers.
Because an untested backup is worse than no backup.

I've worked at the same site for almost 20 years. We've never actually cut over to our COOP (Continuity Of Operations Plan) site in all that time. Not once - not even partially.

I've recommended top management go to the data center, yank out the plugs and say "Let's see how it goes!" (after verifying plans are up to date and ready to go). No one is the least bit interested in doing anything like that.
Love it - very satisfying stuff.

Were you very concerned about calling the boss's bluff?
- In this instance no. I knew it worked because I had tested it before and had labed it out in gns3 beforehand.

10 comments