How not to have a 45-hour outage like GigabitNow

How not to have a 45-hour outage like GigabitNow
That look from my family when the internet is down.

My home internet was recently restored after a 45-hour outage of my GigabitNow service in Bloomington, Indiana. This is the worst internet service provider outage I've experienced in 30 years. The outage apparently affected only my account.

Part of my job is maximizing uptime for internet services, so here's my take on how to avoid the process failures as I saw them from outside.

Monitor more

While the support staff were all pleasant and trying to be helpful, they were not equipped with access to the monitoring tools they needed see that was a problem on their end with my account. Instead, I was told the problem appeared to be on my end, because all the right lights on my ONT device (modem) were blinking green.

This is Credo #1 from my On-Call Credos: Find out before the customer does.

With proper monitoring, they should have a virtual blinking red light in the dashboard for my account.

Test changes before they go in production

GigabitNow relayed that they outsourced part of their business to a "network builder" and that's where problem was. Whether you are outsource or not, you can have a strong a Change Management Policy that employees and subcontractors are required to follow and make sure that quality steps are taken before changes go into production. A Change Management Policy may require:

  • Peer reviews
  • Testing changes in a test environment
  • Automated Testing

... or all the above.

Besides monitoring, this was a second process opportunity to catch the issue that failed.

Test changes after they are in production

Another part of a robust change management program is testing that your changes actually did what they were intended to once they are in production.

For example, when updating a network router, a snapshot of the routing table after the upgrade could be compared with the state before the upgrade to confirm it actually worked. GigabitNow's status page had advertised that they just updated a core router in my area the day before the outage started.

This third process opportunity to identify the the issue before it caused an outage also failed.

Conduct an Production Incident Debrief

Despite our best efforts, my teams also experience outages for production services that we manage. When that happens, a "production incident debrief" is automatically triggered. The focus is forward-looking to improve systems. It is not a blame session. Even when you fail at the first three process points like GigabitNow did, an incident debrief takes the "win or learn" approach. Even if this weekend was ruining for your on-call team as well as some customers, the event can become a lesson to save a future weekend.

I use a standard template and time-box the meetings to keep things moving. Thirty minutes is often enough. Here are prompts I've used before in production incident debrief template that have stood the test of time:

  • Incident Summary and Timeline
  • Actions Already Taken
  • What went well?
  • How can we prevent similar incidents in the future?
  • How can we respond more efficiently and effectively?
  • What follow-up actions should be taken?

It works best to invite all relevant stakeholders have a person close to the incident write up the timeline and run the meeting.

For now, I'm giving GigabitNow a second chance as an internet provider. Hopefully this experience will be a catalyst for process improvements to prevent issues like this in the future.

postmortem: what actually went wrong?

I've been told they'll pass along more details of their investigation once it concludes. The facts I know so far is that it was related to me being one of the few customers who have a static IP address. I know that core routers in the area were updated a day earlier, and the issue seems have been fixed by software, not hardware.