If you’re working in the small world of Internet Providers or perform any other business related to Internet, you are for sure aware of the major outage affecting Colt since yesterday in Europe (or you’re living on the moon!) Checkout the article on theregister.co.uk.
I received feedback from several customers whose business relies on the
Colt backbone Internet (like portals or e-business) and they’re just waiting for the problem to be solved. All of them contacted Colt and are kept in the dark by the ISP: “We have big issues” or “We are trying to restore our services as soon as possible“. On a technical point of view, the outage must be something really low level because some customers also detected problems with their LAN links between different sites (nothing related to the Internet itself).
I’m sure they are fully busy to restore their services but in parallel, they should have deployed some communication channels to communicate in the same time! Leaving customers without information is a really bad press for the company. Once all the services restored, they will have to deploy much more effort to gain back their customers confidence… Funny, a Twitter account @COTLoutagenews was created today to spread news over the micro-blogging platform. I suspect this account to belong to a Colt employer which try to keep the Internet updated about their issue.
More than 24 hours since the problems started, Colt Belgium is still disconnected from the Internet. You can reach the border routers than… black hole:
I would like to be clear: This post has nothing against the Colt company nor the services they provide. They have very competent engineers who must spend a period of intense stress. The story will stay the same even if you replace “Colt” by your favorite ISP. It’s just a good opportunity to learn things from this bad experience:
1. Colt is a well-known company and was known as “reliable”. This story reveals that even the biggest one can suffer of a major outage! And often disasters are caused by a series of small minor incidents. In short read: “Shit happens!”
2. Don’t put all your eggs in the same bag! If you got connected to the Internet from Colt only, you should also be in a major crisis! Bandwidth prices are very low today and there exists lot of solutions to build redundant Internet infrastructure. No need to be a BGP guru anymore to become multi-homed.
3. Be prepared to face the same story. Have a good communication plan ready! I wrote some text about incident management a few weeks ago. When you start your BCP, DRP or whatever you call you plan, don’t forget the communication. I already read some articles in the specialized press about the current outage.
4. Once the problem is under control. Don’t try to hide the fact. You were in a deep shit? Tell it! And explain how you successfully resolved the issues! This could prove to customer, partners or press that you were able to take the right actions.
Never forget: “Humans learns by doing mistakes”. I hope that Colt will analyze and communicate about the incident. This could help them to increase the service level and could, for sure, be useful for other companies. Should I remind you how Apache handled their last incident?
Last news communicated via @COLToutagenews: Further updates can be found at http://www.colt.net/UK-en/CaseStudy/COLT_036279.