The most important job for the technology department is also the least glamorous: keeping the lights on and the servers humming. “The Business” can often expect the infrastructure to be always available. So much so that when things go well “The Techies” are taken for granted and when it all goes wrong they receive no end of grief.
The techies have to take some of the blame for setting appropriate expectations. The reality is that your managers and customers are reasonable people. They will only expect 100% uptime with $0 cost if you let them think that this is reasonable.
So next time the proverbial hits the fan my advice is (when, not if):
1. Don’t get into the blame game
2. Fix the problem, with a smile on your face
3. When the problem is fixed, get out your risk register (with costed mitigation plans) and show to your manager why this issue happened, how likely it is to happen again, what its likely impact is, and the costs needed to avoid/fix/etc the issue in the future
4. Also, while you’re on the subject, explain the other risks that the business is open to right now.
5. Keep in mind that this stuff happens to other companies – and sometimes far worse than what you’ve just experienced. Not that I’d be advocating shadenfreude … but a quick trawl through the news and gossip of the last few weeks turns up:
Kaiser Permanente losing their CIO last week – apparently linked to a disastrous software implementation.
Register.com keeling over a couple of weeks ago after what looked like an epic DDOS attack and taking with it thousands of DNS records.
Morgan Stanley sending out credit card statements last month with all values multiplied by 100.
Then there’s my all-time favourite example of tech muppetry:
Nasa trashing the Mars Climate Orbiter (cost: $125m) because one part of the software had been written on the assumption that everyone was operating in metric units while another part had been written on the assumption that imperial units were the order of the day.
Till next time