How do you manage risk in your data center or IT organization? If you’re in California you’ve probably made serious considerations for flooding, fires or earthquakes, if you’re on the East Coast its little things like Nor’easters and the occasional hurricane. In Texas or other parts of the central and southern states you might also worry about tornados. However, oddly enough the most common cause of an interruption in service is good old humans. Yep, you, me and the other guy are largely to blame for the vast majority of outages our services suffer from.
If we’re to blame, what can we do?
It’s oddly ironic that the easiest and most cost effective things we can do to provide better availability of systems are generally the things we ignore or get lazy about. That’s right, the best return on investment of any high availability strategy is the investment made in the human portion of the equation.
In researching risk vectors for IT and facilities systems I found that the percentage of risk attributed to each vector varied depending on who created the pie chart. However, what I did find was a general agreement or consensus that regardless of how big a risk flooding or a hardware failure was that the piece of the pie attributed to humans was always the largest, in many cases around 40%.
The good news is that developing your staff will only make them better and more loyal to the company, all while increasing the availability of your systems. A key component of successful leadership is investing in your team, what better reasons do you need than employee retention and a reduction in mistakes/downtime. Let’s put that another way; improved quality at a lower cost of operations.
It’s simple, really! (OK, not really, but worth it all the same)
The great part about working on team training and performance is that it’s easy comparatively speaking. It’s much harder to get funding and put in place complex failover and recovery solutions for your ERP or power systems.
4 Simple Keys to Improved Operational Resilience
Key 1: Define the “Why”
Why are you doing what you do? How does it align with what the company is doing? Then work outward through the how and what.
Watch this Ted Talk by Simon Sinek for more inspiration (Thanks to @botchagalupe (John Willis) for the find).
Key 2: Develop a change management plan that you can actually use
Change management implementations are always painful and there are bound to be a few folks who put up walls and scream loudly about how it will slow or stop productivity. Be strong, you can get through it. The best message here is that their firefighting hour will be reduced and customer satisfaction will go way up.
Sub Key 2.1: The Change working group needs to be cross functional. I recommend incorporating the business (or a business unit embedded IT staffer) for larger changes and there must be an expectation of real review, and real responsibility for sign-off on a change.
Sub Key 2.2: You have to be very careful to include the entire system in your change management planning and risk assessment efforts (See the Data Center Pulse “Data Center Stack” for more, and think beyond just the DC)
Key 3: Training
Training plans must be drawn up and followed. The training should cover everything from using the Change Management process to “how to work in the data center for employees and contractors”. I’ve gone so far as to have movers go through a training program prior to DC equipment moves. So far the winning team is 5 and the opposition is 0 (5 major physical moves (100s – 1000s of pieces), all on schedule, and within time target).
Practice what you preach and ensure the training program is treated like a legitimate educational activity. Once the program is in place, you’ll continuously find new things to add that bring even greater value to your organization.
Key 4: Appropriate Rewards
Adjust your verbal and non-verbal reward processes. Many employees have their firefighting habits reinforced by management rewards (“Great job Mark! We really appreciate you coming in on your vacation to fix the flux capacitor again!”). The appropriate reward would be to give folks targets for system improvements combined with overhead reductions (reduced wasted hours on fixing stuff).
Once your training is baked and the positive results have been demonstrated you can begin to build these manual “Operations” activities into more of a “DevOps” or automated approach. Remember, if you take a bad process and automate it, you’ll the same problems occurring more quickly. In a nutshell, DevOps is a fluid and highly automated approach to incorporating all the right policy, scheduling, governance and change management practices into a successful and repeatable result.
It May Not Seem Glamorous
While working towards operational excellent may not have the same cache as building cloud infrastructure or implementing a new WAN (Wide Area Network), it is crucial to protecting your business partners and in turn to your success. When I think about operations I put it in the perspective of a nuclear submarine; you’re underwater, and you’re running a small nuclear plant that you sleep right next to, what could go wrong?