The FableThe two great heads of IT sat and stared at each other across a meeting room table. It was late in the day, and thankfully their services had all been restored. Now was the time for recriminations. The CIO had been called into firefighting meetings with the board all day. They knew he was going to be royally pissed off, but who was going to get the blame?
The beginningThe story began when service performance nose-dived. It was always a busy period, the lead-up to Christmas, but this season had been marked by some hugely successful promotional campaigns, and their online services had been humming with traffic. Nobody quite knew what caused it, but suddenly alarms started sounding. Throughput slowed to a trickle - and response times rocketed through the roof. Downtime. At first the infrastructure team, plunged into a triage and diagnostics scenario, did what they always did. Whilst some were busy pointing fingers, they formed a fire-fighting team, and quickly diagnosed the issue - they'd hit a capacity limit at a critical tier. As quickly as they could, they provisioned some additional infrastructure and slowly brought the systems back online.
The backgroundBut why had all this happened? Some months ago, and at the advice of some highly-paid consultants, the CIO had restructured the business into a private cloud model. The infrastructure team provided a service to the applications team, using state-of-the-art automation systems. Each and every employee was soon enamoured with this new way of working, using ultra-modern interfaces to request and provision new capacity whenever they needed it. Crucially, the capacity management function was disbanded - it just seemed irrelevant when you could provision capacity in just a few moments.
The inquisitionBut as the heads of IT discussed the situation it seemed there were some crucial gaps they had overlooked. The VP of Applications confessed that there there was very little work being done in profiling service demand, and in collaborating with the application owners to forecast future demands. He lacked the basic information to be able to determine service headroom - and crucially was unable to act proactively to provision the right amount of capacity. In an honest exchange, the VP infrastructure also admitted to failings in managing commitment levels of the virtual estate, and in sizing the physical infrastructure needed to keep on top of demand. In disbanding the Capacity Management function, they realized that they had fumbled - and in fact needed those skills in both of their teams.
The ConclusionThe ability to act pro-actively on infrastructure requirements distinguishes successful IT organisations from the crowd. What these heads of IT had realised is that the private cloud model enhances the need for Capacity Management, instead of diminishing it. The dichotomy of Capacity Management in the private cloud model is that these two functions belong to both sides of the table - to the provider, and to the consumer. Working independently, they would be able to improve demand forecasts and diminish the risk of performance issues. Working collaboratively, these twin dichotomies combine in a partnership that allows a most effective way of addressing and sizing capacity requirements to align and optimize cost and service headroom.
- As a consumer, ensure you are continually well-informed on service demand and capacity profiles. Use these profiles to work with your application owners in forecasting different 'what if' scenarios. Use the results to identify which are the most important metrics, and prepare a plan of action when certain thresholds are reached.
- As a provider, ensure you are continually tracking your infrastructure commitment levels and capacity levels. Use the best sizing tools you can find to identify the right-size infrastructure to provision for future scalability.
- Have your capacity management teams work collaboratively to form an effective partnership that ensures cost-efficient infrastructure delivery and most effective headroom management.
Will you wait for your own downtime before acting?