Capacity Management Re-Branded: business capacity management

A common trend in virtualization environments is to use the easily-accessible MHz rating of the server as a normalization parameter - so that when you're considering an optimization routine, you can identify available capacity in terms of MHz and compare it to some other capacity being used, and determine whether there's a fit - or not. While this method of normalization makes complete sense in terms of the data available, I'm here with some bad news. They just don't make MHz like they used to. Actually, in many cases, they make them better! The SPECint2006_rate benchmark is a measure of throughput for CPUs. This is a direct comparison with MHz, which has a direct correlation on throughput.

Confused about capacity? Take this example... How much oil can you get through an oil pipeline is proportional to the cross-section of the pipeline - how fat it is - and the speed at which the oil moves through the pipeline. Take that into digital context - and the cross-section of a CPU is related to the number of cores, and the speed of the CPU is measured by MHz. The clock speed is the frequency of the chip - and defines how quickly a task can be processed by the CPU.

They don't make 'em like they used to...

The problem though, is that the clever guys at Intel and the other processor manufacturers don't want to play this game. They're always thinking of new ways of boosting performance that don't rely on just a MHz improvement. Take a look at the data. The chart below shows the ratio of SPECint2006_rate to GHz over the last 6 years, controlling for the number of cores in the benchmark measurement. This shows that the GHz in 2011 is equivalent to 1.15GHz just 12 months earlier. Another interesting point, is that AMD data doesn't show this same rate of change - the trend line has a lot lower gradient. This highlights that the chipset is a hugely important factor when using MHz as a normalization rating. A MHz just isn't a transferable unit.

Data for HP Proliant only Intel directly from SPEC.ORG

Conclusion

Normalization is a key part of good capacity management practise. Using a percentage is simply a recipe for disaster when trying to apply intelligence to configuration optimization. Using MHz is an easy option, but alas is just fool's gold. The data for Intel chips shows that the processing rate for chips changes dramatically over time, and that could introduce errors of over 15% per 12-month period for optimization. If you were moving from legacy kit, 3 years old, the error margin may be over 75%. This will always lead to over-specified machines - and a higher spend than necessary to meet business requirements. Whilst that amount of headroom may have been justified in directly allocated capacity, in the cloud that overspend represents a high cost of ownership and immediate optimization challenges on deployment.

You'll hear the phrase "predictive analytics" coming from most of the major players in capacity management these days. Looking into the future to support planning any major infrastructure or software initiatives, or even to account for variations in workload growth, all require predictive analytics to some extent. Whether you're an infrastructure provider or consumer, better planning drives more efficient operations and hence improved margins. Let's explore more:

At it's most basic form, predictive analytics is about extrapolation. By gathering a set of historical data, we can begin to spot patterns and make some assessment into the future trajectory. The type of extrapolation that can be made depends on the power of the analytics - at it's most basic, linear regression analysis looks at long term trends and plots a single straight line trend out into the future. This works fine for persistent metrics like disk space. In fact, it works reasonably well for less persistent metrics, provided you bolster the analysis with some variability assessment. However, better curve fitting algorithms (lognormal, exponential, binomial etc.) can provide more accurate predictions if the data is well behaved. Take a look at the graph above. The binomial fit is closer to the capacity used metric, which is a combination of a steady organic growth and a seasonal variation trend. In this case, a linear trend on the peaks (or 98th percentiles) can give the same net result, but it a little more cumbersome.

There are two problems however with extrapolation. First, is one of scale. With no roll-up mechanism, you quickly get drowned in data. There's a simplification process that needs to support the trends. The second, and more fundamental, is that is assumes all other variables remain constant - meaning, it's only the workload that changes - the environment itself is static.

Is this a good assumption? Well, for some platforms it is. For disk capacity, it is a pretty good rule. Only when disks are running out of space, will some change be made - and these can be easily reflected in the extrapolation. For physical infrastructure, or statically allocated partitions, this can be a decent assumption too - provided the software itself isn't changing.

But where the extrapolation and curve-fitting algorithms really fail, is where either the software or the operating environment are changing. Determining the impact assessment of these step-changes in capacity is a task too complex for curve-fitting alone - and some configuration information must be reflected in the predictions. At this stage, a modelling approach must be used. There are in fact many different modelling algorithms and approaches, but the most popular provide both an infrastructure and a service perspective on capacity. The service-centric capacity plan takes a cross-section of Data Centre capacity allocated or used by a heirarchy of service definitions, which can be taken from a service definition or CMDB. The benefit of this view is to enable dialogue with business owners about plans for their relevant domain. If you're capacity planning in the cloud, the relevant conversation should involve budgeting, quality and optimization opportunity. If you have a model, then the relevant KPI for trending and extrapolation becomes workload volumetrics - and this means you can manipulate forecast data based on changing business requirements in the future.

The modelling approach really is beneficial in managing shared virtual infrastructures like the cloud; where the bottleneck may appear at the physical or virtual layer, where the virtual configuration may be changing rapidly; and where DRS workloads may be shifting around within a cluster. It is also beneficial in planning for new software releases, upgrades or (major) reconfigurations - thereby incorporating a life-cycle approach to capacity management. Surely this is where predictive analytics is at it's most powerful? In helping architects to size new cloud environments, testers to validate the scalability of their new release, and capacity managers to measure the impact of their release into a congested production environments.

In Summary
In the technology life-cycle, the role for capacity management predictive analytics should support sizing, provisioning, managing and decommissioning. Whether you choose to use a tool for that, or operate a consultative approach - leaving holes in your planning process has been shown to add risk and cost to your IT operations.

Capacity Management Re-Branded

Thursday, 27 September 2012

Today, a MHz is just not a MHz any more...

They don't make 'em like they used to...

Conclusion

Tuesday, 17 July 2012

Planning for better IT operating margin

Followers