Thursday 27 September 2012

Today, a MHz is just not a MHz any more...

A common trend in virtualization environments is to use the easily-accessible MHz rating of the server as a normalization parameter - so that when you're considering an optimization routine, you can identify available capacity in terms of MHz and compare it to some other capacity being used, and determine whether there's a fit - or not.  While this method of normalization makes complete sense in terms of the data available, I'm here with some bad news.  They just don't make MHz like they used to.  Actually, in many cases, they make them better!  The SPECint2006_rate benchmark is a measure of throughput for CPUs.  This is a direct comparison with MHz, which has a direct correlation on throughput.



Confused about capacity?  Take this example...  How much oil can you get through an oil pipeline is proportional to the cross-section of the pipeline - how fat it is - and the speed at which the oil moves through the pipeline.   Take that into digital context - and the cross-section of a CPU is related to the number of cores, and the speed of the CPU is measured by MHz.  The clock speed is the frequency of the chip - and defines how quickly a task can be processed by the CPU.




They don't make 'em like they used to...

The problem though, is that the clever guys at Intel and the other processor manufacturers don't want to play this game.  They're always thinking of new ways of boosting performance that don't rely on just a MHz improvement.  Take a look at the data.  The chart below shows the ratio of SPECint2006_rate to GHz over the last 6 years, controlling for the number of cores in the benchmark measurement.  This shows that the GHz in 2011 is equivalent to 1.15GHz just 12 months earlier.  Another interesting point, is that AMD data doesn't show this same rate of change - the trend line has a lot lower gradient.  This highlights that the chipset is a hugely important factor when using MHz as a normalization rating.  A MHz just isn't a transferable unit.


Data for HP Proliant only Intel directly from SPEC.ORG 


Conclusion

Normalization is a key part of good capacity management practise.  Using a percentage is simply a recipe for disaster when trying to apply intelligence to configuration optimization.  Using MHz is an easy option, but alas  is just fool's gold.  The data for Intel chips shows that the processing rate for chips changes dramatically over time, and that could introduce errors of over 15% per 12-month period for optimization.  If you were moving from legacy kit, 3 years old, the error margin may be over 75%.  This will always lead to over-specified machines - and a higher spend than necessary to meet business requirements.  Whilst that amount of headroom may have been justified in directly allocated capacity, in the cloud that overspend represents a high cost of ownership and immediate optimization challenges on deployment.

Tuesday 11 September 2012

Agile practise for the cost/effective cloud


Can you tolerate failure?

Some situations cannot tolerate failure: The highly political government project, the business-enabling SAP deployment or the market-leading e-business application all share a common characteristic - getting it wrong hurts.  And the most common symptom of getting it wrong?  Poor performance, poor availability, poor service - and disgruntled customers.

What happens then is where heros make their reputation.  Firefighting, troubleshooting, late-night candle-burning, bonus-generating heroics.  Don't get me wrong, it's great to be a hero - saving the day from evil, just before the clock ticks to zero.  But surely it's better not to get into that position in the first place?

Lessons from other industries

The use of simulation is common in industries where getting performance design right first time is critical to the bottom line.  Imagine building an aeroplane with control systems that didn't respond in times, semiconductors which performed worse than their predecessors, or civil engineering projects which couldn't handle the projected loads.  Scenario planning is also a favourite of agile business.  Just as a chess grand-master is thinking several moves ahead, the business planner looks beyond immediate trends and plans the impact of their next business strategy.

But, IT is sooo complex

For IT professionals, getting it wrong hurts equally as much; but here we are at a significant disadvantage.  The complexity of enterprise-class systems, the interconnected and often opaque nature of cloud services, the lack of insight into business initiatives leave us in the dark.    Whilst we're under constant pressure to reduce cost and risk, the ever-increasing complexity forces us to over-provision, to reduce the risk of getting it wrong. 
But is that right?  Is it true to say that IT is more complex than embedded avionic systems?  Probably it is not true to say that, although the rate of change is certainly higher.  Can we still afford to justify time in planning, even though the sands are constantly shifting under our feet?  In the past, we've focused just on a limited number of critical systems, or performed some pretty rudimentary analysis that makes us comfortable - and trusted in the heros.  But there must be a better way - a way that enables IT to be a part of the planning process, not just driven before the wind like a rudderless ship.  There must, mustn't there?

Converged scenario planning 

Times are changing.  Scenario planning tools have long been available in the marketplace, in the capacity planning sector, and now they are catching up with the needs of the business of cloud management.  Remember, that the agility of cloud computing provokes the need for better management of headroom - to maintain the capacity to support elastic demand, and cost - to do so at a profit.  Cloud Capacity Management solutions must incorporate and translate the needs of the business into capacity requirements, and also translate capacity requirements into business constraints.  And the universal language of business?  Currency.
The new world of the cloud demands alignment between the needs of the business and the capacity provided to it.  The cloud-enabled enterprise IT spend is proportionate to the needs of the business.  The cloud-provider has the challenge of ensuring that its revenue covers its costs and provides a profit margin to its stakeholders.  Efficient decision makers are seizing control of their supply chains, and ensuring that risks and costs are managed effectively throughout.  The convergence of cloud and consumer planning ensures transparency in that decision making process.

The new synergy

Successful business and IT leaders are getting their act together, and learning to co-operate in new fruitful ways.  Planning for IT and business initiatives through capacity management, with focus on both customer experience and the bottom line, is enabling enlightened decision makers to understand the ramifications of their alternative strategies, and understand the budget and risk parameters for their chosen plans.  

It turns out the old adage "a stitch in time saves nine" still has relevance today..

Monday 3 September 2012

Starting Capacity Management - Five Guiding Principles for Success

A few folk have been posting asking for advice on choosing a Capacity Management tool.  The normal result of that is a flurry of responses from vendors trying to position their tool as the best.  I prefer a different approach - some tools are better for some situations than others, and all of them will have some limitations - Capacity Management has to cover such a diverse range of platforms and use-cases that it's inevitable.

The fundamental principle of Capacity Management
As I tweeted recently, the fundamental principle of Capacity Management is that it exists to reduce cost and risk to business services.  When we're designing that process, selecting a tool, or looking for skills - it is imperative that we keep hold of those guiding principles whenever making a decision.  This means that taking a silo infrastructure view doesn't really help align with understanding or quantifying the risk to business services (unless the business service just happens to match 100% onto the infrastructure, which was the old way of running things but pretty much incompatible with the cloud)

The second principle of Capacity Management
Not all capacity is created equal.  So it is critical that we abandon percentages as a way of comparing different assets.  Percentages are only useful to compare a current value against another, such as the maximum.  But you can't take on server at 20% and another running at 20% and assume that consolidating them will result in a server running at 40% (unless the configuration is identical).  We now need a way of comparing capacities.  One of the popular ways of doing that is through using MHz.  But this approach is plagued with inaccuracy.  The main issue is that the processing power of a CPU is not directly correlated with MHz.  In fact (as I intend to show in a later post) the difference can be roughly stated as 25% per year for some chipsets.  This means that a 2GHz chip from 2010 is roughly equivalent to a 1.5GHz chip from 2011.  And don't forget to control for the effect of hyperthreading - your monitoring tools will report on logical CPUs not physical ones.

The third principle of Capacity Management
Cross-platform visibility is essential.  There is no purpose in having silo'd capacity management.  As indicated in the fundamental principle, it is imperative to identify risk to business services.  This means that we need to get end-to-end visibility on all the components that could introduce risk to service quality; meaning storage, network, virtual, physical - and more.  If you are operating within a silo right now, you need to explore ways to increase your scope and introduce a single, standard process - eliminating the variance in approach and in accuracy that is inevitable if you take a silo'd approach.

The fourth principle of Capacity Management
Operate within a limited set of use-cases.  The natural implication of the third principle is that you can extend your approach to include every asset in a business service - even down the the capacity of things you can't measure.  The fourth principle says that you should constrain the scope of your Capacity Management activities to align with your IT management objectives.  You might be solely interested in optimizing your virtual estates, or you might be focused on a DevOps capacity management lifecycle.  Your choice of tool and process should remain consistently aligned with your objectives.  

The fifth principle of Capacity Management
Don't Capacity Manage in isolation!  Remember that one of the benefits in Capacity Management is in managing risk, implying a connection with change.  Change can be from a business perspective (new channels, better performance, new markets etc.) or an IT perspective (virtualizing, consolidating, upgrading, software release etc.).  Your Capacity Management process must integrate with and add insight into these other critical management processes.  Correlating with IT Financial Management will offer cost/efficiency benchmarking.  Connecting with Release Management will provide scalability and impact assessments.  Working with Business Continuity Management will give scenario planning and quantified risk mitigation.  Capacity Management operating on its own just adds no value to the enterprise and should be terminated.

Summary
Capacity Management is a fundamental of business.  Whether the assets in question are office cubicles, trucks in a fleet, or virtual servers - they all fulfil a business need and represent an investment on behalf of the business to meet that need.  Enterprises operating at a lower risk and cost will enjoy a competitive advantage in the market - and Capacity Management is the vehicle to that advantage.  Use the five guiding principles outlined in this posts, to have the very best chance of success.