Capacity Management Re-Branded: 2012

Monday, 19 November 2012

Leadership .. and the rush to the cloud

Might seem strange to write a post about leadership, on a cloud computing blog; but think about it this way -- the 'Cloud' is all about leadership. It's the bandwagon that you've got to get on, irrespective of where it's going. It's the "common wisdom" where 'thought leaders' are too conscious of the opinions of others and began to emulate each other and conform rather than think differently (wiki link).

The reason I think this is happening, is that the fundamentals of IT service delivery have't gone through a revolution in the last 5 years. Sure, there have been leaps forward - notably in the smartphone and tablet markets, which have radically influenced accessibility and demand for services. Over the last 5 years, we have also seen incremental advances in IT capacity, in networks (notably end-user bandwidth which has been driven by the increasing demand), and in compute and storage terms. But the fundamentals of IT service delivery haven't changed. If you had implemented ITIL 5 years ago, you would still have the same frame of reference today and it would serve you well.

The difference is perception, and the advances of running IT like a business. Yet, this was one of the main strategies before the cloud came along. The reality is that business caught up with IT, and debunked the myths of risk-averse culture that became prevalent in many large enterprises. The business started to demand quality of service, and began to put a focus on costs. Just like an enterprise would manage costs in any other part of it's business, IT soon found that they were under similar cost pressures - and these became accelerated as the global downturn impacted profit margins.

What's really interesting though is the way that certain business models have begun to prosper in this new dynamic. Those are models that allow businesses to move away from large sunk capital investments, and towards a flexible model that allows them to account for their costs as a percentage of their revenue stream. There clearly is a great deal to be gained in accounting transparency here, but there's more - these flexible arrangements allow businessses to scale their cost base according to their overriding dynamic. On the face of it, it's a low-risk engagement for the customer.

But here comes the rub. As any risk analyst will tell you, it's the weakest link in the chain that tells you where your true risk really lies; and of course there are a number of risks associated with moving to this flexible arrangement that could scupper the whole deal. For several years, the security risks of losing personal data were often quoted as a show-stopper. More reputable companies offering their services have mitigated those risks -- for now at least. There are a certain number of regulatory factors to take into account, not least the actual jurisdiction of the data stores. Balancing these competing risk factors is the business of IT leaders.

Wednesday, 17 October 2012

Strip Down Cloud: the basics of cloud provision

Let me set my stall out: I really think that "in the cloud" is not a term created by IT professionals, or even marketing teams. I think it is a term created by the archetypal technophobe businessman who doesn't want to be bothered by the details of IT, and just wants a service delivered - and doesn't care how. He* wants it like a phone contract - something that can be tailored to the number of minutes, the number of texts - and can be flexible to move with his business.

All this stuff about what the cloud really is - is really just guff. Take for example self-service. This business guy doesn't care about self-service. In fact, it would be perfect if somebody else could do it for him. He wants a way of managing the contract himself - but he doesn't want to administer the service himself. If his business volumes go up - he wants to adjust the contract, so he has the capacity to support his business. If the volumes go down, likewise - the alignment with his business needs is what's important to him.

Take virtualization technology. This is a means to an end, it provides the rapid provisioning that this business guy really wants. But he doesn't care about virtualization. He cares about rapid response. If he orders more minutes on his phone contract; he wants the minutes instantly (although he might be satisfied to wait until the end of the monthly billing period). The same thing is true with his IT cloud. He wants to adjust his contract - and then wants to see rapid implementation of the changes. But it could be a horde of magic goblins for all he cares.

The only things this guy cares about are the quality of service he is receiving, the cost of the service, and the ability to flexibly manage this service. Just like his phone contract -- if the quality is no good, he will cancel it and move to a provider with better performance. If the cost is too high, he will move to a more competitive provider. And the flexibility in the contract will allow him to do that (although phone contracts often have a lock-in term; but of course if the businessman was designing the terms, then it wouldn't).

So what are the essentials of the cloud, from a technologists point of view?

the ability to measure and manage a quality of service. A provider who values customer service (and many would argue that customer service is the cornerstone of a successful selling organisation) will proactively manage service levels and ensure that the cloud customer is getting the service levels that they need - and that they are contracted to. For this, we would recommend not only some level of service assurance monitoring but also some risk avoidance through predictive analytics; typically found in a capacity planning process. In addition, where service levels are set either contractually or through expectation, some form of management of performance against those service levels - business service insight - will be imperative.
the ability to manage cost and capacity effectively. Given that an unsatisfied customer may change contracts freely or at will - and that the cloud marketplace is a competitive one, cost is the second important factor in a customer's investment decision. Cost in a cloud environment is borne mainly through infrastructure operations, and ties together elements like facilities, management, power/cooling, and capital costs. Uptime institute published a very good paper on this recently (click here). Equally though the price that is charged to the customer must either equate to the cost (in an internal private cloud charged as a cost center) or exceed the cost (as a profit center in a business) and be derived from the cost of the allocated capacity.
the ability for a customer to flexibly manage their contract. There must be an easy way for the customer to change their service. Increasingly tech-savvy customers demand portals by which they can manage their own service levels. A self-service portal is often the lowest cost way of providing this capability. However the cloud does not mandate a portal, in fact a call-center can provide the same facility. Most of the time I manage my phone contract is through a call center; and the benefit is that my provider gets the chance to sell me something new every time..!
the ability to rapidly deploy any changes to a customer's service. The cheapest and quickest way of doing this is likely through usage of virtualization technology, where existing and unused capacity can be allocated to a customer. New technologies are emerging here all the time, around storage and network capacity as well as compute capacity. Hybrid cloud providers are using third-party capacity to extend their capability quickly and leverage existing data center space.

*He could be a *She too; and probably is

Thursday, 27 September 2012

Today, a MHz is just not a MHz any more...

A common trend in virtualization environments is to use the easily-accessible MHz rating of the server as a normalization parameter - so that when you're considering an optimization routine, you can identify available capacity in terms of MHz and compare it to some other capacity being used, and determine whether there's a fit - or not. While this method of normalization makes complete sense in terms of the data available, I'm here with some bad news. They just don't make MHz like they used to. Actually, in many cases, they make them better! The SPECint2006_rate benchmark is a measure of throughput for CPUs. This is a direct comparison with MHz, which has a direct correlation on throughput.

Confused about capacity? Take this example... How much oil can you get through an oil pipeline is proportional to the cross-section of the pipeline - how fat it is - and the speed at which the oil moves through the pipeline. Take that into digital context - and the cross-section of a CPU is related to the number of cores, and the speed of the CPU is measured by MHz. The clock speed is the frequency of the chip - and defines how quickly a task can be processed by the CPU.

They don't make 'em like they used to...

The problem though, is that the clever guys at Intel and the other processor manufacturers don't want to play this game. They're always thinking of new ways of boosting performance that don't rely on just a MHz improvement. Take a look at the data. The chart below shows the ratio of SPECint2006_rate to GHz over the last 6 years, controlling for the number of cores in the benchmark measurement. This shows that the GHz in 2011 is equivalent to 1.15GHz just 12 months earlier. Another interesting point, is that AMD data doesn't show this same rate of change - the trend line has a lot lower gradient. This highlights that the chipset is a hugely important factor when using MHz as a normalization rating. A MHz just isn't a transferable unit.

Data for HP Proliant only Intel directly from SPEC.ORG

Conclusion

Normalization is a key part of good capacity management practise. Using a percentage is simply a recipe for disaster when trying to apply intelligence to configuration optimization. Using MHz is an easy option, but alas is just fool's gold. The data for Intel chips shows that the processing rate for chips changes dramatically over time, and that could introduce errors of over 15% per 12-month period for optimization. If you were moving from legacy kit, 3 years old, the error margin may be over 75%. This will always lead to over-specified machines - and a higher spend than necessary to meet business requirements. Whilst that amount of headroom may have been justified in directly allocated capacity, in the cloud that overspend represents a high cost of ownership and immediate optimization challenges on deployment.

Tuesday, 11 September 2012

Agile practise for the cost/effective cloud

Can you tolerate failure?

Some situations cannot tolerate failure: The highly political government project, the business-enabling SAP deployment or the market-leading e-business application all share a common characteristic - getting it wrong hurts. And the most common symptom of getting it wrong? Poor performance, poor availability, poor service - and disgruntled customers.

What happens then is where heros make their reputation. Firefighting, troubleshooting, late-night candle-burning, bonus-generating heroics. Don't get me wrong, it's great to be a hero - saving the day from evil, just before the clock ticks to zero. But surely it's better not to get into that position in the first place?

Lessons from other industries

The use of simulation is common in industries where getting performance design right first time is critical to the bottom line. Imagine building an aeroplane with control systems that didn't respond in times, semiconductors which performed worse than their predecessors, or civil engineering projects which couldn't handle the projected loads. Scenario planning is also a favourite of agile business. Just as a chess grand-master is thinking several moves ahead, the business planner looks beyond immediate trends and plans the impact of their next business strategy.

But, IT is sooo complex

For IT professionals, getting it wrong hurts equally as much; but here we are at a significant disadvantage. The complexity of enterprise-class systems, the interconnected and often opaque nature of cloud services, the lack of insight into business initiatives leave us in the dark. Whilst we're under constant pressure to reduce cost and risk, the ever-increasing complexity forces us to over-provision, to reduce the risk of getting it wrong.

But is that right? Is it true to say that IT is more complex than embedded avionic systems? Probably it is not true to say that, although the rate of change is certainly higher. Can we still afford to justify time in planning, even though the sands are constantly shifting under our feet? In the past, we've focused just on a limited number of critical systems, or performed some pretty rudimentary analysis that makes us comfortable - and trusted in the heros. But there must be a better way - a way that enables IT to be a part of the planning process, not just driven before the wind like a rudderless ship. There must, mustn't there?

Converged scenario planning

Times are changing. Scenario planning tools have long been available in the marketplace, in the capacity planning sector, and now they are catching up with the needs of the business of cloud management. Remember, that the agility of cloud computing provokes the need for better management of headroom - to maintain the capacity to support elastic demand, and cost - to do so at a profit. Cloud Capacity Management solutions must incorporate and translate the needs of the business into capacity requirements, and also translate capacity requirements into business constraints. And the universal language of business? Currency.
The new world of the cloud demands alignment between the needs of the business and the capacity provided to it. The cloud-enabled enterprise IT spend is proportionate to the needs of the business. The cloud-provider has the challenge of ensuring that its revenue covers its costs and provides a profit margin to its stakeholders. Efficient decision makers are seizing control of their supply chains, and ensuring that risks and costs are managed effectively throughout. The convergence of cloud and consumer planning ensures transparency in that decision making process.

The new synergy

Successful business and IT leaders are getting their act together, and learning to co-operate in new fruitful ways. Planning for IT and business initiatives through capacity management, with focus on both customer experience and the bottom line, is enabling enlightened decision makers to understand the ramifications of their alternative strategies, and understand the budget and risk parameters for their chosen plans.

It turns out the old adage "a stitch in time saves nine" still has relevance today..

Monday, 3 September 2012

Starting Capacity Management - Five Guiding Principles for Success

A few folk have been posting asking for advice on choosing a Capacity Management tool. The normal result of that is a flurry of responses from vendors trying to position their tool as the best. I prefer a different approach - some tools are better for some situations than others, and all of them will have some limitations - Capacity Management has to cover such a diverse range of platforms and use-cases that it's inevitable.

The fundamental principle of Capacity Management
As I tweeted recently, the fundamental principle of Capacity Management is that it exists to reduce cost and risk to business services. When we're designing that process, selecting a tool, or looking for skills - it is imperative that we keep hold of those guiding principles whenever making a decision. This means that taking a silo infrastructure view doesn't really help align with understanding or quantifying the risk to business services (unless the business service just happens to match 100% onto the infrastructure, which was the old way of running things but pretty much incompatible with the cloud)

The second principle of Capacity Management
Not all capacity is created equal. So it is critical that we abandon percentages as a way of comparing different assets. Percentages are only useful to compare a current value against another, such as the maximum. But you can't take on server at 20% and another running at 20% and assume that consolidating them will result in a server running at 40% (unless the configuration is identical). We now need a way of comparing capacities. One of the popular ways of doing that is through using MHz. But this approach is plagued with inaccuracy. The main issue is that the processing power of a CPU is not directly correlated with MHz. In fact (as I intend to show in a later post) the difference can be roughly stated as 25% per year for some chipsets. This means that a 2GHz chip from 2010 is roughly equivalent to a 1.5GHz chip from 2011. And don't forget to control for the effect of hyperthreading - your monitoring tools will report on logical CPUs not physical ones.

The third principle of Capacity Management
Cross-platform visibility is essential. There is no purpose in having silo'd capacity management. As indicated in the fundamental principle, it is imperative to identify risk to business services. This means that we need to get end-to-end visibility on all the components that could introduce risk to service quality; meaning storage, network, virtual, physical - and more. If you are operating within a silo right now, you need to explore ways to increase your scope and introduce a single, standard process - eliminating the variance in approach and in accuracy that is inevitable if you take a silo'd approach.

The fourth principle of Capacity Management
Operate within a limited set of use-cases. The natural implication of the third principle is that you can extend your approach to include every asset in a business service - even down the the capacity of things you can't measure. The fourth principle says that you should constrain the scope of your Capacity Management activities to align with your IT management objectives. You might be solely interested in optimizing your virtual estates, or you might be focused on a DevOps capacity management lifecycle. Your choice of tool and process should remain consistently aligned with your objectives.

The fifth principle of Capacity Management
Don't Capacity Manage in isolation! Remember that one of the benefits in Capacity Management is in managing risk, implying a connection with change. Change can be from a business perspective (new channels, better performance, new markets etc.) or an IT perspective (virtualizing, consolidating, upgrading, software release etc.). Your Capacity Management process must integrate with and add insight into these other critical management processes. Correlating with IT Financial Management will offer cost/efficiency benchmarking. Connecting with Release Management will provide scalability and impact assessments. Working with Business Continuity Management will give scenario planning and quantified risk mitigation. Capacity Management operating on its own just adds no value to the enterprise and should be terminated.

Summary
Capacity Management is a fundamental of business. Whether the assets in question are office cubicles, trucks in a fleet, or virtual servers - they all fulfil a business need and represent an investment on behalf of the business to meet that need. Enterprises operating at a lower risk and cost will enjoy a competitive advantage in the market - and Capacity Management is the vehicle to that advantage. Use the five guiding principles outlined in this posts, to have the very best chance of success.

Thursday, 9 August 2012

Navigating to the cloud

In my recent post Planning for Better Operating Margin, I explored alternative methods of predictive analytics. In this post, we'll explore the economics of choice - and how predictive analytics can be used to navigate a route to the cloud.

The fundamentals of the cloud model result in a competitive marketplace - and choice for the consumer. There are many different routes to cloud computing, representing alternative platforms, service levels, benefits and costs. Choosing from such a wide range of offerings can be confusing, and scary! The law of diffusion of innovation says that 50% of the community are either late adopters or laggards - so it follows that popular choices for cloud technology become more and more popular as the word spreads. However, we're living in a fast paced world. New technology is springing up all the time, and the marketplace is getting larger and more diverse. Take for example the cloud niche - virtual storage. Thus far, this market has been dominated by major players such as VMWare, EMC, Dell, IBM and others - but in light of the growing demand for virtual storage, an array of niche storage players such as Tintri, DataCore, Nimble and others are emerging as stong alternative contenders.

In his recent blog post, Larry Walsh explored the idea that service providers will become the arbitrators of cloud services. This idea rests on the fact that service agreements with cloud providers are relatively fixed and inflexible. Empirical evidence would support this - the quality of service you experience from the public cloud provider is mostly non-negotiable. This is simply another reason why enterprise organisations turn to private clouds for their mission critical systems. However, the world is a little more complex than that. Service Providers often fulfil a key role in integrating cloud offerings to deliver a quality of service that can be negotiated and agreed. I would assert this act of choosing is key to a well-functioning market economy. We don't all expect the same level of performance from our cars - some of us would happily pay extra for higher horse-power, whilst others are looking for the cheapest cost of ownership. Choosing from the array of cloud providers and offerings gives us the ability to trade-off cost and quality of service. This is where Larry Walsh asserts that the Service Provider fits the gap.

However, as Levitt and Dubner assert in Freakonomics, beware the motivations of the expert. They observed that real-estate agents selling their own home left it on the market for 10 days longer than average, and achieved 2% higher price. What incentive does the Solution Provider have? When you move away from the cloud model, you lose the agility, flexibility and elasticity that goes with it. So address the key topics: how are you engaging with cloud providers? what are your contract terms with your Solution Provider? And crucially, how do you benefit from the elasticity of the cloud?

In order to guide you through your decision making, you can either rely on a Solution Provider to do all the heavy lifting (although, as noted above - you cannot completely absolve yourself of choice), or apply your own governance. This is where predictive analytics comes in. With the right tools, you can evaluate alternative choices before making a decision. This 'what if' scenario analysis is key to ensure that the risk of your choice is mitigated, and you have an informed idea of the alternatives in case things do change. After all, change is a world we all live in....

Tuesday, 17 July 2012

Planning for better IT operating margin

You'll hear the phrase "predictive analytics" coming from most of the major players in capacity management these days. Looking into the future to support planning any major infrastructure or software initiatives, or even to account for variations in workload growth, all require predictive analytics to some extent. Whether you're an infrastructure provider or consumer, better planning drives more efficient operations and hence improved margins. Let's explore more:

At it's most basic form, predictive analytics is about extrapolation. By gathering a set of historical data, we can begin to spot patterns and make some assessment into the future trajectory. The type of extrapolation that can be made depends on the power of the analytics - at it's most basic, linear regression analysis looks at long term trends and plots a single straight line trend out into the future. This works fine for persistent metrics like disk space. In fact, it works reasonably well for less persistent metrics, provided you bolster the analysis with some variability assessment. However, better curve fitting algorithms (lognormal, exponential, binomial etc.) can provide more accurate predictions if the data is well behaved. Take a look at the graph above. The binomial fit is closer to the capacity used metric, which is a combination of a steady organic growth and a seasonal variation trend. In this case, a linear trend on the peaks (or 98th percentiles) can give the same net result, but it a little more cumbersome.

There are two problems however with extrapolation. First, is one of scale. With no roll-up mechanism, you quickly get drowned in data. There's a simplification process that needs to support the trends. The second, and more fundamental, is that is assumes all other variables remain constant - meaning, it's only the workload that changes - the environment itself is static.

Is this a good assumption? Well, for some platforms it is. For disk capacity, it is a pretty good rule. Only when disks are running out of space, will some change be made - and these can be easily reflected in the extrapolation. For physical infrastructure, or statically allocated partitions, this can be a decent assumption too - provided the software itself isn't changing.

But where the extrapolation and curve-fitting algorithms really fail, is where either the software or the operating environment are changing. Determining the impact assessment of these step-changes in capacity is a task too complex for curve-fitting alone - and some configuration information must be reflected in the predictions. At this stage, a modelling approach must be used. There are in fact many different modelling algorithms and approaches, but the most popular provide both an infrastructure and a service perspective on capacity. The service-centric capacity plan takes a cross-section of Data Centre capacity allocated or used by a heirarchy of service definitions, which can be taken from a service definition or CMDB. The benefit of this view is to enable dialogue with business owners about plans for their relevant domain. If you're capacity planning in the cloud, the relevant conversation should involve budgeting, quality and optimization opportunity. If you have a model, then the relevant KPI for trending and extrapolation becomes workload volumetrics - and this means you can manipulate forecast data based on changing business requirements in the future.

The modelling approach really is beneficial in managing shared virtual infrastructures like the cloud; where the bottleneck may appear at the physical or virtual layer, where the virtual configuration may be changing rapidly; and where DRS workloads may be shifting around within a cluster. It is also beneficial in planning for new software releases, upgrades or (major) reconfigurations - thereby incorporating a life-cycle approach to capacity management. Surely this is where predictive analytics is at it's most powerful? In helping architects to size new cloud environments, testers to validate the scalability of their new release, and capacity managers to measure the impact of their release into a congested production environments.

In Summary
In the technology life-cycle, the role for capacity management predictive analytics should support sizing, provisioning, managing and decommissioning. Whether you choose to use a tool for that, or operate a consultative approach - leaving holes in your planning process has been shown to add risk and cost to your IT operations.