Capacity Management Re-Branded

Tuesday, 14 October 2014

Managing User-Experience in the App Economy

In today's competitive app economy, user-experience is one of the crucial differentiators between leaders and laggards in the market. As disruptive forces are constantly at work; corporations recognize that they have to work hard to gain new customers - and to maintain the loyalty of their existing ones. Consequently, these corporations seek to form 1:1 relationships with their stakeholders by means of an intimate customer experience.

During the last decade, another fundamental force has been shaping how consumers interact socially. User-experience is now widely shared and is typically present at the point of purchase/install. However these reviews are consulted, it is certain that they can either help or hinder market penetration and, ultimately, brand.

In days gone by, quality of an IT service was often measured by percentile. A typical doctrine would define a service level where 90% of the responses were completed within 3 seconds. This type of logic, however, doesn't consider the social impact of the user-experience of the remaining vociferous 10%. Analytics will be needed to identify and then address this sector with remedial customer service.

The appropriate question that remains is - "how much investment do you put into managing user-experience, and what are the risks if you fail?"

Credit to simonmiller.co.uk for the photo

Re-Branding the Blog

I've decided to rebrand this blog; which used to be titled "Cloud Capacity", which in all honesty was probably a bit vague for its purpose. Considering the content, I've decided to focus more on the area of 'rebranding' capacity management and it's role in IT supply chain.

The simple reality is that the IT trends have been moving towards a tiered model of operation for some time now, and capacity management has to find its role as "end to end" management of apps, infrastructure and datacenters has dwindled with a variety of IT sourcing alternatives gain market share.

Also we must recognise that many of these changes have been borne out of business drivers to create an agile, cost-effective delivery mechanism. What I'm becoming more interested in researching is - how do we harness those drivers and seek to add value to them with data and analytics.

The next 12 months I shall be investigating several topics around workload management, financial management and the balance of quality against agility. Enjoy, dear reader, enjoy!

Tuesday, 9 September 2014

Connecting 3 tiers of capacity management

It always amazes me how confusing enterprise IT can get without collaboration. I'm not just talking about the connection of people in social interaction, but about a proper workflow - that co-ordinates activities of different teams and tools. Without co-ordination, nobody is sure of roles and responsibilities, who is doing what -- and when processes are started, it's impossible to track whether they've been completed.

I reflect on these challenges when considering the multiple teams involved in capacity management. There are actually a good number of teams who will carry out some form of capacity management - even if they don't know they're doing it. However, the co-ordination of activity between these IT functional teams is so often neglected. I was sitting through a Trevor Bunker (
@tbunker01) keynote at the CA event in Brussels this week - considering how the DevOps trend is impacting enterprises. Interestingly, the #1 recognised benefit of DevOps is in collaboration.

So, is DevOps successful because it focuses on collaboration, or is collaboration a side-effect of focusing on a workflow between two distinct functional units in IT? And if the latter, what greater benefits could we expect when applying workflow between other distinct teams?

In Capacity Management, there are often 3 operational teams who have responsibility for right-sizing analysis. Firstly, at the application layer - whose responsibility lies in requesting the right amount of IT capacity according to current and projected workloads. This team are well skilled in the arts of demand management and forecasting. The second tier of capacity management happens at the shared-infrastructure level. This team are anticipating demand from a wide number of applications with varying workloads, and ensuring that the infrastructure is right-sized to provide a resilient and cost-effective service. The final tier is at the data-center level, where management of physical, electrical and environmental factors is dependent on the amount of capacity specified or requested from the IT team.

These 3 teams always carry out a level of capacity management. Every time a change request is submitted for more capacity - an element of sizing is done. This inter-dependent structure ultimately provides an indirect connection between the needs of the business and the data-center capacity provided. Ironically, it is the tiered nature of the delivery mechanism that allows for the maximum cost-efficient and resilient operations whilst also being the area of greatest inefficiency.

Are we not missing a trick, by declining to add a workflow to connect these tiers? And what role does service management have to play in facilitating this workflow, beyond the range of managing tickets? What level of automation is desired or accepted, particularly in the realms of elastic compute - can and should this be provisioned automatically when certain triggers are made?

In the greater adoption of capacity management in a tiered delivery model, the answers to these questions can make the difference between successful implementation or not.

Friday, 14 February 2014

Capacity Management - 5 top tips for #DevOps success

An esteemed consultant friend of mine once commented - "in capacity management, it is the step changes in capacity that are the most difficult to plan for". In agile release practise, such step changes are increasing in frequency. As each new release hits, the historical metrics describing quality of service data lose relevance, making capacity planning harder.

To respond to this change, an agile capacity management practice is called for, which must be lightweight, largely automated, and relevant to both deployed software and software not yet released. Indeed, the process must be able to support all aspects of the DevOps performance cycle - from infrastructure sizing, through unit and load testing, to operational capacity management. In shared environments, such as cloud infrastructures, it is easy to become lost in the "big data" of application or infrastructure performance.

When executing a DevOps strategy however, it is critical to embed performance and capacity management as a core principle - structuring the big data to become relevant and actionable. Here are 5 top tips for success:

1. A well-defined capacity management information system (CMIS) is fundamental

The foundation of your capacity management capability is data - building a strong foundation with a capacity

CMIS takes data from real-time monitors

management information system is crucial. The purpose of this foundation is to capture all relevant metrics that assist a predictive process, a process that provides insight about the current environment to help drive future decision-making. Context is crucial, and configuration information must be captured - to contain virtual and physical machine specifications along with service configuration data. It is advisable also to design this system to be able to accommodate business contextual data as well, such as costs, workloads or revenues. Automation of the data collection is critical when designing an agile process, and this system should be designed to be scalable so to deliver quick wins, but grow to cover all the platforms in your application infrastructures. This system should not replace or duplicate any existing monitoring, since it will not be used for real-time purposes. Also note: it is easy to over-engineer this system for its purpose, hence another reason to adopt a scalable system that can grow to accommodate carefully selected metrics.

2. Aquire a knowledge base around platform capacity

A knowledge base is crucial when comparing platform capabilities. Whether you are looking at legacy AIX

Quantify capacity of different platforms

server or a modern HP blade, you must know how those platforms compare in both performance and capacity. The knowledge base must be well maintained and reliable, so that you have accurate insight over the latest models on the market as well as the older models that may be deployed in your data centres. For smaller organisations, building your own knowledge base may be a viable option, however beware of architectural nuances which affect platform scalability (such as logical threading, or hypervisor overheads). For this reason, it is practical to acquire a commercially maintained knowledge base - and avoid benchmarks provided by the platform vendors. Avoid the use of MHz as a benchmark, it is highly inaccurate. Early in the design stage for new applications, this knowledge base will become a powerful ally - especially when correlated against current environmental usage patterns.

3. Load Testing is for validation only

For agile releases, incremental change makes it expensive to provision and assemble end-to-end test

DevOps and performance testing

environments, and time-consuming to execute. However, load testing still remains a critical part of the performance/capacity DevOps cycle. Modern testing practise has "shifted left" the testing phase, using service virtualization and release automation, resulting in component-level performance profiling activity that provides us with a powerful datapoint in our DevOps process. By assimilating these early-stage performance-tested datapoints into our DevOps thinking, we can provide early insight into the effect of change. For this to be effective, a predictive modelling function of some sort is required, where the performance profile can be scaled to production volumes and "swapped in" to the production model. Such a capability has been described in the past as a "virtual test lab". For smaller organisations, this could be possible with an Excel spreadsheet, although factoring in the scalability and infrastructure knowledge base will be a challenge.

4. Prudently apply predictive analytics

Predictive Analytics at work

To be relevant, predictive analytics need to account for change in your environment - predictive analytics applied only to operational environments are no longer enough. In a DevOps process, change is determined by release, so investing in a modelling capability that allows you to simulate application scalability and the impact of the new release is crucial. Ask yourself the question - "how detailed do you need to be?" to help drive a top-down, incremental path to delivering the results you need. Although it is easy and tempting to profile performance in detail, it can be very time-consuming to do. Predictive analytics are fundamentally there to support decision-making on provisioning the right-amount of capacity to meet demand - it can be time-consuming and problematic to use them to predict code- or application- bottlenecks. Investment in a well-rounded application and infrastructure monitoring capability for alerting and diagnostics remains as important as it ever did.

5. Pause, ensure to measure the value

As a supporting DevOps process, it can be easy to overlook the importance of planning ahead for

Showing cost-efficiency of infrastructure used

performance and capacity. Combining the outputs with business context, such as costs, throughputs or revenues will highlight the value what you are doing. One example is to add your infrastucture cost model to your capacity analyics - and add transparency into the cost of capacity. By combining these costs with utilization patterns, you can easily show a cost-efficiency metric which can drive further optimization. The capacity management DevOps process is there to increase your agility by reducing the time spent in redundant testing, provide greater predictability into the outcomes of new releases, improve cost-efficiency in expensive production environments, and provide executives with the planning support they need in aligning with other IT or business change projects.

Thursday, 6 February 2014

is performance important?

Over the last decade, seismic progress has been made in the realms of application performance management - development in diagnostics, predictive analytics and DevOps enable application performance to be driven harder and measured in more ways than ever before.

But is application performance important? On surface value it seems like a rhetorical question: performance relating to the user experience is paramount, driving customer satisfaction, repeat business, competitive selection, brand reputation - yes, performance is important. However, it is more often the change in performance that more directly influences these behaviours. A response time of 2 seconds may be acceptable if it meets the user expectation - but could be awful if users were expecting a half-second latency. User experience is more than just performance, and the quality of the user experience is related to performance, availability, design, navigability, ease-of-use, accessibility and more. Performance is important, yes - to a point.

The flip-side of performance is throughput, the rate at which business is processed. Without contention, throughput rises directly in proportion to workload volume, without compromising performance. However, when contention starts - performance suffers and, crucially, throughput starts to drop in proportion to the arrival rate. In other words, in a contention state - the rate at which business is transacted becomes impacted.

So - is performance important? Yes, clearly it is important, but only in the context of user-experience. However, a far more important measure of business success is throughput, as it is directly related to business velocity - how fast can a business generate revenue?

Consider the graph below, showing the relationship between performance and throughput for a business service. The point at which throughput is compromised corresponds to a 20% drop in response time. Yet, user-experience is largely maintained at this level of performance, customers are not complaining en masse until performance is degraded by double that amount. At this point, the damage is already done.

SUMMARY

When seeking to understand the risk-margin in service delivery, the more pertinent metric for business performance is to focus on throughput. By building out a scalability assessment of your business services, the relationship between performance and throughput can be derived - and the right amount of capacity allocated in order to avoid the potential throughput issue. Such an assessment can be empirical, but for highest fidelity - a simulation approach should be adopted.

The chart above was created using CA Performance Optimizer - a simulation technology that predicts application scalability under a range of different scenarios.

Monday, 27 January 2014

Finding the spare capacity in your VMware clusters

I recently oversaw a project for a large petrochemicals company, where we identified a potential 500% saving in capacity in a highly-used VMware cluster. I was gobsmacked at the over-allocation of capacity for this production environment, and decided to share some pieces of advice when it comes to capacity analysis in VMware.

How to find the savings in your VMware environment

Hook into your VMware environment, and extract CPU utilization numbers for host, guest and cluster. Use the logical definitions of VMware folders or the cluster/host/guest relationships to carve meaningful correlations. Be careful with heterogeneous environments, not all hosts or clusters will have the same configuration - and configuration is important. Use a tool like CA Capacity Management to provide a knowledge-base of the different configurations so you can compare apples and oranges. Overlay all the utilization numbers and carry out a percentile analysis on the results - the results here represent a 90th percentile analysis, arguably a higher percentile should be used for critical systems. Use the percentile as a "high watermark" figure and compare against average utilization to show the "tidal range" of utilization.

Memory utilization is a little bit more challenging given the diversity of the VMware metrics. Memory consumed is the memory granted to the physical OS, but if the data collection period includes an OS boot - will be distorted due to the memory allocation routines. Memory active is based on "recently touched" pages, and so depending on the load type may not capture 100% of the actual requirement. Additionally, there is a host overhead which becomes significant when the number of virtual machines reaches a crucial level. Memory figures are further distorted by platforms like java, SQL server or Oracle who consume memory in hamster-fashion for when it may be useful. For these purposes, it may also be relevant to consider OS-specific metrics (such as from performance monitor). It now seems as if the capacity manager should be using a combination of these metrics for different purposes, and should refine their practise to avoid paging (the symptom of insufficient memory). .

It is also worth reviewing the IO figures from a capacity point of view, although there is a little more work required in determining the capacity of the cluster, due to protocol overheads and behaviours. The response time metrics are a consequence of capacity issues - not a cause, and although important, are a red herring when it comes to capacity profiling and right-sizing (you can't right-size based on a response time, but you can right-size based on a throughput). I've disregarded disk-free stats in this analysis - which would form part of a storage plan, but check on the configuration of your SAN or DAS to determine which IO loads represent a risk to capacity bottlenecks.

The Actionable Plan

Any analysis is worthless without an actionable plan, and this is where some analytics are useful in right-sizing every element within that VMware estate. CA Virtual Placement Manager gives this ability, correlating the observed usage against the [changing] configuration of each asset to determine the right-size. This seems to work effectively across cluster, host and guest level - and also incorporates several 'what if' parameters such as hypervisor version, hardware platform (from it's impressive model library) and reserve settings. It's pretty quick at determining what is the right size of capacity to allocate to each VM - and how many hosts should fit in a cluster, even factoring in forecast data. Using this approach, a whole series of actionable plans were generated very quickly for a number of clusters - showing capacity savings of 500% and more.

Thursday, 12 December 2013

IT's Day of Reckoning Draws Near

Bob Colwell, Intel's former chief architect, recently delivered a keynote speech proclaiming that Moore’s law will be dead within a decade. Of course, there has to come an end to every technological revolution - and we've certainly noted the stablization of processor clock speeds over recent years, in conjunction with an increasing density of cores per chip.

Moore's Law has been so dominant over the years, it has influenced every major hardware investment and every strategic data center decision. Over the last 40 years, we have seen a consistent increase in processing capacity - reflected in both the increase in processor speeds and the increased density of transistors per chip. In recent years, whilst processor clock speed has reached a plateau - the density of cores per chip has increased capacity (though not performance) markedly.

The ramifications of Moore's Law were felt acutely by IT operations, in two ways.

It was often better for CIOs to defer a sizable procurement by six or twelve months, to get more processing power for your money.
Conversely, the argument had a second edge - that it was not worthwhile carrying out any Capacity Management, because the price of hardware was cheap - and getting cheaper all the time.

So, let us speculate what happens to IT operations when Moore's Law no longer holds:

IT Hardware does not get cheaper over time. Indeed, we can speculate that costs may increase due to costs of energy, logistics etc. Advancements will continue to be made to capability and performance, though not at the same marked rate charted above.
The rate of hardware refresh slows due to the energy and space savings available in the next generation kit. Hardware will stay in support longer, and the costs of support will increase.
Converged architectures will gain more traction as the flexibility and increased intra-unit communication rates drive performance and efficiency.
You can't buy your way out of poor Capacity Management in the future. Therefore the function of sizing, managing and forecasting capacity becomes more strategic.

Since capacity management equates very closely to cost management, we can also speculate that these two functions will continue to evolve closely. This ties in neatly, though perhaps coincidentally, with the maturing of the cloud model into a truly dichotomous entity - being that a supplier and a provider will have two differing views of the same infrastructure. As the cloud models mature in this way, it becomes easier to compare the market for alternative providers on the basis of cost and quality.

Those organisations with a well-established Capacity Management function are well placed to navigate effectively as these twin forces play out over the next few years, provided they:

Understand that their primary function is to manage the risk margin in business services, ensuring sufficient headroom is aligned to current and future demands
Provide true insight into the marketplace in terms of the alternative cost / quality options (whether hardware or cloudsourced)
Develop effective interfaces within the enterprise to allow them to proactively address the impacts of forthcoming IT projects and business initiatives.

So - the day of reckoning draws near - and IT operations will adapt, as it always does. Life will go on - but perhaps with a little bit more careful capacity planning....