Friday 14 February 2014

Capacity Management - 5 top tips for #DevOps success

An esteemed consultant friend of mine once commented - "in capacity management, it is the step changes in capacity that are the most difficult to plan for". In agile release practise, such step changes are increasing in frequency. As each new release hits, the historical metrics describing quality of service data lose relevance, making capacity planning harder.

To respond to this change, an agile capacity management practice is called for, which must be lightweight, largely automated, and relevant to both deployed software and software not yet released. Indeed, the process must be able to support all aspects of the DevOps performance cycle - from infrastructure sizing, through unit and load testing, to operational capacity management. In shared environments, such as cloud infrastructures, it is easy to become lost in the "big data" of application or infrastructure performance.

When executing a DevOps strategy however, it is critical to embed performance and capacity management as a core principle - structuring the big data to become relevant and actionable.  Here are 5 top tips for success:

1. A well-defined capacity management information system (CMIS) is fundamental

The foundation of your capacity management capability is data - building a strong foundation with a capacity
CMIS takes data from real-time monitors
management information system is crucial. The purpose of this foundation is to capture all relevant metrics that assist a predictive process, a process that provides insight about the current environment to help drive future decision-making. Context is crucial, and configuration information must be captured - to contain virtual and physical machine specifications along with service configuration data.  It is advisable also to design this system to be able to accommodate business contextual data as well, such as costs, workloads or revenues. Automation of the data collection is critical when designing an agile process, and this system should be designed to be scalable so to deliver quick wins, but grow to cover all the platforms in your application infrastructures.  This system should not replace or duplicate any existing monitoring, since it will not be used for real-time purposes.  Also note: it is easy to over-engineer this system for its purpose, hence another reason to adopt a scalable system that can grow to accommodate carefully selected metrics.  

2. Aquire a knowledge base around platform capacity

A knowledge base is crucial when comparing platform capabilities. Whether you are looking at legacy AIX
Quantify capacity of different platforms
server or a modern HP blade, you must know how those platforms compare in both performance and capacity. The knowledge base must be well maintained and reliable, so that you have accurate insight over the latest models on the market as well as the older models that may be deployed in your data centres.  For smaller organisations, building your own knowledge base may be a viable option, however beware of architectural nuances which affect platform scalability (such as logical threading, or hypervisor overheads). For this reason, it is practical to acquire a commercially maintained knowledge base - and avoid benchmarks provided by the platform vendors.  Avoid the use of MHz as a benchmark, it is highly inaccurate.  Early in the design stage for new applications, this knowledge base will become a powerful ally - especially when correlated against current environmental usage patterns.

3.  Load Testing is for validation only

For agile releases, incremental change makes it expensive to provision and assemble end-to-end test
DevOps and performance testing
environments, and time-consuming to execute.  However, load testing still remains a critical part of the performance/capacity DevOps cycle.  Modern testing practise has "shifted left" the testing phase, using service virtualization and release automation, resulting in component-level performance profiling activity that provides us with a powerful datapoint in our DevOps process.  By assimilating these early-stage performance-tested datapoints into our DevOps thinking, we can provide early insight into the effect of change.  For this to be effective, a predictive modelling function of some sort is required, where the performance profile can be scaled to production volumes and "swapped in" to the production model.  Such a capability has been described in the past as a "virtual test lab". For smaller organisations, this could be possible with an Excel spreadsheet, although factoring in the scalability and infrastructure knowledge base will be a challenge.

 4.  Prudently apply predictive analytics

Predictive Analytics at work
To be relevant, predictive analytics need to account for change in your environment - predictive analytics applied only to operational environments are no longer enough. In a DevOps process, change is determined by release, so investing in a modelling capability that allows you to simulate application scalability and the impact of the new release is crucial. Ask yourself the question - "how detailed do you need to be?" to help drive a top-down, incremental path to delivering the results you need.  Although it is easy and tempting to profile performance in detail, it can be very time-consuming to do.  Predictive analytics are fundamentally there to support decision-making on provisioning the right-amount of capacity to meet demand - it can be time-consuming and problematic to use them to predict code- or application- bottlenecks.  Investment in a well-rounded application and infrastructure monitoring capability for alerting and diagnostics remains as important as it ever did.

5.  Pause, ensure to measure the value

As a supporting DevOps process, it can be easy to overlook the importance of planning ahead for
Showing cost-efficiency of infrastructure used
performance and capacity.  Combining the outputs with business context, such as costs, throughputs or revenues will highlight the value what you are doing.  One example is to add your infrastucture cost model to your capacity analyics - and add transparency into the cost of capacity.  By combining these costs with utilization patterns, you can easily show a cost-efficiency metric which can drive further optimization.  The capacity management DevOps process is there to increase your agility by reducing the time spent in redundant testing, provide greater predictability into the outcomes of new releases, improve cost-efficiency in expensive production environments, and provide executives with the planning support they need in aligning with other IT or business change projects.


Thursday 6 February 2014

is performance important?

Over the last decade, seismic progress has been made in the realms of application performance management - development in diagnostics, predictive analytics and DevOps enable application performance to be driven harder and measured in more ways than ever before.

But is application performance important?  On surface value it seems like a rhetorical question: performance relating to the user experience is paramount, driving customer satisfaction, repeat business, competitive selection, brand reputation - yes, performance is important. However, it is more often the change in performance that more directly influences these behaviours. A response time of 2 seconds may be acceptable if it meets the user expectation - but could be awful if users were expecting a half-second latency. User experience is more than just performance, and the quality of the user experience is related to performance, availability, design, navigability, ease-of-use, accessibility and more.  Performance is important, yes - to a point.

The flip-side of performance is throughput, the rate at which business is processed.  Without contention, throughput rises directly in proportion to workload volume, without compromising performance. However, when contention starts - performance suffers and, crucially, throughput starts to drop in proportion to the arrival rate. In other words, in a contention state - the rate at which business is transacted becomes impacted.

So - is performance important?  Yes, clearly it is important, but only in the context of user-experience. However, a far more important measure of business success is throughput, as it is directly related to business velocity - how fast can a business generate revenue?

Consider the graph below, showing the relationship between performance and throughput for a business service.  The point at which throughput is compromised corresponds to a 20% drop in response time.  Yet, user-experience is largely maintained at this level of performance, customers are not complaining en masse until performance is degraded by double that amount.  At this point, the damage is already done.


SUMMARY
When seeking to understand the risk-margin in service delivery, the more pertinent metric for business performance is to focus on throughput.  By building out a scalability assessment of your business services, the relationship between performance and throughput can be derived - and the right amount of capacity allocated in order to avoid the potential throughput issue.  Such an assessment can be empirical, but for highest fidelity - a simulation approach should be adopted.

The chart above was created using CA Performance Optimizer - a simulation technology that predicts application scalability under a range of different scenarios.