Improving Client Stability with Proactive Problem Management

In late 2008, we were experiencing an average of about 5,500 “blue screen” system crashes in our client environment per week. As a result, users were not satisfied with the stability of their PCs.

In order to improve it, we implemented a proactive problem management process based on analysis of objective, largely system-generated data from client PCs across our worldwide environment. Using this approach, we have increased client stability by reducing the number of blue screen system crashes by more than 50 percent, and we are beginning to realize benefits in other areas including unexpected shutdown and boot time.

The solution has two aspects – one is the business process based on Information Technology Infrastructure Library (ITIL). The 2nd aspect is based on collecting exceptions from the clients’ environment.

As the client’s environment is very rich and changes constantly,
it is very important to identify in advance what are the exceptions that impacts customer experience. Blue screens for example were identified as something that customers don’t like (no big surprise there I guess).

Once we start collecting the data and analyze it, we could identify trends and the main root because, that then we can fix in the entire client environment (see the banner on page 4, on the left).

Another approach to it was to recall Top N customers with faulty systems to the support center, which was a nice surprise to those customers realizing that IT knows that they have stability issues and fix it without them needing to raise an incident ticket.

From my experience, the main contribution to the change of the trend to reduce blue screens from 5,500 a week to 2,000 a week is doing a proper problem management and release solutions to the whole client environment. Fixing specific system one-by-one is not enough when managing so many systems.

Updating drivers across the enterprise usually did the job, and we focused on the ones we found to cause most of the blue screens.

The success indicator is the trend-down graph showing the decrease in the weekly blue screens count - this enables us to review progress, set goals and know if we’re on the right track - see below screen shot of the trend.

Key to these efforts are partnership across IT teams, indicators in place to measure and show progress and an execution plan that eventually improved customers’ experience from their client systems.

What is your experience on this field? How do you improve your clients’ stability?

You can download the whitepaper from here.

Capture.PNG

Published on Categories Archive
Shachaf Levi

About Shachaf Levi

Shachaf Levi is a Cloud Security Architect at Intel. He has been working in cloud security for the last six years, and has been with Intel since 2004. He is currently building a combined strategy and architecture for cloud security, covering public and private cloud, SaaS (software as a service), PaaS (platform as a service), and IaaS (infrastructure as a service). This includes creating a reference architecture, roadmap, and capability building blocks, then selecting solutions to secure cloud usages across the various threats and compliance requirements. He enjoys new technology, innovative ideas, and working with and empowering engineering and operational teams toward successful solution adoptions while encompassing stakeholders’ needs. In particular, he is interested in automated solutions. Shachaf has published several white papers and a YouTube video documenting his security and automation work.