One solution to improving Quality & Discipline in IT Standard Operating Procedures

By Joe Sartini

As both and automation engineer and IT Automation manager for many years, I’ve both contributed to and monitored how many IT standard operating procedures [or SOPs] can introduce errors into a system. The challenge for many IT operations teams is how to eliminate the human induced errors and provide closed loop feedback systems to process developers on how to create and maintain more robust project insertions. Every IT engineer I’ve ever known has great intentions to make SOP changes flawlessly; however, we tend to find that a fair proportion of our operational incidents are as a result of human errors during the change process. Intel Factory Automation has strict change control procedures to help engineers through the change process and protect them from the human errors. Aside from all the change control processes that exist in many organizations, I believe, a key to success in this area is to automate as much as possible and where not feasible, is to utilize an automated checklist.

Let me give you an example to illustrate the issues which can be experienced by many IT orgs and a way to avoid or mitigate by putting more IT solutions in the manual processes that will always exist.

The Problem Suppose you have an engineer performing a standard server build or decommission, in each case your engineer would deem this to be a fairly straight forward task, and you as an IT org I’m sure have a documented standard operating procedure depending on hardware model and O/S rev, right?. The problem can arise when our engineers are multitasking on many projects at once, under time constraints. In their mind, the trivial server build/decom SOP needs to be completed before they rush to their next important meeting. So they’re in the data centre[DC] with no access to the SOP instructions unless they print it out or login to a PC in the DC to view it, needless to say, the engineer in a hurry and has performed this task many times in the past, will proceed on memory to perform that same task. However, assuming something has changed in the process since they last performed the build/decom or let’s say, nothing’s changed but they simply forget to perform a task, like, let’s say disabling a SAN switch port for the decommissioned server. Down the road the issue arises where we run into SAN switch port capacity problems which shouldn’t exist. It’s possible that an IT org needlessly purchases more switches to handle the perceived capacity problem or they have another engineer perform capacity analysis comparison of server assets versus active port usage to find that something doesn’t add up. More time gets needlessly spent trying to find the unused ports and disabling them since engineers in the past have forgotten to disable the ports through the Server decommission SOP.

One Solution From my experience as an IT engineer and manager, I focus on IT automated checklists for SOPs. Utilizing simple, easily configurable IT web based solutions, the IT manager/engineer can develop checklists for all your SOPs which require engineers to check the box using an online form which can be centrally tracked via standard/simple IT reports. In this case, the IT manager can monitor the completion/success of his SOPs via %PAS reports. Furthermore, the engineer knows that his name is tracked against the tasks with timestamps, so he/she is more inclined to follow the checklist and complete all tasks. The beauty of an online checklist is that the engineer can access it wherever they have an internet connection, e.g. LAN, Wifi, etc and can utilize any form factor device e.g. PC, laptop, MID, iphone etc. The IT manager can also easily run reports against the time it takes on average to perform each of their SOPs to help them with resource allocation per task and also feedback to development teams on TTM for new project insertions etc. In the example above, the engineer who was in a rush to a meeting and in the datacentre would have accessed their checklist via laptop, phone etc and clicked each box as they completed the task. Say for example he still forgot to de-assign the switch port, or more typically didn’t have time to complete all tasks in one visit to the DC. In this case the checklist would not be 100% complete and in the daily/weekly operational review they’d notice that this SOP in still in flight and would follow-up with engineer to complete the checklist as it’s all been centrally traced and closed loop until actual completion of all tasks.

Let’s take the pen & paper and human guess work out of our IT operations and use our IT skills to develop foolproof solutions to our daily routines. In this way, we’ll have a better chance of removing the human errors.  I’m sure you’d agree we need as much time as possible to handle the h/w & s/w errors that affect our operations availability & reliability.