Platforms Infrastructure Systems Management – Be Proactive, Prefer to be Smart or Genius

What is the different between smart guy and genius? Take few secconds to think about it.....

Answer: The smart guy will avoid the trouble that only genius can get you out of. The key point of this blog surrounds being proactive, so let's consider the correlation. I’ll focus on being proactive in the infrastructures system domain, but the concept can be used everywhere.

In the daily life of a platforms engineer, you have two options:

  1. You can spend your time managing incidents for systems. This requires no upfront planning time but lots of time troubleshooting and recovering.
  2. You can avoid those systems incidents by being proactive. This requires more upfront planning but drastically reduces time spent recovering, not to mention the impact to the customer!

Some think they are maintaining their systems perfectly, and it may be true as they are very professional, that is their systems uptime is very high and they solve issues very fast. Actually by not being proactive, they are in higher risk (in addition to other advantages of being proactive they lose) and it is matter of time until their systems will be impacted and potentially cause customer issues.

So what does it mean to be proactive?  Being proactive is to minimize system issues before the systems breaks. Use the analogy of managing your car… the more you get it serviced proactively - change the oil, filters, tyre pressures, respond to warning lights on the dashboard, etc. - the less likely it is to break down  on you when you’re on a long journey. Ask yourself what if you manage your platforms in a similar way?

Why should you be proactive?

Being proactive increase the influence you have on top of incident. See in the next schema, more actions means more influence.


-          - Recovery time - Your systems are healthier and the probability for systems failure is much lower.

- -        - Proactive Time – time spend on been proactive is much lower than fixing issues

-          - Customer impact Time - Save money – of production downtime

Here are couple of domains where you can be proactive:

Capacity management – when managing an environment with hundreds of servers and many Terabytes of disk space, you might run out of something without notice and the impact might even be environment downtime. This is especially true when you build a new environment and start production. The production capacity is growing and consuming resources. Sometimes you can find one of the components in your environment completely full and extending it might take a while, so your systems will be down for that time. To avoid this, one needs to proactively  monitor system capacity, analyze trends correlated to the production capacity, and act accordingly. The key items need to be monitored are:

  • Disk space – when assigning disk space to any system, there is a limit. Many times disk space has high correlation with the production capacity (for example, databases, logfiles, etc.). Generate report every x daysweeksmonth (depending on your environment change) and according to the production capacity change create the disk space trend for the different systems. With those trends try to forecast the needed disk space for the maximum production capacity, and if need extend the disks.
  • Resources (CPU, Memory, Disk performance, network) – when the production capacity is growing, the system performance might also growing, for many reasons (more users, more jobs, etc.). The same as in disk capacity, also here need to monitor the system performance and create trends correlated to the production capacity. If need extend memory, replace servers to stronger servers, extend network bandwidth, etc.

Alerting cleanup – For capacity management we talked about long term proactive maintenance. This one is more for day-to-day operations. This task is very simple but requires one time effort and maintenance. In this case you just need to make your system clean from any alerting. Clean means to take care and fix what is needed attention. You must do this regardless the severity of the alert or the amount of the alerts. If you get alerts that are not important and you do nothing with them, remove them. If you get many alerts and have no time to deal with them, which means do nothing with those alerts, remove them. Most important is not to ignore alerts due to time restriction or low severity. To do these proactive actions need to do the follow:

  • Go over ALL alerts and remove the un-necessary alerts. Yes, I means over all. I know it is a massive task but worth it.
  • Once you have only the needed alerts, go over all active alerts and fix them.  Clean all alerts without resolve the problem will not help, the issue will repeat, and you might have break systems.
  • Maintain this status. Monitor your alerts on daily base and take care for alert while they are in their low severity

We did the above in Intel IT and as I mentioned it didn’t change our up-time, but dramatically reduced our break-fix, as we could validate from our incident management metrics. As a result, we have more some spare time to be spent in a proactive management mode and our systems are much more reliable.

In my next blog I will discuss some of the methods used to be proactive, for example, proactively upgrading hardware to newer models to reduce failure rates and provide more processing power, while also reducing power load in our data centres.