Refreshing your server fleet in a Manufacturing environment

Does the risk of downtime to your production lines outweigh the rewards of server refresh?

As an IT manager, one of our roles is to influence our senior management to plan & approve capital funds on IT server upgrades where it makes sense. Needless to say, we always have to demonstrate as best we can, the benefits of such a capital investment, ideally in a Return-On-Investment (ROI) terms. Many of us use varying forms of cost benefit analysis algorithms to come up with what we believe to be logical data based recommendations. Server technology as with most other technologies is advancing year on year with improvements in many areas, most notably, power efficiency, performance, form factor.
My observations from conversations with peers in several manufacturing industries, has been the reluctance to upgrade their environment’s server fleet at the same cadence as the broader enterprise fleet.  I decided to write this blog to understand why. Well, one argument may be when the underlying manufacturing central systems applications are working fine on the existing h/w,  the attitude is ‘if it’s not broke, don’t fix it’.  We don’t want IT getting in the spotlight for unscheduled downtime affecting company product commitments. Furthermore, the impact or loss of production when refresh cycle occurs i.e. in non-clustered environments where you have to bring your automation systems down in a scheduled manner to replace the servers. Ideally at least  in a clustered environment, you can stage the upgrades with less impact, as one set of node in the cluster manages the workload while the other half is being replaced and vice versa. However, sometimes applications have not been developed for clustering/redundancy and one doesn’t have the option. Something for all of you to consider as harden existing environments over time when finance permits is to influence you app development teams to ensure strong redundancy / fault tolerance is built in at design phase.
Another regular problem exists when your production systems are not continually re-qualified on the latest server hardware at the same rate as the h/w being offered by the OEMs. This places a larger burden on the software development teams and others to perform lengthy validations which can take them away from their core job of working on the latest software enhancements for the production systems.
Server refresh introduces unknown risks, as all the pre-production software testing and validation on the new platform doesn’t always find all the bugs. It goes without saying that the risk of introducing new potential issues/impacts into the production environment generally arises from any changes made to any part of the production system, software, hardware or human error during the upgrade process. All of which can cause unscheduled outages to factory production.
So with all the potential risks/downsides, how do we weigh those against the longer term benefits of upgraded IT/Automation server fleet? The answer isn’t always as clear cut from my experience, as you’re trying to compare factual data against many unknowns e.g. how to accurately can you put  a cost on the risk of impact & the potential for production downtime. The information gathered from here in Intel has consistently shown us that our systems reliability, performance and total cost of ownership [TCO] have improved with every refresh cycle. As a simple and very recent example (I could give more); for one of our production applications in a non-virtualized environment, my team recently replaced approx. 90 physical servers with 19 latest OEM models based on IAx86 of course. The ROI payback was <2yrs, as we were able to reduce the our TCO by removing the maintenance contract costs, reducing our power costs, improving server reliability and thus reducing human intervention to enable our engineers work on other value add projects and that’s the big intangible. An added bonus was that we obviously freed up data center physical capacity to prevent the need for future DC expansions which are a costly endeavor. We could have chosen to do nothing as the customer wasn’t complaining of application performance degradation, however, we as IT people knew the benefit in terms of cost savings to our business in many different ways, some more easily quantifiable than others.
So my recommendation from many years managing IT in an Automated Manufacturing / Industrial environment is that it does indeed make a lot of sense to upgrade the server fleet. However, my experience tells me the key drivers continue to be where you see Application performance improvements & EndOfLife / EofOfSupport hardware triggers. You should be influencing your peer orgs to ensure solid requalification & test processes exists to manage the change as this is an ongoing part of our jobs. After that, you can’t deny that you’re offering the business the very latest and greatest IT infrastructure which should always offer more than its technology predecessors at a fraction of the ongoing maintenance costs. No pain, no gain !.
I’d be interested to hear from those who are keeping their IT manufacturing systems up to the latest spec of h/w on an ongoing basis and if you’ve anything to add in terms of the benefits and/or pitfalls encountered.