Intel IT operates very large grid infrastructure for internal R&D groups, with over 3 million jobs running per day.
This major shared infrastructure is used by all design projects at Intel for validation and many other activities.
In general, design engineers are mostly interested in a good turn-around time for their jobs.
From the other hand, IT is traditionally interested in a high and efficient usage of the provided resources.
Such usage can't be measured just as a high CPU utilization.
Sometimes, jobs submitted to grid fail for various reasons - resulting in wasted runtime hours.
Sometimes, jobs are submitted but have little to no value to the submitter. Running very large amount of validation jobs may or may not bring added value. Designers may have no time to triage and address all bugs reported by such validation jobs until the next validation cycle begins.
Customers should be able to terminate running jobs as soon as they realize their results are not needed anymore.
To ensure higher efficiency, we've started a joint effort with the design teams.
This effort includes extensive analysis of job waste patterns, including automatic association of finished jobs with predefined "exit buckets"
There is also an attempt to build several prediction models using data mining techniques on top of the vast data warehouse of information regarding previously completed jobs. Predicting memory consumption or job runtime may allow us to impove sheduling decisions.
Predicting the overall execution time or chances of job failure based on the specified parameters may reduce waste of resources.
To achieve good results, extensive joint work between IT and customer groups is neccessary.
Are you facing similar challenges in your environment?
Would you be interested to learn more about our experience in this area?
Till the next post,