Up, up and away with Grid efficiency

Intel IT operates very large grid infrastructure for  internal R&D groups, with over 3 million jobs running per day.

This  major shared infrastructure is used by all design projects at Intel for  validation and many other activities.

In general, design engineers are  mostly interested in a good turn-around time for their jobs.

From the  other hand, IT is traditionally interested in a high and efficient  usage of the provided resources.

Such usage can't be measured just as a high CPU  utilization.

Sometimes,  jobs submitted to grid fail for various reasons - resulting in wasted  runtime hours.

Sometimes, jobs are submitted but have little to no  value to the submitter. Running very large amount of validation jobs may or  may not bring added value. Designers may have no time to triage and  address all bugs reported by such validation jobs until the next  validation cycle begins.

Customers should be able to terminate running jobs as  soon as they realize their results are not needed anymore.

To  ensure higher efficiency, we've started a joint effort with the design  teams.

This  effort includes extensive analysis of job waste patterns, including  automatic association of finished jobs with predefined "exit buckets"

There is  also an attempt to build several prediction models using data mining  techniques on top of the vast data warehouse of information regarding  previously completed jobs. Predicting memory consumption or job runtime  may allow us to impove sheduling decisions.

Predicting the overall execution  time or chances of job failure based on the specified parameters may  reduce waste of resources.

To achieve good results, extensive joint work between  IT and customer groups is neccessary.

Are you facing similar  challenges in your environment?

Would you be interested to learn more about our  experience in this area?

Till the next post,

      Gregory Touretsky