Black box monitoring for IT infrastructure

   Our IT infrastructure is complicated and includes thousands of compute and file servers, distributed batch environment, network components, etc. Large amount of various projects utilize this infrastructure in parallel.

   We obviously have monitoring systems in place which are tracking behavior of individual components, such as critical servers, network, etc. However, these monitoring solutions can't address every potential service degradation which we may get into.
   To be able to intercept such unexpected issues before our internal customers begin to suffer we try to introduce some kind of user experience monitoring, or black box monitoring


  Some solutions in this area exist on the market for ERP or DB systems. However, I'm talking here about open systems we use in our R&D environment.

  For example, we monitor responsiveness of our NFS environment. Instead of looking on the specific metrics of the file servers (network I/O, CPU utilization, etc - which we are still collecting for future analysis) - we monitor the entire stack also.
  We copy the same file from every file server every X minutes using 2 clients residing in different subnets. We measure the time it takes to copy this file and compare with the baseline. Every time we exceed the predefined threshold, we launch automatic data gathering to see what has happened  with the affected fileserver, network, batch infrastructure, etc. This data is analyzed immediately or at a later time to make the appropriate operational decisions.
  Naturally, the amount of data we are collecting is exploding, so data mining may provide some interesting insights.

What data mining techniques/solutions do you use, if at all, for your IT-related data analysis?

Till the next post,

   Gregory Touretsky