Serving large amount of powerful compute servers, many of our file servers are struggling from the high load and from time to time may become a bottleneck.
One of the most common access patterns in our batch environment is related to validation, where some dataset under test is accessed again and again by thousands of batch jobs. Such dataset never changes after it gets created, and after some period of time it becomes irrelevant, as new version of the same dataset gets released and all tests are pointed to this new version.
To accommodate such workloads, we've developed caching mechanism which is highly integrated with the actual testing environment. Every time the test lands on a compute server, it validates if the relevant dataset is already cached. If it is, the test runs against the cached copy on local disk. If it's not, the test copies the dataset to the local disk either from file server or from one of the peer compute servers, and registers such new location in the central directory service. This solution results in significant reduction of load on the central file servers. The cache manager also takes care of the old data clean up.
Do you have issues with file servers performance? How do you solve them?
Till the next post,