Our R&D infrastructure depends heavily on storage, which is over 10PBs today.
Vast majority of this data is kept on centrallized file servers accessible via Network File System (NFS).
We implement global name space to provide unified view to all this storage from thousands of compute servers - so the same NFS filesystem gets mounted under the same path at any compute server.
Overall, we are dealing with 10s of thousands of such filesystems across the company. This allows us to better control capacity allocation, ease load balancing and improve allocation scalability between multiple file servers. The name space decouples physical location of the filesystem from the logical path used to access the data, so the data can be migrated between file servers.
In the last few years, we've developed an application which provides self-service capabilities to our design community to manage this large storage capacity. Instead of requests to the HelpDesk, customers can instantly provision additional storage capacity, reclaim unused space, get precise reports about usage, and more.
It's possible to define total allocation limits for various organizational units, so one project can't consume all configured storage capacity.
Policies allow to define self-healing scenarios. For example, the application can enforce automatic reclamation of the entire file system or part of it, based on various policies.
This system is designed in such a way that we can utilize different storage solutions from various vendors, and still use the same interface for storage administration.
This application is integrated with our internal batch scheduling system, as well as with various design flows. As an example, running batch job may find that it is running out of disk space and allocate additional capacity on-demand through this application.
Are you dealing with similar challenges? How do you manage your storage?
Till the next post,