There's really two problems here: 1. Contemporary mainstram OSes have not risen ...

btown · on Feb 21, 2022

Are there solutions to this in the high-performance computing space, where random access to massive datasets is frequent enough that the “sandboxing” overhead adds up?

bayindirh · on Feb 21, 2022

HPC systems generally use LustreFS where you have multiple servers handling metadata and objects (files) separately. These servers have multiple level of drives, where metadata servers are SSD backed and file servers run on SSD accelerated spinning disk boxes, with a mountain of 10TB+ drives.

When this structure is fed to a EDR/HDR/FDR Infiniband network, the result is a blazing fast storage system where you can make a massive number of random accesses by very large number of servers simultaneously. The whole structure won't shiver even.

There are also other tricks Lustre can pull for smaller files to accelerate the access and reduce the overhead even further, too.

In this model, the storage boxes are somewhat sandboxed, but the whole model as a general is mounted via its own client, so the OS is very close to the model Lustre provides.

On the GPU servers, if you're going to provide big NVMe scratch spaces (a-la nVidia DGX systems), you soft-RAID the internal NVMe disks with mdadm.

In both models, saturation happens on hardware level (disks, network, etc.) processors and other soft components doesn't impose a meaningful bottleneck even under high load.

GauntletWizard · on Feb 22, 2022

Additionally, In the HPC space, power loss is not a major factor; backup power systems exist, and rerunning the last few minutes of a half-completed job is common, so on either side you are unlikely to encounter the fallout of "I clicked save, why didn't it save"?

NGRhodes · on Feb 22, 2022

Hours and days of jobs need to be rerun in my experience, our researchers do a poor job of check pointing. Of all the issues we have with lustre, data loss has never been one whilst I have been in the team.

throw0101a · on Feb 22, 2022

> Hours and days of jobs need to be rerun in my experience, our researchers do a poor job of check pointing.

Enabling pre-emption in your queues by default and that'll change: after a job is scheduled and run for 1-2 hours it can be kicked out and a new one run instead after the first's priority decays a bit.

* https://slurm.schedmd.com/preempt.html

You can add incentives:

> When would I want to use preemption? When would I not want to use it?

> When a job is designated as a preemptee, we increase the job's priority, and increase several limits, including the maximum number of running processors or jobs per user, and the maximum running time per job. Note that these increased limits only apply to the preemptable job. This allows preemptable jobs to potentially run on more resources, and for longer times, than normal jobs.

* https://rc.byu.edu/documentation/pbs/preemption

bayindirh · on Feb 22, 2022

> Enabling pre-emption in your queues by default and that'll change.

We run preemptive queues, and no. Not all jobs are compatible with that. Esp. the code researchers developed themselves.

My own code also doesn't have support for checkpointing. Currently it's blazing fast, but for bigger jobs it might need the support, and it needs way more cogs inside the pipeline to make it possible.

bayindirh · on Feb 22, 2022

This is absolutely correct. Cattle vs. Pet analogy [0] applies perfectly there. On the other hand, HPC systems are far from being unprotected. Storage systems generally disable write caches on spinning drives automatically and have all on the fly data on either battery backed or flash based caches. So FS level corruption is kept at minimal levels.

Also, yes, many longer jobs are checkpoints and restart where it's left off, but it's not always possible, unfortunately.

[0]: https://blog.engineyard.com/pets-vs-cattle

shellac · on Feb 22, 2022

> HPC systems generally use LustreFS

Or IBM's GPFS / Spectrum Scale. Same deal, really, although GPFS is a more complete package.

saul_goodman · on Feb 22, 2022

Is it though? IBM hates GPFS and has been trying to kill it off since its initial release, but every time it tries the government (by NSF/tertiary proxy) stuffs more money its mouth. It lives despite being hated by the parent. Both GPFS and Lustre have their warts.