Ask HN: Nix(OS) for HPC?
Recently I helped a friend getting some scientific software running on an HPC system a little on the smaller side. The software is written in C++ and uses cmake for building. And to be honest, the experience was rather subpar.
All HPC systems I have worked on have been using Lmod [1] to manage the environment and enable building with say Intel's compiler or some specific MPI version.
Now one of the concrete problems I hit was the following: Loading the latest version of cmake using Lmod pulls in the latest version of gcc's libstdc++ as cmake is dynamically linked against that. But if you try to build said software with the Intel toolchain which pulls an older version of libstdc++ into the environment, suddenly cmake breaks with a rather cryptic symbol not found error.
This is what got me thinking: On HPC systems you typically need to have lots of libraries/software with oftentimes many and conflicting versions installed, so your users can use what they need. I have not yet tried Nix(OS) myself, but what I described does very much sound like the problem it is intended to solve.
Thus my question: Has anyone tried Nix(OS) on an HPC system, how did it go? Otherwise, are there (better) alternatives to Lmod?
[1]: https://lmod.readthedocs.io I can't answer your question as I am not informed about the subject, but I can add that I am aware that Guix/GuixSD is used for some HPC (see https://hpc.guix.info/about/). Perhaps Nix/NixOS is as well. If you don't get an answer here try https://nixos.org/community/index.html. Thanks for the hints! I didn't consider Guix as I thought it's FOSS only, which doesn't quite work for HPC, will check it out. Why wouldn’t any container system work? Many such systems these days support unprivileged containers and near zero overhears, which are basically the main concerns in HPC. Containers would probably work fine for building but I am pretty sure running them is going to be a massive pain. Typically you need GPU pass-through and you will want to run your software on multiple nodes for example using MPI, which I don't think will be straightforward, especially if you have to explain all this to a scientist who just wants to run their software. But perhaps someone can refute this? I think there are two parts in play here: build process and deployment. From what I’ve been reading at OpenHPC about Slurm and friends, you wouldn’t normally waste deployment time doing the compile operation on each node, you could but it’d be inefficient to have 1000 nodes all compile at the same time instead of mounting your share and pulling down the executable…
Which can be built in any of your favorite build environments. I’ve seen some stuff most recently about writing in Python and using DSL compile tool chains to prep an executable that’s much faster than dropping Python on the cluster. That’s something you set up in a Docker environment or Git action mess that I don’t understand.