HDF5

What is it?

HDF5 (Hierarchical Data Format Version 5) is a general purpose library and file format for storing scientific data. HDF5 can store two primary objects: datasets and groups. A dataset is essentially a multidimensional array of data elements, and a group is a structure for organizing objects in an HDF5 file. Using these two basic objects, one can create and store almost any kind of scientific data structure, such as images, arrays of vectors, and structured and unstructured grids. You can also mix and match them in HDF5 files according to your needs.

Installation and Use of HDF5 on LRZ platforms

Linux based HPC Systems

As of April 2022, there is a new software stack 22.2.1 available on CoolMUC2 and SuperMUC-NG. We provide at least one minor version of HDF5 1.8 and 1.10, where you need to be careful as these versions have different formats/APIs.

For the available hdf5 modules you can check yourself via

module avail hdf5

On spack stack 22.2.1 we provide the following modules:

Serial HDF5

HDF5 MPI parallel (with Intel-MPI)

hdf5/1.8.22-gcc11

hdf5/1.8.22-intel21

hdf5/1.10.7-gcc11

hdf5/1.10.7-intel19

hdf5/1.8.22-gcc11-impi

hdf5/1.8.22-intel21-impi

hdf5/1.10.7-gcc11-impi

hdf5/1.10.7-intel21-impi

The suffixes "-gcc11" and "-intel21" represent the used compilers and the corresponding compiler modules should be loaded when using the modules. The suffix "-impi" stands for the MPI parallel version built with the Intel-MPI standard module.

All packages are built with C, C++ and Fortran support. To make use of HDF5, please load the appropriate Environment Module

For the parallel version with Intel compiler, e.g. use

module load hdf5/1.10.7-intel21-impi

Then, compile your code with

[mpicc|mpicxx|mpif90] -c $HDF5_INC foo.[c|cc|f90]

and link it with

[mpicc|mpicxx|mpif90] -o myprog foo.o <further objects> [$HDF5_F90_SHLIB|$HDF5_CPP_SHLIB] $HDF5_SHLIB

For a serial version (with Intel compiler), e,g, use

module load hdf5/1.10.7-intel21

Then, compile your code with

[icc|icpc|ifort] -c $HDF5_INC foo.[c|cc|f90]

and link it with

[icc|icpc|ifort] -o myprog.exe foo.o <further objects> [$HDF5_F90_SHLIB|$HDF5_CPP_SHLIB] $HDF5_SHLIB

One of the language support libraries $HDF5_F90_SHLIB or $HDF5_CPP_SHLIB is only required if either Fortran or C++ are used for compiling and linking your application.
For static linking, use $HDF5_..._LIB versions instead of $HDF5_..._SHLIB, but this not recommended.

Utilities

Loading an HDF5 module typically will also make available command-line utilities e.g., h5copy, h5debug, h5dump etc. It may be advisable to run these utilities using a serial (as opposed to MPI parallel) HDF5 version, since a linked-in MPI library may not work properly in purely interactive usage.

h5utils

h5utils (Github) is a set of utilities for the visualisation and conversion of scientific data in HDF5 format. Besides providing a simple tool for batch visualisation as PNG images, h5utils also includes programs to convert HDF5 datasets into the formats required by other free visualization software (e.g. plain text, Vis5d, and VTK).

h5utils is not part of the HDF5 module, nor is it available directly in the LRZ provided software stack. The recommended procedure to install this software on SuperMUC-NG, CoolMUC-2 and other LRZ managed clusters is to install it via user-spack:

module load user_spack

# Install
spack info h5utils
spack install h5utils

# Load to search path
spack load h5utils
# Unload
spack unload h5utils

Documentation

Please refer to the HDF5 Web Site for documentation of the interface.

H5py (Pythonic Interface to HDF5)

There are several options to install h5py on LRZ systems. One option is using "pip" or "Conda" (see here or here for details; or further below). The other option (and probably preferable) is the installation via "user_spack" (see also Spack package management tool)
The installation procedure is similar on all systems.

Remark

In order to use h5py MPI parallel, one needs to build it against an hdf5 that was build with MPI support, and against mpi4py! The compiler and MPI installation must be consistent!
This is what we focus on in this docu. Without MPI requirements, h5py installation is usually less complex, and would not require the build from sources!

Spack/User Spack

To create h5py, select an hdf5 module you want to work with. Let us assume you want to use the module hdf5/1.10.11-gcc12-impi on CoolMUC4, which is built with a GCC compiler, and Intel MPI.
One needs the HASH of the hdf5 Spack installation. It can be obtained using "module show":

cm4login1:~> module show hdf5/1.10.11-gcc12-impi | grep BASE
setenv        HDF5_BASE /dss/lrzsys/sys/spack/release/23.1.0/opt/icelake/hdf5/1.10.11-gcc-mlcdtiq

The hash consists of the last seven characters: mlcdtiq

Please note: The hashes of the installations differ on all systems. Using the hash from above for an installation on e.g. SuperMUC-NG will fail!

Next, load the user_spack module to make the spack command-line tool available.

module load user_spack

Installation

The installation (which you only need to do once, if it works without problems) is then done as follows. The general build instruction looks like this,

spack install py-h5py%COMPILER ^hdf5/HASH_OF_INSTALLATION

where COMPILER stands for the compiler of the hdf5 module. It can be gcc, intel or oneapi (please check with spack compilers!) and HASH_OF_INSTALLATION is the hdf5 installation hash from above.
For example, it would be like this:

spack install py-h5py%gcc ^hdf5/mlcdtiq

Usually, it should also be ok to just

spack install py-h5py ^hdf5/mlcdtiq

Spack can resolve the used compiler from the hdf5 dependency (one can check this using spack spec -lINt py-h5py ^hdf5/mlcdtiq).

Testing and Using

The easiest way is now to simply load this package.

:~> spack load py-h5py
:~> cat > h5py_test.py << EOT
from mpi4py import MPI
import h5py
rank = MPI.COMM_WORLD.rank                               # The process ID (integer 0-3 for 4-process run)
print("rank:",rank)
f = h5py.File('parallel_test.hdf5', 'w', driver='mpio', comm=MPI.COMM_WORLD)
dset = f.create_dataset('test', (4,), dtype='i')
print("created ds")
dset[rank] = rank
f.close()
EOT

> mpiexec -n 4 python h5py_test.py                         # check h5py working in parallel
 rank: 0
 created ds
 rank: 1
 created ds
 rank: 2
 created ds
 rank: 3
 created ds

On login nodes, it might be required to further tune the MPI environment (if a MPI module is loaded) according to

export I_MPI_HYDRA_BOOTSTRAP=fork I_MPI_FABRICS=shm      # on login nodes; skip if you go to compute nodes
unset I_MPI_HYDRA_IFACE I_MPI_PMI_LIBRARY                # dito (if set like on CM4)

before calling mpiexec. Or, with the intel-mpi loaded, go to a compute node (salloc or sbatch). On compute nodes, you SHOULD load the default intel-mpi module as we usually set further environment variables for optimization.

Essentially, usage should generally be as simple as that: load user_spack (and intel-mpi), and then spack load py-h5py.

Module Creation and Usage

One can also create an environment module if desired. If the steps above were successful, one can go on with

spack module tcl refresh -y

The module is then generated in the directory $HOME/{spack,user_spack}/<spack version>/<architecture>/, or so. (This scheme has not settled, yet.)
Note: The subfolder x86_avx2 in the path $HOME/user_spack/23.1.0/modules/icelake/ to the modules differs on other systems. On e.g. SuperMUC-NG the path would be $HOME/spack/modules/x86_avx512/linux-sles15-skylake_avx512/ .

To use the h5py module, one needs to make the module available to the module system. And the corresponding hdf5 and mpi modules are required.

module use -p ~/user_spack/23.1.0/modules/icelake/
module load python/3.10.10-extended              # some extended python is necessary for the mpi4py
module load gcc                                  # unless done automatically with hdf5
module load intel-mpi                            # unless done automatically with hdf5
module load hdf5/1.10.11-gcc12-impi
module load py-h5py

It is in the user's responsibility here to load the modules consistently! We recommend the use of module collections here (check module help!).

Using pip

For SuperMUC-NG, there is no internet access. Please, have a look here for options.

installation and test description

module sw spack/23.1.0                                   # take a more up-to-date stack (on SuperMUC-NG, for instance)

module av hdf5                                           # to check available versions
module load hdf5/<desired version>-gcc*-impi             # HDF5 library backend; PLEASE TAKE ONE WITH -gcc*-impi!  (gcc and intel-mpi should be automatically loaded)
module load python                                       # python module (read and heed the warnings!)
python -m venv venv_h5py                                 # virtual environments are cleaner
module rm python                                         # should not be needed anymore
source venv_h5py/bin/activate                            # activate environment
pip install --upgrade pip                                # usually a good idea
pip install --no-cache-dir --no-binary=mpi4py mpi4py     # install mpi4py; especially NO CACHE!!
export I_MPI_HYDRA_BOOTSTRAP=fork I_MPI_FABRICS=shm      # on login nodes; skip if you go to compute nodes
unset I_MPI_HYDRA_IFACE I_MPI_PMI_LIBRARY                # dito (if set)

cat > mpi4py_test.py << EOT
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
print("I'm rank: ",rank)
EOT

mpiexec -n 4 python ./mpi4py_test.py                     # check mpi4py working
 I'm rank:  2
 I'm rank:  3
 I'm rank:  0
 I'm rank:  1

export CC=h5pcc
export HDF5_MPI="ON"
pip install --no-cache-dir --no-binary=h5py h5py         # install h5py

cat > h5py_test.py << EOT
from mpi4py import MPI
import h5py
rank = MPI.COMM_WORLD.rank                               # The process ID (integer 0-3 for 4-process run)
print("rank:",rank)
f = h5py.File('parallel_test.hdf5', 'w', driver='mpio', comm=MPI.COMM_WORLD)
dset = f.create_dataset('test', (4,), dtype='i')
print("created ds")
dset[rank] = rank
f.close()
EOT

mpiexec -n 4 python h5py_test.py                         # check h5py working in parallel
 rank: 0
 created ds
 rank: 1
 created ds
 rank: 2
 created ds
 rank: 3
 created ds

Using Conda (self-confined)

Using conda/mamba (please note the current regulations here!), installing HDF5 and MPI can be as simple as

micromamba create -n my_h5py h5py h5py=*=*mpich*          # or "conda"
micromamba activate my_h5py

MPICH is not really supported at LRZ. But it is similar to Intel MPI, but does not react on I_MPI_* environment variables! It still can be used together with our system. On single login or compute nodes,

mpiexec -n 4 python h5py_test.py

For more than one node, within a Slurm allocation, please use

mpiexec -launcher slurm -n 4 python h5py_test.py