Moose Framework on HPC Systems

What is Moose Framework?

Moose Framework is an open-source, parallel finite element framework based on Libmesh, which in turn is based on PETSc, and used to solve multi-physics partial differential equations computationally.

Moose comes already with a lot of modules, but can be extended as is the case for Golem.

Getting Started

Installation

The basic installation steps are as follows (please also consult the Moose Framework documentation page).

> git clone https://github.com/idaholab/moose.git
> cd moose
> git checkout master
> git submodule update --init
> git submodule foreach --recursive git submodule update --init
> ./scripts/update_and_rebuild_petsc.sh                              # building PETSc
> ./scripts/update_and_rebuild_libmesh.sh                            # building Libmesh
> ./scripts/update_and_rebuild_wasp.sh                               # building WASP
> cd test                                                            # building and performing tests
> make -j 6
> ./run_tests
> cd ../modules                                                      # building the Moose modules (apps); instead also own apps (Golem) can be built
> make -j 6

This above procedure already assumes a correct build environment. On the LRZ HPC cluster, such environment must be arranged for manually by the user. Some adaptations are necessary, e.g. in order to use our tuned performance libraries (e.g. usage of HDF5, Intel MKL and MPI, etc.). As quite usual for PETSc applications, the whole pipeline down to Moose must be build with the same tool chain (same compiler, compiler settings, MPI, ...).

On ColllMUC-4, for instance, the following procedure using GCC and Intel MPI works reasonably well. But Intel compilers should work principially, too.

CoolMUC-4 Installation Procedure Example using GCC compiler and Intel MPI
> module load cmake gcc intel-mkl intel-mpi hdf5/1.10.11-gcc12-impi boost/1.83.0-gcc12-impi libtirpc

> git clone https://github.com/idaholab/moose.git
> cd moose
> git checkout master
> git submodule update --init
> git submodule foreach --recursive git submodule update --init

> export HDF5_DIR=$HDF5_BASE 
> export I_MPI_HYDRA_BOOTSTRAP=fork I_MPI_FABRICS=shm                         # because PETSc requires the test of MPI functionality
> export MOOSE_JOBS=20                                                        # builds faster  

# building PETSc
> ./scripts/update_and_rebuild_petsc.sh --help                                # this script passes all cmd parameters also to PETSc configure script
> ./scripts/update_and_rebuild_petsc.sh --with-blaslapack-dir=$MKL_BASE \
           --with-cc=$(which mpicc) --with-cxx=$(which mpicxx) --with-fc=$(which mpif90) --with-mpi-f90=$(which mpif90) --with-mpiexec=$(which mpiexec) \ 
           COPTFLAGS='-g -O3 -march=native' CXXOPTFLAGS='-g -O3 -march=native' FOPTFLAGS='-g -O3 -march=native' \
           --with-mpi-include=$MPI_BASE/include --with-mpi-lib=$MPI_BASE/lib/release/libmpi.a --with-64-bit-indices=true
# instead of -march=native, -march=x86-64-v4 could be used
# Since Nov'25, this natively fails with errors when trying to build libceed.
# We couldn't find any other way to cope with that by editing scripts/configure_petsc.sh and setting "-download-kokkos=0 --download-kokkos-kernels=0 --download-libceed=0".
# This might anyway be interesting only for GPU applications ... where we currently don't support any.


# building Libmesh
> export CC=mpicc CXX=mpicxx FC=mpif90 F90=mpif90 F77=mpif77
> export CFLAGS="-O3 -march=native" CXXFLAGS="-O3 -march=native" FCFLAGS="-O3 -march=native" FFLAGS="-O3 -march=native"

> ./scripts/update_and_rebuild_libmesh.sh

# This may fail with "configure: error: *** XDR was not found, but --enable-xdr-required was specified."
# In that case, load the libtirpc module as above and retry the step above with
> ./scripts/update_and_rebuild_libmesh.sh --with-xdr-include=$LIBTIRPC_BASE/include/tirpc --with-xdr-libdir=$LIBTIRPC_BASE/lib --with-xdr-libname=tirpc
# If this doesn't pay off ... Alas!! ... Try
> export TIRPC_DIR=$LIBTIRPC_BASE
> export CPATH=$CPATH:$LIBTIRPC_BASE/include/tirpc
> export LIBRARY_PATH=$LIBRARY_PATH:$LIBTIRPC_BASE/lib
> ./scripts/update_and_rebuild_libmesh.sh
# Alternatively, you can also remove the --enable-xdr-required flag in the file scripts/configure_libmesh.sh,
# or set it to --disable-xdr, if you are sure that you won't need XDR support (might still be that the build then fails).

# building WASP
> ./scripts/update_and_rebuild_wasp.sh

# building and running tests
> cd test
> module load python/3.10.10-extended                                          # the system provided Python might not suffice
> python -m venv venv_moose
> source venv_moose/bin/activate
> pip install --upgrade pip pyaml jinja2 pandas numpy matplotlib               # see Moose docu; required python modules; and error messages about missing modules
> make -j 6
> unset I_MPI_PMI_LIBRARY I_MPI_HYDRA_IFACE                                    # disturbing on login nodes
> ./run_tests -j 6

Some (10) tests may fail. Some 100 are skipped maybe. As long as this is not essential for your workflow, you can live with that.

The naming of the modules might change over time! Please check via module avail ... how they are named now!

The hardware-specific GCC optimization flags (-march) must be changed when using a different architecture. Please consult the GCC doku on that!

For building Moose apps, the environment must be restored (compilers, tool chain, libraries ... please consider module collections (module help), can be placed also in an own user-defined module file), and MOOSE_DIR must be set to the moose top directory. For instance, building (separately) the moose apps (although MOOSE_DIR is not necessary in this case), might work as follows.

> module load cmake gcc intel-mkl intel-mpi hdf5/1.10.11-gcc12-impi boost/1.83.0-gcc12-impi libtirpc
> export HDF5_DIR=$HDF5_BASE                                         # probably not necessary anymore; HDF5 is linked in PETSc/Libmesh
> export CC=mpicc CXX=mpicxx FC=mpif90 F90=mpif90 F77=mpif77
> export CFLAGS="-O3 -march=native" CXXFLAGS="-O3 -march=native" FCFLAGS="-O3 -march=native" FFLAGS="-O3 -march=native"
> cd moose
> export MOOSE_DIR=$PWD
> cd modules
> module load python/3.10.10-extended
> make -j 10                                  # takes some while
> unset I_MPI_PMI_LIBRARY I_MPI_HYDRA_IFACE   # disturbing on login nodes
> ./run_tests -j 4                            # takes even longer

Few tests may fail again. Some are skipped. Please check, whether that's critical for your workflows.

Finally, there are also examples, moose/examples. Good starting point to learn the workflows of Moose, and to have some reference on how to setup solvers and cases.

Usage

For the run-time application, only the run-time libraries are necessary (boost may only be used as compile-time library; but loading it does not harm ... gcc is maybe also not relevant).

> module load gcc intel-mkl intel-mpi hdf5/1.10.11-gcc12-impi boost/1.83.0-gcc12-impi libtirpc
> module save moose_runtime                      # create a module collection; for later use: module restore moose_runtime
> mpiexec <mpi-options> ./my-moose-app-opt <options>
# for instance, in moose/examples/ex01_inputfile
> make
> mpiexec -n 2 ./ex01-opt --n-threads=4 -i diffusion_pathological.i

Framework Information:
MOOSE Version:           git commit e2ec6f19cf on 2025-01-30
LibMesh Version:         6ef7d4395794104f48dae1fd48e64077207188e8
PETSc Version:           3.22.1
SLEPc Version:           3.22.1
Current Time:            Fri Jan 31 22:02:55 2025
Executable Timestamp:    Fri Jan 31 21:27:42 2025
...
Parallelism:
  Num Processors:          2
  Num Threads:             4
...

The framework includes most libraries in the executable app's RPATH. However, Intel MKL/MKI and HDF5 modules also provide run-time optimization settings via environment variables. So, loading these modules is recommended.

Moose applications have the option --help. Use this to learn about run-time adaptations, and monitoring capabilities. As PETSc applications, Moose apps also accept the PETSc run-time cmd parameters. A Slurm job script can be kept rather short. E.g.

moose.slurm
#!/bin/bash
#SBATCH -o myjob.%j.%N.out
#SBATCH -D .
#SBATCH -J Test
#SBATCH --clusters=cm4                      # SRPs, 112 CPU cores: 2 sockets, 56 CPUs per socket
#SBATCH --partition=cm4_tiny
#SBATCH --get-user-env
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2                 # 2 ranks (1 rank per socket), and 
#SBATCH --cpus-per-task=56                  # 56 threads per rank
#SBATCH --hint=nomultithread                # don't use hardware threading
#SBATCH --mail-type=none                    # if set differently, provide a valid email address
#SBATCH --export=NONE                       # mandatory!
#SBATCH --time=2:00:00

# module restore moose_runtime              # if you created a module collection; (must be done before slurm_setup)

module load slurm_setup

# or, if no module collection is used ...
module load gcc intel-mkl intel-mpi hdf5/1.10.11-gcc12-impi boost/1.83.0-gcc12-impi libtirpc

export OMP_PLACES=cores OMP_PROC_BIND=close                                 # GOMP thread placement *)
mpiexec ./my-moose-app-opt --n-threads=$SLURM_CPUS_PER_TASK -i Test_Case.i

*) In this MPI+OpenMP hybrid mode, within an MPI rank the communication between threads is via shared memory, what is usually faster than MPI within a NUMA domain.
    Care must be taken that the threads run on different CPUs. The user is responsible for the correct settings.

Consession

Sort of System Installation - including Golem

Though LRZ cannot provide much of support for moose, and less so for apps based on moose, from time-to-time we may install something centrally that might work (without any guarantee!!). You can find then some of the modules in out extfiles (experimental module section).

> module use /lrz/sys/share/modules/extfiles
> module av moose
---------------- /dss/dsshome1/lrz/sys/share/modules/extfiles ----------------------------
moose/2026-03-18-master
> module av golem
---------------- /dss/dsshome1/lrz/sys/share/modules/extfiles ----------------------------
golem/2026-03-18-master

> module sw stack/24.5.0                     # !!!! Please check the dependencies on this stack via "module show moose"!
> module load golem

Moose and its apps like Golem do not appear to have any stable release tags, because they are under high development and changes. The docu recommends to use the "current" master branch. This is unfortunately often not consistent, and Golem causes build and runtime errors. Even some tests failed because of wrong parameter count.

Please contact the developers and ask for improvements of this situation, if you are required to regularly use this software!

The build recipe of the latest such build can be found under $MOOSE_DIR/../../install_moose_golem.sh. The build and test results under $MOOSE_DIR/../../log.moose_golem.install.

> sed 's/\x1b\[[0-9;]*m//g' $MOOSE_DIR/../../log.moose_golem.install | grep -E "[0-9]+ passed, [0-9]+ skipped, [0-9]+ FAILED"
4922 passed, 472 skipped, 8 FAILED              # moose/test runtest
6749 passed, 820 skipped, 31 FAILED             # moose/modules runtest
31 passed, 0 skipped, 13 FAILED                 # golem runtest

Usage with Slurm

The simplest way to get started is maybe to copy a test case to your HOME/SCRATCH or so.

> cp -r $GOLEM_BASE/test/tests/THM .
> cd THM

Create a Slurm job file there. For instance,

job.slurm
#!/bin/bash
#SBATCH -o log.%x.%j.%N.out
#SBATCH -D . 
#SBATCH -J golem_test
#SBATCH --get-user-env 
#SBATCH --clusters=inter
#SBATCH --partition=cm4_inter
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4
#SBATCH --hint=nomultithread
#SBATCH --mail-type=none 
#SBATCH --export=NONE 
#SBATCH --time=00:10:00 

module load slurm_setup

module sw stack/24.5.0
module use /lrz/sys/share/modules/extfiles
module load golem

mpiexec golem-opt --n-threads=$SLURM_CPUS_PER_TASK -i THM_3D_grav.i Outputs/file_base=THM_3D_grav_out_test

This example also illustrates the parallelism using MPI and threading (hybrid) – here with 2 MPI ranks and 4 threads per rank. Either extreme (MPI-only, thread-only) but should be possible.

This test-case is really short, and takes only seconds.

log.golem_test.*****.out example
spack/release/sles15.6/24.5.0 sapphirerapids
Notice: No version specified. Loading version: 'intel-mpi/2021.15.0'.
Notice: No version specified. Loading version: 'intel-mkl/2025.1.0'.

Loading golem/2026-03-18-master
  Loading requirement: gcc/14.2.0 intel-mpi/2021.15.0 intel-mkl/2025.1.0
    python/3.14.0 moose/2026-03-18-master


*** Deprecation Warning ***
Please update your main.C to adapt new main function in MOOSE framework, see'test/src/main.C in MOOSE as an example of moose::main()'. 



*** Info ***
'execute_on' parameter specified in [Outputs] block is ignored for object 'checkpoint'.
Define this object in its own sub-block of [Outputs] to modify its execution schedule.
Framework Information:
MOOSE Version:           git commit b3ab245a82 on 2026-03-17
LibMesh Version:         f8a17588c82a850969afc7d740f3da286da01463
PETSc Version:           3.24.4
SLEPc Version:           3.24.0
Current Time:            Thu Mar 19 07:35:29 2026
Executable Timestamp:    Wed Mar 18 23:51:35 2026

Input File(s):
  /dss/dsshome1/00/di49zop/test_golem/THM/THM_3D_grav.i

Command Line Argument(s):
  --n-threads=4

Command Line Input Argument(s):
  Outputs/file_base=THM_3D_grav_out_test.e

Checkpoint:
  Wall Time Interval:      Every 3600 s
  User Checkpoint:         Disabled
  # Checkpoints Kept:      2
  Execute On:              TIMESTEP_END 

Parallelism:
  Num Processors:          2
  Num Threads:             4

Mesh: 
  Parallel Type:           replicated
  Mesh Dimension:          3
  Spatial Dimension:       3
  Nodes:                   
    Total:                 99
    Local:                 54
    Min/Max/Avg:           45/54/49
  Elems:                   
    Total:                 40
    Local:                 20
    Min/Max/Avg:           20/20/20
  Num Subdomains:          1
  Num Partitions:          2
  Partitioner:             metis

Nonlinear System:
  Num DOFs:                495
  Num Local DOFs:          270
  Variables:               { "pore_pressure" "temperature" "disp_x" "disp_y" "disp_z" } 
  Finite Element Types:    "LAGRANGE" 
  Approximation Orders:    "FIRST" 

Auxiliary System:
  Num DOFs:                80
  Num Local DOFs:          40
  Variables:               { "strain_zz" "stress_zz" } 
  Finite Element Types:    "MONOMIAL" 
  Approximation Orders:    "CONSTANT" 

Execution Information:
  Executioner:             Transient
  TimeStepper:             ConstantDT
  TimeIntegrator(s):       ImplicitEuler
  Solver Mode:             NEWTON 
  MOOSE Preconditioner:    SMP 

LEGACY MODES ENABLED:
 This application uses the legacy initial residual evaluation behavior. The legacy behavior performs an often times redundant residual evaluation before the solution modifying objects are executed prior to the initial (0th nonlinear iteration) residual evaluation. The new behavior skips that redundant residual evaluation unless the parameter Executioner/use_pre_smo_residual is set to true. To remove this message and enable the new behavior, set the parameter 'use_legacy_initial_residual_evaluation_behavior' to false in *App.C. Some tests that rely on the side effects of the legacy behavior may fail/diff and should be re-golded.


Time Step 0, time = 0

Time Step 1, time = 1, dt = 1
 * Nonlinear |R| = 2.430000e+06 (Before preset BCs, predictors, correctors, and constraints)

Performing automatic scaling calculation

 0 Nonlinear |R| = 1.687502e+04
      0 Linear |R| = 1.687502e+04
      1 Linear |R| = 1.957683e+03
      2 Linear |R| = 1.314122e+02
      3 Linear |R| = 8.473260e+00
      4 Linear |R| = 7.308389e-01
      5 Linear |R| = 3.392529e-02
      6 Linear |R| = 1.239313e-03
      7 Linear |R| = 1.080488e-04
      8 Linear |R| = 3.901108e-06
      9 Linear |R| = 5.663600e-08
 1 Nonlinear |R| = 5.662338e-08
 Solve Converged!

Outlier Variable Residual Norms:
  pore_pressure: 5.662338e-08


Performance Graph:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                                Section                               | Calls |   Self(s)  |   Avg(s)   |    %   | Mem(MB) |  Total(s)  |   Avg(s)   |    %   | Mem(MB) |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| GolemApp (main)                                                      |     1 |      0.012 |      0.012 |   4.88 |       2 |      0.251 |      0.251 | 100.00 |      49 |
|   Action::SetupMeshAction::Mesh::SetupMeshAction::act::setup_mesh    |     1 |      0.001 |      0.001 |   0.34 |       2 |      0.001 |      0.001 |   0.34 |       2 |
|   Action::SetupMeshAction::Mesh::SetupMeshAction::act::set_mesh_base |     2 |      0.001 |      0.001 |   0.57 |       0 |      0.009 |      0.005 |   3.77 |       0 |
|   FEProblem::outputStep                                              |     2 |      0.000 |      0.000 |   0.07 |       0 |      0.042 |      0.021 |  16.75 |      13 |
|   Transient::PicardSolve                                             |     1 |      0.001 |      0.001 |   0.21 |       0 |      0.052 |      0.052 |  20.63 |       4 |
|     FEProblem::outputStep                                            |     3 |      0.000 |      0.000 |   0.04 |       0 |      0.000 |      0.000 |   0.04 |       0 |
|     FEProblem::solve                                                 |     1 |      0.030 |      0.030 |  11.95 |       4 |      0.050 |      0.050 |  20.09 |       4 |
|       FEProblem::computeResidualInternal                             |     1 |      0.000 |      0.000 |   0.01 |       0 |      0.004 |      0.004 |   1.52 |       0 |
|       FEProblem::computeResidualInternal                             |     2 |      0.000 |      0.000 |   0.01 |       0 |      0.005 |      0.002 |   1.81 |       0 |
|       FEProblem::computeJacobianInternal                             |     1 |      0.000 |      0.000 |   0.01 |       0 |      0.003 |      0.003 |   1.37 |       0 |
|       FEProblem::computeJacobianInternal                             |     1 |      0.000 |      0.000 |   0.01 |       0 |      0.008 |      0.008 |   3.28 |       0 |
|     FEProblem::computeUserObjects                                    |     1 |      0.000 |      0.000 |   0.03 |       0 |      0.000 |      0.000 |   0.03 |       0 |
|   Transient::final                                                   |     1 |      0.000 |      0.000 |   0.04 |       0 |      0.000 |      0.000 |   0.06 |       0 |
|     FEProblem::outputStep                                            |     1 |      0.000 |      0.000 |   0.01 |       0 |      0.000 |      0.000 |   0.02 |       0 |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

HPC relevant Topics

Advises

  • Create two module collections - one for build-time, and one for run-time.
  • Compile the whole chain (PETSc-Libmesh-Moose app) with AVX support for the respective hardware. Computational frameworks like PETSc usually benefit from this.
  • Moose apps are MPI programs and are usually started via mpiexec or srun --mpi=pmi2, or the like. However, some applications may also have the option --n-threads=<# threads per rank>. Hybrid MPI/Thread execution of applications on the LRZ clusters is recommended for efficient use on NUMA nodes (see the Usage section above).
  • Measure Performance: Specifically when you start with a new case, this is important! Start with few time steps. Check correct MPI rank/thread placement to the CPU cores. Try to assess the run-time and memory consumption requirements (if memory becomes a bottleneck on the nodes, consider to use distributed meshes). Perform some scaling test with the test-case at hand, in oder to assess the possibility for accelerating your computations. (A parallel efficiency of 70% and more are ok. Please also look on the Slurm queue limits!)
    Assessing the total runtime of a simulation case might prove difficult, because of the adaptive time-step integration. But starting from a certain time-step is possible, and thus stop and restart is a solution to recursively extend the simulation's total time-integration time.
  • Pre/Post Processing: Most file formats used in Libmesh/Moose can be analysed with ParaView.

Why don't you provide Moose as a centrally installed Module at LRZ?

  1. Still only few users with diverging requirements.
  2. Moose/Apps experience ongoing rapid development. When users start using some fixed-version, or a bigger community starts using the apps by Moose, we can revise our decision.
  3. Education: Moose is a framework meant to support the development of apps. This is, on the one hand, often not easy to provide as a central module. On the other hand, users should learn to know the complete setup of their tools (including the build of PETSc and Libmesh ... the moose developers have already simplified the business). This is best done when users practice the installation by themselves. For support requests, please contact our Service Desk.
  4. But for OpenFOAM, there are central modules. And OpenFOAM ist also a framework. That is true. But: More users (much larger community) are using the software as is (no development). Well settled environment management for build and run-time. Well settled release cycle and versioning strategy. Industrial support. We are not going to discriminate Moose here. We just have to make reasonable decisions accounting for limited man power for the support of software.