Intel VTune profiler and Application Performance Snapshots

Purpose

Intel VTune  supports profiling and evaluation of performance characteristics for single- and multi-threaded programs on all Intel-based hardware platforms. It is free to use within Intel oneAPI.

Availability on LRZ's HPC platforms

VTune is provided on HPC systems which are based on Intel processors. On non-Intel processors or systems on which no kernel driver is available, only partial functionality may be available. If you encounter any difficulties with the LRZ-specific installations, please contact the LRZ Service Desk for help.

How to use 

First load the relevant modules:
module load oneapi 
module load intel-oneapi-vtune

You can then invoke the tool either via the command line interface (command vtune) or the GUI (command vtune-gui).

The GUI allows you to build analysis projects, specify an executable as well as various parameters for execution and analysis modes. In particular, profiling of threaded programs (including scalability analysis and identification of parallelization-induced performance problems) is supported. Please consult the documentation referenced below for a description of the many options this tool offers.

Because the kernel modules for performance-counter based runs cannot be provided, only a subset of the functionality may be available via the Linux perf infrastructure. 
However, since 2019 this functionality is much more accurate and comprehensive than in the past. See this (off-site) article for further details.
For data protections reasons profiling counters on the login nodes have very limited access rights. Full profiling is allowed on compute nodes (accessible via interactive or batch jobs, see Job Processing with SLURM on SuperMUC-NG and Job Processing on the Linux-Cluster). 

APS

With recent releases, Intel Amplifier XE includes the Application Performance Snapshots (APS),  that provides a quick overview about:

  • MPI parallelism (Linux* only)
  • OpenMP* parallelism
  • Memory access
  • FPU Utilization
  • I/O efficiency
  • ...

APS is included in the vtune module, can be used whenever VTune can, but is a much lighter application, often used as the first profiling step, or for large scale runs. LRZ users are encouraged to use APS for independent profiling, especially at the beginning of a new project, or after porting to a new machine. Occasionally, LRZ may ask users to provide APS reports of their production runs.

Running APS

Initialize APS on LRZ machines by loading  the oneapi and vtune modules as above, plus initialize aps

module load oneapi 
module load intel-oneapi-vtune
export MPS_STAT_LEVEL=4


Here is some more information about the controlling the amount of collected data MPS_STAT_LEVEL:

https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide-application-snapshot-linux/2023-1/controlling-amount-of-collected-data.html


In a job on the compute node (interactive or batch), to run analysis for an application and store results in <dir> (e.g. within a slurm job file):

Non-MPI APS profiling
# Collection
aps [--result-dir=<dir>] ./myserial.exe
# Report
aps-report <dir>  # Creates a useful .html report that can be viewed with any browser
aps-report -a <dir>  # Prints all available stats to stdout  

The syntax is a bit different for MPI-parallel profilings 

MPI APS Profiling
# Collection
mpiexec <mpiexec_options> aps --result-dir=<dir> ./myparallel.exe # Output dir is mandatory here
# Report
# The same options for non-MPI code are available. In addition:
aps-report -x --format=html <dir> # Creates a communication matrix for all MPI tasks

For more advanced APS capabilites, please refer to aps and aps-report manual entries, or consult this (off-site) article.

Documentation