Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Tuning and Optimization Manuals, Courses and Lectures

Advanced techniques of optimization

    • Optimization, (Pseudo-)Vectorization, and Parallelization on the Hitachi SR8000-F1


    written by the High Performance Computing Group of Regionales Rechenzentrum Erlangen, PDF.


High Precision Clocks and Cycle Counters

Cycle Counters: $SCETCYC

This routine returns the number of machine cycles as an INTEGER (KIND=8) value.
... code to be measured
write(6,*) 'Cycles used ', CYCLES2-CYCLES1

Timing Routines

Service routines available in Fortran:
 XCLOCKreturns elapsed or CPU timesPCLOCKmeasures the CPU time required for parallel processing. The routine calculates the maximum, minimum and average values for all the threads.real*8 p(4)call pclock(p,3)... code to be measuredcall pclock(p,8)p(1): The maximum CPU time of all of the threads from the PCLOCK(p,3)p(2): The minimum CPU time of all of the threads from the PCLOCK(p,3)p(3): The average CPU time of all of the threads from the PCLOCK(p,3)p(4): The same value as p(3)
The following routines are contained in liblrz
 DWALLTIMEreturns the elapsed wallclock time; uses mk_gettimeofday.double dwalltime()DOUBLE PRECISION FUNCTION DWALLTIME()DCPUTIMEreturns the used CPU time; uses getrusage.double dcputime()DOUBLE PRECISION FUNCTION DCPUTIME()SECOND/DSECONDreturns the used CPU time; uses XCLOCK.REAL*4 FUNCTION SECOND()REAL*8 FUNCTION DSECOND()SECONDR/DSECONDRreturns the elapsed wallclock time; uses XCLOCKREAL*4 FUNCTION SECONDR()REAL*4 FUNCTION DSECONDR()TREMAINreturns the remaining CPU time; uses XCLOCK.DOUBLE PRECISION FUNCTION TREMAIN()


StopWatch is a Fortran 90 module for portable, easy-to-use measurement of execution time. It supports four clocks -- wall clock, CPU clock, user CPU clock and system CPU clock -- and returns all times in seconds. It provides a simple means of determining which clocks are available, and the precision of those clocks. StopWatch is used by instrumenting your code with subroutine calls that mimic the operation of a stop watch. StopWatch supports multiple watches, and provides the concept of watch groups to allow functions to operate on multiple watches simultaneously.Location of libraries and modules: /usr/local/lib/stopwatchDocumentation: user's guide (html), user's guide (postscript), man pages


Hardware Performance Counters

NQS Output

      A very easy way to get information about the performance is to look into the NQS output file. At the end of a job the following information is output:


Figure 1:  shows the operation of the parallel element program.

Image Added
 Image Removed

The following explains in detail the environment variables which are used in the output


        (in the case the server has been already started by another user, use the PMHISTORY or PMTOPOLOGY command to start these displays)
      • Select the appropriate data to be displayed e.g. MFLOPS.

Image RemovedImage Added



Usage:  pmexec [-g <process number> ] <command> [arg...]
        pmexec -p [-i <interval>] [-g <process number>] [<command> [arg...]]
        pmexec -a [-i <interval>]
        pmexec -G [-i <interval>]


    usr(s)      User cpu time(seconds).
       (us)     User cpu time(micro seconds).
    sys(s)      System cpu time(seconds).
       (us)     System cpu time(micro seconds).
    usage       CPU usage(%) [(usr+sys)/etime].
    inst        Number of instructions.
    CPI         Clocks Per Instructions.
    LD/ST       Number of Load and Store instructions.
    ITLB        Number of Instruction-TLB miss.
    DTLB        Number of Data-TLB miss.
    Icache      Number of Instruction-Cache miss.
    Dcache      Number of Data-Cache miss.
    FU          Number of Floating instructions.
    fault       Number of page faults.
    zero        Number of zero pages(page allocations).
    react       Number of reactivations(pageout cancels).
    pagein      Number of pageins.
    COW         Number of Copy-On-Write.
    nswap       Number of swapouts.
    syscall     Number of system-calls.
    align       Number of page alignment fault


Automatic Instrumentation with -Xparmonitor or -Xfuncmonitor


To instrument the code, compile with

  • f90 -model=F1 -opt=ss -Xparmonitor ...
if you are interested in parallel perfomance or with
  • f90 -model=F1 -opt=ss -Xfuncmonitor ...

f90 -model=F1 -opt=ss -noparallel -Xfuncmonitor ...
if you are interested just in the performance of routines. This adds instrumentation around each parallel section of the subroutine. Several sets of data will be produced for each routine: one set for each parallel region of the code and one set for all the serial sections combined. (Remark) Even if the code is not being compiled with COMPAS enabled, -Xparmonitor is suitable.


  • Link with f90 [-32|-64] -parallel ... -lpl for COMPAS
  • Link with f90 [-32|-64] -noparallel ... -lpl -lcompas -lpthreads -lc_r for non-COMPAS programs
This links to the performance monitor library. The monitor library makes some calls to other libraries, and the easiest way to ensure that all libraries are present is to use the -parallel option.

Set an environment variable to select the type of output file that is produced

export APDEV_OUTPUT_TYPE=TEXT (in sh/ksh)
export APDEV_OUTPUT_TYPE=CSV (in sh/ksh)
export APDEV_OUTPUT_TYPE=BOTH (in sh/ksh)
The value TEXT produces a human readable output file, CSV produces and comma separated variable file suitable for processing by tools such are spreadsheets and BOTH produces both files. The default value if the variable is not set is TEXT. Use CSV to be able to process data by my tool mentioned in

Run the job

After completion, one file per MPI process will be written. The file name will be of the form: executable_name_process_id_node_number.[csv|txt]

Evaluate the data

The Perl script (installed in /usr/local/bin) to extract some useful data from the CSV files. Run with -f file.csv to get Mflops related data and -t file.csv to get time related data. This script was written for personal use and as an example of how to extract data from the hardware monitor files. It is not a supported Hitachi product. Be also aware that the parallel part of routine which are not instrumented (e.g., The BLAS library) could not be counted correctly.

Automatic instrumentation with -pmfunc, -pmpar and pmpr

The same measurements as for -Xfuncmonitor and -Xparmonitor are performed, if the code is compilled with -pmfunc and/or -pmpar, but the method of output is a bit different.


Compile and link with:

  • f90 -model=F1 -opt=ss -pmfunc -pmpar ...
FORTRAN compiler supposes to be specified at least -O4 optimization option (level 4) when -pmfunc or -pmpar option is specified. And the C compiler supposes to be specified at least -O3 optimization option (level 3) when -pmfunc or -pmpar option is specified.

Performance monitoring information file:

A performance monitoring information file is created by the performance monitoring library when an application program runs. For example, if the program was executed with the following conditions, the name of the performance monitoring information file is pm_PROGRAM_Jan02_0304_node005_6
     1. Load module name    : PROGRAM
     2. Execution start time: January 2nd, at 4 minutes past 3.
     3. Node no             : 5
     4. Process no           6

Output the information

The pmpr command inputs this performance monitoring information file and displays various types of performance monitoring information (see pmpr(1)), e.g.:

    • pmpr -ex -full pm_PROGRAM_Jan02_0304_node005_6

      this will give a full output with explanations. See output file for details.
    • pmpr -c -full pm_PROGRAM_Jan02_0304_node005_6

    this will output a comma seperated list for the use with spread sheets.

Details of the performance monitoring information

Beware: If a routine calls non-instrumented subroutines (e. g. libraries), the MFlops/timings of the latter are folded into the calling routines measurement!

Checking the contents of a performance monitoring information files enables you to obtain various types of performance monitoring information: such as the CPU time (the period required by the CPU to execute a program), the number of executed instructions, and the number of floating-point operations. Details of the performance monitoring information are as follows:

Performance monitoring information for a process:

  • Program execution starting date and time
  • Node no
  • Process no
  • Load module name
  • Input/Output count
  • Input/Output quantity
  • CPU time
  • LoaD/STore instructions count
  • execution instructions count
  • Number of floating-point operations
  • MIPS
  • Number of data cache misses
Performance monitoring information in units of functions or procedures:
  • Function or procedure name
  • Source file name
  • Starting-line number
  • Number of executions
  • CPU time
  • LoaD/STore instructions count
  • execution instructions count
  • Number of floating-point operations
  • MIPS
  • Number of data cache misses
  • Execution rate (time basis)
  • Element parallelizing rate (CPU time and floating-point operations)
Performance monitoring information in units of element.
  • Types of information are the same as the performance monitoring information when the units are functions or procedures.


PCL: Performance Counter Library

(currently only available for non-COMPAS programs)


To optimize program performance, it is important to know where the bottlenecks are located. One means to identify bottlenecks in the program code is through the use of hardware counters. For example, such hardware counters can count floating point instructions, cache misses, TLB misses, etc. It is important to use hardware counters and not software counters to keep the overhead to a minimum and thus reduce the disturbing impact on the user program. This is especially important for parallel programs.

Currently, the low level interface to the SR8000 hardware counters is not yet published. However, there is a platform-independent high level interface called "Performance Counter library", or short PCL, of Forschungszentrum Juelich GmbH, which has been ported to the Hitachi SR8000 by LRZ staff and which hides all the gory details of the low level interface. These routines can and should be used by SR8000 users to instrument their programs.


PCL was developed by Rudolf Berrendorf and Heinz Ziegler at the Central Institute for Applied Mathematics (ZAM) at the Research Centre Juelich, Germany. It is a library which can be linked with your code and which provides a high level, platform-independent interface to hardware performance counters. These high level library calls can be used to instrument your code and yet keep it portable. PCL also allows nested calls - up to a certain, pre-compiled limit, which is currently set to 16 nesting levels. For more information, please have a look at the PostScript documentation.

On the -SR8000 the LRZ installed the latest pre-release of PCL, version 2.0. The library resides in /usr/local/lib/libpcl32s.a and can be linked with -lpcl32s. The include files pcl.h for C code and pclh.f for FORTRAN code can be found in /usr/local/include.

Only the following hardware counters (PCL_EVENT) are currently supported on the SR8000:

PCL_L1DCACHE_MISS (Data Cache misses)
PCL_L1ICACHE_MISS (Instruction Cache misses)
PCL_DTLB_MISS (Data Translation Lookup Buffer misses)
PCL_ITLB_MISS (Instruction Translation Lookup Buffer misses)
PCL_FP_INSTR (Floating Point Instructions)
PCL_LOADSTORE_INSTR (Number of Load-Store Instructions)
PCL_INSTR (All Instructions)
PCL_IPC (Instructions per Cycle)
PCL_L1DCACHE_MISSRATE (Data Cache Miss Rate)
PCL_MEM_FP_RATIO (Memory Instructions to Floating Point Instructions Ratio)

The small example program ptest.c illustrates how a program can be instrumented. Please compile and link it like this
cc -o pcltest -I/usr/local/include -L/usr/local/lib ptest.c -lpcl32s

Sample output for this program is provided here:
FLOPs in iteration 0: 1
FLOPs in iteration 1: 101
FLOPs in iteration 2: 201
FLOPs in iteration 3: 301
Total FLOP count: 604

We would like to point out that the total FLOP count in this example was not computed by adding the individual loop contributions, but through the use of nested counters!


To make life a bit easier for our users, LRZ also installed autoPCL, which was programmed by Touati Sid Ahmed Ali of INRIA, France. AutoPCL is a tool that automatically inserts calls to PCL in your FORTRAN source. Unfortunately, autopcl permits only the counting of one event at a time in a Fortran code section and is therefore only of limited practical usability. However, it can be used as a first approach to PCL and the resulting instrumented code can serve as a template for your own, manual instrumentation.

AutoPCL can be called as follows:
autopcl -i fortest -p PCL_FP_INSTR -m PCL_MODE_USER -b7 -e9
The options have the following meaning:
-i <input fortran file name without .f extension>
-p <type of counter, here floating point instructions, cf. table of supported counters>
-m <mode. Must be PCL_MODE_USER for now>
-b<line number in original fortran file where to start monitoring>
-e<line number in original fortran file where to stop monitoring>

The example program fortest.f is transformed into instrumented code, which can be found here ipcl.f.


The LRZ is currently working on a few extensions for automatic instrumentation of user programs. Stay tuned for more information.


PMCINFO: Low Level Hardware Performance Counters

(These counters are available for COMPAS and non-COMPAS programs)

The routines are contained in liblrz


returns the Hardware counters for a serial program