Decommissioned Compiler for Hitachi SR8000
Usage of the Fortran90 Compiler for the SR8000-F1
Calling Conventions
The Hitachi SR8000 Fortran90 compiler has over 100 options. Refer to the man page or the Optimising FORTRAN90 User Guide for full details.
All compiler options can be specified on the command line using the -W0 option and some have short forms. The syntax using -W0 contains "(" and ")" which must be protected from interpretation by the shell by enclosing within "'".
Examples
f90 -c -W0,'opt(o(4)),pvec,mp(procnum(8))' file.f
or
f90 -c -opt=4 -pvec -procnum=8 file.f
The order in which the options are specified on the command line may be important; subsequent options take precedence over earlier options. Certain options have hidden side-effects. For example, opt(o(s)) enables pseudo-vectorisation (PVP) so in the option string
-W0,'opt(o(s)),nopvec'
opt(o(s)) enables (PVP) and nopvec disables PVP so the overall result is that PVP is disabled. If the options were specified
-W0,'nopvec,opt(o(s))'
then PVP would be re-enabled as part of the opt(o(s)) option.
The C compiler flags are mostly the same as the short versions of the Fortran compiler flags.
Some of the more common and useful options are listed in the following sections. For details see the Fortran90 Refernece Manual or the man page
A good way to start
The following settings may be used as the starting point for optimized programs:
    f90 -c -model=F1 -opt=s  [-noparallel] ....
    f90 -c -model=F1 -opt=ss [-noparallel] ...
    f90 -c -model=F1 -opt=ss [-noparallel] -noscope ...
 
Compiler Options
Hardware Specification
| Option | Short Form | Meaning | Comments | 
|---|---|---|---|
| -W0,'model(F1)' | -model=F1 | name of the machine model | Is specifically optimized for the F1 model, and cannot run on other models. Always specify this flag! | 
Fortran Language Specifications
These options provide some variations on the way in which the Fortran source code is interpreted.
| Option | Short Form | Meaning | Comments | 
|---|---|---|---|
| -halt={w|e|s} | Controls compilation abort depending on maximum of occurring severity of error (error level):  w : Abort if error level at least 4 | LRZ recommends using -halt=e. The default value -halt=s may fail to catch certain typing errors. | |
| -W0,'langlvl(precexp(4))' | -precexp=4 | Expand 4 byte REAL and COMPLEX variables to 8 bytes. Many hardware and compiler optimisations are only applied to 8 byte REAL variables. | Similar options expand 8 byte variables to 16 bytes, INTEGER variables, expand data during unformatted I/O operations and modify the interpretation of certain intrinsic functions. | 
| -W0,'langlvl(save(0|1))' | -[no]save | Allocate subroutine local variables on the stack (-nosave) or statically (-save). The - option is required when a subroutine is called within a parallelised loop to prevent the multiple instances of the subroutine writing to the same memory location. | |
| -W0,'langlvl(intptr(1))' | -intptr | Enable Cray pointer syntax for dynamically allocated arrays. NB. The C library functions malloc and free are called directly and so their arguments must be passed by value e.g. p = malloc(%val(n)). | |
| -e | Perform error check, ease restrictions and expand language specification | Implies the following compile-time switches: -i,P,PL -W0,'langlvl(CONTI199,H8000),FORM(FIXED119)'as well as the following run-time settings: -F'PORT(ECONV,EOFRD,EOFRDT,GETARG,GETENV, IARGC,NMLIST, NSCRACH,PRCNTL, REALEDT,REWNOCL,TABSP),RUNST(DAMNONL,UMASK)' | |
| -i[,LS,P,PL] | Extend language specifications | See appendix F.1 of the Optimizing Fortran 90 Reference for details | |
| -W0,'langlvl(CONTI199(0|1)) | Specifies whether up to 39 (value 0) or up to 199 (value 1) continuation lines can be written | ||
| -W0,'langlvl(H8000(0|1)) | Specifies whether (value 1) or not (value 0) the source file is processed according to the OS 7 Fortran specification. | See appendix F.2 of the Optimizing Fortran 90 Reference for details | |
| -W0,'form(FIXED119)' | Enable using up to 119 columns per line in source | 
Optimisation Level
| Option | Short Form | Meaning | Comments | 
|---|---|---|---|
| -W0,'opt(o(0))' | -opt=0 or -O0 | Statements are compiled and optimized individually. | Default level of debugging optimisation with -g | 
| -W0,'opt(o(3))' | -opt=3 or -O3 | Inter-statement optimizations are applied without changing the sequence of operations. | Level 3 is the default level of optimization. | 
| -W0,'opt(o(4))' | -opt=4 or -O4 | Optimization may transform control structure and operation sequence. | |
| -W0,'opt(o(s))' | -opt=s or -Os | Pseudo-vectorisation and most forms of COMPAS parallelisation are automatically enabled. Some optimizations that exchange accuracy for speed are enabled. | |
| -W0,'opt(o(ss))' | -opt=ss or -Oss | Pseudo-vectorisation and all forms of COMPAS parallelisation are automatically enabled. Some further optimisations that exchange accuracy for speed are enabled. | 
Additional useful optimization flags (after already having specified -opt=s or -opt=ss)
| Option | Short Form | Meaning | Comments | 
|---|---|---|---|
| -W0,'opt(scope(0|1))' | -noscope -scope | -scope separates long code blocks into scope regions and optimizes each region independently. If you want to fully optimize long code blocks specify -noscope. | -noscope may lead to better performance but may increase compile times significantly. Try this option when compiling the final production code. | 
| -W0,'opt(listvec(0|1))' | -listvec -nolistvec | Perform list-vectorization for the IF statements in the loop. | Try this for loops with much work in the IF-body and/or low true ratio. | 
| -W0,'opt(predicate(0|1))' | -predicate -nopredicate | Perform optimization without generating a branch instruction for the IF statement | Try this for loops with little work in the IF-body and/or high true ratio. | 
| -W0,'opt(rsqrtlib(1))' | -rsqrtlib | Performs optimization of reciprocal square roots by using the library codes | |
| -W0,'opt(divopt(1))' | -divopt | Performs optimization by reducing division. | |
| W0,'opt(divopt (0|1|2|3|4))' | -divoptlib= [0|1|2|3|4] | Specifies whether or not to perform optimization by division using the library codes. | 
Pseudo-vectorization
| Option | Short Form | Meaning | Comments | 
|---|---|---|---|
| -W0,'pvec' | -pvec | Enable pseudo-vector processing (PVP). | Included with -opt=s and -opt=ss | 
| -W0,'nopvec' | -nopvec | Disable pseudo-vector processing (PVP). | Be sure that PVP is not re-enabled by a subsequent option such as opt(o(s)). | 
| -W0,'pvec(pvfunc(0))' | -nopvfunc | Do not use pseudo-vectorized intrinsic functions. | The pseudo-vectorized functions are: acos, asin, atan, atan2, cos cosh, exp, dim, int, log, log10, anint, sin, sinh, tan, max, min, sign, sqrt. | 
| -W0,'pvec(pvfunc(1))' | -pvfunc=1 | Use pseudo-vectorized intrinsic functions. | This is the default option when PVP is enabled. | 
| -W0,'pvec(pvfunc(2))' | -pvfunc=2 | A work array may be used to split off the intrinsic functions into a separate loop. | |
| -W0,'pvec(pvfunc(3))' | -pvfunc=3 | Allow use of pseudo-vectorized library when the intrinsic function is inside an IF block. | 
Parallelization (COMPAS)
| Option | Short Form | Meaning | Comments | 
|---|---|---|---|
| -W0,'mp(p(0|1|2|3|4))' | -noparallel -parallel=3 | Parallelization Level. -parallel=0 or -noparallel disables COMPAS. | Be sure that COMPAS is not re-enabled by a subsequent option such as opt(o(s)). | 
| -W0,'mp(procnum(8))' | -procnum=8 | Enable COMPAS and optimize code for 8 IPs per node. | The executable can only be run on a node for which all the CPUs are available for parallel execution. | 
| -W0,'mp(procnum(n))' | -procnum=n | The number of IPs is determined at run-time. | |
| -W0,'mp(multiversion(1))' | -multiversion | Generate two versions of each loop that execute on 1 IP and 8 IPs. The loop that is used is chosen at run-time according to the number of iterations in the loop. | 
Parallelization (OpenMP)
| Option | Short Form | Meaning | Comments | 
|---|---|---|---|
| -W0,'mp(omp(0|1))' | -noomp, -omp | Disable/enable OpenMP | Please note that -omp implies - in order to ensure that subroutine locals are private. Hence you might have to explicitly specify -save in case you overflow your stack with statically allocated store. If this cannot be safely done, you will have to explicitly specify a SAVE attribute for large static arrays in your source. | 
| -W0,'mp(orphaned(0|1|2))' | -orphaned[=0|=1|=2] | Specifies how to compile orphaned directives when procnum(8) option is specified. | 
Diagnostic Messages
| Option | Short Form | Meaning | Comments | 
|---|---|---|---|
| -W0,'opt(loopdiag(1))' | -loopdiag | Output diagnostic messages concerning loop and scalar optimizations. | |
| -W0,'pvec(diag(1))' | -pvdiag | Output diagnostic messages concerning PVP. | |
| -W0,'mp(diag(1))' | -pardiag=1 | Output diagnostic messages concerning COMPAS. | |
| -W0,'mp(diag(2|3))' | -pardiag=2|3 | Output (more) detailed diagnostic messages concerning COMPAS. | |
| -loglist | Output a detailed report on the compiler code analysis and optimisation | 
Debugging
| Option | Short Form | Meaning | Comments | 
|---|---|---|---|
| -W0,'TESTMODE(DEBUG)' | -debug | Automatically sets the compiler options for acquiring debug information | The following features are activated: 
 | 
| -g | Inserts information required for symbolic debugger | This switch implies -O0 optimization. | 
Miscellaneous Options
| Option | Short Form | Meaning | Comments | 
|---|---|---|---|
| -32 | Compile with 32-bit addressing. Memory is limited to 2 Gbytes. | This is the default option. | |
| -64 | Compile with 64-bit addressing. | The -64 option must also be passed to the linker. | 
Linker Options
| Option | Short Form | Meaning | Comments | 
|---|---|---|---|
| -32 | Link 32-bit object files. Memory is limited to 2 Gbytes. | This is the default option. | |
| -64 | Link 64-bit object files. | The -64 option must also be passed to the compiler. | |
| -rdma | Place statically allocated data in the remote DMA region of memory. | This option can increase the speed of MPI data transfer. It may not be combined with intra-node (within a node) MPI. | |
| +BTLB | Use the block translation lookaside buffer (block TLB) for virtual to physical memory address translation for statically allocated data. | This option can give a performance increase when addressing large arrays. | |
| +SBTLB | Use the block translation lookaside buffer (block TLB) for virtual to physical memory address translation for stack data. | This option can give a performance increase when addressing large automatic or dynamic arrays. See the SBTLB documentation for details This option is needed when linking with the serial BLAS library in order to get good performance. | |
| -parallel | Link COMPAS object files. | This option is required when linking parallelised object files and is not the default. | 
Some linkage options can only be transferred to ld via the -Wl switch:
f90 -Oss -Wl,'-v' myprog.f90
for example toggles the verbose mode of the linker, which gives a complete list of objects and libraries used in the linking process. See the man page for ld for further linkage options.
A typical set of options for compilation would be:
-64 \ # 64-bit addressing -W0,'langlvl(precexp(4))' \ # Expand 4-byte REALs to 8-bytes -W0,'opt(o(ss))' \ # Maximum optimisation -W0,'pvec(pvfunc(2))' \ # Pseudo-vector functions with work array -W0,'mp(procnum(8))' \ # Use COMPAS and optimise for 8 IPs -W0,'mp(nestcheck(1))' \ # Perform nest checking in COMPAS at runtime -halt=e # Abort compilation on error level 8 or higher
and a typical set of options for linking would be:
-64 -parallel # 64-bit addressing, COMPAS
Compiler Directives
The SR8000 Fortran compiler directives are almost as numerous as the compiler options.
The option directive can be used to set compiler options for the file that contains the directive; the soption directive specifies loop and block optimisations; the voption directive relates to PVP and the poption directive controls the operation of COMPAS.
The syntax of the option directive is:
The option directive must appear on the first line of the file and start in column 1.
*option compiler-option[,compiler-option...]
For example, the following use of the option directive adds the langlvl(save(0)) compiler option to the command line options for just this file. If there is a conflict with the command line options then the option directive takes precedence.
*option langlvl(save(0)) subroutine foo( a, b ) real a(*), b(*) ... return end
The other directives have two forms depending on whether the source code is in fixed format or free format. In both cases the directive must start in column 1.
Fixed format
*soption option[,option...]
*voption option[,option...]
*poption option[,option...]
Free format
!soption option[,option...]
!voption option[,option...]
!poption option[,option...]
In C, the directives are hidden inside a comment. Far fewer directives have been implemented in the C compiler than the Fortran compiler. There are no soption directives in C.
/* voption option[,option...] */
/* poption option[,option...] */
Some of the more useful directives are described in the following sections. Refer to the User Guides for the complete list.
SOPTION Directives
| Directive | Meaning | Comments | 
|---|---|---|
| soption loopinterchange(n1,n2...) | Control the nesting of loops. | |
| soption noloopinterchange | Do not allow the compiler to exchange loop nesting. | Useful if the compiler makes a bad choice. | 
| soption unroll(n) | Unroll a loop n times. | |
| soption noroll | Do not unroll a loop. | |
| soption dcbt(Adress) | Touches the cacheline associated with adress. | Useful do programm one's own prefetch algorithms. | 
Example for Use of SOPTION
For some matrix sizes at least, these unrolling numbers give better performance than the default compiler unrolling algorithm which necessarily must be conservative and general.
*soption unroll(2) do i=1,n *soption unroll(2) do j=1,n c(i,j) = 0.0 enddo *soption unroll(2) do k=1,n *soption unroll(4) do j=1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo
VOPTION Directives
| Directive | Meaning | Comments | 
|---|---|---|
| voption vec | Unconditionally pseudo-vectorise the loop. | These directives assert that an array entry that is written to is not read from in the same iteration or in any other iterations. The voption indep directive should be compared with the poption indep directive. | 
| voption indep[(array-name...)] | Assert that there are no dependency relationships in the loop for the specified arrays. If no arrays are listed, then assert that all arrays are independent. | |
| voption prefetch[(array-name...)] | Pre-fetch the specified arrays. If no arrays are listed, then pre-fetch all arrays. | |
| voption preload[(array-name...)] | Pre-load the specified arrays. If no arrays are listed, then pre-load all arrays. | |
| voption pvfunc(0|1|2|3) | Control the way in which the pseudo-vectorised function library is used in the following loop. | Often, the best way to use the pseudo-vectorised function library must be selected manually. | 
Examples for the Use of VOPTION
The indep option asserts that there are no repeated values in array index.
*voption indep(a) do i = 1,n a(index(i)) = a(index(i)) + b(i) end do
It would be incorrect for the compiler to apply PVP to the following code because the pre-fetch instruction for a(i) at the end of the loop would be issued before the value of a(i) at the beginning of the loop had been written. (Actually the compiler would detect this mistake and ignore the vec directive and instead keep the value of a(i) is a register or in cache, but it may not be possible to analyse more complex subscripts for the array a whose values can only be evaluated at run-time.)
c PVP MAY NOT BE USED HERE *voption vec do i = 1,n a(i) = ... ... = a(i) end do
POPTION Directives
The following directives are used for controlling the distribution of iterations of a loop across the IPs of a node.
 
| Directive | Meaning | Comments | 
|---|---|---|
| poption parallel | Unconditionally parallelise the loop. | Block or cyclic distribution of iterations can be specified. | 
| poption noparallel | Do not parallelise the loop. | The compiler parallelises the outermost loop that is can. It may be necessary to control this default behaviour. | 
| poption indep[(array-name...)] | Assert that there are no dependency relationships between iterations of the loop for the specified arrays. If no arrays are listed, then assert that all arrays are independent. | This is a weaker constraint than voption indep. It asserts only that there are no dependencies between iterations of a loop but says nothing about the interactions between uses of a variable in different statements of a given iteration of a loop. | 
| poption tlocal[(variable-name...)] | Create thread-local copies of variables. | The compiler can often make the correct decision about generating thread-local copies of variables. | 
| poption notlocal[(variable-name...)] | Do not create thread-local copies of variables. | |
| poption barrier [ENTRY|EXIT] | Control barrier synchronisation at before and after parallel loop. | Normally, barriers are applied before and after parallel loops. | 
| poption nobarrier [ENTRY|EXIT] | 
The poption section directives provide another form of parallel processing of independent blocks of code.
Example for the Use of POPTION
The outer loop is parallelised. The array zfd is initialised, accumulated and used so each thread needs its own local copy. Thread-local copies of scalars can usually be generated automatically by the compiler but a directive is needed for arrays.
*poption parallel *poption tlocal(zfd) do jk1 = klev, 1, -1 do jl = 1, klon zfd(jl) = (1.0 - zclm(jl,jk1,klev)) ... enddo do jk = klev-1, jk1, -1 do jl = 1, klon zfd(jl) = zfd(jl) + zcfrac ... enddo enddo do jl = 1, klon pflux(jl,2,jk1) = zfd(jl) enddo enddo
Using OpenMP on the SR8000
Usage of OpenMP with Fortran
The OpenMP Specification 1.0 for Fortran is supported, with the following exceptions:
- nested parallel sections are always executed sequentially
- no dynamic creation or destruction of threads
An OpenMP include file is also available under /usr/include/omp_lib.h and can be used in a Fortran program by inserting the statement
!$ include 'omp_lib.h'
See Chapter 8 of the Optimizing FORTRAN90 User's Guide for details about Hitachi's OpenMP implementation. Fast thread execution is obtained using COMPAS (COperative Microprocessors in single Address Space). OpenMP may serve as a replacement for the *poption directives and automatic COMPAS parallelization (OpenMP statements are uniquely related to corresponding *poption directives). If you wish the compiler to be aware of OpenMP directives in your code, please specify the option
-W0,'mp(omp(1))'.
In this case, any *poption directives in your code are disabled; instead of automatic parallelisation COMPAS code transformation of your OpenMP structures is performed by the compiler. The following compiler options are relevant for use of OpenMP:
- -W0,'mp(diag(3))' # output OpenMP diagnostics
- -W0,'mp(nestcheck(0|1|2))' # what do do with nested parallelism
- -W0,'mp(orphaned(0|1|2))' # how to treat orphaned directives
- -W0,'mp(procnum(8))' # OpenMP always runs on 8 threads
LRZ offers a (german) introduction to programming with OpenMP on its web site. References (Specifications and OpenMP web site) are also accessible from there.
Usage of OpenMP with C
As of the newest C Compiler release (01-02), OpenMP 1.0 is also supported in C programs. Since the web documentation presently published does not say anything about this feature, LRZ courtesy of Hitachi provides a prerelease of Chapter 9 of the new C User's Guide (pdf or compressed postscript).
Furthermore, the C manual page (man cc) provides information on which compiler switches are available for usage of OpenMP.
Run Time Options
The SR8000 offers a number of run-time options for execution of binaries, some of which are described in the following. The run-time options are specified as follows:
[mpiexec ...] executable_name -F'run-time-option, run-time-option'
The following list comprises only such options whose default value on the SR8000 lead to difficulties.
| Option | Explanation | 
|---|---|
| PORT(PRCNTL(0|1)) | Specifies whether or not to treat the first character in a standard output record as print control character. Since the Fortran 01-02 release, the default is 1 (do not treat first character as print control character), in earlier releases the default was 0. | 
| PORT(MSGPUT(STDERR)) | Specifies that run-time-messages (especially traceback information in case of an abort of the Fortran run time system) go to STDERR. The default value is STDOUT, which is not useful in many cases. | 
| PORT(REWNOCL(0|1))) | Specifies whether (value 0) or not (value 1) to close the referenced file after executing the REWIND statement. If this option is not specified at all, the file is closed after REWIND. | 
| PORT(IARGC(0|1),GETARG(0|1)) | Specifies whether (value 1) or not (value 0) to change the handling of the number of command arguments by the IARGC intrinsic function, respectively the handling of arguments in the GETARG service subroutine. By default, the command name itself is not counted as first command argument, and GETARG sets the command name position to 0. | 
Diagnostics and Tuning
Compiler Log File
A detailed report on the compiler code analysis and optimisation can be obtained.
Compile with additional compiler flag (which attempts to translate the Japanese messages into English, but sometimes Japanese messages may appear if the translator is not in synchronism with the compiler release):
-loglist
Example:
The command
f90 -loglist -W0,'opt(o(4)),pvec,mp(procnum(8))' ddot.f
produce - among other files - an output file ddot.log containing the following output:
subroutine ddot(a,b,n,s) real*8 a(n), b(n), s s= 0.0 *poption parallel ** Parallel processing starting at loop entry ** Parallel function: _parallel_func_1_DDOT ** Parallel loop ** S: reduction variable (SUM) ** --- Add barrier at loop exit --- ** Parallel processing finishing at loop exit ** ** Innermost loop accumulator variables expanded (8 times). ** PVP applied ** do i = 1,n s = s + a(i)*b(i) enddo end
Performance Monitoring Tools
Automatic instrumentation with -Xfuncmonitor and -Xparmonitor
The hardware monitor gathers performance statistics during the execution of the program. Calls to the hardware monitor system are inserted by setting compiler flags. The hardware monitor calls can be set either at the beginning or end of each function or also around each parallel region. Since this may generate noticeable overhead, please remove instrumentation prior to production runs!
To monitor functions only, use the compiler flag
-Xfuncmonitor
and to monitor functions and parallel regions use the compiler flag
-Xparmonitor
In both cases also add the linker flag -lpl if using COMPAS and add -lpl -lcompas -lpthreads -lc_r if you have also specified -noparallel. The compilier needs optimization level -O4 (or higher).
The hardware monitor report is written to the files with names of the form pl_wwwww_xxxx_yy.txt where wwwww is the executable name, xxxx is the process number and yy is the node number. This output file gives information like, time, flop counts, MFlop rates, load balance between IPs etc.
How to obtain Nice Output for Performance/Timing
The following procedure produces nicely formatted tables for the performance viz. timing of all instrumented routines in a program:
- compile with -Xparmonitor (or -Xfuncmonitor if not element parallel)
- set the environment variable APDEV_OUTPUT_TYPE to CSV (and export it).
- run your code, which for each MPI process initiated generates an outputfile pl_wwwww_xxxx_yy.csv (see above).
- then executemon.pl -f pl_wwwww_xxxx_yy.csv or mon.pl -t pl_wwwww_xxxx_yy.csv to obtain a per-routine summary of MFlops and times, respectively.
pmfunc /pmpar and pmpr
The same measurements as for -Xfuncmonitor and -Xparmonitor are performed, if the code is compilled with -pmfunc and/or -pmpar, but the method of output is a bit different.
FORTRAN compiler supposes to be specified -O4 optimization option (level 4) when -pmfunc or -pmpar option is specified. And C compiler supposes to be specified -O3 optimization option (level 3) when -pmfunc or -pmpar option is specified.Performance monitoring information file:
A performance monitoring information file is created by the performance monitoring library when an application program runs. For example, if the program was executed with the following conditions, the name of the performance monitoring information file is pm_PROGRAM_Jan02_0304_node005_6
1. Load module name : PROGRAM
2. Execution start time: January 2nd, at 4 minutes past 3.
3. Node no : 5
4. Process no 6Output the information
The pmpr command inputs this performance monitoring information file and displays various types of performance monitoring information (see pmpr(1)), e.g.:
pmpr -ex -full pm_PROGRAM_Jan02_0304_node005_6
this will give a full output with explanations. See output file for details.
pmpr -c -full pm_PROGRAM_Jan02_0304_node005_6
this will output a comma seperated list for the use with spread sheets.
Details of the performance monitoring information
Beware: If a routine calls non-instrumented subroutines (e. g. libraries), the MFlops/timings of the latter are folded into the calling routines measurement!
Checking the contents of a performance monitoring information files enables you to obtain various types of performance monitoring information: such as the CPU time (the period required by the CPU to execute a program), the number of executed instructions, and the number of floating-point operations. Details of the performance monitoring information are as follows:
Performance monitoring information for a process:
- Program execution starting date and time
- Node no
- Process no
- Load module name
- Input/Output count
- Input/Output quantity
- CPU time
- LoaD/STore instructions count
- execution instructions count
- Number of floating-point operations
- MIPS
- MFLOPS
- Number of data cache misses
Performance monitoring information in units of functions or procedures:
- Function or procedure name
- Source file name
- Starting-line number
- Number of executions
- CPU time
- LoaD/STore instructions count
- execution instructions count
- Number of floating-point operations
- MIPS
- MFLOPS
- Number of data cache misses
- Execution rate (time basis)
- Element parallelizing rate (CPU time and floating-point operations)
Performance monitoring information in units of element.
- Types of information are the same as the performance monitoring information when the units are functions or procedures.
Tuning Checklist
Things to check for good code performance (refer to tuning manual for details and remedies).
- parallelised loops
- outermost loops parallelised
- good load balance between IPs
- pre-fetch applied
- pre-load applied if pre-fetch is not possible
- best option for pvfunc selected
- software pipelining applied
- IF predicate transformation used
- no register spill
- no memory bank conflict
- no cache thrashing
- minimal number of divisions
- minimal real-integer conversion
- minimal memory traffic
Calling Service Routines from Fortran
You can access certain operating system services by using functions from the lists in Appendix D of the Fortran 90 Reference manual. Please note that for the functions in D.3 (especially the "SYSTEM" call) the following caveats apply:
- -lf90c must be specified for linkage
- If the -i,L option is used for Fortran and C language mixing, -i,EU must also be specified. Otherwise the libc versions of the service routines will be linked, which gives unreliable results. In this case, all C subroutine names need an additional underscore.
C Compiler
Standard (C89)
Hitachi's Optimizing C compiler is compliant with the ISO/IEC9899:1990 standard (C89). The undocumented and misleading option -c99 has no effect; the compiler cannot compile C99 programs.
C and Fortran Interface
For the interlanguage usage of Fortran and C see the relevant chapter in the Optimizing FORTRAN90 User's Guide, Chapter 9.
To call a FORTRAN subroutine from a C main program, use the -s,INIT option to the F90 compiler to compile your source file containing the FORTRAN subroutines you want to call from your C program. Then use the cc C-compiler to link all objects.
Linking
To link the element parallelizing object modules using C, you must specify the options
cc ... -parallel -lf90s -lf90_r -lhxb
ANSI/ISO98 C++ Compiler
Hitachi's ANSI/ISO98 conforming C++ compiler sCC version 01-00-A is installed together with an implementation of the Standard Template Library (STLport 4.5).
Please note that even though the compiler has passed its beta phase, there may still be problems. The sCC compiler is not 100% conform to the ANSI/ISO98 standard. If you think your program requires some part of the C++ standard that is not implemented in sCC, please contact us so that we can communicate this to Hitachi and, hopefully, have it implemented in the next compiler release.
First steps
| Compiler call (without STL) | /usr/bin/sCC | 
| Compiler call using STL | /usr/local/bin/sCC-stl Please see STL section below | 
| Compiler options for MPI | -I/usr/mpi/include (C bindings) /usr/local/sys/CC/mpi2c++/include/mpi++.h (C++ bindings) (there is no mpiCC command!) | 
| Linker options for MPI | -L/usr/mpi/lib -lmpi (for 32 bit codes) -L/usr/mpi/lib64 -lmpi (for 64 bit codes) /usr/local/sys/CC/mpi2c++/lib/libmpi++.a (C++ bindings, 32 bits only) | 
| Manual page | man sCC | 
| Cross compiler | none available | 
Pseudo-Vectorization and Parallelization with sCC
The big advantage of sCC over KCC and gcc is that it can optimize code for PVP and (COMPAS) automatically or guided by directives), and may therefore produce code that performs much better on the SR8000.
In addition to the -parallel and -pvec compiler options, , the sCC compiler recognizes PVP directives /*voption ..*/ directives and COMPAS directives /*poption ... */ described in Hitachi's CC manual (old ARM-C++ compiler), too.
Unsupported or untested features of sCC
The following list may change rapidly, please keep an eye on it as the compiler evolves! sCC currently does not support
- generation of debug symbols (option: -g)
- exception handling of code parts parallelized with COMPAS
- a few 64-bit libraries (e.g. libmpi++, libstlport)
- the Technical Corrigendum 1 (TC1) enhancing the ISO98-Standard (please refer to the official ISO WG21 website)
Usage of the STLport library
Compile with
sCC-stl -c myCCfile.C
If you encounter problems with this script, please try to compile with
sCC -ll64 -IotherIncludePaths -I/usr/local/sys/CC/stl/stlport -c myCCfile.C
Link either with
sCC-stl -o myprog objectfiles libraries
or
sCC -o myprog objectfiles libraries -L/usr/local/sys/CC/stl/lib -lstlport_sCC -lm
Trouble Shooting
ip_opts" is multiply definedThis is an incompatibility with /usr/include/netinet/in.h. Please use sCC-stl -localfixes or explicitly give -I/usr/local/sys/CC/sCC_IR/fix/include as first option to the compiler.Stack overflowsCC uses more stack than previous C++ compiler releases. Please inform us if this happens; we've increased the default stack size to 2 MB for sCC usage; you can also increase the stack limit yourself as described in the introductory LRZ article, but this should be only done until we've updated the default limit.Excessive compile time, or compile abortCompile times using sCC are fairly long because of difficult optimization tasks used for templates, PVP, and COMPAS; this is, unfortunately, normal. However, as a workaround, you might try the addtionial compiler option -noautoinline (The option "-Wc,-ZTOb8" that was used up to Beta Release 3 does not work any more and may crash the compiler.)Unknown poption parallelWith the plain call to sCC, the optimization directives /*poption parallel*/, /*poption parallel_sections*/ /*poption section*/, /*poption end parallel_sections*/ are rejected as incorrect.If your source file contains at least one these directives, compile with the additional compiler option -Wc,-ZTOsi. Note that the resulting program is not exception safe, i.e. exceptions must not be thrown from within the parallel code sections.Useless line numbers in compiler diagnostic messagesWhen using the -pardiag, -pvdiag and/or -loglist options, source code of included header files is not written out together with the diagnosing messages. But if the header files contain executable code, the applying source code lines cannot be determinded.In this case, first preprocess the source code with sCC -E and then compile the resulting code (which does not contain any #include directives any more).Multiply defined types (at compile time), or duplicate symboles (at link time)The STLport library /usr/local/sys/CC/stl implements its own iostream and a few other standard classes. Those implementations are not fully compatible with sCC's own implementations. So if you get errors about multiply defined types, make sure the include options (-I...) are the same for all source files. You may not be able to use sCC-stl, but have to specify the STL includes and libraries explicitly. Or you may try giving -I/usr/include/sCC as first include option.Warnings due to duplicate symbols at link time are not known to cause problems.Class string (or other ANSI C++ class) not foundFor a few ANSI classes that are not implemented within sCC (e.g. with #include <string>, or with the ifstream operator >>), please use the implementations in the STLport library. This is done using the wrapper script /usr/local/bin/sCC-stl (see above).
Old C++ Compiler (ARM)
Hitachi's old C++ compiler CC is not ISO98 conforming, but only implements the ARM pseudo-standard ("Annotated Reference Manual" by Stroustrup). The LRZ does not recommend using this compiler. together with an implementation of the Standard Template Library (STLport 4.5).
C++ and Fortran Language Mixing
This is unfortunately not entirely trivial, and the fact that there are three C++ compilers installed on the system further confuses the issue.
- Compilation and linkage need to be done separately.
- For the linkage step, the C++ compiler used to generate the C++ objects should be used. The Fortran 90 runtime libraries must be specified explicitly:KCC -o ./myprog.exe myprog.o mysub1.o ... -lm -lf90s -lf90 -lhf90math 
- KCC Compiler: Calling C++ routines from a Fortran main program is only possible if the C++ runtime environment is set up via a _main() call. So you need a small C++ wrapper/* additional part start */ extern "C" _main(); #define kai_startup_main KAI_STARTUP_MAIN extern "C" void kai_startup_main() { _main(); }which is called from the Fortran program before the first C++ subroutine call:PROGRAM MYPROG ... KAI_STARTUP_MAIN call c++subroutine(...) 
- CC Compiler: Calling C++ routines from a Fortran main program is performed in the same way as described above for the KAI C++ Compiler.
- sCC Compiler:Linkage against Fortran has not yet been tested. Feel free to try; please report your results to our support group!
- If the above prescription is disregarded, KCC throws a run-time error, while the Hitachi C++ Compiler segfaults at link time due to a bug.
The LRZ thanks Dr. Ullrich Becker-Lemgau from Pallas and Takeshi Murakami from Hitachi for providing informations on this issue.
Versions of installed Compiler Releases
The presently (January 20, 2003) installed compiler versions are
- Fortran: 01-06
- C: 01-04/B
- C++: 01-03
- ANSI C++: 01-00
Older compiler versions are installed under
- /usr/bin/cc.0104 (C Version 01-04) /usr/bin/cc.0104A (C Version 01-04/A incl. SUT) /usr/ccs/f90/01-05/f90 (Fortran Version 01-05) /usr/ccs/f90/01-05-A/f90 (Fortran Version 01-05-A) /usr/local/bin/sCCb5 (ANSI C++ Beta 5 Release)
 /usr/ccs/f90/01-05/f90 (Fortran Version 01-05)
and can be used by people whose work would be disrupted by bugs in the new compiler release. Nevertheless contact the LRZ because some day the old versions will be deleted.
 
 
        