Most hints presented here do not only apply to Spectrum Scale (GPFS), but also to to other (parallel) files systems.
General Hints for IO
- Open files in the correct mode. If a file is only intended to be read, it must be opened in read-only mode because choosing the right mode allows the system to apply optimisations and to allocate only the necessary resources.
- Write/read arrays/data structures in one call rather than element per element. Not complying with this rule will have a significant negative impact on the I/O performance.
- Do not open and close files too frequently because it involves many system operations. The best way is to open the file the first time it is needed and to close it only if its use is not necessary for a long enough period of time.
- Limit the number of simultaneous open files because for each open file, the system must assign and manage some resources.
- Separate procedures involving I/O from the rest of the source code for better readability and maintainability.
- Separate metadata from data. Metadata is anything that describes the data. This is usually the parameters of calculations, the sizes of arrays... It is often easier to separate files into a first part (header) containing the metadata followed by the
- Create files independent of the number of processes. This will make life much easier for post-processing and also for restarts with a different number of processes.
- Align accesses to the frontiers of the file system blocks and have only one process per data server (not easy).
- Use non-blocking MPI-I/O calls (not implemented/available on all systems).
- Use higher level libraries based on MPI-I/O (HDF5, ADIOS, SIONlib...).
Avoid repetive and excessive "open/close" or "stat" operations
- Mata-data operations may need serialized locking mechanisms.
- Some users use "stat" of related functions and/or commands to test the size or existence of files. When testing becomes excessive, it will cause a heavy load on the meta-data servers.
Avoid having multiple processes open the same file(s) (for writing)
- Mata-data operations may need serialized locking mechanisms.
- When just reading, make this explicit in the open calls(FORTRAN: ACTION=READ, C: RDONLY). This will reduce contention.
Be careful with links
- Having links which point from $HOME to WORK or SCRATCH (or vice versa) may cause problems if they are heavely used from many nodes.
- Use the filesystems in the intended mway without indirections by symbolic links
Do not have all (or too many) files in the same directory
(this also applies to having 1000s of directories within the same directory)
The GPFS architecture is good at processing parallel I/O from many nodes in general. However, it is very slow when different nodes try to write to exactly the same area of the same file. The general rule is to avoid having hundreds or thousands of tasks trying to modify the same file/directory at the same time with certain operations. This happens for instance when, on job start, all participating nodes try to create a file each in one and the same directory. A directory is nothing but a file as well. The rate at which files are being created that way was seen to be about 1/s ! It is strongly recommended to not do this for any larger job. As as better alternative, the files for the individual tasks can be created all by one task. This is faster by several orders at job start (see below). If the nodes need to create their files indeed themselves, then do create subdirectories first, either one for each tasks or one for a (small) subset of tasks, and let then the tasks create their files within these subdirectories. The subdirectory creation should again be done just by one task. The code using MPI should do something like this pseudo-code:
!# serial creationbarrier
if (task==0) then
create file(i) // with optional truncate option
!# all files created now
The tasks can then proceed to modifying their own portions of a common file, with best results if their regions do not overlap on a granularity smaller than the GPFS blocksize (8 MB). For fine grain updates that are smaller than the blocksize, the MPI-IO package is advised since it will use MPI to ship around the small updates to nodes that manage different regions of the file.
Avoid using "ls -l"
- Use "ls" if you just want to list files. If you use "-l" all the metadata have to be read.
Use "vi -n" or "view"
- This avoids the creation of a swap-file during opening of the file (which is a meta data opertion) and speeds up the inspection of files.
Command line options and environment settings for Intel Fortran Compiler
For most applications with the Intel Fortran Compiler "ifort" make sure, that you use buffered IO. There are at least three methods to do this:
- Set the environment variable "FORT_BUFFERED" to true: "FORT_BUFFERED=true"
- use "-assume buffered_io" as command line options for ifort. You can do this for example via using the varibable "FFLAGS" with " FFLAGS='-assume buffered_io' ".
- You can use a custom configuration file ifort.cfg which is called at each compilation, here named myifort.cfg: "IFORTCFG=/PATH/TO/myifort.cfg" .
Then just add "-assume buffered_io" as contents to the file. Of course, you can add any further command line options of ifort to that file for convenience (e.g. "-extend-source" for fixed form Fortran).
You also might want to adjust the environment variable "FORT_BLOCKSIZE" for I/O to the blocksize of the used filesystem. For different Systems at LRZ the corresponding values and commands are listed in the following table:
|SuperMUC-NG||$SCRATCH & $WORK||FORT_BLOCKSIZE=16777216|
There is also a third environment variable "FORT_BUFFERCOUNT" which controls the number of buffers used for multibuffered I/O, but in general the default "1" is appropriate. For further information please consult the Intel documentation Supported Environment Variables.
- The environment variables FORT_BUFFERED and FORT_BLOCKSIZE are now set to the values given above by default when a login shell is created
- There exists at least one scenario where the default buffer size is counterproductive: If you use the -no-wrap-margin compilation option, performance for list-directed I/O collapses to a small fraction of its potential. There are two possible ways of dealing with this:
- unset the FORT_BLOCKSIZE variable and hope that other I/O performance losses are negligible, or
- in your Fortran source, open the unit that will do list-directed I/O with the additional (ifort-specific) additional specifier "BLOCKSIZE=8192"
Using MPIIO Hints
Existing hints and their usefulness for an application developer/user.
|romio_cb_read||High||Enable or not collective buffering.|
Defines whether or not to utilize collective IO for writing. If romio_cb_write is disabled, all tasks perform their own independent POSIX IO. By default, romio_cb_write is enabled
|romio_cb_write||High||Enable or not collective buffering.|
Defines whether or not to utilize collective IO for reading. If romio_cb_read is disabled, all tasks perform their own independent POSIX IO. By default, romio_cb_read is enabled.
|romio_cb_fr_types||Low||Tuning of collective buffering|
|romio_cb_fr_alignment||Low||Tuning of collective buffering|
|romio_cb_alltoall||Low||Tuning of collective buffering|
|romio_cb_pfr||Low||Tuning of collective buffering|
|romio_cb_ds_threshold||Low||Tuning of collective buffering|
Tuning of collective buffering.
Controls the size (in bytes) of the intermediate buffer used in two-phase collective IO. If the amount of data that an aggregator transfers is larger than this value, multiple operations are used. The default value is 16 MB.
|cb_nodes||Medium||Tuning of collective buffering|
|cb_config_list||Medium||Tuning of collective buffering. Provides explicit control over aggregators.|
|romio_no_indep_rw||Low||Deferred open + only collective I/O|
|ind_rd_buffer_size||Low||Buffer size for data sieving|
|ind_wr_buffer_size||Low||Buffer size for data sieving|
|romio_ds_read||High||Enable or not data sieving|
|romio_ds_write||High||Enable or not data sieving|
Most of the time, it is better to disable the data sieving optimisation because a similar one is already performed by the filesystem.
Example which users reported good results on SuperMUC:
call MPI_Info_set(info,"romio_cb_write","enable", error)
call MPI_Info_set(info,"cb_buffer_size","4194304", error)
call MPI_Info_set(info,"striping_unit","4194304", error)