MPI-IO

MPI-IO provides an interface for parallel I/O. MPI-IO supports the partitioning of file data among processes and a collective interface to support complete transfers of global data structures between process memories and files.

In addition, MPI-IO provides facilities to further efficiencies that can be gained via support for asynchronous I/O, strided accesses, and control over physical file layout on storage devices (disks).

Instead of defining I/O access modes to express the common patterns for accessing a shared file, the approach in MPI-IO standard is express the data partitioning using derived datatypes.

Table 2 shows the main concepts defined for the MPI-IO in the MPI-3 standard MPI Standard.

Concept

Definition

file

It is an ordered collection of typed data items

etype

It is the unit of data access and positioning. It can be any MPI predefined or derived datatype.

filetype

It is the basis for partitioning a file among processes and defines a template for accessing the file. A filetype is either a single etype or a derived MPI datatype constructed from multiple instances of the same etype

view

It defines the current set of data visible and accessible from an open file as an ordered set of etypes. Each process has its own view of the file, defined by three quantities: a displacement, an etype, and a filetype. The pattern described by a filetype is repeated, beginning at the displacement, to define the view.

offset

It is a position in the file relative to the current view, expressed as a count of

etypes. Holes in the view’s filetype are skipped when calculating this position.

displacement

It is an absolute byte position relative to the beginning of a file. The displacement defines the location where a view begins.

One of the most common implementation of MPI-IO is ROMIO, which is used in the major MPI distributions such as MPICH, MVAPICH, and IBM- and Intel-MPI. 

ROMIO provides two optimization techniques: data sieving for noncontiguous requests from one process and collective I/O (two-phase I/O) for noncontiguous requests from multiple processes.

Collective Buffering Hints:


HintUseful-
ness
Explanation
romio_cb_readHigh

Controls when collective buffering is applied to collective read operations. Valid values are enable, disable, and automatic. If romio_cb_read is disabled, all tasks perform their own independent I/O. By default, romio_cb_read is automatic.

romio_cb_writeHighControls when collective buffering is applied to collective write operations. Valid values are enable, disable, and automatic. If romio_cb_write is disabled, all tasks perform their own independent I/O. By default, romio_cb_write is automatic.
romio_cb_fr_typesLowTuning of collective buffering   
romio_cb_fr_alignmentLowTuning of collective buffering   
romio_cb_alltoallLowTuning of collective buffering   
romio_cb_pfrLowTuning of collective buffering   
romio_cb_ds_thresholdLowTuning of collective buffering   
cb_buffer_sizeMedium

Controls the size (in bytes) of the intermediate buffer used in two-phase collective IO. If the amount of data that an aggregator transfers is larger than this value, multiple operations are used. The default value is 16 MB.

cb_nodesMediumControls the maximum number of aggregators to be used.
cb_config_listMedium

Provides explicit control over aggregators. *:1 One process per hostname (i.e., one process per node).

romio_no_indep_rwLowIt controls when “deferred open” is used.

Data Sieving Hints:


HintUseful-
ness
Explanation
ind_rd_buffer_sizeLow

Controls the size (in bytes) of the intermediate buffer used when performing data sieving during read operations.

ind_wr_buffer_sizeLow

Controls the size (in bytes) of the intermediate buffer when performing data sieving during write operations.

romio_ds_readHigh

Determines when ROMIO will choose to perform data sieving for read. Valid values are enable, disable, or automatic. By default, romio_ds_read is automatic.

romio_ds_writeHighDetermines when ROMIO will choose to perform data sieving for write. Valid values are enable, disable, or automatic. By default, romio_ds_write is automatic.

Setting hints at MPI-IO level

Using info object in the program

integer info, ierr
call MPI_Info_create(info, ierror)
call MPI_Info_set(info, ’romio_cb_read’, ’disable’, ierr)
call MPI_Info_set(info, ’romio_cb_write’, ’disable’, ierr)
...
call MPI_File_open(comm, filename, amode, info, fh, ierror)

User can define a list of hints in a single file which are going to be set up at execution time for his parallel application.

>cat $HOME/romio-hints
romio_cb_read disable
romio_cb_write disable

Setting for ROMIO HINTS:

export ROMIO_HINTS=$HOME/romio-hints

General Hints for IO

  • Open files in the correct mode. If a file is only intended to be read, it must be opened in read-only mode because choosing the right mode allows the system to apply optimisations and to allocate only the necessary resources.
  • Write/read arrays/data structures in one call rather than element per element. Not complying with this rule will have a significant negative impact on the I/O performance.
  • Do not open and close files too frequently because it involves many system operations. The best way is to open the file the first time it is needed and to close it only if its use is not necessary for a long enough period of time.
  • Limit the number of simultaneous open files because for each open file, the system must assign and manage some resources.
  • Separate procedures involving I/O from the rest of the source code for better readability and maintainability.
  • Separate metadata from data. Metadata is anything that describes the data. This is usually the parameters of calculations, the sizes of arrays... It is often easier to separate files into a first part (header) containing the metadata followed by the
    data.
  • Create files independent of the number of processes. This will make life much easier for post-processing and also for restarts with a different number of processes.
  • Align accesses to the frontiers of the file system blocks and have only one process per data server (not easy).
  • Use non-blocking MPI-I/O calls (not implemented/available on all systems).
  • Use higher level libraries based on MPI-I/O (HDF5, ADIOS, SIONlib...).

For details see: PRACE Advanced Training - Best practices for parallel IO and MPI-IO hints