MPI-IO

MPI-IO provides an interface for parallel I/O. MPI-IO supports the partitioning of file data among processes and a collective interface to support complete transfers of global data structures between process memories and files.

In addition, MPI-IO provides facilities to further efficiencies that can be gained via support for asynchronous I/O, strided accesses, and control over physical file layout on storage devices (disks).

Instead of defining I/O access modes to express the common patterns for accessing a shared file, the approach in MPI-IO standard is express the data partitioning using derived datatypes.

Table 2 shows the main concepts defined for the MPI-IO in the MPI-3 standard MPI Standard.

Concept	Definition
file	It is an ordered collection of typed data items
etype	It is the unit of data access and positioning. It can be any MPI predefined or derived datatype.
filetype	It is the basis for partitioning a file among processes and defines a template for accessing the file. A filetype is either a single etype or a derived MPI datatype constructed from multiple instances of the same etype
view	It defines the current set of data visible and accessible from an open file as an ordered set of etypes. Each process has its own view of the file, defined by three quantities: a displacement, an etype, and a filetype. The pattern described by a filetype is repeated, beginning at the displacement, to define the view.
offset	It is a position in the file relative to the current view, expressed as a count of etypes. Holes in the view’s filetype are skipped when calculating this position.
displacement	It is an absolute byte position relative to the beginning of a file. The displacement defines the location where a view begins.

One of the most common implementation of MPI-IO is ROMIO, which is used in the major MPI distributions such as MPICH, MVAPICH, and IBM- and Intel-MPI.

ROMIO provides two optimization techniques: data sieving for noncontiguous requests from one process and collective I/O (two-phase I/O) for noncontiguous requests from multiple processes.

Collective Buffering Hints:

Hint	Useful- ness	Explanation
romio_cb_read	High	Controls when collective buffering is applied to collective read operations. Valid values are enable, disable, and automatic. If romio_cb_read is disabled, all tasks perform their own independent I/O. By default, romio_cb_read is automatic.
romio_cb_write	High	Controls when collective buffering is applied to collective write operations. Valid values are enable, disable, and automatic. If romio_cb_write is disabled, all tasks perform their own independent I/O. By default, romio_cb_write is automatic.
romio_cb_fr_types	Low	Tuning of collective buffering
romio_cb_fr_alignment	Low	Tuning of collective buffering
romio_cb_alltoall	Low	Tuning of collective buffering
romio_cb_pfr	Low	Tuning of collective buffering
romio_cb_ds_threshold	Low	Tuning of collective buffering
cb_buffer_size	Medium	Controls the size (in bytes) of the intermediate buffer used in two-phase collective IO. If the amount of data that an aggregator transfers is larger than this value, multiple operations are used. The default value is 16 MB.
cb_nodes	Medium	Controls the maximum number of aggregators to be used.
cb_config_list	Medium	Provides explicit control over aggregators. *:1 One process per hostname (i.e., one process per node).
romio_no_indep_rw	Low	It controls when “deferred open” is used.

Data Sieving Hints:

Hint	Useful- ness	Explanation
ind_rd_buffer_size	Low	Controls the size (in bytes) of the intermediate buffer used when performing data sieving during read operations.
ind_wr_buffer_size	Low	Controls the size (in bytes) of the intermediate buffer when performing data sieving during write operations.
romio_ds_read	High	Determines when ROMIO will choose to perform data sieving for read. Valid values are enable, disable, or automatic. By default, romio_ds_read is automatic.
romio_ds_write	High	Determines when ROMIO will choose to perform data sieving for write. Valid values are enable, disable, or automatic. By default, romio_ds_write is automatic.

Setting hints at MPI-IO level

Using info object in the program

integer info, ierr
call MPI_Info_create(info, ierror)
call MPI_Info_set(info, ’romio_cb_read’, ’disable’, ierr)
call MPI_Info_set(info, ’romio_cb_write’, ’disable’, ierr)
...
call MPI_File_open(comm, filename, amode, info, fh, ierror)

User can define a list of hints in a single file which are going to be set up at execution time for his parallel application.

>cat $HOME/romio-hints
romio_cb_read disable
romio_cb_write disable

Setting for ROMIO HINTS:

export ROMIO_HINTS=$HOME/romio-hints

General Hints for IO

Open files in the correct mode. If a file is only intended to be read, it must be opened in read-only mode because choosing the right mode allows the system to apply optimisations and to allocate only the necessary resources.
Write/read arrays/data structures in one call rather than element per element. Not complying with this rule will have a significant negative impact on the I/O performance.
Do not open and close files too frequently because it involves many system operations. The best way is to open the file the first time it is needed and to close it only if its use is not necessary for a long enough period of time.
Limit the number of simultaneous open files because for each open file, the system must assign and manage some resources.
Separate procedures involving I/O from the rest of the source code for better readability and maintainability.
Separate metadata from data. Metadata is anything that describes the data. This is usually the parameters of calculations, the sizes of arrays... It is often easier to separate files into a first part (header) containing the metadata followed by the
data.
Create files independent of the number of processes. This will make life much easier for post-processing and also for restarts with a different number of processes.
Align accesses to the frontiers of the file system blocks and have only one process per data server (not easy).
Use non-blocking MPI-I/O calls (not implemented/available on all systems).
Use higher level libraries based on MPI-I/O (HDF5, ADIOS, SIONlib...).

For details see: PRACE Advanced Training - Best practices for parallel IO and MPI-IO hints