MPI-IO
MPI-IO provides an interface for parallel I/O. MPI-IO supports the partitioning of file data among processes and a collective interface to support complete transfers of global data structures between process memories and files.
In addition, MPI-IO provides facilities to further efficiencies that can be gained via support for asynchronous I/O, strided accesses, and control over physical file layout on storage devices (disks).
Instead of defining I/O access modes to express the common patterns for accessing a shared file, the approach in MPI-IO standard is express the data partitioning using derived datatypes.
Table 2 shows the main concepts defined for the MPI-IO in the MPI-3 standard MPI Standard.
Concept | Definition |
file | It is an ordered collection of typed data items |
etype | It is the unit of data access and positioning. It can be any MPI predefined or derived datatype. |
filetype | It is the basis for partitioning a file among processes and defines a template for accessing the file. A filetype is either a single etype or a derived MPI datatype constructed from multiple instances of the same etype |
view | It defines the current set of data visible and accessible from an open file as an ordered set of etypes. Each process has its own view of the file, defined by three quantities: a displacement, an etype, and a filetype. The pattern described by a filetype is repeated, beginning at the displacement, to define the view. |
offset | It is a position in the file relative to the current view, expressed as a count of etypes. Holes in the view’s filetype are skipped when calculating this position. |
displacement | It is an absolute byte position relative to the beginning of a file. The displacement defines the location where a view begins. |
One of the most common implementation of MPI-IO is ROMIO, which is used in the major MPI distributions such as MPICH, MVAPICH, and IBM- and Intel-MPI.
ROMIO provides two optimization techniques: data sieving for noncontiguous requests from one process and collective I/O (two-phase I/O) for noncontiguous requests from multiple processes.
Collective Buffering Hints:
Hint | Useful- ness | Explanation |
---|---|---|
romio_cb_read | High | Controls when collective buffering is applied to collective read operations. Valid values are enable, disable, and automatic. If romio_cb_read is disabled, all tasks perform their own independent I/O. By default, romio_cb_read is automatic. |
romio_cb_write | High | Controls when collective buffering is applied to collective write operations. Valid values are enable, disable, and automatic. If romio_cb_write is disabled, all tasks perform their own independent I/O. By default, romio_cb_write is automatic. |
romio_cb_fr_types | Low | Tuning of collective buffering |
romio_cb_fr_alignment | Low | Tuning of collective buffering |
romio_cb_alltoall | Low | Tuning of collective buffering |
romio_cb_pfr | Low | Tuning of collective buffering |
romio_cb_ds_threshold | Low | Tuning of collective buffering |
cb_buffer_size | Medium | Controls the size (in bytes) of the intermediate buffer used in two-phase collective IO. If the amount of data that an aggregator transfers is larger than this value, multiple operations are used. The default value is 16 MB. |
cb_nodes | Medium | Controls the maximum number of aggregators to be used. |
cb_config_list | Medium | Provides explicit control over aggregators. *:1 One process per hostname (i.e., one process per node). |
romio_no_indep_rw | Low | It controls when “deferred open” is used. |
Data Sieving Hints:
Hint | Useful- ness | Explanation |
---|---|---|
ind_rd_buffer_size | Low | Controls the size (in bytes) of the intermediate buffer used when performing data sieving during read operations. |
ind_wr_buffer_size | Low | Controls the size (in bytes) of the intermediate buffer when performing data sieving during write operations. |
romio_ds_read | High | Determines when ROMIO will choose to perform data sieving for read. Valid values are enable, disable, or automatic. By default, romio_ds_read is automatic. |
romio_ds_write | High | Determines when ROMIO will choose to perform data sieving for write. Valid values are enable, disable, or automatic. By default, romio_ds_write is automatic. |
Setting hints at MPI-IO level
Using info object in the program
integer info, ierr call MPI_Info_create(info, ierror) call MPI_Info_set(info, ’romio_cb_read’, ’disable’, ierr) call MPI_Info_set(info, ’romio_cb_write’, ’disable’, ierr) ... call MPI_File_open(comm, filename, amode, info, fh, ierror)
User can define a list of hints in a single file which are going to be set up at execution time for his parallel application.
>cat $HOME/romio-hints romio_cb_read disable romio_cb_write disable
Setting for ROMIO HINTS:
export ROMIO_HINTS=$HOME/romio-hints
General Hints for IO
- Open files in the correct mode. If a file is only intended to be read, it must be opened in read-only mode because choosing the right mode allows the system to apply optimisations and to allocate only the necessary resources.
- Write/read arrays/data structures in one call rather than element per element. Not complying with this rule will have a significant negative impact on the I/O performance.
- Do not open and close files too frequently because it involves many system operations. The best way is to open the file the first time it is needed and to close it only if its use is not necessary for a long enough period of time.
- Limit the number of simultaneous open files because for each open file, the system must assign and manage some resources.
- Separate procedures involving I/O from the rest of the source code for better readability and maintainability.
- Separate metadata from data. Metadata is anything that describes the data. This is usually the parameters of calculations, the sizes of arrays... It is often easier to separate files into a first part (header) containing the metadata followed by the
data. - Create files independent of the number of processes. This will make life much easier for post-processing and also for restarts with a different number of processes.
- Align accesses to the frontiers of the file system blocks and have only one process per data server (not easy).
- Use non-blocking MPI-I/O calls (not implemented/available on all systems).
- Use higher level libraries based on MPI-I/O (HDF5, ADIOS, SIONlib...).
For details see: PRACE Advanced Training - Best practices for parallel IO and MPI-IO hints