IPM : Integrated Performance Monitoring

Overview

IPM is a portable profiling infrastructure for parallel codes. It provides a low-overhead performance summary of the computation and communication in a parallel program. The amount of detailed reported is selectable at runtime via environment variables or through a MPI_Pcontrol interface. IPM has extremely low overhead, is scalable and easy to use requiring no source code modification.

IPM runs on IBM SP's, Linux clusters using MPICH, Altix, SX6, and the Earth Simulator. IPM is available under an Open Source software license (LGPL).

IPM brings together several types of information important to developers and users of parallel HPC codes. The information is gathered in a way the tries to minimize the impact on the running code, maintaining a small fixed memory footprint and using minimal amounts of CPU. When the profile is generated the data from individual tasks is aggregated in a scalable way.

For downloads, news, and other information, visit our Project Page

The monitors that IPM currently integrates are:

Examples are sometimes better than explanation

The 'integrated' in IPM is multi-faceted. It refers to binding the above information together through a common interface and also the integration of the records from all the parallel tasks into a single report. On some platforms IPM can be integrated into the execution environment of a parallel computer. In this way IPM profiling is available either automatically or with very little effort. The final level of integration is the collection of individual performance profiles into a database which synthesizes the performance reports via a web interface. This web interface can be used by all those concerned with parallel code performance, namely users, HPC consultants, and HPC center managers.

Background and Using IPM:

More detailed information about how IPM works:

Profiling vs. Tracing

Performance events include HPM counts from on chip counters, memory usage events, and timings of routines such as message passing.

The space in which performance events occur is roughly two dimensional. Events occur at some "place" and at some time. The place might be a nodename or cpu number. It could be a context, e.g., a callsite or program counter value. The event also happens at or over some time.

A profile provides an inventory of performance events and timings for the execution as a whole. This ignores the chronology of the events in an absolute sense. Nothing is timestamped and the resulting report does not say what events happened before other events in a absolute sense. Relative ordering of events may be recorded in a profile.

A trace records the chronology, often with timestamps and is extensive in time. The amount of data in the trace increases with the runtime. As such in order to bound the memory usage by the tracing one must periodically write the data out to disk or network. either run for very short time.

The distinction between profiling and tracing that is sometimes overlooked when choosing a tool to extract performance information from a parallel code. Typically a trace is useful for detailed examination of timing issues occurring within a code. A profile is often sufficient to pinpoint load imbalance due to problem decomposition and/or identify the origin of excessive communication time.

IPM aims toward detailed profiling rather than tracing. It records basic statistics on the performance events as they occur. In most cases these statistics are min, max and total timings of the event. The timings are stored in a fixed size hash table which is keyed off of a description of the event. The description is based on a small number of parameters. For MPI calls these parameters are things like the name of the MPI call, the buffer size, the source/destination rank, etc.

Here are two quick examples showing the MPI profile data collected by IPM on a single task (rank 0) of two parallel codes:

Blocked dense ScaLAPACK code run on 16 tasks:

call        orank      ncalls  buf_size   t_tot    t_min   t_max   %comm
MPI_Recv        2         17     131072 5.96e+00 6.43e-02 5.92e-01  75.5
MPI_Recv        7         18          4 1.82e+00 8.45e-06 4.17e-01  23.0
MPI_Barrier     *          2          * 1.04e-01 7.14e-05 1.03e-01   1.3
MPI_Sendrecv    8         18        504 4.56e-03 6.31e-05 3.19e-04   0.1
MPI_Send        1         36          4 1.84e-03 1.75e-05 1.62e-04   0.0
MPI_Sendrecv   16         18        504 1.55e-03 3.18e-05 2.80e-04   0.0

Here orank means "the other rank" which may be a source or destination for data in the message.

Conjugate gradient on 32 tasks

call        orank      ncalls  buf_size   t_tot    t_min   t_max   %comm
MPI_Wait        4       3952          8 3.93e+00 3.93e-06 8.33e-02  39.4
MPI_Send        1       1976      75000 1.54e+00 5.76e-04 1.23e-02  15.4
MPI_Send        4       1976      75000 1.33e+00 4.68e-04 1.18e-02  13.4
MPI_Send        2       1976      75000 1.27e+00 4.53e-04 5.45e-03  12.7
MPI_Wait        2       3952          8 1.00e+00 3.93e-06 7.92e-02  10.1
MPI_Wait        1       3952          8 3.04e-01 3.70e-06 2.51e-02   3.0
MPI_Send        0       1976      75000 1.32e-01 6.39e-05 1.39e-04   1.3
MPI_Wait        4       1976      75000 9.17e-02 4.41e-06 5.11e-04   0.9
MPI_Send        4       3952          8 5.35e-02 9.43e-06 1.19e-04   0.5
MPI_Irecv       1       1976      75000 4.71e-02 1.75e-05 1.01e-04   0.5
MPI_Send        2       3952          8 4.18e-02 7.06e-06 4.69e-05   0.4
MPI_Send        1       3952          8 4.10e-02 7.75e-06 8.72e-05   0.4
MPI_Wait        2       1976      75000 3.36e-02 4.41e-06 4.24e-04   0.3
MPI_Wait        1       1976      75000 2.64e-02 1.02e-05 4.14e-04   0.3
MPI_Irecv       2       1976      75000 2.59e-02 6.38e-06 8.01e-05   0.3
MPI_Irecv       4       3952          8 2.57e-02 5.64e-06 2.01e-04   0.3

From such a profile, one does not know the order in which the above events happened. In many cases knowing that 37% of the communication time was spent in a 131KB MPI_Recv is sufficient information to make the next step of code examination and improvement. For scaling studies the above information can be quite useful.

In some cases more programnatic or chronological context of the performance events is needed. IPM includes two interfaces through which detailed information may be recorded.


Using IPM

Interfaces

IPM is controlled via environment variables and through MPI_Pcontrol.

Environment Variables

Variable Values Description
IPM_REPORT terse (default) Aggregate wallclock time, memory usage and flops are reported along with the percentage of wallclock time spent in MPI calls.
  full Each HPM counter is reported as are all of wallclock, user, system, and MPI time. The contribution of each MPI call to the communication time is given.
  none No report
IPM_MPI_THRESHOLD 0.0 < x < 1.0 Only report MPI routines using more than x% of the total MPI time.
IPM_HPM 1,2,3,4,scan POWER3 allows four different event sets. Use this environment variable to pick the event set or select scan to use different event sets on different tasks. Using the scan option allows greater coverage of the HPM counters but for codes with load imbalance or MPMD models uniform sampling may be more accurate. The scan option extrapolates the to full totals based on the sampled event sets.

MPI_Pcontrol

The first argument to MPI_Pcontrol determines what action will be taken by IPM.

Arguments Description
1,"label" start code region "label"
-1,"label" exit code region "label"
0,"label" invoke custom event "label"

Defining code regions and events:

       C                                     FORTRAN
MPI_Pcontrol( 1,"proc_a");           call mpi_pcontrol( 1,"proc_a"//char(0))
MPI_Pcontrol(-1,"proc_a");           call mpi_pcontrol(-1,"proc_a"//char(0))
MPI_Pcontrol( 0,"tag_a");            call mpi_pcontrol(0,"tag_a"//char(0))
MPI_Pcontrol( 0,"tag_a");            call mpi_pcontrol(0,"tag_a"//char(0))

                                     ( fortran label strings must be null terminated )

Snapshot Tracing

Real documentation coming soon...

Basic idea:


 IPM_MPI_TRACE=call:MPI_Reduce, region:label[it_i-it_j], time:[t1-t2]
      call:MPI_function     -> trace all calls to MPI_function
      region:label          -> trace all calls in region "label"
      region:label[it1-it2] -> trace it1 to it2 passes through region "label"
      time:[t1-t2]          -> trace all calls from MPI_Init+t1 until MPI_Init+t2

Download

Access is provided through our sourceforge site as released versions and via CVS.

 cvs -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/ipm-hpc login
 cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/ipm-hpc co -P ipm
 

Although using IPM is easy to use, installation is tricky since it fits in with system libraries and commands which tend to be site specific. There's no autoconf approach used. So far the makefile is set up for a list of specific machines. If you're trying to install IPM best to work with your local system guru.

If you're using IPM some place new or having trouble installing it please let me know


Hashing Strategy

IPM uses a fixed size hash to store event information. The hashing strategy aims at making inserts with very low overhead (in terms of CPU time). Double open-address hashing is currently used. For each event the region, call, rank, and buffer size are stroed in a 64bit integer key. The hash maps this key in a roughly deterministic and roughly uniform way to a number between 0 and MAXSIZE_HASH-1, where MAXSIZE_HASH is a prime number. The default size is 32573. In practice fewer than the maximum number of entries should be stored in the hash since the number of collisions increases as the hash becomes full.

For example the event -> key -> hkey mapping for MPI events is as follows:

A key (64 bit int) is generated from the description of the event:

#define IPM_MPI_HASH_KEY(key,region,call,rank,size) {  \
 key = region; key = key << 8; \
 key |= call;  key = key << 16; \
 key |= rank;  key = key << 32; \
 key |= size;  \
}

A hash table index, hkey, is computed from the key

 hkey = (key%MAXSIZE_HASH+collisions*(1+key%(MAXSIZE_HASH-2)))%MAXSIZE_HASH
Which has been benchmarked as performing well (low overhead).

typedef struct ipm_hash_ent {
 IPM_KEY_TYPE key;
 IPM_COUNT_TYPE count;
 double t_tot, t_min, t_max;
} ipm_hash_ent;

Log File Format

The IPM log files are written in an ASCII format that roughly follows XML. The goal is to make a concise format that is easily parsed.

Log file examples:

Parsers:

Other examples:

Links, References, Misc.