IPM -- Profiling vs. Tracing

Profiling vs. Tracing

Performance events include HPM counts from on chip counters, memory usage events, and timings of routines such as message passing.

The space in which performance events occur is roughly two dimensional. Events occur at some "place" and at some time. The place might be a nodename or cpu number. It could be a context, e.g., a callsite or program counter value. The event also happens at or over some time.

A profile provides an inventory of performance events and timings for the execution as a whole. This ignores the chronology of the events in an absolute sense. Nothing is timestamped and the resulting report does not say what events happened before other events in a absolute sense. Relative ordering of events may be recorded in a profile.

A trace records the chronology, often with timestamps and is extensive in time. The amount of data in the trace increases with the runtime. As such in order to bound the memory usage by the tracing one must periodically write the data out to disk or network.

The distinction between profiling and tracing that is sometimes overlooked when choosing a tool to extract performance information from a parallel code. Typically a trace is useful for detailed examination of timing issues occurring within a code. A profile is often sufficient to pinpoint load imbalance due to problem decomposition and/or identify the origin of excessive communication time.

IPM aims toward detailed profiling rather than tracing. It records basic statistics on the performance events as they occur. In most cases these statistics are min, max and total timings of the event. The timings are stored in a fixed size hash table which is keyed off of a description of the event. The description is based on a small number of parameters. For MPI calls these parameters are things like the name of the MPI call, the buffer size, the source/destination rank, etc.

Here are two quick examples showing the MPI profile data collected by IPM on a single task (rank 0) of two parallel codes:

Blocked dense ScaLAPACK code run on 16 tasks:

call        orank      ncalls  buf_size   t_tot    t_min   t_max   %comm
MPI_Recv        2         17     131072 5.96e+00 6.43e-02 5.92e-01  75.5
MPI_Recv        7         18          4 1.82e+00 8.45e-06 4.17e-01  23.0
MPI_Barrier     *          2          * 1.04e-01 7.14e-05 1.03e-01   1.3
MPI_Sendrecv    8         18        504 4.56e-03 6.31e-05 3.19e-04   0.1
MPI_Send        1         36          4 1.84e-03 1.75e-05 1.62e-04   0.0
MPI_Sendrecv   16         18        504 1.55e-03 3.18e-05 2.80e-04   0.0

Here orank means "the other rank" which may be a source or destination for data in the message.

Conjugate gradient on 32 tasks

call        orank      ncalls  buf_size   t_tot    t_min   t_max   %comm
MPI_Wait        4       3952          8 3.93e+00 3.93e-06 8.33e-02  39.4
MPI_Send        1       1976      75000 1.54e+00 5.76e-04 1.23e-02  15.4
MPI_Send        4       1976      75000 1.33e+00 4.68e-04 1.18e-02  13.4
MPI_Send        2       1976      75000 1.27e+00 4.53e-04 5.45e-03  12.7
MPI_Wait        2       3952          8 1.00e+00 3.93e-06 7.92e-02  10.1
MPI_Wait        1       3952          8 3.04e-01 3.70e-06 2.51e-02   3.0
MPI_Send        0       1976      75000 1.32e-01 6.39e-05 1.39e-04   1.3
MPI_Wait        4       1976      75000 9.17e-02 4.41e-06 5.11e-04   0.9
MPI_Send        4       3952          8 5.35e-02 9.43e-06 1.19e-04   0.5
MPI_Irecv       1       1976      75000 4.71e-02 1.75e-05 1.01e-04   0.5
MPI_Send        2       3952          8 4.18e-02 7.06e-06 4.69e-05   0.4
MPI_Send        1       3952          8 4.10e-02 7.75e-06 8.72e-05   0.4
MPI_Wait        2       1976      75000 3.36e-02 4.41e-06 4.24e-04   0.3
MPI_Wait        1       1976      75000 2.64e-02 1.02e-05 4.14e-04   0.3
MPI_Irecv       2       1976      75000 2.59e-02 6.38e-06 8.01e-05   0.3
MPI_Irecv       4       3952          8 2.57e-02 5.64e-06 2.01e-04   0.3

From such a profile, one does not know the order in which the above events happened. In many cases knowing that 37% of the communication time was spent in a 131KB MPI_Recv is sufficient information to make the next step of code examination and improvement. For scaling studies the above information can be quite useful.

In some cases more programnatic or chronological context of the performance events is needed. IPM includes two interfaces through which detailed information may be recorded.

Code regions may be defined. This allows for seperate profiles of individual code sections and provides the possibility of greater resolution. Regions are implemented through MPI_Pcontrol so while code changes are required, these code changes present no build or execute conflicts is IPM is not used.
Snapshot tracing may be done independent of the profiling. Either at execute time or through MPI_Pcontrol the user may arrange for a fixed size buffer to be filled with a timestamped trace of all or selected events. When the buffer fills or tracing is turned off the trace buffer is written out. The trace data can then be visualized.


Last changed: Fri, 04 Sep 2009 17:28:45 +0000 on shell-21011 by fuerling
Website maintained by: The IPM developers. To get help send email to: