MPICH Applications

To debug Message Passing Interface/Chameleon Standard (MPICH) applications, you must use MPICH version 1.2.3 or later on a homogeneous collection of computers. If you need a copy of MPICH, you can obtain it at no cost from Argonne National Laboratory at http://www.mpich.org/downloads. (We strongly urge that you use a later version of MPICH. For versions that work with TotalView, see Supported Platforms.

The MPICH library should use the ch_p4, ch_p4mpd, ch_shmem, ch_lfshmem, or ch_mpl devices.

  • For networks of workstations, the default MPICH library is ch_p4.

  • For shared-memory SMP computers, use ch_shmem.

  • On an IBM SP computer, use the ch_mpl device.

The MPICH source distribution includes all these devices. Choose the one that best fits your environment when you configure and build MPICH.

When configuring MPICH, you must ensure that the MPICH library maintains all of the information that TotalView requires. This means that you must use the -enable-debug option with the MPICH configure command. (Versions earlier than 1.2 used the --debug option.)

Starting TotalView on an MPICH Job

Before you can bring an MPICH job under TotalView’s control, both TotalView and the tvdsvr must be in your path, most easily set in a login or shell startup script.

For version 1.1.2, the following command-line syntax starts a job under TotalView control:

mpirun [ MPICH-arguments ] -tv program [ program-arguments ]

For example:

mpirun -np 4 -tv sendrecv

The MPICH mpirun command obtains information from the TOTALVIEW environment variable and then uses this information when it starts the first process in the parallel job.

For Version 1.2.4, the syntax changes to the following:

mpirun -dbg=totalview [ other_mpich-args ] program [ program-args ]

For example:

mpirun -dbg=totalview -np 4 sendrecv

In this case, mpirun obtains the information it needs from the -dbg command-line option.

In other contexts, setting this environment variable means that you can use different versions of TotalView or pass command-line options to TotalView.

For example, the following is the C shell command that sets the TOTALVIEW environment variable so that mpirun passes the -no_stop_all option to TotalView:

setenv TOTALVIEW "totalview -no_stop_all"

TotalView begins by starting the first process of your job, the master process, under its control. You can then set breakpoints and begin debugging your code.

On the IBM SP computer with the ch_mpl device, the mpirun command uses the poe command to start an MPI job. While you still must use the MPICH mpirun (and its -tv option) command to start an MPICH job, the way you start MPICH differs. For details on using TotalView with poe, see Starting TotalView on a PE Program .

Starting TotalView using the ch_p4mpd device is similar to starting TotalView using poe on an IBM computer or other methods you might use on Sun and HP platforms. In general, you start TotalView using the totalview command, with the following syntax:

totalview mpirun [ totalview_args ] -a [ mpich-args ] program [ program-args ]

totalviewcli mpirun [ totalview_args ] \ -a [ mpich-args ] program [ program-args ]

As your program executes, TotalView automatically acquires the processes that are part of your parallel job as your program creates them. Before TotalView begins to acquire them, it asks if you want to stop the spawned processes. If you click Yes, you can stop processes as they are initialized. This lets you check their states or set breakpoints that are unique to the process. TotalView automatically copies breakpoints from the master process to the slave processes as it acquires them. Consequently, you don’t have to stop them just to set these breakpoints.

If you’re using the UI, TotalView updates the view to show these newly acquired processes.

Attaching to an MPICH Job

You can attach to an MPICH application even if it was not started under TotalView control. To attach to an MPICH application:

  1. Start TotalView.

  2. Select Attach to process on the Start a Debugging Session dialog. A list of processes running on the selected host displays in the Attach to Running Program(s) dialog.

  3. Attach to the first MPICH process in your workstation cluster by diving into it.

    dattach executable pid

  4. On an IBM SP with the ch_mpi device, attach to the poe process that started your job. For details, see Starting TotalView on a PE Program .

    Normally, the first MPICH process is the highest process with the correct program name in the process list. Other instances of the same executable can be:

    • The p4 listener processes if MPICH was configured with ch_p4.

    • Additional slave processes if MPICH was configured with ch_shmem or ch_lfshmem.

    • Additional slave processes if MPICH was configured with ch_p4 and has a file that places multiple processes on the same computer.

  5. After attaching to your program’s processes, a dialog launches where you can choose to also attach to slave MPICH processes. If you do, press Return or choose Yes. If you do not, choose No.

    If you choose Yes, TotalView starts the server processes and acquires all MPICH processes.

Classic UI Only: As an alternative, use the Group > Attach Subset command to predefine what TotalView should do.

In some situations, the processes you expect to see might not exist (for example, they may crash or exit). TotalView acquires all the processes it can and then warns you if it cannot attach to some of them. If you attempt to dive into a process that no longer exists (for example, using a message queue display), you are alerted that the process no longer exists.

Using MPICH P4 procgroup Files

If you’re using MPICH with a P4 procgroup file (by using the -p4pg option), you must use the same absolute path name in your procgroup file and on the mpirun command line. For example, if your procgroup file contains a different path name than that used in the mpirun command, even though this name resolves to the same executable, TotalView assumes that it is a different executable, which causes debugging problems.

The following example uses the same absolute path name on the TotalView command line and in the procgroup file:

% catp4group

local 1 /users/smith/mympichexe

bigiron 2 /users/smith/mympichexe

% mpirun-p4pgp4group-tv/users/smith/mympichexe

In this example, TotalView does the following:

  1. Reads the symbols from mympichexe only once.

  2. Places MPICH processes in the same TotalView share group.

  3. Names the processes mypichexe.0, mympichexe.1, mympichexe.2, and mympichexe.3.

If TotalView assigns names such as mympichexe<mympichexe>.0, a problem occurred and you need to compare the contents of your procgroup file and mpirun command line.

MPICH Debugging Tips

The following debugging tips apply only to MPICH:

  • Passing options to mpirun

    You can pass options to TotalView using the MPICH mpirun command.

    To pass options to TotalView when running mpirun, you can use the TOTALVIEW environment variable. For example, you can cause mpirun to invoke TotalView with the -no_stop_all option, as in the following C shell example:

    setenv TOTALVIEW "totalview -no_stop_all"

  • Using ch_p4

    If you start remote processes with MPICH/ch_p4, you may need to change the way TotalView starts its servers.

    By default, TotalView uses ssh to start its remote server processes. This is the same behavior as ch_p4 uses. If you configure ch_p4 to use a different startup mechanism from another process, you probably also need to change the way that TotalView starts the servers.