TotalView User Guide : Part II: Setting Up : Setting Up MPI Debugging Sessions : Debugging Parallel Applications Tips : Parallel Debugging Tips

Parallel Debugging Tips
The following tips are useful for debugging most parallel programs:
*
When you’re debugging message-passing and other multi-process programs, it is usually easier to understand the program’s behavior if you change the default stopping action of breakpoints and barrier breakpoints. By default, when one process in a multi-process program hits a breakpoint, TotalView stops all other processes.
To change the default stopping action of breakpoints and barrier breakpoints, you can set debugger preferences. The online Help contains information on these preference. These preferences tell TotalView whether to continue to run when a process or thread hits the breakpoint.
These options affect only the default behavior. You can choose a behavior for a breakpoint by setting the breakpoint properties in the File > Preferences Action Points Page. See Setting Breakpoints for Multiple Processes”.
*
TotalView has two features that make it easier to get all of the processes in a multi-process program synchronized and executing a line of code. Process barrier breakpoints and the process hold/release features work together to help you control the execution of your processes. See Setting Barrier Points”.
The Process Window Group > Run To command is a special stepping command. It lets you run a group of processes to a selected source line or instruction. See Stepping (Part I)”.
*
Group commands are often more useful than process commands.
It is often more useful to use the Group > Go command to restart the whole application instead of the Process > Go command.
You would then use the Group > Halt command instead of Process > Halt to stop execution.
The group-level single-stepping commands such as Group > Step and Group > Next let you single-step a group of processes in a parallel. See Stepping (Part I)”.
*
If you use a process-level single-stepping command in a multi-process program, TotalView may appear to hang (it continuously displays the watch cursor). If you single-step a process over a statement that can’t complete without allowing another process to run, and that process is stopped, the stepping process appears to hang. This can occur, for example, when you try to single-step a process over a communication operation that cannot complete without the participation of another process. When this happens, you can abort the single-step operation by selecting Cancel in the Waiting for Command to Complete Window that TotalView displays. As an alternative, consider using a group-level single-step command.
*
Rogue Wave Software receives many bug reports about processes being hung. In almost all cases, the reason is that one process is waiting for another. Using the Group debugging commands almost always solves this problem.
*
The Root Window helps you determine where various processes and threads are executing. When you select a line of code in the Process Window, the Root Window updates to show which processes and threads are executing that line.
*
You can view the value of a variable that is replicated across multiple processes or multiple threads in a single Variable Window. See Displaying a Variable in all Processes or Threads”.
*
You can restart a parallel program at any time. If your program runs past the point you want to examine, you can kill the program by selecting the Group > Kill command. This command kills the master process and all the slave processes. Restarting the master process (for example, mpirun or poe) recreates all of the slave processes. Start up is faster when you do this because TotalView doesn’t need to reread the symbol tables or restart its tvdsvr processes, since they are already running.

Rogue Wave Software, Inc.
Voice: (303) 473-9118
rwonlinedocs@roguewave.com