Simplifying What You’re Debugging
The reason you’re using a debugger is because your program isn’t operating correctly, and the method you think will solve the problem is to stop your program’s threads, examine the values assigned to variables, and step your program so you can observe execution.
Unfortunately, your multi-process, multi-threaded program and the computers upon which it executes are running several threads or processes that you want TotalView to ignore. For example, you don’t want to examine manager and service threads that the operating system, your programming environment, and your program create.
Also, most of us are incapable of understanding exactly how a program is acting when perhaps thousands of processes are executing
asynchronously. Fortunately, only a few problems require full asynchronous behavior at all times.
One of the first simplifications you can make is to change the number of processes. For example, suppose you have a buggy MPI program running on 128 processors. Your first step might be to have it execute in an 8-processor environment.
After the program is running under TotalView control, run the process being debugged to an action point so that you can inspect the program’s state at that point. In many cases, because your program has places where processes are forced to wait for an interaction with other processes, you can ignore what they are doing.
NOTE: TotalView lets you control as many groups, processes, and threads as you need to control. Although you can control each one individually, it would be very complicated to try to control large numbers of these independently. TotalView creates and manages groups so that you can focus on portions of your program.
In most cases, you don’t need to interact with everything that is executing. Instead, you want to focus on one process and the data that this process manipulates. Things get complicated when the process being investigated is using data created by other processes, and these processes might be dependent on other processes.
The following is a typical way to use TotalView to locate problems:
1. At some point, make sure that the groups you are manipulating do not contain service or manager threads. (You can remove processes and threads from a group by using the Group > Custom Group command.)
2. Place a breakpoint in a process or thread and begin investigating the problem. In many cases, you are setting a breakpoint at a place where you hope the program is still executing correctly. Because you are debugging a multi-process, multi-threaded program, set a
barrier point so that all threads and processes stop at the same place.
NOTE: Don’t step your program unless you need to individually look at a thread. Using barrier points is much more efficient. Barrier points are discussed in
Setting Barrier Points.
3. After execution stops at a barrier point, look at the contents of your variables. Verify that your program state is actually correct.
4. Begin stepping your program through its code. In most cases, step your program synchronously or set barriers so that everything isn’t running freely.
Things begin to get complicated at this point. You’ve been focusing on one process or thread. If another process or thread modifies the data and you become convinced that this is the problem, you need to go off to it and see what’s going on.
Keep your focus narrow so that you’re investigating only a limited number of behaviors. This is where debugging becomes an art. A multi-process, multi-threaded program can be doing a great number of things. Understanding where to look when problems occur is the art.
For example, you most often execute commands at the default focus. Only when you think that the problem is occurring in another process do you change to that process. You still execute in the default focus, but this time the default focus changes to another process.
Although it seems like you’re often shifting from one focus to another, you probably will do the following:
Modify the focus so that it affects just the next command. If you are using the GUI, you might select this process and thread from the list displayed in the Root Window. If you are using the CLI, you use the
dfocus command to limit the scope of a future command. For example, the following is the CLI command that steps thread 7 in process 3:
dfocus t3.7 dstep
Use the
dfocus command to change focus temporarily, execute a few commands, and then return to the original focus.