Organizing Chaos
It is possible to debug programs that are running thousands of processes and threads across hundreds of computers by individually looking at each. However, this is almost always impractical. The only workable approach is to organize your processes and threads into groups and then debug your program by using these groups. In other words, in a multi-process, multi-threaded program, you are most often not programming each process or thread individually. Instead, most high-performance computing programs perform the same or similar activities on different sets of data.
TotalView cannot know your program’s architecture; however, it can make some intelligent guesses based on what your program is executing and where the program counter is. Using this information, TotalView automatically organizes your processes and threads into the following predefined groups:
• Control Group: All the processes that a program creates. These processes can be local or remote. If your program uses processes that it did not create, TotalView places them in separate control groups. For example, a client/server program that has two distinct executables that run independently of one another has each executable in a separate control group. In contrast, processes created by fork()/exec() are in the same control group.
• Share Group: All the processes within a control group that share the same code. Same code means that the processes have the same executable file name and path. In most cases, your program has more than one share group. Share groups, like control groups, can be local or remote.
• Workers Group: All the worker threads within a control group. These threads can reside in more than one share group.
• Lockstep Group: All threads that are at the same PC (program counter). This group is a subset of a workers group. A lockstep group only exists for stopped threads. By definition, all members of a lockstep group are within the same workers group. That is, a lockstep group cannot have members in more than one workers group or more than one control group. A lockstep group only means anything when the threads are stopped.
The control and share groups contain only processes; the workers and lockstep groups contain only threads.
TotalView lets you manipulate processes and threads individually and by groups. In addition, you can create your own groups and manipulate a group’s contents (to some extent). For more information, see
Chapter 21, "Group, Process, and Thread Control".
Figure 196 shows a processor running five processes (ignoring daemons and other programs not related to your program) and the threads within the processes, along with a
control group and two
share groups within the control group.
Many of the elements in this figure are used in other figures in this book. These elements are as follows:
CPU
The one outer square represents the CPU. All elements in the drawing operate within one CPU.
Processes
The five white inner squares represent processes being executed by the CPU.
Control Group
The large rounded rectangle that surrounds the five processes shows one control group. This diagram doesn’t indicate which process is the main procedure.
Share Groups
The two smaller rounded rectangles having yellow dashed lines surround processes in a share group. This drawing shows two share groups within one control group. The three processes in the first share group have the same executable. The two processes in the second share group share a second executable.
The control group and the share group contain only processes.
Figure 197 shows how TotalView organizes the threads in the previous figure, adding a workers group and two lockstep groups.
NOTE >> This figure doesn’t show the control group since it encompasses everything in this figure. That is, this example’s control group contains all of the program’s lockstep, share, and worker group’s processes and threads.
The additional elements in this figure are as follows:
Workers Group
All nonmanager threads within the control group make up the workers group. This group includes service threads.
Lockstep Groups
Each
share group has its own
lockstep group. The previous figure shows two lockstep groups, one in each share group.
Service Threads
Each process has one service thread. A process can have any number of service threads, but this figure shows only one.
Manager Threads
The ten manager threads are the only threads that do not participate in the workers group.
Figure 198 extends
Figure 197 to show the same kinds of information executing on two processors.
This figure differs from others in this section because it shows ten processes executing within two processors rather than five processes within one processor. Although the number of processors has changed, the number of control and share groups is unchanged. Note that, while this makes a nice example, most programs are not this regular.