IBM Blue Gene Applications
While the way in which you debug IBM Blue Gene MPI programs is very similar to debugging these programs on other platforms, starting TotalView on your program differs slightly. Unfortunately, each machine is configured differently so you’ll need to find information in IBM’s documentation or in documentation created at your site.
Nevertheless, the remainder of this section presents some hints based on information we have gathered at various sites.
TotalView supports debugging applications on three generations of Blue Gene systems: Blue Gene/L, Blue Gene/P, and Blue Gene/Q. While the different Blue Gene generations are similar, there are differences that affect how you start the debugger.
In general, either launch the MPI starter program under the control of the debugger, or start TotalView and attach to an already running MPI starter program. On Blue Gene/L and Blue Gene/P, the starter program is named mpirun. On Blue Gene/Q, the starter program is named runjob in most cases, or srun when the system is configured to use SLURM.
For example, on Blue Gene/L or Blue Gene/P:
{ totalview | totalviewcli } mpirun -a mpirun-command-line
On most Blue Gene/Q systems:
{ totalview | totalviewcli } runjob -a runjob-command-line
On Blue Gene/Q systems configured to use SLURM:
{ totalview | totalviewcli } srun -a srun-command-line
All Blue Gene systems support a scalable tool daemon launching mechanism call “co-spawning”, where the tool daemons, such as TotalView’s tvdsvr, are launched along with the parallel job. As part of the startup or attach sequence, TotalView tells the MPI starter process to launch (or co-spawn) the TotalView Debug Servers on each Blue Gene I/O node.
To support co-spawning, TotalView must pass the address of the network interface connected to the I/O node network on the front-end node to the servers on the I/O nodes. This is usually not the same network interface that is used to connect to the front-end node from the outside world. TotalView assumes that the address can be resolved by using a name that is:
front-end-hostname-io.
For example, if the hostname of the front-end is bgqfen1, TotalView will attempt to resolve the name bgqfen1-io to an IP address that the server is able to connect to.
NOTE >> Some systems follow this convention and some do not. If you are executing programs on a system that follows this convention, you will not need to set the TotalView variables described in the rest of this section. You can use the command ping -c 1 `hostname -s`-io on the front-end node to check whether the system is using this convention.
If the front-end cannot resolve this name, you must supply the name of the interface using the
-bluegene_io_interface command-line option, or by setting the
bluegene_io_interface TotalView variable. (This variable is described in the
TotalView Variables section of the
TotalView for HPC Reference Guide.)
Because the same version of TotalView must be able to debug both Power-Linux programs (for example,
mpirun) and Blue Gene programs, TotalView uses a Blue Gene-specific server launch string. You can define this launch string by setting the
bluegene_server_launch_string TotalView variable or command-line option.
NOTE >> You must set this variable in a tvdrc file. This differs from other TotalView launch strings, which you can set using the File > Preferences Dialog Box.
The default value for the bluegene_server_launch_string variable is:
-callback %L -set_pw %P -verbosity %V %F
In this string,
%L is the address of the front-end node interface used by the servers. The other substitution arguments have the same meaning as in a normal server launch string. These substitution arguments are discussed in
Chapter 7 of the
TotalView for HPC Reference Guide.