Starting TotalView on Cray
Because the configuration of most Cray systems typically varies from site to site, the following provides only general guidelines for starting TotalView on your application. Please consult your site's documentation for the specific steps needed to debug a program using TotalView on your Cray system.
File System Considerations
Place your application to debug on a file system that is shared across all Cray node types, such as the service, elogin, login/MOM, and/or compute nodes. This allows you to compile and debug your application across node types.
Further, make sure that your
$HOME/.totalview directory is on a shared file system that is common across all Cray node types to ensure that you can use the
tvconnect feature in batch scripts. For more information, see
Chapter 22, Reverse Connections .
Starting TotalView
TotalView typically runs interactively. If your site has not designated any compute nodes for interactive processing, you can allocate compute nodes for interactive use. Use PBS Pro's qsub -I or SLURM's salloc command to allocate an interactive job. Be sure that your X11 DISPLAY environment variable is propagated or set properly. See "man qsub" or "man salloc" for more information on interactive jobs.
If TotalView is installed on your system, load it into your user environment:
module load totalview
Use the following command to start TotalView where mpi_starter is the MPI starter program for your system, such as aprun or srun.
The CLI:
totalviewcli -tv_options -args mpi_starter [mpi_options] application_name [application_arguments]
The GUI:
totalview -tv_options -args mpi_starter [mpi_options] application_name [application_arguments]
TotalView is not able to stop your program before it calls MPI_Init() when using ALPS. While this is typically at the beginning of main(), the actual location depends on how you’ve written the program. This means that if you set a breakpoint before the MPI_Init() call, your program will not hit it because the statement upon which you set the breakpoint will have already executed. On the other hand, SLURM will stop your program before it enters main(), which allows you to debug the statements before MPI_Init() is called.
Example 1: Interactive Jobs Using qsub and aprun
This example shows how you can start TotalView on a program named a.out running in an interactive job using qsub and aprun.
n9610@crystal:~> qsub -I -X -l nodes=4
qsub: waiting for job 599856.sdb to start
qsub: job 599856.sdb ready
Directory: /home/users/n9610
Mon Nov 4 17:23:28 CST 2019
n9610@crystal2:~> cd shared/rctest
n9610@crystal2:~/shared/rctest> module load totalview
n9610@crystal2:~/shared/rctest> totalview -verbosity errors -args aprun -n 4 ./a.out
Example 2: Interactive jobs using salloc and srun
Similarly, you can debug an interactive job using salloc and srun if your Cray system uses SLURM.
This example shows how to submit a SLURM batch job using tvconnect in the batch script. After the batch job starts running, TotalView is started to accept the reverse-connect request.
n9610@jupiter-elogin:~/shared/rctest> cat slurm-script.bash
#!/bin/bash -x
#SBATCH --qos=debug
#SBATCH --time=00:30:00
#SBATCH --nodes=4
#SBATCH --tasks-per-node=1
#SBATCH --constraint=haswell
module load totalview
tvconnect srun -n 4 tx_basic_mpi
n9610@jupiter-elogin:~/shared/rctest> sbatch -C BW28 slurm-script.bash
Submitted batch job 1374150
n9610@jupiter-elogin:~/shared/rctest> squeue -u $USER
JOBID USER ACCOUNT NAME ST REASON START_TIME TIME TIME_LEFT NODES CPUS
1374150 n9610 (null) slurm-script.b R None 2019-11-05T09:31:53 0:16 29:44 4 8
n9610@jupiter-elogin:~/shared/rctest> module load totalview
n9610@jupiter-elogin:~/shared/rctest> totalview