Cray XT/XE/XK/XC Applications

The Cray XT/XE/XK/XC series of supercomputers are supported by the TotalView Linux x86_64 and Linux ARM64 (aarch64) distributions. The discussion here is based on running applications using the Cray Linux Environment (CLE). TotalView supports launching application programs using either PBS Pro with ALPS aprun or SLURM srun.

Related topics

Using MRNet on Cray Computers

Using MRNet on Cray Computers

Setting up a parallel debugging session using the Session Editor

Debug a Parallel Program

Starting TotalView on Cray

Because the configuration of most Cray systems typically varies from site to site, the following provides only general guidelines for starting TotalView on your application. Please consult your site's documentation for the specific steps needed to debug a program using TotalView on your Cray system.

File System Considerations

Place your application to debug on a file system that is shared across all Cray node types, such as the service, elogin, login/MOM, and/or compute nodes. This allows you to compile and debug your application across node types.

Further, make sure that your $HOME/.totalview directory is on a shared file system that is common across all Cray node types to ensure that you can use the tvconnect feature in batch scripts. For more information, see Reverse Connections.

Starting TotalView

TotalView typically runs interactively. If your site has not designated any compute nodes for interactive processing, you can allocate compute nodes for interactive use. Use PBS Pro's qsub -I or SLURM's salloc command to allocate an interactive job. Be sure that your X11 DISPLAY environment variable is propagated or set properly. See "man qsub" or "man salloc" for more information on interactive jobs.

If TotalView is installed on your system, load it into your user environment:

module load totalview

Use the following command to start TotalView where mpi_starter is the MPI starter program for your system, such as aprun or srun.

The CLI:

totalviewcli -tv_options -args mpi_starter [mpi_options] application_name [application_arguments]

The GUI:

totalview -tv_options -args mpi_starter [mpi_options] application_name [application_arguments]

TotalView is not able to stop your program before it calls MPI_Init() when using ALPS. While this is typically at the beginning of main(), the actual location depends on how you’ve written the program. This means that if you set a breakpoint before the MPI_Init() call, your program will not hit it because the statement upon which you set the breakpoint will have already executed. On the other hand, SLURM will stop your program before it enters main(), which allows you to debug the statements before MPI_Init() is called.

Example 1: Interactive Jobs Using qsub and aprun

This example shows how you can start TotalView on a program named a.out running in an interactive job using qsub and aprun.

n9610@crystal:~> qsub -I -X -l nodes=4

qsub: waiting for job 599856.sdb to start

qsub: job 599856.sdb ready

Directory: /home/users/n9610

Mon Nov 4 17:23:28 CST 2019

n9610@crystal2:~> cd shared/rctest

n9610@crystal2:~/shared/rctest> module load totalview

n9610@crystal2:~/shared/rctest> totalview -verbosity errors -args aprun -n 4 ./a.out

Example 2: Interactive jobs using salloc and srun

Similarly, you can debug an interactive job using salloc and srun if your Cray system uses SLURM.

This example shows how to submit a SLURM batch job using tvconnect in the batch script. After the batch job starts running, TotalView is started to accept the reverse-connect request.

n9610@jupiter-elogin:~/shared/rctest> cat slurm-script.bash

#!/bin/bash -x

#SBATCH --qos=debug

#SBATCH --time=00:30:00

#SBATCH --nodes=4

#SBATCH --tasks-per-node=1

#SBATCH --constraint=haswell

module load totalview

tvconnect srun -n 4 tx_basic_mpi

n9610@jupiter-elogin:~/shared/rctest> sbatch -C BW28 slurm-script.bash

Submitted batch job 1374150

n9610@jupiter-elogin:~/shared/rctest> squeue -u $USER

 JOBID    USER ACCOUNT           NAME  ST REASON    START_TIME          TIME  TIME_LEFT NODES CPUS
1374150   n9610  (null) slurm-script.b   R   None    2019-11-05T09:31:53 0:16      29:44     4    8

n9610@jupiter-elogin:~/shared/rctest> module load totalview

n9610@jupiter-elogin:~/shared/rctest> totalview

Support for Cray Abnormal Termination Processing (ATP)

Cray's ATP module stops a running job at the moment it crashes. This allows you to attach TotalView to the held job and begin debugging it. To hold a job as it is crashing you must set the ATP_HOLD_TIME environment variable before launching your job with aprun or srun. See Parallel Attach Behaviors for more information on attaching to processes.

When your job crashes, the MPI starter process outputs a message stating that your job has crashed and that ATP is holding it. You can now attach TotalView to the aprun or srun process using the normal attach procedure.

For more information on ATP, see the Cray intro_atp man page.

Special Requirements for Using ReplayEngine

On Crayx86_64 systems, the MPIs use RDMA techniques, similar to Infiniband MPIs. When using ReplayEngine on MPI programs, certain environment variable settings must be in effect for the MPI rank processes. These settings ensure that memory mapping operations are visible to ReplayEngine. The required environment variable settings are:

MPICH_SMP_SINGLE_COPY_OFF=1

LD_PRELOAD: Set to include a preload library, which can be found under the TotalView installation directory at toolworks/totalview.<version>/linux-x86-64/lib/libundodb_infiniband_preload_x64.so.

When using APLS, these settings may be applied with the aprun -e option. For example, to have TotalView launch an MPI program with ReplayEngine enabled, use a command similar to this:

totalview -replay -args aprun -n 8 \

-e MPICH_SMP_SINGLE_COPY_OFF=1 \

-e LD_PRELOAD=/<path>/libundodb_infiniband_preload_x64.so \

myprogram

When using SLURM, these settings may be applied with the srun --export option. For example:

totalview -replay -args srun -n 8 \

--export=ALL,MPICH_SMP_SINGLE_COPY_OFF=1,LD_PRELOAD=/<path>/libundodb_infiniband_preload_x64.so \

myprogram