The organization of modern HPC systems often makes it difficult to deploy tools such as TotalView. For example, the compute nodes in a cluster may not have access to any X libraries or X forwarding, so launching a GUI on a compute node is not possible.
Using the Reverse Connect feature, you can run the TotalView UI on a front-end node to debug a job executing on compute nodes.
The basic process is to embed the tvconnect command in a batch script; when the batch job runs, the tvconnect process connects with the TotalView client to start the debugger server process on the batch node. The TotalView client would typically run on a front-end node, where the application is built and batch jobs are submitted.
About Reverse Connections
When using reverse connect, TotalView is started in two stages:
1. Run the tvconnect command to create a debugging request, typically from a batch job on a batch node or compute node in a cluster. The tvconnect command accepts the name of the program to debug, along with any arguments to pass to the program.
The tvconnect process blocks for a TotalView session to accept the request.
2. Start TotalView on another node, which is typically a front-end node. When the UI opens, TotalView looks for a request, and if it finds one, confirms via a pop-up that the user wants to accept it. If the request is accepted, TotalView starts a debugger server on the node where the tvconnect process is running, and loads the program that was passed to the tvconnect command. If the request is rejected, the tvconnect process exits with an error.
At that point, debug the program in the normal way within the TotalView UI.
The process works as follows:
Figure 242 – Reverse connect flow
Typically, a tvconnect command is added to a batch script, placed in front of the command to debug. For example:
tvconnect srun -n4 myMPIprogram
Once a batch script runs and starts the tvconnect command and a TotalView front-end UI is started:
1. The tvconnect command creates a request in the $HOME/.totalview/connect directory and blocks indefinitely until the request is either accepted or rejected. If the tvconnect process is killed with a SIGINT or SIGTERM, the tvconnect process deletes the request it created.
2. TotalView reads the request file written by the tvconnect process.
3. TotalView accepts or rejects the request, sending back a response.
4. tvconnect reads the response. If it was accepted:
5. tvconnect execs tvdsvr, the command that allows TotalView to control and debug a program on a remote machine.
6. tvdsvr opens a connection to the TotalView UI. TotalView then loads the program and any program arguments, using the parameters provided to tvconnect. In this example, srun was loaded to debug an MPI job.
NOTE >> TotalView does not look for reverse connect requests once it starts to debug a program, i.e., it automatically listens only if no other debug session is active. You can choose, however, to continue to listen for connection requests while debugging. See Listening for Reverse Connections.
Reverse connections are also supported by the CLI dload command, which has options to either accept or reject reverse connections. In addition, some command line arguments and special environment variables are available that can be used to modify some behavior.