About the TotalView CUDA Debugger
The TotalView CUDA debugger is an integrated debugging tool capable of simultaneously debugging CUDA code that is running on the host system and the NVIDIA® GPU. CUDA support is an extension to the standard version TotalView, and is capable of debugging 64-bit CUDA programs. Debugging 32-bit CUDA programs is currently not supported.
Supported major features:

Debug CUDA application running directly on GPU hardware

Set breakpoints, pause execution, and single step in GPU code

View GPU variables in PTX registers, local, parameter, global, or shared memory

Access runtime variables, such as threadIdx, blockIdx, blockDim, etc.

Debug multiple GPU devices per process

Support for the CUDA MemoryChecker

Debug remote, distributed and clustered systems

Support for directive-based programming languages

Support for host debugging features
Requirements:
The CUDA SDK and a host distribution supported by NVIDIA. For SDK versions and supported NVIDIA GPUs, please see the
TotalView Supported Platforms guide.