Performance Issues
High TLB rates with certain multi-threaded target programs
When reverse debugging an application in which many threads make frequent system calls on a multi-processor platform, binding the application process to a single processor can improve performance. This is because such applications put stress on ReplayEngine's heap management, which in turn stresses the processor's TLB (translation lookaside buffer). If the application is bound to a single processor, it is less likely to suffer TLB misses caused by process migration. Since user threads are automatically serialized during reverse debugging, there is no loss of concurrency due to binding.
If the application is to be launched under TotalView, one way to accomplish binding is to preface the TotalView command with a taskset(1) command specifying a single processor. For example:
taskset --cpu-list 3 totalview -replay myapp
To accomplish binding when TotalView is to be attached to a running application, find the PID (process identifier) of the application process, and use taskset to bind that process to a single processor before attaching to it with TotalView. For example:
taskset --pid --cpu-list 3 <PID of myapp>
We have noticed the need for such binding when debugging MySQL applications with ReplayEngine.
Avoiding self-contention in OpenMP target programs
Because threads are serialized during reverse debugging, OpenMP implementations that use non-yielding spins for synchronization can experience self-contention, resulting in poor performance. ReplayEngine has internal knowledge of several OpenMP implementations and tries to avoid this situation. Since this aspect of OpenMP is somewhat loosely standardized, however, ReplayEngine may not always be able to avoid self-contention.
For the Portland Group compilers in particular, ReplayEngine uses environment variables to avoid self-contention. It inserts the settings OMP_WAIT_POLICY=ACTIVE and MP_SPIN=0 into the environment. The effects are, respectively, to cause idle threads to wait using a semaphore check loop, and to cause the semaphore check loop to call sched_yield in every iteration. If the user has pre-set either of these environment variables, ReplayEngine will not alter the settings.