Disabling GPU ECC Memory and Persistence Mode

This technical blog suggests a method to increase the utilization and the performance on NVIDIA GPUs particularly focusing on disabling the ECC Memory and enabling the Persistence mode.

ECC Memory

ECC introduces memory scrubbing, error detection, and correction cycles, which add latency to memory operations. Disabling ECC can sometimes yield performance benefits especially when the highest level of data integrity isn’t required.

To disable ECC on an NVIDIA GPU, you can use the following command:

nvidia-smi -e 0

Notes:

  1. Not all NVIDIA GPUs support ECC memory. GPUs from the NVIDIA Tesla and Quadro families are typically equipped with ECC functionality.
  2. Disabling ECC on a GPU requires root access and will also prompt a reset of the GPU. After the command, the GPU will be reset to apply the new configuration. This means any running processes using the GPU will be terminated, so plan accordingly when modifying these settings in a production environment!

To re-enable ECC on an NVIDIA GPU simply type nvidia-smi -e 1 instead.

Persistence Mode

Persistence Mode is disabled by default on many GPUs. Persistence Mode completely unloads the GPU driver when there are no active processes. This introduces additional latency when a new process starts and the driver must reload the necessary context. Especially in HPC or multi-user systems where different users or processes frequently utilize the GPU, Persistence Mode ensures quicker task launches.

To enable Persistence Mode, use the following command:

nvidia-smi -pm 1

Notes:

  1. Enabling Persistence Mode can result in slightly higher power consumption. However, this trade-off is generally minimal and worthwhile.

To re-disable Persistence Mode on an NVIDIA GPU simply type nvidia-smi -pm 0 instead.

You can verify that ECC is disabled and Persistence Mode is enabled.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-----------------------------------------------------------------------------|
| GPU  Name        Persistence-M  ECC     Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage |
|=============================================================================|
|   0  Tesla V100       On           Off    33C    P8    24W / 300W |  0MiB /  16384MiB |
+-----------------------------------------------------------------------------+

Author: Serdar Acir