GPU monitoring

Monitoring GPU performance is critical to ensure maximum use of the hardware, and to identify bottlenecks and issues.

JupyterLab NVDashboard

NVDashboard is a JupyterLab extension for displaying NVIDIA GPU usage dashboards. It enables users to visualize system hardware metrics within the same interactive environment they use for development and data analysis.

Supported metrics include:

  • GPU compute utilization

  • GPU memory consumption

  • PCIe throughput

  • NvLink throughput

Installation

Select Terminal from JuptyterLab

Install the package from PyPI

pip install jupyterlab_nvdashboard

Restart the JupyterLab application from the Denvr Cloud Dashboard to enable the package.

Read more:

Linux command-line

NVIDIA drivers are preinstalled with the nvidia-smi command which can be used to find performance metrics of the GPUs. This example shows a single A100 MIG 10 GB instance.

You can get a continuous output of this view by running:

nvidia-smi -l

Another technique is to use the watch command to run nvidia-smi every X seconds, clearing the screen each time before displaying new output:

watch -n 5 nvidia-smi

The nvidia-smi dmon command can also stream output in tabular format which is more easily used for logging and automated monitoring.

The Linux man page has additional information how to configure nvidia-smi dmon to select specific output columns and to provide CSV-style formatting.

Promethesus exporter

NVIDIA provides a tool called DGCM-Exporter which streams metrics into Promethesus. This code should be installed as a service by system administrators depending on your deployment preferences (Kubernetes, docker, local install).

$ docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04

You can verify metrics as follows:

$ curl localhost:9400/metrics

Read more:

Last updated