Skip to main content

Nvidia GPU monitoring with Netdata

Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using nvidia-smi cli tool.

Requirements and Notes#

  • You must have the nvidia-smi tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about nvidia_smi.

  • You must enable this plugin, as its disabled by default due to minor performance issues:

    cd /etc/netdata # Replace this path with your Netdata config directory, if different
    sudo ./edit-config python.d.conf

    Remove the '#' before nvidia_smi so it reads: nvidia_smi: yes.

  • On some systems when the GPU is idle the nvidia-smi tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue.

  • Currently the nvidia-smi tool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: https://github.com/netdata/netdata/pull/4357

  • Contributions are welcome.

  • Make sure netdata user can execute /usr/bin/nvidia-smi or wherever your binary is.

  • If nvidia-smi process is not killed after netdata restart you need to off loop_mode.

  • poll_seconds is how often in seconds the tool is polled for as an integer.

Charts#

It produces the following charts:

  • PCI Express Bandwidth Utilization in KiB/s
  • Fan Speed in percentage
  • GPU Utilization in percentage
  • Memory Bandwidth Utilization in percentage
  • Encoder/Decoder Utilization in percentage
  • Memory Usage in MiB
  • Temperature in celsius
  • Clock Frequencies in MHz
  • Power Utilization in Watts
  • Memory Used by Each Process in MiB
  • Memory Used by Each User in MiB
  • Number of User on GPU in num

Configuration#

Edit the python.d/nvidia_smi.conf configuration file using edit-config from the Netdata config directory, which is typically at /etc/netdata.

cd /etc/netdata # Replace this path with your Netdata config directory, if different
sudo ./edit-config python.d/nvidia_smi.conf

Sample:

loop_mode : yes
poll_seconds : 1
exclude_zero_memory_users : yes

Reach out

If you need help after reading this doc, search our community forum for an answer. There's a good chance someone else has already found a solution to the same issue.

Documentation

Community