Nvidia GPU monitoring with Netdata
Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using nvidia-smi
cli tool.
#
Requirements and NotesYou must have the
nvidia-smi
tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about nvidia_smi.You must enable this plugin, as its disabled by default due to minor performance issues:
Remove the '#' before nvidia_smi so it reads:
nvidia_smi: yes
.On some systems when the GPU is idle the
nvidia-smi
tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue.Currently the
nvidia-smi
tool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: https://github.com/netdata/netdata/pull/4357Contributions are welcome.
Make sure
netdata
user can execute/usr/bin/nvidia-smi
or wherever your binary is.If
nvidia-smi
process is not killed after netdata restart you need to offloop_mode
.poll_seconds
is how often in seconds the tool is polled for as an integer.
#
ChartsIt produces the following charts:
- PCI Express Bandwidth Utilization in
KiB/s
- Fan Speed in
percentage
- GPU Utilization in
percentage
- Memory Bandwidth Utilization in
percentage
- Encoder/Decoder Utilization in
percentage
- Memory Usage in
MiB
- Temperature in
celsius
- Clock Frequencies in
MHz
- Power Utilization in
Watts
- Memory Used by Each Process in
MiB
- Memory Used by Each User in
MiB
- Number of User on GPU in
num
#
ConfigurationEdit the python.d/nvidia_smi.conf
configuration file using edit-config
from the Netdata config
directory, which is typically at /etc/netdata
.
Sample: