Anomaly detection with Netdata
This collector uses the Python PyOD library to perform unsupervised anomaly detection on your Netdata charts and/or dimensions.
Instead of this collector just collecting data, it also does some computation on the data it collects to return an anomaly probability and anomaly flag for each chart or custom model you define. This computation consists of a train function that runs every train_n_secs
to train the ML models to learn what 'normal' typically looks like on your node. At each iteration there is also a predict function that uses the latest trained models and most recent metrics to produce an anomaly probability and anomaly flag for each chart or custom model you define.
As this is a somewhat unique collector and involves often subjective concepts like anomalies and anomaly probabilities, we would love to hear any feedback on it from the community. Please let us know on the community forum or drop us a note at analytics-ml-team@netdata.cloud for any and all feedback, both positive and negative. This sort of feedback is priceless to help us make complex features more useful.
#
ChartsTwo charts are produced:
- Anomaly Probability (
anomalies.probability
): This chart shows the probability that the latest observed data is anomalous based on the trained model for that chart (using thepredict_proba()
method of the trained PyOD model). - Anomaly (
anomalies.anomaly
): This chart shows1
or0
predictions of if the latest observed data is considered anomalous or not based on the trained model (using thepredict()
method of the trained PyOD model).
Below is an example of the charts produced by this collector and how they might look when things are 'normal' on the node. The anomaly probabilities tend to bounce randomly around a typically low probability range, one or two might randomly jump or drift outside of this range every now and then and show up as anomalies on the anomaly chart.
If we then go onto the system and run a command like stress-ng --all 2
to create some stress, we see some charts begin to have anomaly probabilities that jump outside the typical range. When the anomaly probabilities change enough, we will start seeing anomalies being flagged on the anomalies.anomaly
chart. The idea is that these charts are the most anomalous right now so could be a good place to start your troubleshooting.
Then, as the issue passes, the anomaly probabilities should settle back down into their 'normal' range again.
#
Requirements- This collector will only work with Python 3 and requires the packages below be installed.
- Typically you will not need to do this, but, if needed, to ensure Python 3 is used you can add the below line to the
[plugin:python.d]
section ofnetdata.conf
Install the required python libraries.
#
ConfigurationInstall the Python requirements above, enable the collector and restart Netdata.
The configuration for the anomalies collector defines how it will behave on your system and might take some experimentation with over time to set it optimally for your node. Out of the box, the config comes with some sane defaults to get you started that try to balance the flexibility and power of the ML models with the goal of being as cheap as possible in term of cost on the node resources.
Note: If you are unsure about any of the below configuration options then it's best to just ignore all this and leave the anomalies.conf
file alone to begin with. Then you can return to it later if you would like to tune things a bit more once the collector is running for a while and you have a feeling for its performance on your node.
Edit the python.d/anomalies.conf
configuration file using edit-config
from the your agent's config
directory, which is usually at /etc/netdata
.
The default configuration should look something like this. Here you can see each parameter (with sane defaults) and some information about each one and what it does.
#
Custom modelsIn the anomalies.conf
file you can also define some "custom models" which you can use to group one or more metrics into a single model much like is done by default for the charts you specify. This is useful if you have a handful of metrics that exist in different charts but perhaps are related to the same underlying thing you would like to perform anomaly detection on, for example a specific app or user.
To define a custom model you would include configuration like below in anomalies.conf
. By default there should already be some commented out examples in there.
name
is a name you give your custom model, this is what will appear alongside any other specified charts in the anomalies.probability
and anomalies.anomaly
charts. dimensions
is a string of metrics you want to include in your custom model. By default the netdata-pandas library used to pull the data from Netdata uses a "chart.a|dim.1" type of naming convention in the pandas columns it returns, hence the dimensions
string should look like "chart.name|dimension.name,chart.name|dimension.name". The examples below hopefully make this clear.
#
TroubleshootingTo see any relevant log messages you can use a command like below.
If you would like to log in as netdata
user and run the collector in debug mode to see more detail.
#
Deepdive tutorialIf you would like to go deeper on what exactly the anomalies collector is doing under the hood then check out this deepdive tutorial in our community repo where you can play around with some data from our demo servers (or your own if its accessible to you) and work through the calculations step by step.
(Note: as its a Jupyter Notebook it might render a little prettier on nbviewer)
#
Notes- Python 3 is required as the
netdata-pandas
package uses Python async libraries (asks and trio) to make asynchronous calls to the Netdata REST API to get the required data for each chart. - Python 3 is also required for the underlying ML libraries of numba, scikit-learn, and PyOD.
- It may take a few hours or so (depending on your choice of
train_secs_n
) for the collector to 'settle' into it's typical behaviour in terms of the trained models and probabilities you will see in the normal running of your node. - As this collector does most of the work in Python itself, with PyOD leveraging numba under the hood, you may want to try it out first on a test or development system to get a sense of its performance characteristics on a node similar to where you would like to use it.
lags_n
,smooth_n
, anddiffs_n
together define the preprocessing done to the raw data before models are trained and before each prediction. This essentially creates a feature vector for each chart model (or each custom model). The default settings for these parameters aim to create a rolling matrix of recent smoothed differenced values for each chart. The aim of the model then is to score how unusual this 'matrix' of features is for each chart based on what it has learned as 'normal' from the training data. So as opposed to just looking at the single most recent value of a dimension and considering how strange it is, this approach looks at a recent smoothed window of all dimensions for a chart (or dimensions in a custom model) and asks how unusual the data as a whole looks. This should be more flexible in capturing a wider range of anomaly types and be somewhat more robust to temporary 'spikes' in the data that tend to always be happening somewhere in your metrics but often are not the most important type of anomaly (this is all covered in a lot more detail in the deepdive tutorial).- You can see how long model training is taking by looking in the logs for the collector
grep 'anomalies' /var/log/netdata/error.log | grep 'training'
and you should see lines like2020-12-01 22:02:14: python.d INFO: anomalies[local] : training complete in 2.81 seconds (runs_counter=2700, model=pca, train_n_secs=14400, models=26, n_fit_success=26, n_fit_fails=0, after=1606845731, before=1606860131).
.- This also gives counts of the number of models, if any, that failed to fit and so had to default back to the DefaultModel (which is currently HBOS).
after
andbefore
here refer to the start and end of the training data used to train the models.
- On a development n1-standard-2 (2 vCPUs, 7.5 GB memory) vm running Ubuntu 18.04 LTS and not doing any work some of the typical performance characteristics we saw from running this collector (with defaults) were:
- A runtime (
netdata.runtime_anomalies
) of ~80ms when doing scoring and ~3 seconds when training or retraining the models. - Typically ~3%-3.5% additional cpu usage from scoring, jumping to ~60% for a couple of seconds during model training.
- About ~150mb of ram (
apps.mem
) being continually used by thepython.d.plugin
.
- A runtime (
- If you activate this collector on a fresh node, it might take a little while to build up enough data to calculate a realistic and useful model.
- Some models like
iforest
can be comparatively expensive (on same n1-standard-2 system above ~2s runtime during predict, ~40s training time, ~50% cpu on both train and predict) so if you would like to use it you might be advised to set a relatively highupdate_every
maybe 10, 15 or 30 inanomalies.conf
. - Setting a higher
train_every_n
andupdate_every
is an easy way to devote less resources on the node to anomaly detection. Specifying less charts and a lowertrain_n_secs
will also help reduce resources at the expense of covering less charts and maybe a more noisy model if you settrain_n_secs
to be too small for how your node tends to behave.
#
Useful links and further reading- PyOD documentation, PyOD Github.
- Anomaly Detection wikipedia page.
- Anomaly Detection YouTube playlist maintained by andrewm4894 from Netdata.
- awesome-TS-anomaly-detection Github list of useful tools, libraries and resources.
- Mendeley public group with some interesting anomaly detection papers we have been reading.
- Good blog post from Anodot on time series anomaly detection. Anodot also have some great whitepapers in this space too that some may find useful.
- Novelty and outlier detection in the scikit-learn documentation.