Skip to main content

Web server log (Apache, NGINX, Squid) monitoring with Netdata

Tails access log file and Collects web server/caching proxy metrics.

Motivation#

Web server log files exist for more than 20 years. All web servers of all kinds, from all vendors, since the time NCSA httpd was powering the web, produce log files, saving in real-time all accesses to web sites and APIs.

Yet, after the appearance of google analytics and similar services, and the recent rise of APM (Application Performance Monitoring) with sophisticated time-series databases that collect and analyze metrics at the application level, all these web server log files are mostly just filling our disks, rotated every night without any use whatsoever.

Netdata turns this "useless" log file, into a powerful performance and health monitoring tool, capable of detecting, in real-time, most common web server problems, such as:

  • too many redirects (i.e. oops! this should not redirect clients to itself)
  • too many bad requests (i.e. oops! a few files were not uploaded)
  • too many internal server errors (i.e. oops! this release crashes too much)
  • unreasonably too many requests (i.e. oops! we are under attack)
  • unreasonably few requests (i.e. oops! call the network guys)
  • unreasonably slow responses (i.e. oops! the database is slow again)
  • too few successful responses (i.e. oops! help us God!)

Usage#

If Netdata is installed on a system running a web server, it will detect it and it will automatically present a series of charts, with information obtained from the web server API, like these (these do not come from the web server log file):

image netdata charts based on metrics collected by querying the nginx API (i.e. /stub_status).

netdata supports apache, nginx, lighttpd and tomcat. To obtain real-time information from a web server API, the web server needs to expose it. For directions on configuring your web server, check the config files for each web server. There is a directory with a config file for each web server under /etc/netdata/python.d/.

Configuration#

Edit the python.d/web_log.conf configuration file using edit-config from the Netdata config directory, which is typically at /etc/netdata.

cd /etc/netdata # Replace this path with your Netdata config directory, if different
sudo ./edit-config python.d/web_log.conf

netdata has a powerful web_log plugin, capable of incrementally parsing any number of web server log files. This plugin is automatically started with netdata and comes, pre-configured, for finding web server log files on popular distributions. Its configuration is at /etc/netdata/python.d/web_log.conf, like this:

nginx_log:
name : 'nginx_log'
path : '/var/log/nginx/access.log'
apache_log:
name : 'apache_log'
path : '/var/log/apache/other_vhosts_access.log'
categories:
cacti : 'cacti.*'
observium : 'observium'

The module has preconfigured jobs for nginx, apache and gunicorn on various distros. You can add one such section for each of your web server log files.

Important
Keep in mind netdata runs as user netdata. So, make sure user netdata has access to the logs directory and can read the log file.

Charts#

Once you have all log files configured and netdata restarted, for each log file you will get a section at the netdata dashboard, with the following charts.

Responses by status#

In this chart we tried to provide a meaningful status for all responses. So:

  • success counts all the valid responses (i.e. 1xx informational, 2xx successful and 304 not modified).
  • error are 5xx internal server errors. These are very bad, they mean your web site or API is facing difficulties.
  • redirect are 3xx responses, except 304. All 3xx are redirects, but 304 means "not modified" - it tells the browsers the content they already have is still valid and can be used as-is. So, we decided to account it as a successful response.
  • bad are bad requests that cannot be served.
  • other as all the other, non-standard, types of responses.

image

Responses by type#

Then, we group all responses by code family, without interpreting their meaning. Response by type requests/s

  • success (1xx, 2xx, 304)
  • error (5xx)
  • redirect (3xx except 304)
  • bad (4xx)
  • other (all other responses)

image

Responses by code family#

Here we show all the response codes in detail.

Response by code family requests/s

  • 1xx (informational)
  • 2xx (successful)
  • 3xx (redirect)
  • 4xx (bad)
  • 5xx (internal server errors)
  • other (non-standart responses)
  • unmatched (the lines in the log file that are not matched)

image

Important
If your application is using hundreds of non-standard response codes, your browser may become slow while viewing this chart, so we have added a configuration option to disable this chart.

Detailed Response Codes#

Number of responses for each response code family individually (requests/s)

Bandwidth#

This is a nice view of the traffic the web server is receiving and is sending.

What is important to know for this chart, is that the bandwidth used for each request and response is accounted at the time the log is written. Since netdata refreshes this chart every single second, you may have unrealistic spikes is the size of the requests or responses is too big. The reason is simple: a response may have needed 1 minute to be completed, but all the bandwidth used during that minute for the specific response will be accounted at the second the log line is written.

As the legend on the chart suggests, you can use FireQoS to setup QoS on the web server ports and IPs to accurately measure the bandwidth the web server is using. Actually, there may be a few more reasons to install QoS on your servers...

Bandwidth KB/s

  • received (bandwidth of requests)
  • send (bandwidth of responses)

image

Important
Most web servers do not log the request size by default.
So, unless you have configured your web server to log the size of requests, the received dimension will be always zero.

Timings#

netdata will also render the minimum, average and maximum time the web server needed to respond to requests.

Keep in mind most web servers timings start at the reception of the full request, until the dispatch of the last byte of the response. So, they include network latencies of responses, but they do not include network latencies of requests.

Timings ms (request processing time)

  • min (bandwidth of requests)
  • max (bandwidth of responses)
  • average (bandwidth of responses)

image

Important
Most web servers do not log timing information by default.
So, unless you have configured your web server to also log timings, this chart will not exist.

URL patterns#

This is a very interesting chart. It is configured entirely by you.

netdata can map the URLs found in the log file into categories. You can define these categories, by providing names and regular expressions in web_log.conf.

So, this configuration:

nginx_netdata: # name the charts
path: '/var/log/nginx/access.log' # web server log file
categories:
badges : '^/api/v1/badge\.svg'
charts : '^/api/v1/(data|chart|charts)'
registry : '^/api/v1/registry'
alarms : '^/api/v1/alarm'
allmetrics : '^/api/v1/allmetrics'
api_other : '^/api/'
netdata_conf: '^/netdata.conf'
api_old : '^/(data|datasource|graph|list|all\.json)'

Produces the following chart. The categories section is matched in the order given. So, pay attention to the order you give your patterns.

image

HTTP methods#

This chart breaks down requests by HTTP method used.

image

IP versions#

This one provides requests per IP version used by the clients (IPv4, IPv6).

image

Unique clients#

The last charts are about the unique IPs accessing your web server.

Current Poll Unique Client IPs unique ips/s. This one counts the unique IPs for each data collection iteration (i.e. unique clients per second).

image

All Time Unique Client IPs unique ips/s. Counts the unique IPs, since the last netdata restart.

image

Important
To provide this information web_log plugin keeps in memory all the IPs seen by the web server. Although this does not require so much memory, if you have a web server with several million unique client IPs, we suggest to disable this chart.

Alarms#

The magic of netdata is that all metrics are collected per second, and all metrics can be used or correlated to provide real-time alarms. Out of the box, netdata automatically attaches the following alarms] to all web_log charts (i.e. to all log files configured, individually):

alarmdescriptionminimum
requests
warningcritical
1m_redirectsThe ratio of HTTP redirects (3xx except 304) over all the requests, during the last minute.
 
Detects if the site or the web API is suffering from too many or circular redirects.
 
(i.e. oops! this should not redirect clients to itself)
120/min> 20%> 30%
1m_bad_requestsThe ratio of HTTP bad requests (4xx) over all the requests, during the last minute.
 
Detects if the site or the web API is receiving too many bad requests, including 404, not found.
 
(i.e. oops! a few files were not uploaded)
120/min> 30%> 50%
1m_internal_errorsThe ratio of HTTP internal server errors (5xx), over all the requests, during the last minute.
 
Detects if the site is facing difficulties to serve requests.
 
(i.e. oops! this release crashes too much)
120/min> 2%> 5%
5m_requests_ratioThe percentage of successful web requests of the last 5 minutes, compared with the previous 5 minutes.
 
Detects if the site or the web API is suddenly getting too many or too few requests.
 
(i.e. too many = oops! we are under attack)
(i.e. too few = oops! call the network guys)
120/5min> double or \< half> 4x or \< 1/4x
web_slowThe average time to respond to requests, over the last 1 minute, compared to the average of last 10 minutes.
 
Detects if the site or the web API is suddenly a lot slower.
 
(i.e. oops! the database is slow again)
120/min> 2x> 4x
1m_successfulThe ratio of successful HTTP responses (1xx, 2xx, 304) over all the requests, during the last minute.
 
Detects if the site or the web API is performing within limits.
 
(i.e. oops! help us God!)
120/min\< 85%\< 75%

The column minimum requests state the minimum number of requests required for the alarm to be evaluated. We found that when the site is receiving requests above this rate, these alarms are pretty accurate (i.e. no false-positives).

Netdata alarms are user-configurable. Sample config files can be found under directory health/health.d of the Netdata GitHub repository.

Reach out

If you need help after reading this doc, search our community forum for an answer. There's a good chance someone else has already found a solution to the same issue.

Documentation

Community