Failed Prometheus-Server Container

Scenario #

The prometheus-server container within the Prometheus server deployment hits a failure, but the pod itself does not shut down, it simply stops reporting metrics.

Triage #

If no metrics are available, you can identify this failure by searching for panic in the logs.

Obtaining the Panic Log

The Prometheus server pod has two containers: prometheus-server-configmap-reload and prometheus-server, so you must explicitly ask for the prometheus-server using the -c option in order to find the panic log below.

kubectl logs svc/aioli-prometheus-server -c prometheus-server| grep -e panic -e 'no space left on device'

ts=2024-06-13T21:02:52.754Z caller=scrape.go:1351 level=error component="scrape manager" scrape_pool=kubernetes-services target="http://blackbox:80/probe?module=http_2xx&target=fb125m-1-predictor-00001-private.default.svc%3A9091" msg="Scrape commit failed" err="write to WAL: log samples: write /data/wal/00000324: no space left on device"

Resolution #

The Prometheus configuration was updated in MLIS version 1.1.0 to help prevent this failure by automatically enabling wal-compression and setting the retention size to 7GB.

If you encounter disk usage issues, you can further lower the retention time to 5 days (5d). This adjustment can help reduce disk usage.

See the official Prometheus documentation on Storage Operational Aspects for more information on configuration options.

Example #

The following example manually lowers retention to 5 days (5d):

helm upgrade NAME CHART \
--set prometheus.server.retention=5d  \

See the Helm chart for more information on the Prometheus configuration.

Tip

Once you get Prometheus working again, consider increasing the scrape_interval to reduce the data collection rate. This adjustment allows you to increase the retention days while staying within the disk space constraint.