Failed Prometheus-Server Container

Scenario

The prometheus-server container within the Prometheus server deployment hits a failure, but the pod itself does not shut down, it simply stops reporting metrics.

Triage

If no metrics are available, you can identify this failure by searching for panic in the logs.

Obtaining the Panic Log

The Prometheus server pod has two containers: prometheus-server-configmap-reload and prometheus-server, so you must explicitly ask for the prometheus-server using the -c option in order to find the panic log below.

kubectl logs svc/aioli-prometheus-server -c prometheus-server| grep -e panic -e 'no space left on device'
ts=2024-06-13T21:02:52.754Z caller=scrape.go:1351 level=error component="scrape manager" scrape_pool=kubernetes-services target="http://blackbox:80/probe?module=http_2xx&target=fb125m-1-predictor-00001-private.default.svc%3A9091" msg="Scrape commit failed" err="write to WAL: log samples: write /data/wal/00000324: no space left on device"

Resolution

The Prometheus configuration was updated in MLIS version 1.1.0 to help prevent this failure by automatically enabling wal-compression and setting the retention size to 7GB.

If you encounter disk usage issues, you can further lower the retention time to 5 days (5d). This adjustment can help reduce disk usage.

See the official Prometheus documentation on Storage Operational Aspects for more information on configuration options.

Example

The following example manually lowers retention to 5 days (5d):

helm upgrade NAME CHART \
--set prometheus.server.retention=5d  \

See the Helm chart for more information on the Prometheus configuration.

Tip
Once you get Prometheus working again, consider increasing the scrape_interval to reduce the data collection rate. This adjustment allows you to increase the retention days while staying within the disk space constraint.