Failed Prometheus-Server Container
Scenario #
The prometheus-server
container within the Prometheus server deployment hits a failure, but the pod itself does not shut down, it simply stops reporting metrics.
Triage #
If no metrics are available, you can identify this failure by searching for panic
in the logs.
The Prometheus server pod has two containers: prometheus-server-configmap-reload
and prometheus-server
, so you must explicitly ask for the prometheus-server
using the -c
option in order to find the panic log below.
kubectl logs svc/aioli-prometheus-server -c prometheus-server| grep -e panic -e 'no space left on device'
ts=2024-06-13T21:02:52.754Z caller=scrape.go:1351 level=error component="scrape manager" scrape_pool=kubernetes-services target="http://blackbox:80/probe?module=http_2xx&target=fb125m-1-predictor-00001-private.default.svc%3A9091" msg="Scrape commit failed" err="write to WAL: log samples: write /data/wal/00000324: no space left on device"
Resolution #
The Prometheus configuration was updated in MLIS version 1.1.0 to help prevent this failure by automatically enabling wal-compression
and setting the retention size to 7GB
.
If you encounter disk usage issues, you can further lower the retention time to 5 days (5d
). This adjustment can help reduce disk usage.
See the official Prometheus documentation on Storage Operational Aspects for more information on configuration options.
Example #
The following example manually lowers retention to 5 days (5d
):
helm upgrade NAME CHART \
--set prometheus.server.retention=5d \
See the Helm chart for more information on the Prometheus configuration.
scrape_interval
to reduce the data collection rate. This adjustment allows you to increase the retention days while staying within the disk space constraint.