Observability for Kubernetes: Building a Monitoring Stack That Actually Works

The Default Dashboard Problem

You deployed kube-prometheus-stack. You have Grafana running with a dozen pre-built dashboards. You can see CPU and memory graphs for every pod. And yet, when something goes wrong at 2 AM, those dashboards tell you almost nothing useful.

This is the most common observability failure we encounter. Teams confuse monitoring (is this thing up?) with observability (why is this thing broken?). The default dashboards answer neither question well because they were designed to be generic. Your applications, your SLAs, and your failure modes are not generic.

In this post, we walk through building an observability stack for Kubernetes that moves beyond pretty graphs toward actionable insight.

graph TD
    subgraph Applications
        APP[App Pods]
    end
    subgraph Collection
        PROM[Prometheus]
        LOKI[Loki]
        TEMPO[Tempo]
    end
    subgraph Visualization
        GRAF[Grafana]
        ALERT[Alertmanager]
    end
    APP -->|Metrics| PROM
    APP -->|Logs| LOKI
    APP -->|Traces| TEMPO
    PROM --> GRAF
    LOKI --> GRAF
    TEMPO --> GRAF
    PROM --> ALERT
    ALERT -->|Slack/PagerDuty| OPS[On-Call]

The Three Pillars

Before diving into tooling, let us align on the three pillars of observability:

Metrics are numeric measurements over time. CPU utilization, request latency percentiles, error rates, queue depths. They are cheap to store, fast to query, and excellent for alerting and trending.

Logs are discrete events with context. Application errors, access logs, audit trails. They answer “what happened” in human-readable form but are expensive to store and slow to search at scale.

Traces are the path of a single request through your system. They show you which service called which, how long each step took, and where the bottleneck lives. We cover tracing in depth in our OpenTelemetry post.

A production-grade stack needs all three, correlated. A metric alert should link to relevant logs, and a log entry should link to the trace that produced it.

Deploying kube-prometheus-stack

The kube-prometheus-stack Helm chart bundles Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics. It is the de facto starting point. Here is a production-oriented values file:

# values-prometheus-stack.yaml
prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: "80GB"
    resources:
      requests:
        cpu: "2"
        memory: 8Gi
      limits:
        memory: 12Gi
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 100Gi
    # Scrape all ServiceMonitors across namespaces
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
    ruleSelectorNilUsesHelmValues: false

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          resources:
            requests:
              storage: 10Gi
  config:
    receivers:
      - name: "slack-critical"
        slack_configs:
          - api_url_file: /etc/alertmanager/secrets/slack-webhook
            channel: "#alerts-critical"
            title: '{{ .GroupLabels.alertname }}'
            text: >-
              {{ range .Alerts }}
              *{{ .Labels.severity }}* - {{ .Annotations.summary }}
              {{ .Annotations.description }}
              {{ end }}
      - name: "null"
    route:
      group_by: ["alertname", "namespace"]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: "null"
      routes:
        - match:
            severity: critical
          receiver: "slack-critical"
          repeat_interval: 1h

grafana:
  adminPassword: "changeme"  # Use a secret in practice
  persistence:
    enabled: true
    size: 10Gi
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: custom
          orgId: 1
          folder: "Custom"
          type: file
          disableDeletion: false
          editable: true
          options:
            path: /var/lib/grafana/dashboards/custom
  sidecar:
    dashboards:
      enabled: true
      searchNamespace: ALL
    datasources:
      enabled: true

# Disable default rules we'll replace with better ones
defaultRules:
  create: true
  rules:
    general: true
    kubernetesApps: false  # We'll write our own

Deploy with:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --values values-prometheus-stack.yaml \
  --version 65.1.0

A few important settings deserve explanation. Setting serviceMonitorSelectorNilUsesHelmValues: false tells Prometheus to discover ServiceMonitor resources across all namespaces, not just those created by the Helm chart itself. Without this, application teams cannot create their own ServiceMonitors. We also disable the default kubernetesApps alerting rules because we will replace them with SLO-based rules that are far more useful.

Adding Logs with Loki

For log aggregation, Grafana Loki is the natural companion. It uses the same label-based approach as Prometheus, making correlation straightforward:

# values-loki.yaml
loki:
  auth_enabled: false
  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  storage:
    type: s3
    s3:
      endpoint: minio.storage.svc:9000
      bucketnames: loki-chunks
      access_key_id: ${MINIO_ACCESS_KEY}
      secret_access_key: ${MINIO_SECRET_KEY}
      insecure: true
  limits_config:
    retention_period: 30d
    max_query_length: 720h

Pair it with Promtail or the Grafana Alloy collector to ship logs from every pod. The key is ensuring that log labels match your Prometheus labels so you can jump from a metric alert to the relevant logs with a single click in Grafana.

Why Default Dashboards Fail

The pre-built dashboards in kube-prometheus-stack show infrastructure metrics: node CPU, pod memory, network I/O. These are useful for capacity planning but terrible for incident response. Here is why:

They do not reflect your SLAs. Your customers do not care about pod CPU. They care about whether the API responds in under 200ms.
They lack context. A spike in memory usage is meaningless without knowing whether it caused errors or degraded latency.
They generate noise. Alerting on CPU > 80% will page you constantly without correlating to user impact.

The fix is to build dashboards around Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

Building SLI/SLO Dashboards

An SLI is a quantitative measure of a service’s behavior. An SLO is the target you set for that SLI. For example:

SLI: The proportion of HTTP requests that return in under 500ms with a non-5xx status.
SLO: 99.9% of requests meet the SLI over a 30-day rolling window.

Here are the PromQL queries that power an SLO dashboard for an HTTP service. First, the error rate SLI:

# Request success rate (non-5xx) over the last 30 days
sum(rate(http_requests_total{job="my-api", status!~"5.."}[30d]))
/
sum(rate(http_requests_total{job="my-api"}[30d]))

The latency SLI using histogram buckets:

# Proportion of requests under 500ms over the last 30 days
sum(rate(http_request_duration_seconds_bucket{job="my-api", le="0.5"}[30d]))
/
sum(rate(http_request_duration_seconds_count{job="my-api"}[30d]))

The remaining error budget:

# Error budget remaining (1 = full budget, 0 = exhausted)
1 - (
  (1 - (
    sum(rate(http_requests_total{job="my-api", status!~"5.."}[30d]))
    /
    sum(rate(http_requests_total{job="my-api"}[30d]))
  ))
  /
  (1 - 0.999)
)

That last query is the most powerful. When the error budget hits zero, you know your SLO is breached. When it is burning faster than expected, you know you need to act before a breach. Alert on error budget burn rate, not on raw error counts.

The Prometheus rule for a burn-rate alert:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-api-slo
  namespace: monitoring
spec:
  groups:
    - name: my-api.slo
      rules:
        # Multi-window burn rate alert
        - alert: MyAPIHighErrorBurnRate
          expr: |
            (
              sum(rate(http_requests_total{job="my-api", status=~"5.."}[1h]))
              / sum(rate(http_requests_total{job="my-api"}[1h]))
            ) > (14.4 * 0.001)
            and
            (
              sum(rate(http_requests_total{job="my-api", status=~"5.."}[5m]))
              / sum(rate(http_requests_total{job="my-api"}[5m]))
            ) > (14.4 * 0.001)
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "my-api is burning error budget 14.4x faster than allowed"
            description: "At this rate, the 30-day error budget will be exhausted in {{ $value | humanize }} hours."

This multi-window approach (checking both a 1-hour and 5-minute window) avoids alerting on brief spikes while catching sustained problems quickly. The 14.4 factor means you are burning budget fast enough to exhaust it in roughly 2 days, which warrants immediate attention.

Practical Dashboard Layout

A well-structured Grafana dashboard for a service should have these rows:

SLO Summary — Current SLI values, error budget remaining, budget burn rate. This is what you look at first.
Request Rate and Errors — Requests per second broken down by status code. Error rate over time.
Latency — p50, p90, p99 latency over time. Latency by endpoint.
Saturation — CPU, memory, and connection pool utilization for the service pods.
Dependencies — Latency and error rate for downstream calls (databases, external APIs).

Store these dashboards as JSON in your GitOps config repo and deploy them via ConfigMaps that Grafana’s sidecar picks up. This way, dashboards are version-controlled and reproducible:

apiVersion: v1
kind: ConfigMap
metadata:
  name: my-api-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  my-api.json: |
    { ... exported Grafana dashboard JSON ... }

Connecting the Dots

The real power of observability comes from correlation. When an SLO alert fires:

The alert links to the SLO dashboard (Grafana alert annotations).
The dashboard shows which SLI is degraded (latency? errors?).
You click through to the dependency row and see a database latency spike.
You switch to the Loki datasource and filter logs for that database’s namespace.
You find a slow query log entry with a trace ID.
You open that trace in your tracing UI and see the exact query and its execution plan.

This is what “observability” actually means — not just having data, but being able to follow the thread from symptom to root cause. Our post on OpenTelemetry covers setting up the tracing layer that completes this picture.

Common Mistakes

Over-alerting. If your team gets more than five alerts per on-call shift, your alerting is broken. Reduce noise by alerting on SLOs, not symptoms.

Under-retaining. Thirty days of metrics seems like a lot until you need to compare this month’s performance to last quarter. Keep at least 90 days, or use Thanos/Cortex for long-term storage.

Ignoring cardinality. Every unique label combination creates a new time series. A label with user IDs will explode your Prometheus memory. Audit cardinality regularly with prometheus_tsdb_head_series.

Dashboard sprawl. Fifty dashboards that nobody looks at are worse than five that everyone uses. Start with one dashboard per service, focused on SLOs.

Wrapping Up

Building an observability stack that actually works requires moving beyond default configurations. Deploy kube-prometheus-stack as your foundation, add Loki for logs, then invest your energy in SLI/SLO dashboards that reflect what your users experience. Alert on error budget burn rates, not raw metrics. Store everything in Git. And most importantly, practice using your observability tools before the next incident — because 2 AM is not the time to learn PromQL.