Introduction
Observability in Kubernetes requires coordinated implementation of metrics, logs, and traces. This guide walks through building a production-ready observability stack using proven open-source tools.
The Three Pillars
Metrics
Numerical measurements over time:
- CPU, memory, network utilization
- Request rates and latencies
- Business KPIs
Logs
Discrete events with context:
- Application logs
- System logs
- Audit logs
Traces
Request flow across services:
- Distributed transaction tracking
- Latency breakdown
- Dependency mapping
Stack Components
Metrics: Prometheus + Grafana
Prometheus for collection and storage:
- Pull-based metrics collection
- PromQL query language
- Built-in alerting
Grafana for visualization:
- Powerful dashboards
- Multi-data-source support
- Alerting integration
Logs: Loki + Promtail
Loki for log aggregation:
- Prometheus-inspired design
- Label-based indexing
- Cost-effective storage
Promtail for collection:
- Automatic Kubernetes discovery
- Pipeline processing
- Label enrichment
Traces: Tempo + OpenTelemetry
Tempo for trace storage:
- Scales to millions of traces
- Object storage backend
- TraceQL query language
OpenTelemetry for instrumentation:
- Vendor-neutral
- Auto-instrumentation
- Unified collection
Implementation
Step 1: Deploy Prometheus Stack
Using Helm:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=your-secure-password
Key configurations:
# values.yaml
prometheus:
prometheusSpec:
retention: 30d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
serviceMonitorSelector: {}
podMonitorSelector: {}
grafana:
persistence:
enabled: true
size: 10Gi
Step 2: Deploy Loki Stack
helm install loki grafana/loki-stack \
--namespace monitoring \
--set promtail.enabled=true \
--set loki.persistence.enabled=true \
--set loki.persistence.size=50Gi
Configure Promtail for Kubernetes:
# promtail-config.yaml
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
pipeline_stages:
- json:
expressions:
level: level
message: msg
- labels:
level:
Step 3: Deploy Tempo
helm install tempo grafana/tempo \
--namespace monitoring \
--set tempo.storage.trace.backend=s3 \
--set tempo.storage.trace.s3.bucket=traces-bucket
Step 4: Configure OpenTelemetry
Deploy the collector:
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: monitoring
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
timeout: 5s
exporters:
otlp:
endpoint: tempo.monitoring:4317
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Step 5: Connect Data Sources in Grafana
Add data sources:
- Prometheus: http://prometheus-server:80
- Loki: http://loki:3100
- Tempo: http://tempo:3100
Enable correlation:
# Grafana data source config
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-server:80
jsonData:
httpMethod: POST
- name: Loki
type: loki
url: http://loki:3100
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: "traceID=(\w+)"
name: TraceID
url: '${__value.raw}'
- name: Tempo
type: tempo
url: http://tempo:3100
Application Instrumentation
Automatic Instrumentation
For Java applications:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-java-app
spec:
template:
metadata:
annotations:
instrumentation.opentelemetry.io/inject-java: "true"
Manual Instrumentation
For custom metrics:
import (
"github.com/prometheus/client_golang/prometheus"
)
var requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "path", "status"},
)
func init() {
prometheus.MustRegister(requestsTotal)
}
Alerting Configuration
Prometheus Alerting Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: application-alerts
spec:
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: High error rate detected
Dashboards
Essential dashboards:
- Cluster Overview: Node health, resource utilization
- Namespace View: Per-namespace metrics
- Service Dashboards: RED metrics per service
- enterprise-ai-infrastructure-gpu-clusters-cloud-services" title="Building Enterprise AI Infrastructure: GPU Clusters vs Cloud Services" class="internal-link">Infrastructure: Storage, infiniband-ethernet-ai-workloads" title="GPU Cluster Networking: InfiniBand vs Ethernet for AI Workloads" class="internal-link">networking
Conclusion
A comprehensive observability stack enables faster troubleshooting and better understanding of system behavior. Start with metrics, add logs for context, and implement tracing for distributed systems visibility.
