OpenTelemetry in Practice: Instrumenting Your Applications Without Vendor Lock-in
Why OpenTelemetry Matters
For years, observability meant choosing a vendor and adopting their proprietary agent, SDK, and data format. Switch vendors and you re-instrument everything. OpenTelemetry (OTel) changes this equation fundamentally.
OTel is a CNCF project that provides a single, vendor-neutral standard for generating, collecting, and exporting telemetry data — metrics, logs, and traces. It has reached general availability for all three signal types, and adoption has been remarkable. Every major observability vendor now accepts OTel data natively.
If you read our post on building a Kubernetes observability stack, you saw how Prometheus and Loki handle metrics and logs. OpenTelemetry completes the picture by adding distributed tracing and — critically — correlating all three signals together.
graph LR
A1[App 1] -->|OTLP| C[OTel Collector]
A2[App 2] -->|OTLP| C
A3[App 3] -->|OTLP| C
C -->|metrics| P[Prometheus]
C -->|logs| L[Loki]
C -->|traces| T[Tempo / Jaeger]
P --> G[Grafana]
L --> G
T --> G
The Architecture
OpenTelemetry has three main components:
- SDKs and auto-instrumentation — Libraries that generate telemetry data from your application code.
- The Collector — A standalone service that receives, processes, and exports telemetry data.
- The protocol (OTLP) — A standard wire format that all components speak.
The typical deployment looks like this:
Application (with OTel SDK)
→ OTLP → OTel Collector (DaemonSet or Sidecar)
→ Prometheus (metrics)
→ Loki (logs)
→ Jaeger or Tempo (traces)
The Collector in the middle is the key architectural decision. It decouples your applications from your backends. If you switch from Jaeger to Grafana Tempo, you change a Collector config — not application code.
Auto-Instrumentation: The Quick Win
For many languages, OTel provides auto-instrumentation that requires zero code changes. In Kubernetes, the OpenTelemetry Operator makes this even simpler by injecting instrumentation via annotations.
First, install the operator:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm upgrade --install opentelemetry-operator open-telemetry/opentelemetry-operator \
--namespace opentelemetry \
--create-namespace \
--set "manager.collectorImage.repository=otel/opentelemetry-collector-contrib" \
--set admissionWebhooks.certManager.enabled=true
Then create an Instrumentation resource:
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: default-instrumentation
namespace: opentelemetry
spec:
exporter:
endpoint: http://otel-collector.opentelemetry.svc:4317
propagators:
- tracecontext
- baggage
sampler:
type: parentbased_traceidratio
argument: "0.1" # Sample 10% of traces
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest
dotnet:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:latest
go:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-go:latest
Now, annotate any pod to get automatic instrumentation:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-api
spec:
template:
metadata:
annotations:
instrumentation.opentelemetry.io/inject-python: "opentelemetry/default-instrumentation"
spec:
containers:
- name: my-api
image: my-registry/my-api:v1.2.3
The operator injects an init container that adds the OTel SDK to the application’s runtime. For Python, Java, Node.js, and .NET, this works without rebuilding the container image. You get traces for HTTP requests, database calls, and gRPC — all without touching application code.
Manual Instrumentation: When You Need More
Auto-instrumentation captures framework-level spans automatically, but it cannot instrument your business logic. For custom spans, you use the OTel SDK directly.
Here is a Python example that adds a custom span for a critical business operation:
from opentelemetry import trace
from opentelemetry.trace import StatusCode
tracer = trace.get_tracer("my-api.order-service")
def process_order(order_id: str, items: list) -> dict:
with tracer.start_as_current_span(
"process_order",
attributes={
"order.id": order_id,
"order.item_count": len(items),
}
) as span:
try:
# Validate inventory
with tracer.start_as_current_span("validate_inventory"):
available = inventory_service.check(items)
if not available:
span.set_status(StatusCode.ERROR, "Inventory unavailable")
raise InsufficientInventoryError(order_id)
# Process payment
with tracer.start_as_current_span("process_payment") as payment_span:
result = payment_service.charge(order_id, items)
payment_span.set_attribute("payment.method", result.method)
payment_span.set_attribute("payment.amount", result.amount)
span.set_attribute("order.status", "completed")
return {"status": "completed", "order_id": order_id}
except Exception as e:
span.set_status(StatusCode.ERROR, str(e))
span.record_exception(e)
raise
This creates a trace with three spans: the parent process_order and two children validate_inventory and process_payment. Each span carries attributes that become searchable in your tracing backend. When the payment service is slow, you will see exactly where the time is spent.
The Collector: Your Telemetry Pipeline
The OTel Collector is where you define how telemetry flows from your applications to your backends. Deploy it as a DaemonSet for efficient node-level collection:
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
namespace: opentelemetry
spec:
mode: daemonset
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
resource:
attributes:
- key: k8s.cluster.name
value: production
action: upsert
# Generate metrics from traces (RED metrics)
spanmetrics:
metrics_exporter: prometheus
dimensions:
- name: http.method
- name: http.status_code
- name: http.route
exporters:
# Traces to Tempo
otlp/tempo:
endpoint: tempo.monitoring.svc:4317
tls:
insecure: true
# Metrics to Prometheus via remote write
prometheusremotewrite:
endpoint: http://prometheus.monitoring.svc:9090/api/v1/write
# Logs to Loki
loki:
endpoint: http://loki.monitoring.svc:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, resource, spanmetrics, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [loki]
The spanmetrics processor is particularly powerful. It automatically generates RED (Rate, Errors, Duration) metrics from your trace data. This means you get Prometheus metrics for every instrumented endpoint without any additional application-side work. These are the same metrics you would use for the SLI/SLO dashboards we discussed previously.
Trace-to-Metric-to-Log Correlation
Correlation is what transforms three separate data streams into a unified observability experience. The key is ensuring all three signals share common identifiers.
Trace ID in logs. Configure your application’s logging framework to include the active trace ID:
import logging
from opentelemetry import trace
class TraceIdFilter(logging.Filter):
def filter(self, record):
span = trace.get_current_span()
ctx = span.get_span_context()
record.trace_id = format(ctx.trace_id, '032x') if ctx.trace_id else ""
record.span_id = format(ctx.span_id, '016x') if ctx.span_id else ""
return True
logger = logging.getLogger(__name__)
logger.addFilter(TraceIdFilter())
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter(
'%(asctime)s %(levelname)s [trace_id=%(trace_id)s span_id=%(span_id)s] %(message)s'
))
logger.addHandler(handler)
Now every log line carries a trace ID. In Grafana, you can configure a derived field on your Loki datasource that turns trace IDs into clickable links to Tempo. One click takes you from a log line to the full distributed trace.
Exemplars for metrics. Prometheus supports exemplars — trace IDs attached to individual metric observations. When you see a latency spike on a graph, you can click an exemplar point and jump directly to the trace that caused it.
Sampling Strategies
In production, tracing 100% of requests generates enormous data volumes. Sampling is essential, but naive random sampling loses important data. We recommend a tiered approach:
- Head-based sampling at 10% for normal traffic (configured in the Instrumentation resource above).
- Tail-based sampling in the Collector to always keep error traces and slow traces:
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-requests
type: latency
latency:
threshold_ms: 2000
- name: baseline
type: probabilistic
probabilistic:
sampling_percentage: 10
This ensures you always have traces for errors and slow requests — the ones you actually need during incident response — while keeping volume manageable.
Deployment Considerations
Collector sizing. Start with a DaemonSet and 512MB memory limit per node. Monitor the Collector’s own metrics (it exports them at /metrics) and scale accordingly. For high-throughput clusters, add a Collector gateway deployment between the DaemonSet and backends to handle aggregation.
Version pinning. The OTel ecosystem moves fast. Pin your Collector image version and SDK versions in your dependency files. Test upgrades in staging before rolling to production, just as you would with any infrastructure component in your GitOps pipeline.
Security. Traces can contain sensitive data in HTTP headers, database queries, and span attributes. Use the attributes processor in the Collector to redact sensitive fields before they reach your backend. And manage your Collector credentials through proper secrets management, not environment variables.
The Vendor Lock-in Escape Hatch
This is the strategic argument for OTel that resonates with CTOs. Once your applications emit OTLP, switching backends is a configuration change. Moving from self-hosted Jaeger to Grafana Tempo? Change the Collector exporter. Want to evaluate Datadog? Add a second exporter and run both in parallel. Deciding to bring observability in-house after scaling past a SaaS vendor’s price point? Your application code does not change.
The investment in OTel instrumentation is permanent. The choice of backend is always reversible.
Getting Started
If this seems like a lot, here is the minimal path to value:
- Install the OTel Operator and create an Instrumentation resource.
- Deploy a single Collector with an OTLP receiver and Tempo exporter.
- Annotate one application deployment for auto-instrumentation.
- Open Grafana, add Tempo as a datasource, and explore your first traces.
You can do this in an afternoon. Once you see the first distributed trace spanning multiple services, the value becomes immediately obvious — and you will want to instrument everything.