OpenTelemetry in Practice: Instrumenting Your Applications Without Vendor Lock-in

Why OpenTelemetry Matters

For years, observability meant choosing a vendor and adopting their proprietary agent, SDK, and data format. Switch vendors and you re-instrument everything. OpenTelemetry (OTel) changes this equation fundamentally.

OTel is a CNCF project that provides a single, vendor-neutral standard for generating, collecting, and exporting telemetry data — metrics, logs, and traces. It has reached general availability for all three signal types, and adoption has been remarkable. Every major observability vendor now accepts OTel data natively.

If you read our post on building a Kubernetes observability stack, you saw how Prometheus and Loki handle metrics and logs. OpenTelemetry completes the picture by adding distributed tracing and — critically — correlating all three signals together.

graph LR
    A1[App 1] -->|OTLP| C[OTel Collector]
    A2[App 2] -->|OTLP| C
    A3[App 3] -->|OTLP| C
    C -->|metrics| P[Prometheus]
    C -->|logs| L[Loki]
    C -->|traces| T[Tempo / Jaeger]
    P --> G[Grafana]
    L --> G
    T --> G

The Architecture

OpenTelemetry has three main components:

SDKs and auto-instrumentation — Libraries that generate telemetry data from your application code.
The Collector — A standalone service that receives, processes, and exports telemetry data.
The protocol (OTLP) — A standard wire format that all components speak.

The typical deployment looks like this:

Application (with OTel SDK)
  → OTLP → OTel Collector (DaemonSet or Sidecar)
    → Prometheus (metrics)
    → Loki (logs)
    → Jaeger or Tempo (traces)

The Collector in the middle is the key architectural decision. It decouples your applications from your backends. If you switch from Jaeger to Grafana Tempo, you change a Collector config — not application code.

Auto-Instrumentation: The Quick Win

For many languages, OTel provides auto-instrumentation that requires zero code changes. In Kubernetes, the OpenTelemetry Operator makes this even simpler by injecting instrumentation via annotations.

First, install the operator:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm upgrade --install opentelemetry-operator open-telemetry/opentelemetry-operator \
  --namespace opentelemetry \
  --create-namespace \
  --set "manager.collectorImage.repository=otel/opentelemetry-collector-contrib" \
  --set admissionWebhooks.certManager.enabled=true

Then create an Instrumentation resource:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: default-instrumentation
  namespace: opentelemetry
spec:
  exporter:
    endpoint: http://otel-collector.opentelemetry.svc:4317
  propagators:
    - tracecontext
    - baggage
  sampler:
    type: parentbased_traceidratio
    argument: "0.1"  # Sample 10% of traces
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest
  dotnet:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:latest
  go:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-go:latest

Now, annotate any pod to get automatic instrumentation:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-api
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-python: "opentelemetry/default-instrumentation"
    spec:
      containers:
        - name: my-api
          image: my-registry/my-api:v1.2.3

The operator injects an init container that adds the OTel SDK to the application’s runtime. For Python, Java, Node.js, and .NET, this works without rebuilding the container image. You get traces for HTTP requests, database calls, and gRPC — all without touching application code.

Manual Instrumentation: When You Need More

Auto-instrumentation captures framework-level spans automatically, but it cannot instrument your business logic. For custom spans, you use the OTel SDK directly.

Here is a Python example that adds a custom span for a critical business operation:

from opentelemetry import trace
from opentelemetry.trace import StatusCode

tracer = trace.get_tracer("my-api.order-service")

def process_order(order_id: str, items: list) -> dict:
    with tracer.start_as_current_span(
        "process_order",
        attributes={
            "order.id": order_id,
            "order.item_count": len(items),
        }
    ) as span:
        try:
            # Validate inventory
            with tracer.start_as_current_span("validate_inventory"):
                available = inventory_service.check(items)
                if not available:
                    span.set_status(StatusCode.ERROR, "Inventory unavailable")
                    raise InsufficientInventoryError(order_id)

            # Process payment
            with tracer.start_as_current_span("process_payment") as payment_span:
                result = payment_service.charge(order_id, items)
                payment_span.set_attribute("payment.method", result.method)
                payment_span.set_attribute("payment.amount", result.amount)

            span.set_attribute("order.status", "completed")
            return {"status": "completed", "order_id": order_id}

        except Exception as e:
            span.set_status(StatusCode.ERROR, str(e))
            span.record_exception(e)
            raise

This creates a trace with three spans: the parent process_order and two children validate_inventory and process_payment. Each span carries attributes that become searchable in your tracing backend. When the payment service is slow, you will see exactly where the time is spent.

The Collector: Your Telemetry Pipeline

The OTel Collector is where you define how telemetry flows from your applications to your backends. Deploy it as a DaemonSet for efficient node-level collection:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
  namespace: opentelemetry
spec:
  mode: daemonset
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 5s
        send_batch_size: 1024
      memory_limiter:
        check_interval: 1s
        limit_mib: 512
        spike_limit_mib: 128
      resource:
        attributes:
          - key: k8s.cluster.name
            value: production
            action: upsert
      # Generate metrics from traces (RED metrics)
      spanmetrics:
        metrics_exporter: prometheus
        dimensions:
          - name: http.method
          - name: http.status_code
          - name: http.route

    exporters:
      # Traces to Tempo
      otlp/tempo:
        endpoint: tempo.monitoring.svc:4317
        tls:
          insecure: true

      # Metrics to Prometheus via remote write
      prometheusremotewrite:
        endpoint: http://prometheus.monitoring.svc:9090/api/v1/write

      # Logs to Loki
      loki:
        endpoint: http://loki.monitoring.svc:3100/loki/api/v1/push

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, resource, spanmetrics, batch]
          exporters: [otlp/tempo]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, resource, batch]
          exporters: [prometheusremotewrite]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, resource, batch]
          exporters: [loki]

The spanmetrics processor is particularly powerful. It automatically generates RED (Rate, Errors, Duration) metrics from your trace data. This means you get Prometheus metrics for every instrumented endpoint without any additional application-side work. These are the same metrics you would use for the SLI/SLO dashboards we discussed previously.

Trace-to-Metric-to-Log Correlation

Correlation is what transforms three separate data streams into a unified observability experience. The key is ensuring all three signals share common identifiers.

Trace ID in logs. Configure your application’s logging framework to include the active trace ID:

import logging
from opentelemetry import trace

class TraceIdFilter(logging.Filter):
    def filter(self, record):
        span = trace.get_current_span()
        ctx = span.get_span_context()
        record.trace_id = format(ctx.trace_id, '032x') if ctx.trace_id else ""
        record.span_id = format(ctx.span_id, '016x') if ctx.span_id else ""
        return True

logger = logging.getLogger(__name__)
logger.addFilter(TraceIdFilter())
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter(
    '%(asctime)s %(levelname)s [trace_id=%(trace_id)s span_id=%(span_id)s] %(message)s'
))
logger.addHandler(handler)

Now every log line carries a trace ID. In Grafana, you can configure a derived field on your Loki datasource that turns trace IDs into clickable links to Tempo. One click takes you from a log line to the full distributed trace.

Exemplars for metrics. Prometheus supports exemplars — trace IDs attached to individual metric observations. When you see a latency spike on a graph, you can click an exemplar point and jump directly to the trace that caused it.

Sampling Strategies

In production, tracing 100% of requests generates enormous data volumes. Sampling is essential, but naive random sampling loses important data. We recommend a tiered approach:

Head-based sampling at 10% for normal traffic (configured in the Instrumentation resource above).
Tail-based sampling in the Collector to always keep error traces and slow traces:

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 2000
      - name: baseline
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

This ensures you always have traces for errors and slow requests — the ones you actually need during incident response — while keeping volume manageable.

Deployment Considerations

Collector sizing. Start with a DaemonSet and 512MB memory limit per node. Monitor the Collector’s own metrics (it exports them at /metrics) and scale accordingly. For high-throughput clusters, add a Collector gateway deployment between the DaemonSet and backends to handle aggregation.

Version pinning. The OTel ecosystem moves fast. Pin your Collector image version and SDK versions in your dependency files. Test upgrades in staging before rolling to production, just as you would with any infrastructure component in your GitOps pipeline.

Security. Traces can contain sensitive data in HTTP headers, database queries, and span attributes. Use the attributes processor in the Collector to redact sensitive fields before they reach your backend. And manage your Collector credentials through proper secrets management, not environment variables.

The Vendor Lock-in Escape Hatch

This is the strategic argument for OTel that resonates with CTOs. Once your applications emit OTLP, switching backends is a configuration change. Moving from self-hosted Jaeger to Grafana Tempo? Change the Collector exporter. Want to evaluate Datadog? Add a second exporter and run both in parallel. Deciding to bring observability in-house after scaling past a SaaS vendor’s price point? Your application code does not change.

The investment in OTel instrumentation is permanent. The choice of backend is always reversible.

Getting Started

If this seems like a lot, here is the minimal path to value:

Install the OTel Operator and create an Instrumentation resource.
Deploy a single Collector with an OTLP receiver and Tempo exporter.
Annotate one application deployment for auto-instrumentation.
Open Grafana, add Tempo as a datasource, and explore your first traces.

You can do this in an afternoon. Once you see the first distributed trace spanning multiple services, the value becomes immediately obvious — and you will want to instrument everything.