Observability and Tracing in FastAPI
Observability is the ability to answer questions about a running system from the outside, using three correlated signals — structured logs, metrics, and distributed traces — all tied together by a shared request identifier.
This topic is part of Async, Background Tasks and Observability. It builds directly on the correlation ID assigned in middleware and the consistent error envelope, turning per-request context into a queryable production picture.
Core Mechanics: Structured Logs with Context
The base layer is structured logging keyed by the correlation ID. JSON logs that always carry the request ID let you reconstruct a single request's story with one filter, and they are the substrate the other two signals link back to.
import logging
from app.tracing import request_id_ctx # contextvar set by tracing middleware
class ContextFilter(logging.Filter):
def filter(self, record: logging.LogRecord) -> bool:
record.request_id = request_id_ctx.get() # Always present on every record.
return True
Production Implementation: OpenTelemetry Tracing
OpenTelemetry auto-instruments FastAPI, the database, and HTTP clients, producing spans that form a trace of each request across services. You add manual spans only around the business operations you want timed.
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
tracer = trace.get_tracer("orders")
def instrument(app) -> None:
FastAPIInstrumentor.instrument_app(app) # Auto spans for every request.
async def place_order(order: Order) -> None:
# A manual span times a business operation inside the auto-generated request span.
with tracer.start_as_current_span("place_order", attributes={"order.id": order.id}):
await persist(order)
The full instrumentation walk-through, including exporting to a collector, is in Instrumenting FastAPI with OpenTelemetry.
Metrics
Expose request rate, error rate, and latency per route, plus saturation signals such as pool usage, so you can both alert on user-facing symptoms and see resource limits approaching.
from prometheus_client import Histogram
# Latency distribution per route — the basis for percentile SLOs.
REQUEST_LATENCY = Histogram("http_request_seconds", "Request latency", ["route"])
Async and Performance Notes
Telemetry must be cheap and non-blocking, because it runs on every request. Export spans and metrics through a batching exporter that buffers and flushes in the background rather than writing synchronously per request, and sample traces under heavy load so instrumentation never becomes the bottleneck it is meant to observe.
Testing Strategy
Assert that the correlation ID propagates into logs and that key spans are emitted:
def test_request_id_in_logs(client, caplog):
client.get("/health", headers={"x-request-id": "trace-9"})
assert any(getattr(r, "request_id", "") == "trace-9" for r in caplog.records)
Failure Modes and Debugging
- Unstructured logs. Free-text logs cannot be filtered by request; emit JSON with the ID.
- Synchronous exporters. Exporting per request on the hot path adds latency; batch and sample.
- Context loss in tasks. Background jobs lose the trace unless you propagate it, per Background Task Processing.
- Missing saturation metrics. Latency alone hides an exhausting pool; measure database pool usage too.
Related Reading
- Up to the section: Async, Background Tasks and Observability.
- Hands-on guide: Instrumenting FastAPI with OpenTelemetry.
- Composes with: Implementing Custom Middleware for Request Tracing and Error Handling and Global Exceptions.