Observability
What Your Agent Inherits
Every HTTP request your agent handles is automatically traced with OpenTelemetry spans, counted by Prometheus metrics, and logged in structured JSON with request ID correlation. Your agent never has to configure an exporter, register a metric, or wire up a log filter. It focuses on business logic while the chassis takes care of observability.
Three systems work in concert:
- Distributed tracing propagates context across service boundaries, letting you follow a single request through the FastAPI app, its database queries, and any outbound HTTP calls.
- Prometheus metrics count requests, measure latencies, and track body sizes. Everything is exposed on
/metricsfor any scraper to collect. - Structured logging emits every log record as JSON (or human-readable text in development) with
request_idandcorrelation_idfields injected automatically through Python context variables.
The payoff is that when something goes wrong in production, the operator can correlate a trace, a metric spike, and a log line back to the same request without touching a single line of application code.
Distributed Tracing
The tracing subsystem sets up a global TracerProvider with the application’s service metadata and ships spans over OTLP to any compatible collector, whether that is Jaeger, Tempo, Honeycomb, or Datadog.
Provider Configuration
def configure_tracing(settings: Settings) -> None: """Configure the global OpenTelemetry tracer provider once.""" global _provider_configured, _httpx_instrumented
if not settings.otel_enabled or _provider_configured: return
provider = TracerProvider( resource=Resource.create( { "service.name": settings.otel_service_name, "service.version": settings.otel_service_version, "deployment.environment": settings.otel_environment, } ) ) exporter = OTLPSpanExporter( endpoint=settings.otel_exporter_otlp_endpoint, headers=_parse_headers(settings.otel_exporter_otlp_headers), ) provider.add_span_processor(BatchSpanProcessor(exporter)) trace.set_tracer_provider(provider) _provider_configured = True
if not _httpx_instrumented: HTTPXClientInstrumentor().instrument() _httpx_instrumented = TrueKey design decisions:
- Resource attributes like
service.name,service.version, anddeployment.environmenttag every span, so you can filter traces by service and environment in your collector UI. - BatchSpanProcessor buffers spans and ships them in batches. This keeps per-request overhead negligible.
- HTTPX instrumentation is global and one-shot: all outbound HTTP calls made through
httpxautomatically create child spans. - Guard flag (
_provider_configured) prevents double-initialization ifcreate_app()runs multiple times in tests.
FastAPI Auto-Instrumentation
def instrument_fastapi_app(app: Any, settings: Settings) -> None: """Attach FastAPI instrumentation to an application instance.""" if not settings.otel_enabled: return
FastAPIInstrumentor.instrument_app( app, excluded_urls=",".join( [settings.health_check_path, settings.readiness_check_path, "/metrics", "/favicon.ico"] ), )The instrumentor automatically wraps every route handler with span creation. Health checks, readiness probes, the metrics endpoint, and favicon requests are excluded so they don’t pollute your traces with infrastructure noise.
Database Tracing
def instrument_database_engine(engine: AsyncEngine, settings: Settings) -> None: """Attach SQLAlchemy tracing to the engine when enabled.""" if not settings.otel_enabled: return
SQLAlchemyInstrumentor().instrument(engine=engine.sync_engine)Database queries show up as child spans beneath the HTTP request span. This gives you timing and statement-level visibility for every query the agent executes.
Builder Integration
def setup_tracing(self) -> Self: """Configure OpenTelemetry tracing for the application.""" configure_tracing(self.settings) instrument_fastapi_app(self.app, self.settings) self.logger.info( "Tracing %s", ( "configured successfully" if self.settings.otel_enabled else "disabled by configuration" ), ) return selfThe builder’s setup_tracing() method calls both configure_tracing() and instrument_fastapi_app() as one step in the build chain. Tracing is off by default (APP_OTEL_ENABLED=false) and you can activate it with a single environment variable.
Prometheus Metrics
The chassis exposes request-level metrics at /metrics through starlette-exporter, a lightweight Prometheus middleware built for ASGI applications.
def setup_metrics(self) -> Self: """Configure Prometheus metrics collection.""" if not self.settings.metrics_enabled: self.logger.info("Metrics collection disabled by configuration") return self
try: from prometheus_client import REGISTRY, Info from starlette_exporter import PrometheusMiddleware, handle_metrics from starlette_exporter.optional_metrics import request_body_size, response_body_size
with contextlib.suppress(KeyError): REGISTRY.unregister(REGISTRY._names_to_collectors["fastapi_app_info_info"])
app_info = Info("fastapi_app_info", "FastAPI application information") app_info.info( { "app_name": self.settings.app_name, "app_version": self.settings.app_version, "python_version": platform.python_version(), "fastapi_version": fastapi.__version__, } )
self.app.add_middleware( PrometheusMiddleware, app_name=self.settings.app_name, prefix=self.settings.metrics_prefix, group_paths=False, optional_metrics=[response_body_size, request_body_size], skip_paths=[ self.settings.health_check_path, self.settings.readiness_check_path, METRICS_PATH, ], skip_methods=["OPTIONS"], ) self.app.add_route(METRICS_PATH, handle_metrics) self.logger.info("Prometheus metrics configured successfully") except ImportError: self.logger.warning( "Prometheus dependencies not installed. " "Install with: pip install prometheus-client starlette-exporter" ) except Exception as exc: self.logger.exception("Failed to configure metrics: %s", exc) raise
return selfHere is what this provides automatically:
- Request count and latency histograms broken down by method, path, and status code.
- Request and response body size tracking through optional metrics.
- Application info gauge that exports the app name, version, Python version, and FastAPI version as labels.
- Noise filtering. Health checks, readiness probes, the metrics endpoint itself, and
OPTIONSpreflight requests are all excluded from collection.
Just like tracing, metrics are off by default (APP_METRICS_ENABLED=false) and require no code changes to activate.
Structured Logging
The chassis supports two log formats: JSON for production, which is machine-parseable and compatible with any log aggregator, and text for local development, which is human-readable with color support. A single environment variable controls the format: APP_LOG_FORMAT=json|text.
Request Context Injection
Every log record emitted during a request includes request_id and correlation_id fields, even though application code never passes them explicitly. The mechanism behind this is Python’s contextvars module:
from contextvars import ContextVar, Token
_request_id: ContextVar[str] = ContextVar("request_id", default="-")_correlation_id: ContextVar[str] = ContextVar("correlation_id", default="-")
type RequestContextTokens = tuple[Token[str], Token[str]]
def get_request_id() -> str: """Return the request ID for the current context, or '-' when absent.""" return _request_id.get()
def get_correlation_id() -> str: """Return the correlation ID for the current context, or '-' when absent.""" return _correlation_id.get()
def set_request_context(request_id: str, correlation_id: str) -> RequestContextTokens: """Set request-scoped tracing IDs and return reset tokens.""" return _request_id.set(request_id), _correlation_id.set(correlation_id)
def reset_request_context(tokens: RequestContextTokens) -> None: """Reset tracing context to the previous values.""" request_id_token, correlation_id_token = tokens _request_id.reset(request_id_token) _correlation_id.reset(correlation_id_token)The RequestIDMiddleware (covered in the Middleware chapter) calls set_request_context() at the start of every request and reset_request_context() once the response completes. Since ContextVar is async-safe, the correct IDs propagate naturally through await chains, background tasks, and database callbacks, with no thread-local hacks required.
JSON Log Configuration
def configure_root_logging(settings: Settings) -> None: """Bootstrap the root logger with the format specified in settings.""" level = getattr(logging, settings.log_level.upper(), logging.INFO) root = logging.getLogger() root.handlers.clear() root.setLevel(level)
handler = logging.StreamHandler(sys.stdout) handler.setLevel(level) handler.addFilter(RequestContextFilter())
if settings.log_format == "json": from pythonjsonlogger.json import JsonFormatter
handler.setFormatter( JsonFormatter( fmt=( "%(asctime)s %(levelname)s %(name)s" " %(request_id)s %(correlation_id)s %(message)s" ), datefmt="%Y-%m-%dT%H:%M:%S", rename_fields={ "asctime": "timestamp", "levelname": "level", "name": "logger", }, ) ) else: handler.setFormatter( logging.Formatter( fmt=settings.log_text_template, datefmt=settings.log_date_format, ) )
root.addHandler(handler)The RequestContextFilter pulls get_request_id() and get_correlation_id() from the context variables and injects them into every log record. In JSON mode, the output looks like this:
{ "timestamp": "2026-03-08T14:22:10", "level": "INFO", "logger": "app.routes.items", "request_id": "a1b2c3d4", "correlation_id": "x9y8z7w6", "message": "Created item id=42"}In text mode, the same request produces:
2026-03-08 14:22:10 | INFO | app.routes.items | request_id=a1b2c3d4 | correlation_id=x9y8z7w6 | items:create:55 | Created item id=42The Three Pillars Together
Traces, metrics, and logs are independent subsystems, but the request_id ties them all together:
- A trace captures the full request lifecycle, including the HTTP span, database query spans, and any outbound HTTP call spans, all sharing a single trace ID.
- A metric increments the request counter and records the latency in a histogram, tagged by method, path, and status code.
- A log line records application-level events with the
request_idandcorrelation_idfields.
When a latency spike shows up in your Prometheus dashboard, you search for the corresponding request_id in your log aggregator. That same ID links to the trace in your tracing backend, where you can see exactly which database query or external call caused the delay. There is no manual instrumentation and no boilerplate, just the correlation the chassis provides out of the box.
Best Practices
- Always use the three-pillar approach: traces, metrics, and logs together. Each pillar answers different questions — traces show request flow, metrics show aggregate trends, logs show application-level events. The
request_idties them all together. - Never include health checks, readiness probes, or metrics endpoints in telemetry collection. Infrastructure noise drowns out real application signals and inflates storage costs.
- Always use
BatchSpanProcessorinstead ofSimpleSpanProcessorin production. Batch processing keeps per-request overhead negligible while simple processing blocks on every span export. - Prefer structured JSON logging over unstructured text in production. JSON logs are machine-parseable, compatible with every log aggregator, and queryable by field name.
- Always propagate
request_idandcorrelation_idthrough PythonContextVar. Context variables are async-safe and propagate naturally throughawaitchains without thread-local hacks.
Further Reading
- OpenTelemetry Python Documentation
- Prometheus Best Practices — Instrumentation
- Google SRE Book — Monitoring Distributed Systems
-
Python
contextvarsDocumentation
What the Agent Never Implements
The chassis handles all observability plumbing, so the agent never needs to:
- Create or manage spans. The FastAPI and SQLAlchemy instrumentors handle the span lifecycle automatically.
- Register Prometheus metrics or expose a
/metricsendpoint. The middleware creates histograms, counters, and gauges from request traffic. - Configure log formatters or filters. JSON versus text output toggles with a single environment variable.
- Propagate request IDs through async code.
ContextVarhandles propagation transparently. - Exclude infrastructure paths from telemetry. Health checks, readiness probes, and metrics endpoints are filtered by default.
- Bootstrap the tracing provider or exporter. Setting
APP_OTEL_ENABLED=trueactivates the full OTLP pipeline.