Rate Limiting
What Your Agent Inherits
Every request to the agent’s endpoints passes through rate limiting before it ever reaches the route handler.
Limits are enforced per client IP (or per authorization token) using a fixed-window algorithm with configurable windows and thresholds.
If a client exceeds the limit, the middleware returns a 429 Too Many Requests response with a Retry-After header. The agent’s handler is never invoked.
Successful responses also include X-RateLimit-* headers, giving clients the information they need to self-throttle before they hit the wall.
Fixed-Window Algorithm
The rate limiter divides time into fixed-length windows (60 seconds by default) and counts requests per client within each window. Once the count exceeds the configured limit, the request is rejected.
The core enforcement logic lives in the middleware’s __call__ method:
async def __call__(self, scope: Scope, receive: Receive, send: Send) -> None: if scope["type"] != "http": await self.app(scope, receive, send) return
path = scope.get("path", "") if path in self.exempt_paths: await self.app(scope, receive, send) return
key = _build_rate_limit_key( scope, self.key_strategy, trust_proxy_headers=self.trust_proxy_headers, proxy_headers=self.proxy_headers, trusted_proxies=self.trusted_proxies, ) decision = await self.store.hit(key, self.limit, self.window_seconds) if not decision.allowed: response = JSONResponse( status_code=429, content={ "error": "rate_limited", "detail": "Request rate limit exceeded", "retry_after_seconds": max(decision.reset_at_epoch - int(time.time()), 0), }, headers=_decision_headers(decision), ) await response(scope, receive, send) return
async def send_wrapper(message: Message) -> None: if message["type"] == "http.response.start": headers = MutableHeaders(raw=message.setdefault("headers", [])) for key_name, value in _decision_headers(decision).items(): headers[key_name] = value await send(message)
await self.app(scope, receive, send_wrapper)Notice the exempt paths check. Health and readiness endpoints are excluded from rate limiting so that load balancer probes are never throttled.
Memory and Redis Backends
The middleware uses a store abstraction that separates counting logic from the storage mechanism. Two implementations ship with the chassis.
MemoryRateLimitStore works well for development and single-instance deployments.
Counts live in a Python dictionary keyed by {client}:{bucket}, and stale buckets are pruned on each request:
class MemoryRateLimitStore(RateLimitStore): """In-memory fixed-window rate limiting store."""
def __init__(self) -> None: self._buckets: dict[str, tuple[int, int]] = {}
async def hit(self, key: str, limit: int, window_seconds: int) -> RateLimitDecision: now = int(time.time()) bucket = now // window_seconds self._prune_expired_buckets(bucket) bucket_key = f"{key}:{bucket}" current_bucket, current_count = self._buckets.get(bucket_key, (bucket, 0)) if current_bucket != bucket: current_count = 0
current_count += 1 self._buckets[bucket_key] = (bucket, current_count) allowed = current_count <= limit remaining = max(limit - current_count, 0) reset_at = (bucket + 1) * window_seconds return RateLimitDecision(allowed, limit, remaining, reset_at)RedisRateLimitStore is the right choice for multi-instance deployments where rate limits need to be shared across processes.
It relies on Redis INCR with automatic key expiry, making the store entirely self-cleaning:
class RedisRateLimitStore(RateLimitStore): """Redis-backed fixed-window rate limiting store."""
def __init__(self, storage_url: str) -> None: try: from redis import asyncio as redis_asyncio except ImportError: raise ImportError( "The 'redis' package is required for Redis-backed rate limiting. " "Install it with: uv sync --extra redis" ) from None self._redis_asyncio = redis_asyncio self._client = redis_asyncio.from_url(storage_url, encoding="utf-8", decode_responses=True)
async def hit(self, key: str, limit: int, window_seconds: int) -> RateLimitDecision: now = int(time.time()) bucket = now // window_seconds reset_at = (bucket + 1) * window_seconds redis_key = f"rate_limit:{bucket}:{key}"
count = await self._client.incr(redis_key) if count == 1: await self._client.expire(redis_key, window_seconds)
remaining = max(limit - count, 0) return RateLimitDecision(count <= limit, limit, remaining, reset_at)You select the backend at startup through APP_RATE_LIMIT_STORAGE_BACKEND (memory or redis).
The middleware constructor wires up the appropriate store automatically:
if storage_url: self.store = RedisRateLimitStore(storage_url)else: self.store = MemoryRateLimitStore()Client Identification
The _build_rate_limit_key() function figures out which client a request belongs to.
Two strategies are available:
def _build_rate_limit_key( scope: Scope, key_strategy: str, *, trust_proxy_headers: bool, proxy_headers: list[str], trusted_proxies: tuple[TrustedProxyNetwork, ...],) -> str: headers = Headers(raw=scope.get("headers", [])) if key_strategy == "authorization": authorization = headers.get("authorization") if authorization: # Normalize: strip the Bearer scheme prefix so casing/whitespace # differences in the scheme do not produce distinct rate-limit keys. token = authorization if authorization.lower().startswith("bearer "): token = authorization[7:].strip() digest = hashlib.sha256(token.encode("utf-8")).hexdigest() return f"authorization:{digest}"
client = scope.get("client") client_host = client[0] if client else "unknown"
if trust_proxy_headers and _is_trusted_proxy(client_host, trusted_proxies): forwarded_ip = get_forwarded_client_ip(headers, proxy_headers, trusted_proxies) if forwarded_ip: return f"ip:{forwarded_ip}"
return f"ip:{client_host}"| Strategy | Key source | Use case |
|---|---|---|
ip (default) | Client IP from ASGI scope or X-Forwarded-For | Standard per-IP throttling |
authorization | SHA-256 hash of the bearer token (with Bearer prefix stripped) | Per-token throttling for API consumers |
Proxy awareness. When APP_RATE_LIMIT_TRUST_PROXY_HEADERS is enabled, the middleware extracts the real client IP from X-Forwarded-For or X-Real-IP. It only does this when the direct connection comes from an IP in the APP_RATE_LIMIT_TRUSTED_PROXIES allowlist, which prevents IP spoofing from untrusted clients.
Response Headers
Every response includes standard rate-limit headers regardless of whether it was allowed or rejected, so clients can self-throttle:
| Header | Description |
|---|---|
X-RateLimit-Limit | Maximum requests allowed per window |
X-RateLimit-Remaining | Requests remaining in the current window |
X-RateLimit-Reset | Unix epoch when the current window resets |
Retry-After | Seconds until the client should retry (only on 429 responses) |
A single helper function generates these headers and applies them to both successful and rate-limited responses:
def _decision_headers(decision: RateLimitDecision) -> dict[str, str]: return { "X-RateLimit-Limit": str(decision.limit), "X-RateLimit-Remaining": str(decision.remaining), "X-RateLimit-Reset": str(decision.reset_at_epoch), "Retry-After": str(max(decision.reset_at_epoch - int(time.time()), 0)), }Configuration
All rate limiting settings are controlled through environment variables:
| Setting | Default | Description |
|---|---|---|
APP_RATE_LIMIT_ENABLED | false | Master toggle for the middleware |
APP_RATE_LIMIT_REQUESTS | 100 | Maximum requests per window |
APP_RATE_LIMIT_WINDOW_SECONDS | 60 | Window length in seconds |
APP_RATE_LIMIT_KEY_STRATEGY | ip | Key strategy: ip or authorization |
APP_RATE_LIMIT_STORAGE_BACKEND | memory | Storage backend: memory or redis |
APP_RATE_LIMIT_STORAGE_URL | (empty) | Explicit Redis URL (overrides host/port/db) |
APP_RATE_LIMIT_TRUST_PROXY_HEADERS | false | Honor X-Forwarded-For from trusted proxies |
APP_RATE_LIMIT_PROXY_HEADERS | ["X-Forwarded-For", "X-Real-IP"] | Headers to inspect for client IP |
APP_RATE_LIMIT_TRUSTED_PROXIES | [] | CIDR allowlist for proxy trust |
If APP_RATE_LIMIT_STORAGE_BACKEND=redis and no explicit APP_RATE_LIMIT_STORAGE_URL is provided, the URL is built automatically from APP_REDIS_HOST, APP_REDIS_PORT, APP_REDIS_DB, and APP_REDIS_PASSWORD.
Best Practices
- Always use Redis-backed rate limiting for multi-instance deployments. In-memory rate limiting splits counters across processes, effectively multiplying the allowed rate by the number of workers or pods.
- Never apply rate limiting to health check and readiness endpoints. Throttling probe traffic causes false-positive failures in load balancers and orchestrators.
- Prefer IP-based rate limiting as the default strategy and switch to authorization-based limiting only for authenticated API consumers who need per-token budgets.
- Always include
X-RateLimit-*andRetry-Afterheaders in both successful and rate-limited responses. These headers enable well-behaved clients to self-throttle before hitting the limit. - Always validate proxy trust before extracting client IPs from
X-Forwarded-For. Without trust validation, clients behind untrusted proxies can spoof their IP to bypass rate limits.
Further Reading
- IETF Draft — RateLimit Header Fields for HTTP
- OWASP — Denial of Service Cheat Sheet (Rate Limiting)
- Redis INCR Command Documentation
- Cloudflare — Rate Limiting Best Practices
What the Agent Never Implements
The middleware layer owns all of the following concerns. Your agent’s generated code should never contain logic for:
- Request counting. The fixed-window algorithm runs before the route handler.
- Backend selection. Memory vs. Redis is a deployment decision, not application code.
- Client IP extraction. Proxy-aware IP resolution is handled by the middleware.
- 429 responses. Rate-limited requests are rejected before they reach the agent.
- Rate-limit headers.
X-RateLimit-*andRetry-Afterare injected automatically. - Exempt-path management. Health and readiness endpoints are excluded by the builder.