Skip to main content

Rate Limiting

What Your Agent Inherits

Every request to the agent’s endpoints passes through rate limiting before it ever reaches the route handler. Limits are enforced per client IP (or per authorization token) using a fixed-window algorithm with configurable windows and thresholds. If a client exceeds the limit, the middleware returns a 429 Too Many Requests response with a Retry-After header. The agent’s handler is never invoked.

Successful responses also include X-RateLimit-* headers, giving clients the information they need to self-throttle before they hit the wall.


Fixed-Window Algorithm

The rate limiter divides time into fixed-length windows (60 seconds by default) and counts requests per client within each window. Once the count exceeds the configured limit, the request is rejected.

The core enforcement logic lives in the middleware’s __call__ method:

RateLimitMiddleware.__call__ View source
RateLimitMiddleware.__call__
async def __call__(self, scope: Scope, receive: Receive, send: Send) -> None:
if scope["type"] != "http":
await self.app(scope, receive, send)
return
path = scope.get("path", "")
if path in self.exempt_paths:
await self.app(scope, receive, send)
return
key = _build_rate_limit_key(
scope,
self.key_strategy,
trust_proxy_headers=self.trust_proxy_headers,
proxy_headers=self.proxy_headers,
trusted_proxies=self.trusted_proxies,
)
decision = await self.store.hit(key, self.limit, self.window_seconds)
if not decision.allowed:
response = JSONResponse(
status_code=429,
content={
"error": "rate_limited",
"detail": "Request rate limit exceeded",
"retry_after_seconds": max(decision.reset_at_epoch - int(time.time()), 0),
},
headers=_decision_headers(decision),
)
await response(scope, receive, send)
return
async def send_wrapper(message: Message) -> None:
if message["type"] == "http.response.start":
headers = MutableHeaders(raw=message.setdefault("headers", []))
for key_name, value in _decision_headers(decision).items():
headers[key_name] = value
await send(message)
await self.app(scope, receive, send_wrapper)

Notice the exempt paths check. Health and readiness endpoints are excluded from rate limiting so that load balancer probes are never throttled.


Memory and Redis Backends

The middleware uses a store abstraction that separates counting logic from the storage mechanism. Two implementations ship with the chassis.

MemoryRateLimitStore works well for development and single-instance deployments. Counts live in a Python dictionary keyed by {client}:{bucket}, and stale buckets are pruned on each request:

MemoryRateLimitStore View source
MemoryRateLimitStore
class MemoryRateLimitStore(RateLimitStore):
"""In-memory fixed-window rate limiting store."""
def __init__(self) -> None:
self._buckets: dict[str, tuple[int, int]] = {}
async def hit(self, key: str, limit: int, window_seconds: int) -> RateLimitDecision:
now = int(time.time())
bucket = now // window_seconds
self._prune_expired_buckets(bucket)
bucket_key = f"{key}:{bucket}"
current_bucket, current_count = self._buckets.get(bucket_key, (bucket, 0))
if current_bucket != bucket:
current_count = 0
current_count += 1
self._buckets[bucket_key] = (bucket, current_count)
allowed = current_count <= limit
remaining = max(limit - current_count, 0)
reset_at = (bucket + 1) * window_seconds
return RateLimitDecision(allowed, limit, remaining, reset_at)

RedisRateLimitStore is the right choice for multi-instance deployments where rate limits need to be shared across processes. It relies on Redis INCR with automatic key expiry, making the store entirely self-cleaning:

RedisRateLimitStore View source
RedisRateLimitStore
class RedisRateLimitStore(RateLimitStore):
"""Redis-backed fixed-window rate limiting store."""
def __init__(self, storage_url: str) -> None:
try:
from redis import asyncio as redis_asyncio
except ImportError:
raise ImportError(
"The 'redis' package is required for Redis-backed rate limiting. "
"Install it with: uv sync --extra redis"
) from None
self._redis_asyncio = redis_asyncio
self._client = redis_asyncio.from_url(storage_url, encoding="utf-8", decode_responses=True)
async def hit(self, key: str, limit: int, window_seconds: int) -> RateLimitDecision:
now = int(time.time())
bucket = now // window_seconds
reset_at = (bucket + 1) * window_seconds
redis_key = f"rate_limit:{bucket}:{key}"
count = await self._client.incr(redis_key)
if count == 1:
await self._client.expire(redis_key, window_seconds)
remaining = max(limit - count, 0)
return RateLimitDecision(count <= limit, limit, remaining, reset_at)

You select the backend at startup through APP_RATE_LIMIT_STORAGE_BACKEND (memory or redis). The middleware constructor wires up the appropriate store automatically:

if storage_url:
self.store = RedisRateLimitStore(storage_url)
else:
self.store = MemoryRateLimitStore()

Client Identification

The _build_rate_limit_key() function figures out which client a request belongs to. Two strategies are available:

_build_rate_limit_key View source
_build_rate_limit_key
def _build_rate_limit_key(
scope: Scope,
key_strategy: str,
*,
trust_proxy_headers: bool,
proxy_headers: list[str],
trusted_proxies: tuple[TrustedProxyNetwork, ...],
) -> str:
headers = Headers(raw=scope.get("headers", []))
if key_strategy == "authorization":
authorization = headers.get("authorization")
if authorization:
# Normalize: strip the Bearer scheme prefix so casing/whitespace
# differences in the scheme do not produce distinct rate-limit keys.
token = authorization
if authorization.lower().startswith("bearer "):
token = authorization[7:].strip()
digest = hashlib.sha256(token.encode("utf-8")).hexdigest()
return f"authorization:{digest}"
client = scope.get("client")
client_host = client[0] if client else "unknown"
if trust_proxy_headers and _is_trusted_proxy(client_host, trusted_proxies):
forwarded_ip = get_forwarded_client_ip(headers, proxy_headers, trusted_proxies)
if forwarded_ip:
return f"ip:{forwarded_ip}"
return f"ip:{client_host}"
StrategyKey sourceUse case
ip (default)Client IP from ASGI scope or X-Forwarded-ForStandard per-IP throttling
authorizationSHA-256 hash of the bearer token (with Bearer prefix stripped)Per-token throttling for API consumers

Proxy awareness. When APP_RATE_LIMIT_TRUST_PROXY_HEADERS is enabled, the middleware extracts the real client IP from X-Forwarded-For or X-Real-IP. It only does this when the direct connection comes from an IP in the APP_RATE_LIMIT_TRUSTED_PROXIES allowlist, which prevents IP spoofing from untrusted clients.


Response Headers

Every response includes standard rate-limit headers regardless of whether it was allowed or rejected, so clients can self-throttle:

HeaderDescription
X-RateLimit-LimitMaximum requests allowed per window
X-RateLimit-RemainingRequests remaining in the current window
X-RateLimit-ResetUnix epoch when the current window resets
Retry-AfterSeconds until the client should retry (only on 429 responses)

A single helper function generates these headers and applies them to both successful and rate-limited responses:

_decision_headers View source
_decision_headers
def _decision_headers(decision: RateLimitDecision) -> dict[str, str]:
return {
"X-RateLimit-Limit": str(decision.limit),
"X-RateLimit-Remaining": str(decision.remaining),
"X-RateLimit-Reset": str(decision.reset_at_epoch),
"Retry-After": str(max(decision.reset_at_epoch - int(time.time()), 0)),
}

Configuration

All rate limiting settings are controlled through environment variables:

SettingDefaultDescription
APP_RATE_LIMIT_ENABLEDfalseMaster toggle for the middleware
APP_RATE_LIMIT_REQUESTS100Maximum requests per window
APP_RATE_LIMIT_WINDOW_SECONDS60Window length in seconds
APP_RATE_LIMIT_KEY_STRATEGYipKey strategy: ip or authorization
APP_RATE_LIMIT_STORAGE_BACKENDmemoryStorage backend: memory or redis
APP_RATE_LIMIT_STORAGE_URL(empty)Explicit Redis URL (overrides host/port/db)
APP_RATE_LIMIT_TRUST_PROXY_HEADERSfalseHonor X-Forwarded-For from trusted proxies
APP_RATE_LIMIT_PROXY_HEADERS["X-Forwarded-For", "X-Real-IP"]Headers to inspect for client IP
APP_RATE_LIMIT_TRUSTED_PROXIES[]CIDR allowlist for proxy trust

If APP_RATE_LIMIT_STORAGE_BACKEND=redis and no explicit APP_RATE_LIMIT_STORAGE_URL is provided, the URL is built automatically from APP_REDIS_HOST, APP_REDIS_PORT, APP_REDIS_DB, and APP_REDIS_PASSWORD.


Best Practices

  • Always use Redis-backed rate limiting for multi-instance deployments. In-memory rate limiting splits counters across processes, effectively multiplying the allowed rate by the number of workers or pods.
  • Never apply rate limiting to health check and readiness endpoints. Throttling probe traffic causes false-positive failures in load balancers and orchestrators.
  • Prefer IP-based rate limiting as the default strategy and switch to authorization-based limiting only for authenticated API consumers who need per-token budgets.
  • Always include X-RateLimit-* and Retry-After headers in both successful and rate-limited responses. These headers enable well-behaved clients to self-throttle before hitting the limit.
  • Always validate proxy trust before extracting client IPs from X-Forwarded-For. Without trust validation, clients behind untrusted proxies can spoof their IP to bypass rate limits.

Further Reading


What the Agent Never Implements

The middleware layer owns all of the following concerns. Your agent’s generated code should never contain logic for:

  • Request counting. The fixed-window algorithm runs before the route handler.
  • Backend selection. Memory vs. Redis is a deployment decision, not application code.
  • Client IP extraction. Proxy-aware IP resolution is handled by the middleware.
  • 429 responses. Rate-limited requests are rejected before they reach the agent.
  • Rate-limit headers. X-RateLimit-* and Retry-After are injected automatically.
  • Exempt-path management. Health and readiness endpoints are excluded by the builder.