Reverse Proxy

Load Balancer

Distributes traffic across backend pools with health checks, circuit breakers, and outlier detection

Overview

Distributes requests across backend servers with health-aware routing, circuit breakers, and outlier detection. Replaces separate load balancer, health checker, and circuit breaker tools with one integrated module. Applies to every proxy route — each mapping can have its own backend pool, algorithm, and health check configuration.

Key differentiators:

Cluster-native state: all nodes share circuit breaker, health, and connection state via memory module. One node’s health check benefits all nodes. Circuit trips propagate instantly across the cluster.
HTTP/3 backend support: full QUIC support for backend connections with HTTP/3 health checks. Per-protocol circuit breaking allows HTTP/3 to fail independently while HTTP/2 continues working (e.g., QUIC blocked by firewall).
Expression-based circuit breaker trip conditions: custom logic like “error_rate > 0.1 && latency_p99 > 2.0” instead of rigid thresholds.
JA4/JA4H fingerprint routing: session affinity based on TLS fingerprint, no cookies required. Works for any protocol, including non-HTTP.
7 health check types: TCP, HTTP, HTTP/3, gRPC, MySQL, PostgreSQL, Redis with native protocol checks (not just TCP port probes).
7 load balancing algorithms: adaptive (default), round_robin, weighted (EDF scheduling), least_conn (Power of Two Choices), hash (xxhash), maglev (Google consistent hashing), random. The adaptive algorithm uses epsilon-greedy reinforcement learning — it starts exploring all backends equally, measures real latency and error rates, then gradually shifts traffic toward the best performers while still probing others to detect recovery. Zero configuration.
Outlier detection with three independent detection mechanisms: consecutive failures, success rate analysis (statistical), and failure percentage (threshold-based).
DNS-based service discovery with automatic backend add/remove on DNS changes.
Token bucket rate limiting with per-pool or per-user limits, distributed across cluster. Per-user mode enabled via rate_limit_per_user = true on proxy mappings.

All state is shared cluster-wide. Reads are served from local in-memory cache for near-instant response. Writes are replicated to all nodes with eventual consistency.

Config

Configuration is primarily done through [[proxy.mapping]] sub-tables. The load balancer does not have its own top-level TOML section; pools are created and managed automatically by the proxy service when mappings have multiple backends.

[[proxy.mapping]]
  service = ["http://be1:8080", "http://be2:8080"]  # Array triggers LB pool creation
  lb_strategy = "adaptive"         # Algorithm: adaptive, round_robin, weighted, least_connections,
                                   #   hash, maglev, random (default: adaptive)
  lb_weights = [5, 3, 2]           # Backend weights for "weighted" strategy
  lb_hash_key = "cookie:session_id"  # Hash key for "hash"/"maglev" strategies
                                   # Options: cookie:<name>, ja4, ja4h, ip, header:<name>
  enable_http3 = false             # Enable HTTP/3 (QUIC) backend connections
  protocol_preference = "prefer_http3"  # Protocol preference: prefer_http3, prefer_http2
  dns_discovery = false            # Enable DNS-based service discovery
  dns_refresh = "30s"              # DNS refresh interval

# Default health check (active by default even without this section).
# Applies to ALL mappings that don't have an explicit [proxy.mapping.health_check].
# Any non-5xx response = healthy. Only connection errors, timeouts, and 5xx = unhealthy.
# Health check path is derived from each mapping's route path.
[proxy.default_health_check]
  enabled = true                   # Set false to disable default health checks
  type = "http"                    # Check type: tcp, http, http3, grpc
  method = "GET"                   # HTTP method (default: GET)
  interval = "15s"                 # Check interval (default: 15s)
  timeout = "5s"                   # Check timeout (default: 5s)
  unhealthy_threshold = 3          # Consecutive failures to mark unhealthy (default: 3)
  healthy_threshold = 2            # Consecutive successes to mark healthy (default: 2)

# Per-mapping health check (overrides default_health_check for this mapping)
[proxy.mapping.health_check]
  enabled = true                   # Enable health checking (default: true)
  type = "http"                    # Check type: tcp, http, http3, grpc
  path = "/health"                 # HTTP/HTTP3 health check path
  method = "GET"                   # HTTP method (default: GET)
  expected_status = [200]          # Expected HTTP status codes (empty = any non-5xx is healthy)
  interval = "10s"                 # Check interval (default: 10s)
  timeout = "5s"                   # Check timeout (default: 5s)
  unhealthy_threshold = 3          # Consecutive failures to mark unhealthy (default: 3)
  healthy_threshold = 2            # Consecutive successes to mark healthy (default: 2)
  tls_skip_verify = false          # Skip TLS certificate verification
  grpc_service = ""                # gRPC service name for grpc.health.v1.Health/Check

[proxy.mapping.circuit_breaker]
  enabled = true                   # Enable circuit breaker (default: false)
  error_ratio_threshold = 0.5      # Max error ratio 0.0-1.0 (threshold mode)
  latency_p95_threshold = "1s"     # Max P95 latency (threshold mode)
  network_error_threshold = 0.3    # Max network error ratio (threshold mode)
  combine_mode = "or"              # Threshold combine: "or" (any) or "and" (all)
  trip_expression = ""             # Expression mode (overrides thresholds when set)
                                   # Variables: error_ratio, success_ratio, p50_latency,
                                   #   p95_latency, p99_latency, avg_latency (all ms),
                                   #   network_error_ratio, timeout_ratio, request_count
  min_samples = 10                 # Minimum requests before evaluation (default: 10)
  error_window = "60s"             # Sliding window for error tracking (default: 60s)
  fallback_duration = "30s"        # Duration circuit stays open (default: 30s)
  success_threshold = 3            # Successes in half-open to close (default: 3)
  fallback_mode = "error"          # Open behavior: "error" (503) or "pool" (fallback pool)
  fallback_pool_id = ""            # Pool ID for fallback_mode="pool"
  response_code = 503              # HTTP status when fallback_mode="error" (default: 503)
  per_protocol = false             # Track circuits per protocol (http3/http2/http1)
  fallback_protocol = "http2"      # Protocol when per-protocol circuit opens

[proxy.mapping.outlier_detection]
  enabled = true                   # Enable outlier detection (default: false)
  consecutive_5xx = 5              # Eject after N consecutive 5xx (default: 5)
  consecutive_gateway_failure = 5  # Eject after N consecutive 502/503/504 (default: 5)
  consecutive_local_failure = 5    # Eject after N consecutive connection errors (default: 5)
  success_rate_enabled = true      # Enable statistical success rate analysis
  success_rate_min_hosts = 5       # Min hosts for success rate calculation
  success_rate_min_requests = 100  # Min requests per host for success rate
  success_rate_stdev_factor = 1.9  # Standard deviation factor for outlier threshold
  success_rate_enforcing_pct = 100 # Percentage of detected outliers actually ejected
  failure_percentage_enabled = true  # Enable failure percentage threshold
  failure_percentage_threshold = 85  # Failure % above which backend is ejected
  failure_percentage_min_hosts = 5   # Min hosts for failure percentage
  failure_percentage_min_reqs = 50   # Min requests for failure percentage
  interval = "10s"                 # Detection sweep interval (default: 10s)
  base_ejection_time = "30s"       # Initial ejection duration (default: 30s)
  max_ejection_time = "300s"       # Maximum ejection duration (default: 300s)
  max_ejection_percent = 10        # Max % of backends ejected at once (default: 10)
  ejection_jitter_pct = 10         # Random jitter on re-admission time (default: 10)

Rate limiting (configured via proxy mapping fields, not a sub-table):

  rate_limit = "200/1m"            # Per-mapping rate limit (count/duration)
  rate_limit_per_user = false      # Per-user limits (vs shared pool limit, cluster-wide)

DNS discovery modes (dns_mode field in programmatic API):

  "internal" (default): Use Hexon DNS module with DNSSEC, caching, cluster config
  "system": Use system resolver directly (no DNSSEC)
  "custom": Use custom resolvers directly (requires dns_resolvers list)

Maglev table size default: 65537 (prime number). Configurable via programmatic API.

Troubleshooting

Common symptoms and diagnostic steps:

Backend never selected (no healthy backends):

  - Check health status: 'proxy health' or 'proxy health <pool-id>'
  - Verify backend is reachable: 'dns test <backend-hostname>'
  - Wrong health check type: TCP check passes but HTTP check fails (wrong path/status)
  - gRPC health check requires grpc.health.v1.Health service on backend
  - HTTP/3 health check failing: QUIC/UDP may be blocked by firewall
  - Database checks (mysql/postgresql/redis): verify protocol handshake, not auth
  - Threshold too strict: lower unhealthy_threshold or increase interval

Circuit breaker tripped unexpectedly:

  - Check circuit state: 'proxy circuits'
  - Review trip expression: expression variables are in milliseconds for latency
  - min_samples too low: brief error bursts trip the circuit prematurely
  - error_window too short: transient errors accumulate faster
  - gRPC pitfall: HTTP status is always 200; circuit uses gRPC status codes
    (codes 4,8,13,14,15 = server error). Check grpc-status trailers.
  - Per-protocol circuit: HTTP/3 circuit may be open while HTTP/2 works fine;
    check per-protocol states via 'proxy circuits'
  - Manual reset: 'proxy reset <pool-id> <backend-id>'

Uneven traffic distribution:

  - Weighted strategy: verify lb_weights array matches service array length
  - Least-connections: requires ConnectionOpened/ConnectionClosed tracking;
    check 'proxy backends <pool-id>' for connection counts
  - Hash/Maglev: same hash key always routes to same backend (by design);
    verify lb_hash_key is sufficiently varied across requests
  - Maglev table imbalance: small backend count can cause uneven distribution;
    increase table size or add more backends

All backends ejected (outlier detection too aggressive):

  - Check outlier state: 'proxy outliers' or 'proxy outliers <pool-id>'
  - max_ejection_percent too high: set to 10-33% to always keep backends active
  - consecutive_5xx threshold too low: increase from 5 to 10+ for noisy backends
  - success_rate_stdev_factor too low: increase from 1.9 to 2.5+ for high variance
  - Manual re-admit: 'proxy uneject <pool-id> <backend-id>'
  - Ejection backoff: duration doubles each re-ejection (base * 2^count),
    capped at max_ejection_time

DNS discovery not updating backends:

  - Check discovery state: 'proxy pools <pool-id>' shows discovery config
  - DNS resolution failing: 'dns test <service-hostname>'
  - DNSSEC validation failing on unsigned zone: use dns_mode="system"
  - Custom resolvers unreachable: check dns_resolvers list
  - Exponential backoff active: after DNS failures, refresh backs off up to 5 minutes
  - Force immediate refresh via programmatic API (RefreshDiscovery operation)

Connection pool exhaustion (high latency, timeouts):

  - Check pool stats: 'connpool stats' and 'connpool pools'
  - Backend slow to respond: connections pile up, least-conn skews
  - Circuit breaker not tripping: error_window may be too long to catch slow failures

Rate limit hitting unexpectedly:

  - Per-user vs per-pool: per-pool limit shared across all users
  - Cluster-wide counting: nodes may have slight count drift due to replication lag
  - Burst exhausted: burst tokens consumed, waiting for window refill
  - Check: 'proxy traffic <app-name>' for request rate metrics

Cross-subsystem diagnostics:

  - Full domain diagnostic: 'diagnose domain <hostname>'
  - Full user access diagnostic: 'diagnose user <username>'

Architecture

Cluster state management and design:

State categories and retention:

  Pool configurations:        retained for 24h, refreshed on update
  Backend health states:      retained for 30s, updated by health checker
  Circuit breaker states:     retained for 24h
  Outlier detection states:   retained for 24h per backend
  Connection counts:          retained for 60s per backend
  Rate limit counters:        retained for window duration + 1 minute
  Backend metrics:            retained for circuit breaker decision window

Read/write model:

  Reads (e.g., backend selection, health status, circuit state) are served
  from local in-memory cache for near-instant response.
  Writes (e.g., pool creation, health updates, circuit trips) are replicated
  to all nodes with eventual consistency.

Load balancing algorithm details:

  adaptive: Epsilon-greedy selection that learns from request outcomes. Starts with
    round-robin (learning phase, first 50 requests), then exploits the best-scoring
    backend 95% of the time and explores randomly 5%. Scores combine success rate,
    latency (EMA), timeout rate, and consecutive failures with decay. Default strategy.
  round_robin: Sequential rotation across backends. O(1) per selection.
  weighted: Earliest Deadline First (EDF) scheduling. Higher weight = smaller deadline
    increment = selected more often. Smoother than GCD-based approaches (interleaved,
    not batched). O(log n) per selection.
  least_conn: Power of Two Choices (P2C) — picks 2 random backends, selects the one
    with fewer connections. O(1) complexity with near-optimal distribution. Weighted:
    effective_connections = actual_connections * 1000 / weight.
  hash: Consistent hashing with xxhash. Same key always routes to same backend unless
    pool membership changes. O(1) per selection.
  maglev: Google Maglev consistent hashing. Table size default 65537 (prime). Better
    distribution than ring hash with minimal disruption on backend changes. O(1) lookup
    after O(n * table_size) table build.
  random: Uniform random selection. O(1) per selection.

Circuit breaker state machine:

  closed --> open:      Trip conditions met (after min_samples within error_window)
  open --> half_open:   fallback_duration elapsed
  half_open --> closed: success_threshold consecutive successes
  half_open --> open:   Any failure

  Per-protocol mode: independent state machines for http3, http2, http1, tcp.
  Automatic fallback chain: http3 -> http2 -> http1 (configurable via fallback_protocol).

Outlier detection mechanisms:

  1. Consecutive failures (immediate): eject after N consecutive 5xx / gateway / local errors
  2. Success rate (statistical): eject if success_rate < (cluster_avg - stdev_factor * stdev)
  3. Failure percentage (threshold): eject if failure% > threshold
  Ejection backoff: base_ejection_time * 2^(ejection_count - 1), capped at max_ejection_time.
  Re-admission: automatic after ejection duration expires; counters reset.
  Jitter: ejection_jitter_pct random jitter on re-admission to prevent thundering herd.
  Safety: max_ejection_percent prevents ejecting all backends simultaneously.

Health check architecture:

  Each node independently runs health checks for its configured pools.
  Results are replicated to all nodes -- the most recent check from any node is used.
  Threshold logic: unhealthy_threshold consecutive failures to mark down;
  healthy_threshold consecutive successes to mark up.
  HTTP/3 checks: quic-go with connection pooling and idle connection cleanup.
  gRPC checks: native grpc.health.v1.Health/Check RPC with connection pooling.
  Database checks: protocol handshake only (MySQL initial packet, PostgreSQL SSLRequest,
  Redis PING/PONG). No authentication or query execution.

DNS discovery lifecycle:

  1. Periodically resolves hostname to IP addresses (refresh interval)
  2. New IPs: automatically added as backends to pool
  3. Removed IPs: automatically removed from pool
  4. DNS failure: exponential backoff up to 5 minutes
  5. Modes: internal (Hexon DNS with DNSSEC), system (OS resolver), custom (direct resolvers)

Interpreting tool output:

  'proxy pools':
    Healthy: All pools show ActiveBackends = TotalBackends, Strategy listed
    Degraded: ActiveBackends < TotalBackends — some backends ejected or unhealthy
    Action: Degraded → 'proxy health' for per-backend status, 'proxy outliers' for ejections

  'proxy health' (per-pool):
    Healthy: All backends Status=healthy, consecutive failures=0
    Unhealthy: Status=unhealthy with failure reason (connection_refused, timeout, http_error)
    Action: Unhealthy → check backend directly with 'net tcp <host:port>' or 'net http <url>'

  'proxy circuits':
    Closed: Normal operation — requests flowing to backend
    Half-open: Testing recovery — limited requests allowed through, do NOT reset manually
    Open: Tripped — backend failing, shows TripCondition and ErrorRate
    Action: Open → fix backend root cause first, then 'proxy reset <pool> <backend>' to clear

  'proxy outliers':
    Normal: No ejected backends
    Ejected: Backend removed from rotation — shows EjectionTime and FailureRate
    Action: Fix backend, then 'proxy uneject <pool> <backend>' to re-admit

Logs

Log entries by component. Search with: logs search “loadbalancer” All entries use log name “loadbalancer”. Levels: ERROR > WARN > INFO > DEBUG. No AUDIT entries in this module.

Pool Management:

  loadbalancer  INFO   pool created (strategy, backends count, health_check/circuit_breaker/outlier_detection enabled)
  loadbalancer  INFO   pool deleted
  loadbalancer  INFO   pool updated
  loadbalancer  INFO   backend added (pool_id, backend_id, address)
  loadbalancer  INFO   backend removed (pool_id, backend_id)
  loadbalancer  INFO   backend draining (pool_id, backend_id)
  loadbalancer  WARN   failed to initialize circuit state (pool_id, backend_id, error)
  loadbalancer  WARN   failed to initialize outlier state (pool_id, backend_id, error)
  loadbalancer  ERROR  failed to check pool existence (pool_id, error)
  loadbalancer  ERROR  failed to store pool config (pool_id, error)
  loadbalancer  ERROR  failed to get pool for deletion (pool_id, error)
  loadbalancer  ERROR  failed to delete pool (pool_id, error)
  loadbalancer  ERROR  failed to get pool (pool_id, error)
  loadbalancer  ERROR  failed to list pools (error)
  loadbalancer  ERROR  failed to get pool for update (pool_id, error)
  loadbalancer  ERROR  failed to update pool (pool_id, error)
  loadbalancer  ERROR  failed to get pool for add backend (pool_id, error)
  loadbalancer  ERROR  failed to add backend (pool_id, backend_id, error)
  loadbalancer  ERROR  failed to get pool for remove backend (pool_id, error)
  loadbalancer  ERROR  failed to remove backend (pool_id, backend_id, error)
  loadbalancer  ERROR  failed to get health state for drain (pool_id, backend_id, error)
  loadbalancer  ERROR  failed to update health state for drain (pool_id, backend_id, error)

Backend Selection:

  loadbalancer  DEBUG  backends excluded from selection (pool_id, total_backends, healthy_backends, excluded)
  loadbalancer  DEBUG  backend selected (pool_id, backend_id, strategy, healthy_backends, latency)

Health Checks:

  loadbalancer  INFO   backend health state changed — unhealthy to healthy (pool_id, backend_id, consecutive_ok, latency)
  loadbalancer  WARN   backend health state changed — healthy to unhealthy (pool_id, backend_id, consecutive_fails, error)
  loadbalancer  DEBUG  health check passed (pool_id, backend_id, consecutive_ok, latency)
  loadbalancer  DEBUG  health check failed (pool_id, backend_id, consecutive_fails, error)
  loadbalancer  ERROR  failed to store health state (pool_id, backend_id, error)

Circuit Breaker:

  loadbalancer  INFO   circuit breaker state changed (pool_id, backend_id, from_state, to_state, error_ratio)
  loadbalancer  INFO   per-protocol circuit breaker state changed (pool_id, backend_id, protocol, from_state, to_state, error_ratio)
  loadbalancer  INFO   circuit breaker reset (pool_id, backend_id, reset_by)
  loadbalancer  WARN   circuit breaker expression compilation failed (expression, error)
  loadbalancer  WARN   circuit breaker expression evaluation failed (expression, error)
  loadbalancer  DEBUG  circuit breaker threshold evaluation (combine_mode, conditions_met, error_ratio, error_threshold, p95_latency_ms, latency_threshold_ms, network_error_ratio, network_threshold)
  loadbalancer  ERROR  failed to store circuit state (pool_id, backend_id, error)
  loadbalancer  ERROR  failed to reset circuit (pool_id, backend_id, error)

Connection Tracking:

  loadbalancer  ERROR  failed to update connection count (pool_id, backend_id, error)

Rate Limiting:

  loadbalancer  DEBUG  rate limit exceeded (pool_id, user_id, limit, current_count, cost, retry_after)
  loadbalancer  ERROR  failed to update rate limit state (pool_id, key, error)

Outlier Detection:

  loadbalancer  INFO   backend ejected due to outlier detection (pool_id, backend_id, reason, ejection_count, duration, re_admit_at)
  loadbalancer  INFO   backend re-admitted after ejection period (pool_id, backend_id, total_ejections)
  loadbalancer  INFO   backend manually un-ejected (pool_id, backend_id)
  loadbalancer  DEBUG  outlier success rate analysis (pool_id, eligible_backends, avg_success_rate, stdev, threshold, stdev_factor)
  loadbalancer  DEBUG  outlier failure percentage analysis (pool_id, eligible_backends, threshold, ejected_count, max_ejectable)
  loadbalancer  ERROR  failed to save outlier state (pool_id, backend_id, error)
  loadbalancer  ERROR  failed to save outlier state on re-admission (pool_id, backend_id, error)
  loadbalancer  ERROR  failed to reset outlier interval stats (pool_id, backend_id, error)

DNS Discovery:

  loadbalancer  INFO   DNS discovery enabled (pool_id, hostname, refresh)
  loadbalancer  INFO   DNS discovery disabled (pool_id)
  loadbalancer  INFO   DNS discovery updated backends (pool_id, hostname, total_ips, added, removed)
  loadbalancer  WARN   DNS discovery resolution failed (pool_id, hostname, error)
  loadbalancer  WARN   DNS discovery returned no IPs (pool_id, hostname)
  loadbalancer  WARN   failed to add discovered backend (pool_id, ip, error)
  loadbalancer  WARN   failed to remove discovered backend (pool_id, ip, error)

Metrics

Prometheus metrics. Query with: metrics prometheus loadbalancer_<name>

Pool Lifecycle:

  loadbalancer_pools_created          counter    {strategy}                         Pools created
  loadbalancer_pools_deleted          counter    {}                                 Pools deleted

Backend Selection:

  loadbalancer_selects                counter    {pool_id, strategy}                Successful backend selections
  loadbalancer_select_failures        counter    {pool_id, reason}                  Failed selections (reason: pool_not_found|no_healthy_backends|algorithm_returned_nil)
  loadbalancer_select_latency         latency    {pool_id}                          Backend selection duration

Health Checks:

  loadbalancer_health_checks          counter    {pool_id, healthy}                 Health check executions (healthy: true|false)

Circuit Breaker:

  loadbalancer_circuit_state_changes  counter    {pool_id, backend_id, from_state, to_state}           Circuit state transitions (optionally includes protocol label in per-protocol mode)
  loadbalancer_circuit_resets         counter    {pool_id, backend_id}              Manual circuit resets

Connections:

  loadbalancer_connections_opened     counter    {pool_id, backend_id}              Connections opened
  loadbalancer_connections_closed     counter    {pool_id, backend_id}              Connections closed
  loadbalancer_active_connections     gauge      {pool_id, backend_id}              Current active connections
  loadbalancer_connection_duration    latency    {pool_id, backend_id}              Connection duration
  loadbalancer_bytes_sent             counter    {pool_id, backend_id}              Bytes sent to backends
  loadbalancer_bytes_recv             counter    {pool_id, backend_id}              Bytes received from backends

Rate Limiting:

  loadbalancer_rate_limit_allowed     counter    {pool_id}                          Requests allowed by rate limiter
  loadbalancer_rate_limit_denied      counter    {pool_id}                          Requests denied by rate limiter
  Note: metrics aggregate at pool level. For per-user denial details when
  rate_limit_per_user = true, use: logs search "rate limit exceeded" (includes user_id)

Outlier Detection:

  loadbalancer_outlier_ejections      counter    {pool_id, backend_id, reason}      Backends ejected (reason: consecutive_5xx|consecutive_gateway|consecutive_local|success_rate|failure_percentage)
  loadbalancer_outlier_readmissions   counter    {pool_id, backend_id}              Backends auto-readmitted after ejection period
  loadbalancer_outlier_manual_uneject counter    {pool_id, backend_id}              Backends manually un-ejected

DNS Discovery:

  loadbalancer_dns_discovery_failures counter    {pool_id, hostname}                DNS resolution failures
  loadbalancer_dns_discovery_updates  counter    {pool_id, hostname}                Backend set updates from DNS

Alerts:

  rate(loadbalancer_select_failures{reason="no_healthy_backends"}[5m]) > 0    All backends down — check health and outlier state
  rate(loadbalancer_circuit_state_changes{to_state="open"}[5m]) > 0           Circuit opened — backend degradation
  rate(loadbalancer_outlier_ejections[5m]) > 5                                 High ejection rate — systemic backend issues
  rate(loadbalancer_rate_limit_denied[5m]) > 50                                High rate limit denial — check capacity or limits
  rate(loadbalancer_dns_discovery_failures[5m]) > 0                            DNS discovery failing — check DNS config

Relationships

Module dependencies and interactions:

proxy: Primary consumer. Creates and manages LB pools automatically when

  [[proxy.mapping]] has multiple backends in the service array. Selects backends on
  every request, tracks connections, records results for circuit breaker and outlier
  detection. All LB configuration flows through proxy mapping sub-tables.

distributed cache: State storage backend. All pool configs, health states, circuit breaker states, outlier states, connection counts, rate limit counters, and backend stats are stored in the cluster-wide cache with appropriate TTLs.
dns: Backend hostname resolution for DNS-based service discovery. Supports three modes: internal (Hexon DNS module with DNSSEC), system (OS resolver), custom (direct resolvers). Per-pool DNS configuration.
sessions: Session affinity for hash-based algorithms. Cookie-based hash keys read session cookies. JA4/JA4H fingerprint routing uses TLS fingerprint from session.
connection_pool: Backend HTTP connection management. Circuit breaker integration prevents new connections to tripped backends. Connection counts feed least_conn algorithm decisions.
certificates: TLS for backend connections. Health checks honor TLS configuration (tls_skip_verify). HTTP/3 health checks require valid QUIC/TLS setup.

Reverse Proxy

Routes HTTP traffic to backends with authentication, load balancing, identity headers, and circuit breaking

Overview

Routes HTTP requests to backend applications with authentication, group-based authorization, and signed identity headers. Replaces separate reverse proxy, SSO, and load balancer products with one integrated service. Every proxied request carries the user’s identity — backends verify it without implementing auth.

Capabilities:

Host-based and path-based routing with 3-tier hybrid matcher (exact → prefix → regex)
Per-route OIDC SSO authentication with cross-domain cookie support
Group-based authorization (OR semantics — user needs any one listed group)
Identity header injection (X-Hexon-User, X-Hexon-Mail, X-Hexon-Name, X-Hexon-Groups)
Ed25519 header signing and optional full request signing for backend verification
Response header URL rewriting (Link, Content-Location, Refresh) and HTML body rewriting
JavaScript interceptor injection for dynamic URL rewrites (fetch, XHR, window.open)
Logout toolbar injection for authenticated routes (draggable, shows user + app name)
WebSocket and gRPC support, HTTP/3 (QUIC) backend connections
Zero-copy streaming mode for API routes (rewrite_host=false, saves 8-15ms, 4x throughput)
Zstandard, Brotli, gzip response compression (negotiated via Accept-Encoding)
Speculation Rules API injection for prefetch/prerender (Chromium 121+, per-mapping eagerness)
Per-mapping mTLS, CIDR subnet restriction, HTTP method filtering
Multi-backend load balancing (round-robin, weighted, least-conn, consistent hash, Maglev)
Circuit breakers with expression-based trip conditions and outlier detection
Native health checks (TCP, HTTP, HTTP/3, gRPC)
Hot-reload of routes, backends, auth rules without restart (atomic, +2ns overhead)
Landing page listing all accessible apps filtered by user groups (folder/tag grouping)
PROXY protocol v1/v2 support for preserving client IP through L4 load balancers
Per-mapping protection overrides (rate limit, size limit bypass, per-user rate limits)
Per-user rate limit response headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset (on every response), Retry-After (on 429 only)
Weighted canary traffic splitting with sticky routing and bypass groups
Automatic retry with budget (prevents retry storms) and backend exclusion
Hedged requests for tail latency reduction (parallel speculative requests)
Custom response headers with three-state logic (set/strip/inherit)
Cookie domain rewriting for SSO across subdomains (RFC 6265 compliant)
Request shadowing/mirroring for testing (async fire-and-forget, per-route sampling)
JWT Bearer token verification cache (sessions module, SHA256 of token as session ID, configurable TTL)
Personal Access Token (PAT) authentication via Bearer header with session validation and IP enforcement

Config

Core configuration under [proxy] and [[proxy.mapping]]:

[proxy]
  enabled = true                   # Enable reverse proxy service
  hostname = "apps.hexon.es"       # Landing page hostname (optional; if unset, serves on service.hostname)
  signing_enabled = true           # Ed25519 header signing (default: true if cluster_key set)
  signing_rotation = "15m"         # Key rotation interval (HKDF-SHA256 from cluster_key)
  brotli_support = true            # Decompress/reencode Brotli for URL rewriting
  group_refresh_interval = "15m"   # Background session group membership refresh (0 to disable)
  bearer_cache_ttl = "5m"          # JWT Bearer token verification cache TTL (0 to disable)
  gzip = true                      # Enable gzip compression
  headers = {}                     # Global response header overrides (three-state: value/"-"/empty)
  header_user = "X-Hexon-User"    # Identity header name overrides (rarely changed)
  header_mail = "X-Hexon-Mail"
  header_name = "X-Hexon-Name"
  header_groups = "X-Hexon-Groups"

[[proxy.mapping]]
  app = "Name"                     # Display name (shown in landing page and toolbar)
  host = "app.hexon.es"            # Hostname for routing (SNI matching)
  path = "^/.*"                    # Path regex (auto-classified into matching tiers)
  service = "https://backend:8080" # Backend URL(s) — string or array for load balancing
  auth = true                      # Require authentication
  groups = ["users"]               # Authorized groups (OR logic, empty = any authenticated user)
  bypass_auth_cidrs = ["10.0.0.0/8"] # CIDR ranges that skip auth + JIT-2FA (straight to backend)
  add_auth_headers = true          # Inject X-Hexon-* identity headers
  add_bearer = true                # Inject signed JWT Bearer token to backend (SSO via OIDC)
  allow_upgrade = true             # WebSocket upgrade support
  rewrite_host = true              # HTML URL rewriting (default: true; false = zero-copy mode)
  inject_toolbar = true            # Logout toolbar (default: true when auth=true)
  rewrite_hosts = [["backend.com","app.hexon.es"]]  # Multi-domain URL mapping pairs
  priority = 500                   # Route priority (auto 0-1000, manual >1000 overrides)
  cert = "/path/to/cert.pem"      # Per-mapping TLS certificate
  key = "/path/to/key.pem"        # Per-mapping TLS private key
  tls_check = true                 # Verify backend TLS certificate
  mtls = false                     # Require client certificate (default: false)
  allowed_subnets = ["10.0.0.0/8"] # CIDR subnet restriction (OR logic, 403 if no match)
  allowed_methods = ["GET","POST"] # HTTP method filter (empty = all allowed, 405 if no match)
  audience = "custom"              # Custom audience for header signing (default: mapping app name)
  sign_request = false             # Full request signing (method, path, query, body)
  sign_request_max_body = "10MB"   # Max body size to hash (default 10MB; "0" = skip body hash)
  bearer_cache_ttl = "5m"          # Per-mapping JWT cache TTL override (inherits from global)
  oidc_providers = ["internal"]    # OIDC provider(s) for authentication
  dnssec = true                    # Per-route DNSSEC override
  dns_resolvers = ["10.0.0.1:53"] # Per-route DNS resolver override
  brotli_support = true            # Per-route Brotli override (falls back to global)
  permissions_policy = "..."       # Browser Permissions-Policy header
  referrer_policy = "..."          # Browser Referrer-Policy header
  csp_header = "..."               # Content-Security-Policy header
  headers = {}                     # Per-route response headers (completely replaces global)
  disable_rate_limit = false       # Bypass rate limiting for this route
  rate_limit = "200/1m"            # Custom rate limit
  rate_limit_per_user = false      # Per-user rate limits (cluster-wide, requires auth=true)
  disable_size_limit = false       # Bypass size limiting
  max_bytes = "100MB"              # Custom max body size

  [proxy.mapping.canary]           # Weighted traffic splitting
    enabled = false
    sticky = true                  # Session-pinned: version stored in session/cookie
    sticky_key = "user"            # Initial selection hash: "user" | "fingerprint" | "ip"
    header = "X-Hexon-Version"    # Inject version label header (optional)
    bypass_groups = ["qa"]         # Groups always routed to canary (simple) or first version (multi)
    site = "prod-eu-b2c4d1"       # Connector site for canary (overrides mapping-level site, optional)

    # Simple mode (two versions: stable + canary):
    service = ["https://canary:8080"]
    weight = 10                    # % to canary (0-100), rest goes to stable
    label = "v2.1.0"              # Prometheus {version} label for canary

    # Multi-version mode (A/B/C routing — replaces service/weight/label above):
    [[proxy.mapping.canary.versions]]
      service = ["https://v1:8080"]
      weight = 70                  # Weights must sum to 100
      label = "v1-stable"
      site = "prod-us-a1b2c3"     # Per-version connector site (overrides canary-level site, optional)
    [[proxy.mapping.canary.versions]]
      service = ["https://v2:8080"]
      weight = 20
      label = "v2-beta"
    [[proxy.mapping.canary.versions]]
      service = ["https://v3:8080"]
      weight = 10
      label = "v3-alpha"

  [proxy.mapping.retry]            # Automatic retry with budget
    enabled = false
    max_attempts = 3               # Total attempts (1-10)
    retry_on = ["5xx", "connect-failure", "reset"]  # Also: "retriable-4xx" (429 only)
    retriable_methods = ["GET", "HEAD", "OPTIONS", "PUT", "DELETE"]
    backoff_base = "50ms"
    backoff_max = "1s"
    backoff_jitter = true          # Full jitter: random in [0, computed backoff]
    budget_ratio = 0.10            # Max retries = 10% of requests in window
    budget_window = "10s"
    budget_min_retries = 3         # Always allow N retries regardless of ratio
    max_body_size = "1MB"          # Bodies larger than this skip retry

  [proxy.mapping.hedge]            # Hedged requests for tail latency
    enabled = false
    delay = "100ms"                # Calibrate to route's p99 — too low doubles load, too high adds no value
    max_hedges = 1                 # 1-3: each fires to a different backend after delay

IMPORTANT: Retry and hedge amplify backend load. rate_limit is enforced once per client request (before retry/hedge). Retries and hedges do NOT consume additional rate-limit tokens. Example: retry.max_attempts=3 + hedge.max_hedges=2 = up to 9 backend requests per single client request. Plan backend capacity accordingly.

  forward_request_headers = false  # Forward Authorization header to backend
  forward_response_headers = false # Forward WWW-Authenticate header from backend
  folder = "Category"              # Landing page folder grouping
  tags = ["tag1"]                  # Landing page tags for filtering
  display = true                   # Show in portal/access list (default: true, set false for API-only)

DNS: Centralized in [dns] section. Proxy uses DNS module by default ([proxy.dns] use_cluster=true). Per-route overrides via dnssec and dns_resolvers fields for backends with special DNS needs (e.g., internal backends with internal DNS, or backends in unsigned DNS zones).

Load balancing: service can be an array of URLs. Configure algorithm and weights via lb_strategy and lb_weights fields. Default strategy is adaptive (epsilon-greedy).

Circuit breaker: [proxy.mapping.circuit_breaker] with trip expression and recovery settings. Health checks: all mappings get HTTP health checks by default (any non-5xx = healthy, 15s interval). Override globally via [proxy.default_health_check] or per-mapping via [proxy.mapping.health_check]. 4 active check types: tcp, http, http3, grpc. expected_status is an array (e.g. [200, 302]); empty means any non-5xx response is healthy. Health check path is derived from the mapping’s route path when using defaults. Health state is shared cluster-wide; unhealthy backends are removed from rotation until they recover. Connection pool: [connection_pool.http] for global pool settings (max_connections, adaptive_scaling).

Hot-reloadable: routes, backends, auth rules, paths, rewrite rules, protection overrides, identity header names, per-route DNS, certificates. Cold (restart required): proxy.enabled, global connection pool settings, DNS module config, cache.

Security

Identity Headers (when add_auth_headers=true):

  X-Hexon-User:     Username
  X-Hexon-Mail:     Email address
  X-Hexon-Name:     Full name (display name)
  X-Hexon-Groups:   Comma-separated group list (e.g. "users,admins,developers")

Groups are fetched fresh from directory on every request (not cached), ensuring immediate enforcement of group changes without re-authentication. Backends can trust these headers for SSO without implementing their own authentication.

Header Signing — Ed25519 (enabled by default when cluster_key is set):

  Additional headers injected when signing is enabled:
    X-Hexon-Audience:    Route audience string. Precedence: explicit audience field on the
                         mapping, then mapping app name, then service URL fallback. Default
                         is the app name — stable across deployments and unambiguous when the
                         service array contains multiple backends.
    X-Hexon-Timestamp:   Unix epoch seconds when signature was created
    X-Hexon-Request-Id:  Unique request correlation ID for tracing
    X-Hexon-Signature:   Signature in format: v2.{timestamp}.{base64_ed25519}

  Signed payload (pipe-delimited, 7 fields):
    {timestamp}|{request_id}|{audience}|{user}|{email}|{name}|{groups}

  IMPORTANT: The groups field (last) may itself contain pipe characters.
  Backends MUST parse with SplitN(payload, "|", 7) — NOT Split(payload, "|").

  Why Ed25519 instead of HMAC:
    - HMAC requires sharing the secret key with verifiers, enabling forgery
    - Ed25519 distributes only the PUBLIC key — backends can verify but NOT forge
    - The private key never leaves the Hexon cluster

  Key derivation: HKDF-SHA256 (RFC 5869) from cluster_key with a versioned
  domain-specific salt. Rotates every signing_rotation (default 15m).
  Current + previous keypair kept in memory for rotation boundary handling.

  Backend Verification — Option 1: Delegated (simple, recommended for most backends):

    POST /.well-known/header-signing.verify
    Content-Type: application/json

    Request body fields: signature, timestamp, request_id, audience, user, email, name, groups
    Example:
      {"signature":"v2.1732800000.base64ed25519==","timestamp":1732800000,
       "request_id":"abc123","audience":"https://backend:8080",
       "user":"jdoe","email":"jdoe@example.com","name":"John Doe","groups":"admin,users"}

    Responses:
      200 OK:                  {"valid": true}
      401 Unauthorized:        {"valid": false, "error": "signature mismatch"}
      400 Bad Request:         {"valid": false, "error": "missing required field: signature"}
      503 Service Unavailable: {"valid": false, "error": "signing not enabled"}

    Benefits: zero crypto code in backend, automatic key rotation handling,
    works with nginx auth_request directive.

  Backend Verification — Option 2: Direct (fast, recommended for high-throughput):

    GET /.well-known/header-signing.key?t={timestamp}

    Response (200 OK):
      {"public_key":"base64_32_byte_key","valid_from":1732800000,"valid_until":1732800900}

    Verify Ed25519 signature locally:
      1. Parse X-Hexon-Signature: split by "." → [version, timestamp, signature_base64]
      2. Verify version is "v2", decode signature (64 bytes)
      3. Check X-Hexon-Audience matches expected audience
      4. Check timestamp is within 30 seconds of current time
      5. Fetch public key from /.well-known/header-signing.key?t={timestamp}
      6. Reconstruct payload: timestamp|request_id|audience|user|email|name|groups
      7. ed25519.Verify(publicKey, payload, signature) → true/false

    Public key is safe to cache (32 bytes, cannot create signatures — only verify).

  Clock synchronization requirements:
    - NTP required on all nodes (chrony or systemd-timesyncd)
    - Clock drift should be <1 second for reliable operation
    - Verification allows 30-second tolerance for network delays
    - Key rotation windows calculated from Unix epoch

Request Signing — Ed25519 (optional, per-route sign_request=true):

  Signs the entire HTTP request for end-to-end integrity verification.
  Protects against: method tampering, host header attacks, path manipulation,
  query injection, body tampering.

  Header: X-Hexon-Request-Signature: v1|{timestamp}|{base64_ed25519}

  Signed payload:
    REQ|{timestamp}|{method}|{host}|{path}|{query_hash}|{body_hash}

  Canonicalization rules:
    Path: URL-decoded → dot-segments resolved → slashes collapsed → leading slash ensured
      /api/../admin → /admin,  /api//users → /api/users,  /api/foo%2Fbar → /api/foo/bar
    Query: parsed → sorted alphabetically by key → re-encoded with URL escaping
      b=2&a=1 → a=1&b=2
    Body: SHA256 hash (base64). Bodies over sign_request_max_body → "SKIPPED".
      Empty body → hash of empty string (47DEQpj8HBSa...).
      Set sign_request_max_body = "0" to always skip body hashing.

  Verify via: POST /.well-known/request-signing.verify (same JSON format)
  or GET /.well-known/request-signing.key?t={timestamp} (same keypair as header signing).

  Header signing vs request signing:
    Header signing: covers auth headers only, enabled by default, X-Hexon-Signature
    Request signing: covers entire request, opt-in per route, X-Hexon-Request-Signature
    Use both on sensitive routes (e.g., payment gateways) for maximum security.

Bearer Token Injection (when add_bearer=true):

  Injects a signed JWT ID token as Authorization: Bearer <token> on proxied requests.
  Backend verifies the token via the /oidc/cert JWKS endpoint (standard OIDC discovery).
  Token signed with the OIDC provider's signing key — supports threshold ECDSA (TSS/DKG)
  or deterministic HKDF-derived keys, auto-swaps transparently.

  JWT claims: iss (gateway issuer), sub (username), aud (app name or custom audience),
  email, preferred_username, groups, exp, iat. Same structure as regular OIDC ID tokens.

  Audience defaults to the mapping's app name. Override via audience = "custom" field.
  Tokens are cached in the session module (type "proxy_bearer") with deterministic IDs
  derived from the user session and audience, distributed across all cluster nodes. With threshold
  signing, only one node performs the signing ceremony per user:audience pair. Cached JWTs
  are AES-256-GCM encrypted at rest using a cluster key derivative. Refreshed at 80% of TTL.

  Pre-minting: optimizes ECDSA signing time on first request by minting during OIDC callback.

  Existing Bearer tokens are NOT overwritten — if the request already carries a Bearer
  (e.g., kubelogin passthrough), the injection is skipped. This allows mixed usage:
  M2M clients with their own tokens and browser users with injected tokens on the same route.

  Use case: backends like ArgoCD, Rancher, Grafana, Kubernetes API trust the gateway's
  OIDC issuer and get SSO for free — no redirect flow, no separate auth integration.

PAT Bearer Authentication:

  Personal Access Tokens work as standard Bearer tokens for proxy access.
  Flow: Authorization: Bearer <PAT-JWT> → middleware detects opaque miss →
  JWT verify → PAT detection (jti non-empty) → PAT session validation → bearer auth.
  Session check on every request ensures instant revocation (~5-50µs local KV).
  IP restriction (allowed_ips) enforced from session metadata (exact IP + CIDR).
  Last-used tracking: fire-and-forget metadata update preserves fixed PAT expiry.
  Cache (bearer_cache session, SHA256 key): stores is_pat, jti, allowed_ips metadata.
  Cache hits skip JWT verify but always re-validate session (revocation gate).
  Stale cache entries auto-deleted when revocation detected.
  Groups from JWT claims used for per-route group authorization (OR logic).

  Two access paths to the same proxy mapping:
    Browser: PoW challenge → OIDC SSO → session cookie → proxy (human-optimized)
    Machine: Authorization: Bearer <token> → proxy (machine-optimized)
  Bearer resolves at step 1 of the middleware chain — before PoW, before OIDC redirect.
  All three token types (opaque access tokens, JWT ID tokens, PATs) bypass PoW and
  browser redirects. Provides direct, redirect-free, cookieless proxy access for CI/CD,
  monitoring, CLI tools (kubelogin), and service-to-service calls. Same group authorization,
  identity headers, and Ed25519 signing apply — only the authentication on-ramp differs.

Mutual TLS (per-mapping):

  mtls=false (default): no client certificate requested (no browser popup)
  mtls=true: TLS handshake requires valid client certificate (RequireAndVerifyClientCert)
  Certificate validated against ACME CA bundle or configured external PKI.
  Applied at TLS layer via GetConfigForClient callback during handshake.

Subnet Restriction:

  allowed_subnets uses CIDR notation, OR logic (client IP must match at least one).
  Uses X-Forwarded-For if present (for CDN/LB scenarios), falls back to direct IP.
  Enforced AFTER authentication but BEFORE proxy forwarding (defense-in-depth).
  CIDR validated at config load time — startup fails on invalid format.
  All violations logged at LevelWarn with app and host labels.

Cookie Handling:

  Set-Cookie domains rewritten from backend domain to proxy domain for SSO.
  Cookies intentionally shared across all subdomains (*.hexon.es) for single sign-on.
  RFC 6265 compliant: case-insensitive attribute parsing (Domain=/domain=/DOMAIN=).
  HttpOnly and Secure flags preserved during rewriting.

Response Header Overrides (three-state logic):

  "" (empty/omit): Inherit from backend (pass through unchanged)
  "-" (dash):      Strip header from response
  "value":         Override with the specified value
  Per-route headers completely replace global [proxy].headers (no merging).
  Empty map (headers = {}) disables all header processing for that route.
  Forbidden headers (blocked at config validation):
    Transfer-Encoding, Content-Length, Connection, Keep-Alive, Upgrade, Proxy-Connection, TE
  When both legacy fields (permissions_policy, referrer_policy, csp_header) and headers
  map target the same header, the headers map takes precedence.

Troubleshooting

Common symptoms and diagnostic steps:

502/503 Bad Gateway:

  - Backend unreachable: 'proxy health' shows backend down
  - Circuit breaker open: 'proxy circuits' shows tripped breakers
  - DNS resolution failure: 'dns test <backend-hostname>' to verify
  - DNSSEC failure on unsigned zone: set dnssec=false on that route
  - All custom resolvers failing: falls back to system DNS, check 'dns resolvers'
  - Start with: 'diagnose domain <hostname>' for cross-subsystem check

Auth redirect loops:

  - OIDC callback failing: check oidc_providers configuration
  - Cross-domain cookie issue: verify proxy hostname matches cookie domain
  - Session group mismatch: group_refresh_interval updated session groups
  - Multiple providers: ensure provider selection page renders correctly
  - Check: 'sessions list --user=X' and 'auth status'

WebSocket upgrade failures:

  - Missing allow_upgrade=true on the mapping
  - Backend not responding to Upgrade handshake
  - Rate limiting blocking upgrade requests: check disable_rate_limit
  - TLS verification failing: check tls_check setting

gRPC errors (circuit breaker tripping unexpectedly):

  - gRPC always returns HTTP 200; actual status is in grpc-status trailer
  - Circuit breaker uses gRPC-aware status extraction (codes 4,8,13,14,15 = server error)
  - Expression variables: grpc_error_rate, grpc_unavailable_rate, grpc_timeout_rate
  - Backend must implement grpc.health.v1.Health for native gRPC health checks
  - Enable: grpc=true on the mapping, grpc_health_check=true on circuit_breaker config

Slow responses:

  - HTML buffering: set rewrite_host=false for API routes (saves 8-15ms, 4x throughput)
  - Brotli decompression cost: set brotli_support=false on specific routes
  - Backend health degrading: 'proxy backends' for connection stats
  - Circuit breaker half-open: 'proxy circuits' for breaker states
  - Connection pool exhaustion: 'connpool stats' for pool metrics
  - Route matching slow: too many regex routes in Tier 3, convert to prefix patterns
  - Enable debug mode for Server-Timing header: shows route/auth/backend_ttfb/tls timing
    breakdown in browser DevTools Network tab (do NOT enable in production)

Header signing verification failures:

  - Clock skew >1s between proxy and backend: ensure NTP is running on all nodes
  - Key rotation window boundary: verification allows 30s tolerance
  - Wrong audience: check mapping's audience field matches backend expectation
  - Signature format: expect v2.{timestamp}.{base64}, parse with Split(".", 3)
  - Groups with pipes: backend must use SplitN("|", 7), not Split("|")

Request signing verification failures:

  - Path canonicalization mismatch: backend must URL-decode, resolve dots, collapse slashes
  - Query parameter order: backend must sort alphabetically before hashing
  - Body hash "SKIPPED": body exceeded sign_request_max_body, backend must handle this case
  - Large file uploads: set sign_request_max_body = "0" to skip body hashing

Landing page not showing apps:

  - User not in required groups: 'directory user <username>'
  - proxy.hostname not configured: landing page serves on service.hostname at /
  - Route auth=false: public apps show with PUBLIC badge
  - App not visible: check display=true (default) and folder/tags grouping in mapping config

PROXY protocol issues:

  - Backend not expecting PROXY protocol: set proxy_protocol=false
  - Wrong protocol version: check proxy_protocol_version (v1 text vs v2 binary)

PAT Bearer token not authenticating:

  - PAT falls through to session/OIDC: check 'logs search "handlers.bearer"' for PAT validation logs
  - "PAT rejected" in logs: session revoked or expired — check 'sessions list --type=pat --user=X'
  - "Cached PAT rejected" in logs: stale bearer_cache entry — auto-invalidated, retry should work
  - "source IP not allowed" in logs: PAT has allowed_ips restriction — check 'pats show <session_id>'
  - PAT works for QUIC but not proxy: ensure Authorization: Bearer header is sent correctly
  - Groups not matching route: PAT carries groups from creation time — if user groups changed,
    create new PAT with current groups or use route with groups the PAT carries

403 Forbidden on specific routes:

  - Subnet restriction: client IP not in allowed_subnets (check 'proxy traffic' metrics)
  - HTTP method not allowed (405): check allowed_methods list
  - Group authorization failed: user missing required group membership
  - mTLS required but no client certificate: check mtls setting

Canary routing not splitting traffic:

  - Verify canary.enabled = true and weight > 0
  - Sticky routing: same user always hits same version (deterministic hash)
  - bypass_groups: users in these groups always hit canary (simple) or first version (multi)
  - Metric: proxy_canary_requests{version="canary"} should show traffic
  - Use 'proxy canary <app>' for detailed canary status, pool health, and per-version metrics
  - Log: logs search "proxy.canary" shows per-request routing decisions

Canary backend isolation:

  - Canary backends have their own LB pool (separate circuit breaker, health checks, outlier detection)
  - Health checks inherited from parent mapping config
  - Canary errors do NOT contaminate primary circuit breaker
  - Response cache keys include version label (no cross-contamination)
  - E2OE WebSocket connections route to canary when canary selected
  - URL/cookie rewriting: canary must use same hostname as primary (limitation)
  - Retry/hedge after canary failure: selects from canary pool (same pool isolation)
  - Canary site: canary.site overrides mapping-level site for connector routing
  - Per-version site: versions[].site overrides canary.site (multi-version mode)
  - Site resolution: version.site > canary.site > mapping.site > direct connection
  - CLI probe: 'proxy mappings' probes both primary and canary backends

Session-based canary pinning:

  - First request: version computed via weight/sticky, stored in session metadata (auth) or cookie (no-auth)
  - Subsequent requests: pinned version read from session/cookie — same user always gets same version
  - Session priority: session metadata > cookie > fresh computation
  - Logout/login: session destroyed → fresh selection on new session
  - Cookie: hexon_cv_{app} session cookie (no Max-Age = cleared on browser close)
  - Works for auth=false routes via cookie fallback

Retries not firing / firing too much:

  - Method not in retriable_methods: POST/PATCH never retried by default
  - Budget exhausted: rate(proxy_retry_budget_exceeded[5m]) > 0
  - Body too large: bodies > max_body_size skip retry silently
  - Check: X-Hexon-Attempts response header shows actual attempt count
  - Log: logs search "proxy.retry" for attempt details and budget blocks

Hedging not reducing tail latency:

  - delay too high: hedge fires too late to help — calibrate to route p99
  - Insufficient backends: hedge requires max_hedges+1 backends (validated at config load)
  - max_hedges > 1: fires N hedges simultaneously after delay, each to a different backend
  - Metric ratio: proxy_hedge_fired_total vs proxy_hedge_won_total
    Low won/fired = delay well-calibrated; high = persistent tail latency problem
  - Hedge skipped: proxy_hedge_skipped_total — no different backend available or body replay failed

Interpreting tool output:

  'proxy health':
    Healthy: All backends Status=healthy, Latency < 100ms
    Warning: Status=healthy but Latency > 500ms — backend is slow, not down
    Degraded: Status=unhealthy with Reason: connection_refused | tls_error | timeout | dns_failed
    No pools: "No load balancer pools configured" means all mappings are single-backend (normal)
    Action: All backends unhealthy → 'proxy circuits' for open breakers

  'proxy circuits':
    Healthy: all breakers State=closed — normal operation
    Half-open: breaker is testing recovery — allow a few requests through, do NOT reset
    Open: breaker tripped — backend is failing, check Reason and TripCondition
    No pools: same as above — circuit breakers only exist for multi-backend pools
    Action: Open breaker → check 'proxy backends' for error counts, fix backend, then 'proxy reset'

  'proxy backends':
    Healthy: ActiveConns reasonable, ErrorRate < 1%, Latency stable
    Degraded: ErrorRate > 5% or Latency spiking — backend may be overloaded
    Ejected: Outlier detection removed backend — 'proxy uneject' to re-admit after fixing

  'proxy traffic':
    Normal: RequestRate steady, ErrorRate < 1%, Latency p99 < 500ms
    Abnormal: Sudden RequestRate spike (possible attack), ErrorRate > 5% (backend issue)
    Zero traffic: Route exists but no requests — check DNS/certificate for that hostname

  'proxy canary':
    List view: shows all canary-enabled routes with mode, weight, sticky, pool health
    Detail view (proxy canary <app>): full config, canary pool health, per-version request counts
    No canary routes: "No routes with canary enabled" — canary not configured
    Pool health N/M: N healthy backends out of M total in canary pool
    Version metrics: proxy_canary_requests broken down by version label

Relationships

Module dependencies and interactions:

loadbalancer: Pool management, backend selection, health checks, circuit breakers, outlier detection. Multi-algorithm support (round-robin, weighted, least-conn, consistent hash, Maglev).
sessions: Authentication enforcement via session cookies. Session creation during OIDC callback. Session group monitor updates groups in place on changes (no re-login).
certificates: TLS termination, SNI-based certificate selection for per-mapping certs. Falls back to service.tls_cert/tls_key if no mapping-specific cert. Invalid or missing certificates prevent route from mounting.
waf: Request filtering applied before proxy forwarding (WAF rules checked first).
authentication.oidc: SSO via internal OIDC provider. Uses a dedicated internal OIDC client with PKCE S256. Back-channel token exchange is in-process (no network hairpin in K8s).
directory: Group membership lookup on every request (fresh, not cached). Powers both per-request authorization and X-Hexon-Groups header, plus landing page app filtering.
dns: Backend hostname resolution with DNSSEC validation. Centralized in [dns] section. Per-route overrides for internal backends or unsigned DNS zones.
firewall: Network-level access rules applied before proxy routing.
protection: Rate limiting (JA4 fingerprint-based) and size limiting at router level. Per-mapping bypass via disable_rate_limit/disable_size_limit context keys.
connection_pool: Backend HTTP connection management with adaptive pool sizing, circuit breaker integration, and performance metrics. Pool consolidation for routes with identical transport configuration.
render: Landing page and toolbar asset serving (CSS, JS, images).

Architecture

Request flow:

Client request arrives → TLS termination with SNI certificate selection
Hostname match: O(1) hash map lookup → PathMatcher for that hostname
Path match: 3-tier hybrid matcher (exact → prefix tree → regex scan)
Middleware chain: rate limit → size limit → subnet check → method filter
Authentication: Bearer token check (opaque cache → JWT session cache → Ed25519 verify) PAT detection: if JWT has jti → PAT session validation (revocation + IP restriction) → last_used update then OIDC SSO check → redirect to /oidc/auth if no session
Authorization: group membership verified (fresh directory lookup, OR logic)
Director: canary selection (if enabled) → inject identity headers, sign headers (Ed25519)
Transport chain: RetryTransport → HedgeTransport → ProtocolFallbackTransport → backend
Response processing: URL rewriting (HTML + response headers), cookie domain rewriting, toolbar + JS interceptor injection, custom header overrides
Response sent to client (zero-copy mode skips step 9 for API routes)

Route matching — 3-tier hybrid matcher:

  Tier 1 (exact): O(1) hash map — static paths like ^/api/v1/users$
  Tier 2 (prefix): O(log n) radix tree — wildcard paths like ^/api/.*
  Tier 3 (regex): O(n) sequential — complex patterns with alternation
  Auto-priority: (prefix_length × 100) + end_anchor_bonus(100) - alternation_penalty(50)
  Manual priority >1000 overrides auto-calculation. Catch-all always priority 0.
  Performance: exact ~50ns, prefix ~100-200ns, regex ~500ns-50μs (scales with route count).

OIDC SSO flow (solves cross-domain cookie problem):

  1. User hits proxy host with no session cookie
  2. Redirect to OIDC provider with PKCE challenge
  3. If user has existing session on main domain → auto-approved (no login prompt)
  4. Redirect back: /_hexon/oidc/callback?code=...
  5. Token exchange in-process (no network hairpin)
  6. Session cookie set ON the proxy host domain → future requests have session
  Security: PKCE S256, AES-GCM encrypted state with cluster key derivative,
  CSRF double-submit cookie, 10-minute state expiry, host binding in state,
  open redirect prevention (return URL validated against proxy hostname).

Hot-reload mechanism:

  1. Config change detected (file watcher or SIGHUP)
  2. Config hash compared to detect actual changes (skip if identical)
  3. Routes rebuilt from new configuration
  4. HTTP transport cache checked for connection pool reuse
  5. Circuit breaker state preserved for unchanged routes
  6. Response cache selectively invalidated (only changed routes)
  7. Server routes re-registered (atomic hostname updates)
  8. Proxy state swapped atomically (lock-free reads, +2ns overhead)
  Reload is all-or-nothing: failure keeps old routes active, error logged.
  Duration: 50-200ms for typical configs (10-50 routes).

HTML processing pipeline (when rewrite_host=true):

  1. Backend response received → check Content-Type (only process text/html)
  2. Decompress if needed (gzip always; Brotli if brotli_support=true)
  3. Replace backend URLs with proxy URLs in HTML body
  4. Rewrite URL-containing response headers (Link, Content-Location, Refresh)
  5. Rewrite Set-Cookie domains (case-insensitive attribute matching)
  6. Inject JavaScript interceptor (rewrites fetch, XHR, dynamic elements, window.open)
  7. Inject logout toolbar before </body> (if auth=true and inject_toolbar=true)
  8. Re-compress response (gzip or Brotli based on client Accept-Encoding)
  Zero-copy mode (rewrite_host=false, inject_toolbar=false): skip all HTML processing,
  eliminating 10MB allocation per request. Ideal for APIs, WebSocket, streaming.

Logs

Log entries by component. Search with: logs search “proxy” Levels: ERROR > WARN > INFO > DEBUG. DEBUG requires log level configuration.

Routing & Dispatch:

  proxy.dispatcher        DEBUG         Matched proxy route / no path match / no routes for hostname
  proxy.request           DEBUG         Proxying request to backend
  proxy.error             INFO          Request canceled (client disconnected — expected)
  proxy.error             ERROR         Proxy request failed (timeout, connection refused, etc.)
  proxy.redirect          DEBUG         Rewriting redirect Location header
  proxy.assets            WARN          Path traversal attempt / invalid path detected
  proxy.debug_timing      DEBUG         Full proxy roundtrip timing summary

Authentication & Authorization:

  proxy.reauth            INFO   AUDIT  Re-authentication required (reauth rule matched)
  proxy.oidc              DEBUG         Redirecting to internal/external OIDC provider
  proxy.oidc              ERROR         Failed to generate PKCE verifier, CSRF token, or state encryption
  proxy.oidc.callback     INFO   AUDIT  OIDC proxy authentication completed
  proxy.oidc.callback     WARN   AUDIT  State expired, CSRF validation failed
  proxy.oidc.callback     WARN          OAuth error from IdP, state decryption failed, host mismatch
  proxy.oidc.callback     ERROR  AUDIT  Token exchange failed
  proxy.oidc.callback     ERROR         Session creation failed

Header Signing:

  proxy.signing           WARN          Rotation interval too short / cluster key too short
  proxy.signing           ERROR         Key derivation failed (initial or rotation)

Bearer Token Injection:

  proxy.bearer_inject     WARN          No username/session, decryption failed (will re-mint)
  proxy.bearer_inject     ERROR         MintBearerToken failed, wrong response type, encryption error
  proxy.bearer_inject     DEBUG         Bearer token minted for backend
  proxy.bearer_refresh    WARN          Background refresh failed
  proxy.bearer_refresh    DEBUG         Background refresh completed

HTML Rewriting:

  proxy.rewrite           WARN          Response too large to buffer, streaming without rewrite
  proxy.rewrite           DEBUG         Chunked/binary response streaming, Brotli/Zstd disabled

WebSocket E2OE:

  proxy.ws_e2oe.*         INFO          Relay started / ended
  proxy.ws_e2oe.*         ERROR         Accept failed / backend dial failed

Session Monitoring:

  proxy.group_monitor     INFO          Monitor started, check completed
  proxy.group_monitor     INFO   AUDIT  User groups changed, updating session
  proxy.group_monitor     WARN          Session update wait failed
  proxy.group_monitor     ERROR         Group fetch failed, session update failed

Lifecycle:

  proxy.init              INFO          Proxy service initialized
  proxy.init              ERROR         Initialization failed
  proxy.reload            ERROR         Reload failed
  proxy.ca_rotation       INFO          Transport caches invalidated (CA rotation)
  proxy.director          WARN          PROXY protocol: invalid source IP

Transport & DNS:

  proxy.transport         DEBUG         Transport configured for route
  proxy.dns               DEBUG/INFO    Backend DNS resolution, DNSSEC validation
  proxy.dns               WARN/ERROR    DNS resolution failed, fallback to system DNS
  proxy.dns.quic          DEBUG/WARN    QUIC-specific DNS resolution and connection

Circuit Breaker & Load Balancing:

  proxy.circuit_breaker   WARN          Circuit breaker open / fallback activated
  proxy.outlier_detection WARN          Outlier detection config warnings
  proxy.dns_discovery     WARN          DNS discovery config warnings
  proxy.health_check      WARN          Health check config warnings
  proxy.fallback          ERROR         Invalid fallback URL / fallback service error

JIT-2FA:

  proxy.jit2fa            ERROR         JIT-2FA middleware creation failure

Request Signing:

  proxy.request_signing   WARN          Body hash failure
  proxy.request_signing   DEBUG         Request signed successfully
  proxy.signing_key       DEBUG/WARN    Public key endpoint access and validation
  proxy.signature_verify  DEBUG/WARN    Signature verification endpoint handler
  proxy.request_signature_verify  DEBUG/WARN  Request signature verification handler

Shadow/Mirror:

  proxy.shadow            DEBUG         Shadow request dispatched

Co-browsing:

  proxy.cobrowse.started              INFO          Co-browse session started
  proxy.cobrowse.stopped              INFO          Session stopped
  proxy.cobrowse.recorder_connected   INFO          Recorder WebSocket connected
  proxy.cobrowse.recorder_disconnected INFO         Recorder disconnected
  proxy.cobrowse.reconnected          INFO          Recorder reconnected
  proxy.cobrowse.grace_expired        INFO          Cleanup grace period expired
  proxy.cobrowse.ws_upgrade_failed    WARN          WebSocket upgrade failed
  proxy.cobrowse.publish_failed       WARN          Event publish failed
  proxy.cobrowse.input_write_failed   WARN          Input forwarding failed
  proxy.cobrowse.recorder_ws_not_found  WARN          Recorder WS not found in cluster store
  proxy.cobrowse.input_received       DEBUG         Forwarding interaction event to recorder
  proxy.cobrowse.input_subscribe_failed WARN         Input channel subscription failed
  proxy.cobrowse.recorder_stats       INFO          Recorder WebSocket session ended

Configuration:

  proxy.route             INFO          Route configured (full route details)
  proxy.config            INFO          Global config summary, cert loading
  proxy.config            WARN          Duplicate route detection, config validation

Access Control:

  proxy.access            DEBUG         Route access check (app, host, groups, reason)

Canary:

  proxy.canary            DEBUG         Routing to stable/canary backend (app, version, backend)

Retry:

  proxy.retry             INFO          Retrying request (app, attempt, backend)
  proxy.retry             DEBUG         Retry succeeded (app, attempt)
  proxy.retry             WARN          Retry budget exceeded (app, pool_id)
  proxy.retry             WARN          All retry attempts exhausted (app, max_attempts)

Hedge:

  proxy.hedge             DEBUG         Hedge fired (app, hedges, delay, primary_backend)
  proxy.hedge             DEBUG         Hedge skipped (app, reason)
  proxy.hedge             WARN          All hedge attempts failed (app, total_attempts)

Metrics

Prometheus metrics. Query with: metrics prometheus proxy_<name>

Request Flow:

  proxy_requests                          counter    {app, host, auth}         Successful proxied requests
  proxy_errors                            counter    {app, host}               Proxy request errors
  proxy_backend_duration                  latency    {app}                     Backend response time
  proxy_auth_failures                     counter    {app, host}               Authentication failures
  proxy_authz_failures                    counter    {app, host}               Group authorization failures
  proxy_auth_bypass_total                 counter    {app, host}               Auth bypassed (bypass_auth_cidrs)
  proxy_subnet_failures                   counter    {app, host}               Subnet restriction failures
  proxy_reauth_required                   counter    {app, host}               Re-authentication triggered

Caching:

  proxy_cache_hits                        counter    {app, type}               Response cache hits (304/full)
  proxy_cache_misses                      counter    {app}                     Response cache misses
  proxy_cache_size                        gauge      {}                        Response cache entries
  proxy_cache_invalidated                 counter    {}                        Cache entries invalidated on reload

Header Signing:

  proxy_signing_total                     counter    {status, app}             Header signing operations
  proxy_signing_duration                  latency    {app}                     Header signing time
  proxy_request_signing_total             counter    {status, app}             Request signing operations
  proxy_request_signing_duration          latency    {app}                     Request signing time
  proxy_sign_payload_total                counter    {status}                  Payload signing outcomes
  proxy_key_derivation_total              counter    {status}                  Key derivation attempts
  proxy_key_rotation_total                counter    {status}                  Key rotations
  proxy_key_operation_total               counter    {status, operation}       Key operation failures
  proxy_key_request_total                 counter    {status}                  Public key endpoint requests
  proxy_key_request_duration              latency    {}                        Public key endpoint latency
  proxy_signature_verify_total            counter    {status}                  Signature verification requests
  proxy_signature_verify_duration         latency    {}                        Signature verification latency
  proxy_request_signature_verify_total    counter    {status}                  Request signature verification
  proxy_request_signature_verify_duration latency    {}                        Request signature verify latency

OIDC SSO:

  proxy_oidc_flow_initiated               counter    {host, provider}          OIDC auth flows started
  proxy_oidc_flow_completed               counter    {host, provider}          OIDC auth flows completed
  proxy_oidc_flow_failed                  counter    {host, reason, provider?} OIDC auth flow failures (provider absent on pre-state errors)
  proxy_oidc_state_validation_failed      counter    {reason}                  State validation failures

Per-User Rate Limiting:

  proxy_rate_limit_per_user_denied        counter    {app}                     Requests denied by per-user rate limit

Canary:

  proxy_canary_requests                   counter    {app, version}            Requests routed per version (stable/canary label)

Retry:

  proxy_retry_attempts_total              counter    {app, attempt}            Retry attempts by attempt number
  proxy_retry_success_total               counter    {app}                     Retries that succeeded
  proxy_retry_budget_exceeded             counter    {app}                     Retries blocked by budget
  proxy_retry_exhausted                   counter    {app}                     All retry attempts failed

Hedge:

  proxy_hedge_fired_total                 counter    {app}                     Hedge requests sent (primary too slow, value = number of hedges)
  proxy_hedge_won_total                   counter    {app}                     Hedge response used (primary was slower or failed)
  proxy_hedge_lost_total                  counter    {app}                     All attempts failed (primary + all hedges)
  proxy_hedge_skipped_total               counter    {app}                     Hedge skipped (no different backend, body replay failure)

Circuit Breaker:

  proxy_circuit_breaker_rejections        counter    {app}                     Requests rejected (breaker open)
  proxy_circuit_breaker_fallbacks         counter    {app}                     Fallback service activated

Transport:

  proxy_transport_cache_hits              counter    {}                        HTTP transport reused
  proxy_transport_cache_misses            counter    {}                        New HTTP transport created
  proxy_transport_cache_size              gauge      {}                        Cached transports
  proxy_transport_cache_invalidated       counter    {reason}                  Cache invalidated (CA rotation)
  proxy_http3_transport_cache_hits        counter    {}                        HTTP/3 transport reused
  proxy_http3_transport_cache_misses      counter    {}                        HTTP/3 transport created
  proxy_proxyprotocol_sent                counter    {version}                 PROXY protocol headers sent
  proxy_proxyprotocol_skipped             counter    {reason}                  PROXY protocol skipped
  proxy_ca_pool_version                   gauge      {}                        CA pool version
  proxy_optimized_transport_hits          counter    {route, pool}             Optimized transport cache hits
  proxy_optimized_transport_fallbacks     counter    {route, reason}           Optimized transport fallbacks
  proxy_transport_pools_created           counter    {pool, route}             Transport pools created
  proxy_transport_pools_cleaned           counter    {pool}                    Transport pools cleaned
  proxy_pool_registration_failures        counter    {pool, error}             Pool registration failures
  proxy_pool_cleanup_errors               counter    {pool, error}             Pool cleanup errors

Rewriting:

  proxy_rewrite_duration                  latency    {app}                     HTML rewriting time
  proxy_buffer_pool_gets                  counter    {}                        Buffer pool acquisitions

Reload:

  proxy_reload_attempts                   counter    {trigger}                 Reload attempts
  proxy_reload_total                      counter    {success, reason?}        Reload results (reason on failure only)
  proxy_reload_skipped                    counter    {reason}                  Reload skipped
  proxy_reload_duration                   latency    {success}                 Reload time (success path only)
  proxy_routes_configured                 gauge      {}                        Active routes
  proxy_routes_changed                    gauge      {}                        Routes changed on reload
  proxy_routes_unchanged                  gauge      {}                        Routes unchanged on reload
  proxy_routes_added                      counter    {}                        Routes added
  proxy_routes_removed                    counter    {}                        Routes removed
  proxy_config_hash_changed               counter    {}                        Config hash changes
  proxy_lb_pools_preserved                counter    {app}                     LB pools preserved
  proxy_lb_pools_created                  counter    {app, reason}             LB pools created

Session Monitoring:

  proxy_group_monitor_changes_total       counter    {username}                Group membership changes
  proxy_group_monitor_updates_total       counter    {username}                Session metadata updates
  proxy_group_monitor_errors_total        counter    {error_type}              Monitor errors
  proxy_group_monitor_check_duration      latency    {}                        Check cycle time

FastCGI:

  proxy_fastcgi_requests_total            counter    {mapping, status_class}   FastCGI RoundTrip exits (status_class: 2xx/3xx/4xx/5xx/error)
  proxy_fastcgi_request_duration          latency    {mapping}                 RoundTrip wall-clock time (auto-bucketed)
  proxy_fastcgi_pool_total                counter    {mapping, result}         Conn pool acquire (hit/fresh/retry)
  proxy_fastcgi_stderr_bytes_total        counter    {mapping, severity}       PHP-FPM STDERR bytes routed to audit log
  proxy_fastcgi_proto_status_total        counter    {mapping, status}         Non-success FCGI_END_REQUEST signals (cant_mpx/overloaded/unknown_role)

Alerts:

  rate(proxy_errors[5m]) / rate(proxy_requests[5m]) > 0.05     Error rate > 5%
  proxy_circuit_breaker_rejections > 0                          Backend unhealthy
  rate(proxy_auth_failures[5m]) > 10                            Brute-force attempt
  rate(proxy_oidc_state_validation_failed[5m]) > 5              CSRF/state attack
  proxy_transport_cache_invalidated > 0                         CA rotation event
  rate(proxy_reload_total{success="false"}[5m]) > 0             Config reload failing
  rate(proxy_retry_budget_exceeded[5m]) > 10                    Retry storm — budget protecting cluster
  rate(proxy_retry_exhausted[5m]) > 5                           Backend failures exhausting retries
  rate(proxy_hedge_fired_total[5m]) / rate(proxy_requests[5m]) > 0.1   >10% requests hedging — check tail latency
  rate(proxy_fastcgi_requests_total{status_class="error"}[5m]) > 5   FastCGI transport-level failures (backend unreachable)
  rate(proxy_fastcgi_stderr_bytes_total{severity="error"}[5m]) > 1000 Sustained PHP-FPM error output (investigate php-fpm.log)

Fastcgi

FastCGI 1.0 RESPONDER backends — PHP-FPM, Python WSGI (flup, gunicorn FCGI mode), Perl FCGI::ProcManager, Ruby rack-fastcgi, Lua, and any other language with a FastCGI 1.0 RESPONDER implementation. Selected per-mapping via the service URL scheme.

For deeper detail see: man fastcgi

URL schemes:

  fastcgi://host:port           Plain TCP. The default for most FastCGI
                                deployments (PHP-FPM in containers,
                                gunicorn FCGI listener, etc.).
  fastcgi+tls://host:port       TCP + TLS handshake before FCGI records.
                                Reuses tls_check from the mapping for
                                cert verify policy.
  fastcgi+unix://...            REJECTED at config load. Configure the
                                FastCGI backend to listen on a TCP port
                                instead.

Required mapping fields:

  service          = ["fastcgi://php-fpm.internal:9000"]
  fcgi_script_root = "/var/www/html"   (absolute filesystem path on backend)

Common optional fields:

  fcgi_index                  "index.php"   front-controller for paths without a SplitPath suffix
  fcgi_split_path             [".php"]      suffixes splitting SCRIPT_NAME from PATH_INFO
  fcgi_env                    {}            extra CGI envs (passthrough)
  fcgi_inject_identity        true          inject HEXON_USER/MAIL/etc. envs
  fcgi_dial_timeout           "5s"          backend connect timeout
  fcgi_read_timeout           "60s"         full response read timeout — bump for heavy/long-running backends
  fcgi_write_timeout          "10s"         request body write timeout
  fcgi_idle_timeout           "90s"         idle TTL for pooled conns (set BELOW your FastCGI worker idle timeout, e.g. PHP-FPM pm.process_idle_timeout)
  fcgi_max_idle_conns         32            pool size per backend (set 0 to disable pooling)

Composes with existing mapping fields:

  auth = true + add_auth_headers = true
                              Drives identity envs (HEXON_USER/MAIL/NAME/GROUPS).
  add_extended_auth_headers   When true, also injects X-Hexon-Auth-Method
                              (passkey/oidc/x509/...) and X-Hexon-JA4
                              headers, which surface as HEXON_AUTH_METHOD
                              and HEXON_JA4 envs.
  mtls = true                 Auto-injects SSL_CLIENT_* envs from the peer
                              cert (also fires for X.509 SSO and voluntary
                              client auth — anytime a peer cert is presented).
  tls_check                   For fastcgi+tls:// — verify backend cert chain.
  site = "..."                Routes FCGI traffic through the connector
                              tunnel. The FastCGI backend host doesn't
                              need a public IP.
  host_header                 Overrides the Host the backend sees;
                              propagates to HTTP_HOST and SERVER_NAME envs.
  lb_strategy                 Load balancing across multiple FCGI backends.
                              Mixed schemes (one http://, one fastcgi://)
                              rejected at config load.

Identity envs (when fcgi_inject_identity = true):

  HEXON_USER, HEXON_MAIL, HEXON_NAME, HEXON_GROUPS  — from the standard
                                                       add_auth_headers block.
                                                       Read from the configured
                                                       header_user/mail/name/
                                                       groups names (defaults
                                                       X-Hexon-User/...).
  HEXON_AUTH_METHOD, HEXON_JA4                       — require
                                                       add_extended_auth_headers = true
                                                       (header names are fixed)
  HTTP_X_HEXON_*                                     — same values, also
                                                       exposed via the
                                                       standard HTTP_* loop

Computed CGI envs:

  GATEWAY_INTERFACE, REQUEST_METHOD, REQUEST_URI, QUERY_STRING,
  SCRIPT_NAME, SCRIPT_FILENAME, PATH_INFO, PATH_TRANSLATED,
  DOCUMENT_ROOT, DOCUMENT_URI, REQUEST_SCHEME,
  SERVER_NAME, SERVER_PORT, SERVER_PROTOCOL, SERVER_SOFTWARE,
  REMOTE_ADDR, REMOTE_HOST, REMOTE_PORT, REMOTE_USER,
  CONTENT_LENGTH, CONTENT_TYPE, HTTP_HOST,
  HTTPS, SSL_PROTOCOL, SSL_CIPHER (when TLS),
  SSL_CLIENT_VERIFY, SSL_CLIENT_S_DN, SSL_CLIENT_S_DN_CN,
  SSL_CLIENT_I_DN, SSL_CLIENT_M_SERIAL, SSL_CLIENT_V_START,
  SSL_CLIENT_V_END  (when client cert presented),
  HTTP_*  (all request headers, dashes→underscores, uppercased).

AUTH_TYPE: derived from request properties:

  Bearer       Authorization: Bearer ... (also auto when add_bearer=true)
  Basic        Authorization: Basic ...
  Negotiate    Authorization: Negotiate ... (Kerberos/SPNEGO)
  Digest       Authorization: Digest ...
  Certificate  client peer cert presented (mTLS, X.509 SSO, voluntary)
  ""           cookie-based session (passkey, OIDC, magic link, JIT-2FA)

Backends that want the Hexon-canonical auth method (passkey vs oidc vs x509 vs …) read HEXON_AUTH_METHOD instead — populated when add_extended_auth_headers = true on the mapping.

Health checks:

  type = "tcp" probes the FastCGI listener directly (recommended).
  type = "http" requires the backend to expose an HTTP health endpoint
  on a separate listener (PHP-FPM's pm.status_path, gunicorn /healthz,
  etc.).

Security:

  - Null-byte URL paths rejected with 400.
  - Missing/invalid Content-Length on POST/PUT/PATCH rejected with 411
    (PHP-FPM hangs without it).
  - SCRIPT_FILENAME path-traversal rejected (cleaned path must stay
    under fcgi_script_root).
  - Backend error output (FCGI_STDERR) routed to the audit log with the
    request's correlation_id; never appears in the response body.
  - Only Responder role; Authorizer/Filter roles silently ignored.
  - SSRF blocklist (cloud metadata endpoints, link-local) applies when
    routing through the connector.

Operational notes:

  - Connection pool reuses conns via FCGI_KEEP_CONN. When the backend
    cycles a worker (e.g., PHP-FPM pm.max_requests), the next pooled-
    conn use detects the close and dials fresh on retry.
  - Retries: first-byte failures on pooled conns retry once on a fresh
    dial. Failures after request body bytes have been sent do NOT retry
    (avoids double-execution on POST/PUT/PATCH).
  - STDERR severity: WARN by default, ERROR when status >= 400.

Common deployment pattern:

  [[proxy.mappings]]
  app  = "wordpress"
  host = "blog.example.com"
  path = "/.*"
  service = ["fastcgi://php-fpm.internal:9000"]
  fcgi_script_root = "/var/www/html"
  auth = true
  add_auth_headers = true
  add_extended_auth_headers = true
  groups = ["staff"]

  # Heavy admin operations (Composer, large reports)
  fcgi_read_timeout = "300s"

References:

  RFC 3875                CGI/1.1 environment variables
  FastCGI 1.0 spec        https://fastcgi-archives.github.io/FastCGI_Specification.html

Request Shadow/Mirror

Mirrors live traffic to secondary backends for testing or migration — fully async, never affects the primary path

Overview

Duplicates live HTTP requests to a secondary backend for testing, canary validation, or migration. Shadow traffic is fully asynchronous — it never affects the primary request/response path or adds latency. Configurable per proxy mapping with sampling control.

Core capabilities:

Asynchronous dispatch (no blocking the main request path)
Configurable sampling via runtime_fraction (percentage or fractional modes)
Dedicated HTTP transport pool separate from main proxy traffic
Shadow identification headers for backend awareness (X-Hexon-Shadow-*)
Per-shadow timeout and body size limits to prevent resource exhaustion
Distributed trace ID propagation for end-to-end observability
Per-mapping shadow configuration with global defaults
Multiple shadow targets per proxy mapping (e.g., canary + analytics)
Connector site routing: shadow targets at remote sites via QUIC tunnels

Shadow dispatch flow:

  1. Client request arrives at proxy handler
  2. Proxy forwards to primary backend (normal flow)
  3. For each configured shadow target, sampling decision is evaluated
  4. If sampled, request is dispatched asynchronously to the shadow module
  5. Shadow module replays the request to the shadow backend with its own transport
  6. Shadow response is discarded (metrics only, no client impact)

Shadow identification headers (when AddHeaders is enabled):

  X-Hexon-Shadow: "true"          - Identifies this as a shadow request
  X-Hexon-Shadow-Name: "<name>"   - Shadow target name for routing/filtering
  X-Hexon-Shadow-Source: "<host>" - Original request host
  X-Hexon-Shadow-Time: "<unix>"   - Unix timestamp of original request
  X-Hexon-Trace-ID: "<uuid>"      - Distributed trace ID
  X-Forwarded-Host: "<host>"      - Standard forwarded host header

Sampling modes:

  Percentage: runtime_fraction = { percent = 10 } for 10% of requests
  Fractional: runtime_fraction = { numerator = 1, denominator = 1000 } for 0.1%

Use cases:

  - Canary deployments: shadow 10% of traffic to new version before cutover
  - Analytics pipelines: mirror requests to analytics backend for processing
  - Migration validation: compare primary and shadow responses offline
  - Load testing: replay production traffic to staging environments
  - A/B backend testing: shadow to alternative implementation

Config

Global shadow defaults under [proxy.shadow]:

[proxy.shadow]
  enabled = true                    # Enable shadow dispatch globally
  timeout = "5s"                    # Default timeout for shadow requests
  max_body_size = "10MB"            # Maximum request body size to shadow
  add_headers = true                # Add X-Hexon-Shadow-* identification headers
  max_idle_conns = 50               # Transport pool: max idle connections
  max_idle_conns_per_host = 10      # Transport pool: max idle per host
  max_conns_per_host = 100          # Transport pool: max total per host
  idle_conn_timeout = 90            # Idle connection timeout (seconds)
  tls_handshake_timeout = 10        # TLS handshake timeout (seconds)
  tls_verify = true                 # Verify shadow backend TLS certificates

Per-mapping shadow configuration (overrides global defaults):

[[proxy.mapping]]
  host = "api.example.com"
  path = ".*"
  service = ["https://primary.internal"]

  [[proxy.mapping.shadow]]
  name = "canary"                   # Shadow target name (used in metrics and headers)
  service = "https://canary.internal:8443"  # Shadow backend URL
  runtime_fraction = { percent = 10 }       # Sample 10% of requests
  add_headers = true                # Override global add_headers

  [[proxy.mapping.shadow]]
  name = "analytics"
  service = "https://analytics.internal"
  runtime_fraction = { numerator = 1, denominator = 1000 }  # 0.1% sampling
  timeout = "2s"                    # Override global timeout

  # Shadow target at a remote site via connector tunnel:
  [[proxy.mapping.shadow]]
  name = "staging-mirror"
  service = "https://staging.internal:8443"
  site = "staging-eu"               # Routes through connector tunnel

Sampling configuration:

  Percentage mode (0-100):
    runtime_fraction = { percent = 10 }   # 10% of requests

  Fractional mode (precise low rates):
    runtime_fraction = { numerator = 1, denominator = 1000 }  # 0.1%

  No runtime_fraction: 100% of requests are shadowed.

Hot-reloadable: runtime_fraction, timeout, add_headers, max_body_size. Cold (restart required): enabled, transport pool settings (max_idle_conns, etc.).

Troubleshooting

Common symptoms and diagnostic steps:

Shadow requests not reaching the backend:

  - Verify [proxy.shadow] enabled = true
  - Check shadow target name matches in proxy mapping configuration
  - Verify shadow service URL is reachable from the Hexon server
  - Test connectivity: 'net tcp <shadow-host>:<port> --tls'
  - Check runtime_fraction is set correctly (0% = no traffic)
  - Verify max_body_size is sufficient for the request payload
  - Check shadow metrics: shadow_requests_total should be incrementing

Shadow requests timing out:

  - Increase timeout setting (default 5s may be too short for slow backends)
  - Check shadow backend health and response times
  - Verify network path between Hexon and shadow backend
  - Check max_conns_per_host limit (100 default) is not exhausted
  - Monitor shadow_request_duration histogram for latency distribution

Transport pool exhaustion:

  - Increase max_idle_conns (default 50) for high-traffic deployments
  - Increase max_conns_per_host (default 100) for single-target shadows
  - Reduce timeout to free connections faster
  - Check idle_conn_timeout (default 90s) is appropriate
  - Use short timeouts (2-5s) to prevent connection buildup

TLS errors to shadow backend:

  - Verify shadow backend TLS certificate is valid
  - Check tls_verify setting (set to false only for testing)
  - Verify tls_handshake_timeout is sufficient (default 10s)
  - Check if shadow backend requires specific TLS version or ciphers

Sampling rate seems incorrect:

  - Percentage mode: percent = 10 means approximately 10% (not exact)
  - Fractional mode: numerator/denominator for precise low rates
  - Sampling is per-request, random; short windows may show variance
  - Check shadow_requests_total vs total proxy requests for actual rate

Shadow affecting primary request latency:

  - Shadow dispatch should be fire-and-forget (no Wait())
  - If primary slows down, check if body buffering is the cause
  - Reduce max_body_size to limit memory allocation for large payloads
  - Verify shadow dispatch is non-blocking (asynchronous)

Metrics and monitoring:

  - shadow_requests_total{shadow_name}: total dispatched shadow requests
  - shadow_success_total{shadow_name}: requests with 2xx/3xx responses
  - shadow_errors_total{shadow_name, error_type}: requests with errors
  - shadow_request_duration{shadow_name}: latency histogram

Relationships

Module dependencies and interactions:

proxy: Primary consumer. The reverse proxy handler evaluates shadow configuration for each proxy mapping and dispatches shadow requests asynchronously when sampling criteria are met. Shadow config is nested under [[proxy.mapping.shadow]].
config: Global defaults from [proxy.shadow] merged with per-mapping shadow overrides. Runtime_fraction, timeout, and add_headers are hot-reloadable. Transport pool settings require restart.
telemetry: Shadow metrics exported for monitoring: request counts, success/error counts, and latency histograms per shadow target name. Structured logging for dispatch and response events.
dns: Shadow backend hostnames resolved via the DNS module with standard resolution and caching behavior.
certs: TLS certificate verification for shadow backends uses the system trust store. tls_verify controls whether verification is enforced.
Cluster RPC: Shadow dispatch uses the fire-and-forget pattern to ensure zero impact on primary request path. No cluster coordination needed; shadow runs on the receiving node only.
connector: When a shadow target specifies a “site” parameter, requests route through the QUIC connector tunnel to the remote site instead of direct connection. Transport is cached per site for connection reuse.

Logs

Log entries emitted by shadow dispatch. Search with: logs search “shadow.dispatch” All entries use log name “shadow.dispatch”. Levels: WARN > DEBUG. No AUDIT entries in this module.

Dispatch Lifecycle:

  shadow.dispatch  DEBUG  Shadow request succeeded (shadow_name, status_code, latency_ms)
  shadow.dispatch  WARN   Shadow request failed with status (shadow_name, status_code, latency_ms)
  shadow.dispatch  WARN   Shadow request error (shadow_name, error_type, latency_ms, error)

Metrics

Prometheus metrics. Query with: metrics prometheus shadow_<name>

Request Counts:

  shadow_requests_total       counter    {shadow_name}                          Shadow requests dispatched
  shadow_success_total        counter    {shadow_name}                          Shadow requests with 2xx/3xx responses
  shadow_errors_total         counter    {shadow_name, error_type}              Shadow request errors (error_type: request_creation|timeout|network|http_error)

Note: error_type “http_error” also includes a “status_code” label.

Latency:

  shadow_request_duration     latency    {shadow_name}                          Shadow request round-trip duration

Alerts:

  rate(shadow_errors_total[5m]) > 0                                             Shadow errors occurring — check backend health
  rate(shadow_errors_total{error_type="timeout"}[5m]) > 5                       High timeout rate — increase timeout or check backend
  histogram_quantile(0.99, shadow_request_duration) > 5                         p99 latency exceeds 5s — shadow backend slow