Skip to content

Reverse Proxy

Load Balancer

Distributes traffic across backend pools with health checks, circuit breakers, and outlier detection

Overview

Distributes requests across backend servers with health-aware routing, circuit breakers, and outlier detection. Replaces separate load balancer, health checker, and circuit breaker tools with one integrated module. Applies to every proxy route — each mapping can have its own backend pool, algorithm, and health check configuration.

Key differentiators:

  • Cluster-native state: all nodes share circuit breaker, health, and connection state via memory module. One node’s health check benefits all nodes. Circuit trips propagate instantly across the cluster.
  • HTTP/3 backend support: full QUIC support for backend connections with HTTP/3 health checks. Per-protocol circuit breaking allows HTTP/3 to fail independently while HTTP/2 continues working (e.g., QUIC blocked by firewall).
  • Expression-based circuit breaker trip conditions: custom logic like “error_rate > 0.1 && latency_p99 > 2.0” instead of rigid thresholds.
  • JA4/JA4H fingerprint routing: session affinity based on TLS fingerprint, no cookies required. Works for any protocol, including non-HTTP.
  • 7 health check types: TCP, HTTP, HTTP/3, gRPC, MySQL, PostgreSQL, Redis with native protocol checks (not just TCP port probes).
  • 7 load balancing algorithms: adaptive (default), round_robin, weighted (EDF scheduling), least_conn (Power of Two Choices), hash (xxhash), maglev (Google consistent hashing), random. The adaptive algorithm uses epsilon-greedy reinforcement learning — it starts exploring all backends equally, measures real latency and error rates, then gradually shifts traffic toward the best performers while still probing others to detect recovery. Zero configuration.
  • Outlier detection with three independent detection mechanisms: consecutive failures, success rate analysis (statistical), and failure percentage (threshold-based).
  • DNS-based service discovery with automatic backend add/remove on DNS changes.
  • Token bucket rate limiting with per-pool or per-user limits, distributed across cluster. Per-user mode enabled via rate_limit_per_user = true on proxy mappings.

All state is shared cluster-wide. Reads are served from local in-memory cache for near-instant response. Writes are replicated to all nodes with eventual consistency.

Config

Configuration is primarily done through [[proxy.mapping]] sub-tables. The load balancer does not have its own top-level TOML section; pools are created and managed automatically by the proxy service when mappings have multiple backends.

[[proxy.mapping]]
service = ["http://be1:8080", "http://be2:8080"] # Array triggers LB pool creation
lb_strategy = "adaptive" # Algorithm: adaptive, round_robin, weighted, least_connections,
# hash, maglev, random (default: adaptive)
lb_weights = [5, 3, 2] # Backend weights for "weighted" strategy
lb_hash_key = "cookie:session_id" # Hash key for "hash"/"maglev" strategies
# Options: cookie:<name>, ja4, ja4h, ip, header:<name>
enable_http3 = false # Enable HTTP/3 (QUIC) backend connections
protocol_preference = "prefer_http3" # Protocol preference: prefer_http3, prefer_http2
dns_discovery = false # Enable DNS-based service discovery
dns_refresh = "30s" # DNS refresh interval
# Default health check (active by default even without this section).
# Applies to ALL mappings that don't have an explicit [proxy.mapping.health_check].
# Any non-5xx response = healthy. Only connection errors, timeouts, and 5xx = unhealthy.
# Health check path is derived from each mapping's route path.
[proxy.default_health_check]
enabled = true # Set false to disable default health checks
type = "http" # Check type: tcp, http, http3, grpc
method = "GET" # HTTP method (default: GET)
interval = "15s" # Check interval (default: 15s)
timeout = "5s" # Check timeout (default: 5s)
unhealthy_threshold = 3 # Consecutive failures to mark unhealthy (default: 3)
healthy_threshold = 2 # Consecutive successes to mark healthy (default: 2)
# Per-mapping health check (overrides default_health_check for this mapping)
[proxy.mapping.health_check]
enabled = true # Enable health checking (default: true)
type = "http" # Check type: tcp, http, http3, grpc
path = "/health" # HTTP/HTTP3 health check path
method = "GET" # HTTP method (default: GET)
expected_status = [200] # Expected HTTP status codes (empty = any non-5xx is healthy)
interval = "10s" # Check interval (default: 10s)
timeout = "5s" # Check timeout (default: 5s)
unhealthy_threshold = 3 # Consecutive failures to mark unhealthy (default: 3)
healthy_threshold = 2 # Consecutive successes to mark healthy (default: 2)
tls_skip_verify = false # Skip TLS certificate verification
grpc_service = "" # gRPC service name for grpc.health.v1.Health/Check
[proxy.mapping.circuit_breaker]
enabled = true # Enable circuit breaker (default: false)
error_ratio_threshold = 0.5 # Max error ratio 0.0-1.0 (threshold mode)
latency_p95_threshold = "1s" # Max P95 latency (threshold mode)
network_error_threshold = 0.3 # Max network error ratio (threshold mode)
combine_mode = "or" # Threshold combine: "or" (any) or "and" (all)
trip_expression = "" # Expression mode (overrides thresholds when set)
# Variables: error_ratio, success_ratio, p50_latency,
# p95_latency, p99_latency, avg_latency (all ms),
# network_error_ratio, timeout_ratio, request_count
min_samples = 10 # Minimum requests before evaluation (default: 10)
error_window = "60s" # Sliding window for error tracking (default: 60s)
fallback_duration = "30s" # Duration circuit stays open (default: 30s)
success_threshold = 3 # Successes in half-open to close (default: 3)
fallback_mode = "error" # Open behavior: "error" (503) or "pool" (fallback pool)
fallback_pool_id = "" # Pool ID for fallback_mode="pool"
response_code = 503 # HTTP status when fallback_mode="error" (default: 503)
per_protocol = false # Track circuits per protocol (http3/http2/http1)
fallback_protocol = "http2" # Protocol when per-protocol circuit opens
[proxy.mapping.outlier_detection]
enabled = true # Enable outlier detection (default: false)
consecutive_5xx = 5 # Eject after N consecutive 5xx (default: 5)
consecutive_gateway_failure = 5 # Eject after N consecutive 502/503/504 (default: 5)
consecutive_local_failure = 5 # Eject after N consecutive connection errors (default: 5)
success_rate_enabled = true # Enable statistical success rate analysis
success_rate_min_hosts = 5 # Min hosts for success rate calculation
success_rate_min_requests = 100 # Min requests per host for success rate
success_rate_stdev_factor = 1.9 # Standard deviation factor for outlier threshold
success_rate_enforcing_pct = 100 # Percentage of detected outliers actually ejected
failure_percentage_enabled = true # Enable failure percentage threshold
failure_percentage_threshold = 85 # Failure % above which backend is ejected
failure_percentage_min_hosts = 5 # Min hosts for failure percentage
failure_percentage_min_reqs = 50 # Min requests for failure percentage
interval = "10s" # Detection sweep interval (default: 10s)
base_ejection_time = "30s" # Initial ejection duration (default: 30s)
max_ejection_time = "300s" # Maximum ejection duration (default: 300s)
max_ejection_percent = 10 # Max % of backends ejected at once (default: 10)
ejection_jitter_pct = 10 # Random jitter on re-admission time (default: 10)

Rate limiting (configured via proxy mapping fields, not a sub-table):

rate_limit = "200/1m" # Per-mapping rate limit (count/duration)
rate_limit_per_user = false # Per-user limits (vs shared pool limit, cluster-wide)

DNS discovery modes (dns_mode field in programmatic API):

"internal" (default): Use Hexon DNS module with DNSSEC, caching, cluster config
"system": Use system resolver directly (no DNSSEC)
"custom": Use custom resolvers directly (requires dns_resolvers list)

Maglev table size default: 65537 (prime number). Configurable via programmatic API.

Troubleshooting

Common symptoms and diagnostic steps:

Backend never selected (no healthy backends):

- Check health status: 'proxy health' or 'proxy health <pool-id>'
- Verify backend is reachable: 'dns test <backend-hostname>'
- Wrong health check type: TCP check passes but HTTP check fails (wrong path/status)
- gRPC health check requires grpc.health.v1.Health service on backend
- HTTP/3 health check failing: QUIC/UDP may be blocked by firewall
- Database checks (mysql/postgresql/redis): verify protocol handshake, not auth
- Threshold too strict: lower unhealthy_threshold or increase interval

Circuit breaker tripped unexpectedly:

- Check circuit state: 'proxy circuits'
- Review trip expression: expression variables are in milliseconds for latency
- min_samples too low: brief error bursts trip the circuit prematurely
- error_window too short: transient errors accumulate faster
- gRPC pitfall: HTTP status is always 200; circuit uses gRPC status codes
(codes 4,8,13,14,15 = server error). Check grpc-status trailers.
- Per-protocol circuit: HTTP/3 circuit may be open while HTTP/2 works fine;
check per-protocol states via 'proxy circuits'
- Manual reset: 'proxy reset <pool-id> <backend-id>'

Uneven traffic distribution:

- Weighted strategy: verify lb_weights array matches service array length
- Least-connections: requires ConnectionOpened/ConnectionClosed tracking;
check 'proxy backends <pool-id>' for connection counts
- Hash/Maglev: same hash key always routes to same backend (by design);
verify lb_hash_key is sufficiently varied across requests
- Maglev table imbalance: small backend count can cause uneven distribution;
increase table size or add more backends

All backends ejected (outlier detection too aggressive):

- Check outlier state: 'proxy outliers' or 'proxy outliers <pool-id>'
- max_ejection_percent too high: set to 10-33% to always keep backends active
- consecutive_5xx threshold too low: increase from 5 to 10+ for noisy backends
- success_rate_stdev_factor too low: increase from 1.9 to 2.5+ for high variance
- Manual re-admit: 'proxy uneject <pool-id> <backend-id>'
- Ejection backoff: duration doubles each re-ejection (base * 2^count),
capped at max_ejection_time

DNS discovery not updating backends:

- Check discovery state: 'proxy pools <pool-id>' shows discovery config
- DNS resolution failing: 'dns test <service-hostname>'
- DNSSEC validation failing on unsigned zone: use dns_mode="system"
- Custom resolvers unreachable: check dns_resolvers list
- Exponential backoff active: after DNS failures, refresh backs off up to 5 minutes
- Force immediate refresh via programmatic API (RefreshDiscovery operation)

Connection pool exhaustion (high latency, timeouts):

- Check pool stats: 'connpool stats' and 'connpool pools'
- Backend slow to respond: connections pile up, least-conn skews
- Circuit breaker not tripping: error_window may be too long to catch slow failures

Rate limit hitting unexpectedly:

- Per-user vs per-pool: per-pool limit shared across all users
- Cluster-wide counting: nodes may have slight count drift due to replication lag
- Burst exhausted: burst tokens consumed, waiting for window refill
- Check: 'proxy traffic <app-name>' for request rate metrics

Cross-subsystem diagnostics:

- Full domain diagnostic: 'diagnose domain <hostname>'
- Full user access diagnostic: 'diagnose user <username>'

Architecture

Cluster state management and design:

State categories and retention:

Pool configurations: retained for 24h, refreshed on update
Backend health states: retained for 30s, updated by health checker
Circuit breaker states: retained for 24h
Outlier detection states: retained for 24h per backend
Connection counts: retained for 60s per backend
Rate limit counters: retained for window duration + 1 minute
Backend metrics: retained for circuit breaker decision window

Read/write model:

Reads (e.g., backend selection, health status, circuit state) are served
from local in-memory cache for near-instant response.
Writes (e.g., pool creation, health updates, circuit trips) are replicated
to all nodes with eventual consistency.

Load balancing algorithm details:

adaptive: Epsilon-greedy selection that learns from request outcomes. Starts with
round-robin (learning phase, first 50 requests), then exploits the best-scoring
backend 95% of the time and explores randomly 5%. Scores combine success rate,
latency (EMA), timeout rate, and consecutive failures with decay. Default strategy.
round_robin: Sequential rotation across backends. O(1) per selection.
weighted: Earliest Deadline First (EDF) scheduling. Higher weight = smaller deadline
increment = selected more often. Smoother than GCD-based approaches (interleaved,
not batched). O(log n) per selection.
least_conn: Power of Two Choices (P2C) — picks 2 random backends, selects the one
with fewer connections. O(1) complexity with near-optimal distribution. Weighted:
effective_connections = actual_connections * 1000 / weight.
hash: Consistent hashing with xxhash. Same key always routes to same backend unless
pool membership changes. O(1) per selection.
maglev: Google Maglev consistent hashing. Table size default 65537 (prime). Better
distribution than ring hash with minimal disruption on backend changes. O(1) lookup
after O(n * table_size) table build.
random: Uniform random selection. O(1) per selection.

Circuit breaker state machine:

closed --> open: Trip conditions met (after min_samples within error_window)
open --> half_open: fallback_duration elapsed
half_open --> closed: success_threshold consecutive successes
half_open --> open: Any failure
Per-protocol mode: independent state machines for http3, http2, http1, tcp.
Automatic fallback chain: http3 -> http2 -> http1 (configurable via fallback_protocol).

Outlier detection mechanisms:

1. Consecutive failures (immediate): eject after N consecutive 5xx / gateway / local errors
2. Success rate (statistical): eject if success_rate < (cluster_avg - stdev_factor * stdev)
3. Failure percentage (threshold): eject if failure% > threshold
Ejection backoff: base_ejection_time * 2^(ejection_count - 1), capped at max_ejection_time.
Re-admission: automatic after ejection duration expires; counters reset.
Jitter: ejection_jitter_pct random jitter on re-admission to prevent thundering herd.
Safety: max_ejection_percent prevents ejecting all backends simultaneously.

Health check architecture:

Each node independently runs health checks for its configured pools.
Results are replicated to all nodes -- the most recent check from any node is used.
Threshold logic: unhealthy_threshold consecutive failures to mark down;
healthy_threshold consecutive successes to mark up.
HTTP/3 checks: quic-go with connection pooling and idle connection cleanup.
gRPC checks: native grpc.health.v1.Health/Check RPC with connection pooling.
Database checks: protocol handshake only (MySQL initial packet, PostgreSQL SSLRequest,
Redis PING/PONG). No authentication or query execution.

DNS discovery lifecycle:

1. Periodically resolves hostname to IP addresses (refresh interval)
2. New IPs: automatically added as backends to pool
3. Removed IPs: automatically removed from pool
4. DNS failure: exponential backoff up to 5 minutes
5. Modes: internal (Hexon DNS with DNSSEC), system (OS resolver), custom (direct resolvers)

Interpreting tool output:

'proxy pools':
Healthy: All pools show ActiveBackends = TotalBackends, Strategy listed
Degraded: ActiveBackends < TotalBackends — some backends ejected or unhealthy
Action: Degraded → 'proxy health' for per-backend status, 'proxy outliers' for ejections
'proxy health' (per-pool):
Healthy: All backends Status=healthy, consecutive failures=0
Unhealthy: Status=unhealthy with failure reason (connection_refused, timeout, http_error)
Action: Unhealthy → check backend directly with 'net tcp <host:port>' or 'net http <url>'
'proxy circuits':
Closed: Normal operation — requests flowing to backend
Half-open: Testing recovery — limited requests allowed through, do NOT reset manually
Open: Tripped — backend failing, shows TripCondition and ErrorRate
Action: Open → fix backend root cause first, then 'proxy reset <pool> <backend>' to clear
'proxy outliers':
Normal: No ejected backends
Ejected: Backend removed from rotation — shows EjectionTime and FailureRate
Action: Fix backend, then 'proxy uneject <pool> <backend>' to re-admit

Logs

Log entries by component. Search with: logs search “loadbalancer” All entries use log name “loadbalancer”. Levels: ERROR > WARN > INFO > DEBUG. No AUDIT entries in this module.

Pool Management:

loadbalancer INFO pool created (strategy, backends count, health_check/circuit_breaker/outlier_detection enabled)
loadbalancer INFO pool deleted
loadbalancer INFO pool updated
loadbalancer INFO backend added (pool_id, backend_id, address)
loadbalancer INFO backend removed (pool_id, backend_id)
loadbalancer INFO backend draining (pool_id, backend_id)
loadbalancer WARN failed to initialize circuit state (pool_id, backend_id, error)
loadbalancer WARN failed to initialize outlier state (pool_id, backend_id, error)
loadbalancer ERROR failed to check pool existence (pool_id, error)
loadbalancer ERROR failed to store pool config (pool_id, error)
loadbalancer ERROR failed to get pool for deletion (pool_id, error)
loadbalancer ERROR failed to delete pool (pool_id, error)
loadbalancer ERROR failed to get pool (pool_id, error)
loadbalancer ERROR failed to list pools (error)
loadbalancer ERROR failed to get pool for update (pool_id, error)
loadbalancer ERROR failed to update pool (pool_id, error)
loadbalancer ERROR failed to get pool for add backend (pool_id, error)
loadbalancer ERROR failed to add backend (pool_id, backend_id, error)
loadbalancer ERROR failed to get pool for remove backend (pool_id, error)
loadbalancer ERROR failed to remove backend (pool_id, backend_id, error)
loadbalancer ERROR failed to get health state for drain (pool_id, backend_id, error)
loadbalancer ERROR failed to update health state for drain (pool_id, backend_id, error)

Backend Selection:

loadbalancer DEBUG backends excluded from selection (pool_id, total_backends, healthy_backends, excluded)
loadbalancer DEBUG backend selected (pool_id, backend_id, strategy, healthy_backends, latency)

Health Checks:

loadbalancer INFO backend health state changed — unhealthy to healthy (pool_id, backend_id, consecutive_ok, latency)
loadbalancer WARN backend health state changed — healthy to unhealthy (pool_id, backend_id, consecutive_fails, error)
loadbalancer DEBUG health check passed (pool_id, backend_id, consecutive_ok, latency)
loadbalancer DEBUG health check failed (pool_id, backend_id, consecutive_fails, error)
loadbalancer ERROR failed to store health state (pool_id, backend_id, error)

Circuit Breaker:

loadbalancer INFO circuit breaker state changed (pool_id, backend_id, from_state, to_state, error_ratio)
loadbalancer INFO per-protocol circuit breaker state changed (pool_id, backend_id, protocol, from_state, to_state, error_ratio)
loadbalancer INFO circuit breaker reset (pool_id, backend_id, reset_by)
loadbalancer WARN circuit breaker expression compilation failed (expression, error)
loadbalancer WARN circuit breaker expression evaluation failed (expression, error)
loadbalancer DEBUG circuit breaker threshold evaluation (combine_mode, conditions_met, error_ratio, error_threshold, p95_latency_ms, latency_threshold_ms, network_error_ratio, network_threshold)
loadbalancer ERROR failed to store circuit state (pool_id, backend_id, error)
loadbalancer ERROR failed to reset circuit (pool_id, backend_id, error)

Connection Tracking:

loadbalancer ERROR failed to update connection count (pool_id, backend_id, error)

Rate Limiting:

loadbalancer DEBUG rate limit exceeded (pool_id, user_id, limit, current_count, cost, retry_after)
loadbalancer ERROR failed to update rate limit state (pool_id, key, error)

Outlier Detection:

loadbalancer INFO backend ejected due to outlier detection (pool_id, backend_id, reason, ejection_count, duration, re_admit_at)
loadbalancer INFO backend re-admitted after ejection period (pool_id, backend_id, total_ejections)
loadbalancer INFO backend manually un-ejected (pool_id, backend_id)
loadbalancer DEBUG outlier success rate analysis (pool_id, eligible_backends, avg_success_rate, stdev, threshold, stdev_factor)
loadbalancer DEBUG outlier failure percentage analysis (pool_id, eligible_backends, threshold, ejected_count, max_ejectable)
loadbalancer ERROR failed to save outlier state (pool_id, backend_id, error)
loadbalancer ERROR failed to save outlier state on re-admission (pool_id, backend_id, error)
loadbalancer ERROR failed to reset outlier interval stats (pool_id, backend_id, error)

DNS Discovery:

loadbalancer INFO DNS discovery enabled (pool_id, hostname, refresh)
loadbalancer INFO DNS discovery disabled (pool_id)
loadbalancer INFO DNS discovery updated backends (pool_id, hostname, total_ips, added, removed)
loadbalancer WARN DNS discovery resolution failed (pool_id, hostname, error)
loadbalancer WARN DNS discovery returned no IPs (pool_id, hostname)
loadbalancer WARN failed to add discovered backend (pool_id, ip, error)
loadbalancer WARN failed to remove discovered backend (pool_id, ip, error)

Metrics

Prometheus metrics. Query with: metrics prometheus loadbalancer_<name>

Pool Lifecycle:

loadbalancer_pools_created counter {strategy} Pools created
loadbalancer_pools_deleted counter {} Pools deleted

Backend Selection:

loadbalancer_selects counter {pool_id, strategy} Successful backend selections
loadbalancer_select_failures counter {pool_id, reason} Failed selections (reason: pool_not_found|no_healthy_backends|algorithm_returned_nil)
loadbalancer_select_latency latency {pool_id} Backend selection duration

Health Checks:

loadbalancer_health_checks counter {pool_id, healthy} Health check executions (healthy: true|false)

Circuit Breaker:

loadbalancer_circuit_state_changes counter {pool_id, backend_id, from_state, to_state} Circuit state transitions (optionally includes protocol label in per-protocol mode)
loadbalancer_circuit_resets counter {pool_id, backend_id} Manual circuit resets

Connections:

loadbalancer_connections_opened counter {pool_id, backend_id} Connections opened
loadbalancer_connections_closed counter {pool_id, backend_id} Connections closed
loadbalancer_active_connections gauge {pool_id, backend_id} Current active connections
loadbalancer_connection_duration latency {pool_id, backend_id} Connection duration
loadbalancer_bytes_sent counter {pool_id, backend_id} Bytes sent to backends
loadbalancer_bytes_recv counter {pool_id, backend_id} Bytes received from backends

Rate Limiting:

loadbalancer_rate_limit_allowed counter {pool_id} Requests allowed by rate limiter
loadbalancer_rate_limit_denied counter {pool_id} Requests denied by rate limiter
Note: metrics aggregate at pool level. For per-user denial details when
rate_limit_per_user = true, use: logs search "rate limit exceeded" (includes user_id)

Outlier Detection:

loadbalancer_outlier_ejections counter {pool_id, backend_id, reason} Backends ejected (reason: consecutive_5xx|consecutive_gateway|consecutive_local|success_rate|failure_percentage)
loadbalancer_outlier_readmissions counter {pool_id, backend_id} Backends auto-readmitted after ejection period
loadbalancer_outlier_manual_uneject counter {pool_id, backend_id} Backends manually un-ejected

DNS Discovery:

loadbalancer_dns_discovery_failures counter {pool_id, hostname} DNS resolution failures
loadbalancer_dns_discovery_updates counter {pool_id, hostname} Backend set updates from DNS

Alerts:

rate(loadbalancer_select_failures{reason="no_healthy_backends"}[5m]) > 0 All backends down — check health and outlier state
rate(loadbalancer_circuit_state_changes{to_state="open"}[5m]) > 0 Circuit opened — backend degradation
rate(loadbalancer_outlier_ejections[5m]) > 5 High ejection rate — systemic backend issues
rate(loadbalancer_rate_limit_denied[5m]) > 50 High rate limit denial — check capacity or limits
rate(loadbalancer_dns_discovery_failures[5m]) > 0 DNS discovery failing — check DNS config

Relationships

Module dependencies and interactions:

  • proxy: Primary consumer. Creates and manages LB pools automatically when
[[proxy.mapping]] has multiple backends in the service array. Selects backends on
every request, tracks connections, records results for circuit breaker and outlier
detection. All LB configuration flows through proxy mapping sub-tables.
  • distributed cache: State storage backend. All pool configs, health states, circuit breaker states, outlier states, connection counts, rate limit counters, and backend stats are stored in the cluster-wide cache with appropriate TTLs.
  • dns: Backend hostname resolution for DNS-based service discovery. Supports three modes: internal (Hexon DNS module with DNSSEC), system (OS resolver), custom (direct resolvers). Per-pool DNS configuration.
  • sessions: Session affinity for hash-based algorithms. Cookie-based hash keys read session cookies. JA4/JA4H fingerprint routing uses TLS fingerprint from session.
  • connection_pool: Backend HTTP connection management. Circuit breaker integration prevents new connections to tripped backends. Connection counts feed least_conn algorithm decisions.
  • certificates: TLS for backend connections. Health checks honor TLS configuration (tls_skip_verify). HTTP/3 health checks require valid QUIC/TLS setup.

Reverse Proxy

Routes HTTP traffic to backends with authentication, load balancing, identity headers, and circuit breaking

Overview

Routes HTTP requests to backend applications with authentication, group-based authorization, and signed identity headers. Replaces separate reverse proxy, SSO, and load balancer products with one integrated service. Every proxied request carries the user’s identity — backends verify it without implementing auth.

Capabilities:

  • Host-based and path-based routing with 3-tier hybrid matcher (exact → prefix → regex)
  • Per-route OIDC SSO authentication with cross-domain cookie support
  • Group-based authorization (OR semantics — user needs any one listed group)
  • Identity header injection (X-Hexon-User, X-Hexon-Mail, X-Hexon-Name, X-Hexon-Groups)
  • Ed25519 header signing and optional full request signing for backend verification
  • Response header URL rewriting (Link, Content-Location, Refresh) and HTML body rewriting
  • JavaScript interceptor injection for dynamic URL rewrites (fetch, XHR, window.open)
  • Logout toolbar injection for authenticated routes (draggable, shows user + app name)
  • WebSocket and gRPC support, HTTP/3 (QUIC) backend connections
  • Zero-copy streaming mode for API routes (rewrite_host=false, saves 8-15ms, 4x throughput)
  • Zstandard, Brotli, gzip response compression (negotiated via Accept-Encoding)
  • Speculation Rules API injection for prefetch/prerender (Chromium 121+, per-mapping eagerness)
  • Per-mapping mTLS, CIDR subnet restriction, HTTP method filtering
  • Multi-backend load balancing (round-robin, weighted, least-conn, consistent hash, Maglev)
  • Circuit breakers with expression-based trip conditions and outlier detection
  • Native health checks (TCP, HTTP, HTTP/3, gRPC)
  • Hot-reload of routes, backends, auth rules without restart (atomic, +2ns overhead)
  • Landing page listing all accessible apps filtered by user groups (folder/tag grouping)
  • PROXY protocol v1/v2 support for preserving client IP through L4 load balancers
  • Per-mapping protection overrides (rate limit, size limit bypass, per-user rate limits)
  • Per-user rate limit response headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset (on every response), Retry-After (on 429 only)
  • Weighted canary traffic splitting with sticky routing and bypass groups
  • Automatic retry with budget (prevents retry storms) and backend exclusion
  • Hedged requests for tail latency reduction (parallel speculative requests)
  • Custom response headers with three-state logic (set/strip/inherit)
  • Cookie domain rewriting for SSO across subdomains (RFC 6265 compliant)
  • Request shadowing/mirroring for testing (async fire-and-forget, per-route sampling)
  • JWT Bearer token verification cache (sessions module, SHA256 of token as session ID, configurable TTL)
  • Personal Access Token (PAT) authentication via Bearer header with session validation and IP enforcement

Config

Core configuration under [proxy] and [[proxy.mapping]]:

[proxy]
enabled = true # Enable reverse proxy service
hostname = "apps.hexon.es" # Landing page hostname (optional; if unset, serves on service.hostname)
signing_enabled = true # Ed25519 header signing (default: true if cluster_key set)
signing_rotation = "15m" # Key rotation interval (HKDF-SHA256 from cluster_key)
brotli_support = true # Decompress/reencode Brotli for URL rewriting
group_refresh_interval = "15m" # Background session group membership refresh (0 to disable)
bearer_cache_ttl = "5m" # JWT Bearer token verification cache TTL (0 to disable)
gzip = true # Enable gzip compression
headers = {} # Global response header overrides (three-state: value/"-"/empty)
header_user = "X-Hexon-User" # Identity header name overrides (rarely changed)
header_mail = "X-Hexon-Mail"
header_name = "X-Hexon-Name"
header_groups = "X-Hexon-Groups"
[[proxy.mapping]]
app = "Name" # Display name (shown in landing page and toolbar)
host = "app.hexon.es" # Hostname for routing (SNI matching)
path = "^/.*" # Path regex (auto-classified into matching tiers)
service = "https://backend:8080" # Backend URL(s) — string or array for load balancing
auth = true # Require authentication
groups = ["users"] # Authorized groups (OR logic, empty = any authenticated user)
bypass_auth_cidrs = ["10.0.0.0/8"] # CIDR ranges that skip auth + JIT-2FA (straight to backend)
add_auth_headers = true # Inject X-Hexon-* identity headers
add_bearer = true # Inject signed JWT Bearer token to backend (SSO via OIDC)
allow_upgrade = true # WebSocket upgrade support
rewrite_host = true # HTML URL rewriting (default: true; false = zero-copy mode)
inject_toolbar = true # Logout toolbar (default: true when auth=true)
rewrite_hosts = [["backend.com","app.hexon.es"]] # Multi-domain URL mapping pairs
priority = 500 # Route priority (auto 0-1000, manual >1000 overrides)
cert = "/path/to/cert.pem" # Per-mapping TLS certificate
key = "/path/to/key.pem" # Per-mapping TLS private key
tls_check = true # Verify backend TLS certificate
mtls = false # Require client certificate (default: false)
allowed_subnets = ["10.0.0.0/8"] # CIDR subnet restriction (OR logic, 403 if no match)
allowed_methods = ["GET","POST"] # HTTP method filter (empty = all allowed, 405 if no match)
audience = "custom" # Custom audience for header signing (default: mapping app name)
sign_request = false # Full request signing (method, path, query, body)
sign_request_max_body = "10MB" # Max body size to hash (default 10MB; "0" = skip body hash)
bearer_cache_ttl = "5m" # Per-mapping JWT cache TTL override (inherits from global)
oidc_providers = ["internal"] # OIDC provider(s) for authentication
dnssec = true # Per-route DNSSEC override
dns_resolvers = ["10.0.0.1:53"] # Per-route DNS resolver override
brotli_support = true # Per-route Brotli override (falls back to global)
permissions_policy = "..." # Browser Permissions-Policy header
referrer_policy = "..." # Browser Referrer-Policy header
csp_header = "..." # Content-Security-Policy header
headers = {} # Per-route response headers (completely replaces global)
disable_rate_limit = false # Bypass rate limiting for this route
rate_limit = "200/1m" # Custom rate limit
rate_limit_per_user = false # Per-user rate limits (cluster-wide, requires auth=true)
disable_size_limit = false # Bypass size limiting
max_bytes = "100MB" # Custom max body size
[proxy.mapping.canary] # Weighted traffic splitting
enabled = false
sticky = true # Session-pinned: version stored in session/cookie
sticky_key = "user" # Initial selection hash: "user" | "fingerprint" | "ip"
header = "X-Hexon-Version" # Inject version label header (optional)
bypass_groups = ["qa"] # Groups always routed to canary (simple) or first version (multi)
site = "prod-eu-b2c4d1" # Connector site for canary (overrides mapping-level site, optional)
# Simple mode (two versions: stable + canary):
service = ["https://canary:8080"]
weight = 10 # % to canary (0-100), rest goes to stable
label = "v2.1.0" # Prometheus {version} label for canary
# Multi-version mode (A/B/C routing — replaces service/weight/label above):
[[proxy.mapping.canary.versions]]
service = ["https://v1:8080"]
weight = 70 # Weights must sum to 100
label = "v1-stable"
site = "prod-us-a1b2c3" # Per-version connector site (overrides canary-level site, optional)
[[proxy.mapping.canary.versions]]
service = ["https://v2:8080"]
weight = 20
label = "v2-beta"
[[proxy.mapping.canary.versions]]
service = ["https://v3:8080"]
weight = 10
label = "v3-alpha"
[proxy.mapping.retry] # Automatic retry with budget
enabled = false
max_attempts = 3 # Total attempts (1-10)
retry_on = ["5xx", "connect-failure", "reset"] # Also: "retriable-4xx" (429 only)
retriable_methods = ["GET", "HEAD", "OPTIONS", "PUT", "DELETE"]
backoff_base = "50ms"
backoff_max = "1s"
backoff_jitter = true # Full jitter: random in [0, computed backoff]
budget_ratio = 0.10 # Max retries = 10% of requests in window
budget_window = "10s"
budget_min_retries = 3 # Always allow N retries regardless of ratio
max_body_size = "1MB" # Bodies larger than this skip retry
[proxy.mapping.hedge] # Hedged requests for tail latency
enabled = false
delay = "100ms" # Calibrate to route's p99 — too low doubles load, too high adds no value
max_hedges = 1 # 1-3: each fires to a different backend after delay

IMPORTANT: Retry and hedge amplify backend load. rate_limit is enforced once per client request (before retry/hedge). Retries and hedges do NOT consume additional rate-limit tokens. Example: retry.max_attempts=3 + hedge.max_hedges=2 = up to 9 backend requests per single client request. Plan backend capacity accordingly.

forward_request_headers = false # Forward Authorization header to backend
forward_response_headers = false # Forward WWW-Authenticate header from backend
folder = "Category" # Landing page folder grouping
tags = ["tag1"] # Landing page tags for filtering
display = true # Show in portal/access list (default: true, set false for API-only)

DNS: Centralized in [dns] section. Proxy uses DNS module by default ([proxy.dns] use_cluster=true). Per-route overrides via dnssec and dns_resolvers fields for backends with special DNS needs (e.g., internal backends with internal DNS, or backends in unsigned DNS zones).

Load balancing: service can be an array of URLs. Configure algorithm and weights via lb_strategy and lb_weights fields. Default strategy is adaptive (epsilon-greedy).

Circuit breaker: [proxy.mapping.circuit_breaker] with trip expression and recovery settings. Health checks: all mappings get HTTP health checks by default (any non-5xx = healthy, 15s interval). Override globally via [proxy.default_health_check] or per-mapping via [proxy.mapping.health_check]. 4 active check types: tcp, http, http3, grpc. expected_status is an array (e.g. [200, 302]); empty means any non-5xx response is healthy. Health check path is derived from the mapping’s route path when using defaults. Health state is shared cluster-wide; unhealthy backends are removed from rotation until they recover. Connection pool: [connection_pool.http] for global pool settings (max_connections, adaptive_scaling).

Hot-reloadable: routes, backends, auth rules, paths, rewrite rules, protection overrides, identity header names, per-route DNS, certificates. Cold (restart required): proxy.enabled, global connection pool settings, DNS module config, cache.

Security

Identity Headers (when add_auth_headers=true):

X-Hexon-User: Username
X-Hexon-Mail: Email address
X-Hexon-Name: Full name (display name)
X-Hexon-Groups: Comma-separated group list (e.g. "users,admins,developers")

Groups are fetched fresh from directory on every request (not cached), ensuring immediate enforcement of group changes without re-authentication. Backends can trust these headers for SSO without implementing their own authentication.

Header Signing — Ed25519 (enabled by default when cluster_key is set):

Additional headers injected when signing is enabled:
X-Hexon-Audience: Route audience string. Precedence: explicit audience field on the
mapping, then mapping app name, then service URL fallback. Default
is the app name — stable across deployments and unambiguous when the
service array contains multiple backends.
X-Hexon-Timestamp: Unix epoch seconds when signature was created
X-Hexon-Request-Id: Unique request correlation ID for tracing
X-Hexon-Signature: Signature in format: v2.{timestamp}.{base64_ed25519}
Signed payload (pipe-delimited, 7 fields):
{timestamp}|{request_id}|{audience}|{user}|{email}|{name}|{groups}
IMPORTANT: The groups field (last) may itself contain pipe characters.
Backends MUST parse with SplitN(payload, "|", 7) — NOT Split(payload, "|").
Why Ed25519 instead of HMAC:
- HMAC requires sharing the secret key with verifiers, enabling forgery
- Ed25519 distributes only the PUBLIC key — backends can verify but NOT forge
- The private key never leaves the Hexon cluster
Key derivation: HKDF-SHA256 (RFC 5869) from cluster_key with a versioned
domain-specific salt. Rotates every signing_rotation (default 15m).
Current + previous keypair kept in memory for rotation boundary handling.
Backend Verification — Option 1: Delegated (simple, recommended for most backends):
POST /.well-known/header-signing.verify
Content-Type: application/json
Request body fields: signature, timestamp, request_id, audience, user, email, name, groups
Example:
{"signature":"v2.1732800000.base64ed25519==","timestamp":1732800000,
"request_id":"abc123","audience":"https://backend:8080",
"user":"jdoe","email":"jdoe@example.com","name":"John Doe","groups":"admin,users"}
Responses:
200 OK: {"valid": true}
401 Unauthorized: {"valid": false, "error": "signature mismatch"}
400 Bad Request: {"valid": false, "error": "missing required field: signature"}
503 Service Unavailable: {"valid": false, "error": "signing not enabled"}
Benefits: zero crypto code in backend, automatic key rotation handling,
works with nginx auth_request directive.
Backend Verification — Option 2: Direct (fast, recommended for high-throughput):
GET /.well-known/header-signing.key?t={timestamp}
Response (200 OK):
{"public_key":"base64_32_byte_key","valid_from":1732800000,"valid_until":1732800900}
Verify Ed25519 signature locally:
1. Parse X-Hexon-Signature: split by "." → [version, timestamp, signature_base64]
2. Verify version is "v2", decode signature (64 bytes)
3. Check X-Hexon-Audience matches expected audience
4. Check timestamp is within 30 seconds of current time
5. Fetch public key from /.well-known/header-signing.key?t={timestamp}
6. Reconstruct payload: timestamp|request_id|audience|user|email|name|groups
7. ed25519.Verify(publicKey, payload, signature) → true/false
Public key is safe to cache (32 bytes, cannot create signatures — only verify).
Clock synchronization requirements:
- NTP required on all nodes (chrony or systemd-timesyncd)
- Clock drift should be <1 second for reliable operation
- Verification allows 30-second tolerance for network delays
- Key rotation windows calculated from Unix epoch

Request Signing — Ed25519 (optional, per-route sign_request=true):

Signs the entire HTTP request for end-to-end integrity verification.
Protects against: method tampering, host header attacks, path manipulation,
query injection, body tampering.
Header: X-Hexon-Request-Signature: v1|{timestamp}|{base64_ed25519}
Signed payload:
REQ|{timestamp}|{method}|{host}|{path}|{query_hash}|{body_hash}
Canonicalization rules:
Path: URL-decoded → dot-segments resolved → slashes collapsed → leading slash ensured
/api/../admin → /admin, /api//users → /api/users, /api/foo%2Fbar → /api/foo/bar
Query: parsed → sorted alphabetically by key → re-encoded with URL escaping
b=2&a=1 → a=1&b=2
Body: SHA256 hash (base64). Bodies over sign_request_max_body → "SKIPPED".
Empty body → hash of empty string (47DEQpj8HBSa...).
Set sign_request_max_body = "0" to always skip body hashing.
Verify via: POST /.well-known/request-signing.verify (same JSON format)
or GET /.well-known/request-signing.key?t={timestamp} (same keypair as header signing).
Header signing vs request signing:
Header signing: covers auth headers only, enabled by default, X-Hexon-Signature
Request signing: covers entire request, opt-in per route, X-Hexon-Request-Signature
Use both on sensitive routes (e.g., payment gateways) for maximum security.

Bearer Token Injection (when add_bearer=true):

Injects a signed JWT ID token as Authorization: Bearer <token> on proxied requests.
Backend verifies the token via the /oidc/cert JWKS endpoint (standard OIDC discovery).
Token signed with the OIDC provider's signing key — supports threshold ECDSA (TSS/DKG)
or deterministic HKDF-derived keys, auto-swaps transparently.
JWT claims: iss (gateway issuer), sub (username), aud (app name or custom audience),
email, preferred_username, groups, exp, iat. Same structure as regular OIDC ID tokens.
Audience defaults to the mapping's app name. Override via audience = "custom" field.
Tokens are cached in the session module (type "proxy_bearer") with deterministic IDs
derived from the user session and audience, distributed across all cluster nodes. With threshold
signing, only one node performs the signing ceremony per user:audience pair. Cached JWTs
are AES-256-GCM encrypted at rest using a cluster key derivative. Refreshed at 80% of TTL.
Pre-minting: optimizes ECDSA signing time on first request by minting during OIDC callback.
Existing Bearer tokens are NOT overwritten — if the request already carries a Bearer
(e.g., kubelogin passthrough), the injection is skipped. This allows mixed usage:
M2M clients with their own tokens and browser users with injected tokens on the same route.
Use case: backends like ArgoCD, Rancher, Grafana, Kubernetes API trust the gateway's
OIDC issuer and get SSO for free — no redirect flow, no separate auth integration.

PAT Bearer Authentication:

Personal Access Tokens work as standard Bearer tokens for proxy access.
Flow: Authorization: Bearer <PAT-JWT> → middleware detects opaque miss →
JWT verify → PAT detection (jti non-empty) → PAT session validation → bearer auth.
Session check on every request ensures instant revocation (~5-50µs local KV).
IP restriction (allowed_ips) enforced from session metadata (exact IP + CIDR).
Last-used tracking: fire-and-forget metadata update preserves fixed PAT expiry.
Cache (bearer_cache session, SHA256 key): stores is_pat, jti, allowed_ips metadata.
Cache hits skip JWT verify but always re-validate session (revocation gate).
Stale cache entries auto-deleted when revocation detected.
Groups from JWT claims used for per-route group authorization (OR logic).
Two access paths to the same proxy mapping:
Browser: PoW challenge → OIDC SSO → session cookie → proxy (human-optimized)
Machine: Authorization: Bearer <token> → proxy (machine-optimized)
Bearer resolves at step 1 of the middleware chain — before PoW, before OIDC redirect.
All three token types (opaque access tokens, JWT ID tokens, PATs) bypass PoW and
browser redirects. Provides direct, redirect-free, cookieless proxy access for CI/CD,
monitoring, CLI tools (kubelogin), and service-to-service calls. Same group authorization,
identity headers, and Ed25519 signing apply — only the authentication on-ramp differs.

Mutual TLS (per-mapping):

mtls=false (default): no client certificate requested (no browser popup)
mtls=true: TLS handshake requires valid client certificate (RequireAndVerifyClientCert)
Certificate validated against ACME CA bundle or configured external PKI.
Applied at TLS layer via GetConfigForClient callback during handshake.

Subnet Restriction:

allowed_subnets uses CIDR notation, OR logic (client IP must match at least one).
Uses X-Forwarded-For if present (for CDN/LB scenarios), falls back to direct IP.
Enforced AFTER authentication but BEFORE proxy forwarding (defense-in-depth).
CIDR validated at config load time — startup fails on invalid format.
All violations logged at LevelWarn with app and host labels.

Cookie Handling:

Set-Cookie domains rewritten from backend domain to proxy domain for SSO.
Cookies intentionally shared across all subdomains (*.hexon.es) for single sign-on.
RFC 6265 compliant: case-insensitive attribute parsing (Domain=/domain=/DOMAIN=).
HttpOnly and Secure flags preserved during rewriting.

Response Header Overrides (three-state logic):

"" (empty/omit): Inherit from backend (pass through unchanged)
"-" (dash): Strip header from response
"value": Override with the specified value
Per-route headers completely replace global [proxy].headers (no merging).
Empty map (headers = {}) disables all header processing for that route.
Forbidden headers (blocked at config validation):
Transfer-Encoding, Content-Length, Connection, Keep-Alive, Upgrade, Proxy-Connection, TE
When both legacy fields (permissions_policy, referrer_policy, csp_header) and headers
map target the same header, the headers map takes precedence.

Troubleshooting

Common symptoms and diagnostic steps:

502/503 Bad Gateway:

- Backend unreachable: 'proxy health' shows backend down
- Circuit breaker open: 'proxy circuits' shows tripped breakers
- DNS resolution failure: 'dns test <backend-hostname>' to verify
- DNSSEC failure on unsigned zone: set dnssec=false on that route
- All custom resolvers failing: falls back to system DNS, check 'dns resolvers'
- Start with: 'diagnose domain <hostname>' for cross-subsystem check

Auth redirect loops:

- OIDC callback failing: check oidc_providers configuration
- Cross-domain cookie issue: verify proxy hostname matches cookie domain
- Session group mismatch: group_refresh_interval updated session groups
- Multiple providers: ensure provider selection page renders correctly
- Check: 'sessions list --user=X' and 'auth status'

WebSocket upgrade failures:

- Missing allow_upgrade=true on the mapping
- Backend not responding to Upgrade handshake
- Rate limiting blocking upgrade requests: check disable_rate_limit
- TLS verification failing: check tls_check setting

gRPC errors (circuit breaker tripping unexpectedly):

- gRPC always returns HTTP 200; actual status is in grpc-status trailer
- Circuit breaker uses gRPC-aware status extraction (codes 4,8,13,14,15 = server error)
- Expression variables: grpc_error_rate, grpc_unavailable_rate, grpc_timeout_rate
- Backend must implement grpc.health.v1.Health for native gRPC health checks
- Enable: grpc=true on the mapping, grpc_health_check=true on circuit_breaker config

Slow responses:

- HTML buffering: set rewrite_host=false for API routes (saves 8-15ms, 4x throughput)
- Brotli decompression cost: set brotli_support=false on specific routes
- Backend health degrading: 'proxy backends' for connection stats
- Circuit breaker half-open: 'proxy circuits' for breaker states
- Connection pool exhaustion: 'connpool stats' for pool metrics
- Route matching slow: too many regex routes in Tier 3, convert to prefix patterns
- Enable debug mode for Server-Timing header: shows route/auth/backend_ttfb/tls timing
breakdown in browser DevTools Network tab (do NOT enable in production)

Header signing verification failures:

- Clock skew >1s between proxy and backend: ensure NTP is running on all nodes
- Key rotation window boundary: verification allows 30s tolerance
- Wrong audience: check mapping's audience field matches backend expectation
- Signature format: expect v2.{timestamp}.{base64}, parse with Split(".", 3)
- Groups with pipes: backend must use SplitN("|", 7), not Split("|")

Request signing verification failures:

- Path canonicalization mismatch: backend must URL-decode, resolve dots, collapse slashes
- Query parameter order: backend must sort alphabetically before hashing
- Body hash "SKIPPED": body exceeded sign_request_max_body, backend must handle this case
- Large file uploads: set sign_request_max_body = "0" to skip body hashing

Landing page not showing apps:

- User not in required groups: 'directory user <username>'
- proxy.hostname not configured: landing page serves on service.hostname at /
- Route auth=false: public apps show with PUBLIC badge
- App not visible: check display=true (default) and folder/tags grouping in mapping config

PROXY protocol issues:

- Backend not expecting PROXY protocol: set proxy_protocol=false
- Wrong protocol version: check proxy_protocol_version (v1 text vs v2 binary)

PAT Bearer token not authenticating:

- PAT falls through to session/OIDC: check 'logs search "handlers.bearer"' for PAT validation logs
- "PAT rejected" in logs: session revoked or expired — check 'sessions list --type=pat --user=X'
- "Cached PAT rejected" in logs: stale bearer_cache entry — auto-invalidated, retry should work
- "source IP not allowed" in logs: PAT has allowed_ips restriction — check 'pats show <session_id>'
- PAT works for QUIC but not proxy: ensure Authorization: Bearer header is sent correctly
- Groups not matching route: PAT carries groups from creation time — if user groups changed,
create new PAT with current groups or use route with groups the PAT carries

403 Forbidden on specific routes:

- Subnet restriction: client IP not in allowed_subnets (check 'proxy traffic' metrics)
- HTTP method not allowed (405): check allowed_methods list
- Group authorization failed: user missing required group membership
- mTLS required but no client certificate: check mtls setting

Canary routing not splitting traffic:

- Verify canary.enabled = true and weight > 0
- Sticky routing: same user always hits same version (deterministic hash)
- bypass_groups: users in these groups always hit canary (simple) or first version (multi)
- Metric: proxy_canary_requests{version="canary"} should show traffic
- Use 'proxy canary <app>' for detailed canary status, pool health, and per-version metrics
- Log: logs search "proxy.canary" shows per-request routing decisions

Canary backend isolation:

- Canary backends have their own LB pool (separate circuit breaker, health checks, outlier detection)
- Health checks inherited from parent mapping config
- Canary errors do NOT contaminate primary circuit breaker
- Response cache keys include version label (no cross-contamination)
- E2OE WebSocket connections route to canary when canary selected
- URL/cookie rewriting: canary must use same hostname as primary (limitation)
- Retry/hedge after canary failure: selects from canary pool (same pool isolation)
- Canary site: canary.site overrides mapping-level site for connector routing
- Per-version site: versions[].site overrides canary.site (multi-version mode)
- Site resolution: version.site > canary.site > mapping.site > direct connection
- CLI probe: 'proxy mappings' probes both primary and canary backends

Session-based canary pinning:

- First request: version computed via weight/sticky, stored in session metadata (auth) or cookie (no-auth)
- Subsequent requests: pinned version read from session/cookie — same user always gets same version
- Session priority: session metadata > cookie > fresh computation
- Logout/login: session destroyed → fresh selection on new session
- Cookie: hexon_cv_{app} session cookie (no Max-Age = cleared on browser close)
- Works for auth=false routes via cookie fallback

Retries not firing / firing too much:

- Method not in retriable_methods: POST/PATCH never retried by default
- Budget exhausted: rate(proxy_retry_budget_exceeded[5m]) > 0
- Body too large: bodies > max_body_size skip retry silently
- Check: X-Hexon-Attempts response header shows actual attempt count
- Log: logs search "proxy.retry" for attempt details and budget blocks

Hedging not reducing tail latency:

- delay too high: hedge fires too late to help — calibrate to route p99
- Insufficient backends: hedge requires max_hedges+1 backends (validated at config load)
- max_hedges > 1: fires N hedges simultaneously after delay, each to a different backend
- Metric ratio: proxy_hedge_fired_total vs proxy_hedge_won_total
Low won/fired = delay well-calibrated; high = persistent tail latency problem
- Hedge skipped: proxy_hedge_skipped_total — no different backend available or body replay failed

Interpreting tool output:

'proxy health':
Healthy: All backends Status=healthy, Latency < 100ms
Warning: Status=healthy but Latency > 500ms — backend is slow, not down
Degraded: Status=unhealthy with Reason: connection_refused | tls_error | timeout | dns_failed
No pools: "No load balancer pools configured" means all mappings are single-backend (normal)
Action: All backends unhealthy → 'proxy circuits' for open breakers
'proxy circuits':
Healthy: all breakers State=closed — normal operation
Half-open: breaker is testing recovery — allow a few requests through, do NOT reset
Open: breaker tripped — backend is failing, check Reason and TripCondition
No pools: same as above — circuit breakers only exist for multi-backend pools
Action: Open breaker → check 'proxy backends' for error counts, fix backend, then 'proxy reset'
'proxy backends':
Healthy: ActiveConns reasonable, ErrorRate < 1%, Latency stable
Degraded: ErrorRate > 5% or Latency spiking — backend may be overloaded
Ejected: Outlier detection removed backend — 'proxy uneject' to re-admit after fixing
'proxy traffic':
Normal: RequestRate steady, ErrorRate < 1%, Latency p99 < 500ms
Abnormal: Sudden RequestRate spike (possible attack), ErrorRate > 5% (backend issue)
Zero traffic: Route exists but no requests — check DNS/certificate for that hostname
'proxy canary':
List view: shows all canary-enabled routes with mode, weight, sticky, pool health
Detail view (proxy canary <app>): full config, canary pool health, per-version request counts
No canary routes: "No routes with canary enabled" — canary not configured
Pool health N/M: N healthy backends out of M total in canary pool
Version metrics: proxy_canary_requests broken down by version label

Relationships

Module dependencies and interactions:

  • loadbalancer: Pool management, backend selection, health checks, circuit breakers, outlier detection. Multi-algorithm support (round-robin, weighted, least-conn, consistent hash, Maglev).
  • sessions: Authentication enforcement via session cookies. Session creation during OIDC callback. Session group monitor updates groups in place on changes (no re-login).
  • certificates: TLS termination, SNI-based certificate selection for per-mapping certs. Falls back to service.tls_cert/tls_key if no mapping-specific cert. Invalid or missing certificates prevent route from mounting.
  • waf: Request filtering applied before proxy forwarding (WAF rules checked first).
  • authentication.oidc: SSO via internal OIDC provider. Uses a dedicated internal OIDC client with PKCE S256. Back-channel token exchange is in-process (no network hairpin in K8s).
  • directory: Group membership lookup on every request (fresh, not cached). Powers both per-request authorization and X-Hexon-Groups header, plus landing page app filtering.
  • dns: Backend hostname resolution with DNSSEC validation. Centralized in [dns] section. Per-route overrides for internal backends or unsigned DNS zones.
  • firewall: Network-level access rules applied before proxy routing.
  • protection: Rate limiting (JA4 fingerprint-based) and size limiting at router level. Per-mapping bypass via disable_rate_limit/disable_size_limit context keys.
  • connection_pool: Backend HTTP connection management with adaptive pool sizing, circuit breaker integration, and performance metrics. Pool consolidation for routes with identical transport configuration.
  • render: Landing page and toolbar asset serving (CSS, JS, images).

Architecture

Request flow:

  1. Client request arrives → TLS termination with SNI certificate selection
  2. Hostname match: O(1) hash map lookup → PathMatcher for that hostname
  3. Path match: 3-tier hybrid matcher (exact → prefix tree → regex scan)
  4. Middleware chain: rate limit → size limit → subnet check → method filter
  5. Authentication: Bearer token check (opaque cache → JWT session cache → Ed25519 verify) PAT detection: if JWT has jti → PAT session validation (revocation + IP restriction) → last_used update then OIDC SSO check → redirect to /oidc/auth if no session
  6. Authorization: group membership verified (fresh directory lookup, OR logic)
  7. Director: canary selection (if enabled) → inject identity headers, sign headers (Ed25519)
  8. Transport chain: RetryTransport → HedgeTransport → ProtocolFallbackTransport → backend
  9. Response processing: URL rewriting (HTML + response headers), cookie domain rewriting, toolbar + JS interceptor injection, custom header overrides
  10. Response sent to client (zero-copy mode skips step 9 for API routes)

Route matching — 3-tier hybrid matcher:

Tier 1 (exact): O(1) hash map — static paths like ^/api/v1/users$
Tier 2 (prefix): O(log n) radix tree — wildcard paths like ^/api/.*
Tier 3 (regex): O(n) sequential — complex patterns with alternation
Auto-priority: (prefix_length × 100) + end_anchor_bonus(100) - alternation_penalty(50)
Manual priority >1000 overrides auto-calculation. Catch-all always priority 0.
Performance: exact ~50ns, prefix ~100-200ns, regex ~500ns-50μs (scales with route count).

OIDC SSO flow (solves cross-domain cookie problem):

1. User hits proxy host with no session cookie
2. Redirect to OIDC provider with PKCE challenge
3. If user has existing session on main domain → auto-approved (no login prompt)
4. Redirect back: /_hexon/oidc/callback?code=...
5. Token exchange in-process (no network hairpin)
6. Session cookie set ON the proxy host domain → future requests have session
Security: PKCE S256, AES-GCM encrypted state with cluster key derivative,
CSRF double-submit cookie, 10-minute state expiry, host binding in state,
open redirect prevention (return URL validated against proxy hostname).

Hot-reload mechanism:

1. Config change detected (file watcher or SIGHUP)
2. Config hash compared to detect actual changes (skip if identical)
3. Routes rebuilt from new configuration
4. HTTP transport cache checked for connection pool reuse
5. Circuit breaker state preserved for unchanged routes
6. Response cache selectively invalidated (only changed routes)
7. Server routes re-registered (atomic hostname updates)
8. Proxy state swapped atomically (lock-free reads, +2ns overhead)
Reload is all-or-nothing: failure keeps old routes active, error logged.
Duration: 50-200ms for typical configs (10-50 routes).

HTML processing pipeline (when rewrite_host=true):

1. Backend response received → check Content-Type (only process text/html)
2. Decompress if needed (gzip always; Brotli if brotli_support=true)
3. Replace backend URLs with proxy URLs in HTML body
4. Rewrite URL-containing response headers (Link, Content-Location, Refresh)
5. Rewrite Set-Cookie domains (case-insensitive attribute matching)
6. Inject JavaScript interceptor (rewrites fetch, XHR, dynamic elements, window.open)
7. Inject logout toolbar before </body> (if auth=true and inject_toolbar=true)
8. Re-compress response (gzip or Brotli based on client Accept-Encoding)
Zero-copy mode (rewrite_host=false, inject_toolbar=false): skip all HTML processing,
eliminating 10MB allocation per request. Ideal for APIs, WebSocket, streaming.

Logs

Log entries by component. Search with: logs search “proxy” Levels: ERROR > WARN > INFO > DEBUG. DEBUG requires log level configuration.

Routing & Dispatch:

proxy.dispatcher DEBUG Matched proxy route / no path match / no routes for hostname
proxy.request DEBUG Proxying request to backend
proxy.error INFO Request canceled (client disconnected — expected)
proxy.error ERROR Proxy request failed (timeout, connection refused, etc.)
proxy.redirect DEBUG Rewriting redirect Location header
proxy.assets WARN Path traversal attempt / invalid path detected
proxy.debug_timing DEBUG Full proxy roundtrip timing summary

Authentication & Authorization:

proxy.reauth INFO AUDIT Re-authentication required (reauth rule matched)
proxy.oidc DEBUG Redirecting to internal/external OIDC provider
proxy.oidc ERROR Failed to generate PKCE verifier, CSRF token, or state encryption
proxy.oidc.callback INFO AUDIT OIDC proxy authentication completed
proxy.oidc.callback WARN AUDIT State expired, CSRF validation failed
proxy.oidc.callback WARN OAuth error from IdP, state decryption failed, host mismatch
proxy.oidc.callback ERROR AUDIT Token exchange failed
proxy.oidc.callback ERROR Session creation failed

Header Signing:

proxy.signing WARN Rotation interval too short / cluster key too short
proxy.signing ERROR Key derivation failed (initial or rotation)

Bearer Token Injection:

proxy.bearer_inject WARN No username/session, decryption failed (will re-mint)
proxy.bearer_inject ERROR MintBearerToken failed, wrong response type, encryption error
proxy.bearer_inject DEBUG Bearer token minted for backend
proxy.bearer_refresh WARN Background refresh failed
proxy.bearer_refresh DEBUG Background refresh completed

HTML Rewriting:

proxy.rewrite WARN Response too large to buffer, streaming without rewrite
proxy.rewrite DEBUG Chunked/binary response streaming, Brotli/Zstd disabled

WebSocket E2OE:

proxy.ws_e2oe.* INFO Relay started / ended
proxy.ws_e2oe.* ERROR Accept failed / backend dial failed

Session Monitoring:

proxy.group_monitor INFO Monitor started, check completed
proxy.group_monitor INFO AUDIT User groups changed, updating session
proxy.group_monitor WARN Session update wait failed
proxy.group_monitor ERROR Group fetch failed, session update failed

Lifecycle:

proxy.init INFO Proxy service initialized
proxy.init ERROR Initialization failed
proxy.reload ERROR Reload failed
proxy.ca_rotation INFO Transport caches invalidated (CA rotation)
proxy.director WARN PROXY protocol: invalid source IP

Transport & DNS:

proxy.transport DEBUG Transport configured for route
proxy.dns DEBUG/INFO Backend DNS resolution, DNSSEC validation
proxy.dns WARN/ERROR DNS resolution failed, fallback to system DNS
proxy.dns.quic DEBUG/WARN QUIC-specific DNS resolution and connection

Circuit Breaker & Load Balancing:

proxy.circuit_breaker WARN Circuit breaker open / fallback activated
proxy.outlier_detection WARN Outlier detection config warnings
proxy.dns_discovery WARN DNS discovery config warnings
proxy.health_check WARN Health check config warnings
proxy.fallback ERROR Invalid fallback URL / fallback service error

JIT-2FA:

proxy.jit2fa ERROR JIT-2FA middleware creation failure

Request Signing:

proxy.request_signing WARN Body hash failure
proxy.request_signing DEBUG Request signed successfully
proxy.signing_key DEBUG/WARN Public key endpoint access and validation
proxy.signature_verify DEBUG/WARN Signature verification endpoint handler
proxy.request_signature_verify DEBUG/WARN Request signature verification handler

Shadow/Mirror:

proxy.shadow DEBUG Shadow request dispatched

Co-browsing:

proxy.cobrowse.started INFO Co-browse session started
proxy.cobrowse.stopped INFO Session stopped
proxy.cobrowse.recorder_connected INFO Recorder WebSocket connected
proxy.cobrowse.recorder_disconnected INFO Recorder disconnected
proxy.cobrowse.reconnected INFO Recorder reconnected
proxy.cobrowse.grace_expired INFO Cleanup grace period expired
proxy.cobrowse.ws_upgrade_failed WARN WebSocket upgrade failed
proxy.cobrowse.publish_failed WARN Event publish failed
proxy.cobrowse.input_write_failed WARN Input forwarding failed
proxy.cobrowse.recorder_ws_not_found WARN Recorder WS not found in cluster store
proxy.cobrowse.input_received DEBUG Forwarding interaction event to recorder
proxy.cobrowse.input_subscribe_failed WARN Input channel subscription failed
proxy.cobrowse.recorder_stats INFO Recorder WebSocket session ended

Configuration:

proxy.route INFO Route configured (full route details)
proxy.config INFO Global config summary, cert loading
proxy.config WARN Duplicate route detection, config validation

Access Control:

proxy.access DEBUG Route access check (app, host, groups, reason)

Canary:

proxy.canary DEBUG Routing to stable/canary backend (app, version, backend)

Retry:

proxy.retry INFO Retrying request (app, attempt, backend)
proxy.retry DEBUG Retry succeeded (app, attempt)
proxy.retry WARN Retry budget exceeded (app, pool_id)
proxy.retry WARN All retry attempts exhausted (app, max_attempts)

Hedge:

proxy.hedge DEBUG Hedge fired (app, hedges, delay, primary_backend)
proxy.hedge DEBUG Hedge skipped (app, reason)
proxy.hedge WARN All hedge attempts failed (app, total_attempts)

Metrics

Prometheus metrics. Query with: metrics prometheus proxy_<name>

Request Flow:

proxy_requests counter {app, host, auth} Successful proxied requests
proxy_errors counter {app, host} Proxy request errors
proxy_backend_duration latency {app} Backend response time
proxy_auth_failures counter {app, host} Authentication failures
proxy_authz_failures counter {app, host} Group authorization failures
proxy_auth_bypass_total counter {app, host} Auth bypassed (bypass_auth_cidrs)
proxy_subnet_failures counter {app, host} Subnet restriction failures
proxy_reauth_required counter {app, host} Re-authentication triggered

Caching:

proxy_cache_hits counter {app, type} Response cache hits (304/full)
proxy_cache_misses counter {app} Response cache misses
proxy_cache_size gauge {} Response cache entries
proxy_cache_invalidated counter {} Cache entries invalidated on reload

Header Signing:

proxy_signing_total counter {status, app} Header signing operations
proxy_signing_duration latency {app} Header signing time
proxy_request_signing_total counter {status, app} Request signing operations
proxy_request_signing_duration latency {app} Request signing time
proxy_sign_payload_total counter {status} Payload signing outcomes
proxy_key_derivation_total counter {status} Key derivation attempts
proxy_key_rotation_total counter {status} Key rotations
proxy_key_operation_total counter {status, operation} Key operation failures
proxy_key_request_total counter {status} Public key endpoint requests
proxy_key_request_duration latency {} Public key endpoint latency
proxy_signature_verify_total counter {status} Signature verification requests
proxy_signature_verify_duration latency {} Signature verification latency
proxy_request_signature_verify_total counter {status} Request signature verification
proxy_request_signature_verify_duration latency {} Request signature verify latency

OIDC SSO:

proxy_oidc_flow_initiated counter {host, provider} OIDC auth flows started
proxy_oidc_flow_completed counter {host, provider} OIDC auth flows completed
proxy_oidc_flow_failed counter {host, reason, provider?} OIDC auth flow failures (provider absent on pre-state errors)
proxy_oidc_state_validation_failed counter {reason} State validation failures

Per-User Rate Limiting:

proxy_rate_limit_per_user_denied counter {app} Requests denied by per-user rate limit

Canary:

proxy_canary_requests counter {app, version} Requests routed per version (stable/canary label)

Retry:

proxy_retry_attempts_total counter {app, attempt} Retry attempts by attempt number
proxy_retry_success_total counter {app} Retries that succeeded
proxy_retry_budget_exceeded counter {app} Retries blocked by budget
proxy_retry_exhausted counter {app} All retry attempts failed

Hedge:

proxy_hedge_fired_total counter {app} Hedge requests sent (primary too slow, value = number of hedges)
proxy_hedge_won_total counter {app} Hedge response used (primary was slower or failed)
proxy_hedge_lost_total counter {app} All attempts failed (primary + all hedges)
proxy_hedge_skipped_total counter {app} Hedge skipped (no different backend, body replay failure)

Circuit Breaker:

proxy_circuit_breaker_rejections counter {app} Requests rejected (breaker open)
proxy_circuit_breaker_fallbacks counter {app} Fallback service activated

Transport:

proxy_transport_cache_hits counter {} HTTP transport reused
proxy_transport_cache_misses counter {} New HTTP transport created
proxy_transport_cache_size gauge {} Cached transports
proxy_transport_cache_invalidated counter {reason} Cache invalidated (CA rotation)
proxy_http3_transport_cache_hits counter {} HTTP/3 transport reused
proxy_http3_transport_cache_misses counter {} HTTP/3 transport created
proxy_proxyprotocol_sent counter {version} PROXY protocol headers sent
proxy_proxyprotocol_skipped counter {reason} PROXY protocol skipped
proxy_ca_pool_version gauge {} CA pool version
proxy_optimized_transport_hits counter {route, pool} Optimized transport cache hits
proxy_optimized_transport_fallbacks counter {route, reason} Optimized transport fallbacks
proxy_transport_pools_created counter {pool, route} Transport pools created
proxy_transport_pools_cleaned counter {pool} Transport pools cleaned
proxy_pool_registration_failures counter {pool, error} Pool registration failures
proxy_pool_cleanup_errors counter {pool, error} Pool cleanup errors

Rewriting:

proxy_rewrite_duration latency {app} HTML rewriting time
proxy_buffer_pool_gets counter {} Buffer pool acquisitions

Reload:

proxy_reload_attempts counter {trigger} Reload attempts
proxy_reload_total counter {success, reason?} Reload results (reason on failure only)
proxy_reload_skipped counter {reason} Reload skipped
proxy_reload_duration latency {success} Reload time (success path only)
proxy_routes_configured gauge {} Active routes
proxy_routes_changed gauge {} Routes changed on reload
proxy_routes_unchanged gauge {} Routes unchanged on reload
proxy_routes_added counter {} Routes added
proxy_routes_removed counter {} Routes removed
proxy_config_hash_changed counter {} Config hash changes
proxy_lb_pools_preserved counter {app} LB pools preserved
proxy_lb_pools_created counter {app, reason} LB pools created

Session Monitoring:

proxy_group_monitor_changes_total counter {username} Group membership changes
proxy_group_monitor_updates_total counter {username} Session metadata updates
proxy_group_monitor_errors_total counter {error_type} Monitor errors
proxy_group_monitor_check_duration latency {} Check cycle time

FastCGI:

proxy_fastcgi_requests_total counter {mapping, status_class} FastCGI RoundTrip exits (status_class: 2xx/3xx/4xx/5xx/error)
proxy_fastcgi_request_duration latency {mapping} RoundTrip wall-clock time (auto-bucketed)
proxy_fastcgi_pool_total counter {mapping, result} Conn pool acquire (hit/fresh/retry)
proxy_fastcgi_stderr_bytes_total counter {mapping, severity} PHP-FPM STDERR bytes routed to audit log
proxy_fastcgi_proto_status_total counter {mapping, status} Non-success FCGI_END_REQUEST signals (cant_mpx/overloaded/unknown_role)

Alerts:

rate(proxy_errors[5m]) / rate(proxy_requests[5m]) > 0.05 Error rate > 5%
proxy_circuit_breaker_rejections > 0 Backend unhealthy
rate(proxy_auth_failures[5m]) > 10 Brute-force attempt
rate(proxy_oidc_state_validation_failed[5m]) > 5 CSRF/state attack
proxy_transport_cache_invalidated > 0 CA rotation event
rate(proxy_reload_total{success="false"}[5m]) > 0 Config reload failing
rate(proxy_retry_budget_exceeded[5m]) > 10 Retry storm — budget protecting cluster
rate(proxy_retry_exhausted[5m]) > 5 Backend failures exhausting retries
rate(proxy_hedge_fired_total[5m]) / rate(proxy_requests[5m]) > 0.1 >10% requests hedging — check tail latency
rate(proxy_fastcgi_requests_total{status_class="error"}[5m]) > 5 FastCGI transport-level failures (backend unreachable)
rate(proxy_fastcgi_stderr_bytes_total{severity="error"}[5m]) > 1000 Sustained PHP-FPM error output (investigate php-fpm.log)

Fastcgi

FastCGI 1.0 RESPONDER backends — PHP-FPM, Python WSGI (flup, gunicorn FCGI mode), Perl FCGI::ProcManager, Ruby rack-fastcgi, Lua, and any other language with a FastCGI 1.0 RESPONDER implementation. Selected per-mapping via the service URL scheme.

For deeper detail see: man fastcgi

URL schemes:

fastcgi://host:port Plain TCP. The default for most FastCGI
deployments (PHP-FPM in containers,
gunicorn FCGI listener, etc.).
fastcgi+tls://host:port TCP + TLS handshake before FCGI records.
Reuses tls_check from the mapping for
cert verify policy.
fastcgi+unix://... REJECTED at config load. Configure the
FastCGI backend to listen on a TCP port
instead.

Required mapping fields:

service = ["fastcgi://php-fpm.internal:9000"]
fcgi_script_root = "/var/www/html" (absolute filesystem path on backend)

Common optional fields:

fcgi_index "index.php" front-controller for paths without a SplitPath suffix
fcgi_split_path [".php"] suffixes splitting SCRIPT_NAME from PATH_INFO
fcgi_env {} extra CGI envs (passthrough)
fcgi_inject_identity true inject HEXON_USER/MAIL/etc. envs
fcgi_dial_timeout "5s" backend connect timeout
fcgi_read_timeout "60s" full response read timeout — bump for heavy/long-running backends
fcgi_write_timeout "10s" request body write timeout
fcgi_idle_timeout "90s" idle TTL for pooled conns (set BELOW your FastCGI worker idle timeout, e.g. PHP-FPM pm.process_idle_timeout)
fcgi_max_idle_conns 32 pool size per backend (set 0 to disable pooling)

Composes with existing mapping fields:

auth = true + add_auth_headers = true
Drives identity envs (HEXON_USER/MAIL/NAME/GROUPS).
add_extended_auth_headers When true, also injects X-Hexon-Auth-Method
(passkey/oidc/x509/...) and X-Hexon-JA4
headers, which surface as HEXON_AUTH_METHOD
and HEXON_JA4 envs.
mtls = true Auto-injects SSL_CLIENT_* envs from the peer
cert (also fires for X.509 SSO and voluntary
client auth — anytime a peer cert is presented).
tls_check For fastcgi+tls:// — verify backend cert chain.
site = "..." Routes FCGI traffic through the connector
tunnel. The FastCGI backend host doesn't
need a public IP.
host_header Overrides the Host the backend sees;
propagates to HTTP_HOST and SERVER_NAME envs.
lb_strategy Load balancing across multiple FCGI backends.
Mixed schemes (one http://, one fastcgi://)
rejected at config load.

Identity envs (when fcgi_inject_identity = true):

HEXON_USER, HEXON_MAIL, HEXON_NAME, HEXON_GROUPS — from the standard
add_auth_headers block.
Read from the configured
header_user/mail/name/
groups names (defaults
X-Hexon-User/...).
HEXON_AUTH_METHOD, HEXON_JA4 — require
add_extended_auth_headers = true
(header names are fixed)
HTTP_X_HEXON_* — same values, also
exposed via the
standard HTTP_* loop

Computed CGI envs:

GATEWAY_INTERFACE, REQUEST_METHOD, REQUEST_URI, QUERY_STRING,
SCRIPT_NAME, SCRIPT_FILENAME, PATH_INFO, PATH_TRANSLATED,
DOCUMENT_ROOT, DOCUMENT_URI, REQUEST_SCHEME,
SERVER_NAME, SERVER_PORT, SERVER_PROTOCOL, SERVER_SOFTWARE,
REMOTE_ADDR, REMOTE_HOST, REMOTE_PORT, REMOTE_USER,
CONTENT_LENGTH, CONTENT_TYPE, HTTP_HOST,
HTTPS, SSL_PROTOCOL, SSL_CIPHER (when TLS),
SSL_CLIENT_VERIFY, SSL_CLIENT_S_DN, SSL_CLIENT_S_DN_CN,
SSL_CLIENT_I_DN, SSL_CLIENT_M_SERIAL, SSL_CLIENT_V_START,
SSL_CLIENT_V_END (when client cert presented),
HTTP_* (all request headers, dashes→underscores, uppercased).

AUTH_TYPE: derived from request properties:

Bearer Authorization: Bearer ... (also auto when add_bearer=true)
Basic Authorization: Basic ...
Negotiate Authorization: Negotiate ... (Kerberos/SPNEGO)
Digest Authorization: Digest ...
Certificate client peer cert presented (mTLS, X.509 SSO, voluntary)
"" cookie-based session (passkey, OIDC, magic link, JIT-2FA)

Backends that want the Hexon-canonical auth method (passkey vs oidc vs x509 vs …) read HEXON_AUTH_METHOD instead — populated when add_extended_auth_headers = true on the mapping.

Health checks:

type = "tcp" probes the FastCGI listener directly (recommended).
type = "http" requires the backend to expose an HTTP health endpoint
on a separate listener (PHP-FPM's pm.status_path, gunicorn /healthz,
etc.).

Security:

- Null-byte URL paths rejected with 400.
- Missing/invalid Content-Length on POST/PUT/PATCH rejected with 411
(PHP-FPM hangs without it).
- SCRIPT_FILENAME path-traversal rejected (cleaned path must stay
under fcgi_script_root).
- Backend error output (FCGI_STDERR) routed to the audit log with the
request's correlation_id; never appears in the response body.
- Only Responder role; Authorizer/Filter roles silently ignored.
- SSRF blocklist (cloud metadata endpoints, link-local) applies when
routing through the connector.

Operational notes:

- Connection pool reuses conns via FCGI_KEEP_CONN. When the backend
cycles a worker (e.g., PHP-FPM pm.max_requests), the next pooled-
conn use detects the close and dials fresh on retry.
- Retries: first-byte failures on pooled conns retry once on a fresh
dial. Failures after request body bytes have been sent do NOT retry
(avoids double-execution on POST/PUT/PATCH).
- STDERR severity: WARN by default, ERROR when status >= 400.

Common deployment pattern:

[[proxy.mappings]]
app = "wordpress"
host = "blog.example.com"
path = "/.*"
service = ["fastcgi://php-fpm.internal:9000"]
fcgi_script_root = "/var/www/html"
auth = true
add_auth_headers = true
add_extended_auth_headers = true
groups = ["staff"]
# Heavy admin operations (Composer, large reports)
fcgi_read_timeout = "300s"

References:

RFC 3875 CGI/1.1 environment variables
FastCGI 1.0 spec https://fastcgi-archives.github.io/FastCGI_Specification.html

Request Shadow/Mirror

Mirrors live traffic to secondary backends for testing or migration — fully async, never affects the primary path

Overview

Duplicates live HTTP requests to a secondary backend for testing, canary validation, or migration. Shadow traffic is fully asynchronous — it never affects the primary request/response path or adds latency. Configurable per proxy mapping with sampling control.

Core capabilities:

  • Asynchronous dispatch (no blocking the main request path)
  • Configurable sampling via runtime_fraction (percentage or fractional modes)
  • Dedicated HTTP transport pool separate from main proxy traffic
  • Shadow identification headers for backend awareness (X-Hexon-Shadow-*)
  • Per-shadow timeout and body size limits to prevent resource exhaustion
  • Distributed trace ID propagation for end-to-end observability
  • Per-mapping shadow configuration with global defaults
  • Multiple shadow targets per proxy mapping (e.g., canary + analytics)
  • Connector site routing: shadow targets at remote sites via QUIC tunnels

Shadow dispatch flow:

1. Client request arrives at proxy handler
2. Proxy forwards to primary backend (normal flow)
3. For each configured shadow target, sampling decision is evaluated
4. If sampled, request is dispatched asynchronously to the shadow module
5. Shadow module replays the request to the shadow backend with its own transport
6. Shadow response is discarded (metrics only, no client impact)

Shadow identification headers (when AddHeaders is enabled):

X-Hexon-Shadow: "true" - Identifies this as a shadow request
X-Hexon-Shadow-Name: "<name>" - Shadow target name for routing/filtering
X-Hexon-Shadow-Source: "<host>" - Original request host
X-Hexon-Shadow-Time: "<unix>" - Unix timestamp of original request
X-Hexon-Trace-ID: "<uuid>" - Distributed trace ID
X-Forwarded-Host: "<host>" - Standard forwarded host header

Sampling modes:

Percentage: runtime_fraction = { percent = 10 } for 10% of requests
Fractional: runtime_fraction = { numerator = 1, denominator = 1000 } for 0.1%

Use cases:

- Canary deployments: shadow 10% of traffic to new version before cutover
- Analytics pipelines: mirror requests to analytics backend for processing
- Migration validation: compare primary and shadow responses offline
- Load testing: replay production traffic to staging environments
- A/B backend testing: shadow to alternative implementation

Config

Global shadow defaults under [proxy.shadow]:

[proxy.shadow]
enabled = true # Enable shadow dispatch globally
timeout = "5s" # Default timeout for shadow requests
max_body_size = "10MB" # Maximum request body size to shadow
add_headers = true # Add X-Hexon-Shadow-* identification headers
max_idle_conns = 50 # Transport pool: max idle connections
max_idle_conns_per_host = 10 # Transport pool: max idle per host
max_conns_per_host = 100 # Transport pool: max total per host
idle_conn_timeout = 90 # Idle connection timeout (seconds)
tls_handshake_timeout = 10 # TLS handshake timeout (seconds)
tls_verify = true # Verify shadow backend TLS certificates

Per-mapping shadow configuration (overrides global defaults):

[[proxy.mapping]]
host = "api.example.com"
path = ".*"
service = ["https://primary.internal"]
[[proxy.mapping.shadow]]
name = "canary" # Shadow target name (used in metrics and headers)
service = "https://canary.internal:8443" # Shadow backend URL
runtime_fraction = { percent = 10 } # Sample 10% of requests
add_headers = true # Override global add_headers
[[proxy.mapping.shadow]]
name = "analytics"
service = "https://analytics.internal"
runtime_fraction = { numerator = 1, denominator = 1000 } # 0.1% sampling
timeout = "2s" # Override global timeout
# Shadow target at a remote site via connector tunnel:
[[proxy.mapping.shadow]]
name = "staging-mirror"
service = "https://staging.internal:8443"
site = "staging-eu" # Routes through connector tunnel

Sampling configuration:

Percentage mode (0-100):
runtime_fraction = { percent = 10 } # 10% of requests
Fractional mode (precise low rates):
runtime_fraction = { numerator = 1, denominator = 1000 } # 0.1%
No runtime_fraction: 100% of requests are shadowed.

Hot-reloadable: runtime_fraction, timeout, add_headers, max_body_size. Cold (restart required): enabled, transport pool settings (max_idle_conns, etc.).

Troubleshooting

Common symptoms and diagnostic steps:

Shadow requests not reaching the backend:

- Verify [proxy.shadow] enabled = true
- Check shadow target name matches in proxy mapping configuration
- Verify shadow service URL is reachable from the Hexon server
- Test connectivity: 'net tcp <shadow-host>:<port> --tls'
- Check runtime_fraction is set correctly (0% = no traffic)
- Verify max_body_size is sufficient for the request payload
- Check shadow metrics: shadow_requests_total should be incrementing

Shadow requests timing out:

- Increase timeout setting (default 5s may be too short for slow backends)
- Check shadow backend health and response times
- Verify network path between Hexon and shadow backend
- Check max_conns_per_host limit (100 default) is not exhausted
- Monitor shadow_request_duration histogram for latency distribution

Transport pool exhaustion:

- Increase max_idle_conns (default 50) for high-traffic deployments
- Increase max_conns_per_host (default 100) for single-target shadows
- Reduce timeout to free connections faster
- Check idle_conn_timeout (default 90s) is appropriate
- Use short timeouts (2-5s) to prevent connection buildup

TLS errors to shadow backend:

- Verify shadow backend TLS certificate is valid
- Check tls_verify setting (set to false only for testing)
- Verify tls_handshake_timeout is sufficient (default 10s)
- Check if shadow backend requires specific TLS version or ciphers

Sampling rate seems incorrect:

- Percentage mode: percent = 10 means approximately 10% (not exact)
- Fractional mode: numerator/denominator for precise low rates
- Sampling is per-request, random; short windows may show variance
- Check shadow_requests_total vs total proxy requests for actual rate

Shadow affecting primary request latency:

- Shadow dispatch should be fire-and-forget (no Wait())
- If primary slows down, check if body buffering is the cause
- Reduce max_body_size to limit memory allocation for large payloads
- Verify shadow dispatch is non-blocking (asynchronous)

Metrics and monitoring:

- shadow_requests_total{shadow_name}: total dispatched shadow requests
- shadow_success_total{shadow_name}: requests with 2xx/3xx responses
- shadow_errors_total{shadow_name, error_type}: requests with errors
- shadow_request_duration{shadow_name}: latency histogram

Relationships

Module dependencies and interactions:

  • proxy: Primary consumer. The reverse proxy handler evaluates shadow configuration for each proxy mapping and dispatches shadow requests asynchronously when sampling criteria are met. Shadow config is nested under [[proxy.mapping.shadow]].
  • config: Global defaults from [proxy.shadow] merged with per-mapping shadow overrides. Runtime_fraction, timeout, and add_headers are hot-reloadable. Transport pool settings require restart.
  • telemetry: Shadow metrics exported for monitoring: request counts, success/error counts, and latency histograms per shadow target name. Structured logging for dispatch and response events.
  • dns: Shadow backend hostnames resolved via the DNS module with standard resolution and caching behavior.
  • certs: TLS certificate verification for shadow backends uses the system trust store. tls_verify controls whether verification is enforced.
  • Cluster RPC: Shadow dispatch uses the fire-and-forget pattern to ensure zero impact on primary request path. No cluster coordination needed; shadow runs on the receiving node only.
  • connector: When a shadow target specifies a “site” parameter, requests route through the QUIC connector tunnel to the remote site instead of direct connection. Transport is cached per site for connection reuse.

Logs

Log entries emitted by shadow dispatch. Search with: logs search “shadow.dispatch” All entries use log name “shadow.dispatch”. Levels: WARN > DEBUG. No AUDIT entries in this module.

Dispatch Lifecycle:

shadow.dispatch DEBUG Shadow request succeeded (shadow_name, status_code, latency_ms)
shadow.dispatch WARN Shadow request failed with status (shadow_name, status_code, latency_ms)
shadow.dispatch WARN Shadow request error (shadow_name, error_type, latency_ms, error)

Metrics

Prometheus metrics. Query with: metrics prometheus shadow_<name>

Request Counts:

shadow_requests_total counter {shadow_name} Shadow requests dispatched
shadow_success_total counter {shadow_name} Shadow requests with 2xx/3xx responses
shadow_errors_total counter {shadow_name, error_type} Shadow request errors (error_type: request_creation|timeout|network|http_error)

Note: error_type “http_error” also includes a “status_code” label.

Latency:

shadow_request_duration latency {shadow_name} Shadow request round-trip duration

Alerts:

rate(shadow_errors_total[5m]) > 0 Shadow errors occurring — check backend health
rate(shadow_errors_total{error_type="timeout"}[5m]) > 5 High timeout rate — increase timeout or check backend
histogram_quantile(0.99, shadow_request_duration) > 5 p99 latency exceeds 5s — shadow backend slow