Skip to content

Connectivity

Client Access (HexonClient)

Transparent L3 network access via QUIC tunnels for CLI tools and native applications

Overview

The Client Access subsystem enables end users (DBAs, developers, operators) to transparently access internal resources through a lightweight QUIC tunnel. The HexonClient binary captures IP packets via TUN + gVisor netstack, extracts TCP flows, and dials each flow as a QUIC stream to the gateway.

The gateway side (this module) handles:

  • QUIC listener on a dedicated port with ALPN “hexon-client” and TLS 1.3
  • Two authentication paths: server-side device code (RFC 8628) for interactive use, JWT with RFC 5705 channel binding for reconnect/automation
  • Per-user route derivation from firewall ACL rules (CIDR + Site routes)
  • Virtual IP allocation from a dedicated subnet (default 100.64.208.0/22)
  • Per-stream firewall ACL check before dialing backends
  • Direct dial or connector tunnel routing based on HostAlias Site field
  • Bidirectional splice with 32KB pooled buffers and half-close propagation
  • DNS resolution on the control stream for split DNS
  • DNS defense-in-depth: per-session O(1) rate limiting + ACL enforcement (RFC 8914)
  • Token refresh with group-change detection and mid-session route updates
  • Cluster-wide session tracking

This mirrors the connector architecture but reversed: the client opens streams, the gateway accepts and dials backends.

Configuration

Configuration uses the [client_access] TOML section:

[client_access]
enabled = true
port = 8445
# network_interface = "" # Bind to specific interface (falls back to service.network_interface)
# cert = "" # Dedicated TLS cert (falls back to SNI/auto-TLS)
# key = "" # Dedicated TLS key
subnet = "100.64.208.0/22" # Virtual IP pool for clients (1022 addresses)
gateway_ip = "100.64.208.1" # Gateway IP within subnet (excluded from pool)
dns_upstream = ["10.0.0.53"] # DNS resolvers for client queries
dns_domains = [] # Additional DNS domains pushed to all clients
# cidrs = ["10.0.0.0/22"] # Additional CIDR routes pushed to all clients
heartbeat_interval = "30s" # Heartbeat frequency (session TTL = 3x this)
token_refresh_interval = "45m" # Client token refresh interval
max_idle_timeout = "5m" # QUIC idle timeout
max_clients = 1000 # Maximum concurrent client connections
max_streams_per_client = 100 # Maximum concurrent TCP streams per client
dns_rate_limit = 100 # Maximum DNS queries per second per client
# required_groups = ["engineers", "operators"] # Empty = any authenticated user

Each connected client gets one virtual IP from the pool — use a dedicated CGN-space subnet to avoid overlap with other networks.

Routes pushed to clients come from two sources:

  1. Firewall host aliases: CIDRs and IPs from aliases matched by user groups
  2. Config-level cidrs: pushed to all clients regardless of group membership Both are merged (deduplicated) before sending in ClientAck.

Admin commands

Admin CLI commands:

clients list [--user=X] List connected hexonclient sessions (cluster-wide)
clients show <session_id> Show full session details (device, network, streams, traffic, timing)
clients disconnect <user> [id] Disconnect all sessions for user, or a specific session [WRITE]

Bastion shell commands (self-service, filtered to own sessions):

clients List your active hexonclient sessions
clients list Same as above
clients disconnect [session_id] Disconnect your own session(s)

Security

Two authentication paths (determined by whether client sends a token):

Device code flow (interactive — RFC 8628, same as bastion SSH):

  1. Client establishes QUIC/TLS 1.3 connection (ALPN: “hexon-client”)
  2. Client sends ClientAuth with empty token (signals device code request)
  3. Gateway initiates device code authorization (server-side, no HTTP from client)
  4. Gateway sends DeviceCodeChallenge: verification URI, user code, expiry
  5. Client displays QR code + clickable URL + user code
  6. Gateway polls the device code service until authorized, denied, or expired
  7. On authorization: gateway extracts claims (username, email, groups) from poll response
  8. Gateway checks required_groups, derives routes, allocates VIP
  9. Gateway sends ClientAck with VIP, routes, DNS, and JWT tokens for reconnection Reconnected sessions use the JWT path below (no re-authentication needed).

JWT flow (reconnect / automation):

  1. Client establishes QUIC/TLS 1.3 connection (ALPN: “hexon-client”)
  2. Client sends ClientAuth: JWT + HMAC-SHA256(token, TLS exporter) proof
  3. Gateway validates JWT (extracts username, groups)
  4. Gateway verifies channel binding proof (RFC 5705 prevents token replay)
  5. Gateway checks required_groups (if configured): user must have ANY listed group
  6. Client sends ClientRegister with device metadata
  7. Gateway derives per-user routes from firewall ACL rules
  8. Gateway sends ClientAck with VIP, routes, DNS, token refresh interval

Per-stream access control:

  • Each QUIC stream carries a DialHeader (host, port, protocol)
  • Gateway checks firewall access control (user groups vs target host/port/protocol)
  • Denied streams get DialStatusDenied response immediately
  • Only allowed streams proceed to backend dial

DNS defense-in-depth:

  • Per-session rate limiting: O(1) time-bucketed rolling window (dns_rate_limit qps)
  • DNS ACL: after resolve, firewall checks user groups vs host aliases
  • ACL-denied queries return DNSStatusDenied (RFC 8914 REFUSED) — prevents information leakage
  • ACL call failure fails open (dial-time ACL is the authoritative control)

Token refresh:

  • Client sends TokenRefresh with new JWT + proof before token expires
  • Gateway re-validates JWT and channel binding
  • Gateway re-checks required_groups: if user lost membership, connection is terminated
  • If groups changed: re-derive routes, send RouteUpdate with add/remove entries
  • Bad token on refresh kills the connection (security boundary)

Troubleshooting

Common symptoms and diagnostic steps:

Client cannot connect:

- Check listener: 'status summary' shows clientaccess listener state
- Check config: 'config show client_access' (enabled, port, subnet)
- Check required_groups: 'config show client_access' — user must be in listed groups (empty = any)
- Check certs: 'certs list' or 'diagnose domain <hostname>'
- Max clients reached: 'logs search clientaccess --level=warn'
- Group denied: 'logs search clientaccess --level=warn' shows "group access denied"

Client connected but cannot reach services:

- Check pushed routes: 'config show client_access' — cidrs must include destination subnet
- Check firewall rules: user's groups must match rule sources
- Check HostAlias: destination alias must have matching hosts (CIDRs for TUN routes, wildcards for DNS only)
- Check connector: if Site is set, connector must be connected
- 'logs search clientaccess-dial --level=warn' for denied dials

DNS not resolving:

- Check dns_upstream config: must point to reachable resolvers
- Check dns_domains: domains must be in the pushed list for split DNS
- 'logs search clientaccess-dns' for resolution errors
- DNS ACL denied (REFUSED): check user's groups match firewall rules for the hostname
- DNS rate limited (SERVFAIL): check dns_rate_limit setting (default 100 qps)

Token refresh failures:

- 'logs search clientaccess-refresh --level=warn'
- Invalid token: OIDC provider may have rotated keys
- Channel binding failure: possible MITM or TLS session change

Relationships

Module dependencies:

  • devicecode: Server-side device code authorization (RFC 8628) for interactive authentication
  • oidc: JWT validation for reconnect/automation authentication
  • firewall: Per-stream access control, DNS ACL enforcement, host alias route derivation
  • dns: DNS resolution for client split DNS queries
  • sessions: Cluster-wide session tracking (create, validate, revoke)
  • connectors: Site-based routing through connector tunnels
  • IP pool: Virtual IP allocation from dedicated subnet
  • listener: QUIC listener with TLS 1.3 and idle timeout
  • telemetry: Structured logging and Prometheus metrics

Logs

Log entries by component. Search with: logs search “clientaccess” Levels: ERROR > WARN > INFO > DEBUG.

Lifecycle:

clientaccess INFO initializing client access subsystem
clientaccess ERROR failed to create IP pool
clientaccess ERROR TLS config not available, client access listener disabled
clientaccess ERROR failed to create client access listener
clientaccess ERROR failed to start client access listener
clientaccess INFO client access listener started

Connection:

clientaccess INFO AUDIT client connected (VIP, routes, hostname)
clientaccess INFO AUDIT client disconnected (duration, traffic stats)
clientaccess WARN client rejected: max clients reached
clientaccess WARN unexpected first message type

Registration:

clientaccess INFO client registered (session, VIP, hostname)
clientaccess INFO client unregistered (session, duration, traffic counters)

Authentication — JWT:

clientaccess INFO/WARN client auth failed (INFO for PAT rejection, WARN otherwise)
clientaccess WARN channel binding failed

Authentication — Device Code:

clientaccess WARN device code auth rejected: concurrency limit reached
clientaccess WARN device code authorization request failed
clientaccess INFO device code challenge sent, waiting for authorization
clientaccess INFO client disconnected during device code auth
clientaccess INFO device code authorized
clientaccess INFO device code denied by user
clientaccess INFO device code expired

Authorization:

clientaccess WARN group access denied

Token Refresh:

clientaccess WARN token refresh failed: invalid token
clientaccess WARN token refresh failed: channel binding
clientaccess WARN group access revoked on refresh
clientaccess INFO token refreshed with group change
clientaccess DEBUG token refreshed

PAT Revocation:

clientaccess INFO disconnected clients after PAT revocation

Dial:

clientaccess WARN dial denied by ACL
clientaccess DEBUG dial failed
clientaccess DEBUG udp dial failed
clientaccess DEBUG dial accept stream error

Traffic:

clientaccess DEBUG client traffic

Hexdcall Module:

clientaccess.list_sessions WARN Registry not initialized
clientaccess.list_sessions DEBUG Listed client access sessions
clientaccess.disconnect_session WARN Username missing in disconnect request
clientaccess.disconnect_session WARN Registry not initialized
clientaccess.disconnect_session INFO Session not found on this node
clientaccess.disconnect_session INFO Disconnected client access session
clientaccess.disconnect_session INFO Disconnected all client access sessions for user

Metrics

Prometheus metrics. Query with: metrics prometheus clientaccess_<name>

Connections:

clientaccess_connections_total counter {} QUIC connections accepted
clientaccess_connections_active gauge {} Currently active QUIC connections
clientaccess_connections_rejected counter {reason} Connections rejected before auth
clientaccess_connection_duration latency {username?} Connection lifetime

Authentication:

clientaccess_auth_success_total counter {username?} Successful authentications
clientaccess_auth_failures_total counter {reason} Failed authentications

Clients:

clientaccess_clients_active gauge {} Registered client instances

Heartbeat:

clientaccess_heartbeat_latency latency {username?} Heartbeat RTT (raw)

Dial:

clientaccess_dials_total counter {} Dial requests received
clientaccess_dials_denied_total counter {} Dials denied by ACL
clientaccess_dials_success_total counter {} Dials completed successfully
clientaccess_dials_errors_total counter {} Dial errors (connect refused, timeout)
clientaccess_dial_latency latency {} Backend dial time
clientaccess_streams_active gauge {} Active QUIC dial streams

DNS:

clientaccess_dns_queries_total counter {} DNS queries processed

Alerts:

clientaccess_connections_active > max_clients * 0.9 Approaching client limit
rate(clientaccess_connections_rejected[5m]) > 10 Connection rejection spike
rate(clientaccess_auth_failures_total[5m]) > 10 Authentication failure spike
rate(clientaccess_dials_denied_total[5m]) > 20 ACL denial spike

QUIC Connector

Connects remote sites to the gateway via outbound QUIC tunnel — no inbound ports required at the remote site

Overview

Enables access to services at remote sites without IPsec or opening inbound ports. A lightweight binary at the remote site dials out to the gateway over QUIC — the gateway routes traffic through the tunnel. All protocols work through connectors: HTTP proxy, SSH bastion, forward proxy, and SQL bastion.

A lightweight binary (hexonconnect) deployed at the remote site establishes an outbound QUIC connection to Hexon. Hexon then sends “dial” commands through this tunnel whenever a proxy mapping, bastion session, forward proxy rule, or firewall policy references that site via the “site” parameter.

Key capabilities:

  • Zero-trust remote access: connector dials only what Hexon asks, nothing else
  • Opaque site namespace: same IPs and DNS names across sites are irrelevant
  • Stateless token auth: HMAC-derived tokens validated without storage
  • Channel-bound authentication: RFC 5705 TLS Exported Keying Material prevents replay and MITM attacks — the token never travels on the wire
  • Multi-instance HA: multiple connectors per site with adaptive load balancing
  • Cross-node routing: any cluster node can route to any connector via adaptive inter-node forwarding — requests arriving at a node without connector instances are transparently forwarded to a node that has them
  • Auto-reconnect: connector never gives up, exponential backoff on disconnect
  • CDN-compatible: optional dedicated hostname and TLS certificate for direct access

Configuration

Configuration uses the [connector] TOML section:

[connector]
enabled = true
port = 8444
# hostname = "connector.example.com" # Optional: dedicated hostname (CDN bypass)
# cert = "/path/to/cert.pem" # Optional: file path or inline PEM
# key = "/path/to/key.pem" # Optional: file path or inline PEM
[[connector.sites]]
id = "prod-asia-a8f3c1"
name = "Production Asia"
cidrs = ["203.0.113.0/24"]
max_instances = 3
rebalance = true # Distribute across cluster nodes (default: true)
rebalance_retries = 5 # Accept after N soft-rejects (default: 5, 1-10)

TLS certificate resolution:

1. connector.cert/key when set (static certificate)
2. SNI callback: auto-TLS (ACME), certmanager, wildcard, or service certificate
If connector.hostname is set and no cert/key is provided, ACME will automatically
provision a certificate for the connector hostname.

Usage across subsystems — add “site” parameter:

[[proxy.mapping]]
app = "API Asia"
host = "api-asia.example.com"
service = "http://api.default.svc.cluster.local:8080"
site = "prod-asia-a8f3c1"
# Shadow targets can also route through connectors:
[[proxy.mapping.shadow]]
name = "staging-mirror"
service = "https://staging.internal:8443"
site = "staging-eu"
# Circuit breaker fallback can use a different connector site:
[proxy.mapping.circuit_breaker]
fallback_mode = "service"
fallback_service = ["http://dr-backend:8080"]
fallback_site = "dr-europe"
# SSH cert rules — route bastion SSH through connector:
[[bastion.ssh_cert.rules]]
name = "remote-dc-ssh"
groups = ["devops"]
destinations = ["*.internal"]
site = "prod-asia-a8f3c1"
# SQL bastion — route database connections through connector:
[[sql_bastion.sites]]
name = "postgres-remote"
type = "postgres"
host = "pg.internal"
port = 5432
site = "prod-asia-a8f3c1"
# Firewall host aliases — route forward proxy traffic through connector:
[[firewall.aliases.hosts]]
name = "remote_services"
hosts = ["gitlab.internal", "jenkins.internal"]
site = "prod-asia-a8f3c1"
# Aliases with site skip nft rules — traffic goes through userspace QUIC tunnel

Token generation is deterministic from the cluster key — any node can validate.

Admin commands

Admin CLI:

connector list List configured sites and live connections
connector show <site-id> Show site config, token, and connected instances
(includes platform, origin with geo/ASN, system labels)
connector create <site-id> Create new site (generates token)
connector revoke <site-id> Block site, disconnect active QUIC tunnels
connector instances <site-id> List connected instances with metrics

The “connector show” output includes per-instance details: platform (OS/arch), origin IP with country and ASN (via geo module), and system labels reported by the connector binary (kernel, OS version, runtime environment, memory, virtualization, PID, UID/GID).

The “connector revoke” command disconnects active QUIC tunnels in addition to revoking cluster sessions, causing connectors to reconnect (and be rejected).

Config reload cleanup: when a site is removed from config (via GitOps or hot reload), active QUIC connections for that site are automatically disconnected. The connector binary will reconnect but be rejected because the site is no longer in config. This prevents stale sessions from lingering in JetStream KV.

Security

Trust boundaries:

  • Hexon Cluster (full trust): policy enforcement, identity, routing
  • Connector (minimal trust): dials only what Hexon asks, no autonomous access

Authentication flow:

  1. QUIC/TLS 1.3 connection established (server cert, ECDHE, forward secrecy)
  2. Both sides compute TLS exporter keying material (RFC 5705) with an application-specific label
  3. Connector sends: site_id + HMAC of token bound to the TLS channel
  4. Hexon validates by recomputing from cluster key

Additional protections:

  • Optional CIDR allowlist per site restricts connector source IPs
  • max_instances limit prevents token abuse
  • Instance selection uses epsilon-greedy adaptive algorithm with circuit breaker
  • QUIC relay loop prevention: relay handler only dispatches locally, preventing infinite forwarding loops between nodes
  • Cluster-wide rebalancing: soft-rejects excess connectors so they redistribute across gateway nodes (configurable per site, default 5 retries before accepting)

Inter node forwarding

All cluster nodes can route to any connector site through QUIC relay.

When a request arrives at a node without local connector instances (or after local retries are exhausted), the dispatcher transparently relays through a peer node. The relay uses QUIC on the same connector port (8444) with ALPN “hexon-relay” and mTLS for peer authentication. Each relay request opens a QUIC stream, sends a dispatch header, and the peer dispatches locally through its QUIC connector tunnel.

All traffic types converge through the same dispatch path — this covers reverse proxy, forward proxy, client access (TCP/UDP), SSH bastion, SQL bastion, shadow targets, and probes.

Remote node IPs are cached (5s refresh) from cluster-wide connector sessions. Failed nodes are tracked by the cluster discovery health checks.

Loop prevention:

  • The relay handler only dispatches locally (never relays further)
  • A peer with no local instances returns an immediate error

Troubleshooting relay:

  • Client-side metrics: relay_total (attempts), relay_success_total, relay_errors_total
  • Server-side metrics: relay_served (requests handled), relay_rejected_total (auth failures)
  • Relay rejected with “no_certificate”: peer isn’t presenting its service cert
  • Relay rejected with “not_peer”: source IP not in cluster discovery peer list
  • Relay “no_instances”: the peer node also has no local connectors for the site
  • Check logs: ‘logs search connectors.relay —level=warn’

Quic tuning

QUIC performance tuning applied to both gateway and connector sides:

Flow control windows (tuned for database and bulk transfer workloads):

- Stream: 2MB initial, 8MB max
- Connection: 4MB initial, 20MB max
- Stream-to-connection ratio: 2:5
Larger initial windows reduce round-trips for big responses (SQL results, file transfers).

Persistent QUIC transport (connector side):

- hexonconnect reuses one UDP socket across reconnections
- Avoids per-connection socket allocation and kernel offload state loss
- Enables future QUIC connection migration if network interface changes

Stream error handling:

- Error paths immediately release QUIC stream resources instead of graceful close
- Frees resources under load without waiting for peer acknowledgment

Max concurrent streams:

- Gateway: configurable per listener (default 100)
- Connector: 1024 (high concurrency for multiplexed tunnel streams)

Rebalancing

When multiple connector replicas start simultaneously (e.g., Kubernetes Deployment with 3 replicas), they may all connect to the same gateway node via DNS or a load balancer. The rebalance mechanism redistributes them:

  1. First connector for a site on a node is always accepted
  2. Subsequent connectors check cluster distribution: if this node has more instances than the least-loaded remote node, the registration is soft-rejected
  3. The connector reconnects with a short backoff (2 seconds) — DNS/LB randomness typically sends it to a different node
  4. After N soft-rejects (configurable, default 5), the node accepts anyway

Per-site configuration:

rebalance = true # Enable cluster-wide load distribution (default: true)
rebalance_retries = 5 # Max soft-rejects before accepting (1-10, default: 5)

Rebalance is best-effort — sticky load balancers may prevent redistribution, so the retry budget ensures connectors are never stuck. Metrics: rebalance_reject_total and rebalance_accept_total track distribution activity per site.

Logs

Log entries by component. Search with: logs search “connectors” Levels: ERROR > WARN > INFO > DEBUG. DEBUG requires log level configuration.

Initialization:

connectors INFO initializing connector subsystem
connectors ERROR TLS config not available, connector listener disabled
connectors ERROR failed to create connector listener
connectors ERROR failed to start connector listener
connectors INFO connector listener started

Authentication:

connectors.handler WARN AUDIT connector auth failed: invalid proof
connectors.handler WARN AUDIT connector auth failed: unknown site
connectors.handler WARN AUDIT connector auth failed: source IP not allowed

Connection lifecycle:

connectors.handler INFO AUDIT connector connected
connectors.handler INFO AUDIT connector disconnected

Registry:

connectors.registry INFO AUDIT Connector instance registered
connectors.registry INFO AUDIT Connector instance unregistered

Session management:

connectors WARN failed to create session
connectors WARN session create wait failed
connectors WARN unexpected session create response type
connectors DEBUG failed to extend session
connectors DEBUG session extend wait failed
connectors WARN failed to revoke session
connectors WARN session revoke wait failed

Config reload:

connectors.reload INFO disconnected instances for removed site

Relay:

connectors.relay WARN AUDIT relay rejected: source IP not a cluster peer
connectors.relay DEBUG relay connection accepted
connectors.relay WARN relay fallback also failed after local exhaustion

Metrics

Prometheus metrics. Query with: metrics prometheus connectors_<name>

Connections:

connectors_connections_total counter {} Total connector connections
connectors_connections_active gauge {} Active connector connections
connectors_connections_rejected counter {reason} Rejected connections
connectors_connection_duration latency {site_id} Connection lifetime

Authentication:

connectors_auth_success_total counter {site_id} Successful authentications
connectors_auth_failures_total counter {site_id, reason} Authentication failures

Instances:

connectors_instances_active gauge {site_id} Active connector instances
connectors_heartbeat_latency latency {site_id} Heartbeat round-trip time

Dial (tunnel dispatch):

connectors_dials_total counter {site_id} Dial attempts through tunnel
connectors_dials_success_total counter {site_id} Successful dials
connectors_dials_errors_total counter {site_id, reason} Failed dials
connectors_dial_latency latency {site_id} Dial latency
connectors_streams_active gauge {} Active QUIC streams

Rebalance:

connectors_rebalance_reject_total counter {site_id} Soft-rejected for rebalance
connectors_rebalance_accept_total counter {site_id} Accepted after rebalance check

Inter-node forwarding (TCP-level):

connectors_forward_total counter {site_id, target} Forward attempts to peer node
connectors_forward_success_total counter {site_id, target} Successful forwards
connectors_forward_errors_total counter {site_id, target} Failed forwards
connectors_forward_latency latency {site_id, target} Forward latency
connectors_forward_local_total counter {site_id} Requests handled locally

Relay (QUIC inter-node dispatch):

connectors_relay_total counter {site_id, target} Client-side relay attempts
connectors_relay_served counter {site_id, target} Server-side relay requests handled
connectors_relay_success_total counter {site_id, target} Successful relay dispatches
connectors_relay_errors_total counter {site_id, reason} Failed relay dispatches
connectors_relay_rejected_total counter {reason} Relay connections rejected (auth)

Alerts:

rate(connectors_auth_failures_total[5m]) > 5 High auth failure rate (brute force or misconfiguration)
connectors_instances_active == 0 No connector instances (site unreachable)
rate(connectors_dials_errors_total[5m]) > 10 High dial failure rate (tunnel health)
rate(connectors_relay_rejected_total[5m]) > 0 Relay auth failures (cluster misconfiguration)
connectors_connections_active > 100 High connection count

DNS Resolution

Resolves DNS for all gateway components — custom resolvers, DNSSEC validation, caching, and health-aware failover

Overview

Handles DNS resolution for all gateway components — proxy backends, bastion hosts, cluster discovery, and ACME validation. Provides custom resolvers with automatic failover, caching, and DNSSEC validation.

Capabilities:

  • Custom DNS resolvers with automatic failover and circuit breaker pattern
  • DNSSEC validation in two modes: resolver-trust (fast) and full cryptographic (secure)
  • Distributed DNS caching via the memory storage module (local reads, broadcast writes)
  • Lookup coalescing to prevent cache poisoning from concurrent requests
  • Hostname validation to block DNS injection attacks (null bytes, CRLF)
  • IPv4 preference when both A and AAAA records are available
  • CNAME flattening with configurable depth limit (default 16, per RFC 1034)
  • DNS-over-TLS (DoT) support for encrypted transport (RFC 7858)
  • Adaptive resolver selection using epsilon-greedy algorithm (20-40% lower latency)
  • Health checking with exponential backoff and automatic system DNS fallback
  • Typed DNS queries for 30+ record types (A, AAAA, CAA, TLSA, SRV, MX, etc.)
  • Context propagation for request cancellation and graceful shutdown
  • TTL sanitization to prevent integer overflow attacks (capped at 1 week)

Operations:

  • Resolve: DNS resolution with optional DNSSEC, caching, and resolver selection
  • ValidateHostname: RFC-compliant hostname validation against injection attacks

Config

Core configuration under [dns]:

[dns]
timeout = 5 # DNS query timeout in seconds (default: 5)
cache_ttl = 300 # Default cache TTL in seconds (default: 300)
cache_override = false # Ignore DNS server TTL, always use cache_ttl (default: false)
resolvers = ["1.1.1.1:53", "8.8.8.8:53", "9.9.9.9:53"] # DNS resolvers (default: cluster.cluster_dns_resolvers)
flatten_cname = true # Follow CNAMEs to final A/AAAA records (default: true)
max_cname_depth = 16 # Max CNAME chain depth to prevent loops (default: 16)

DNSSEC settings:

dnssec_full_validation = false # Full cryptographic RRSIG/DNSKEY validation (default: false)
dnssec_strict = false # Fail if zone is not DNSSEC-signed (default: false)

DNS-over-TLS (DoT):

dot_enabled = false # Enable DNS-over-TLS transport (default: false)
dot_port = 853 # DoT port per RFC 7858 (default: 853)
dot_verify_server_cert = true # Verify resolver TLS certificate (default: true)

Health checking:

health_check_enabled = true # Enable resolver health monitoring (default: true)
health_check_interval = 30 # Health check interval in seconds (default: 30)
health_failure_threshold = 2 # Consecutive failures before marking unhealthy (default: 2)
health_check_query = "google.com" # Domain used for health check probes (default: "google.com")

Adaptive resolver selection (epsilon-greedy ML):

adaptive_selector_enabled = true # Enable adaptive resolver selection (default: true)
adaptive_exploration_rate = 0.10 # Exploration rate 0.0-1.0 (default: 0.10 = 10%)
adaptive_smoothing_factor = 0.3 # EMA smoothing factor for latency tracking (default: 0.3)
adaptive_min_sample_size = 100 # Queries before switching from learning to intelligent mode (default: 100)
adaptive_load_balance_enabled = true # Penalize recently-used resolvers to spread load (default: true)

Resolver architecture — three separate resolver pools:

dns.resolvers # Infrastructure resolvers (health-checked, used by all modules)
cluster.cluster_dns_resolvers # Cluster discovery resolvers (fallback if dns.resolvers unset)
proxy.dns.resolvers # Proxy-specific override (must be subset of dns.resolvers)

Per-route proxy DNS overrides in [[proxy.mapping]]:

dnssec = true # Override global DNSSEC setting for this route
dns_resolvers = ["10.0.0.1:53"] # Override resolvers for this route (must be in dns.resolvers)

TTL precedence (cache_override=false): DNS server TTL > dns.cache_ttl > 300s default. TTL precedence (cache_override=true): dns.cache_ttl > 300s default. TTL bounds: minimum 1 second, maximum 604800 seconds (1 week).

Cache key format: “dns_cache:{hostname}:{resolver_hash}” (128-bit SHA256 hash). Cache reads are local (no network). Cache writes broadcast to cluster (fire-and-forget).

Hot-reloadable: resolvers, DNSSEC settings, cache TTL, health check parameters, adaptive settings. Cold (restart required): dot_enabled, dot_port.

Troubleshooting

Common symptoms and diagnostic steps:

DNS resolution failures:

- Check resolver health: 'dns resolvers' shows status, latency, and failure counts
- Test specific hostname: 'dns test <hostname>' performs live resolution
- All resolvers unhealthy: module falls back to system DNS (/etc/resolv.conf)
- Resolver filtered out: proxy resolvers must be a subset of dns.resolvers
- Cross-subsystem check: 'diagnose domain <hostname>' tests DNS + proxy + TLS together

DNSSEC validation errors:

- Zone not signed: set dnssec_strict=false to allow unsigned zones (default)
- Resolver-trust mode: compromised resolver can fake AD bit — use dnssec_full_validation=true
- Full validation slow: first query ~200ms (chain of trust), cached queries ~50ms
- Clock skew: DNSSEC signatures have validity windows — ensure NTP is running
- Check validation: 'dns test <hostname> --dnssec' shows validation result and mode
- Strict mode blocking: dnssec_strict=true rejects all unsigned zones — check per-route override

Slow DNS resolution:

- Check cache hit rate: 'dns cache' shows hit/miss ratio and entry count
- High cache miss: increase cache_ttl or set cache_override=true for static backends
- Resolver latency: 'dns resolvers' shows per-resolver average latency (EMA)
- Adaptive selector: 'dns adaptive' shows resolver scores and selection distribution
- Learning phase: first 100 queries use round-robin — performance improves after
- CNAME chains: deep chains add latency per hop — check with 'dns test <hostname>'

All resolvers down (circuit breaker tripped):

- Health checker marks resolver unhealthy after 2 consecutive failures (configurable)
- Backoff schedule: 30s, 1m, 2m, 4m, 8m, 15m (max)
- System DNS fallback activates automatically when all custom resolvers fail
- Recovery is automatic — resolver returns to pool when health check succeeds
- Force re-check: 'dns health --reset' clears backoff timers
- Check: 'dns resolvers' shows healthy/unhealthy status and next retry time

Cache poisoning concerns:

- Lookup coalescing: concurrent requests for same hostname share single lookup result
- Per-hostname locking prevents race conditions (no global bottleneck)
- Enable DNSSEC (dnssec_full_validation=true) for cryptographic validation
- Use DoT (dot_enabled=true) to encrypt DNS transport against snooping

CNAME resolution issues:

- CNAME not followed: check flatten_cname=true (default)
- CNAME loop detected: max_cname_depth exceeded (default 16) — check DNS zone config
- CNAME + ACL: ACL checks use original hostname, not CNAME target (prevents bypass)
- Metrics: dns.cname_resolutions_total tracks success and depth_exceeded counts

DoT connection failures:

- Port blocked: DoT uses port 853 (RFC 7858) — verify firewall rules
- Certificate error: set dot_verify_server_cert=false to diagnose (re-enable after)
- Non-standard port: module warns if dot_port is not 853

502/503 from proxy due to DNS:

- DNSSEC failure blocks connection (no system DNS fallback for security)
- DNS infrastructure failure falls back to system DNS (availability)
- Fix: set dnssec=false on specific proxy routes for unsigned internal zones
- Verify: 'dns test <backend-hostname> --dnssec' to check DNSSEC status

Interpreting tool output:

'dns health':
Healthy: Status=healthy, Healthy resolvers = total resolvers
Degraded: Healthy < total — some resolvers failing, but DNS still works
Down: Healthy=0 — all resolvers failed, system DNS fallback active
Action: Degraded/Down → 'dns resolvers' for per-resolver breakdown
'dns resolvers':
Healthy: Status=healthy, Latency < 50ms, Score > 100
Degraded: Status=unhealthy with BackoffUntil timestamp — resolver in circuit breaker
Learning: Score near 100 with low QueryCount — adaptive selector still calibrating (normal)
Action: All unhealthy → check network connectivity to resolver IPs, verify port 53/853 open
'dns test <hostname>':
Success: IPs returned, TTL shown, DNSSEC=valid (if enabled)
DNSSEC failure: DNSSEC=invalid — zone is unsigned or signatures expired
No results: hostname does not resolve — check DNS zone configuration
Action: DNSSEC failure + proxy 502 → set dnssec=false on that proxy route

Architecture

Resolution flow:

  1. Resolve request arrives (from proxy, bastion, ACME, or discovery)
  2. Hostname validation: RFC compliance check, injection prevention (null bytes, CRLF, length)
  3. Cache lookup: local memory read for “dns_cache:{hostname}:{resolver_hash}”
  4. If cache hit: return cached IPs immediately (no network call)
  5. If cache miss: acquire per-hostname lock (coalescing for concurrent requests)
  6. Resolver selection: adaptive selector picks best resolver (or round-robin during learning)
  7. Health filter: only healthy resolvers considered (circuit breaker pattern)
  8. DNS query: send query via UDP (or DoT if enabled) with configured timeout
  9. DNSSEC validation (if enabled): a. Resolver-trust mode: check AD bit in response b. Full validation: verify RRSIG signatures, validate DNSKEY chain to root trust anchor
  10. CNAME handling: if CNAME response and flatten_cname=true, recursively resolve target
  11. IPv4 preference: sort results with A records before AAAA records
  12. TTL extraction: from DNS response (DNSSEC/custom resolver) or use configured default
  13. TTL sanitization: clamp to [1s, 604800s], zero defaults to 300s
  14. Cache store: broadcast write to cluster memory (fire-and-forget, best-effort)
  15. Release per-hostname lock, waiting callers receive same result
  16. Return ResolveResponse with IPs, TTL, cached flag, DNSSEC validity

Adaptive resolver selection (epsilon-greedy):

Learning phase (first 100 queries): round-robin across all healthy resolvers
Intelligent phase: 90% exploitation (best score), 10% exploration (random)
Score = 100 + (success_rate * 50) - (avg_latency_ms / 10) - (timeout_pct * 30)
- (consecutive_failures * 20) - (recently_used * 10)
Latency tracked via EMA: new_avg = 0.3 * sample + 0.7 * old_avg
Load balancing penalty: -10 points if resolver used within last 1 second

Health checker circuit breaker:

Healthy: failure_count = 0, available for selection
Unhealthy: failure_count >= threshold (default 2), excluded from selection
Backoff: 30s -> 1m -> 2m -> 4m -> 8m -> 15m (max)
Recovery: single successful health check returns resolver to healthy state
System DNS fallback: automatic when ALL custom resolvers are unhealthy
Memory cleanup: Resolver sync removes stale entries on config reload

DNSSEC full validation chain:

1. Query resolver with DO bit set
2. Extract RRSIG from response
3. Fetch DNSKEY for target zone (cached with TTL)
4. Verify RRSIG signature using DNSKEY (RSA/SHA-256, ECDSA P-256, Ed25519)
5. Fetch DS record from parent zone
6. Verify DNSKEY hash matches DS record
7. Recurse up to root zone
8. Validate root DNSKEY against hardcoded IANA trust anchor (KSK 20326)
9. Validate NSEC/NSEC3 for authenticated denial of existence

Distributed caching via memory module:

Read path: local-only (no network, no quorum)
Write path: broadcast to all cluster nodes (fire-and-forget)
Key format: "dns_cache:{hostname}:{sha256_hash_of_resolvers}" (collision-resistant)
Eviction: TTL-based (respects DNS TTL or configured override)
Coalescing: per-hostname mutex prevents concurrent duplicate lookups

Metrics emitted:

dns.resolve_total (tags: status, cached, dnssec)
dns.resolve_latency_ms (histogram)
dns.cache_hit_total / dns.cache_miss_total
dns.health_check_total (tags: resolver, status)
dns.adaptive_resolver_selected (tags: resolver, reason)
dns.resolver_score (gauge, tags: resolver)
dns.resolver_avg_latency_ms (gauge, tags: resolver)
dns.cname_resolutions_total (tags: status)

Relationships

Module dependencies and interactions:

  • proxy: Backend hostname resolution for all proxy routes. Uses [dns] configuration by default. Per-route overrides via dnssec and dns_resolvers fields in [[proxy.mapping]]. DNSSEC validation failure blocks connection (no system DNS fallback — prevents downgrade). DNS infrastructure failure falls back to system DNS (availability).

  • bastion: SSH connection and port forwarding hostname resolution. Uses [dns] configuration directly (no bastion-specific overrides). DNSSEC protects against SSH destination poisoning.

  • discovery: Cluster peer discovery via DNS SRV records. Uses [dns] configuration for resolver settings. Critical for cluster formation and membership.

  • acme: ACME challenge validation uses typed DNS queries (CAA record checking per RFC 8659). SERVFAIL handling distinguishes “no records” from “DNS infrastructure error” for security.

  • memory: Distributed DNS cache storage. Local reads (fast), broadcast writes (best-effort). No quorum required — cache is opportunistic, falls back to fresh lookup on miss.

  • config: Reads [dns] and [cluster] TOML sections. Hot-reload updates resolvers, DNSSEC settings, cache parameters, health check configuration, and adaptive selection tuning. Resolver sync cleans up stale resolver state on reload (memory leak prevention).

  • metrics (telemetry): Emits counters, histograms, and gauges for resolution, caching, health checks, and adaptive selection. Enables monitoring dashboards and alerting.

Logs

Log entries by component. Search with: logs search “dns” Levels: ERROR > WARN > INFO > DEBUG > TRACE.

Init & Lifecycle:

dns.init INFO DNS module initialized
dns.health INFO DNS resolvers not configured, using cluster resolvers for health checking
dns.health WARN Failed to initialize resolver health manager
dns.health INFO Resolver health manager started
dns.health INFO Health checking enabled but no resolvers configured
dns.health INFO Resolver health checking disabled
dns.adaptive INFO Adaptive resolver selector initialized
dns.adaptive INFO Adaptive selector enabled but no resolvers configured
dns.adaptive INFO Adaptive resolver selector disabled

Resolution:

dns.resolve DEBUG DNS resolution request
dns.resolve DEBUG DNS cache hit
dns.resolve DEBUG Waiting for concurrent DNS lookup to complete
dns.resolve ERROR DNS lookup panicked
dns.resolve ERROR DNS resolution failed
dns.resolve INFO DNS resolution succeeded - no records found
dns.resolve INFO DNS resolution succeeded

Hostname Validation:

dns.validate WARN Hostname validation failed

Health Status:

dns.gethealth DEBUG DNS health status requested

Cache Operations:

dns.cache WARN Invalid cache entry type
dns.cache WARN Failed to broadcast DNS cache update
dns.cache DEBUG DNS result cached

DNSSEC Core:

dns.dnssec DEBUG Using DNS-over-TLS
dns.dnssec WARN DNS query failed
dns.dnssec DEBUG DNS query returned error
dns.dnssec.full DEBUG RRSIG present but AD bit not set - performing full validation
dns.dnssec.full ERROR Full DNSSEC validation failed
dns.dnssec.full INFO Full DNSSEC validation succeeded
dns.dnssec ERROR DNSSEC validation failed: RRSIG present but AD bit not set
dns.dnssec ERROR DNSSEC strict mode: zone not signed
dns.dnssec WARN DNSSEC validation skipped: zone not signed
dns.dnssec DEBUG DNSSEC validation succeeded (resolver-trust mode)

DNSSEC Validation:

dns.dnssec.validate WARN RRSIG signature verification failed
dns.dnssec.validate WARN RRSIG signature expired or not yet valid
dns.dnssec.validate DEBUG RRSIG signature validated successfully
dns.dnssec.dnskey WARN Failed to query DNSKEY
dns.dnssec.dnskey WARN DNSKEY query returned error
dns.dnssec.dnskey WARN No DNSKEY records found in zone
dns.dnssec.dnskey DEBUG DNSKEY records fetched successfully
dns.dnssec.validate ERROR DNSSEC strict mode: RRset not signed
dns.dnssec.validate DEBUG RRset has no RRSIG (zone not signed)
dns.dnssec.validate ERROR No matching DNSKEY found for RRSIG
dns.dnssec.validate INFO DNSSEC validation completed

DNSSEC Cache:

dns.dnssec.cache DEBUG DNSKEY cache hit
dns.dnssec.cache DEBUG DNSKEY cache expired
dns.dnssec.cache DEBUG DNSKEY cached
dns.dnssec.cache DEBUG DS cache hit
dns.dnssec.cache DEBUG DS cache expired
dns.dnssec.cache DEBUG DS cached
dns.dnssec.cache INFO DNSSEC cache cleared

DNSSEC Chain of Trust:

dns.dnssec WARN DEPRECATED: SHA-1 used in DNSSEC validation
dns.dnssec.ds WARN Failed to query DS
dns.dnssec.ds WARN DS query returned error
dns.dnssec.ds DEBUG No DS records found (zone may be unsigned or at root)
dns.dnssec.ds DEBUG DS records fetched successfully
dns.dnssec.chain WARN Failed to compute DS digest
dns.dnssec.chain DEBUG DNSKEY validated successfully using DS
dns.dnssec.chain ERROR DNSKEY validation failed: no matching DS found
dns.dnssec.chain DEBUG Validating chain of trust
dns.dnssec.chain INFO Root DNSKEY validated against trust anchor
dns.dnssec.chain ERROR Root DNSKEY validation failed: no matching trust anchor

DNSSEC NSEC/NSEC3:

dns.dnssec.nsec DEBUG No NSEC records found in response
dns.dnssec.nsec DEBUG Found NSEC records for validation
dns.dnssec.nsec INFO NSEC authenticated denial validated
dns.dnssec.nsec WARN NSEC validation failed: name not in range
dns.dnssec.nsec3 DEBUG No NSEC3 records found in response
dns.dnssec.nsec3 DEBUG Found NSEC3 records for validation
dns.dnssec.nsec3 WARN Unsupported NSEC3 hash algorithm
dns.dnssec.nsec3 ERROR Failed to compute NSEC3 hash
dns.dnssec.nsec3 INFO NSEC3 authenticated denial validated
dns.dnssec.nsec3 WARN NSEC3 validation failed: hash not in range

Resolver:

dns.resolve WARN Hostname validation failed
dns.ttl DEBUG Cache override enabled, using configured TTL
dns.ttl DEBUG Using DNS server TTL
dns.ttl DEBUG DNS server TTL not available, using fallback
dns.health DEBUG Filtered unhealthy resolvers
dns.resolve DEBUG DNS resolution succeeded
dns.resolve DEBUG DNS resolution failed, trying next resolver
dns.resolve ERROR All DNS resolvers failed
dns.resolve DEBUG Using system DNS resolver
dns.resolve DEBUG Using configured DNS cache TTL for system resolver
dns.resolve DEBUG DNS resolution succeeded
dns.dnssec DEBUG DNSSEC resolution succeeded
dns.dnssec WARN DNSSEC lookup failed, trying next resolver
dns.cname DEBUG Resolving CNAME target
dns.cname DEBUG CNAME record found
dns.cname WARN Failed to resolve CNAME target
dns.cname DEBUG CNAME chain returned (flatten disabled)
dns.query DEBUG Using DNS-over-TLS
dns.query WARN DNS query failed
dns.query DEBUG DNS query returned error
dns.query DEBUG DNS query completed
dns.query WARN DNS query returned SERVFAIL

Adaptive Resolver:

dns.adaptive ERROR Failed to create adaptive selector
dns.adaptive INFO Cleaned up performance data for removed resolvers
dns.adaptive INFO Adaptive resolver selector initialized
dns.adaptive TRACE Resolver performance updated
dns.adaptive INFO Adaptive selector learning phase completed, switching to intelligent selection
dns.adaptive DEBUG Adaptive DNS resolution succeeded
dns.adaptive DEBUG Adaptive DNS resolution failed, selecting another resolver
dns.adaptive ERROR All adaptive DNS resolution attempts failed

Health Manager:

dns.health INFO Initializing resolver health checks
dns.health ERROR Invalid resolver address format
dns.health WARN Initial health check failed
dns.health INFO Initial health check passed
dns.health ERROR No healthy DNS resolvers available
dns.health INFO Resolver health initialization complete
dns.health DEBUG Starting health check
dns.health WARN Health check query failed
dns.health WARN Health check returned nil response
dns.health DEBUG Health check returned error response
dns.health DEBUG Health check successful
dns.health DEBUG GetHealthyResolvers called
dns.fallback WARN All custom DNS resolvers unhealthy, falling back to system DNS
dns.fallback INFO Custom DNS resolver recovered, switching back from system DNS
dns.health WARN RecordSuccess called for unknown resolver
dns.health INFO Resolver recovered
dns.fallback INFO Custom DNS resolver recovered, switching back from system DNS
dns.health WARN RecordFailure called for unknown resolver
dns.health WARN Resolver marked unhealthy
dns.health INFO Starting resolver health checker
dns.health INFO Stopping resolver health checker
dns.health DEBUG Performing health checks
dns.health DEBUG Health check still failing
dns.health INFO Resolver recovered via health check
dns.health INFO Removed resolvers no longer in configuration

Metrics

Prometheus metrics. Query with: metrics prometheus dns_<name>

Resolution (namespace: dns):

dns_resolve_total counter {result, cached, dnssec} Resolution outcomes
result=success, cached=true|false Successful resolution
result=nxdomain, cached=true|false Domain not found (valid response)
result=failure, cached=false Resolution failed
dns_nxdomain_total counter {} NXDOMAIN responses (uncached)
dns_cache_hits counter {} Cache hits
dns_cache_misses counter {} Cache misses
dns_lookup_coalesced counter {} Lookups coalesced (shared concurrent result)
dns_lookup_performed counter {} Lookups actually performed
dns_cache_operations_total counter {operation, result} Cache write operations
operation=set, result=success|error Broadcast cache set outcomes

Resolver Selection (namespace: dns):

dns_resolver_queries_total counter {resolver, result} Per-resolver query outcomes
result=success|nxdomain|failure Query result per resolver
dns_system_dns_queries_total counter {result} System DNS fallback queries
result=success|nxdomain|failure System resolver outcomes

Transport (namespace: dns):

dns_transport_used counter {type, resolver} DNS transport protocol used
type=udp|dot UDP or DNS-over-TLS

CNAME Resolution (namespace: dns):

dns_cname_resolutions_total counter {status} CNAME chain resolution outcomes
status=success|depth_exceeded CNAME follow results

DNSSEC Validation (namespace: dns):

dns_dnssec_validations_total counter {result, resolver} Resolver-trust mode validations
result=valid|invalid|unsigned AD bit check outcomes
dns_dnssec_full_validations counter {result, resolver} Full cryptographic validations
result=valid|invalid RRSIG/DNSKEY verification outcomes
dns_dnssec_signature_validations counter {result, algorithm} RRSIG signature verifications
result=valid Successful signature check
dns_dnssec_dnskey_queries counter {result} DNSKEY record fetches
result=success DNSKEY query succeeded
dns_dnssec_response_validations counter {result} Full response validations
result=valid All RRsets validated
dns_dnssec_chain_validations counter {result} Chain of trust DS validations
result=valid|invalid DNSKEY-DS digest match
dns_dnssec_root_validations counter {result} Root trust anchor validations
result=valid|invalid Root DNSKEY match
dns_dnssec_nsec_validations counter {result, type} NSEC/NSEC3 denial validations
result=valid|invalid, type=nsec|nsec3 Authenticated denial outcomes

DNSSEC Cache (namespace: dns):

dns_dnssec_cache_hits counter {type} DNSSEC record cache hits
type=dnskey|ds Cached record type
dns_dnssec_cache_misses counter {type} DNSSEC record cache misses
type=dnskey|ds Record type queried
dns_dnssec_cache_clears counter {} DNSSEC cache full clears

Health Management (namespace: dns):

dns_resolver_latency latency {resolver} Per-resolver query latency
dns_resolver_healthy gauge {resolver} Resolver health status (1=healthy, 0=unhealthy)
dns_resolver_avg_latency_ms gauge {resolver} Resolver average latency EMA (ms)
dns_resolver_consecutive_failures gauge {resolver} Consecutive failure count per resolver
dns_resolver_failures_total counter {resolver} Total resolver failures
dns_system_fallback gauge {} System DNS fallback active (1=active, 0=inactive)
dns_fallback_activations counter {} System DNS fallback activations

Adaptive Selection (namespace: dns):

dns_adaptive_resolver_selected counter {resolver, reason} Adaptive resolver selections
reason=exploration|best_score|round_robin|... Selection strategy used
dns_adaptive_selection_total counter {mode, resolver} Selection mode distribution
mode=explore|exploit Exploration vs exploitation
dns_resolver_score histogram {resolver} Resolver scores (intelligent phase)

Forward Proxy

Browser-native access to internal resources — no client software needed, just configure the browser’s proxy settings

Overview

Provides browser-native access to internal resources — no client software needed. Users configure their browser’s proxy settings (or use the auto-generated PAC file) and access internal services as if they were local. Handles HTTP CONNECT for TCP tunneling and CONNECT-UDP for UDP proxying via MASQUE.

Core capabilities:

  • HTTP CONNECT handling for TCP proxy tunneling
  • CONNECT-UDP handling for UDP proxy tunneling (MASQUE/QUIC)
  • PAC file endpoint serving at configurable path (default /proxy.pac)
  • Browser extension config endpoint at /proxy/config
  • Browser extension setup/login endpoint at /proxy/setup
  • CONNECT rejected on main service port (421 Misdirected) — proxy port only
  • Geo-IP and time-based restriction enforcement before tunneling
  • DNS resolution with system DNS fallback
  • Bidirectional TCP relay with idle timeout and max connection duration
  • HTTP/2+ full duplex CONNECT stream support (RFC 8441)
  • HTTP/1.1 connection hijacking for classic CONNECT tunneling
  • Connection tracking and byte-level metrics recording

The service runs on a dedicated port (forward_proxy.port) separate from the main service port for security isolation. CONNECT requests on the main port receive 421 Misdirected Request, directing clients to the correct proxy port.

TCP CONNECT request flow:

1. Extract client IP (CDN bypass mode uses RemoteAddr directly)
2. Check geo-IP and time-based restrictions
3. Validate target host:port format (RFC 1035 hostname length limit)
4. Extract bearer token from Proxy-Authorization header
5. Authenticate token and check user is not disabled
6. Check ACL (firewall group rules for target destination)
7. Check per-user rate limit (fail-closed)
8. Resolve hostname via DNS module (system DNS fallback)
9. Establish backend TCP connection with configurable timeout
10. Start bidirectional relay with idle timeout and max duration
11. Record metrics (bytes sent/recv, duration, success)

CONNECT-UDP request flow:

1-7. Same as TCP (restrictions, auth, ACL, rate limit)
8. MASQUE UDP proxying (capsule protocol, socket management)
9. Record metrics after session completes

Bearer token authentication supports two formats:

- "Bearer <token>" header (direct bearer token)
- "Basic <base64>" header where username is "_bearer_" and password is the token
(Chrome's onAuthRequired format for Proxy-Authorization)

Config

Service-level configuration under [forward_proxy] in hexon.toml:

[forward_proxy]
enabled = true # Enable forward proxy (default: false)
port = 8443 # Dedicated proxy port (must differ from service.port)
public_port = 8443 # External port for PAC URLs (NAT/LB scenarios)
hostname = "proxy.example.com" # Separate hostname for CDN bypass (optional)
enable_tcp = true # Enable TCP CONNECT handling (default: true)
enable_udp = true # Enable CONNECT-UDP/MASQUE handling (default: true)
udp_proxy_path = "/masque" # URI path for CONNECT-UDP requests (default: /masque)
auth_mode = "bearer" # Authentication mode for CONNECT requests
buffer_size = "32KB" # TCP relay buffer size (default: 32KB)
connect_timeout = "10s" # Backend connection timeout
idle_timeout = "5m" # Idle connection timeout (no data flowing)
max_connection_duration = "24h" # Maximum connection duration (hard limit)
preserve_client_port = true # Use client's port in Alt-Svc header
# Token settings (used by /proxy/config endpoint)
token_ttl = "5m" # Token validity duration (default: 5m, min: 30s)
token_refresh_interval = "60s" # Extension refresh interval (default: 60s, min: 5s)
# TLS certificate for the proxy hostname (when hostname differs from service)
cert = "/path/to/cert.pem" # File path or inline PEM
key = "/path/to/key.pem" # File path or inline PEM
# Geo-IP restrictions (overrides [service] if set)
geo_enabled = true # Enable geo-IP restrictions
geo_allow_countries = ["US", "CA"] # Allowed country codes (ISO 3166-1 alpha-2)
geo_deny_countries = [] # Denied country codes
geo_bypass_cidr = ["10.0.0.0/8"] # CIDR ranges that bypass geo checks
geo_deny_code = 403 # HTTP status for geo denial
geo_deny_message = "Access denied from your location"
# Time-based restrictions (overrides [service] if set)
time_enabled = true # Enable time-based restrictions
time_timezone = "America/New_York" # Timezone for time checks
time_allow_days = ["Mon","Tue","Wed","Thu","Fri"]
time_allow_hours = "09:00-18:00" # Allowed hours range
time_deny_code = 403 # HTTP status for time denial
time_deny_message = "Access not permitted at this time"
# PAC file settings
[forward_proxy.pac]
enabled = true # Enable PAC endpoint (default: true)
path = "/proxy.pac" # PAC file URL path
cache_ttl = "15m" # PAC response Cache-Control max-age
group = "proxy-users" # Required group for PAC/config/setup access (optional)
use_firewall_targets = true # Derive PAC targets from firewall rules

Endpoints registered by the service:

GET /proxy.pac - PAC file (requires auth, optional group)
GET /proxy/config - JSON: PAC + token + refresh interval + username + server_time
GET /proxy/setup - Login trigger page for browser extensions

CDN bypass mode:

When forward_proxy.hostname differs from service.hostname, the proxy accepts
direct connections (no CDN in between). Client IP is extracted from RemoteAddr
instead of X-Forwarded-For. This is typical because CDNs do not support HTTP CONNECT.

Hot-reloadable: token_ttl, token_refresh_interval, geo/time restrictions, PAC settings,

rate_limit_per_user, bandwidth_limit_per_user, buffer_size, idle_timeout,
max_connection_duration.

Cold (restart required): enabled, port, hostname, enable_tcp, enable_udp,

udp_proxy_path, preserve_client_port.

Troubleshooting

Common symptoms and diagnostic steps:

CONNECT requests returning 421 Misdirected Request:

- Client is sending CONNECT to the main service port instead of the proxy port
- The forward proxy middleware rejects CONNECT on the main port by design
- Verify client is configured to use forward_proxy.port (or public_port)
- Check error message for the correct proxy hostname:port

407 Proxy Authentication Required:

- Missing Proxy-Authorization header on CONNECT request
- Token format not recognized (must be "Bearer <token>" or "Basic <base64>")
- For Chrome extension: username must be "_bearer_" in Basic auth format
- Token exceeds max length (8192 bytes) — check token generation
- Verify token is being refreshed before expiry: check /proxy/config response

403 Forbidden on CONNECT:

- ACL denied: user's groups do not match firewall rules for the target
- Check: 'forwardproxy check <user> <target>' for ACL evaluation
- Check: 'forwardproxy targets <user>' to see allowed destinations
- Check: 'firewall check <user>' for firewall rule details
- Geo-IP denial: 'geo lookup <client_ip>' and 'geo check <client_ip>'
- Time-based denial: verify time_timezone and time_allow_hours in config

429 Too Many Requests:

- Per-user rate limit exceeded: check rate_limit_per_user setting
- Per-user bandwidth limit exceeded: check bandwidth_limit_per_user
- Retry-After header in response indicates when to retry
- Monitor: 'forwardproxy metrics' for per-user rate limit stats
- Consider increasing limits for legitimate high-volume users

502 Bad Gateway on CONNECT:

- DNS resolution failed: 'dns test <target_hostname>'
- Backend unreachable: 'net tcp <target_host:port>'
- Connect timeout too short: check forward_proxy.connect_timeout
- All resolved IPs failed (tries IPv4 first, then IPv6)
- DNS module failure with system DNS fallback also failing

Connection drops or timeouts during tunnel:

- Idle timeout: no data flowing for forward_proxy.idle_timeout (default 5m)
- Max duration exceeded: forward_proxy.max_connection_duration hard limit
- Check relay buffer_size: default 32KB, increase for high-throughput tunnels
- HTTP/2 full duplex not supported by server: check error logs for full duplex support errors
- Intermediate firewall blocking long-lived connections or UDP (QUIC)

PAC file returns DIRECT for all traffic:

- PAC endpoint requires authentication; verify session cookie is sent
- Check forward_proxy.pac.enabled = true
- Check use_firewall_targets = true and user has firewall rules
- Unauthenticated PAC intentionally returns DIRECT-only (security by design)
- Inspect PAC: curl -b session=<cookie> https://host/proxy.pac

/proxy/config returns 401 or 403:

- 401: session cookie missing or expired; trigger re-login via /proxy/setup
- 403: user not in required group (forward_proxy.pac.group)
- Verify group membership: 'directory user <username>'

Extension not refreshing token:

- Verify token_refresh_interval < token_ttl in config
- Check /proxy/config endpoint accessibility from extension
- Look for clock skew between client and server (server_time in response)
- Monitor: 'forwardproxy metrics' for token generation counts

CONNECT-UDP/MASQUE failures:

- QUIC port (UDP) blocked by intermediate firewall
- forward_proxy.enable_udp = false in config
- URI template mismatch: check udp_proxy_path setting
- MASQUE parse error: malformed CONNECT-UDP request
- Verify: 'net tcp <proxy_hostname:port> --tls' for TLS connectivity

Geo/time restriction inconsistencies:

- Forward proxy has its own geo/time config that overrides [service] settings
- Check both forward_proxy.geo_enabled and service.geo_enabled
- Restrictions on /proxy/config and CONNECT may behave differently
- CONNECT restrictions fail-open if the cluster is not ready

Metrics and monitoring:

- 'forwardproxy metrics' — cluster-wide connection counts and byte totals
- 'forwardproxy metrics <user>' — per-user breakdown
- Bytes sent/recv recorded per TCP connection; UDP records duration and
success only (MASQUE library limitation)

Relationships

Dependencies and interactions:

  • Forward proxy module: All authentication, ACL, rate limiting, PAC generation, metrics, and restriction checks handled cluster-wide.
  • DNS: Hostname resolution for CONNECT targets. Falls back to system DNS if the DNS module is unavailable. IPv4 preferred over IPv6 in resolution order.
  • Firewall: ACL rules determine which groups can access which destination host:port. Firewall rules also drive PAC file generation (use_firewall_targets).
  • Directory: User disabled status checked during authentication. Group membership resolved server-side from the directory memory index during ACL evaluation (not embedded in the bearer token).
  • Geo/Time access: Location and time-based access checks on both /proxy/config endpoint and CONNECT requests. Forward proxy can override [service] geo/time settings with its own configuration.
  • Sessions: Session cookies used for /proxy/config, /proxy/setup, and /proxy.pac. Browser extension first authenticates via session, then receives a bearer token for subsequent CONNECT requests.
  • Reverse proxy: Complementary service — reverse proxy handles inbound traffic to backends, forward proxy handles outbound traffic from users. Both share the same TLS listener and session subsystem.

Logs

Log entries by component. Search with: logs search “forwardproxy” Levels: ERROR > WARN > INFO > DEBUG. DEBUG requires log level configuration.

Lifecycle & Middleware:

forwardproxy.service.init INFO Forward proxy service disabled in config
forwardproxy.service.init INFO Forward proxy service initialized
forwardproxy.middleware INFO Forward proxy disabled, passing CONNECT to next handler
forwardproxy.middleware WARN CONNECT request rejected on main service port

PAC & Config Endpoints:

forwardproxy.pac DEBUG Generating PAC file for authenticated user
forwardproxy.pac ERROR Failed to generate PAC
forwardproxy.config DEBUG Generating proxy config for extension
forwardproxy.config WARN Access blocked by restriction
forwardproxy.config ERROR Failed to generate PAC
forwardproxy.config ERROR Failed to generate proxy token
forwardproxy.config INFO Proxy config generated successfully
forwardproxy.setup INFO Proxy setup authorized

Restrictions:

forwardproxy.restrictions ERROR Failed to call restrictions check

SSRF Protection:

forwardproxy.ssrf WARN AUDIT blocked non-routable IP from DNS resolution
forwardproxy.ssrf WARN AUDIT all resolved IPs are non-routable — request blocked

DNS & Connectivity:

forwardproxy.dns DEBUG Resolving hostname via DNS module
forwardproxy.dns DEBUG DNS resolution successful
forwardproxy.dns DEBUG Using system DNS resolver
forwardproxy.dns DEBUG Successfully connected to backend
forwardproxy.dns WARN DNS module failure - falling back to system DNS
forwardproxy.dns WARN DNS resolution timeout - falling back to system DNS
forwardproxy.dns WARN DNS module returned error - falling back to system DNS
forwardproxy.dns WARN Failed to connect to IP, trying next
forwardproxy.connector DEBUG Dialing via connector site
forwardproxy.connector DEBUG Connected via connector site

TCP CONNECT Authentication:

forwardproxy.tcp.auth INFO AUDIT Missing or invalid Proxy-Authorization header
forwardproxy.tcp.auth INFO AUDIT Token too long
forwardproxy.tcp.auth INFO AUDIT Authentication failed

TCP CONNECT ACL & Rate Limiting:

forwardproxy.tcp.acl WARN AUDIT ACL denied
forwardproxy.tcp.ratelimit ERROR Rate limit service unavailable
forwardproxy.tcp.ratelimit ERROR Rate limit check failed
forwardproxy.tcp.ratelimit WARN AUDIT Rate limit exceeded

TCP CONNECT Connection:

forwardproxy.tcp.connect INFO Proxy connection established
forwardproxy.tcp.dial ERROR Failed to connect to backend
forwardproxy.tcp.http2 DEBUG Using HTTP/2+ full duplex CONNECT stream
forwardproxy.tcp.http2 ERROR Failed to enable full duplex mode
forwardproxy.tcp.http2 ERROR Failed to flush response
forwardproxy.tcp.hijack ERROR ResponseWriter does not support hijacking
forwardproxy.tcp.hijack ERROR Failed to hijack connection
forwardproxy.tcp.error ERROR Request validation or service errors (dynamic message)

HTTP Proxy Authentication:

forwardproxy.http.auth INFO AUDIT Missing or invalid Proxy-Authorization header
forwardproxy.http.auth INFO AUDIT Token too long
forwardproxy.http.auth INFO AUDIT Authentication failed

HTTP Proxy ACL & Rate Limiting:

forwardproxy.http.acl WARN AUDIT ACL denied
forwardproxy.http.ratelimit ERROR Rate limit service unavailable
forwardproxy.http.ratelimit ERROR Rate limit check failed
forwardproxy.http.ratelimit WARN AUDIT Rate limit exceeded

HTTP Proxy Forwarding:

forwardproxy.http.forward INFO HTTP proxy request forwarded
forwardproxy.http.forward ERROR Failed to forward request
forwardproxy.http.copy DEBUG Response body copy error
forwardproxy.http.error ERROR Request validation or service errors (dynamic message)

UDP/MASQUE Authentication:

forwardproxy.udp.auth INFO AUDIT Missing or invalid Proxy-Authorization header
forwardproxy.udp.auth INFO AUDIT Token too long
forwardproxy.udp.auth INFO AUDIT Authentication failed

UDP/MASQUE ACL & Rate Limiting:

forwardproxy.udp.acl WARN AUDIT ACL denied
forwardproxy.udp.ratelimit ERROR Rate limit service unavailable
forwardproxy.udp.ratelimit ERROR Rate limit check failed
forwardproxy.udp.ratelimit WARN Rate limit exceeded

UDP/MASQUE Connection & Session:

forwardproxy.udp.parse WARN Failed to parse CONNECT-UDP request
forwardproxy.udp.parse WARN Invalid CONNECT-UDP request
forwardproxy.udp.parse WARN Invalid target hostname
forwardproxy.udp.connect INFO UDP proxy session authorized
forwardproxy.udp.ssrf WARN AUDIT SSRF blocked: UDP target resolves to non-routable IP
forwardproxy.udp.dial WARN Failed to dial UDP IP, trying next
forwardproxy.udp.dial ERROR All UDP dial attempts failed
forwardproxy.udp.proxy ERROR UDP proxy error
forwardproxy.udp.complete INFO UDP proxy session completed
forwardproxy.udp.error ERROR Request validation or service errors (dynamic message)

Shared (TCP, HTTP, UDP):

forwardproxy.ratelimit.status DEBUG Rate limit check passed

Metrics

No Prometheus metrics emitted directly by this service layer. Metrics are recorded by the forward proxy infrastructure module after each connection. Query with: metrics prometheus forwardproxy_<name>


Forward Proxy Engine

Authentication, ACL evaluation, rate limiting, and PAC generation engine for the forward proxy

Overview

The forward proxy module provides browser-native access to backend services using the MASQUE protocol (RFC 9298) over QUIC. It enables authenticated, policy-controlled tunneling of TCP and UDP traffic through the Hexon gateway without requiring any client software.

Core capabilities:

  • Bearer token authentication using HMAC-SHA256 signed tokens with configurable TTL
  • Firewall ACL integration for group-based destination access control
  • Per-user rate limiting (requests/sec) and bandwidth limiting (bytes/sec)
  • PAC (Proxy Auto-Configuration) file generation for browser proxy setup
  • JA4/JA4Q fingerprint binding for session-based authentication
  • Geo-IP and time-based access restrictions (fail-closed)
  • Active connection tracking with per-user and per-target metrics
  • DNS resolution via the DNS module (prevents DNS poisoning)
  • Separate proxy hostname and TLS certificate support for CDN bypass
  • Token refresh mechanism for long-lived browser sessions

Transport security model:

The PAC file returns "HTTPS host:port", so the browser always connects to
the proxy over TLS. The forward proxy listener only speaks TLS.
HTTPS target (e.g. https://example.com):
Browser --TLS--> Proxy --TLS--> Target
CONNECT tunnel (end-to-end encrypted)
+ token (raw bytes, no proxy headers)
Plain HTTP target (e.g. http://ifconfig.io):
Browser --TLS--> Proxy --plain--> Target
GET http://... (content visible on last hop)
+ token (token STRIPPED before forwarding)
The bearer token only travels on the encrypted browser-to-proxy leg.
Hop-by-hop headers (Proxy-Authorization, Connection, etc.) are removed
before forwarding. The token never reaches the target server.

Authentication flow (bearer token):

1. User logs in via any method, receives session cookie
2. Browser extension fetches /proxy/config with session cookie
3. Service generates HMAC-SHA256 signed token with user/groups/expiry
4. Extension sends Proxy-Authorization: Bearer <token> on CONNECT
5. Token validated locally (no round-trip for validation)
6. User disabled status checked against directory
7. CheckAccess enforces firewall ACL rules
8. Connection established and traffic relayed
9. Extension periodically refreshes token via /proxy/config

Config

Core configuration under [forward_proxy] section in hexon.toml:

[forward_proxy]
enabled = true # Enable forward proxy module
port = 8443 # Dedicated proxy port (must differ from service.port)
public_port = 8443 # External port for PAC URLs (for NAT/LB scenarios)
preserve_client_port = true # Use client's port in Alt-Svc header
hostname = "proxy.example.com" # Separate hostname for CDN bypass (optional)
fingerprint_binding = true # Enable JA4/JA4Q fingerprint-to-session binding
fingerprint_binding_ttl = "8h" # Fingerprint binding TTL (match session TTL)
rate_limit_per_user = 1000 # Max requests per second per user
bandwidth_limit_per_user = "100mbps" # Max bandwidth per user
# Token settings
token_ttl = "5m" # Token validity duration (default: 5m)
token_refresh_interval = "60s" # Extension refresh interval (default: 60s)
# Probe resistance — hide proxy presence from unauthenticated probes
probe_resistance_mode = "off" # off | fingerprint | ip | secret_host
probe_resistance_decoy = "404" # 404 (Not Found) | empty (204 No Content)
probe_resistance_ttl = "" # cache TTL; defaults to fingerprint_binding_ttl
probe_resistance_secret_host = "" # required when mode=secret_host
# TLS certificate for the proxy hostname (optional)
# Only needed when hostname differs from service.hostname
# Value can be a file path or inline PEM content
# If not set, uses ACME (add hostname to acme.additional_domains) or service cert
cert = "/path/to/cert.pem"
key = "/path/to/key.pem"
# Geo-IP restrictions (optional, falls back to [service] if not set)
geo_enabled = true # Enable geo-IP restrictions
geo_allow_countries = ["US", "CA"] # Allowed country codes (ISO 3166-1 alpha-2)
geo_deny_countries = [] # Denied country codes
geo_bypass_cidr = ["10.0.0.0/8"] # CIDR ranges that bypass geo checks
geo_deny_code = 403 # HTTP status code for geo-denied requests
geo_deny_message = "Access denied from your location"
# Time-based restrictions (optional, falls back to [service] if not set)
time_enabled = true # Enable time-based restrictions
time_timezone = "America/New_York" # Timezone for time checks
time_allow_days = ["Mon", "Tue", "Wed", "Thu", "Fri"]
time_allow_hours = "09:00-18:00" # Allowed hours range
time_deny_code = 403 # HTTP status code for time-denied requests
time_deny_message = "Access not permitted at this time"
# PAC file configuration
[forward_proxy.pac]
enabled = true # Enable PAC endpoint
path = "/proxy.pac" # PAC file URL path
cache_ttl = "15m" # PAC response cache TTL
use_firewall_targets = true # Derive PAC targets from firewall rules

PAC authentication requirement: unauthenticated requests receive a minimal PAC that routes all traffic directly. Authenticated users get a PAC with targets derived from their firewall rules.

Hot-reloadable: rate_limit_per_user, bandwidth_limit_per_user, geo/time restrictions, PAC settings, token_ttl, token_refresh_interval, probe_resistance_* (mode and TTL apply on next request). Cold (restart required): enabled, port, hostname, fingerprint_binding.

Security

Security layers and hardening measures:

Bearer token security:

Tokens signed with HMAC-SHA256 using the cluster-wide secret key.
Short TTL (default 5 minutes) limits exposure window for stolen tokens.
Token contains user ID, groups, and expiry; validated locally without
round-trip for minimal latency.
Tokens are not stored server-side (stateless validation via signature).
Token transport is always encrypted: the browser-to-proxy connection is
TLS (PAC returns "HTTPS"), and the token is stripped (hop-by-hop header)
before forwarding to the target. Even for plain HTTP targets, the token
never leaves the TLS tunnel.

Fingerprint binding:

JA4/JA4Q TLS fingerprint bound to session via BindFingerprint operation.
Prevents token replay from a different client/browser. Binding has its own
TTL that should match the session TTL for consistent expiry. The same
binding doubles as the signal for probe_resistance_mode=fingerprint.

Probe resistance (probe_resistance_mode):

Without this gate, every CONNECT to the proxy receives a 407 with
"Basic realm=Hexon Proxy", and the IAP-port middleware leaks the
dedicated proxy port in plaintext via 421 — both fingerprintable.
off — legacy 407-on-everything (default).
fingerprint — 407 only when the request's JA4Q TLS fingerprint is
currently bound (BindFingerprint cache, populated on
sign-in). Recommended for browser-extension deployments;
the binding is auto-populated whenever a user signs in.
ip — 407 only when the request's source IP authenticated
within probe_resistance_ttl. Survives JA4Q drift but
leaks the 407 to every client behind a shared egress
once one user has signed in. Best for office/VPN-fronted
access.
secret_host — 407 only when r.Host equals probe_resistance_secret_host.
Best for non-extension manual proxy configuration where
the secret host is distributed out-of-band.
In all non-off modes the IAP-port middleware mirrors the same gate so
a probe diffing responses across listeners cannot fingerprint either.
Metrics:
forwardproxy_probe_decisions_total{mode, decision, path}
mode configured mode at decision time
decision "challenge" (407 emitted) or "decoy" (decoy served)
path "tcp" / "http" / "udp" / "iap_middleware"

Access control (multi-layer):

1. Bearer token authentication (identity verification)
2. User disabled check via directory.IsUserDisabled (account status)
3. Firewall ACL via CheckAccess (group-based destination control)
4. Rate limiting per user (abuse prevention)
5. Bandwidth limiting per user (network saturation prevention)
6. Geo-IP restrictions (location-based access, fail-closed)
7. Time-based restrictions (schedule-based access, fail-closed)
8. DNS resolution via the DNS module (prevents DNS poisoning)

Geo-IP and time restrictions:

Both use fail-closed semantics: if the check cannot be performed
(e.g., GeoIP database unavailable), access is denied.
Forward proxy has its own geo/time config that overrides [service] defaults,
allowing different policies for proxy vs. web access.

PAC file security:

PAC endpoint requires authentication to return proxy-routed targets.
Unauthenticated PAC returns DIRECT-only routing (no information leak).
Username embedded in PAC for browser extension display only.

Rate and bandwidth limiting:

Per-user rate limiting prevents connection flooding.
Per-user bandwidth limiting prevents single-user network saturation.
Both return RetryAfter hints for well-behaved clients.

Troubleshooting

Common symptoms and diagnostic steps:

User cannot connect through forward proxy:

- Verify forward_proxy.enabled = true and port is correct
- Check bearer token: token_ttl may have expired, verify refresh is working
- Check user disabled status: directory user <username>
- Verify firewall rules allow the target: forwardproxy check <user> <target>
- Check geo restrictions: geo lookup <client_ip> and geo check <client_ip>
- Check time restrictions: ensure current time is within allowed window
- DNS resolution: verify target hostname resolves via dns test <hostname>

PAC file returns DIRECT for all traffic:

- PAC requires authentication; check session cookie is being sent
- Verify forward_proxy.pac.enabled = true
- Check use_firewall_targets = true and firewall rules exist for the user
- Inspect PAC content: curl -b session=<cookie> https://host/proxy.pac

Token refresh failing (extension shows expired):

- Check token_refresh_interval is shorter than token_ttl
- Verify /proxy/config endpoint is accessible with session cookie
- Check for clock skew between client and server
- Monitor token generation metrics via forwardproxy metrics

Rate limited (429 responses):

- Check rate_limit_per_user setting (requests/sec)
- Check bandwidth_limit_per_user setting
- Monitor per-user metrics: forwardproxy metrics <username>
- RetryAfter header indicates when to retry

Fingerprint binding failures:

- Verify fingerprint_binding = true in config
- Check fingerprint_binding_ttl matches session TTL
- JA4 fingerprint changes between requests indicate client switching
- Browser updates can change JA4 fingerprint (rebind needed)

Connection drops or timeouts:

- Check backend connectivity: net tcp <target_host:port>
- Check QUIC port (UDP) is not blocked by intermediate firewalls
- Verify TLS certificate: net tls <proxy_hostname:port>
- Check active connections: forwardproxy metrics to see connection counts

Geo-IP or time-based denial (403/451):

- Geo denial: geo lookup <ip> shows country, geo check <ip> shows policy
- Time denial: verify time_timezone is correct, check time_allow_hours
- Bypass CIDR: add client network to geo_bypass_cidr for exemption
- Forward proxy geo/time overrides [service] config if set

Metrics and monitoring:

- Active connections: forwardproxy metrics (cluster-wide)
- Per-user breakdown: forwardproxy metrics <username>
- Connection success/failure rates tracked via RecordMetrics
- Bytes sent/received per user for bandwidth accounting

Relationships

Module dependencies and interactions:

  • Firewall: ACL rule evaluation determines which destinations each user group can reach. Firewall rules also drive PAC file generation when use_firewall_targets is enabled.
  • Directory: User disabled check on every authentication call. Group membership embedded in token for ACL evaluation.
  • Forward proxy service: Service layer handles HTTP CONNECT (TCP tunneling), CONNECT-UDP (UDP tunneling), and absolute-form HTTP requests (plain HTTP forwarding), plus HTTP endpoints (/proxy/config, /proxy/setup, /proxy.pac). Service calls this engine for auth, ACL, metrics.
  • DNS: Hostname resolution for target destinations, with system DNS fallback.
  • Rate limiting: Per-user request throttling and bandwidth controls.
  • Geo-IP: Location-based access restrictions. Forward proxy can override [service] geo config with its own settings.
  • Sessions: Session cookie used for initial token generation. Fingerprint binding ties proxy session to TLS fingerprint.
  • Configuration: Hot-reload of rate limits, bandwidth limits, geo/time restrictions, PAC settings. Token TTL changes apply to new tokens only.
  • Telemetry: Structured logging for authentication, ACL decisions, rate limit events. Metrics for active connections, bytes transferred, token generation.
  • Auto TLS: ACME certificate for proxy hostname when using a separate hostname (add to acme.additional_domains).

Logs

Log entries by component. Search with: logs search “forwardproxy” Levels: ERROR > WARN > INFO > DEBUG > TRACE.

Initialize:

forwardproxy.init INFO Forward proxy disabled in config
forwardproxy.init ERROR Failed to initialize forward proxy
forwardproxy.init INFO Initializing forward proxy module

Access Control:

forwardproxy.checkaccess ERROR Failed to resolve user groups
forwardproxy.checkaccess ERROR Failed to call firewall.CheckProxyAccess
forwardproxy.checkaccess ERROR Invalid response type from firewall

Allowed Targets:

forwardproxy.getallowedtargets ERROR Failed to resolve user groups
forwardproxy.getallowedtargets ERROR Failed to call firewall.GetAllowedTargets
forwardproxy.getallowedtargets ERROR Invalid response type from firewall

PAC Generation:

forwardproxy.generatepac WARN PAC requested without authentication
forwardproxy.generatepac DEBUG Generated PAC file

Authentication:

forwardproxy.auth WARN Token validation failed
forwardproxy.auth WARN User account is disabled
forwardproxy.auth INFO AUDIT Token authentication successful
forwardproxy.auth DEBUG Invalidated fingerprint binding

Token Generation:

forwardproxy.token ERROR Failed to generate token
forwardproxy.token DEBUG Generated proxy token

Fingerprint Binding:

forwardproxy.bind WARN Failed to broadcast fingerprint binding
forwardproxy.bind WARN Failed to achieve quorum for fingerprint binding
forwardproxy.bind INFO Fingerprint bound to session

Rate Limiting:

forwardproxy.ratelimit WARN Rate limit check called without UserID
forwardproxy.ratelimit WARN User rate limit exceeded
forwardproxy.ratelimit WARN Destination rate limit exceeded
forwardproxy.ratelimit WARN User bandwidth limit exceeded

Rate Limit Cleanup:

forwardproxy.cleanup DEBUG Cleaned up stale rate limit entries

Geo Restrictions:

forwardproxy.restrictions.geo ERROR Geo check failed - denying access (fail-closed)
forwardproxy.restrictions.geo ERROR Geo check wait failed - denying access (fail-closed)
forwardproxy.restrictions.geo ERROR Invalid geo check response type - denying access (fail-closed)
forwardproxy.restrictions.geo INFO Access blocked by geo restriction

Time Restrictions:

forwardproxy.restrictions.time ERROR Time check failed - denying access (fail-closed)
forwardproxy.restrictions.time ERROR Time check wait failed - denying access (fail-closed)
forwardproxy.restrictions.time ERROR Invalid time check response type - denying access (fail-closed)
forwardproxy.restrictions.time INFO Access blocked by time restriction

Metrics

Prometheus metrics. Query with: metrics prometheus forwardproxy_<name>

Connection Metrics (namespace: forwardproxy):

forwardproxy_connections_total counter {protocol, user_id} Proxy connections recorded
forwardproxy_bytes_sent_total counter {protocol, user_id} Bytes sent through proxy
forwardproxy_bytes_received_total counter {protocol, user_id} Bytes received through proxy
forwardproxy_connection_duration latency {protocol, user_id} Connection duration
forwardproxy_errors_total counter {protocol, error} Failed proxy connections
forwardproxy_active_connections gauge {} Currently active proxy connections

Network Listener

Manages all network connections — TLS termination, client fingerprinting, HTTP middleware chain, and protocol detection

Overview

Manages all incoming network connections — TLS termination, protocol detection, client fingerprinting, and the HTTP middleware chain. Every request to the gateway passes through the listener before reaching any service or proxy route. Supports TCP, TLS, HTTP/1.1, HTTP/2, HTTP/3 (QUIC), UDP, and gRPC.

Client fingerprinting combines three layers into a composite hash:

JA4 (TLS) — cipher and extension hash, extracted during TLS handshake
HTTP/2 — SETTINGS frame parameters and pseudo-header ordering
TCP/IP Stack — window size, MSS, TTL for OS identification
Composite — SHA256(ja4|http2|tcp) truncated to 32 hex chars
JA4Q (QUIC) — QUIC transport parameter fingerprint for HTTP/3 clients

Used for rate limiting, session affinity, and client identification — resistant to IP spoofing and NAT. Fingerprint data is stored in a unified structure across all protocols (HTTP/1.1, HTTP/2, HTTP/3).

HTTP middleware chain (applied in order): security headers, geo restriction, time restriction, rate limiting, size limiting, proof-of-work, WAF. Each layer runs independently.

Additional capabilities:

  • Deployment behind CDN/load balancer with header-based client identification (proxy mode)
  • Per-SNI mTLS with dynamic CA rotation
  • HXEP (Hexon Edge Protocol) for real client IP through edge proxies and SNAT
  • Correlation ID propagation for end-to-end distributed tracing
  • Malformed TLS blocking to reject invalid ClientHello messages
  • Graceful shutdown with configurable connection draining timeout

Config

Core configuration under [service] in config TOML:

[service]
hostname = "auth.example.com" # Service hostname
tls_cert = "/path/to/cert.pem" # TLS certificate path
tls_key = "/path/to/key.pem" # TLS private key path
handshake_timeout = 10 # TLS handshake timeout in seconds (default: 10)
block_malformed_tls = true # Reject invalid TLS ClientHello (default: true)
max_header_bytes = 65536 # Max ClientHello size in bytes (default: 64KB)
disable_server_header = false # Suppress HexonGateway/<version> header (default: false)
correlation_id_header = "X-Hexon-ID" # Correlation ID header name (default: "X-Hexon-ID")
cookie_name = "hexon" # Session cookie name (default: "hexon")
# Mutual TLS
mtls_mode = "none" # "none", "optional", "mandatory" (default: "none")
# HTTP/2 settings
http2_enable = true # Enable HTTP/2 (default: true)
http2_maxstreams = 1000 # Max concurrent streams per connection
http2_maxframesize = 1048576 # Max frame payload size (default: 1MB)
http2_idletimeout = 120 # Idle timeout in seconds
http2_keepalive = true # Enable HTTP/2 keepalive
http2_keepaliveseconds = 30 # Keepalive interval in seconds
# Fingerprint cache
fingerprint_max_entries = 10000 # Max entries in addr fingerprint map (default: 10000)
fingerprint_ttl_seconds = 300 # Base TTL in seconds (default: 5 min)
fingerprint_cleanup_seconds = 30 # Cleanup sweep interval (default: 30s)
fingerprint_max_entries_per_ip = 10 # Max fingerprints per IP, anti-abuse (default: 10)
# JA4 parsing security limits
ja4_max_extensions = 200 # Max TLS extensions to parse (default: 200, typical: 10-30)
ja4_max_sigalgs = 100 # Max signature algorithms to parse (default: 100)
# HTTP/2 fingerprint cache
http2_fingerprint_cache_size = 10000 # Max entries (default: 10000)
http2_fingerprint_cache_evict_pct = 10 # % of oldest entries to evict when full (1-50)
# QUIC fingerprint reassembly
quic_fingerprint_reassembly_max_packets = 10 # Max packets for reassembly (default: 10)
quic_fingerprint_reassembly_max_bytes = 15360 # Max reassembly buffer (default: 15KB)
quic_fingerprint_reassembly_timeout_s = 5 # State timeout (default: 5s)
quic_max_crypto_frame_offset = 65536 # Max CRYPTO frame offset (default: 64KB)
# Proxy mode (behind CDN/LB)
proxy = false # Enable proxy mode (default: false)
proxy_cidr = ["10.0.0.0/8"] # Trusted proxy IPs (REQUIRED when proxy=true)
proxy_header_clientip = "X-Forwarded-For" # Real client IP header (REQUIRED when proxy=true)
proxy_header_clientcert = "SSL_CLIENT_CERT" # Client certificate header (optional)
proxy_header_clientfingerprint = "CF-Ray" # Client fingerprint header (optional)
proxy_header_traceid = "X-Request-ID" # Trace ID header for distributed tracing (optional)
# Geo restriction (router-level middleware)
geo_enabled = false # Enable geo restrictions (default: false)
geo_database = "GeoLite2-Country.mmdb"
geo_asn_database = "GeoLite2-ASN.mmdb"
geo_allow_countries = [] # ISO 3166-1 alpha-2 codes (empty = all)
geo_deny_countries = [] # Deny takes precedence over allow
geo_allow_asn = [] # ASN allow list
geo_deny_asn = [] # ASN deny list
geo_bypass_cidr = [] # CIDRs that skip geo checks
geo_deny_code = 403 # HTTP status for denials
geo_deny_message = "" # Custom denial message
# Time restriction (router-level middleware)
time_enabled = false # Enable time restrictions (default: false)
time_bypass_cidr = [] # CIDRs that skip time checks
time_default_timezone = "UTC" # Default timezone (IANA format)
[protection]
rate_limit = "100/1m" # Requests per interval (empty = disabled)
rate_limit_type = "fingerprint" # "fingerprint" or "ip" (default: "ip")
rate_limit_bantime = "5m" # Ban duration when limit exceeded

Fingerprint adaptive TTL (based on cache utilization):

Normal (<60%): base TTL (default 5 min)
Medium (60-80%): base TTL / 2 (min 2 min)
High (>80%): base TTL / 5 (min 1 min)
LRU eviction triggers when TTL cleanup is insufficient.
# HXEP (Hexon Edge Protocol)
hexon_edge_protocol = false # Enable HXEP header parsing (default: false)
hexon_edge_cidr = [ # Trusted CIDRs for HXEP (default: trust all)
"10.244.0.0/16", # Kubernetes pod network
]

HXEP (Hexon Edge Protocol) — real client IP through edge proxies:

When traffic flows: External Client → Edge Proxy → Gateway (via k8s Service/LB),
the edge proxy prepends a binary header with the original client IP and port.
Format: Magic "HXEP" (4B) + Type (1B: 0x04=IPv4, 0x06=IPv6) + IP (4/16B) + Port (2B)
Required for: geo-IP accuracy, rate limiting, IDS, and RADIUS NAS identification
when the gateway sits behind an edge proxy or Kubernetes service with SNAT.
Config:
- service.hexon_edge_protocol = true → enables HXEP parsing on all listeners
- service.hexon_edge_cidr = [...] → only these source CIDRs are trusted for HXEP
Default: ["0.0.0.0/0", "::/0"] (trust all) — restrict to pod CIDR in production
- Packets from untrusted CIDRs: HXEP header stripped, socket address used
- Set automatically via Helm when edge.enabled=true
Protocols: TCP (parsed on first read, before TLS handshake), UDP (PacketConn wrapper),
HTTP/3 QUIC (HXEP wrapping applied transparently, GSO/ECN/GRO OOB data preserved).
Used by: reverse proxy, RADIUS (RADSEC + UDP), SSH bastion, QUIC connector, QUIC client access.

Hot-reloadable: TLS certificates, mTLS CA pool, proxy mappings, geo/time rules, rate limit settings, fingerprint cache limits. Cold (restart required): listen addresses, HTTP/2 enable, proxy mode toggle, HXEP settings.

Troubleshooting

Common symptoms and diagnostic steps:

TLS handshake failures:

- Malformed ClientHello blocked: check 'logs search "Malformed TLS"' for details
- block_malformed_tls=true rejects missing SNI, invalid TLS version, oversized ClientHello
- ClientHello too large: check max_header_bytes setting (default 64KB)
- TLS version rejected: only 0x0301-0x0304 (TLS 1.0-1.3) accepted
- mTLS certificate popup on proxy routes: check per-SNI mTLS config, set mtls=false on mapping
- CA rotation issues: 'certs list' to verify CA bundle, check 'logs search "CA rotation"'
- Start with: 'diagnose domain <hostname>' for cross-subsystem check

Fingerprint cache exhaustion:

- High memory from fingerprint storage: check fingerprint_max_entries setting
- Adaptive TTL kicking in too aggressively: increase fingerprint_ttl_seconds
- Per-IP abuse: 'logs search "fingerprint limit exceeded"' to identify attackers
- fingerprint_max_entries_per_ip controls anti-abuse threshold (default: 10)
- LRU eviction warnings: 'logs search "evict"' to monitor cache pressure
- Check: 'metrics prometheus fingerprint' for cache utilization metrics

Session affinity not working:

- Verify cluster_affinity=true in global config
- Loopback connections (127.0.0.1, ::1) bypass affinity by design
- Circuit breaker open for target node: 'proxy circuits' to check breaker states
- No TLS = no fingerprint = no affinity: ensure clients connect via HTTPS
- Check: 'cluster status' for node health, 'health components' for listener status

Proxy mode issues (behind CDN/LB):

- 403 Forbidden: source IP not in proxy_cidr, check 'logs search "CIDR"'
- 400 Bad Request: missing client IP header, verify proxy_header_clientip config
- Rate limiting all users as one: JA4 unavailable in proxy mode, use proxy_header_clientfingerprint
- Wrong client IP: X-Forwarded-For uses FIRST IP only (original client, not proxy chain)
- Header injection: ensure proxy_cidr is restricted to actual proxy IPs
- Distributed tracing broken: configure proxy_header_traceid for end-to-end correlation
- mTLS through proxy: set proxy_header_clientcert and mtls_mode="optional" or "mandatory"

QUIC/HTTP/3 fingerprint failures:

- Large ClientHello spanning packets: check quic_fingerprint_reassembly_max_packets
- Reassembly timeout: increase quic_fingerprint_reassembly_timeout_s for slow networks
- CRYPTO frame offset too large: quic_max_crypto_frame_offset default 64KB should suffice
- Connection ID too long (>20 bytes): RFC 9000 violation, likely malicious traffic

Rate limiting misbehavior:

- All clients sharing one rate bucket: check rate_limit_type ("fingerprint" vs "ip")
- Composite fingerprint unavailable: falls back to IP automatically
- Per-route bypass not working: verify disable_rate_limit=true on the proxy mapping
- Cluster-wide consistency: rate limits use distributed memory cache
- Check: 'ratelimit stats' for current rate limiting state, 'metrics ratelimit' for counters

HXEP (Hexon Edge Protocol) issues:

- HXEP not resolving real client IP: verify service.hexon_edge_protocol = true
- Wrong client IP after HXEP: verify source IP falls within service.hexon_edge_cidr
- "HXEP header stripped": source IP is outside trusted CIDRs — add pod/edge CIDR
- Geo/rate limiting sees edge proxy IP instead of client: HXEP not enabled or CIDR mismatch
- RADIUS NAS rejected after HXEP: real NAS IP doesn't match any [[radius.client]] CIDR
- Default trust-all CIDRs in production: security risk — restrict to actual pod network CIDR
- Config: 'config show service' and check hexon_edge_protocol + hexon_edge_cidr fields
- Helm sets HXEP automatically when edge.enabled=true in values.yaml

Connection metrics missing:

- Metrics batched (flush every 100ms or on close): short-lived connections may lag
- Check: 'health components' for listener health status
- 'metrics prometheus listener' for per-listener connection counters

Geo/time restriction issues:

- Geo blocking wrong country: verify MaxMind database is current
- Bypass CIDR not working: geo_bypass_cidr checked before country/ASN rules
- Time window mismatch: verify IANA timezone spelling (e.g., "America/New_York")
- Overnight ranges supported: "22:00-06:00" spans midnight correctly
- Check: 'geo lookup <ip>' to verify classification, 'geo timecheck <ip>' for time rules

Architecture

Connection lifecycle:

  1. Client connects to TCP socket
  2. First bytes peeked to detect TLS, extract JA4 fingerprint + SNI
  3. TCP fingerprint extracted (window size, TTL, MSS, options ordering)
  4. Session affinity check: fingerprint hash maps to a cluster node
  5. If affinity target is a remote node: forward connection to that node
  6. If local: proceed with TLS handshake (per-SNI mTLS selection)
  7. If HTTP/2: extract HTTP/2 fingerprint from SETTINGS frame
  8. Compute composite hash: SHA256(ja4|http2|tcp) truncated to 32 hex chars
  9. Assign correlation ID, begin connection tracking
  10. HTTP middleware chain: telemetry -> client identification -> connection info -> security headers -> geo restriction -> time restriction -> rate limit -> handler
  11. Handler processes request, correlation ID propagates as trace_id across modules
  12. Metrics flushed on connection close

Fingerprint extraction pipeline:

Accept-level (before TLS): JA4 from ClientHello peek (zero-copy, buffered I/O)
TLS callback: per-SNI mTLS mode selection
Post-handshake: HTTP/2 SETTINGS fingerprint from connection preface
TCP layer: p0f-style OS fingerprint from socket options (window, MSS, TTL)
QUIC path: JA4Q from Initial packet, transport params fingerprint, multi-packet reassembly

GSO/ECN/GRO preservation:

All UDP wrappers (HXEP edge protocol and JA4Q fingerprint) preserve kernel offload
capabilities so that QUIC can use:
- GSO (Generic Segmentation Offload): send 64KB in one syscall, kernel splits into MTU packets
- GRO (Generic Receive Offload): kernel coalesces packets, fewer syscalls on receive
- ECN (Explicit Congestion Notification): congestion signals via IP header bits
Without these, QUIC silently falls back to one syscall per packet.
This affects both HTTP/3 reverse proxy and QUIC connector listeners.

Fingerprint memory protection:

Address fingerprint map: configurable max entries (default 10,000) with adaptive TTL
Per-IP limit: configurable (default 10), oldest replaced on overflow
LRU eviction: sorts by timestamp, evicts oldest when TTL cleanup insufficient
HTTP/2 cache: configurable size with percentage-based LRU eviction (1-50%)
All maps use lock-free concurrent reads for performance

Proxy mode flow:

Step 1: Validate source IP against configured proxy_cidr
Step 2: Extract trace ID from proxy header, update correlation context
Step 3: Extract and sanitize client IP (first IP from comma-separated list)
Step 4: Fingerprint priority: dedicated header > client cert hash > client IP
Step 5: Update context with real client identifiers for downstream modules

mTLS CA rotation flow:

1. ACME CA rotates, triggers listener update
2. CA pool rebuilt atomically (config CA + ACME CA merged)
3. HTTPS listeners gracefully restarted
4. Existing connections drain gracefully, new connections get fresh CA pool

Graceful shutdown sequence:

1. Stop accepting new connections on all listeners
2. Close all listener sockets
3. Wait for active connections up to configurable timeout
4. Cancel contexts for remaining connections
5. Force-close any connections still open after timeout

Performance characteristics:

- Pooled slice allocations reduce GC pressure during fingerprint extraction
- Buffered I/O to minimize syscalls
- Metrics batched to reduce overhead (flush every 100ms)
- TCP Fast Open: 15-30% latency reduction for repeat clients (Linux 3.7+, macOS)
- TCP Window Scaling: 20-40% throughput improvement for large transfers
- SO_REUSEPORT on Linux for load balancing across cores

Relationships

Module dependencies and interactions:

  • Proxy: Provides per-SNI mTLS lookup. Listener provides fingerprint and client IP context consumed by proxy for rate limiting, identity headers, and session affinity.
  • Sessions: Listener middleware manages session cookie extraction. Session validation uses correlation IDs propagated through listener context.
  • Certificates: TLS termination uses certificates from the cert module. Per-mapping certificates loaded via SNI callback. CA pool for mTLS verification rebuilt atomically on ACME CA rotation.
  • WAF: WAF rules applied in middleware chain after listener accepts connection. Fingerprint available in context for WAF correlation.
  • X.509 authentication: mTLS mode controls TLS client auth level. In proxy mode, client certificates injected from HTTP header. Certificate validation uses dynamic CA pool.
  • Rate limiting: Middleware reads composite fingerprint or client IP from context. Composite fingerprint (JA4+HTTP/2+TCP) or IP-based, configurable per route.
  • Geo restriction: Middleware at router level uses client IP from context with MaxMind GeoLite2 databases for country/ASN lookup.
  • Time restriction: Middleware after geo restriction uses client country for timezone-aware time window matching.
  • Cluster affinity: Fingerprint hash selects cluster node for session routing. Node health checked before forwarding. Forwarded connections use inter-node communication for transparent routing.
  • DNS: Listener does not directly use DNS, but proxy backends resolved via DNS module.
  • Distributed tracing: Correlation IDs generated at listener level propagate as trace_id through all operations, enabling end-to-end tracing across cluster nodes.
  • Connection pool: Backend connection management operates downstream of listener. Listener handles inbound connections; connection pool handles outbound to backends.

Encrypted Client Hello (ECH)

Encrypted Client Hello (ECH) hides proxied app SNI behind the service hostname.

Without ECH, a network observer sees which app a user accesses via the plaintext SNI in the TLS ClientHello (e.g., “app.internal.com”). With ECH enabled, the observer only sees the gateway’s service hostname (e.g., “gateway.example.com”). The real hostname is encrypted inside the ClientHello using HPKE (X25519 + HKDF-SHA256 + AES-128-GCM).

Configuration:

[service]
ech = true # Default: false (opt-in)

How it works:

1. Gateway generates an HPKE key pair (X25519) and ECH config on startup
2. The ECH config is logged as base64 — publish it in a DNS HTTPS record
3. Clients that support ECH (Chrome 117+, Firefox 118+, Safari 17.4+) encrypt
the real SNI in the ClientHello
4. The gateway decrypts the inner ClientHello using its HPKE private key
5. GetCertificate receives the decrypted (inner) SNI — certificate selection
and proxy routing work unchanged
6. Non-ECH clients connect normally with plaintext SNI (graceful fallback)

The ECH config must be published in a DNS HTTPS (SVCB type 65) record for clients to discover ECH support. The gateway logs the config as base64 at startup:

"ech_config_list_base64": "<base64>" — copy to your DNS HTTPS record

What doesn’t change with ECH:

- Certificate selection (GetCertificateForSNI receives inner SNI)
- Proxy routing (uses HTTP Host header, not SNI)
- JA4 fingerprinting (computed from outer ClientHello)
- mTLS (client cert validation after ECH decryption)

Limitations:

- CDN termination: If a CDN terminates TLS before the gateway, ECH at the
gateway layer has no effect — the CDN already saw the SNI
- HTTP/3 QUIC: Uses a different ECH mechanism (not covered by this feature)
- DNS requirement: Without the HTTPS record, clients fall back to plaintext SNI

Logs

Log entries by component. Search with: logs search “listener” Levels: ERROR > WARN > INFO > DEBUG > TRACE. DEBUG/TRACE require log level configuration.

HTTP Errors:

listener.http.error DEBUG/WARN HTTP server errors (DEBUG for client TLS/connection failures, WARN otherwise)

Proxy Mode:

listener.proxy_validation WARN Rejected connection not from trusted proxy
listener.proxy_validation ERROR Client IP header missing in proxy mode
listener.proxy_cert WARN Oversized cert header (DoS) / parse failed
listener.proxy_cert DEBUG/INFO Client cert injected / invalid PEM block

CORS:

listener.cors WARN AUDIT CORS origin rejected

Sessions:

listener.session DEBUG Session created / validated / expired
listener.session ERROR/WARN Session creation/validation failures

Proof-of-Work:

listener.pow INFO PoW challenge passed / application session valid / body restored
listener.pow WARN Body too large / session validation failures / invalid body format
listener.pow ERROR PoW handler not registered / body encryption failures
listener.pow DEBUG Session checks, challenge served, body stored

Rate Limiting:

listener.ratelimit WARN AUDIT Request blocked by rate limit
listener.ratelimit WARN Config fallback (invalid rate_limit_type)
listener.ratelimit ERROR Ratelimit module call/response failures / no fingerprint
listener.ratelimit DEBUG Fingerprint fallback to IP
listener.ratelimit TRACE Per-entity rate limiting applied
listener.ratelimit.status DEBUG Rate limit check passed
listener.ratelimit.circuitbreaker ERROR Circuit breaker open — blocking request

Size Limiting:

listener.sizelimit WARN AUDIT Request blocked — size limit exceeded
listener.sizelimit ERROR Sizelimit module call/response failures
listener.sizelimit TRACE Size limit applied / exception / within limit

Compression:

listener.compression DEBUG Response compressed

Geo Restrictions:

listener.geo INFO AUDIT Request blocked by geo restriction
listener.geo ERROR Geo check failed (allowing request)

Time Restrictions:

listener.time INFO AUDIT Request blocked by time restriction
listener.time ERROR Time check failed (allowing request)

ECH (Encrypted Client Hello):

ech.generate INFO ECH key pair derived from cluster key

PoW Body Preservation:

pow.body DEBUG POST body stored / retrieved / deleted / restored
pow.body WARN Body not found (expired) / cleanup failures
pow.body ERROR Storage / retrieval / decryption failures

Metrics

Prometheus metrics. Query with: metrics prometheus listener_<name>

Lifecycle:

listener_starts counter {type, name} Listener startups
listener_stops counter {type, name} Listener shutdowns
listener_restarts counter {type, name} Listener restarts
listener_errors counter {type, name} Listener errors

Rate & Size Limiting:

listener_rate_limit_hits counter {reason} Requests blocked by rate limit
listener_ratelimit_circuit_breaker_trips_total counter {} Circuit breaker trips
listener_size_limit_hits counter {host, path} Size limit exceeded

TLS Security:

listener_connections_accepted counter {protocol} Successful TLS connections
listener_security_non_tls_dropped counter {reason} Non-TLS connections rejected
listener_security_malformed_tls counter {reason} Invalid TLS versions
listener_security_oversized_record counter {reason} TLS records exceeding RFC limits
listener_security_oversized_clienthello counter {reason} ClientHello too large
listener_security_small_clienthello counter {reason} Suspiciously small ClientHello
listener_security_malformed_clienthello counter {reason} Malformed ClientHello
listener_security_no_sni counter {reason} TLS handshakes without SNI

QUIC Affinity:

listener_quic_affinity_packets_received counter {} QUIC packets received
listener_quic_affinity_packets_dropped counter {reason} QUIC packets dropped
listener_quic_affinity_decryption_failures counter {} QUIC decryption failures
listener_quic_affinity_packets_local counter {} QUIC packets processed locally
listener_quic_affinity_packets_forwarded counter {target_node} QUIC packets forwarded to cluster
listener_quic_affinity_forward_failures counter {target_node} Forward failures
listener_quic_affinity_response_dropped counter {reason} QUIC response packets dropped
listener_quic_affinity_cid_mappings gauge {} Active connection ID mappings
listener_quic_connection_migrations counter {} QUIC connection migrations

QUIC Forwarding:

listener_quic_forward_connect_errors counter {target_node} Forwarding connect errors
listener_quic_forward_write_errors counter {target_node} Forwarding write errors
listener_quic_forward_bytes counter {target_node} Bytes forwarded

HXEP (Edge Protocol):

hxep_parsed_trusted counter {} TCP HXEP parsed (trusted)
hxep_stripped_untrusted counter {} TCP HXEP stripped (untrusted)
hxep_parse_failed counter {} TCP HXEP parse failures
hxep_partial_header counter {} TCP HXEP incomplete headers
hxep_udp_parsed_trusted counter {} UDP HXEP parsed (trusted)
hxep_udp_stripped_untrusted counter {} UDP HXEP stripped (untrusted)
hxep_udp_parse_failed counter {} UDP HXEP parse failures

Alerts:

rate(listener_rate_limit_hits[5m]) > 50 High rate limiting (possible attack)
listener_ratelimit_circuit_breaker_trips_total > 0 Circuit breaker tripped
rate(listener_security_no_sni[5m]) > 10 SNI probing
rate(hxep_stripped_untrusted[5m]) > 0 HXEP spoofing attempt
rate(listener_quic_affinity_forward_failures[5m]) > 0 Cluster QUIC forwarding issues