Connectivity

Client Access (HexonClient)

Transparent L3 network access via QUIC tunnels for CLI tools and native applications

Overview

The Client Access subsystem enables end users (DBAs, developers, operators) to transparently access internal resources through a lightweight QUIC tunnel. The HexonClient binary captures IP packets via TUN + gVisor netstack, extracts TCP flows, and dials each flow as a QUIC stream to the gateway.

The gateway side (this module) handles:

QUIC listener on a dedicated port with ALPN “hexon-client” and TLS 1.3
Two authentication paths: server-side device code (RFC 8628) for interactive use, JWT with RFC 5705 channel binding for reconnect/automation
Per-user route derivation from firewall ACL rules (CIDR + Site routes)
Virtual IP allocation from a dedicated subnet (default 100.64.208.0/22)
Per-stream firewall ACL check before dialing backends
Direct dial or connector tunnel routing based on HostAlias Site field
Bidirectional splice with 32KB pooled buffers and half-close propagation
DNS resolution on the control stream for split DNS
DNS defense-in-depth: per-session O(1) rate limiting + ACL enforcement (RFC 8914)
Token refresh with group-change detection and mid-session route updates
Cluster-wide session tracking

This mirrors the connector architecture but reversed: the client opens streams, the gateway accepts and dials backends.

Configuration

Configuration uses the [client_access] TOML section:

  [client_access]
  enabled = true
  port = 8445
  # network_interface = ""        # Bind to specific interface (falls back to service.network_interface)
  # cert = ""                     # Dedicated TLS cert (falls back to SNI/auto-TLS)
  # key = ""                      # Dedicated TLS key
  subnet = "100.64.208.0/22"         # Virtual IP pool for clients (1022 addresses)
  gateway_ip = "100.64.208.1"        # Gateway IP within subnet (excluded from pool)
  dns_upstream = ["10.0.0.53"]    # DNS resolvers for client queries
  dns_domains = []                # Additional DNS domains pushed to all clients
  # cidrs = ["10.0.0.0/22"]      # Additional CIDR routes pushed to all clients
  heartbeat_interval = "30s"      # Heartbeat frequency (session TTL = 3x this)
  token_refresh_interval = "45m"  # Client token refresh interval
  max_idle_timeout = "5m"         # QUIC idle timeout
  max_clients = 1000              # Maximum concurrent client connections
  max_streams_per_client = 100    # Maximum concurrent TCP streams per client
  dns_rate_limit = 100            # Maximum DNS queries per second per client
  # required_groups = ["engineers", "operators"]  # Empty = any authenticated user

Each connected client gets one virtual IP from the pool — use a dedicated CGN-space subnet to avoid overlap with other networks.

Routes pushed to clients come from two sources:

Firewall host aliases: CIDRs and IPs from aliases matched by user groups
Config-level cidrs: pushed to all clients regardless of group membership Both are merged (deduplicated) before sending in ClientAck.

Admin commands

Admin CLI commands:

  clients list [--user=X]              List connected hexonclient sessions (cluster-wide)
  clients show <session_id>            Show full session details (device, network, streams, traffic, timing)
  clients disconnect <user> [id]       Disconnect all sessions for user, or a specific session [WRITE]

Bastion shell commands (self-service, filtered to own sessions):

  clients                              List your active hexonclient sessions
  clients list                         Same as above
  clients disconnect [session_id]      Disconnect your own session(s)

Security

Two authentication paths (determined by whether client sends a token):

Device code flow (interactive — RFC 8628, same as bastion SSH):

Client establishes QUIC/TLS 1.3 connection (ALPN: “hexon-client”)
Client sends ClientAuth with empty token (signals device code request)
Gateway initiates device code authorization (server-side, no HTTP from client)
Gateway sends DeviceCodeChallenge: verification URI, user code, expiry
Client displays QR code + clickable URL + user code
Gateway polls the device code service until authorized, denied, or expired
On authorization: gateway extracts claims (username, email, groups) from poll response
Gateway checks required_groups, derives routes, allocates VIP
Gateway sends ClientAck with VIP, routes, DNS, and JWT tokens for reconnection Reconnected sessions use the JWT path below (no re-authentication needed).

JWT flow (reconnect / automation):

Client establishes QUIC/TLS 1.3 connection (ALPN: “hexon-client”)
Client sends ClientAuth: JWT + HMAC-SHA256(token, TLS exporter) proof
Gateway validates JWT (extracts username, groups)
Gateway verifies channel binding proof (RFC 5705 prevents token replay)
Gateway checks required_groups (if configured): user must have ANY listed group
Client sends ClientRegister with device metadata
Gateway derives per-user routes from firewall ACL rules
Gateway sends ClientAck with VIP, routes, DNS, token refresh interval

Per-stream access control:

Each QUIC stream carries a DialHeader (host, port, protocol)
Gateway checks firewall access control (user groups vs target host/port/protocol)
Denied streams get DialStatusDenied response immediately
Only allowed streams proceed to backend dial

DNS defense-in-depth:

Per-session rate limiting: O(1) time-bucketed rolling window (dns_rate_limit qps)
DNS ACL: after resolve, firewall checks user groups vs host aliases
ACL-denied queries return DNSStatusDenied (RFC 8914 REFUSED) — prevents information leakage
ACL call failure fails open (dial-time ACL is the authoritative control)

Token refresh:

Client sends TokenRefresh with new JWT + proof before token expires
Gateway re-validates JWT and channel binding
Gateway re-checks required_groups: if user lost membership, connection is terminated
If groups changed: re-derive routes, send RouteUpdate with add/remove entries
Bad token on refresh kills the connection (security boundary)

Troubleshooting

Common symptoms and diagnostic steps:

Client cannot connect:

  - Check listener: 'status summary' shows clientaccess listener state
  - Check config: 'config show client_access' (enabled, port, subnet)
  - Check required_groups: 'config show client_access' — user must be in listed groups (empty = any)
  - Check certs: 'certs list' or 'diagnose domain <hostname>'
  - Max clients reached: 'logs search clientaccess --level=warn'
  - Group denied: 'logs search clientaccess --level=warn' shows "group access denied"

Client connected but cannot reach services:

  - Check pushed routes: 'config show client_access' — cidrs must include destination subnet
  - Check firewall rules: user's groups must match rule sources
  - Check HostAlias: destination alias must have matching hosts (CIDRs for TUN routes, wildcards for DNS only)
  - Check connector: if Site is set, connector must be connected
  - 'logs search clientaccess-dial --level=warn' for denied dials

DNS not resolving:

  - Check dns_upstream config: must point to reachable resolvers
  - Check dns_domains: domains must be in the pushed list for split DNS
  - 'logs search clientaccess-dns' for resolution errors
  - DNS ACL denied (REFUSED): check user's groups match firewall rules for the hostname
  - DNS rate limited (SERVFAIL): check dns_rate_limit setting (default 100 qps)

Token refresh failures:

  - 'logs search clientaccess-refresh --level=warn'
  - Invalid token: OIDC provider may have rotated keys
  - Channel binding failure: possible MITM or TLS session change

Relationships

Module dependencies:

devicecode: Server-side device code authorization (RFC 8628) for interactive authentication
oidc: JWT validation for reconnect/automation authentication
firewall: Per-stream access control, DNS ACL enforcement, host alias route derivation
dns: DNS resolution for client split DNS queries
sessions: Cluster-wide session tracking (create, validate, revoke)
connectors: Site-based routing through connector tunnels
IP pool: Virtual IP allocation from dedicated subnet
listener: QUIC listener with TLS 1.3 and idle timeout
telemetry: Structured logging and Prometheus metrics

Logs

Log entries by component. Search with: logs search “clientaccess” Levels: ERROR > WARN > INFO > DEBUG.

Lifecycle:

  clientaccess                       INFO          initializing client access subsystem
  clientaccess                       ERROR         failed to create IP pool
  clientaccess                       ERROR         TLS config not available, client access listener disabled
  clientaccess                       ERROR         failed to create client access listener
  clientaccess                       ERROR         failed to start client access listener
  clientaccess                       INFO          client access listener started

Connection:

  clientaccess                       INFO   AUDIT  client connected (VIP, routes, hostname)
  clientaccess                       INFO   AUDIT  client disconnected (duration, traffic stats)
  clientaccess                       WARN          client rejected: max clients reached
  clientaccess                       WARN          unexpected first message type

Registration:

  clientaccess                       INFO          client registered (session, VIP, hostname)
  clientaccess                       INFO          client unregistered (session, duration, traffic counters)

Authentication — JWT:

  clientaccess                       INFO/WARN     client auth failed (INFO for PAT rejection, WARN otherwise)
  clientaccess                       WARN          channel binding failed

Authentication — Device Code:

  clientaccess                       WARN          device code auth rejected: concurrency limit reached
  clientaccess                       WARN          device code authorization request failed
  clientaccess                       INFO          device code challenge sent, waiting for authorization
  clientaccess                       INFO          client disconnected during device code auth
  clientaccess                       INFO          device code authorized
  clientaccess                       INFO          device code denied by user
  clientaccess                       INFO          device code expired

Authorization:

  clientaccess                       WARN          group access denied

Token Refresh:

  clientaccess                       WARN          token refresh failed: invalid token
  clientaccess                       WARN          token refresh failed: channel binding
  clientaccess                       WARN          group access revoked on refresh
  clientaccess                       INFO          token refreshed with group change
  clientaccess                       DEBUG         token refreshed

PAT Revocation:

  clientaccess                       INFO          disconnected clients after PAT revocation

Dial:

  clientaccess                       WARN          dial denied by ACL
  clientaccess                       DEBUG         dial failed
  clientaccess                       DEBUG         udp dial failed
  clientaccess                       DEBUG         dial accept stream error

Traffic:

  clientaccess                       DEBUG         client traffic

Hexdcall Module:

  clientaccess.list_sessions         WARN          Registry not initialized
  clientaccess.list_sessions         DEBUG         Listed client access sessions
  clientaccess.disconnect_session    WARN          Username missing in disconnect request
  clientaccess.disconnect_session    WARN          Registry not initialized
  clientaccess.disconnect_session    INFO          Session not found on this node
  clientaccess.disconnect_session    INFO          Disconnected client access session
  clientaccess.disconnect_session    INFO          Disconnected all client access sessions for user

Metrics

Prometheus metrics. Query with: metrics prometheus clientaccess_<name>

Connections:

  clientaccess_connections_total              counter    {}                    QUIC connections accepted
  clientaccess_connections_active             gauge      {}                    Currently active QUIC connections
  clientaccess_connections_rejected           counter    {reason}              Connections rejected before auth
  clientaccess_connection_duration            latency    {username?}           Connection lifetime

Authentication:

  clientaccess_auth_success_total             counter    {username?}           Successful authentications
  clientaccess_auth_failures_total            counter    {reason}              Failed authentications

Clients:

  clientaccess_clients_active                 gauge      {}                    Registered client instances

Heartbeat:

  clientaccess_heartbeat_latency              latency    {username?}           Heartbeat RTT (raw)

Dial:

  clientaccess_dials_total                    counter    {}                    Dial requests received
  clientaccess_dials_denied_total             counter    {}                    Dials denied by ACL
  clientaccess_dials_success_total            counter    {}                    Dials completed successfully
  clientaccess_dials_errors_total             counter    {}                    Dial errors (connect refused, timeout)
  clientaccess_dial_latency                   latency    {}                    Backend dial time
  clientaccess_streams_active                 gauge      {}                    Active QUIC dial streams

DNS:

  clientaccess_dns_queries_total              counter    {}                    DNS queries processed

Alerts:

  clientaccess_connections_active > max_clients * 0.9     Approaching client limit
  rate(clientaccess_connections_rejected[5m]) > 10        Connection rejection spike
  rate(clientaccess_auth_failures_total[5m]) > 10         Authentication failure spike
  rate(clientaccess_dials_denied_total[5m]) > 20          ACL denial spike

QUIC Connector

Connects remote sites to the gateway via outbound QUIC tunnel — no inbound ports required at the remote site

Overview

Enables access to services at remote sites without IPsec or opening inbound ports. A lightweight binary at the remote site dials out to the gateway over QUIC — the gateway routes traffic through the tunnel. All protocols work through connectors: HTTP proxy, SSH bastion, forward proxy, and SQL bastion.

A lightweight binary (hexonconnect) deployed at the remote site establishes an outbound QUIC connection to Hexon. Hexon then sends “dial” commands through this tunnel whenever a proxy mapping, bastion session, forward proxy rule, or firewall policy references that site via the “site” parameter.

Key capabilities:

Zero-trust remote access: connector dials only what Hexon asks, nothing else
Opaque site namespace: same IPs and DNS names across sites are irrelevant
Stateless token auth: HMAC-derived tokens validated without storage
Channel-bound authentication: RFC 5705 TLS Exported Keying Material prevents replay and MITM attacks — the token never travels on the wire
Multi-instance HA: multiple connectors per site with adaptive load balancing
Cross-node routing: any cluster node can route to any connector via adaptive inter-node forwarding — requests arriving at a node without connector instances are transparently forwarded to a node that has them
Auto-reconnect: connector never gives up, exponential backoff on disconnect
CDN-compatible: optional dedicated hostname and TLS certificate for direct access

Configuration

Configuration uses the [connector] TOML section:

  [connector]
  enabled = true
  port = 8444
  # hostname = "connector.example.com"  # Optional: dedicated hostname (CDN bypass)
  # cert = "/path/to/cert.pem"          # Optional: file path or inline PEM
  # key = "/path/to/key.pem"            # Optional: file path or inline PEM

  [[connector.sites]]
  id = "prod-asia-a8f3c1"
  name = "Production Asia"
  cidrs = ["203.0.113.0/24"]
  max_instances = 3
  rebalance = true              # Distribute across cluster nodes (default: true)
  rebalance_retries = 5         # Accept after N soft-rejects (default: 5, 1-10)

TLS certificate resolution:

  1. connector.cert/key when set (static certificate)
  2. SNI callback: auto-TLS (ACME), certmanager, wildcard, or service certificate
  If connector.hostname is set and no cert/key is provided, ACME will automatically
  provision a certificate for the connector hostname.

Usage across subsystems — add “site” parameter:

  [[proxy.mapping]]
  app = "API Asia"
  host = "api-asia.example.com"
  service = "http://api.default.svc.cluster.local:8080"
  site = "prod-asia-a8f3c1"

  # Shadow targets can also route through connectors:
  [[proxy.mapping.shadow]]
  name = "staging-mirror"
  service = "https://staging.internal:8443"
  site = "staging-eu"

  # Circuit breaker fallback can use a different connector site:
  [proxy.mapping.circuit_breaker]
  fallback_mode = "service"
  fallback_service = ["http://dr-backend:8080"]
  fallback_site = "dr-europe"

  # SSH cert rules — route bastion SSH through connector:
  [[bastion.ssh_cert.rules]]
  name = "remote-dc-ssh"
  groups = ["devops"]
  destinations = ["*.internal"]
  site = "prod-asia-a8f3c1"

  # SQL bastion — route database connections through connector:
  [[sql_bastion.sites]]
  name = "postgres-remote"
  type = "postgres"
  host = "pg.internal"
  port = 5432
  site = "prod-asia-a8f3c1"

  # Firewall host aliases — route forward proxy traffic through connector:
  [[firewall.aliases.hosts]]
  name = "remote_services"
  hosts = ["gitlab.internal", "jenkins.internal"]
  site = "prod-asia-a8f3c1"
  # Aliases with site skip nft rules — traffic goes through userspace QUIC tunnel

Token generation is deterministic from the cluster key — any node can validate.

Admin commands

Admin CLI:

  connector list                        List configured sites and live connections
  connector show <site-id>              Show site config, token, and connected instances
                                        (includes platform, origin with geo/ASN, system labels)
  connector create <site-id>             Create new site (generates token)
  connector revoke <site-id>            Block site, disconnect active QUIC tunnels
  connector instances <site-id>         List connected instances with metrics

The “connector show” output includes per-instance details: platform (OS/arch), origin IP with country and ASN (via geo module), and system labels reported by the connector binary (kernel, OS version, runtime environment, memory, virtualization, PID, UID/GID).

The “connector revoke” command disconnects active QUIC tunnels in addition to revoking cluster sessions, causing connectors to reconnect (and be rejected).

Config reload cleanup: when a site is removed from config (via GitOps or hot reload), active QUIC connections for that site are automatically disconnected. The connector binary will reconnect but be rejected because the site is no longer in config. This prevents stale sessions from lingering in JetStream KV.

Security

Trust boundaries:

Hexon Cluster (full trust): policy enforcement, identity, routing
Connector (minimal trust): dials only what Hexon asks, no autonomous access

Authentication flow:

QUIC/TLS 1.3 connection established (server cert, ECDHE, forward secrecy)
Both sides compute TLS exporter keying material (RFC 5705) with an application-specific label
Connector sends: site_id + HMAC of token bound to the TLS channel
Hexon validates by recomputing from cluster key

Additional protections:

Optional CIDR allowlist per site restricts connector source IPs
max_instances limit prevents token abuse
Instance selection uses epsilon-greedy adaptive algorithm with circuit breaker
QUIC relay loop prevention: relay handler only dispatches locally, preventing infinite forwarding loops between nodes
Cluster-wide rebalancing: soft-rejects excess connectors so they redistribute across gateway nodes (configurable per site, default 5 retries before accepting)

Inter node forwarding

All cluster nodes can route to any connector site through QUIC relay.

When a request arrives at a node without local connector instances (or after local retries are exhausted), the dispatcher transparently relays through a peer node. The relay uses QUIC on the same connector port (8444) with ALPN “hexon-relay” and mTLS for peer authentication. Each relay request opens a QUIC stream, sends a dispatch header, and the peer dispatches locally through its QUIC connector tunnel.

All traffic types converge through the same dispatch path — this covers reverse proxy, forward proxy, client access (TCP/UDP), SSH bastion, SQL bastion, shadow targets, and probes.

Remote node IPs are cached (5s refresh) from cluster-wide connector sessions. Failed nodes are tracked by the cluster discovery health checks.

Loop prevention:

The relay handler only dispatches locally (never relays further)
A peer with no local instances returns an immediate error

Troubleshooting relay:

Client-side metrics: relay_total (attempts), relay_success_total, relay_errors_total
Server-side metrics: relay_served (requests handled), relay_rejected_total (auth failures)
Relay rejected with “no_certificate”: peer isn’t presenting its service cert
Relay rejected with “not_peer”: source IP not in cluster discovery peer list
Relay “no_instances”: the peer node also has no local connectors for the site
Check logs: ‘logs search connectors.relay —level=warn’

Quic tuning

QUIC performance tuning applied to both gateway and connector sides:

Flow control windows (tuned for database and bulk transfer workloads):

  - Stream: 2MB initial, 8MB max
  - Connection: 4MB initial, 20MB max
  - Stream-to-connection ratio: 2:5
  Larger initial windows reduce round-trips for big responses (SQL results, file transfers).

Persistent QUIC transport (connector side):

  - hexonconnect reuses one UDP socket across reconnections
  - Avoids per-connection socket allocation and kernel offload state loss
  - Enables future QUIC connection migration if network interface changes

Stream error handling:

  - Error paths immediately release QUIC stream resources instead of graceful close
  - Frees resources under load without waiting for peer acknowledgment

Max concurrent streams:

  - Gateway: configurable per listener (default 100)
  - Connector: 1024 (high concurrency for multiplexed tunnel streams)

Rebalancing

When multiple connector replicas start simultaneously (e.g., Kubernetes Deployment with 3 replicas), they may all connect to the same gateway node via DNS or a load balancer. The rebalance mechanism redistributes them:

First connector for a site on a node is always accepted
Subsequent connectors check cluster distribution: if this node has more instances than the least-loaded remote node, the registration is soft-rejected
The connector reconnects with a short backoff (2 seconds) — DNS/LB randomness typically sends it to a different node
After N soft-rejects (configurable, default 5), the node accepts anyway

Per-site configuration:

  rebalance = true             # Enable cluster-wide load distribution (default: true)
  rebalance_retries = 5        # Max soft-rejects before accepting (1-10, default: 5)

Rebalance is best-effort — sticky load balancers may prevent redistribution, so the retry budget ensures connectors are never stuck. Metrics: rebalance_reject_total and rebalance_accept_total track distribution activity per site.

Logs

Log entries by component. Search with: logs search “connectors” Levels: ERROR > WARN > INFO > DEBUG. DEBUG requires log level configuration.

Initialization:

  connectors                      INFO          initializing connector subsystem
  connectors                      ERROR         TLS config not available, connector listener disabled
  connectors                      ERROR         failed to create connector listener
  connectors                      ERROR         failed to start connector listener
  connectors                      INFO          connector listener started

Authentication:

  connectors.handler              WARN   AUDIT  connector auth failed: invalid proof
  connectors.handler              WARN   AUDIT  connector auth failed: unknown site
  connectors.handler              WARN   AUDIT  connector auth failed: source IP not allowed

Connection lifecycle:

  connectors.handler              INFO   AUDIT  connector connected
  connectors.handler              INFO   AUDIT  connector disconnected

Registry:

  connectors.registry             INFO   AUDIT  Connector instance registered
  connectors.registry             INFO   AUDIT  Connector instance unregistered

Session management:

  connectors                      WARN          failed to create session
  connectors                      WARN          session create wait failed
  connectors                      WARN          unexpected session create response type
  connectors                      DEBUG         failed to extend session
  connectors                      DEBUG         session extend wait failed
  connectors                      WARN          failed to revoke session
  connectors                      WARN          session revoke wait failed

Config reload:

  connectors.reload               INFO          disconnected instances for removed site

Relay:

  connectors.relay                WARN   AUDIT  relay rejected: source IP not a cluster peer
  connectors.relay                DEBUG         relay connection accepted
  connectors.relay                WARN          relay fallback also failed after local exhaustion

Metrics

Prometheus metrics. Query with: metrics prometheus connectors_<name>

Connections:

  connectors_connections_total           counter    {}                   Total connector connections
  connectors_connections_active          gauge      {}                   Active connector connections
  connectors_connections_rejected        counter    {reason}             Rejected connections
  connectors_connection_duration         latency    {site_id}            Connection lifetime

Authentication:

  connectors_auth_success_total          counter    {site_id}            Successful authentications
  connectors_auth_failures_total         counter    {site_id, reason}    Authentication failures

Instances:

  connectors_instances_active            gauge      {site_id}            Active connector instances
  connectors_heartbeat_latency           latency    {site_id}            Heartbeat round-trip time

Dial (tunnel dispatch):

  connectors_dials_total                 counter    {site_id}            Dial attempts through tunnel
  connectors_dials_success_total         counter    {site_id}            Successful dials
  connectors_dials_errors_total          counter    {site_id, reason}    Failed dials
  connectors_dial_latency                latency    {site_id}            Dial latency
  connectors_streams_active              gauge      {}                   Active QUIC streams

Rebalance:

  connectors_rebalance_reject_total      counter    {site_id}            Soft-rejected for rebalance
  connectors_rebalance_accept_total      counter    {site_id}            Accepted after rebalance check

Inter-node forwarding (TCP-level):

  connectors_forward_total               counter    {site_id, target}    Forward attempts to peer node
  connectors_forward_success_total       counter    {site_id, target}    Successful forwards
  connectors_forward_errors_total        counter    {site_id, target}    Failed forwards
  connectors_forward_latency             latency    {site_id, target}    Forward latency
  connectors_forward_local_total         counter    {site_id}            Requests handled locally

Relay (QUIC inter-node dispatch):

  connectors_relay_total                 counter    {site_id, target}    Client-side relay attempts
  connectors_relay_served                counter    {site_id, target}    Server-side relay requests handled
  connectors_relay_success_total         counter    {site_id, target}    Successful relay dispatches
  connectors_relay_errors_total          counter    {site_id, reason}    Failed relay dispatches
  connectors_relay_rejected_total        counter    {reason}             Relay connections rejected (auth)

Alerts:

  rate(connectors_auth_failures_total[5m]) > 5       High auth failure rate (brute force or misconfiguration)
  connectors_instances_active == 0                    No connector instances (site unreachable)
  rate(connectors_dials_errors_total[5m]) > 10        High dial failure rate (tunnel health)
  rate(connectors_relay_rejected_total[5m]) > 0       Relay auth failures (cluster misconfiguration)
  connectors_connections_active > 100                 High connection count

DNS Resolution

Resolves DNS for all gateway components — custom resolvers, DNSSEC validation, caching, and health-aware failover

Overview

Handles DNS resolution for all gateway components — proxy backends, bastion hosts, cluster discovery, and ACME validation. Provides custom resolvers with automatic failover, caching, and DNSSEC validation.

Capabilities:

Custom DNS resolvers with automatic failover and circuit breaker pattern
DNSSEC validation in two modes: resolver-trust (fast) and full cryptographic (secure)
Distributed DNS caching via the memory storage module (local reads, broadcast writes)
Lookup coalescing to prevent cache poisoning from concurrent requests
Hostname validation to block DNS injection attacks (null bytes, CRLF)
IPv4 preference when both A and AAAA records are available
CNAME flattening with configurable depth limit (default 16, per RFC 1034)
DNS-over-TLS (DoT) support for encrypted transport (RFC 7858)
Adaptive resolver selection using epsilon-greedy algorithm (20-40% lower latency)
Health checking with exponential backoff and automatic system DNS fallback
Typed DNS queries for 30+ record types (A, AAAA, CAA, TLSA, SRV, MX, etc.)
Context propagation for request cancellation and graceful shutdown
TTL sanitization to prevent integer overflow attacks (capped at 1 week)

Operations:

Resolve: DNS resolution with optional DNSSEC, caching, and resolver selection
ValidateHostname: RFC-compliant hostname validation against injection attacks

Config

Core configuration under [dns]:

[dns]
  timeout = 5                          # DNS query timeout in seconds (default: 5)
  cache_ttl = 300                      # Default cache TTL in seconds (default: 300)
  cache_override = false               # Ignore DNS server TTL, always use cache_ttl (default: false)
  resolvers = ["1.1.1.1:53", "8.8.8.8:53", "9.9.9.9:53"]  # DNS resolvers (default: cluster.cluster_dns_resolvers)
  flatten_cname = true                 # Follow CNAMEs to final A/AAAA records (default: true)
  max_cname_depth = 16                 # Max CNAME chain depth to prevent loops (default: 16)

DNSSEC settings:

  dnssec_full_validation = false       # Full cryptographic RRSIG/DNSKEY validation (default: false)
  dnssec_strict = false                # Fail if zone is not DNSSEC-signed (default: false)

DNS-over-TLS (DoT):

  dot_enabled = false                  # Enable DNS-over-TLS transport (default: false)
  dot_port = 853                       # DoT port per RFC 7858 (default: 853)
  dot_verify_server_cert = true        # Verify resolver TLS certificate (default: true)

Health checking:

  health_check_enabled = true          # Enable resolver health monitoring (default: true)
  health_check_interval = 30           # Health check interval in seconds (default: 30)
  health_failure_threshold = 2         # Consecutive failures before marking unhealthy (default: 2)
  health_check_query = "google.com"    # Domain used for health check probes (default: "google.com")

Adaptive resolver selection (epsilon-greedy ML):

  adaptive_selector_enabled = true     # Enable adaptive resolver selection (default: true)
  adaptive_exploration_rate = 0.10     # Exploration rate 0.0-1.0 (default: 0.10 = 10%)
  adaptive_smoothing_factor = 0.3      # EMA smoothing factor for latency tracking (default: 0.3)
  adaptive_min_sample_size = 100       # Queries before switching from learning to intelligent mode (default: 100)
  adaptive_load_balance_enabled = true # Penalize recently-used resolvers to spread load (default: true)

Resolver architecture — three separate resolver pools:

  dns.resolvers                        # Infrastructure resolvers (health-checked, used by all modules)
  cluster.cluster_dns_resolvers        # Cluster discovery resolvers (fallback if dns.resolvers unset)
  proxy.dns.resolvers                  # Proxy-specific override (must be subset of dns.resolvers)

Per-route proxy DNS overrides in [[proxy.mapping]]:

  dnssec = true                        # Override global DNSSEC setting for this route
  dns_resolvers = ["10.0.0.1:53"]      # Override resolvers for this route (must be in dns.resolvers)

TTL precedence (cache_override=false): DNS server TTL > dns.cache_ttl > 300s default. TTL precedence (cache_override=true): dns.cache_ttl > 300s default. TTL bounds: minimum 1 second, maximum 604800 seconds (1 week).

Cache key format: “dns_cache:{hostname}:{resolver_hash}” (128-bit SHA256 hash). Cache reads are local (no network). Cache writes broadcast to cluster (fire-and-forget).

Hot-reloadable: resolvers, DNSSEC settings, cache TTL, health check parameters, adaptive settings. Cold (restart required): dot_enabled, dot_port.

Troubleshooting

Common symptoms and diagnostic steps:

DNS resolution failures:

  - Check resolver health: 'dns resolvers' shows status, latency, and failure counts
  - Test specific hostname: 'dns test <hostname>' performs live resolution
  - All resolvers unhealthy: module falls back to system DNS (/etc/resolv.conf)
  - Resolver filtered out: proxy resolvers must be a subset of dns.resolvers
  - Cross-subsystem check: 'diagnose domain <hostname>' tests DNS + proxy + TLS together

DNSSEC validation errors:

  - Zone not signed: set dnssec_strict=false to allow unsigned zones (default)
  - Resolver-trust mode: compromised resolver can fake AD bit — use dnssec_full_validation=true
  - Full validation slow: first query ~200ms (chain of trust), cached queries ~50ms
  - Clock skew: DNSSEC signatures have validity windows — ensure NTP is running
  - Check validation: 'dns test <hostname> --dnssec' shows validation result and mode
  - Strict mode blocking: dnssec_strict=true rejects all unsigned zones — check per-route override

Slow DNS resolution:

  - Check cache hit rate: 'dns cache' shows hit/miss ratio and entry count
  - High cache miss: increase cache_ttl or set cache_override=true for static backends
  - Resolver latency: 'dns resolvers' shows per-resolver average latency (EMA)
  - Adaptive selector: 'dns adaptive' shows resolver scores and selection distribution
  - Learning phase: first 100 queries use round-robin — performance improves after
  - CNAME chains: deep chains add latency per hop — check with 'dns test <hostname>'

All resolvers down (circuit breaker tripped):

  - Health checker marks resolver unhealthy after 2 consecutive failures (configurable)
  - Backoff schedule: 30s, 1m, 2m, 4m, 8m, 15m (max)
  - System DNS fallback activates automatically when all custom resolvers fail
  - Recovery is automatic — resolver returns to pool when health check succeeds
  - Force re-check: 'dns health --reset' clears backoff timers
  - Check: 'dns resolvers' shows healthy/unhealthy status and next retry time

Cache poisoning concerns:

  - Lookup coalescing: concurrent requests for same hostname share single lookup result
  - Per-hostname locking prevents race conditions (no global bottleneck)
  - Enable DNSSEC (dnssec_full_validation=true) for cryptographic validation
  - Use DoT (dot_enabled=true) to encrypt DNS transport against snooping

CNAME resolution issues:

  - CNAME not followed: check flatten_cname=true (default)
  - CNAME loop detected: max_cname_depth exceeded (default 16) — check DNS zone config
  - CNAME + ACL: ACL checks use original hostname, not CNAME target (prevents bypass)
  - Metrics: dns.cname_resolutions_total tracks success and depth_exceeded counts

DoT connection failures:

  - Port blocked: DoT uses port 853 (RFC 7858) — verify firewall rules
  - Certificate error: set dot_verify_server_cert=false to diagnose (re-enable after)
  - Non-standard port: module warns if dot_port is not 853

502/503 from proxy due to DNS:

  - DNSSEC failure blocks connection (no system DNS fallback for security)
  - DNS infrastructure failure falls back to system DNS (availability)
  - Fix: set dnssec=false on specific proxy routes for unsigned internal zones
  - Verify: 'dns test <backend-hostname> --dnssec' to check DNSSEC status

Interpreting tool output:

  'dns health':
    Healthy: Status=healthy, Healthy resolvers = total resolvers
    Degraded: Healthy < total — some resolvers failing, but DNS still works
    Down: Healthy=0 — all resolvers failed, system DNS fallback active
    Action: Degraded/Down → 'dns resolvers' for per-resolver breakdown

  'dns resolvers':
    Healthy: Status=healthy, Latency < 50ms, Score > 100
    Degraded: Status=unhealthy with BackoffUntil timestamp — resolver in circuit breaker
    Learning: Score near 100 with low QueryCount — adaptive selector still calibrating (normal)
    Action: All unhealthy → check network connectivity to resolver IPs, verify port 53/853 open

  'dns test <hostname>':
    Success: IPs returned, TTL shown, DNSSEC=valid (if enabled)
    DNSSEC failure: DNSSEC=invalid — zone is unsigned or signatures expired
    No results: hostname does not resolve — check DNS zone configuration
    Action: DNSSEC failure + proxy 502 → set dnssec=false on that proxy route

Architecture

Resolution flow:

Resolve request arrives (from proxy, bastion, ACME, or discovery)
Hostname validation: RFC compliance check, injection prevention (null bytes, CRLF, length)
Cache lookup: local memory read for “dns_cache:{hostname}:{resolver_hash}”
If cache hit: return cached IPs immediately (no network call)
If cache miss: acquire per-hostname lock (coalescing for concurrent requests)
Resolver selection: adaptive selector picks best resolver (or round-robin during learning)
Health filter: only healthy resolvers considered (circuit breaker pattern)
DNS query: send query via UDP (or DoT if enabled) with configured timeout
DNSSEC validation (if enabled): a. Resolver-trust mode: check AD bit in response b. Full validation: verify RRSIG signatures, validate DNSKEY chain to root trust anchor
CNAME handling: if CNAME response and flatten_cname=true, recursively resolve target
IPv4 preference: sort results with A records before AAAA records
TTL extraction: from DNS response (DNSSEC/custom resolver) or use configured default
TTL sanitization: clamp to [1s, 604800s], zero defaults to 300s
Cache store: broadcast write to cluster memory (fire-and-forget, best-effort)
Release per-hostname lock, waiting callers receive same result
Return ResolveResponse with IPs, TTL, cached flag, DNSSEC validity

Adaptive resolver selection (epsilon-greedy):

  Learning phase (first 100 queries): round-robin across all healthy resolvers
  Intelligent phase: 90% exploitation (best score), 10% exploration (random)
  Score = 100 + (success_rate * 50) - (avg_latency_ms / 10) - (timeout_pct * 30)
        - (consecutive_failures * 20) - (recently_used * 10)
  Latency tracked via EMA: new_avg = 0.3 * sample + 0.7 * old_avg
  Load balancing penalty: -10 points if resolver used within last 1 second

Health checker circuit breaker:

  Healthy: failure_count = 0, available for selection
  Unhealthy: failure_count >= threshold (default 2), excluded from selection
  Backoff: 30s -> 1m -> 2m -> 4m -> 8m -> 15m (max)
  Recovery: single successful health check returns resolver to healthy state
  System DNS fallback: automatic when ALL custom resolvers are unhealthy
  Memory cleanup: Resolver sync removes stale entries on config reload

DNSSEC full validation chain:

  1. Query resolver with DO bit set
  2. Extract RRSIG from response
  3. Fetch DNSKEY for target zone (cached with TTL)
  4. Verify RRSIG signature using DNSKEY (RSA/SHA-256, ECDSA P-256, Ed25519)
  5. Fetch DS record from parent zone
  6. Verify DNSKEY hash matches DS record
  7. Recurse up to root zone
  8. Validate root DNSKEY against hardcoded IANA trust anchor (KSK 20326)
  9. Validate NSEC/NSEC3 for authenticated denial of existence

Distributed caching via memory module:

  Read path: local-only (no network, no quorum)
  Write path: broadcast to all cluster nodes (fire-and-forget)
  Key format: "dns_cache:{hostname}:{sha256_hash_of_resolvers}" (collision-resistant)
  Eviction: TTL-based (respects DNS TTL or configured override)
  Coalescing: per-hostname mutex prevents concurrent duplicate lookups

Metrics emitted:

  dns.resolve_total (tags: status, cached, dnssec)
  dns.resolve_latency_ms (histogram)
  dns.cache_hit_total / dns.cache_miss_total
  dns.health_check_total (tags: resolver, status)
  dns.adaptive_resolver_selected (tags: resolver, reason)
  dns.resolver_score (gauge, tags: resolver)
  dns.resolver_avg_latency_ms (gauge, tags: resolver)
  dns.cname_resolutions_total (tags: status)

Relationships

Module dependencies and interactions:

proxy: Backend hostname resolution for all proxy routes. Uses [dns] configuration by default. Per-route overrides via dnssec and dns_resolvers fields in [[proxy.mapping]]. DNSSEC validation failure blocks connection (no system DNS fallback — prevents downgrade). DNS infrastructure failure falls back to system DNS (availability).
bastion: SSH connection and port forwarding hostname resolution. Uses [dns] configuration directly (no bastion-specific overrides). DNSSEC protects against SSH destination poisoning.
discovery: Cluster peer discovery via DNS SRV records. Uses [dns] configuration for resolver settings. Critical for cluster formation and membership.
acme: ACME challenge validation uses typed DNS queries (CAA record checking per RFC 8659). SERVFAIL handling distinguishes “no records” from “DNS infrastructure error” for security.
memory: Distributed DNS cache storage. Local reads (fast), broadcast writes (best-effort). No quorum required — cache is opportunistic, falls back to fresh lookup on miss.
config: Reads [dns] and [cluster] TOML sections. Hot-reload updates resolvers, DNSSEC settings, cache parameters, health check configuration, and adaptive selection tuning. Resolver sync cleans up stale resolver state on reload (memory leak prevention).
metrics (telemetry): Emits counters, histograms, and gauges for resolution, caching, health checks, and adaptive selection. Enables monitoring dashboards and alerting.

Logs

Log entries by component. Search with: logs search “dns” Levels: ERROR > WARN > INFO > DEBUG > TRACE.

Init & Lifecycle:

  dns.init                INFO          DNS module initialized
  dns.health              INFO          DNS resolvers not configured, using cluster resolvers for health checking
  dns.health              WARN          Failed to initialize resolver health manager
  dns.health              INFO          Resolver health manager started
  dns.health              INFO          Health checking enabled but no resolvers configured
  dns.health              INFO          Resolver health checking disabled
  dns.adaptive            INFO          Adaptive resolver selector initialized
  dns.adaptive            INFO          Adaptive selector enabled but no resolvers configured
  dns.adaptive            INFO          Adaptive resolver selector disabled

Resolution:

  dns.resolve             DEBUG         DNS resolution request
  dns.resolve             DEBUG         DNS cache hit
  dns.resolve             DEBUG         Waiting for concurrent DNS lookup to complete
  dns.resolve             ERROR         DNS lookup panicked
  dns.resolve             ERROR         DNS resolution failed
  dns.resolve             INFO          DNS resolution succeeded - no records found
  dns.resolve             INFO          DNS resolution succeeded

Hostname Validation:

  dns.validate            WARN          Hostname validation failed

Health Status:

  dns.gethealth           DEBUG         DNS health status requested

Cache Operations:

  dns.cache               WARN          Invalid cache entry type
  dns.cache               WARN          Failed to broadcast DNS cache update
  dns.cache               DEBUG         DNS result cached

DNSSEC Core:

  dns.dnssec              DEBUG         Using DNS-over-TLS
  dns.dnssec              WARN          DNS query failed
  dns.dnssec              DEBUG         DNS query returned error
  dns.dnssec.full         DEBUG         RRSIG present but AD bit not set - performing full validation
  dns.dnssec.full         ERROR         Full DNSSEC validation failed
  dns.dnssec.full         INFO          Full DNSSEC validation succeeded
  dns.dnssec              ERROR         DNSSEC validation failed: RRSIG present but AD bit not set
  dns.dnssec              ERROR         DNSSEC strict mode: zone not signed
  dns.dnssec              WARN          DNSSEC validation skipped: zone not signed
  dns.dnssec              DEBUG         DNSSEC validation succeeded (resolver-trust mode)

DNSSEC Validation:

  dns.dnssec.validate     WARN          RRSIG signature verification failed
  dns.dnssec.validate     WARN          RRSIG signature expired or not yet valid
  dns.dnssec.validate     DEBUG         RRSIG signature validated successfully
  dns.dnssec.dnskey       WARN          Failed to query DNSKEY
  dns.dnssec.dnskey       WARN          DNSKEY query returned error
  dns.dnssec.dnskey       WARN          No DNSKEY records found in zone
  dns.dnssec.dnskey       DEBUG         DNSKEY records fetched successfully
  dns.dnssec.validate     ERROR         DNSSEC strict mode: RRset not signed
  dns.dnssec.validate     DEBUG         RRset has no RRSIG (zone not signed)
  dns.dnssec.validate     ERROR         No matching DNSKEY found for RRSIG
  dns.dnssec.validate     INFO          DNSSEC validation completed

DNSSEC Cache:

  dns.dnssec.cache        DEBUG         DNSKEY cache hit
  dns.dnssec.cache        DEBUG         DNSKEY cache expired
  dns.dnssec.cache        DEBUG         DNSKEY cached
  dns.dnssec.cache        DEBUG         DS cache hit
  dns.dnssec.cache        DEBUG         DS cache expired
  dns.dnssec.cache        DEBUG         DS cached
  dns.dnssec.cache        INFO          DNSSEC cache cleared

DNSSEC Chain of Trust:

  dns.dnssec              WARN          DEPRECATED: SHA-1 used in DNSSEC validation
  dns.dnssec.ds           WARN          Failed to query DS
  dns.dnssec.ds           WARN          DS query returned error
  dns.dnssec.ds           DEBUG         No DS records found (zone may be unsigned or at root)
  dns.dnssec.ds           DEBUG         DS records fetched successfully
  dns.dnssec.chain        WARN          Failed to compute DS digest
  dns.dnssec.chain        DEBUG         DNSKEY validated successfully using DS
  dns.dnssec.chain        ERROR         DNSKEY validation failed: no matching DS found
  dns.dnssec.chain        DEBUG         Validating chain of trust
  dns.dnssec.chain        INFO          Root DNSKEY validated against trust anchor
  dns.dnssec.chain        ERROR         Root DNSKEY validation failed: no matching trust anchor

DNSSEC NSEC/NSEC3:

  dns.dnssec.nsec         DEBUG         No NSEC records found in response
  dns.dnssec.nsec         DEBUG         Found NSEC records for validation
  dns.dnssec.nsec         INFO          NSEC authenticated denial validated
  dns.dnssec.nsec         WARN          NSEC validation failed: name not in range
  dns.dnssec.nsec3        DEBUG         No NSEC3 records found in response
  dns.dnssec.nsec3        DEBUG         Found NSEC3 records for validation
  dns.dnssec.nsec3        WARN          Unsupported NSEC3 hash algorithm
  dns.dnssec.nsec3        ERROR         Failed to compute NSEC3 hash
  dns.dnssec.nsec3        INFO          NSEC3 authenticated denial validated
  dns.dnssec.nsec3        WARN          NSEC3 validation failed: hash not in range

Resolver:

  dns.resolve             WARN          Hostname validation failed
  dns.ttl                 DEBUG         Cache override enabled, using configured TTL
  dns.ttl                 DEBUG         Using DNS server TTL
  dns.ttl                 DEBUG         DNS server TTL not available, using fallback
  dns.health              DEBUG         Filtered unhealthy resolvers
  dns.resolve             DEBUG         DNS resolution succeeded
  dns.resolve             DEBUG         DNS resolution failed, trying next resolver
  dns.resolve             ERROR         All DNS resolvers failed
  dns.resolve             DEBUG         Using system DNS resolver
  dns.resolve             DEBUG         Using configured DNS cache TTL for system resolver
  dns.resolve             DEBUG         DNS resolution succeeded
  dns.dnssec              DEBUG         DNSSEC resolution succeeded
  dns.dnssec              WARN          DNSSEC lookup failed, trying next resolver
  dns.cname               DEBUG         Resolving CNAME target
  dns.cname               DEBUG         CNAME record found
  dns.cname               WARN          Failed to resolve CNAME target
  dns.cname               DEBUG         CNAME chain returned (flatten disabled)
  dns.query               DEBUG         Using DNS-over-TLS
  dns.query               WARN          DNS query failed
  dns.query               DEBUG         DNS query returned error
  dns.query               DEBUG         DNS query completed
  dns.query               WARN          DNS query returned SERVFAIL

Adaptive Resolver:

  dns.adaptive            ERROR         Failed to create adaptive selector
  dns.adaptive            INFO          Cleaned up performance data for removed resolvers
  dns.adaptive            INFO          Adaptive resolver selector initialized
  dns.adaptive            TRACE         Resolver performance updated
  dns.adaptive            INFO          Adaptive selector learning phase completed, switching to intelligent selection
  dns.adaptive            DEBUG         Adaptive DNS resolution succeeded
  dns.adaptive            DEBUG         Adaptive DNS resolution failed, selecting another resolver
  dns.adaptive            ERROR         All adaptive DNS resolution attempts failed

Health Manager:

  dns.health              INFO          Initializing resolver health checks
  dns.health              ERROR         Invalid resolver address format
  dns.health              WARN          Initial health check failed
  dns.health              INFO          Initial health check passed
  dns.health              ERROR         No healthy DNS resolvers available
  dns.health              INFO          Resolver health initialization complete
  dns.health              DEBUG         Starting health check
  dns.health              WARN          Health check query failed
  dns.health              WARN          Health check returned nil response
  dns.health              DEBUG         Health check returned error response
  dns.health              DEBUG         Health check successful
  dns.health              DEBUG         GetHealthyResolvers called
  dns.fallback            WARN          All custom DNS resolvers unhealthy, falling back to system DNS
  dns.fallback            INFO          Custom DNS resolver recovered, switching back from system DNS
  dns.health              WARN          RecordSuccess called for unknown resolver
  dns.health              INFO          Resolver recovered
  dns.fallback            INFO          Custom DNS resolver recovered, switching back from system DNS
  dns.health              WARN          RecordFailure called for unknown resolver
  dns.health              WARN          Resolver marked unhealthy
  dns.health              INFO          Starting resolver health checker
  dns.health              INFO          Stopping resolver health checker
  dns.health              DEBUG         Performing health checks
  dns.health              DEBUG         Health check still failing
  dns.health              INFO          Resolver recovered via health check
  dns.health              INFO          Removed resolvers no longer in configuration

Metrics

Prometheus metrics. Query with: metrics prometheus dns_<name>

Resolution (namespace: dns):

  dns_resolve_total                          counter    {result, cached, dnssec}     Resolution outcomes
    result=success, cached=true|false                                                 Successful resolution
    result=nxdomain, cached=true|false                                                Domain not found (valid response)
    result=failure, cached=false                                                      Resolution failed
  dns_nxdomain_total                         counter    {}                           NXDOMAIN responses (uncached)
  dns_cache_hits                             counter    {}                           Cache hits
  dns_cache_misses                           counter    {}                           Cache misses
  dns_lookup_coalesced                       counter    {}                           Lookups coalesced (shared concurrent result)
  dns_lookup_performed                       counter    {}                           Lookups actually performed
  dns_cache_operations_total                 counter    {operation, result}          Cache write operations
    operation=set, result=success|error                                               Broadcast cache set outcomes

Resolver Selection (namespace: dns):

  dns_resolver_queries_total                 counter    {resolver, result}           Per-resolver query outcomes
    result=success|nxdomain|failure                                                   Query result per resolver
  dns_system_dns_queries_total               counter    {result}                     System DNS fallback queries
    result=success|nxdomain|failure                                                   System resolver outcomes

Transport (namespace: dns):

  dns_transport_used                         counter    {type, resolver}             DNS transport protocol used
    type=udp|dot                                                                      UDP or DNS-over-TLS

CNAME Resolution (namespace: dns):

  dns_cname_resolutions_total                counter    {status}                     CNAME chain resolution outcomes
    status=success|depth_exceeded                                                     CNAME follow results

DNSSEC Validation (namespace: dns):

  dns_dnssec_validations_total               counter    {result, resolver}           Resolver-trust mode validations
    result=valid|invalid|unsigned                                                     AD bit check outcomes
  dns_dnssec_full_validations                counter    {result, resolver}           Full cryptographic validations
    result=valid|invalid                                                              RRSIG/DNSKEY verification outcomes
  dns_dnssec_signature_validations           counter    {result, algorithm}          RRSIG signature verifications
    result=valid                                                                      Successful signature check
  dns_dnssec_dnskey_queries                  counter    {result}                     DNSKEY record fetches
    result=success                                                                    DNSKEY query succeeded
  dns_dnssec_response_validations            counter    {result}                     Full response validations
    result=valid                                                                      All RRsets validated
  dns_dnssec_chain_validations               counter    {result}                     Chain of trust DS validations
    result=valid|invalid                                                              DNSKEY-DS digest match
  dns_dnssec_root_validations                counter    {result}                     Root trust anchor validations
    result=valid|invalid                                                              Root DNSKEY match
  dns_dnssec_nsec_validations                counter    {result, type}               NSEC/NSEC3 denial validations
    result=valid|invalid, type=nsec|nsec3                                             Authenticated denial outcomes

DNSSEC Cache (namespace: dns):

  dns_dnssec_cache_hits                      counter    {type}                       DNSSEC record cache hits
    type=dnskey|ds                                                                    Cached record type
  dns_dnssec_cache_misses                    counter    {type}                       DNSSEC record cache misses
    type=dnskey|ds                                                                    Record type queried
  dns_dnssec_cache_clears                    counter    {}                           DNSSEC cache full clears

Health Management (namespace: dns):

  dns_resolver_latency                       latency    {resolver}                   Per-resolver query latency
  dns_resolver_healthy                       gauge      {resolver}                   Resolver health status (1=healthy, 0=unhealthy)
  dns_resolver_avg_latency_ms                gauge      {resolver}                   Resolver average latency EMA (ms)
  dns_resolver_consecutive_failures          gauge      {resolver}                   Consecutive failure count per resolver
  dns_resolver_failures_total                counter    {resolver}                   Total resolver failures
  dns_system_fallback                        gauge      {}                           System DNS fallback active (1=active, 0=inactive)
  dns_fallback_activations                   counter    {}                           System DNS fallback activations

Adaptive Selection (namespace: dns):

  dns_adaptive_resolver_selected             counter    {resolver, reason}           Adaptive resolver selections
    reason=exploration|best_score|round_robin|...                                     Selection strategy used
  dns_adaptive_selection_total               counter    {mode, resolver}             Selection mode distribution
    mode=explore|exploit                                                              Exploration vs exploitation
  dns_resolver_score                         histogram  {resolver}                   Resolver scores (intelligent phase)

Forward Proxy

Browser-native access to internal resources — no client software needed, just configure the browser’s proxy settings

Overview

Provides browser-native access to internal resources — no client software needed. Users configure their browser’s proxy settings (or use the auto-generated PAC file) and access internal services as if they were local. Handles HTTP CONNECT for TCP tunneling and CONNECT-UDP for UDP proxying via MASQUE.

Core capabilities:

HTTP CONNECT handling for TCP proxy tunneling
CONNECT-UDP handling for UDP proxy tunneling (MASQUE/QUIC)
PAC file endpoint serving at configurable path (default /proxy.pac)
Browser extension config endpoint at /proxy/config
Browser extension setup/login endpoint at /proxy/setup
CONNECT rejected on main service port (421 Misdirected) — proxy port only
Geo-IP and time-based restriction enforcement before tunneling
DNS resolution with system DNS fallback
Bidirectional TCP relay with idle timeout and max connection duration
HTTP/2+ full duplex CONNECT stream support (RFC 8441)
HTTP/1.1 connection hijacking for classic CONNECT tunneling
Connection tracking and byte-level metrics recording

The service runs on a dedicated port (forward_proxy.port) separate from the main service port for security isolation. CONNECT requests on the main port receive 421 Misdirected Request, directing clients to the correct proxy port.

TCP CONNECT request flow:

  1. Extract client IP (CDN bypass mode uses RemoteAddr directly)
  2. Check geo-IP and time-based restrictions
  3. Validate target host:port format (RFC 1035 hostname length limit)
  4. Extract bearer token from Proxy-Authorization header
  5. Authenticate token and check user is not disabled
  6. Check ACL (firewall group rules for target destination)
  7. Check per-user rate limit (fail-closed)
  8. Resolve hostname via DNS module (system DNS fallback)
  9. Establish backend TCP connection with configurable timeout
  10. Start bidirectional relay with idle timeout and max duration
  11. Record metrics (bytes sent/recv, duration, success)

CONNECT-UDP request flow:

  1-7. Same as TCP (restrictions, auth, ACL, rate limit)
  8. MASQUE UDP proxying (capsule protocol, socket management)
  9. Record metrics after session completes

Bearer token authentication supports two formats:

  - "Bearer <token>" header (direct bearer token)
  - "Basic <base64>" header where username is "_bearer_" and password is the token
    (Chrome's onAuthRequired format for Proxy-Authorization)

Config

Service-level configuration under [forward_proxy] in hexon.toml:

[forward_proxy]
  enabled = true                       # Enable forward proxy (default: false)
  port = 8443                          # Dedicated proxy port (must differ from service.port)
  public_port = 8443                   # External port for PAC URLs (NAT/LB scenarios)
  hostname = "proxy.example.com"       # Separate hostname for CDN bypass (optional)
  enable_tcp = true                    # Enable TCP CONNECT handling (default: true)
  enable_udp = true                    # Enable CONNECT-UDP/MASQUE handling (default: true)
  udp_proxy_path = "/masque"           # URI path for CONNECT-UDP requests (default: /masque)
  auth_mode = "bearer"                 # Authentication mode for CONNECT requests
  buffer_size = "32KB"                 # TCP relay buffer size (default: 32KB)
  connect_timeout = "10s"              # Backend connection timeout
  idle_timeout = "5m"                  # Idle connection timeout (no data flowing)
  max_connection_duration = "24h"      # Maximum connection duration (hard limit)
  preserve_client_port = true          # Use client's port in Alt-Svc header

  # Token settings (used by /proxy/config endpoint)
  token_ttl = "5m"                     # Token validity duration (default: 5m, min: 30s)
  token_refresh_interval = "60s"       # Extension refresh interval (default: 60s, min: 5s)

  # TLS certificate for the proxy hostname (when hostname differs from service)
  cert = "/path/to/cert.pem"           # File path or inline PEM
  key = "/path/to/key.pem"             # File path or inline PEM

  # Geo-IP restrictions (overrides [service] if set)
  geo_enabled = true                   # Enable geo-IP restrictions
  geo_allow_countries = ["US", "CA"]   # Allowed country codes (ISO 3166-1 alpha-2)
  geo_deny_countries = []              # Denied country codes
  geo_bypass_cidr = ["10.0.0.0/8"]    # CIDR ranges that bypass geo checks
  geo_deny_code = 403                  # HTTP status for geo denial
  geo_deny_message = "Access denied from your location"

  # Time-based restrictions (overrides [service] if set)
  time_enabled = true                  # Enable time-based restrictions
  time_timezone = "America/New_York"   # Timezone for time checks
  time_allow_days = ["Mon","Tue","Wed","Thu","Fri"]
  time_allow_hours = "09:00-18:00"     # Allowed hours range
  time_deny_code = 403                 # HTTP status for time denial
  time_deny_message = "Access not permitted at this time"

# PAC file settings
[forward_proxy.pac]
  enabled = true                       # Enable PAC endpoint (default: true)
  path = "/proxy.pac"                  # PAC file URL path
  cache_ttl = "15m"                    # PAC response Cache-Control max-age
  group = "proxy-users"                # Required group for PAC/config/setup access (optional)
  use_firewall_targets = true          # Derive PAC targets from firewall rules

Endpoints registered by the service:

  GET /proxy.pac       - PAC file (requires auth, optional group)
  GET /proxy/config    - JSON: PAC + token + refresh interval + username + server_time
  GET /proxy/setup     - Login trigger page for browser extensions

CDN bypass mode:

  When forward_proxy.hostname differs from service.hostname, the proxy accepts
  direct connections (no CDN in between). Client IP is extracted from RemoteAddr
  instead of X-Forwarded-For. This is typical because CDNs do not support HTTP CONNECT.

Hot-reloadable: token_ttl, token_refresh_interval, geo/time restrictions, PAC settings,

  rate_limit_per_user, bandwidth_limit_per_user, buffer_size, idle_timeout,
  max_connection_duration.

Cold (restart required): enabled, port, hostname, enable_tcp, enable_udp,

  udp_proxy_path, preserve_client_port.

Troubleshooting

Common symptoms and diagnostic steps:

CONNECT requests returning 421 Misdirected Request:

  - Client is sending CONNECT to the main service port instead of the proxy port
  - The forward proxy middleware rejects CONNECT on the main port by design
  - Verify client is configured to use forward_proxy.port (or public_port)
  - Check error message for the correct proxy hostname:port

407 Proxy Authentication Required:

  - Missing Proxy-Authorization header on CONNECT request
  - Token format not recognized (must be "Bearer <token>" or "Basic <base64>")
  - For Chrome extension: username must be "_bearer_" in Basic auth format
  - Token exceeds max length (8192 bytes) — check token generation
  - Verify token is being refreshed before expiry: check /proxy/config response

403 Forbidden on CONNECT:

  - ACL denied: user's groups do not match firewall rules for the target
  - Check: 'forwardproxy check <user> <target>' for ACL evaluation
  - Check: 'forwardproxy targets <user>' to see allowed destinations
  - Check: 'firewall check <user>' for firewall rule details
  - Geo-IP denial: 'geo lookup <client_ip>' and 'geo check <client_ip>'
  - Time-based denial: verify time_timezone and time_allow_hours in config

429 Too Many Requests:

  - Per-user rate limit exceeded: check rate_limit_per_user setting
  - Per-user bandwidth limit exceeded: check bandwidth_limit_per_user
  - Retry-After header in response indicates when to retry
  - Monitor: 'forwardproxy metrics' for per-user rate limit stats
  - Consider increasing limits for legitimate high-volume users

502 Bad Gateway on CONNECT:

  - DNS resolution failed: 'dns test <target_hostname>'
  - Backend unreachable: 'net tcp <target_host:port>'
  - Connect timeout too short: check forward_proxy.connect_timeout
  - All resolved IPs failed (tries IPv4 first, then IPv6)
  - DNS module failure with system DNS fallback also failing

Connection drops or timeouts during tunnel:

  - Idle timeout: no data flowing for forward_proxy.idle_timeout (default 5m)
  - Max duration exceeded: forward_proxy.max_connection_duration hard limit
  - Check relay buffer_size: default 32KB, increase for high-throughput tunnels
  - HTTP/2 full duplex not supported by server: check error logs for full duplex support errors
  - Intermediate firewall blocking long-lived connections or UDP (QUIC)

PAC file returns DIRECT for all traffic:

  - PAC endpoint requires authentication; verify session cookie is sent
  - Check forward_proxy.pac.enabled = true
  - Check use_firewall_targets = true and user has firewall rules
  - Unauthenticated PAC intentionally returns DIRECT-only (security by design)
  - Inspect PAC: curl -b session=<cookie> https://host/proxy.pac

/proxy/config returns 401 or 403:

  - 401: session cookie missing or expired; trigger re-login via /proxy/setup
  - 403: user not in required group (forward_proxy.pac.group)
  - Verify group membership: 'directory user <username>'

Extension not refreshing token:

  - Verify token_refresh_interval < token_ttl in config
  - Check /proxy/config endpoint accessibility from extension
  - Look for clock skew between client and server (server_time in response)
  - Monitor: 'forwardproxy metrics' for token generation counts

CONNECT-UDP/MASQUE failures:

  - QUIC port (UDP) blocked by intermediate firewall
  - forward_proxy.enable_udp = false in config
  - URI template mismatch: check udp_proxy_path setting
  - MASQUE parse error: malformed CONNECT-UDP request
  - Verify: 'net tcp <proxy_hostname:port> --tls' for TLS connectivity

Geo/time restriction inconsistencies:

  - Forward proxy has its own geo/time config that overrides [service] settings
  - Check both forward_proxy.geo_enabled and service.geo_enabled
  - Restrictions on /proxy/config and CONNECT may behave differently
  - CONNECT restrictions fail-open if the cluster is not ready

Metrics and monitoring:

  - 'forwardproxy metrics' — cluster-wide connection counts and byte totals
  - 'forwardproxy metrics <user>' — per-user breakdown
  - Bytes sent/recv recorded per TCP connection; UDP records duration and
    success only (MASQUE library limitation)

Relationships

Dependencies and interactions:

Forward proxy module: All authentication, ACL, rate limiting, PAC generation, metrics, and restriction checks handled cluster-wide.
DNS: Hostname resolution for CONNECT targets. Falls back to system DNS if the DNS module is unavailable. IPv4 preferred over IPv6 in resolution order.
Firewall: ACL rules determine which groups can access which destination host:port. Firewall rules also drive PAC file generation (use_firewall_targets).
Directory: User disabled status checked during authentication. Group membership resolved server-side from the directory memory index during ACL evaluation (not embedded in the bearer token).
Geo/Time access: Location and time-based access checks on both /proxy/config endpoint and CONNECT requests. Forward proxy can override [service] geo/time settings with its own configuration.
Sessions: Session cookies used for /proxy/config, /proxy/setup, and /proxy.pac. Browser extension first authenticates via session, then receives a bearer token for subsequent CONNECT requests.
Reverse proxy: Complementary service — reverse proxy handles inbound traffic to backends, forward proxy handles outbound traffic from users. Both share the same TLS listener and session subsystem.

Logs

Log entries by component. Search with: logs search “forwardproxy” Levels: ERROR > WARN > INFO > DEBUG. DEBUG requires log level configuration.

Lifecycle & Middleware:

  forwardproxy.service.init   INFO          Forward proxy service disabled in config
  forwardproxy.service.init   INFO          Forward proxy service initialized
  forwardproxy.middleware      INFO          Forward proxy disabled, passing CONNECT to next handler
  forwardproxy.middleware      WARN          CONNECT request rejected on main service port

PAC & Config Endpoints:

  forwardproxy.pac            DEBUG         Generating PAC file for authenticated user
  forwardproxy.pac            ERROR         Failed to generate PAC
  forwardproxy.config         DEBUG         Generating proxy config for extension
  forwardproxy.config         WARN          Access blocked by restriction
  forwardproxy.config         ERROR         Failed to generate PAC
  forwardproxy.config         ERROR         Failed to generate proxy token
  forwardproxy.config         INFO          Proxy config generated successfully
  forwardproxy.setup          INFO          Proxy setup authorized

Restrictions:

  forwardproxy.restrictions   ERROR         Failed to call restrictions check

SSRF Protection:

  forwardproxy.ssrf           WARN   AUDIT  blocked non-routable IP from DNS resolution
  forwardproxy.ssrf           WARN   AUDIT  all resolved IPs are non-routable — request blocked

DNS & Connectivity:

  forwardproxy.dns            DEBUG         Resolving hostname via DNS module
  forwardproxy.dns            DEBUG         DNS resolution successful
  forwardproxy.dns            DEBUG         Using system DNS resolver
  forwardproxy.dns            DEBUG         Successfully connected to backend
  forwardproxy.dns            WARN          DNS module failure - falling back to system DNS
  forwardproxy.dns            WARN          DNS resolution timeout - falling back to system DNS
  forwardproxy.dns            WARN          DNS module returned error - falling back to system DNS
  forwardproxy.dns            WARN          Failed to connect to IP, trying next
  forwardproxy.connector      DEBUG         Dialing via connector site
  forwardproxy.connector      DEBUG         Connected via connector site

TCP CONNECT Authentication:

  forwardproxy.tcp.auth       INFO   AUDIT  Missing or invalid Proxy-Authorization header
  forwardproxy.tcp.auth       INFO   AUDIT  Token too long
  forwardproxy.tcp.auth       INFO   AUDIT  Authentication failed

TCP CONNECT ACL & Rate Limiting:

  forwardproxy.tcp.acl        WARN   AUDIT  ACL denied
  forwardproxy.tcp.ratelimit  ERROR         Rate limit service unavailable
  forwardproxy.tcp.ratelimit  ERROR         Rate limit check failed
  forwardproxy.tcp.ratelimit  WARN   AUDIT  Rate limit exceeded

TCP CONNECT Connection:

  forwardproxy.tcp.connect    INFO          Proxy connection established
  forwardproxy.tcp.dial       ERROR         Failed to connect to backend
  forwardproxy.tcp.http2      DEBUG         Using HTTP/2+ full duplex CONNECT stream
  forwardproxy.tcp.http2      ERROR         Failed to enable full duplex mode
  forwardproxy.tcp.http2      ERROR         Failed to flush response
  forwardproxy.tcp.hijack     ERROR         ResponseWriter does not support hijacking
  forwardproxy.tcp.hijack     ERROR         Failed to hijack connection
  forwardproxy.tcp.error      ERROR         Request validation or service errors (dynamic message)

HTTP Proxy Authentication:

  forwardproxy.http.auth      INFO   AUDIT  Missing or invalid Proxy-Authorization header
  forwardproxy.http.auth      INFO   AUDIT  Token too long
  forwardproxy.http.auth      INFO   AUDIT  Authentication failed

HTTP Proxy ACL & Rate Limiting:

  forwardproxy.http.acl       WARN   AUDIT  ACL denied
  forwardproxy.http.ratelimit ERROR         Rate limit service unavailable
  forwardproxy.http.ratelimit ERROR         Rate limit check failed
  forwardproxy.http.ratelimit WARN   AUDIT  Rate limit exceeded

HTTP Proxy Forwarding:

  forwardproxy.http.forward   INFO          HTTP proxy request forwarded
  forwardproxy.http.forward   ERROR         Failed to forward request
  forwardproxy.http.copy      DEBUG         Response body copy error
  forwardproxy.http.error     ERROR         Request validation or service errors (dynamic message)

UDP/MASQUE Authentication:

  forwardproxy.udp.auth       INFO   AUDIT  Missing or invalid Proxy-Authorization header
  forwardproxy.udp.auth       INFO   AUDIT  Token too long
  forwardproxy.udp.auth       INFO   AUDIT  Authentication failed

UDP/MASQUE ACL & Rate Limiting:

  forwardproxy.udp.acl        WARN   AUDIT  ACL denied
  forwardproxy.udp.ratelimit  ERROR         Rate limit service unavailable
  forwardproxy.udp.ratelimit  ERROR         Rate limit check failed
  forwardproxy.udp.ratelimit  WARN          Rate limit exceeded

UDP/MASQUE Connection & Session:

  forwardproxy.udp.parse      WARN          Failed to parse CONNECT-UDP request
  forwardproxy.udp.parse      WARN          Invalid CONNECT-UDP request
  forwardproxy.udp.parse      WARN          Invalid target hostname
  forwardproxy.udp.connect    INFO          UDP proxy session authorized
  forwardproxy.udp.ssrf       WARN   AUDIT  SSRF blocked: UDP target resolves to non-routable IP
  forwardproxy.udp.dial       WARN          Failed to dial UDP IP, trying next
  forwardproxy.udp.dial       ERROR         All UDP dial attempts failed
  forwardproxy.udp.proxy      ERROR         UDP proxy error
  forwardproxy.udp.complete   INFO          UDP proxy session completed
  forwardproxy.udp.error      ERROR         Request validation or service errors (dynamic message)

Shared (TCP, HTTP, UDP):

  forwardproxy.ratelimit.status  DEBUG      Rate limit check passed

Metrics

No Prometheus metrics emitted directly by this service layer. Metrics are recorded by the forward proxy infrastructure module after each connection. Query with: metrics prometheus forwardproxy_<name>

Forward Proxy Engine

Authentication, ACL evaluation, rate limiting, and PAC generation engine for the forward proxy

Overview

The forward proxy module provides browser-native access to backend services using the MASQUE protocol (RFC 9298) over QUIC. It enables authenticated, policy-controlled tunneling of TCP and UDP traffic through the Hexon gateway without requiring any client software.

Core capabilities:

Bearer token authentication using HMAC-SHA256 signed tokens with configurable TTL
Firewall ACL integration for group-based destination access control
Per-user rate limiting (requests/sec) and bandwidth limiting (bytes/sec)
PAC (Proxy Auto-Configuration) file generation for browser proxy setup
JA4/JA4Q fingerprint binding for session-based authentication
Geo-IP and time-based access restrictions (fail-closed)
Active connection tracking with per-user and per-target metrics
DNS resolution via the DNS module (prevents DNS poisoning)
Separate proxy hostname and TLS certificate support for CDN bypass
Token refresh mechanism for long-lived browser sessions

Transport security model:

  The PAC file returns "HTTPS host:port", so the browser always connects to
  the proxy over TLS. The forward proxy listener only speaks TLS.

  HTTPS target (e.g. https://example.com):
    Browser --TLS--> Proxy --TLS--> Target
             CONNECT        tunnel (end-to-end encrypted)
             + token        (raw bytes, no proxy headers)

  Plain HTTP target (e.g. http://ifconfig.io):
    Browser --TLS--> Proxy --plain--> Target
             GET http://...           (content visible on last hop)
             + token                  (token STRIPPED before forwarding)

  The bearer token only travels on the encrypted browser-to-proxy leg.
  Hop-by-hop headers (Proxy-Authorization, Connection, etc.) are removed
  before forwarding. The token never reaches the target server.

Authentication flow (bearer token):

  1. User logs in via any method, receives session cookie
  2. Browser extension fetches /proxy/config with session cookie
  3. Service generates HMAC-SHA256 signed token with user/groups/expiry
  4. Extension sends Proxy-Authorization: Bearer <token> on CONNECT
  5. Token validated locally (no round-trip for validation)
  6. User disabled status checked against directory
  7. CheckAccess enforces firewall ACL rules
  8. Connection established and traffic relayed
  9. Extension periodically refreshes token via /proxy/config

Config

Core configuration under [forward_proxy] section in hexon.toml:

[forward_proxy]
  enabled = true                       # Enable forward proxy module
  port = 8443                          # Dedicated proxy port (must differ from service.port)
  public_port = 8443                   # External port for PAC URLs (for NAT/LB scenarios)
  preserve_client_port = true          # Use client's port in Alt-Svc header
  hostname = "proxy.example.com"       # Separate hostname for CDN bypass (optional)
  fingerprint_binding = true           # Enable JA4/JA4Q fingerprint-to-session binding
  fingerprint_binding_ttl = "8h"       # Fingerprint binding TTL (match session TTL)
  rate_limit_per_user = 1000           # Max requests per second per user
  bandwidth_limit_per_user = "100mbps" # Max bandwidth per user

  # Token settings
  token_ttl = "5m"                     # Token validity duration (default: 5m)
  token_refresh_interval = "60s"       # Extension refresh interval (default: 60s)

  # Probe resistance — hide proxy presence from unauthenticated probes
  probe_resistance_mode  = "off"       # off | fingerprint | ip | secret_host
  probe_resistance_decoy = "404"       # 404 (Not Found) | empty (204 No Content)
  probe_resistance_ttl   = ""          # cache TTL; defaults to fingerprint_binding_ttl
  probe_resistance_secret_host = ""    # required when mode=secret_host

  # TLS certificate for the proxy hostname (optional)
  # Only needed when hostname differs from service.hostname
  # Value can be a file path or inline PEM content
  # If not set, uses ACME (add hostname to acme.additional_domains) or service cert
  cert = "/path/to/cert.pem"
  key = "/path/to/key.pem"

  # Geo-IP restrictions (optional, falls back to [service] if not set)
  geo_enabled = true                   # Enable geo-IP restrictions
  geo_allow_countries = ["US", "CA"]   # Allowed country codes (ISO 3166-1 alpha-2)
  geo_deny_countries = []              # Denied country codes
  geo_bypass_cidr = ["10.0.0.0/8"]    # CIDR ranges that bypass geo checks
  geo_deny_code = 403                  # HTTP status code for geo-denied requests
  geo_deny_message = "Access denied from your location"

  # Time-based restrictions (optional, falls back to [service] if not set)
  time_enabled = true                  # Enable time-based restrictions
  time_timezone = "America/New_York"   # Timezone for time checks
  time_allow_days = ["Mon", "Tue", "Wed", "Thu", "Fri"]
  time_allow_hours = "09:00-18:00"     # Allowed hours range
  time_deny_code = 403                 # HTTP status code for time-denied requests
  time_deny_message = "Access not permitted at this time"

# PAC file configuration
[forward_proxy.pac]
  enabled = true                       # Enable PAC endpoint
  path = "/proxy.pac"                  # PAC file URL path
  cache_ttl = "15m"                    # PAC response cache TTL
  use_firewall_targets = true          # Derive PAC targets from firewall rules

PAC authentication requirement: unauthenticated requests receive a minimal PAC that routes all traffic directly. Authenticated users get a PAC with targets derived from their firewall rules.

Hot-reloadable: rate_limit_per_user, bandwidth_limit_per_user, geo/time restrictions, PAC settings, token_ttl, token_refresh_interval, probe_resistance_* (mode and TTL apply on next request). Cold (restart required): enabled, port, hostname, fingerprint_binding.

Security

Security layers and hardening measures:

Bearer token security:

  Tokens signed with HMAC-SHA256 using the cluster-wide secret key.
  Short TTL (default 5 minutes) limits exposure window for stolen tokens.
  Token contains user ID, groups, and expiry; validated locally without
  round-trip for minimal latency.
  Tokens are not stored server-side (stateless validation via signature).
  Token transport is always encrypted: the browser-to-proxy connection is
  TLS (PAC returns "HTTPS"), and the token is stripped (hop-by-hop header)
  before forwarding to the target. Even for plain HTTP targets, the token
  never leaves the TLS tunnel.

Fingerprint binding:

  JA4/JA4Q TLS fingerprint bound to session via BindFingerprint operation.
  Prevents token replay from a different client/browser. Binding has its own
  TTL that should match the session TTL for consistent expiry. The same
  binding doubles as the signal for probe_resistance_mode=fingerprint.

Probe resistance (probe_resistance_mode):

  Without this gate, every CONNECT to the proxy receives a 407 with
  "Basic realm=Hexon Proxy", and the IAP-port middleware leaks the
  dedicated proxy port in plaintext via 421 — both fingerprintable.

  off          — legacy 407-on-everything (default).

  fingerprint  — 407 only when the request's JA4Q TLS fingerprint is
                 currently bound (BindFingerprint cache, populated on
                 sign-in). Recommended for browser-extension deployments;
                 the binding is auto-populated whenever a user signs in.

  ip           — 407 only when the request's source IP authenticated
                 within probe_resistance_ttl. Survives JA4Q drift but
                 leaks the 407 to every client behind a shared egress
                 once one user has signed in. Best for office/VPN-fronted
                 access.

  secret_host  — 407 only when r.Host equals probe_resistance_secret_host.
                 Best for non-extension manual proxy configuration where
                 the secret host is distributed out-of-band.

  In all non-off modes the IAP-port middleware mirrors the same gate so
  a probe diffing responses across listeners cannot fingerprint either.

  Metrics:
    forwardproxy_probe_decisions_total{mode, decision, path}
      mode      configured mode at decision time
      decision  "challenge" (407 emitted) or "decoy" (decoy served)
      path      "tcp" / "http" / "udp" / "iap_middleware"

Access control (multi-layer):

  1. Bearer token authentication (identity verification)
  2. User disabled check via directory.IsUserDisabled (account status)
  3. Firewall ACL via CheckAccess (group-based destination control)
  4. Rate limiting per user (abuse prevention)
  5. Bandwidth limiting per user (network saturation prevention)
  6. Geo-IP restrictions (location-based access, fail-closed)
  7. Time-based restrictions (schedule-based access, fail-closed)
  8. DNS resolution via the DNS module (prevents DNS poisoning)

Geo-IP and time restrictions:

  Both use fail-closed semantics: if the check cannot be performed
  (e.g., GeoIP database unavailable), access is denied.
  Forward proxy has its own geo/time config that overrides [service] defaults,
  allowing different policies for proxy vs. web access.

PAC file security:

  PAC endpoint requires authentication to return proxy-routed targets.
  Unauthenticated PAC returns DIRECT-only routing (no information leak).
  Username embedded in PAC for browser extension display only.

Rate and bandwidth limiting:

  Per-user rate limiting prevents connection flooding.
  Per-user bandwidth limiting prevents single-user network saturation.
  Both return RetryAfter hints for well-behaved clients.

Troubleshooting

Common symptoms and diagnostic steps:

User cannot connect through forward proxy:

  - Verify forward_proxy.enabled = true and port is correct
  - Check bearer token: token_ttl may have expired, verify refresh is working
  - Check user disabled status: directory user <username>
  - Verify firewall rules allow the target: forwardproxy check <user> <target>
  - Check geo restrictions: geo lookup <client_ip> and geo check <client_ip>
  - Check time restrictions: ensure current time is within allowed window
  - DNS resolution: verify target hostname resolves via dns test <hostname>

PAC file returns DIRECT for all traffic:

  - PAC requires authentication; check session cookie is being sent
  - Verify forward_proxy.pac.enabled = true
  - Check use_firewall_targets = true and firewall rules exist for the user
  - Inspect PAC content: curl -b session=<cookie> https://host/proxy.pac

Token refresh failing (extension shows expired):

  - Check token_refresh_interval is shorter than token_ttl
  - Verify /proxy/config endpoint is accessible with session cookie
  - Check for clock skew between client and server
  - Monitor token generation metrics via forwardproxy metrics

Rate limited (429 responses):

  - Check rate_limit_per_user setting (requests/sec)
  - Check bandwidth_limit_per_user setting
  - Monitor per-user metrics: forwardproxy metrics <username>
  - RetryAfter header indicates when to retry

Fingerprint binding failures:

  - Verify fingerprint_binding = true in config
  - Check fingerprint_binding_ttl matches session TTL
  - JA4 fingerprint changes between requests indicate client switching
  - Browser updates can change JA4 fingerprint (rebind needed)

Connection drops or timeouts:

  - Check backend connectivity: net tcp <target_host:port>
  - Check QUIC port (UDP) is not blocked by intermediate firewalls
  - Verify TLS certificate: net tls <proxy_hostname:port>
  - Check active connections: forwardproxy metrics to see connection counts

Geo-IP or time-based denial (403/451):

  - Geo denial: geo lookup <ip> shows country, geo check <ip> shows policy
  - Time denial: verify time_timezone is correct, check time_allow_hours
  - Bypass CIDR: add client network to geo_bypass_cidr for exemption
  - Forward proxy geo/time overrides [service] config if set

Metrics and monitoring:

  - Active connections: forwardproxy metrics (cluster-wide)
  - Per-user breakdown: forwardproxy metrics <username>
  - Connection success/failure rates tracked via RecordMetrics
  - Bytes sent/received per user for bandwidth accounting

Relationships

Module dependencies and interactions:

Firewall: ACL rule evaluation determines which destinations each user group can reach. Firewall rules also drive PAC file generation when use_firewall_targets is enabled.
Directory: User disabled check on every authentication call. Group membership embedded in token for ACL evaluation.
Forward proxy service: Service layer handles HTTP CONNECT (TCP tunneling), CONNECT-UDP (UDP tunneling), and absolute-form HTTP requests (plain HTTP forwarding), plus HTTP endpoints (/proxy/config, /proxy/setup, /proxy.pac). Service calls this engine for auth, ACL, metrics.
DNS: Hostname resolution for target destinations, with system DNS fallback.
Rate limiting: Per-user request throttling and bandwidth controls.
Geo-IP: Location-based access restrictions. Forward proxy can override [service] geo config with its own settings.
Sessions: Session cookie used for initial token generation. Fingerprint binding ties proxy session to TLS fingerprint.
Configuration: Hot-reload of rate limits, bandwidth limits, geo/time restrictions, PAC settings. Token TTL changes apply to new tokens only.
Telemetry: Structured logging for authentication, ACL decisions, rate limit events. Metrics for active connections, bytes transferred, token generation.
Auto TLS: ACME certificate for proxy hostname when using a separate hostname (add to acme.additional_domains).

Logs

Log entries by component. Search with: logs search “forwardproxy” Levels: ERROR > WARN > INFO > DEBUG > TRACE.

Initialize:

  forwardproxy.init                    INFO          Forward proxy disabled in config
  forwardproxy.init                    ERROR         Failed to initialize forward proxy
  forwardproxy.init                    INFO          Initializing forward proxy module

Access Control:

  forwardproxy.checkaccess             ERROR         Failed to resolve user groups
  forwardproxy.checkaccess             ERROR         Failed to call firewall.CheckProxyAccess
  forwardproxy.checkaccess             ERROR         Invalid response type from firewall

Allowed Targets:

  forwardproxy.getallowedtargets       ERROR         Failed to resolve user groups
  forwardproxy.getallowedtargets       ERROR         Failed to call firewall.GetAllowedTargets
  forwardproxy.getallowedtargets       ERROR         Invalid response type from firewall

PAC Generation:

  forwardproxy.generatepac             WARN          PAC requested without authentication
  forwardproxy.generatepac             DEBUG         Generated PAC file

Authentication:

  forwardproxy.auth                    WARN          Token validation failed
  forwardproxy.auth                    WARN          User account is disabled
  forwardproxy.auth                    INFO  AUDIT   Token authentication successful
  forwardproxy.auth                    DEBUG         Invalidated fingerprint binding

Token Generation:

  forwardproxy.token                   ERROR         Failed to generate token
  forwardproxy.token                   DEBUG         Generated proxy token

Fingerprint Binding:

  forwardproxy.bind                    WARN          Failed to broadcast fingerprint binding
  forwardproxy.bind                    WARN          Failed to achieve quorum for fingerprint binding
  forwardproxy.bind                    INFO          Fingerprint bound to session

Rate Limiting:

  forwardproxy.ratelimit               WARN          Rate limit check called without UserID
  forwardproxy.ratelimit               WARN          User rate limit exceeded
  forwardproxy.ratelimit               WARN          Destination rate limit exceeded
  forwardproxy.ratelimit               WARN          User bandwidth limit exceeded

Rate Limit Cleanup:

  forwardproxy.cleanup                 DEBUG         Cleaned up stale rate limit entries

Geo Restrictions:

  forwardproxy.restrictions.geo        ERROR         Geo check failed - denying access (fail-closed)
  forwardproxy.restrictions.geo        ERROR         Geo check wait failed - denying access (fail-closed)
  forwardproxy.restrictions.geo        ERROR         Invalid geo check response type - denying access (fail-closed)
  forwardproxy.restrictions.geo        INFO          Access blocked by geo restriction

Time Restrictions:

  forwardproxy.restrictions.time       ERROR         Time check failed - denying access (fail-closed)
  forwardproxy.restrictions.time       ERROR         Time check wait failed - denying access (fail-closed)
  forwardproxy.restrictions.time       ERROR         Invalid time check response type - denying access (fail-closed)
  forwardproxy.restrictions.time       INFO          Access blocked by time restriction

Metrics

Prometheus metrics. Query with: metrics prometheus forwardproxy_<name>

Connection Metrics (namespace: forwardproxy):

  forwardproxy_connections_total                  counter    {protocol, user_id}         Proxy connections recorded
  forwardproxy_bytes_sent_total                   counter    {protocol, user_id}         Bytes sent through proxy
  forwardproxy_bytes_received_total               counter    {protocol, user_id}         Bytes received through proxy
  forwardproxy_connection_duration                latency    {protocol, user_id}         Connection duration
  forwardproxy_errors_total                       counter    {protocol, error}           Failed proxy connections
  forwardproxy_active_connections                 gauge      {}                          Currently active proxy connections

Network Listener

Manages all network connections — TLS termination, client fingerprinting, HTTP middleware chain, and protocol detection

Overview

Manages all incoming network connections — TLS termination, protocol detection, client fingerprinting, and the HTTP middleware chain. Every request to the gateway passes through the listener before reaching any service or proxy route. Supports TCP, TLS, HTTP/1.1, HTTP/2, HTTP/3 (QUIC), UDP, and gRPC.

Client fingerprinting combines three layers into a composite hash:

  JA4 (TLS)     — cipher and extension hash, extracted during TLS handshake
  HTTP/2        — SETTINGS frame parameters and pseudo-header ordering
  TCP/IP Stack  — window size, MSS, TTL for OS identification
  Composite     — SHA256(ja4|http2|tcp) truncated to 32 hex chars
  JA4Q (QUIC)   — QUIC transport parameter fingerprint for HTTP/3 clients

Used for rate limiting, session affinity, and client identification — resistant to IP spoofing and NAT. Fingerprint data is stored in a unified structure across all protocols (HTTP/1.1, HTTP/2, HTTP/3).

HTTP middleware chain (applied in order): security headers, geo restriction, time restriction, rate limiting, size limiting, proof-of-work, WAF. Each layer runs independently.

Additional capabilities:

Deployment behind CDN/load balancer with header-based client identification (proxy mode)
Per-SNI mTLS with dynamic CA rotation
HXEP (Hexon Edge Protocol) for real client IP through edge proxies and SNAT
Correlation ID propagation for end-to-end distributed tracing
Malformed TLS blocking to reject invalid ClientHello messages
Graceful shutdown with configurable connection draining timeout

Config

Core configuration under [service] in config TOML:

[service]
  hostname = "auth.example.com"        # Service hostname
  tls_cert = "/path/to/cert.pem"       # TLS certificate path
  tls_key = "/path/to/key.pem"         # TLS private key path
  handshake_timeout = 10               # TLS handshake timeout in seconds (default: 10)
  block_malformed_tls = true           # Reject invalid TLS ClientHello (default: true)
  max_header_bytes = 65536             # Max ClientHello size in bytes (default: 64KB)
  disable_server_header = false        # Suppress HexonGateway/<version> header (default: false)
  correlation_id_header = "X-Hexon-ID" # Correlation ID header name (default: "X-Hexon-ID")
  cookie_name = "hexon"                # Session cookie name (default: "hexon")

  # Mutual TLS
  mtls_mode = "none"                   # "none", "optional", "mandatory" (default: "none")

  # HTTP/2 settings
  http2_enable = true                  # Enable HTTP/2 (default: true)
  http2_maxstreams = 1000              # Max concurrent streams per connection
  http2_maxframesize = 1048576         # Max frame payload size (default: 1MB)
  http2_idletimeout = 120              # Idle timeout in seconds
  http2_keepalive = true               # Enable HTTP/2 keepalive
  http2_keepaliveseconds = 30          # Keepalive interval in seconds

  # Fingerprint cache
  fingerprint_max_entries = 10000      # Max entries in addr fingerprint map (default: 10000)
  fingerprint_ttl_seconds = 300        # Base TTL in seconds (default: 5 min)
  fingerprint_cleanup_seconds = 30     # Cleanup sweep interval (default: 30s)
  fingerprint_max_entries_per_ip = 10  # Max fingerprints per IP, anti-abuse (default: 10)

  # JA4 parsing security limits
  ja4_max_extensions = 200             # Max TLS extensions to parse (default: 200, typical: 10-30)
  ja4_max_sigalgs = 100                # Max signature algorithms to parse (default: 100)

  # HTTP/2 fingerprint cache
  http2_fingerprint_cache_size = 10000     # Max entries (default: 10000)
  http2_fingerprint_cache_evict_pct = 10   # % of oldest entries to evict when full (1-50)

  # QUIC fingerprint reassembly
  quic_fingerprint_reassembly_max_packets = 10   # Max packets for reassembly (default: 10)
  quic_fingerprint_reassembly_max_bytes = 15360  # Max reassembly buffer (default: 15KB)
  quic_fingerprint_reassembly_timeout_s = 5      # State timeout (default: 5s)
  quic_max_crypto_frame_offset = 65536           # Max CRYPTO frame offset (default: 64KB)

  # Proxy mode (behind CDN/LB)
  proxy = false                                  # Enable proxy mode (default: false)
  proxy_cidr = ["10.0.0.0/8"]                    # Trusted proxy IPs (REQUIRED when proxy=true)
  proxy_header_clientip = "X-Forwarded-For"      # Real client IP header (REQUIRED when proxy=true)
  proxy_header_clientcert = "SSL_CLIENT_CERT"    # Client certificate header (optional)
  proxy_header_clientfingerprint = "CF-Ray"      # Client fingerprint header (optional)
  proxy_header_traceid = "X-Request-ID"          # Trace ID header for distributed tracing (optional)

  # Geo restriction (router-level middleware)
  geo_enabled = false                  # Enable geo restrictions (default: false)
  geo_database = "GeoLite2-Country.mmdb"
  geo_asn_database = "GeoLite2-ASN.mmdb"
  geo_allow_countries = []             # ISO 3166-1 alpha-2 codes (empty = all)
  geo_deny_countries = []              # Deny takes precedence over allow
  geo_allow_asn = []                   # ASN allow list
  geo_deny_asn = []                    # ASN deny list
  geo_bypass_cidr = []                 # CIDRs that skip geo checks
  geo_deny_code = 403                  # HTTP status for denials
  geo_deny_message = ""                # Custom denial message

  # Time restriction (router-level middleware)
  time_enabled = false                 # Enable time restrictions (default: false)
  time_bypass_cidr = []                # CIDRs that skip time checks
  time_default_timezone = "UTC"        # Default timezone (IANA format)

[protection]
  rate_limit = "100/1m"                # Requests per interval (empty = disabled)
  rate_limit_type = "fingerprint"      # "fingerprint" or "ip" (default: "ip")
  rate_limit_bantime = "5m"            # Ban duration when limit exceeded

Fingerprint adaptive TTL (based on cache utilization):

  Normal (<60%):  base TTL (default 5 min)
  Medium (60-80%): base TTL / 2 (min 2 min)
  High (>80%):    base TTL / 5 (min 1 min)
  LRU eviction triggers when TTL cleanup is insufficient.

  # HXEP (Hexon Edge Protocol)
  hexon_edge_protocol = false         # Enable HXEP header parsing (default: false)
  hexon_edge_cidr = [                 # Trusted CIDRs for HXEP (default: trust all)
    "10.244.0.0/16",                  # Kubernetes pod network
  ]

HXEP (Hexon Edge Protocol) — real client IP through edge proxies:

  When traffic flows: External Client → Edge Proxy → Gateway (via k8s Service/LB),
  the edge proxy prepends a binary header with the original client IP and port.
  Format: Magic "HXEP" (4B) + Type (1B: 0x04=IPv4, 0x06=IPv6) + IP (4/16B) + Port (2B)
  Required for: geo-IP accuracy, rate limiting, IDS, and RADIUS NAS identification
  when the gateway sits behind an edge proxy or Kubernetes service with SNAT.

  Config:
  - service.hexon_edge_protocol = true   → enables HXEP parsing on all listeners
  - service.hexon_edge_cidr = [...]      → only these source CIDRs are trusted for HXEP
    Default: ["0.0.0.0/0", "::/0"] (trust all) — restrict to pod CIDR in production
  - Packets from untrusted CIDRs: HXEP header stripped, socket address used
  - Set automatically via Helm when edge.enabled=true

  Protocols: TCP (parsed on first read, before TLS handshake), UDP (PacketConn wrapper),
  HTTP/3 QUIC (HXEP wrapping applied transparently, GSO/ECN/GRO OOB data preserved).

  Used by: reverse proxy, RADIUS (RADSEC + UDP), SSH bastion, QUIC connector, QUIC client access.

Hot-reloadable: TLS certificates, mTLS CA pool, proxy mappings, geo/time rules, rate limit settings, fingerprint cache limits. Cold (restart required): listen addresses, HTTP/2 enable, proxy mode toggle, HXEP settings.

Troubleshooting

Common symptoms and diagnostic steps:

TLS handshake failures:

  - Malformed ClientHello blocked: check 'logs search "Malformed TLS"' for details
  - block_malformed_tls=true rejects missing SNI, invalid TLS version, oversized ClientHello
  - ClientHello too large: check max_header_bytes setting (default 64KB)
  - TLS version rejected: only 0x0301-0x0304 (TLS 1.0-1.3) accepted
  - mTLS certificate popup on proxy routes: check per-SNI mTLS config, set mtls=false on mapping
  - CA rotation issues: 'certs list' to verify CA bundle, check 'logs search "CA rotation"'
  - Start with: 'diagnose domain <hostname>' for cross-subsystem check

Fingerprint cache exhaustion:

  - High memory from fingerprint storage: check fingerprint_max_entries setting
  - Adaptive TTL kicking in too aggressively: increase fingerprint_ttl_seconds
  - Per-IP abuse: 'logs search "fingerprint limit exceeded"' to identify attackers
  - fingerprint_max_entries_per_ip controls anti-abuse threshold (default: 10)
  - LRU eviction warnings: 'logs search "evict"' to monitor cache pressure
  - Check: 'metrics prometheus fingerprint' for cache utilization metrics

Session affinity not working:

  - Verify cluster_affinity=true in global config
  - Loopback connections (127.0.0.1, ::1) bypass affinity by design
  - Circuit breaker open for target node: 'proxy circuits' to check breaker states
  - No TLS = no fingerprint = no affinity: ensure clients connect via HTTPS
  - Check: 'cluster status' for node health, 'health components' for listener status

Proxy mode issues (behind CDN/LB):

  - 403 Forbidden: source IP not in proxy_cidr, check 'logs search "CIDR"'
  - 400 Bad Request: missing client IP header, verify proxy_header_clientip config
  - Rate limiting all users as one: JA4 unavailable in proxy mode, use proxy_header_clientfingerprint
  - Wrong client IP: X-Forwarded-For uses FIRST IP only (original client, not proxy chain)
  - Header injection: ensure proxy_cidr is restricted to actual proxy IPs
  - Distributed tracing broken: configure proxy_header_traceid for end-to-end correlation
  - mTLS through proxy: set proxy_header_clientcert and mtls_mode="optional" or "mandatory"

QUIC/HTTP/3 fingerprint failures:

  - Large ClientHello spanning packets: check quic_fingerprint_reassembly_max_packets
  - Reassembly timeout: increase quic_fingerprint_reassembly_timeout_s for slow networks
  - CRYPTO frame offset too large: quic_max_crypto_frame_offset default 64KB should suffice
  - Connection ID too long (>20 bytes): RFC 9000 violation, likely malicious traffic

Rate limiting misbehavior:

  - All clients sharing one rate bucket: check rate_limit_type ("fingerprint" vs "ip")
  - Composite fingerprint unavailable: falls back to IP automatically
  - Per-route bypass not working: verify disable_rate_limit=true on the proxy mapping
  - Cluster-wide consistency: rate limits use distributed memory cache
  - Check: 'ratelimit stats' for current rate limiting state, 'metrics ratelimit' for counters

HXEP (Hexon Edge Protocol) issues:

  - HXEP not resolving real client IP: verify service.hexon_edge_protocol = true
  - Wrong client IP after HXEP: verify source IP falls within service.hexon_edge_cidr
  - "HXEP header stripped": source IP is outside trusted CIDRs — add pod/edge CIDR
  - Geo/rate limiting sees edge proxy IP instead of client: HXEP not enabled or CIDR mismatch
  - RADIUS NAS rejected after HXEP: real NAS IP doesn't match any [[radius.client]] CIDR
  - Default trust-all CIDRs in production: security risk — restrict to actual pod network CIDR
  - Config: 'config show service' and check hexon_edge_protocol + hexon_edge_cidr fields
  - Helm sets HXEP automatically when edge.enabled=true in values.yaml

Connection metrics missing:

  - Metrics batched (flush every 100ms or on close): short-lived connections may lag
  - Check: 'health components' for listener health status
  - 'metrics prometheus listener' for per-listener connection counters

Geo/time restriction issues:

  - Geo blocking wrong country: verify MaxMind database is current
  - Bypass CIDR not working: geo_bypass_cidr checked before country/ASN rules
  - Time window mismatch: verify IANA timezone spelling (e.g., "America/New_York")
  - Overnight ranges supported: "22:00-06:00" spans midnight correctly
  - Check: 'geo lookup <ip>' to verify classification, 'geo timecheck <ip>' for time rules

Architecture

Connection lifecycle:

Client connects to TCP socket
First bytes peeked to detect TLS, extract JA4 fingerprint + SNI
TCP fingerprint extracted (window size, TTL, MSS, options ordering)
Session affinity check: fingerprint hash maps to a cluster node
If affinity target is a remote node: forward connection to that node
If local: proceed with TLS handshake (per-SNI mTLS selection)
If HTTP/2: extract HTTP/2 fingerprint from SETTINGS frame
Compute composite hash: SHA256(ja4|http2|tcp) truncated to 32 hex chars
Assign correlation ID, begin connection tracking
HTTP middleware chain: telemetry -> client identification -> connection info -> security headers -> geo restriction -> time restriction -> rate limit -> handler
Handler processes request, correlation ID propagates as trace_id across modules
Metrics flushed on connection close

Fingerprint extraction pipeline:

  Accept-level (before TLS): JA4 from ClientHello peek (zero-copy, buffered I/O)
  TLS callback: per-SNI mTLS mode selection
  Post-handshake: HTTP/2 SETTINGS fingerprint from connection preface
  TCP layer: p0f-style OS fingerprint from socket options (window, MSS, TTL)
  QUIC path: JA4Q from Initial packet, transport params fingerprint, multi-packet reassembly

GSO/ECN/GRO preservation:

  All UDP wrappers (HXEP edge protocol and JA4Q fingerprint) preserve kernel offload
  capabilities so that QUIC can use:
  - GSO (Generic Segmentation Offload): send 64KB in one syscall, kernel splits into MTU packets
  - GRO (Generic Receive Offload): kernel coalesces packets, fewer syscalls on receive
  - ECN (Explicit Congestion Notification): congestion signals via IP header bits
  Without these, QUIC silently falls back to one syscall per packet.
  This affects both HTTP/3 reverse proxy and QUIC connector listeners.

Fingerprint memory protection:

  Address fingerprint map: configurable max entries (default 10,000) with adaptive TTL
  Per-IP limit: configurable (default 10), oldest replaced on overflow
  LRU eviction: sorts by timestamp, evicts oldest when TTL cleanup insufficient
  HTTP/2 cache: configurable size with percentage-based LRU eviction (1-50%)
  All maps use lock-free concurrent reads for performance

Proxy mode flow:

  Step 1: Validate source IP against configured proxy_cidr
  Step 2: Extract trace ID from proxy header, update correlation context
  Step 3: Extract and sanitize client IP (first IP from comma-separated list)
  Step 4: Fingerprint priority: dedicated header > client cert hash > client IP
  Step 5: Update context with real client identifiers for downstream modules

mTLS CA rotation flow:

  1. ACME CA rotates, triggers listener update
  2. CA pool rebuilt atomically (config CA + ACME CA merged)
  3. HTTPS listeners gracefully restarted
  4. Existing connections drain gracefully, new connections get fresh CA pool

Graceful shutdown sequence:

  1. Stop accepting new connections on all listeners
  2. Close all listener sockets
  3. Wait for active connections up to configurable timeout
  4. Cancel contexts for remaining connections
  5. Force-close any connections still open after timeout

Performance characteristics:

  - Pooled slice allocations reduce GC pressure during fingerprint extraction
  - Buffered I/O to minimize syscalls
  - Metrics batched to reduce overhead (flush every 100ms)
  - TCP Fast Open: 15-30% latency reduction for repeat clients (Linux 3.7+, macOS)
  - TCP Window Scaling: 20-40% throughput improvement for large transfers
  - SO_REUSEPORT on Linux for load balancing across cores

Relationships

Module dependencies and interactions:

Proxy: Provides per-SNI mTLS lookup. Listener provides fingerprint and client IP context consumed by proxy for rate limiting, identity headers, and session affinity.
Sessions: Listener middleware manages session cookie extraction. Session validation uses correlation IDs propagated through listener context.
Certificates: TLS termination uses certificates from the cert module. Per-mapping certificates loaded via SNI callback. CA pool for mTLS verification rebuilt atomically on ACME CA rotation.
WAF: WAF rules applied in middleware chain after listener accepts connection. Fingerprint available in context for WAF correlation.
X.509 authentication: mTLS mode controls TLS client auth level. In proxy mode, client certificates injected from HTTP header. Certificate validation uses dynamic CA pool.
Rate limiting: Middleware reads composite fingerprint or client IP from context. Composite fingerprint (JA4+HTTP/2+TCP) or IP-based, configurable per route.
Geo restriction: Middleware at router level uses client IP from context with MaxMind GeoLite2 databases for country/ASN lookup.
Time restriction: Middleware after geo restriction uses client country for timezone-aware time window matching.
Cluster affinity: Fingerprint hash selects cluster node for session routing. Node health checked before forwarding. Forwarded connections use inter-node communication for transparent routing.
DNS: Listener does not directly use DNS, but proxy backends resolved via DNS module.
Distributed tracing: Correlation IDs generated at listener level propagate as trace_id through all operations, enabling end-to-end tracing across cluster nodes.
Connection pool: Backend connection management operates downstream of listener. Listener handles inbound connections; connection pool handles outbound to backends.

Encrypted Client Hello (ECH)

Encrypted Client Hello (ECH) hides proxied app SNI behind the service hostname.

Without ECH, a network observer sees which app a user accesses via the plaintext SNI in the TLS ClientHello (e.g., “app.internal.com”). With ECH enabled, the observer only sees the gateway’s service hostname (e.g., “gateway.example.com”). The real hostname is encrypted inside the ClientHello using HPKE (X25519 + HKDF-SHA256 + AES-128-GCM).

Configuration:

  [service]
  ech = true    # Default: false (opt-in)

How it works:

  1. Gateway generates an HPKE key pair (X25519) and ECH config on startup
  2. The ECH config is logged as base64 — publish it in a DNS HTTPS record
  3. Clients that support ECH (Chrome 117+, Firefox 118+, Safari 17.4+) encrypt
     the real SNI in the ClientHello
  4. The gateway decrypts the inner ClientHello using its HPKE private key
  5. GetCertificate receives the decrypted (inner) SNI — certificate selection
     and proxy routing work unchanged
  6. Non-ECH clients connect normally with plaintext SNI (graceful fallback)

The ECH config must be published in a DNS HTTPS (SVCB type 65) record for clients to discover ECH support. The gateway logs the config as base64 at startup:

  "ech_config_list_base64": "<base64>" — copy to your DNS HTTPS record

What doesn’t change with ECH:

  - Certificate selection (GetCertificateForSNI receives inner SNI)
  - Proxy routing (uses HTTP Host header, not SNI)
  - JA4 fingerprinting (computed from outer ClientHello)
  - mTLS (client cert validation after ECH decryption)

Limitations:

  - CDN termination: If a CDN terminates TLS before the gateway, ECH at the
    gateway layer has no effect — the CDN already saw the SNI
  - HTTP/3 QUIC: Uses a different ECH mechanism (not covered by this feature)
  - DNS requirement: Without the HTTPS record, clients fall back to plaintext SNI

Logs

Log entries by component. Search with: logs search “listener” Levels: ERROR > WARN > INFO > DEBUG > TRACE. DEBUG/TRACE require log level configuration.

HTTP Errors:

  listener.http.error           DEBUG/WARN    HTTP server errors (DEBUG for client TLS/connection failures, WARN otherwise)

Proxy Mode:

  listener.proxy_validation     WARN          Rejected connection not from trusted proxy
  listener.proxy_validation     ERROR         Client IP header missing in proxy mode
  listener.proxy_cert           WARN          Oversized cert header (DoS) / parse failed
  listener.proxy_cert           DEBUG/INFO    Client cert injected / invalid PEM block

CORS:

  listener.cors                 WARN   AUDIT  CORS origin rejected

Sessions:

  listener.session              DEBUG         Session created / validated / expired
  listener.session              ERROR/WARN    Session creation/validation failures

Proof-of-Work:

  listener.pow                  INFO          PoW challenge passed / application session valid / body restored
  listener.pow                  WARN          Body too large / session validation failures / invalid body format
  listener.pow                  ERROR         PoW handler not registered / body encryption failures
  listener.pow                  DEBUG         Session checks, challenge served, body stored

Rate Limiting:

  listener.ratelimit            WARN   AUDIT  Request blocked by rate limit
  listener.ratelimit            WARN          Config fallback (invalid rate_limit_type)
  listener.ratelimit            ERROR         Ratelimit module call/response failures / no fingerprint
  listener.ratelimit            DEBUG         Fingerprint fallback to IP
  listener.ratelimit            TRACE         Per-entity rate limiting applied
  listener.ratelimit.status     DEBUG         Rate limit check passed
  listener.ratelimit.circuitbreaker ERROR     Circuit breaker open — blocking request

Size Limiting:

  listener.sizelimit            WARN   AUDIT  Request blocked — size limit exceeded
  listener.sizelimit            ERROR         Sizelimit module call/response failures
  listener.sizelimit            TRACE         Size limit applied / exception / within limit

Compression:

  listener.compression          DEBUG         Response compressed

Geo Restrictions:

  listener.geo                  INFO   AUDIT  Request blocked by geo restriction
  listener.geo                  ERROR         Geo check failed (allowing request)

Time Restrictions:

  listener.time                 INFO   AUDIT  Request blocked by time restriction
  listener.time                 ERROR         Time check failed (allowing request)

ECH (Encrypted Client Hello):

  ech.generate                  INFO          ECH key pair derived from cluster key

PoW Body Preservation:

  pow.body                      DEBUG         POST body stored / retrieved / deleted / restored
  pow.body                      WARN          Body not found (expired) / cleanup failures
  pow.body                      ERROR         Storage / retrieval / decryption failures

Metrics

Prometheus metrics. Query with: metrics prometheus listener_<name>

Lifecycle:

  listener_starts                              counter    {type, name}          Listener startups
  listener_stops                               counter    {type, name}          Listener shutdowns
  listener_restarts                            counter    {type, name}          Listener restarts
  listener_errors                              counter    {type, name}          Listener errors

Rate & Size Limiting:

  listener_rate_limit_hits                     counter    {reason}              Requests blocked by rate limit
  listener_ratelimit_circuit_breaker_trips_total counter  {}                    Circuit breaker trips
  listener_size_limit_hits                     counter    {host, path}          Size limit exceeded

TLS Security:

  listener_connections_accepted                counter    {protocol}            Successful TLS connections
  listener_security_non_tls_dropped            counter    {reason}              Non-TLS connections rejected
  listener_security_malformed_tls              counter    {reason}              Invalid TLS versions
  listener_security_oversized_record           counter    {reason}              TLS records exceeding RFC limits
  listener_security_oversized_clienthello      counter    {reason}              ClientHello too large
  listener_security_small_clienthello          counter    {reason}              Suspiciously small ClientHello
  listener_security_malformed_clienthello      counter    {reason}              Malformed ClientHello
  listener_security_no_sni                     counter    {reason}              TLS handshakes without SNI

QUIC Affinity:

  listener_quic_affinity_packets_received      counter    {}                    QUIC packets received
  listener_quic_affinity_packets_dropped       counter    {reason}              QUIC packets dropped
  listener_quic_affinity_decryption_failures   counter    {}                    QUIC decryption failures
  listener_quic_affinity_packets_local         counter    {}                    QUIC packets processed locally
  listener_quic_affinity_packets_forwarded     counter    {target_node}         QUIC packets forwarded to cluster
  listener_quic_affinity_forward_failures      counter    {target_node}         Forward failures
  listener_quic_affinity_response_dropped      counter    {reason}              QUIC response packets dropped
  listener_quic_affinity_cid_mappings          gauge      {}                    Active connection ID mappings
  listener_quic_connection_migrations          counter    {}                    QUIC connection migrations

QUIC Forwarding:

  listener_quic_forward_connect_errors         counter    {target_node}         Forwarding connect errors
  listener_quic_forward_write_errors           counter    {target_node}         Forwarding write errors
  listener_quic_forward_bytes                  counter    {target_node}         Bytes forwarded

HXEP (Edge Protocol):

  hxep_parsed_trusted                          counter    {}                    TCP HXEP parsed (trusted)
  hxep_stripped_untrusted                      counter    {}                    TCP HXEP stripped (untrusted)
  hxep_parse_failed                            counter    {}                    TCP HXEP parse failures
  hxep_partial_header                          counter    {}                    TCP HXEP incomplete headers
  hxep_udp_parsed_trusted                      counter    {}                    UDP HXEP parsed (trusted)
  hxep_udp_stripped_untrusted                  counter    {}                    UDP HXEP stripped (untrusted)
  hxep_udp_parse_failed                        counter    {}                    UDP HXEP parse failures

Alerts:

  rate(listener_rate_limit_hits[5m]) > 50                   High rate limiting (possible attack)
  listener_ratelimit_circuit_breaker_trips_total > 0        Circuit breaker tripped
  rate(listener_security_no_sni[5m]) > 10                   SNI probing
  rate(hxep_stripped_untrusted[5m]) > 0                     HXEP spoofing attempt
  rate(listener_quic_affinity_forward_failures[5m]) > 0     Cluster QUIC forwarding issues