Cluster & Operations
Admin Unix Socket
Server-side CLI access via Unix domain socket — run admin commands directly on the server without bastion
Overview
The admin socket enables operators to run admin CLI commands directly on the server host without opening a bastion SSH session.
The hexon binary operates in two modes:
- Server mode (default): starts the gateway, creates the Unix socket listener
- Client mode (hexon admin …): connects to the socket, sends a command, renders output
Same binary, same command registry, same execution path as bastion and MCP. Commands are executed as user “root” with source “cli” in the audit trail.
The socket is created at /tmp/hexon-admin.sock (override with HEXON_ADMIN_SOCK). File permissions are 0600 (owner-only read/write). Stale sockets from crashed instances are detected and cleaned up automatically on startup.
Usage
Commands that require a running server:
hexon admin cluster status hexon admin proxy health hexon admin sessions list --user=alice hexon admin --json proxy backends hexon admin --cluster health statusHelp commands work offline (no running server needed):
hexon admin # List all commands hexon admin help proxy # Detailed help for proxy commandCustom socket path:
HEXON_ADMIN_SOCK=/tmp/hexon.sock hexon admin pingExit codes: 0 on success, 1 on error.
Troubleshooting
Common errors:
- “hexon is not running (socket not found at …)” — server not started or different socket path. Check HEXON_ADMIN_SOCK env var.
- “hexon is not running” — socket file exists but connection refused. The previous instance crashed without cleanup. Restart the server.
- “command timed out” — command took longer than 30 seconds. Check server logs.
- Socket permission denied — socket is mode 0600, must run as the same user that started the hexon server (typically root).
The socket is cleaned up on graceful shutdown. If the server crashes, the stale socket is automatically removed on next startup.
Logs
No structured log entries. A single console message is emitted on startup. Command execution logging is handled by the admin CLI module.
Metrics
No Prometheus metrics emitted by this module. Admin command metrics are handled by the admin CLI module.
Threshold Signing & Cluster Cryptography
Signs certificates and tokens without any single node holding the full private key — quorum-based threshold cryptography
Overview
Signs certificates and OIDC tokens without any single node holding the complete private key. The signing key is split across cluster nodes — a quorum collaborates to produce each signature. Replaces external HSMs with distributed key management built into the gateway.
Threshold signing means that certificates and tokens are signed by a quorum of cluster nodes working together. No single node ever holds the full private key — the key is split into shares, and a minimum number of nodes (the “threshold”) must cooperate to produce a valid signature.
The cluster runs two signing schemes in parallel:
-
ECDSA (ES256/ES384/ES512) — for EXTERNAL tokens Used for: OIDC tokens, Personal Access Tokens (PATs), standard OAuth Why: industry-standard algorithms that third-party tools verify natively
-
FROST Ed25519 — for INTERNAL operations Used for: proxy bearer tokens, bastion device codes, internal service auth Why: faster signing (~15ms) optimized for high-volume internal operations
These two schemes are not fallbacks for each other — they run in parallel, each serving different consumers. The only brief fallback window is during cluster startup: internal tokens temporarily use ECDSA until FROST key generation completes. This is a few seconds, not a steady-state condition.
Token routing (when signing_algorithm = ES256):
Token Type | Scheme | Reason ────────────────────────────────────────────────────────── Proxy bearer tokens | FROST | Internal — speed, backend verifies via JWKS Bastion device codes | FROST | Internal — bastion authentication Internal device codes | FROST | Internal service callers Personal Access Tokens | ECDSA | External — distributed to users Standard OIDC tokens | ECDSA | External — third-party OAuth clientsQuorum model (default: 2-of-3 nodes):
- Any 2 nodes can sign; 1 node alone cannot forge signatures - 1 node failure = still operational (2 remaining nodes can sign) - 2 nodes down = signing blocked (quorum lost) - 1 node compromised = attacker has 1 share, cannot forge (needs 2)Startup sequence:
1. Nodes perform distributed key generation (DKG) for the ECDSA scheme 2. Once ECDSA is ready, FROST key generation auto-triggers 3. Once both complete, all signing paths are available During the brief window between steps 1-3, internal tokens use ECDSA. Zero downtime throughout.When signing_algorithm is EdDSA: everything uses FROST (no dual mode needed).
Key management
Key lifecycle and rotation:
Threshold signing keys (both ECDSA and FROST):
- Generated collaboratively by all cluster nodes (distributed key generation) - No single node ever holds the full private key — only its own share - Stored encrypted at rest using authenticated encryption derived from cluster_key - When nodes join or leave, key shares are redistributed while preserving the same public key (external verifiers like JWKS consumers are not affected)ECDSA key rotation (for OIDC tokens):
- Automatic: triggered when key generation completes or cluster membership changes - New JWKS entry published; old key retained for verification (grace period) - Relying parties cache JWKS — existing tokens remain valid until cache refreshFROST key rotation (for internal tokens):
- Independent from ECDSA rotation — separate lifecycle - Auto-triggered when cluster membership changes - Internal tokens are short-lived — rotation is seamless with no visible impactInter-node encryption:
Nodes communicate over an encrypted channel with forward secrecy. This means each inter-node session uses unique encryption keys, and even if long-term keys were compromised, past communications remain unreadable. - Encryption keys rotate automatically on a schedule (2PC protocol with quorum) - Grace period: old + new keys accepted simultaneously during rotation - Temporary fallback to derived keys if the key exchange is briefly unavailable - SPK rotation uses publish-before-swap: new bundle published before private key swaps - Key rotation defers if SPK just rotated (SPK recency guard, 5s window) - On quorum failure, key rotation retries once after flushing stale bundle caches - All rotation events emitted as audit entries via OnKeyRotationEvent callbackIMPORTANT — All rotations are automatic:
Certificate rotation, signing key rotation, and encryption key rotation are all handled by background health monitors. Operators do NOT need to set calendar reminders or manually trigger rotations. Only investigate when 'health components' or 'hexdcall status' shows warnings.Threshold resharing
Cluster scaling under threshold signing — what’s safe and what isn’t:
Resharing rotates the cluster’s share material across a (possibly) different party set while preserving the same group public key. The CA certificate, JWKS public keys, and any downstream verifiers stay valid across the rotation. Both threshold algorithms support resharing:
ECDSA threshold (ca_algorithm=ES256, signing_algorithm=ES256/384/512): GG18 FROST threshold (ca_algorithm=EdDSA, signing_algorithm=EdDSA): FROSTAuto-trigger:
The leader's health monitor watches for cluster-membership changes vs the set of nodes that hold shares in JetStream KV. When they diverge (a node joins or leaves), resharing fires automatically after a short stabilization window (~10s) plus exponential backoff (30s base, 5min cap). Operators do NOT need to manually trigger resharing for normal scaling.Adding nodes (+N): no upper bound.
You can add any number of nodes. The original cooperating-old subset (t+1 contributors) reshares with their existing share values; joiners receive fresh shares. The CA / JWKS public key stays unchanged.Removing nodes (-N): bounded by the OLD threshold.
Resharing math requires |cooperating-old| >= oldThreshold + 1, and the cooperating-old set must be a subset of the new party set. So: minimum new cluster size = oldThreshold + 1 With the default majority-quorum config (t = n/2 at birth), this means: Initial cluster | Old threshold (t) | Minimum after shrinking via reshare ----------------|-------------------|-------------------------------------- 3 nodes | 1 | 2 nodes 5 nodes | 2 | 3 nodes 7 nodes | 3 | 4 nodes 9 nodes | 4 | 5 nodes Example: a 7-node cluster (t=3) cannot shrink to 3 in a single reshare — it would need 4 cooperating-old shareholders but the new set has only 3 slots. The system rejects this synchronously with a clear "insufficient cooperators" error. To shrink past the floor: stage the reduction (e.g., 7 → 5 with a lower new threshold → 3 with a still-lower threshold). Each step preserves the CA cert. OR perform a hard CA rotation (delete birth metadata + restart), which produces a new CA cert and forces every leaf cert to renew.Choosing initial threshold with shrinkage in mind:
Lower threshold = more shrinkage room but weaker security (fewer cooperators needed to sign = lower attack threshold). Override via threshold_nodes config at birth time if your cluster needs to scale down significantly.Resharing failure mode is non-destructive:
If a reshare protocol run fails (cooperator unreachable, KV write failure, byzantine commitment), the manager reverts to its old shares and the CA cert stays valid. Operators see the failure in 'hexdcall status' / health components and can retry manually or wait for the next auto-trigger.Deterministic signing
Clarification on “deterministic” in cluster crypto context:
The word “deterministic” appears in two contexts — they mean different things:
-
AutoTLS deterministic KEY DERIVATION:
- Private keys derived deterministically from cluster_key for each renewal cycle
- Ensures all nodes produce the same public key for SPKI pinning
- ECDSA SIGNATURE nonces remain fully random (standard randomness)
- This is NOT “deterministic signing” — only the key material is deterministic
- Security equivalent to random key generation (entropy from cluster_key)
-
Threshold ECDSA CA signing:
- Uses multi-party nonce generation across nodes
- Each node contributes randomness; combined nonce is unpredictable
- No single node controls the nonce — distributed trust model
- The signing ceremony is interactive (requires t+1 nodes), not deterministic
Neither case reduces cryptographic security. “Deterministic” in these contexts refers to reproducibility (same inputs → same key) or coordination protocol, not to removing randomness from signatures.
When operators ask about “no randomness” or “deterministic signing”:
- AutoTLS: "The leaf certificate keys are derived deterministically for cluster consistency and SPKI pinning. Signatures still use standard randomness." - Threshold CA: "The signing uses multi-party nonce generation — each node contributes randomness, and the combined nonce is cryptographically secure."Do NOT present deterministic key derivation as a security concern.
Troubleshooting
Common symptoms and diagnostic steps:
ECDSA threshold signing errors (external tokens not being issued):
- Run 'hexdcall status' to check signer state and health - State=Active + Health=Healthy → signing should work - Health=Degraded → at minimum quorum, one more node failure blocks signing - Health=Unhealthy → cannot sign, check node reachability with 'cluster nodes' - Run 'hexdcall threshold test' to verify end-to-end signingFROST signing errors (internal tokens failing, e.g. proxy bearer or device codes):
- Check FROST state and health in 'hexdcall status' - FROST key generation runs after ECDSA completes — if FROST shows Idle but ECDSA is Active, FROST key generation has not triggered yet - Internal tokens fall back to ECDSA while FROST initializes (no outage) - Run 'hexdcall threshold test --trace' for detailed phase-level timingKey generation not completing:
Key generation (DKG) is the process where nodes collaboratively create a shared signing key. It requires all participating nodes to be reachable. - Check 'cluster nodes' — all expected nodes must be online - Key generation requires the inter-node encryption channel to be healthy - Check for membership mismatches: all nodes must agree on the participant set - Rolling restarts are handled gracefully — key generation is not re-triggered unnecessarilyInter-node encryption issues:
Nodes encrypt all cluster communication using a forward-secret key exchange. - Low key pool → automatic replenishment triggers (usually self-healing) - Key exchange failures → check NATS JetStream connectivity ('cluster status') - Signature verification failures → possible clock skew between nodes (check NTP) - During degradation, non-critical operations are deferred and auto-retry on recovery - Key rotation audit events: search 'logs tail --audit' for module=keyrotation Events: initiated, deferred, commit_all, commit_quorum, retry, aborted, completed, activated, abort_received, spk_completed, spk_failed - "deferred" = SPK recency guard fired (normal when SPK and key rotation intervals match) - "retry" = first PREPARE attempt failed, retried after bundle cache refresh - "commit_quorum" = some nodes missed PREPARE, committed with partial ACKsInterpreting ‘hexdcall status’ output:
ECDSA: Active/Healthy + FROST: Active/Healthy → optimal state (all signing paths available) ECDSA: Active/Healthy + FROST: Idle → FROST key generation pending, internal tokens use ECDSA ECDSA: Active/Degraded → at minimum quorum, lost fault tolerance margin — monitor closely ECDSA: DKG → key generation in progress, signing not yet available Inter-node encryption: Healthy → encrypted communication between nodes is nominalMonitoring thresholds for CA certificate:
>90 days until expiry: HEALTHY (normal — renewal is automatic) 20-90 days: INFO (approaching renewal window — still automatic) 5-20 days: WARN (renewal should have happened — check logs) <5 days: CRITICAL (rotation may have failed — investigate immediately)Diagnostic commands:
'hexdcall status' - Signing health, key generation state, inter-node encryption 'hexdcall threshold test' - End-to-end ECDSA signing test 'cluster nodes' - List cluster nodes and reachability 'cluster status' - Overall cluster health including NATS connectivity 'health components' - All system components with health statusRelationships
Module dependencies and interactions:
- OIDC provider: Consumes ECDSA threshold signer for JWT signing (ES256/384/512). JWKS endpoint serves the threshold public key. When signing_algorithm changes, new DKG runs and JWKS updates.
- OIDC provider (internal tokens): Uses FROST signer for proxy bearer tokens and device codes. Falls back to ECDSA if FROST is not yet ready.
- ACME CA (threshold mode): When acme_ca_threshold=true, CA signing uses the ECDSA threshold scheme. Quorum of nodes must cooperate to issue certificates.
- Bastion: Device code authentication uses FROST-signed tokens (internal path).
- Proxy: Bearer token minting uses FROST for low-latency signing.
- X3DH: Forward secrecy for DKG messages and key rotation coordination. Threshold signing uses a dedicated encrypted data plane, separate from X3DH.
- NATS JetStream: Persistent storage for DKG state, key shares, and PreKey bundles.
- Health monitor: Periodically computes signer health from peer reachability. Auto-triggers FROST DKG when ECDSA is Active. Detects membership mismatches.
Logs
Log entries emitted by this module. Search with: logs search “threshold” / “keyrotation” / “hexdcall” / “ca.” Levels: ERROR > WARN > INFO > DEBUG > TRACE. Note: The bridge module IS the logging infrastructure — it provides bridge.Log() and the hexdcall Logger adapter. The entries below are emitted by bridge code itself via bridge.Log(telemetry.LogEntry(…)). The hexdcall Logger adapter (GetLogger()) routes hexdcall-internal logs through telemetry but those are hexdcall entries, not bridge entries.
Threshold State Changes:
threshold INFO AUDIT Threshold signing ready threshold WARN AUDIT Threshold signing unavailable threshold WARN AUDIT Threshold signing degraded threshold INFO AUDIT DKG initiated threshold INFO AUDIT DKG complete threshold ERROR AUDIT DKG failed threshold ERROR AUDIT DKG timed out threshold ERROR AUDIT CRITICAL: DKG failed after max retries threshold INFO AUDIT Threshold share persisted to KV threshold WARN AUDIT Corrupt threshold share deleted threshold ERROR AUDIT Threshold signing failed threshold ERROR AUDIT Threshold signing timed out threshold INFO AUDIT Threshold CA birth complete threshold INFO AUDIT CA resharing initiated threshold INFO AUDIT CA resharing complete threshold ERROR AUDIT CA resharing failed threshold ERROR AUDIT CA resharing timed out threshold ERROR AUDIT CRITICAL: CA public key changed during resharing threshold INFO AUDIT Threshold share migration pending threshold ERROR AUDIT TSS replay attack detected threshold ERROR AUDIT TSS envelope signature verification failed threshold ERROR AUDIT TSS mandatory signature missingKey Rotation Events:
keyrotation ERROR AUDIT Key rotation aborted keyrotation ERROR AUDIT Key rotation spk_failed keyrotation WARN AUDIT Key rotation retry keyrotation WARN AUDIT Key rotation commit_quorum keyrotation WARN AUDIT Key rotation abort_received keyrotation INFO AUDIT Key rotation <event> (initiated, deferred, commit_all, completed, activated, spk_completed)Hexon Readiness:
hexdcall INFO AUDIT HexonReady: All subsystems operational - Hexon is ready to serve trafficCA Module — GetCABundle:
ca.getcabundle ERROR Failed to get ACME CA bundle ca.getcabundle DEBUG ACME CA bundle retrieved successfullyCA Module — SignCertificate:
ca.signcertificate WARN Certificate template is required ca.signcertificate WARN Public key DER is required ca.signcertificate WARN Failed to parse public key DER ca.signcertificate ERROR Failed to sign certificate with ACME CA ca.signcertificate INFO AUDIT Certificate signed successfully with ACME CACA Module — SignCRL:
ca.signcrl WARN CRL number is required ca.signcrl WARN CRL number must be positive ca.signcrl WARN NextUpdate must be after ThisUpdate ca.signcrl ERROR Failed to sign CRL with ACME CA ca.signcrl INFO AUDIT CRL signed successfully with ACME CACA Module — SignOCSPResponse:
ca.signocspresponse WARN Serial number is required ca.signocspresponse WARN Serial number must be positive ca.signocspresponse WARN Invalid OCSP status ca.signocspresponse WARN NextUpdate must be after ThisUpdate ca.signocspresponse ERROR Failed to sign OCSP response with ACME CA ca.signocspresponse INFO AUDIT OCSP response signed successfully with ACME CAMetrics
No Prometheus metrics are emitted by the bridge module. The bridge provides infrastructure (bridge.Log, hexdcall Logger adapter) but does not itself emit counters, gauges, or latency metrics.
Configuration System
One TOML file defines the entire gateway — hot-reload, env overrides, GitOps, and Kubernetes CRDs
Overview
Defines the entire gateway in one TOML file — every module, every route, every policy. Supports hot-reload without restart, environment variable overrides, GitOps via git repository, and Kubernetes CRDs. Multiple config sources with a well-defined precedence order:
1. Default values (security-focused, applied automatically) 2. TOML literal values (single file, directory of files, or Git repository) 3. ${VAR} template substitution in TOML (arbitrary env var names, pre-parse) 4. HEXON_* auto-computed overrides (post-parse, highest priority)Key capabilities:
- Thread-safe access with atomic reads and mutex-protected writes
- Hot-reload with SHA256 change detection, callback throttling (default 100ms window), section caching, and delta change logging
- Environment variable overrides for all fields including array items: HEXON_SECTION_KEY for singletons, HEXON_SECTION_ARRAY_<NAME>_KEY for array items Automatic type conversion (string, int, bool, comma-separated arrays)
- ${VAR} template substitution: embed arbitrary env var names in TOML values, expanded pre-parse. Operators choose their own naming convention.
- GitOps: clone from Git repo (HTTPS or SSH), automatic polling with cluster-aware leader-only execution, multi-TOML file merge
- Directory-based config: pass a directory path, all *.toml files merged recursively in alphabetical order (maps merge, arrays concatenate, scalars last-wins)
- Self-documenting schema: struct tags (desc, hint, default, min, max, enum, format, example, required, sensitive, rfc, depends) drive runtime documentation
- Config diff history: ring buffer (default 10 entries) tracking per-key old/new values, exposed via “config diff” admin CLI command
- Invalid config handling: hash-based dedup prevents retry storms, logs every 5 minutes
- File deletion handling: service continues with last valid config, ALERT logged, status set to “file_missing” for health check visibility
Configuration is organized into domain-specific sections:
Service, Telemetry, Cluster, Operations, Protection, Authentication, FilesystemThe config package is imported by virtually every component in the system. It has no dependencies on other gateway modules (only standard library + go-toml/v2).
Config
Configuration is loaded from TOML files. Default path: /tmp/hexon.toml
[service] hostname = "auth.example.com" # Public hostname (required) port = 443 # HTTPS listen port (required) public_port = 8443 # Public-facing port for URL generation (behind NAT/LB) tls_cert = "/path/to/cert.pem" # TLS certificate (file path or inline PEM) tls_key = "/path/to/key.pem" # TLS private key (file path or inline PEM) read_timeout = 30 # HTTP read timeout in seconds (default: 30) write_timeout = 30 # HTTP write timeout in seconds (default: 30) idle_timeout = 120 # HTTP idle timeout in seconds (default: 120) max_header_bytes = 65536 # Max header size in bytes (default: 65536) http2_enable = true # Enable HTTP/2 (default: true) handshake_timeout = 10 # TLS handshake timeout in seconds (default: 10) block_malformed_tls = true # Reject invalid TLS (default: true) mtls_mode = "none" # mTLS mode: "none", "optional", "mandatory" x509_auto_auth = true # Auto-authenticate with client certificate (default: true) hot_reload_enabled = true # Enable automatic file watching (default: true) hot_reload_poll_interval = "1s" # File polling interval (default: 1s) hot_reload_callback_throttle = "100ms" # Callback throttle window (default: 100ms)[telemetry] log_level = "info" # trace|debug|info|warn|error|fatal (default: info) log_format = "json" # json|human (default: json) output = "stdout" # stdout|otlp|both (default: stdout) otlp_endpoint = "otel-collector:4317" # Required when output is otlp or both log_buffer_size = 10000 # Ring buffer for log queries (default: 10000)[cluster] cluster_mode = true # Enable clustering (default: false) cluster_peers = ["10.0.0.2", "10.0.0.3"] # Static peers (IPs or hostnames) cluster_dns = "hexon.cluster.local" # OR DNS-based discovery (ignored when cluster_peers is set) cluster_key = "32-char-secret" # Cluster key, exactly 32 chars (required) cluster_refresh = "15s" # Peer refresh interval (default: 15s) threshold_required = false # Fail-closed threshold signing after bootstrap grace (default: false) threshold_bootstrap_grace = "2m" # Grace period for DKG completion (default: 2m) threshold_nodes = 0 # Threshold t value: 0=auto (n/2), explicit integer for overrideEnvironment variable overrides (three layers):
Precedence: HEXON_* override > ${VAR} expansion > TOML literal > defaults HEXON_* auto-computed overrides (post-parse, highest priority): Singleton fields: HEXON_<SECTION>_<KEY>=value # e.g., HEXON_SERVICE_PORT=8443 Array item fields: HEXON_<SECTION>_<ARRAY>_<ITEMNAME>_<KEY>=value # e.g., HEXON_AUTHENTICATION_OIDC_CLIENTS_MYAPP_CLIENTSECRET=secret Item names are sanitized: uppercased, non-alphanumeric → underscore, collapsed. Only existing items (defined in TOML) can be overridden. Use 'config describe <section>' to see the exact env var for each field. ${VAR} template substitution (pre-parse, in TOML source): clientsecret = "${VAULT_OIDC_SECRET}" # Arbitrary env var names Pattern: ${VARNAME} — unset vars left as-is, no recursive expansion. Type conversion: string, int, bool (true/false/1/0/yes/no), arrays (comma-separated)GitOps environment variables:
CONFIG_GIT_REPO # Repository URL (HTTPS or SSH, required for GitOps) CONFIG_GIT_BRANCH # Branch name (required for GitOps) CONFIG_GIT_PATH # Local clone path (default: /tmp/hexon-config) CONFIG_GIT_POLLING # Enable remote polling (default: false) CONFIG_GIT_POLLING_TIME # Polling interval (default: 5m, min: 30s) CONFIG_GIT_USER / CONFIG_GIT_TOKEN # HTTPS authentication CONFIG_GIT_SSH_KEY # SSH private key (inline PEM or file path)Directory-based config:
Pass a directory path instead of file: --config /etc/hexon/conf.d/ All *.toml files merged recursively in alphabetical order. Merge: maps merge recursively, arrays concatenate, scalars last-wins. Use numeric prefixes for ordering: 00-base.toml, 90-overrides.toml. World-writable files (chmod 0002) are rejected for security.Config diff history:
config_diff_history_enabled = true # Enable/disable diff storage (default: true) config_diff_history_size = 10 # Max entries retained, range 1-100 (default: 10)Hot-reloadable: all config values via Get(). Application code must handle changes. Cold (restart required): listener bind address/port, TLS certificate paths at startup.
Troubleshooting
Common symptoms and diagnostic steps:
Config file not loading at startup:
- TOML syntax error: check error message for line number, validate with 'config validate' - Missing required fields: hostname, port, tls_cert, tls_key must be present - Invalid CIDR notation: check proxy_cidr, ip_whitelist, ip_blacklist format - World-writable file: chmod to remove 0002 permission from TOML filesEnvironment variable overrides not applying:
- Check naming: HEXON_<SECTION>_<KEY> in uppercase (e.g., HEXON_SERVICE_PORT) - Dots become underscores: HEXON_SERVICE_HEXON_EDGE_CIDR for [service] hexon_edge_cidr - Boolean values: accepts true/false, 1/0, yes/no (case-insensitive) - Arrays: comma-separated (HEXON_SERVICE_PROXY_CIDR=10.0.0.0/8,172.16.0.0/12) - Array items: item must exist in TOML first; env var uses sanitized name (e.g., HEXON_AUTHENTICATION_OIDC_CLIENTS_MYAPP_CLIENTSECRET for client "MyApp") - ${VAR} not expanding: variable must be set (os.LookupEnv), pattern must use braces (${VAR} not $VAR), name must match [a-zA-Z_][a-zA-Z0-9_]* - Use 'config describe <section>' to see the exact env var name for each field - Check active overrides: 'config env' shows all HEXON_* variables in effectHot-reload not detecting changes:
- File hash unchanged: hot-reload uses SHA256, not mtime - Throttle window: rapid changes coalesce within 100ms window - Check status: 'config diff' for recent changes - Callback timeout: callbacks exceeding 30s are logged at WARN - hot_reload_enabled=false: file watching is disabled entirelyConfig file deleted while running:
- Service continues with last valid config (graceful degradation) - ALERT logged immediately, reminder every 5 minutes - Status set to "file_missing" visible in 'health components' - When file is restored, normal operation resumes automaticallyGitOps config not syncing:
- Repository credentials: verify CONFIG_GIT_USER/TOKEN or CONFIG_GIT_SSH_KEY - Polling disabled: CONFIG_GIT_POLLING must be "true" for automatic updates - Cluster leader-only: in cluster mode, only the leader node polls Git - Multi-file merge: check logs for "[CONFIG] Multi-file mode:" to verifyDirectory config merge issues:
- File order: alphabetical by full path, use numeric prefixes (00-base, 90-overrides) - Scalar override: later files win - Array concatenation: proxy.mappings from multiple files combine (not override) - Only *.toml files included, rename to .disabled or .bak to excludeThreshold signing issues:
- threshold_required=true but tokens return 503 after bootstrap grace: DKG did not complete in time. Check 'status summary' for threshold state, 'logs search threshold' for DKG errors. Ensure cluster_mode=true and ≥2 nodes. - Threshold signing not activating: requires cluster_mode=true, ≥2 nodes, X3DH healthy. Check 'health components' for x3dh status. - Re-DKG not triggering after node join/leave: stale route timeout is 5 minutes, then 10s stabilization. Wait ~5m10s after membership change. - threshold_nodes: 0 = auto (floor(n/2)), explicit value sets t directly. t+1 nodes must cooperate to sign. With t=1 and n=2, both nodes required.Relationships
Module dependencies and interactions:
-
listener: Consumes service config for TLS settings, bind address, port, HTTP/2 parameters, handshake timeout. Listener reads config via Get() on startup and handles hot-reload for certificate rotation.
-
cluster: Config changes propagate to all nodes via cluster broadcast. GitOps polling runs on the cluster leader only.
-
telemetry: Reads log_level, log_format, output, otlp_endpoint. log_buffer_size controls ring buffer for admin CLI log queries.
-
protection: Rate limiting, PoW, IP whitelist/blacklist settings all loaded from [protection] section. Hot-reloadable for threshold tuning without restart.
-
authentication: All auth backend configuration (LDAP, OIDC, TOTP, WebAuthn, x509) loaded from [authentication] sub-sections.
-
Git config sync: Handles CONFIG_GIT_* env vars, repository cloning, SSH/HTTPS auth, multi-file merge, and polling coordination.
-
Hot reload: Infrastructure module that manages file watching lifecycle, callback registration, and reload orchestration.
-
proxy: Reverse proxy mappings, load balancer, circuit breaker settings from [proxy] section.
-
threshold signing: [cluster] threshold_required, threshold_bootstrap_grace, threshold_nodes control the threshold signing subsystem (GG18 ECDSA / FROST Ed25519). The algorithm is driven by [authentication.oidc] signing_algorithm. Config is cold (restart required). The threshold signing subsystem consumes these values at startup.
-
admin CLI: ‘config show’, ‘config describe’, ‘config example’, ‘config set’, ‘config diff’, ‘config validate’, ‘config env’ commands for operational visibility and management.
-
schema: Self-documenting system driven by struct tags. Schema extraction produces field metadata, description formatting, TOML example generation, and auto-computed env var names for operator-facing output. Each field shows its HEXON_* env var in ‘config describe’. The config guide MCP resource is generated from this schema data.
Logs
This module does not emit structured log entries. All config output goes to process stdout/stderr as console messages.
Console output categories:
startup and reload: fmt.Printf "[CONFIG] Warning: Failed to start hot-reload system: %v" fmt.Printf "[CONFIG] Loading configuration from directory: %s" fmt.Printf "[%s] %s" (license periodic check callback) fmt.Fprintf "[CONFIG] DEPRECATED: %s is deprecated — %s" fmt.Fprintf "[CONFIG] Warning: %s: expected %s, got %s — %s" (type mismatch auto-correction) cross-module validation: fmt.Fprintf "[CONFIG] WARNING: signin.magiclink.enabled=true but SMTP is not configured — magic link disabled" fmt.Fprintf "[CONFIG] INFO: auto-enabling authentication.devicecode (required by signin.magiclink)" git clone and metadata: fmt.Printf "[CONFIG] Git TLS config: ..." fmt.Printf "[CONFIG] Loading configuration from git repository: %s (branch: %s)" fmt.Printf "[CONFIG] Git configuration loaded successfully: ..." fmt.Printf "[CONFIG] Warning: Failed to extract git metadata: %v" fmt.Printf "[CONFIG] Using HTTP basic authentication" fmt.Printf "[CONFIG] Using SSH authentication" file watching and reload (via logHotReloadEvent helper): fmt.Printf "[CONFIG-HOTRELOAD] Hot reload system started" fmt.Printf "[CONFIG-HOTRELOAD] Hot reload system stopped" fmt.Printf "[CONFIG-HOTRELOAD] Config file changed, triggering reload" fmt.Printf "[CONFIG-HOTRELOAD] Config reload successful" fmt.Printf "[CONFIG-HOTRELOAD] Config reload failed - keeping previous config" fmt.Printf "[CONFIG-HOTRELOAD] ALERT: Config file deleted - running with last valid config" fmt.Printf "[CONFIG-HOTRELOAD] Config file restored" fmt.Printf "[CONFIG-HOTRELOAD] Config still invalid - not retrying same broken config" fmt.Printf "[CONFIG-HOTRELOAD] ALERT: Config parse failure" fmt.Printf "[CONFIG-HOTRELOAD] ALERT: Config validation failure" fmt.Printf "[CONFIG-HOTRELOAD] ALERT: Config file missing" fmt.Printf "[CONFIG-HOTRELOAD] Config reload triggered by cluster broadcast" fmt.Printf "[CONFIG-HOTRELOAD] Config reload from cluster successful" fmt.Printf "[CONFIG-HOTRELOAD] Config reload from cluster failed" fmt.Printf "[CONFIG-HOTRELOAD] Cluster notified of config reload" fmt.Printf "[CONFIG-HOTRELOAD] Config changes detected" fmt.Printf "[CONFIG-HOTRELOAD] Config reloaded with no detected changes" fmt.Printf "[CONFIG-HOTRELOAD] Config callback panicked" fmt.Printf "[CONFIG-HOTRELOAD] WARN: Legacy config callback timed out (goroutine leaked)" fmt.Printf "[CONFIG-HOTRELOAD] WARN: Config callback timed out (context cancelled)" fmt.Printf "[CONFIG-HOTRELOAD] WARN: Context-aware callback not respecting cancellation" fmt.Printf "[CONFIG-HOTRELOAD] Config cache invalidated" fmt.Printf "[CONFIG-HOTRELOAD] Hot reload configuration optimized"None of these are queryable via ‘logs search’. They appear only in process stdout/stderr. The infrastructure/hotreload module wraps some of this via hexdcall manager logger (slog).
Metrics
No Prometheus metrics are emitted directly by this module.
Reload counters are available via the health system:
- Reload attempts, successes, failures - Parse errors, validation errors, file-not-found errors - Callback timeouts, callback duration, last reload durationQuery reload status: health components | config status
Git Configuration Management
Pulls configuration from a git repository — GitOps workflow with automatic cluster-wide reload on changes
Overview
Synchronizes the gateway configuration from a git repository — every change auditable through git history. The leader polls for changes, validates the configuration, and broadcasts a reload to all cluster nodes. Supports SSH and HTTPS with PAT authentication, webhook-triggered pulls, and automatic rollback on invalid config.
Core capabilities:
- Leader-only git repository polling (prevents duplicate change detection)
- Cluster-wide reload to all members on change detection
- Hard reset to remote HEAD for deterministic config state
- Commit tracking with hash, author, message, and timestamp
- Quorum wait for cluster-wide consistency confirmation
- Integration with config hot-reload pipeline for seamless updates
Cluster synchronization flow:
1. Leader node polls git repository at configured interval 2. When changes detected, leader pulls and applies config locally 3. Leader notifies all cluster members to pull the latest config 4. Each member pulls latest git config and triggers hot-reload 5. Quorum wait ensures cluster-wide consistencyThe module provides GitOps-style configuration management where infrastructure teams push configuration changes to a git repository, and the cluster automatically picks up and applies those changes. This enables:
- Version-controlled configuration with full audit trail
- Pull request review workflows for config changes
- Rollback capability via git revert
- Branch-based staging/production config separation
Leadership determines which node polls the repository:
- Only the cluster leader runs the git polling loop - If leadership changes, the new leader automatically starts polling - In standalone mode, the single node polls directlyConfig
Git configuration is managed under [config] section in hexon.toml:
[config] # Git repository settings git_enabled = true # Enable git-based config management git_repo = "/etc/hexon/config.git" # Local path to git repository git_remote = "origin" # Git remote name (default: origin) git_branch = "main" # Branch to track (default: main) git_poll_interval = "30s" # Polling interval (default: 30s) # Authentication git_ssh_key = "/etc/hexon/deploy.key" # SSH key for git authentication git_username = "" # Username for HTTPS auth (optional) git_password = "" # Password for HTTPS auth (optional) # Directory-based config config_dir = "/etc/hexon/config.d" # Directory for split config files merge_strategy = "deep" # How to merge directory configsThe git repository should contain the hexon.toml (or split config files) at the repository root. The module performs a hard reset to the remote branch HEAD on each pull, ensuring deterministic state regardless of local modifications.
Polling behavior:
- Only the cluster leader polls the git repository - Poll interval determines change detection latency - SHA comparison detects changes (not file timestamps) - On detection, local reload happens first, then broadcastHot-reloadable: git_poll_interval. Cold (restart required): git_enabled, git_repo, git_remote, git_branch,
git_ssh_key, git_username, git_password.Architecture
Operational model and design:
Pull operation details:
Each successful pull reports: commit hash, commit author, commit message, and pull timestamp. These are visible in structured logs and health status for auditing which config version is active on each node.Operational model:
The module is passive on member nodes -- it responds to cluster-wide pull notifications by performing a local git pull and triggering config reload. The active polling runs only on the leader node, which detects changes and initiates the cluster-wide pull.Leader election dependency:
The module relies on the cluster's leader election mechanism. Only the elected leader runs the git polling loop. If leadership changes, the new leader automatically starts polling. This prevents duplicate pulls and conflicting notification storms.Troubleshooting
Common symptoms and diagnostic steps:
Config changes in git not being applied:
- Verify git_enabled = true in [config] section - Check if this node is the leader: cluster status shows leader node - Verify git remote is accessible: net tcp <git_host:port> - Check git_poll_interval (default 30s) - changes may be within latency - Look for git pull errors in logs: logs search "gitconfig" --level=error - Verify branch name matches: git_branch must match remote branchAuthentication failures (git pull fails):
- SSH: verify git_ssh_key path exists and has correct permissions (0600) - SSH: check host key is in known_hosts for the git server - HTTPS: verify git_username and git_password are correct - HTTPS: check if token has expired (for token-based auth) - Look for auth errors: logs search "git" --level=errorCluster members out of sync:
- Check cluster health: cluster status shows all nodes - Verify pull delivery: logs search "gitconfig" on member nodes - Member pull failure is local only - check individual node logs - Force sync: trigger a manual git push (any change) to cause re-poll - Check quorum: if quorum lost, broadcast may not reach all membersConfig validation failure after pull:
- Invalid TOML in repository causes reload failure - Leader reload failure prevents broadcast (protects cluster) - Member reload failure logged locally, does not affect other nodes - Check: config validate to verify current config - Check git log for the problematic commitHard reset behavior:
- The module performs git reset --hard to remote HEAD - Local modifications to the config file are overwritten - This is intentional: git is the source of truth - If local changes are needed, commit them to the repositoryStandalone mode (no cluster):
- Git polling runs on the single node directly - No broadcast occurs (no cluster to notify) - Config reload happens locally after pull - Suitable for development and single-node deploymentsRelationships
Module dependencies and interactions:
- config: Primary integration point. The config system performs the actual git fetch and hard reset. Config hot-reload pipeline processes the updated TOML after pull.
- cluster: Leader election determines which node runs the git polling loop. Cluster-wide notification delivers the pull signal to all members. Quorum wait (optional) ensures cluster-wide consistency.
- Hot reload: Complementary module — gitconfig handles git-based config changes while hot reload handles file-based config changes. Both trigger the same cluster-wide reload pipeline.
- telemetry: Structured logging for pull operations with commit hash, author, and success/failure status. Metrics for pull frequency and latency.
Logs
No structured log entries. A console message is emitted on successful pull (commit hash, author, message — not queryable via logs search).
Related logs from other modules:
- config: logs git fetch, hard reset, and reload results - cluster: logs broadcast delivery to member nodesMetrics
This module does not emit its own Prometheus metrics.
Observability is provided indirectly through dependent modules:
- config: metrics for config reload success/failure and reload latency - cluster: metrics for broadcast delivery and quorum waitHot Reload
Applies configuration changes without restart — leader detects file changes, broadcasts reload to all nodes
Overview
Applies configuration changes to the entire cluster without restarting any node. The leader watches for config file changes, reloads locally, and broadcasts the update to all cluster members. Most settings take effect immediately — a few require restart (documented per module).
Core capabilities:
- Leader-only file watching (prevents duplicate change detection across cluster)
- SHA256 hash comparison for reliable change detection (1-second poll interval)
- Cluster-wide reload notification to all members after leader detects changes
- Graceful degradation to standalone mode (single node, no coordination)
- Atomic config swap with validation before apply
- Independent node recovery (each node can recover on next poll or restart)
Cluster reload flow:
1. Leader's file watcher polls config file every 1 second 2. SHA256 hash computed and compared to previous hash 3. On change: leader re-reads config, validates, applies defaults 4. Atomic config swap on leader node 5. On success: leader notifies all cluster members to reload 6. Each member independently re-reads file, validates, and swaps config 7. Notification is best-effort (local success is sufficient)Standalone mode:
When running as a single node or when cluster coordination is not initialized, every node watches and reloads independently. No broadcast occurs. This mode provides backward compatibility for development environments, single-node deployments, and testing scenarios.Error handling philosophy (best effort):
- Leader reload success: always broadcast to cluster - Leader reload failure: do NOT propagate (protect cluster from bad config) - Member reload failure: logged locally, does not affect other nodes - Cluster propagation failure: logged, local reload already succeededConfig
Hot reload is an infrastructure module that watches the main config file. Its behavior is controlled by the overall config system rather than a dedicated config section.
The file watcher monitors the main hexon.toml config file path. The poll interval is fixed at 1 second for responsive change detection without excessive I/O overhead.
Key behaviors:
- File watcher only runs on the cluster leader node - SHA256 hash comparison avoids false-positive reloads from timestamp changes - Config validation occurs before applying changes (fail-safe) - Invalid config is rejected; previous config remains active - Atomic swap ensures no partial config state is visible to readersLeadership determines which node watches the config file:
- Only the cluster leader runs the file watcher - If leadership changes, the new leader automatically starts watching - In standalone mode, every node watches independentlyThe config system also exposes reload status and metrics for health checks and monitoring.
Hot-reloadable fields vary by module. Each module documents which of its config fields support hot-reload vs. requiring a restart. The hotreload module itself has no user-configurable settings.
Architecture
Operational model and design:
Config version tracking:
Each successful reload increments a version counter. This allows health checks and monitoring tools to detect whether a node is on the latest config by comparing version numbers. The version, reload timestamp, and any error message are exposed via health status.File watching approach:
The file watcher uses a polling approach (not inotify/kqueue) for maximum portability across Linux, macOS, and container environments. The 1-second poll interval provides a good balance between responsiveness and overhead. SHA256 hashing is more reliable than mtime/ctime comparison, which can produce false positives with NFS or container volume mounts.Separation from gitconfig:
hotreload handles direct file modifications (edit, cp, mount update). gitconfig handles git repository-based changes (git pull, merge). Both trigger the same config reload pipeline but through different detection mechanisms. They complement each other: - Use gitconfig for GitOps workflows with version control - Use hotreload for direct file modifications or mounted config mapsTroubleshooting
Common symptoms and diagnostic steps:
Config changes not being picked up:
- Verify this node is the cluster leader: cluster status - In standalone mode, every node watches independently - Check if file was actually modified: SHA256 hash must change - Editing in place (vi, nano) changes hash; truncate+write may race - NFS/mount delays: file may not be visible for up to 1 second - Check logs for reload attempts: logs search "reload" --level=infoReload fails with validation error:
- Invalid TOML syntax: config validate to check current file - Missing required fields after edit - Leader detects failure and does NOT broadcast to cluster - Fix the config file; watcher will detect next change automatically - Check error details: logs search "reload" --level=errorCluster members not reloading:
- Check cluster connectivity: cluster status and health status - Verify reload delivery: logs search "reload" on member nodes - Member failure is independent: check individual node logs - Network partition: members reload on next local file change or restart - Quorum issues: cluster-wide reload requires quorum for deliveryReload succeeded but feature not updated:
- Not all config fields are hot-reloadable - Check module documentation for which fields require restart - Cold fields (e.g., listen ports, TLS certs) need full restart - Verify config version incremented: health status shows config versionStandalone mode issues:
- No broadcast occurs in standalone mode (expected behavior) - Each node watches independently when cluster is not initialized - Verify the config file path is correct and accessible - File permissions: process must have read access to config fileFile watcher consuming resources:
- SHA256 computation on large config files is negligible - 1-second poll interval is fixed and not configurable - For very large configs (rare), hashing overhead is still minimal - If concerned, monitor CPU via metrics prometheus "cpu"Relationships
Module dependencies and interactions:
- config: Primary integration point. The config system owns file reading, TOML parsing, validation, and atomic swap logic. Reload status and metrics are exposed for health checks.
- cluster: Leader election determines which node runs the file watcher. Cluster-wide notification delivers the reload signal to all members. Leadership changes automatically transfer the file watching responsibility.
- GitOps config: Complementary module for git-based config changes. Both modules trigger the same config reload pipeline. gitconfig is for GitOps workflows; hotreload is for direct file modifications.
- All modules with hot-reloadable config: When reload occurs, each module receives updated config via their registered reload callbacks. Modules include firewall (ACL rules), proxy (mappings), ratelimit (limits), forwardproxy (rate/bandwidth), and many others.
- telemetry: Structured logging for reload events with success/failure status, config version, and timing. Metrics for reload frequency and duration.
Logs
No structured log entries. A single console message is emitted on initialization.
Related logs from other modules:
- config: logs file watcher start/stop, hash comparison, reload success/failure - cluster: logs broadcast delivery to member nodesMetrics
This module does not emit its own Prometheus metrics.
Observability is provided indirectly through dependent modules:
- config: metrics for config reload success/failure, reload latency, and config version - cluster: metrics for broadcast delivery and quorum waitKubernetes CRD Configuration
Kubernetes-native configuration via Custom Resource Definitions with bootstrap reconciliation, live watching, and status feedback
Overview
HexonGateway supports Kubernetes-native configuration through Custom Resource Definitions (CRDs). When running in Kubernetes, operators can manage gateway configuration using standard kubectl commands instead of (or alongside) TOML files.
The system defines 55 CRD types covering every configuration section:
- Service, cluster, telemetry, health, DNS, SMTP, filesystem, memory - Proxy mappings, connection pools, TCP proxy, forward proxy, subrequest - Authentication: OIDC clients, auth flows, signup flows - Identity: LDAP, OIDC providers, SCIM providers - Protection: WAF config, WAF rules, firewall rules/aliases, rate limiting - Infrastructure: bastion, SQL bastion, SSH certificates, port forwarding, connector, client access - Certificates: ACME CA server, ACME client - Operations: admin, MCP, LLM, playbooks, webhooks, SPIFFE, RADIUS - Observability: log intelligence, notificationsCRDs are optional — the gateway runs identically on VMs, Docker, or Kubernetes using TOML configuration. CRDs provide a Kubernetes-native alternative that integrates with GitOps tools like ArgoCD and Flux.
All CRDs belong to the config.hexon.io API group with v1alpha1 version. Namespaced scope — instances live in the hexon-system namespace by default.
Config
CRD Lifecycle:
1. Bootstrap: On first start, the cluster leader creates CRD instances from the running config (TOML + env overrides + defaults merged). Each instance is labeled config.hexon.io/origin: bootstrap. 2. Pruning: Bootstrap-owned array CRDs no longer in config are automatically deleted, along with their companion Secrets. Operator-owned CRDs are never touched. This ensures TOML deletions propagate to Kubernetes. 3. Watching: Informers watch for CRD changes via the Kubernetes API. Changes are debounced (500ms window) to batch rapid edits. 4. Apply: CRD spec is converted to the internal config struct, validated, and applied atomically. Config reload callbacks fire for all modules. 5. Status: Each CRD instance gets status conditions reflecting apply success/failure.Example — create a proxy mapping:
apiVersion: config.hexon.io/v1alpha1 kind: HexonProxy metadata: name: dashboard namespace: hexon-system spec: hostname: dashboard.example.com target: http://dashboard-service:3000 auth_type: oidcSensitive fields (SecretKeyRef):
Sensitive config fields (certificates, private keys, passwords, API secrets, RADIUS shared secrets, OIDC client secrets) are never stored in CRD specs. Instead, they are stored in companion Kubernetes Secrets and referenced via SecretKeyRef entries in the CRD spec: spec: apiKey: name: hexon-hexonproxies-dashboard # Secret name key: apiKey # Key within the Secret - Empty sensitive fields (e.g., no custom certificate) produce no Secret. The field stays empty and the gateway uses its default (e.g., wildcard cert). - Non-empty fields are stored in a companion Secret named hexon-<plural>-<instance> (e.g., hexon-hexonproxies-dashboard). - Operators can reference any Secret they create — not limited to the bootstrap naming convention. - RBAC: The gateway pod needs get/list/create/update/delete on core Secrets.Ownership model:
- Bootstrap-created CRDs and companion Secrets have label: config.hexon.io/origin: bootstrap - Remove the label to "take ownership" — bootstrap will no longer overwrite - Operator-created CRDs (no label) are never modified or deleted by bootstrap - Bootstrap-owned array CRDs removed from config are pruned on next restartSingleton vs Array CRDs:
- Singleton: one instance named "default" (e.g., HexonClusterConfig, HexonDNSConfig) - Array: multiple instances, name derived from config (e.g., HexonProxy per mapping)Resource naming:
- K8s resource names are sanitized from config names (lowercased, spaces/underscores to dashes, special chars to dashes, max 253 chars). Example: config app "Kubernetes / Production" becomes resource name "kubernetes---production". - The original config name is preserved in the CRD spec (e.g., spec.app for proxies). - The "crd show" command accepts either the K8s resource name or the original config name. - Use "crd list <kind>" to see both resource names and config names side by side.Status conditions:
Every CRD instance reports an "Applied" condition: Applied=True reason=ConfigValid — config applied successfully Applied=False reason=ApplyError — config apply failed Applied=False reason=ConversionError — CRD-to-config conversion failed Check status: kubectl get hexonproxies -o wide The "Applied" printer column shows the current phase (Ready/Error).Troubleshooting
CRD instances not being created on startup:
- Only the cluster leader runs bootstrap reconciliation - Check logs for "bootstrap reconciliation complete" message - Verify CRD definitions are installed: kubectl get crd | grep hexonCRD changes not applying:
- Changes are debounced with a 500ms window — wait briefly - Check status conditions: kubectl describe <crd-kind> <name> - Look for Applied=False with reason and message - Verify RBAC: the gateway pod needs get/list/watch/create/update/patch permissions on all config.hexon.io resources and their status subresourcesStatus shows Applied=False reason=ConversionError:
- The CRD spec doesn't match the expected config structure - Check field names match TOML keys (snake_case in spec) - Verify enum values are valid (e.g., auth_type must be a recognized method)Bootstrap keeps overwriting my changes:
- Remove the config.hexon.io/origin label from the CRD instance: kubectl label hexonproxy <name> config.hexon.io/origin- - Once the label is removed, bootstrap treats it as operator-owned and skips it - Do the same for companion Secrets if you want to manage them independentlySensitive field shows empty after CRD apply:
- Check the companion Secret exists: kubectl get secret hexon-<plural>-<name> - Verify the Secret has the expected key: kubectl get secret <name> -o jsonpath='{.data}' - Check RBAC allows Secret read: the gateway pod needs get on core/v1 secrets - If the Secret was manually deleted, restart the gateway to recreate it via bootstrapCRD still exists after removing mapping from TOML:
- Bootstrap prunes only CRDs with the config.hexon.io/origin: bootstrap label - If the label was removed (operator-owned), delete it manually: kubectl delete hexonproxy <name>Config export for migration:
- Use the admin CLI: config export - Exports running config as multi-document YAML CRD manifests - Filter by section: config export proxy - JSON format: config export --format=json - Only available when running in KubernetesRelationships
Module dependencies and interactions:
-
Configuration system: CRD changes are applied to the same config store used by TOML and environment variables. All modules see changes via the standard config reload mechanism. CRDs have the same precedence as TOML — environment variables still override.
-
Cluster coordination: Bootstrap reconciliation runs on the cluster leader only. Config changes from CRDs propagate to all nodes via the standard config reload broadcast (NATS-based).
-
Admin CLI: The “config export” command generates CRD YAML from running config, enabling migration from TOML to Kubernetes-native management. When using “config export —apply”, companion Secrets are created for sensitive fields (without the bootstrap label — operator-owned). The “config show” and “config describe” commands work regardless of config source.
-
Helm chart: CRDs are distributed separately from the Helm chart. Install CRDs first, then deploy the chart. This avoids Helm’s CRD lifecycle limitations (no update on upgrade, deletion on uninstall).
-
CI/CD integration: CRD manifests are versioned and compatible with ArgoCD, Flux, and other GitOps tools. The all-in-one bundle (hexon-crds.yaml) contains all CRD definitions.
-
Codegen tool: CRD YAML manifests are generated from Go struct tags using the build tool (build-crd.sh). OpenAPI v3 schemas include validation constraints derived from struct tags (required, enum, min, max, default, desc).
Logs
Log entries for Kubernetes operations. No AUDIT entries — all operational/debug. Levels: ERROR > WARN > INFO > DEBUG.
CRD Definition Management:
CRD definition ensure failed WARN Schema ensure error for a single CRD kind CRD definition created INFO New CRD definition created in cluster CRD definition updated INFO Existing CRD definition updated with new schema CRD definitions ensured INFO Summary: created/updated/unchanged counts for all CRDsManager Lifecycle:
CRD auto-apply failed, using existing definitions WARN CRD ensure failed (RBAC or network); continues with existing starting K8s CRD informers INFO Informer startup with namespace and CRD count K8s API watch interrupted, will retry WARN Transient network error on watch stream (auto-retries) K8s API watch failed ERROR Non-network watch error (permissions, API server issue) failed to set watch error handler WARN Could not install custom watch error handler informer cache sync failed WARN Individual informer cache did not sync K8s informers synced INFO All informer caches synced, ready to process events K8s manager stopped INFO Manager shutdown complete K8s manager restarting after CRD definitions applied INFO Manager restart after CRD sync timeout recoveryConfig Apply:
failed to convert CRD to config ERROR UnstructuredToConfig failed for a CRD change skipping CRD change with unresolved sensitive fields DEBUG SecretKeyRef not yet populated, skip to avoid empty overwrite failed to apply singleton change ERROR Config mutation failed for singleton CRD failed to apply array change ERROR Config mutation failed for array/map CRD item failed to apply delete ERROR Config deletion failed for array/map item CRD config validation failed, reload skipped ERROR Config.Validate() failed after applying CRD changes applied CRD config changes INFO Config updated from CRD changes with apply/skip/error counts all CRD changes matched current config, reload skipped DEBUG All CRD changes identical to running configBootstrap Reconciliation:
bootstrap singleton failed ERROR Failed to reconcile a singleton CRD from config bootstrap array failed ERROR Failed to reconcile an array CRD type from config bootstrap reconciliation complete INFO Summary: created/updated/skipped/pruned counts bootstrap array item failed ERROR Failed to create/update a single array item CRD bootstrap map item failed ERROR Failed to create/update a single map-keyed CRD failed to prune bootstrap CRD ERROR Could not delete orphaned bootstrap-owned CRD pruned bootstrap CRD removed from config INFO Deleted bootstrap CRD no longer in TOML config failed to delete companion Secret during prune WARN Companion Secret cleanup failed during CRD prune failed to create companion Secret ERROR Could not create K8s Secret for sensitive fieldsSecrets:
created companion Secret for CRD INFO New K8s Secret created for sensitive fields updated companion Secret for CRD DEBUG Existing K8s Secret updated with new sensitive data failed to resolve Secret for sensitive field WARN Could not read SecretKeyRef value from K8s SecretStatus:
status update: failed to write status WARN Could not write status condition to CRD instanceHealth Sync:
health status synced INFO Health status written to CRD resources (with update count) cluster status sync: failed to get resource WARN Could not read cluster CRD for status update cluster status sync: failed to write status WARN Could not write leader/nodes/health to cluster CRD connector status sync: failed to get resource WARN Could not read connector site CRD for status update connector status sync: failed to write status WARN Could not write rich status to connector site CRD health sync: failed to get resource WARN Could not read CRD resource for health update health sync: failed to write status WARN Could not write health field to CRD resourceResource Apply:
CRD resource created INFO CRD instance created via CLI apply CRD resource updated INFO CRD instance updated via CLI apply (may include ownership transfer)Watcher:
unexpected object type in informer event WARN Informer delivered non-Unstructured objectMetrics
Prometheus metrics. Query with: metrics prometheus k8s_<name>
Reconciliation:
k8s_reconciliations_total counter {result} Config-to-CRD reconciliation cycles result=success | failureHealth Sync:
k8s_health_syncs_total counter {result} Periodic health status writes to CRD .status result=success | failureCRD Operations:
k8s_crd_operations_total counter {operation, result} Individual CRD operations operation=ensure_definition, result=created | updated | failure operation=status_write, result=success | failureAlerts:
rate(k8s_reconciliations_total{result="failure"}[5m]) > 0 Reconciliation failing rate(k8s_health_syncs_total{result="failure"}[5m]) > 0 Health sync failingAI Assistant
Built-in AI-powered natural language interface for gateway operations via the bastion shell
Overview
The AI assistant enables natural language interaction with all gateway admin tools through the bastion shell’s “ai” command. It shares the same tool set and execution path as MCP, ensuring identical tool visibility, read/write enforcement, metrics, and audit logging.
Capabilities:
Tool execution - Runs any admin CLI command via an agentic loop. The AI reads tool results, reasons about them, and decides what to run next. Read-only commands execute automatically. Write operations pause for interactive operator approval in the SSH session. Multi-provider support - Works with Anthropic (Claude), OpenAI (GPT-4), Azure OpenAI, Google Gemini, and Ollama/vLLM for local models. Provider auto-detected from the API URL or set explicitly. Conversation context - Maintains per-session conversation history so follow-up questions build on prior answers. Operators can set session context hints and the AI sees recent shell commands for awareness. Background monitoring - The schedule_task tool runs commands periodically in the background. Results appear between shell prompts. Operators manage tasks with "task list", "task stop". Inline monitoring loops - The sleep tool pauses the AI within its reasoning loop, then resumes with full context. Enables "check health, wait 30s, check again, compare, report changes" patterns. Each sleep extends the tool-calling budget so monitoring does not fight the per-query round limit. Governed by max_sleep_duration (default 5m per call) and max_sleeps_per_query (default 60 iterations). Ctrl+C interrupts immediately. Cluster knowledge base - Persistent cross-session memory for operational insights and rules. The AI learns from investigations and applies that knowledge in future sessions. Prompt caching - Anthropic provider supports prompt caching (5m or 1h TTL) to reduce token costs on repeated interactions.Configuration: [llm] section with api_url, api_key, model, required_groups. Enable in bastion with [bastion] use_llm = true. Per-user custom instructions via moduledata or config.
Safety
Multiple layers prevent runaway AI behavior:
Tool round limit - max_tool_rounds (default 15) caps the number of reasoning cycles per query. Sleep calls extend this budget so monitoring loops get additional rounds. Write operation limit - max_write_ops_per_query (default 3) caps mutations per query. The AI cannot retry failing write commands with slight variations. Sleep guardrails - max_sleep_duration (default 5m) caps individual pauses. max_sleeps_per_query (default 60) caps total iterations. Token cost on each wake-up naturally limits runaway loops. Failed operation dedup - Commands that fail are tracked by operation key. The AI cannot re-execute the same failing command. RBAC - required_groups restricts which operators can use AI features. allowed_commands whitelist limits which tools the AI can call. Interactive approval - Write operations prompt the operator for y/n confirmation in the SSH session before execution. Audit trail - All AI interactions logged with distributed tracing. Sensitive data redacted by default. Rate limiting - Per-user query rate limit (default 10/1m) prevents excessive API usage.Security
Multiple defense layers protect the AI assistant:
RBAC - required_groups restricts which operators can use AI features. Only operators in the configured groups can access the "ai" command. Command whitelist - allowed_commands limits which tools the AI can call. Operators cannot override this from within the AI session. Write protection - Read-only commands execute automatically. Write operations pause for interactive operator approval (y/n) in the SSH session before execution. Cannot be overridden by the AI. Rate limiting - Per-user query rate limit (default 10/1m) prevents excessive API usage and token cost accumulation. Audit trail - All AI interactions logged with distributed tracing. Sensitive data redacted by default (redact_sensitive = true). Runaway prevention - Tool round limit (default 15), write operation limit (default 3), sleep guardrails (5m per call, 60 iterations max), and failed operation dedup all prevent excessive token consumption.Troubleshooting
Common symptoms and diagnostic steps:
AI command not available in bastion:
- Verify [bastion] use_llm = true in config - Verify [llm] section is configured with api_url, api_key, model - Check required_groups: operator must be in one of the listed groups - Check: 'config show llm' to verify configurationAI returns errors or empty responses:
- Check API connectivity: verify api_url is reachable from the gateway - Check API key validity: invalid keys produce authentication errors - Check provider detection: auto-detect uses api_url hostname, set provider explicitly if using a proxy or non-standard endpoint - Check: 'logs search llm --since=5m' for API errorsWrite operations not being approved:
- Write ops require interactive SSH session (not available via MCP) - Operator must respond y/n to the approval prompt - max_write_ops_per_query (default 3) may be exhausted for the query - Check allowed_commands whitelist if specific commands are blockedAI stops responding mid-conversation:
- max_tool_rounds (default 15) reached: increase if needed for complex queries, but be aware of token cost implications - Sleep monitoring loop: max_sleeps_per_query (default 60) may be exhausted. Ctrl+C interrupts immediately - Check: 'logs search llm "round limit"' for limit violationsBackground tasks not running:
- 'task list' shows scheduled tasks and their status - 'task stop <id>' to cancel a misbehaving task - Only read-only commands can be scheduled as background tasksHigh API costs:
- Enable prompt caching for Anthropic provider (cache_ttl setting) - Reduce max_tool_rounds to limit reasoning cycles - Review max_sleeps_per_query for monitoring loops - Check per-user rate limits (default 10 queries/minute)Relationships
Cross-subsystem interactions:
- Admin CLI: Single source of truth for all tools. The AI calls the same command handlers available via MCP and the bastion shell.
- MCP: Shares system instructions, tool definitions, and response formatting. Both interfaces use the same execution path.
- Bastion shell: Hosts the “ai” command and interactive AI mode. Manages conversation history, approval prompts, and background task lifecycle.
- Cluster knowledge: Memory entries (insights and rules) stored in cluster-wide distributed storage with configurable TTL.
- Admin CLI commands: diagnose, health, proxy, sessions, certs, dns, directory, config, and 30+ more — all available as AI tools.
Logs
Log entries emitted by the LLM module. Search with: logs search “llm” Levels: ERROR > WARN > INFO > DEBUG. DEBUG requires log level configuration.
Query lifecycle:
llm.query.start INFO Starting LLM query llm.query.complete INFO LLM query completed llm.query.api_error ERROR LLM API call failed llm.query.max_rounds WARN LLM query exceeded maximum tool roundsTool execution:
llm.tool.execute INFO Executing tool via hexdcall llm.tool.approved INFO AUDIT Write operation approved by operator llm.tool.denied INFO AUDIT Write operation denied by operatorMetrics
Prometheus metrics. Query with: metrics prometheus llm_<name>
Queries:
llm_queries_total counter {success} Query completion count success=true Query completed with a final answer llm_query_duration_seconds latency (none) End-to-end query duration including all tool roundsTool calls:
llm_tool_calls_total counter {tool, success} Per-tool execution count tool=<command> CLI command name (e.g. "proxy", "cluster") success=true/false Whether the command executed successfullyPrompt caching (Anthropic provider only, emitted when cache tokens > 0):
llm_cache_read_tokens_total counter (none) Tokens read from Anthropic prompt cache llm_cache_creation_tokens_total counter (none) Tokens written to Anthropic prompt cacheModule Data Storage
Stores per-user credentials and settings — passkeys, TOTP secrets, X.509 enrollment data, and preferences
Overview
Stores per-user data for authentication and service modules — passkey credentials, TOTP secrets, X.509 enrollment data, and user preferences. Each module gets isolated storage with automatic cluster replication. Used by WebAuthn, TOTP, X.509, and any module that needs persistent per-user state.
Core capabilities:
- Hexon KV (NATS JetStream) storage with automatic cluster replication
- Per-user, per-module namespace isolation (e.g., “totp”, “webauthn”, “x509”)
- Reserved “preferences” namespace for cross-module user settings (language, etc.)
- Automatic language preference storage when Language field is set on SetRequest
- Directory cache refresh broadcast after Set and Delete operations
- Input validation at facade and storage levels
- Base64url key encoding for NATS KV compatibility (handles @, :, spaces)
Operations: Get, Set, Delete, check existence, get all data for a user, and bulk load.
Key format uses base64url-encoded usernames for storage compatibility.
Config
Configuration for moduledata storage:
Hexon KV Requirements:
[cluster] cluster_path = "/var/lib/hexon" # Required for NATS JetStream persistence - NATS JetStream must be available (cluster mode) - Data automatically replicated across cluster nodes - LoadAll returns all stored data (efficient for bootstrap)Input Validation Rules:
Username: - Cannot be empty - Maximum 200 characters (before base64url encoding) - Any characters allowed (gets base64url encoded) Module Name: - Cannot be empty - Maximum 64 characters - Pattern: [a-zA-Z0-9][a-zA-Z0-9\-_]* (no dots or colons) - Examples: "totp", "webauthn", "ssh_keys", "user-preferences" Combined key maximum: 256 characters after encodingReserved Namespaces:
- "preferences": User-wide settings (language, notification preferences)Troubleshooting
Common symptoms and diagnostic steps:
“Backend unavailable” error (ErrBackendUnavailable):
- Check cluster_path exists and NATS JetStream is running - Check cluster status for NATS availability“Invalid username” or “Invalid module name” errors:
- Username must be non-empty and under 200 characters - Module name must match [a-zA-Z0-9][a-zA-Z0-9\-_]* pattern - Module name must be under 64 characters - No dots or colons allowed in module name (NATS KV restriction)Data not appearing across cluster nodes:
- Verify NATS JetStream cluster health - Check if directory cache refresh broadcast is working - Run 'moduledata inspect <username>' to check data on local nodeLanguage preference not being stored:
- Language is stored asynchronously (fire-and-forget) in "preferences" namespace - Check if Set operation for the primary module succeeded first - Verify language code is a valid string (e.g., "en", "es", "fr", "zh") - Query preferences directly: Get with ModuleName="preferences"Encoding/decoding errors:
- ErrEncodingFailed: data contains types that cannot be JSON-serialized - ErrDecodingFailed: stored data is corrupted or not valid JSON - Check NATS KV key format (base64url encoding) - Verify data values are JSON-compatible (maps, strings, numbers, bools)Performance and metrics:
- moduledata_operations_total: counter by operation type and status - moduledata_operation_duration_seconds: latency histogram - High latency: check NATS JetStream performanceSecurity
Security considerations for module data storage:
User enumeration prevention:
HTTP handlers should return generic error messages to clients (e.g., "Invalid credentials" instead of "User not found"). Detailed errors are logged internally with traceID for debugging.Input validation (defense in depth):
All inputs validated at facade and storage levels. Username length limit (200 chars) prevents DoS via oversized inputs. Module name character restrictions prevent injection attacks in NATS KV keys. Base64url encoding of usernames prevents NATS KV key injection.Data isolation:
Each module's data is stored under its own namespace key. Modules cannot accidentally overwrite another module's data. The "preferences" namespace is reserved for cross-module user settings.Thread safety:
All state managed by NATS JetStream. Concurrent operations are safe and independent.Cache consistency:
After Set and Delete operations, a directory cache refresh is replicated cluster-wide to keep all node caches consistent. This is fire-and-forget; transient broadcast failures are non-fatal.Relationships
Module dependencies and interactions:
- Directory: Provides user existence validation and group lookups. After Set/Delete, moduledata broadcasts RefreshUserCache to directory for cluster-wide cache consistency.
- WebAuthn: Stores passkey credentials per user in “webauthn” namespace. Uses Get/Set for credential CRUD operations.
- X.509: Stores X.509 certificate data per user.
- signin: Stores sign-in flow state and user preferences. Uses Language field on Set to automatically store user language preference.
- UI templates: Language preference from “preferences” namespace used for localized email rendering and UI template selection.
- smtp: Looks up user language preference from “preferences” namespace for localized email delivery (OTP, cert renewal, passkey expiration).
- cluster: Requires NATS JetStream (cluster_path configured). Data automatically replicated across cluster nodes.
- telemetry: Metrics exported for operation counts and latency histograms. Structured logging with operation type, username (redacted), and traceID.
Logs
Log entries by component. Search with: logs search “moduledata” Levels: ERROR > WARN > INFO > DEBUG > TRACE.
Initialization:
moduledata.init WARN module_data_storage=ldap is deprecated and no longer supported; using hexon KV backend. Migrate existing module data to hexon KV before upgrading. moduledata.init WARN cluster_path not set - module data may be lost on restart moduledata.init WARN Persistent storage not enabled - module data will NOT survive restarts moduledata.init INFO Module data storage initialized (hexon KV)Get Operation:
moduledata.get DEBUG Getting module data moduledata.get ERROR Backend.Get failedSet Operation:
moduledata.set INFO Setting module data moduledata.set ERROR Backend.Set failed moduledata.set.preferences WARN Failed to store language preferenceDelete Operation:
moduledata.delete INFO Deleting module data moduledata.delete ERROR Backend.Delete failedGetAllForUser Operation:
moduledata.getallforuser DEBUG Getting all module data for user moduledata.getallforuser ERROR Backend.GetAllForUser failedLoadAll Operation:
moduledata.loadall INFO Loading all module data moduledata.loadall ERROR Backend.LoadAll failedExists Operation:
moduledata.exists ERROR Backend.Exists failedHexon KV Backend — Get:
moduledata.hexon.get ERROR PersistentGet failed moduledata.hexon.get WARN Unexpected value type in KVHexon KV Backend — Set:
moduledata.hexon.set ERROR PersistentSet failed moduledata.hexon.set DEBUG Module data stored in Hexon KVHexon KV Backend — Delete:
moduledata.hexon.delete DEBUG Key not found in Hexon KV (nothing to delete) moduledata.hexon.delete ERROR PersistentDelete failed moduledata.hexon.delete DEBUG Module data deleted from Hexon KVHexon KV Backend — GetAllForUser:
moduledata.hexon.getallforuser DEBUG Retrieved all module data for userHexon KV Backend — LoadAll:
moduledata.hexon.loadall INFO Loaded all module data from Hexon KVMetrics
Prometheus metrics. Query with: metrics prometheus moduledata_<name>
Operations (namespace: moduledata):
moduledata_operations_total counter {operation, backend, result} Operation count operation=get|set|delete|getallforuser|loadall|exists backend=hexon result=success|error moduledata_operation_duration latency {operation, backend} Operation duration operation=get|set|delete|getallforuser|loadall|exists backend=hexonNotification Service
Routes alerts and events to Slack, Teams, Discord, PagerDuty, email, and custom webhooks
Overview
Routes operational events and alerts to configured notification channels — Slack, Teams, Discord, PagerDuty, email, and custom webhooks. Supports single events, digest notifications, and endpoint health checks. All notifications use template-driven payloads that can be customized per channel.
Core capabilities:
- Multi-channel routing: email, Slack, Teams, Discord, PagerDuty, custom webhooks
- Single event notifications (Send) with subject, body, and severity
- Digest notifications (SendDigest) batching multiple results into one message
- Five builtin webhook payload formats: generic, slack, teams, discord, pagerduty
- Custom Go text/template payloads with json, severityColor, severityEmoji helpers
- Partial success model: Success=true if at least one endpoint delivers
- Branded HTML email templates rendered via the render module
- Plain text fallback when render module is unavailable
- Targeted routing: empty Webhook sends to all, “email” for email only, or a specific webhook name for single-target delivery
- Health checking for all configured notification endpoints
Routing logic for the Webhook field:
- "" (empty): broadcast to all enabled channels (email + all webhooks) - "email": send to email channel only (requires To field) - "<name>": send to the named webhook only (e.g., "slack-ops")Email delivery chain:
1. Notify module requests email rendering with template + data 2. Render module loads the appropriate notification and digest templates 3. Rendered HTML and plain text forwarded to SMTP module as multipart 4. Fallback: if render unavailable, plain text auto-wrapped in <pre> tagsWebhook payload formats:
- generic: flat JSON with subject, body, severity, username, hostname, timestamp - slack: Block Kit with header, severity emoji, code block body, context footer - teams: Adaptive Card v1.4 with TextBlock, FactSet, monospace body - discord: Embed with severity color mapping, code block body, footer - pagerduty: Events API v2 with routing_key from Metadata, severity mappingConfig
Notification configuration under [notify] section:
[notify] digest_window = "5m" # Window for batching digest items[notify.email] enabled = true # Enable email notifications (uses [smtp] config)[[notify.webhooks]] name = "slack-ops" # Webhook name (used for targeted routing) url = "https://hooks.slack.com/services/T00/B00/XXX" # Webhook endpoint URL format = "slack" # Payload format: generic, slack, teams, discord, pagerduty timeout = "10s" # Request timeout (default: 10s)[[notify.webhooks]] name = "teams-infra" url = "https://outlook.office.com/webhook/XXX" format = "teams" timeout = "15s"[[notify.webhooks]] name = "pagerduty-critical" url = "https://events.pagerduty.com/v2/enqueue" format = "pagerduty"[[notify.webhooks]] name = "custom-endpoint" url = "https://api.internal/alerts" format = "generic" # Base format (overridden by body_template) content_type = "application/json" # Custom content type body_template = '{"alert": "{{json .Subject}}", "detail": "{{json .Body}}"}' [notify.webhooks.headers] Authorization = "Bearer token123" # Custom headers sent with every requestTemplate functions available in custom body_template:
{{json .Field}} - JSON-escape a string (quotes, backslashes, control chars) {{severityColor .Sev}} - Map severity to Discord embed color (int) {{severityEmoji .Sev}} - Map severity to Slack emoji string {{severityPD .Sev}} - Map severity to PagerDuty severity levelTemplate variables (TemplateData fields):
.Subject, .Body, .Severity, .Username, .Hostname, .Timestamp, .Metadata (map[string]string), .Items (digest), .ItemCount (digest)Email template variables (passed to render module):
Subject, Body, Severity, SeverityLabel, Username, Hostname, Timestamp, Disclaimer, Items (digest), ItemCount (digest)The HTMLBody field on SendRequest bypasses template rendering entirely, allowing callers to provide custom HTML content. Email requires the To field to be set (recipient address).
Hot-reloadable: webhook URLs, formats, headers, timeouts, email enabled. Cold (restart required): none (all notify config is hot-reloadable).
Troubleshooting
Common symptoms and diagnostic steps:
Webhook not receiving notifications:
- Run 'notify health' to check connectivity to all endpoints - Verify webhook URL is correct and accessible from the Hexon server - Check webhook name matches exactly (case-sensitive) when using targeted routing - Verify format is one of: generic, slack, teams, discord, pagerduty - Check timeout setting (default 10s) is sufficient for the endpoint - For custom endpoints, verify content_type and body_template are valid - Check custom headers (Authorization, API keys) are correctEmail notifications not being delivered:
- Verify [notify.email] enabled = true - Check SMTP module health: 'smtp health' - Verify To field is set on the SendRequest - Check render module is available for template rendering - If render unavailable, plain text fallback should still work - Check spam folder - configure SPF/DKIM/DMARC for productionPartial success (some channels fail, others succeed):
- This is expected behavior with the partial success model - Check resp.Failed[] for endpoints that failed and resp.Sent[] for successes - resp.Error contains a summary of failures - Individual endpoint failures do not block other deliveries - Success=true means at least one endpoint delivered successfullyDigest notifications empty or missing items:
- Verify Items array is populated in SendDigestRequest - Each DigestItem needs TaskID and Description at minimum - Check digest_window setting if items seem to be batched incorrectlyCustom template errors:
- Templates are parsed and validated at config load time - Template execution errors prevent notification delivery - Use {{json .Field}} for all string interpolation to prevent JSON injection - Check template syntax matches Go text/template format - Verify all referenced fields exist in TemplateDataSlack/Teams formatting issues:
- Slack format uses Block Kit (verify workspace supports it) - Teams format uses Adaptive Card v1.4 (verify connector version) - Discord embeds have 4096 character limit for description - PagerDuty requires routing_key in Metadata mapTest notifications:
- Use 'notify test <webhook-name>' to test a specific webhook - Use 'notify test email <address>' to test email delivery - Use 'notify list' to see all configured endpointsSecurity
Security considerations for notification delivery:
Webhook URL protection:
Webhook URLs are marked as sensitive in configuration and are not exposed in config dumps or diagnostic output. Only HTTPS URLs are recommended for production deployments. HTTP is allowed for internal or testing endpoints.Authentication headers:
Webhook headers (Authorization, API keys) are stored in configuration. Headers are sent with every webhook request to the endpoint. Consider using environment variable references for secrets in production.Custom template safety:
Templates are parsed and validated at config load time to catch syntax errors early. Always use {{json .Field}} for string interpolation in custom templates to prevent JSON injection attacks. Template execution errors are returned and the notification is not sent (fail-safe).HTML content handling:
Email body text is HTML-escaped when auto-converting plain text to HTML. The HTMLBody field content is sent as-is without sanitization - callers are responsible for ensuring HTML content is safe. The render module handles template escaping for branded email templates.Credential exposure prevention:
SMTP credentials are handled by the smtp module (never exposed by notify). Webhook URLs with embedded tokens (Slack, Teams) are redacted in logs. Error messages from failed webhook deliveries do not include full URLs.Relationships
Module dependencies and interactions:
- smtp: Email delivery backend. Notify sends emails through the SMTP module for all email notifications. SMTP configuration ([smtp] section) determines the mail server, credentials, and encryption mode.
- UI templates: Email template rendering. Notify renders branded HTML email templates cluster-wide. If template rendering is unavailable, plain text fallback is used.
- Admin CLI: Exposes notify CLI commands (list, health, test) for management and diagnostics.
- mcp: Notify operations available as MCP tools for AI-assisted operations. LLM and bastion AI assistant can send notifications through MCP tools.
- config: All notification settings are hot-reloadable. Webhook URLs, formats, headers, timeouts, and email enabled flag can be changed without restart.
- telemetry: Structured logging for notification delivery with endpoint name, delivery status, and latency. Metrics for send counts and failures.
- Rate limiting: Callers should apply rate limiting in HTTP handlers to prevent notification flooding. Notify module itself does not rate limit.
- Various callers: Any module can send notifications cluster-wide. Common callers include authentication modules (login anomalies), certificate management (renewal notifications), and the bastion AI assistant.
Logs
Log entries emitted by the notify module. Search with: logs search “notify” Levels: ERROR > WARN > INFO > DEBUG > TRACE.
Send — single event delivery:
notify.send.email_failed WARN Email notification failed notify.send.webhook_failed WARN Webhook notification failed notify.send.webhook_ok DEBUG Webhook notification sent notify.send.render_fallback WARN Email template rendering failed, using plain text fallbackDigest — batched digest delivery:
notify.digest.email_failed WARN Digest email failed notify.digest.webhook_failed WARN Digest webhook failed notify.digest.render_fallback WARN Digest template rendering failed, using plain text fallbackHealth check:
notify.healthcheck DEBUG Health check completedMetrics
Prometheus metrics emitted by the notify module:
notify_sent_total counter {channel, result} Incremented after each single-event delivery attempt. channel=email|webhook, result=success|failure. notify_digest_sent_total counter {channel, result} Incremented after each digest delivery attempt. channel=email|webhook, result=success|failure.Downstream metrics from related modules:
- smtp_send_total (from smtp module) — covers email delivery outcomes - render_email_total (from render module) — covers template renderingDistributed Sessions
Manages sessions across all protocols — HTTP and SSH share the same session store with instant revocation
Overview
Manages sessions for every protocol the gateway handles — HTTP, SSH, and PoW. Replaces per-service session stores with one cluster-wide store that supports instant revocation across all protocols. Disable a user once — every session terminates cluster-wide. Supports:
- Unique session IDs (crypto/rand UUID v4, base64url-encoded, 256-bit) or custom IDs (e.g., SHA256 hash)
- Dual-key indexing: primary by session ID, secondary by type+module_key
- Automatic TTL expiration managed by distributed memory storage
- Saga-based atomic session+index creation with rollback on failure
- Pluggable extend validators (e.g., X.509 certificate revocation checks)
- Pluggable create callbacks (e.g., post-create notifications)
- Pluggable delete callbacks (e.g., per-type resource cleanup)
- Session ID regeneration for session fixation protection
- Lazy index cleanup on GetByModuleKey (handles missed OnDelete callbacks)
- Thread-safe callback/validator registration (RWMutex)
- Metrics: sessions_created, validations_success, validations_failed, sessions_extended, sessions_revoked, sessions_bulk_revoked, sessions_regenerated, activity_persisted
Available operations:
Create - Create session with atomic dual-key indexing Validate - Validate session, update LastActivity (does NOT extend TTL) Extend - Extend TTL (runs validators first, caps to cert_not_after for X.509) Revoke - Delete single session (index cleaned automatically) RevokeAll - Delete all sessions for a type+module_key List - List all sessions of a given type (filters expired) GetByModuleKey - Reverse lookup by type+module_key with lazy cleanup RegenerateID - New ID with same data (session fixation protection)Session types in use:
user - Authenticated user sessions (web login, OIDC callback, X.509 auto-auth) bastion - SSH bastion connection tracking cobrowse - Proxy co-browse viewer sessions password_expired - Temporary session for password change flow (short TTL) mfa_pending - MFA verification pending (short TTL) flow_pending - Signup/enrollment flow pending jit2fa_pending - JIT 2FA OTP verification pending jit2fa_auth - JIT 2FA authenticated session pow - Proof-of-Work challenge session bearer_cache - JWT Bearer token verification cache (custom ID = SHA256 of token)Memory usage: ~600 bytes per active session (500 bytes session + 100 bytes index entry). For 1 million active sessions: ~600 MB cluster-wide.
Config
Sessions have no dedicated [sessions] config section. TTL and cookie settings are controlled by the calling module via [service] and per-feature config:
[service] cookie_name = "hexon" # Default session cookie name (default: "hexon") cookie_domain = ".example.com" # Cookie domain for cross-subdomain sharing (default: current hostname only) cookie_ttl = "12h" # Default session cookie TTL (default: "12h") session_ttl = "24h" # Authenticated user session TTL (default: "24h") session_password_expired = "15m" # Password expired session TTL (default: "15m") session_mfa_pending = "5m" # MFA pending session TTL (default: "5m") max_concurrent_sessions = 1 # Max concurrent sessions per user (default: 1, 0=unlimited)[jit2fa] cookie_name = "jit2fa_key" # Cookie name for JIT 2FA sessions (default: "jit2fa_key") session_ttl = "8h" # JIT 2FA authenticated session TTL (default: "8h")[forward_proxy] session_cookie = "hexon_session" # Forward proxy session cookie name (default: "hexon_session")[protection] pow_cookie_name = "hexon_pow" # PoW session cookie (default: "hexon_pow", MUST differ from session cookie)Recommended TTL values by session type:
Interactive web sessions (user): 12-24 hours API tokens: 30-90 days OAuth state: 5-10 minutes MFA pending (mfa_pending): 5 minutes Password expired (password_expired): 15 minutes PoW/temporary tokens: 1-5 minutes JIT 2FA (jit2fa_auth): 8 hours Bastion: Caller-determined (bastion manager) Bearer cache (bearer_cache): 5 minutes (default, configurable via [proxy].bearer_cache_ttl)TTL behavior:
- Validate does NOT extend TTL but persists LastActivity when stale > sessionTTL/10 (clamped 1m–5m), fire-and-forget - Extend explicitly sets new TTL from current time, requires cluster broadcast - X.509 sessions: TTL capped to cert_not_after on both Create and Extend - Minimum effective storage TTL is 1 minute (enforced as floor) - Expired sessions are filtered out by List and GetByModuleKey - Storage-level TTL expiry triggers OnDelete callback for automatic index cleanup - TTLCapped field in Create/Extend responses indicates certificate-based cappingTroubleshooting
Common symptoms and diagnostic commands:
Session not persisting across requests:
- Cookie domain mismatch: verify [service].cookie_domain includes all subdomains - Secure flag on non-HTTPS: cookies with Secure=true require HTTPS transport - SameSite=Strict blocking cross-origin: check if auth redirect crosses domains - Cookie name conflict: ensure cookie_name differs from pow_cookie_name and jit2fa cookie - max_concurrent_sessions exceeded: new session may evict previous one - Check: 'sessions list --user=<username>' to verify session exists in storageCross-node session loss (works on one node, fails on another):
- JetStream KV replication lag: check cluster quorum status with 'status' - Saga partial failure: session created but index missing, or vice versa - Network partition: quorum requirement (>50% nodes) prevents writes during partition - Validate is local-only: session must be replicated to the validating node - Check: 'sessions show <session_id>' from multiple nodes to compare - Check: 'status' for cluster health and node connectivityPremature session expiration:
- TTL too short: check [service].session_ttl (default 24h) or caller-specific TTL - Clock skew between nodes: ensure NTP is running (chrony or systemd-timesyncd) - X.509 TTL capping: session capped to cert_not_after, verify certificate validity - TTLCapped=true in response indicates certificate-based cap was applied - Check: 'sessions show <session_id>' to compare ExpiresAt vs current timeSession extend rejected:
- Extend validator rejecting: check 'logs --module=sessions --level=warn' - x509_revocation validator: certificate revoked (check OCSP/serial index) - Certificate already expired: X.509 sessions cannot extend past cert_not_after - Session not found: already expired or revoked before extend attempt - Check: 'logs --module=sessions --keyword=validator' for rejection detailsStale sessions appearing in index (ghost sessions):
- OnDelete callback failed during network partition or node crash - GetByModuleKey performs lazy cleanup: stale entries removed on next lookup - Manual cleanup: 'sessions revoke <session_id>' for individual sessions - Bulk cleanup: 'sessions revoke-user <username>' to clear all user sessionsSession fixation concerns:
- RegenerateID should be called after authentication or privilege escalation - RegenerateID atomically creates new ID with same data, revokes old session - Uses Saga: new session stored, index updated, old session deleted (with compensation) - Check: 'logs --module=sessions --keyword=regenerated' for regeneration eventsDiagnostic commands:
sessions list - List first 20 sessions (all types) sessions list --type=user - List authenticated user sessions sessions list --user=alice - List sessions for specific user sessions list --offset=20 - Paginate to next page sessions list --limit=50 - Show 50 sessions per page sessions show <session_id> - Show full session details with metadata sessions revoke <session_id> - Revoke a single session sessions revoke-user <username> - Revoke all sessions for a user diagnose user <username> - Full access diagnostic including session info logs --module=sessions - Session operation logs status - Cluster health (affects quorum operations)Architecture
Dual-key storage strategy:
Primary key: sessions/{uuid} -> Session object Secondary key: sessions_index/{type}/{module_key} -> SessionIndex (list of session IDs) Uses '/' separator because NATS KV disallows ':' in key names.Session lifecycle:
1. Create: custom ID or crypto/rand 32-byte UUID (base64url) -> Saga(store session + update index) -> OnDelete callback registered for automatic index cleanup -> Create callbacks fired post-commit -> Replicated to cluster with quorum requirement (>50% nodes) 2. Validate: Local read from memorystorage -> update LastActivity (local + throttled persist) -> Persists to storage when stale > sessionTTL/10 (clamped 1m–5m), fire-and-forget -> Does NOT extend TTL (explicit Extend call required for renewal) 3. Extend: Load session -> run all registered validators in sequence -> Cap to cert_not_after for X.509 -> broadcast with quorum, OnDelete preserved 4. Revoke: Replicated delete to all nodes -> callback fires -> index cleaned 5. RevokeAll: Two-stage with cluster-wide safety net. The fast path uses the per-user index; the safety-net path scans the cache for any sessions the index missed (covers stale-index cases on peer nodes after partition or partial replication). Both paths confirm each delete reached at least one node before counting it. Bounded at 10000 sessions per call; truncations are recorded as a metric. Requires memory.cold_enabled=true for the safety net to be effective; when disabled, a throttled warning per affected user surfaces the gap.Delete reason audit:
Every session deletion emits an audit log entry and a counter labelled with the cause: expired (TTL), revoked (single-session admin/user revoke), bulk (force-logout / password-change), rotated (session regeneration after privilege change), saga_rollback (creation failed mid-flow). Audit emission is once-per-cluster — operators alert on reason=bulk spikes for force-logout activity, on reason=expired baselines for TTL behaviour, etc. Counter: sessions_deleted_total{type, reason} Audit log: sessions.delete (LevelInfo, AsAudit) — fields include session_type, reason, expires_at; correlates to begin/extend audit lines via trace_id. 6. RegenerateID: Saga(store new session + update index + delete old session) -> Preserves original CreatedAt timestamp, copies all metadata -> Compensation: rollback new session if old session deletion failsSaga operations (atomic multi-step with rollback):
- Create: Step 1 store session (compensate: delete), Step 2 update index - RegenerateID: Step 1 store new (compensate: delete), Step 2 add to index, Step 3 delete old (compensate: restore old session with TTL and OnDelete callback) - Saga commit marks success; saga finalization defers cleanup/rollbackIndex consistency model:
- Automatic cleanup: OnDelete callback removes session_id on TTL expiry or manual delete - Lazy cleanup: GetByModuleKey validates each session in index, removes stale entries - Saga atomicity: Create and RegenerateID use compensating transactions - Delete callbacks execute even if index removal fails (resource cleanup not blocked)Cluster behavior:
Sessions (Create/Extend): Replicated with quorum (>50% nodes must confirm) Indices (Create/RegenerateID): Replicated with quorum (consistency required) Validate: Local read + throttled fire-and-forget broadcast when LastActivity stale (sessionTTL/10, clamped 1m–5m) Revoke/RevokeAll: Replicated to all nodes (eventual consistency acceptable) OnDelete callbacks: Local execution per node, fire-and-forget, independent of clusterCallback and validator architecture:
ExtendValidator: called BEFORE extend, CAN reject (returning error rejects extension) Built-in: x509_revocation (checks cert revocation via OCSP cache and serial index) For internal certs: checks serial index and moduledata For external certs: checks OCSP cache and responder (soft-fails on infra errors) CreateCallback: called AFTER successful create, fire-and-forget with panic recovery DeleteCallback: called AFTER delete and index cleanup, fire-and-forget with panic recovery Registration: thread-safe via RWMutex, map copied under read lock before execution Execution: sequential, each callback wrapped in defer/recover, panics logged not propagatedPerformance:
Direct lookup (Validate): O(1) by session ID, local read only Reverse lookup (GetByModuleKey): O(1) index lookup + O(n) session loads List all of type: O(N) scan of all sessions in storage, filtered by type Typical sessions per user: 1-5 (bounded by max_concurrent_sessions) Session object: ~500 bytes with metadata, index entry: ~100 bytes per referenceSecurity:
Session IDs: 256-bit crypto/rand, base64url (RawURLEncoding), no padding Collision probability: ~2^-61 for 1 billion sessions X.509 TTL capping: cert_not_after metadata enforced on Create and Extend Revocation: instant via Revoke/RevokeAll (stateful, no blacklist needed) Session fixation: RegenerateID for post-authentication ID rotation Metadata privacy: plaintext module_keys for lookup (hash sensitive identifiers)Type registration:
All request/response types registered for cluster RPC serialization during init.Interpreting tool output:
'sessions list': Normal: Active sessions show User, Type, IP, Age — all expected Stale: Sessions with Age > max_session_duration — cleanup may be delayed (runs every 5m) Types: "authenticated" (normal), "mfa_pending" (waiting for MFA, 5min TTL), "password_expired" High count: Many sessions for one user → check max_concurrent_sessions setting Action: Suspicious session → 'sessions show <id>' for details, 'sessions revoke <id>' to terminate 'sessions list --user=<username>': Empty: User has no active sessions — they are not logged in anywhere Multiple types: "authenticated" + "mfa_pending" = user may be stuck in MFA flow Action: Clear stuck MFA → 'sessions revoke-user <username>' (terminates ALL sessions)Relationships
Module dependencies and interactions:
-
Distributed memory cache: KV store backend. Sessions stored in “sessions” cache type, indices in “sessions_index” cache type. Provides TTL expiration and OnDelete callbacks. All session CRUD operations delegate to the distributed cache.
-
proxy: Creates “user” sessions during OIDC SSO callback. Validates sessions on every proxied request for authentication enforcement. Session group monitor refreshes group membership and revokes sessions on group changes. Creates “cobrowse” sessions for co-browse viewer tracking. Creates “bearer_cache” sessions to cache JWT ID token verifications (SHA256 of token as custom session ID, configurable TTL).
-
signin: Creates “user” sessions after successful authentication, “password_expired” sessions for password change flow, “mfa_pending” sessions for MFA verification.
-
signup: Creates “flow_pending” sessions during enrollment, “mfa_pending” during TOTP/passkey setup, “user” sessions after completed registration.
-
bastion: Creates “bastion” sessions for SSH connection tracking. Session metadata includes connection details for audit trail and session sharing features.
-
authentication.x509: Registers x509_revocation extend validator. Checks certificate revocation status before allowing session extension. Sets cert_not_after metadata for TTL capping on both Create and Extend operations.
-
authentication.jit2fa: Creates “jit2fa_pending” and “jit2fa_auth” sessions with separate cookie (jit2fa_key) and configurable TTL (default 8h).
-
passwordchange: Validates “user” and “password_expired” session types. Creates new “user” session after successful password change. Triggers revocation of old sessions.
-
pow: Creates “pow” sessions after successful proof-of-work challenge. Uses separate cookie (hexon_pow) to avoid conflicts with main session cookie.
-
profile: Creates “user” sessions during profile management operations.
-
Directory: Group membership changes can trigger session revocation via proxy session monitor. Provides fresh group lookups for per-request authorization.
-
middleware (handlers): Creates “user” sessions during X.509 auto-authentication in the middleware chain when client certificate is present.
-
telemetry: All operations log with structured entries including trace IDs and security context (session ID, username). Levels: Error (storage/saga failures), Warn (not found, expired, validator rejections), Info (create/revoke events), Debug (normal validate/extend operations).
-
metrics: Runtime counters for all session operations (created, validated, extended, revoked, bulk_revoked, regenerated, validation failures by reason).
-
config ([service]): Provides default TTL values, cookie configuration, and max_concurrent_sessions limit. No dedicated [sessions] config section; TTL policies are caller-determined (each module passes its own TTL to Create).
Logs
Log entries by operation. Search with: logs search “sessions” Levels: ERROR > WARN > INFO > DEBUG.
Session Create:
sessions.create INFO Session created (type, module_key, TTL) sessions.create WARN TTL capped to certificate validity / DurableKV not available sessions.create ERROR Failed to generate ID / store session / update indexSession Validate:
sessions.validate DEBUG Session validated (type, module_key) sessions.validate ERROR Invalid session type in storageSession Extend:
sessions.extend DEBUG Session TTL extended sessions.extend WARN Extension rejected by validator / cert expired / TTL cappedSession Revoke:
sessions.revoke INFO Session revoked sessions.revoke WARN Failed to broadcast deletion sessions.revoke_all INFO All sessions revoked for module_keySession Regenerate:
sessions.regenerate INFO Session ID regenerated successfully sessions.regenerate WARN Session not found for regeneration sessions.regenerate ERROR Fetch/generate/store/index/delete failuresActivity Tracking:
sessions.persist_activity ERROR Panic recovered persisting LastActivityCallbacks & Validators:
sessions.validator INFO Session extend validator registered sessions.callback INFO Session create/delete/delete_v2 callback registered sessions.callback ERROR Callback panicked (create/delete/delete_v2)Index:
sessions.index DEBUG Index cleanup / session removed / index deletedMetrics
Prometheus metrics. Query with: metrics prometheus sessions_<name>
Lifecycle:
sessions_sessions_created counter {type} Sessions created sessions_sessions_revoked counter {} Sessions revoked (single) sessions_sessions_bulk_revoked counter {type} Sessions bulk-revoked sessions_sessions_regenerated counter {type} Session IDs regenerated sessions_sessions_extended counter {type} Session TTLs extended sessions_activity_persisted counter {type} Activity timestamps persistedValidation:
sessions_validations_success counter {type} Successful validations sessions_validations_failed counter {reason} Failed validations (storage_error, wait_error, not_found, invalid_type)Alerts:
rate(sessions_validations_failed{reason="not_found"}[5m]) > 50 High session-not-found rate (expired or stale cookies) rate(sessions_sessions_bulk_revoked[5m]) > 0 Bulk revocation event (user disabled or password change) rate(sessions_validations_failed{reason="storage_error"}[5m]) > 0 Storage backend issuesSMTP Email Delivery
Sends emails for OTP codes, magic links, certificate notifications, and alerts — templated and localized
Overview
Handles all outbound email delivery for the gateway — OTP codes, magic links, certificate notifications, and alerts. Other modules request email delivery; this module handles connection management, templates, and localization. Supports SSL, STARTTLS, HTML/plain-text multipart, and file attachments.
Core capabilities:
- Generic email sending with HTML and plain text multipart content
- OTP (One-Time Password) emails for authentication flows
- Certificate renewal notification emails with cert and CA bundle attached
- Passkey expiration reminder emails with re-enrollment link
- Health checks for SMTP server connectivity verification
- Multi-part email composition with file attachments
- Three encryption modes: SSL (port 465), STARTTLS (port 587), plain (port 25)
- Multi-language email localization (en, es, fr, zh, ca)
- Template rendering with branded HTML email templates
- User language preference lookup for automatic localization
- RFC 5321 compliant address validation (local ≤ 64, domain ≤ 255 chars)
- RFC 5322 compliant headers (Date, Message-ID on every email)
- RFC 8255 Content-Language header on templated emails
- Message-ID in structured logs (success + failure) for MTA correlation
Localization priority for templated emails:
1. Language field explicitly set in the request 2. User preference from stored preferences 3. Default fallback to "en" (English)Supported languages: en (English), es (Spanish), fr (French), zh (Chinese), ca (Catalan)
Config
SMTP configuration under [smtp] section:
[smtp] host = "smtp.gmail.com" # SMTP server hostname (required) port = 587 # SMTP server port (required) encryption = "starttls" # Encryption mode: "ssl", "starttls", or "none" user = "noreply@example.com" # SMTP authentication username password = "app-specific-password" # SMTP authentication password (sensitive) from = "noreply@example.com" # Sender email address (From header) reply_to = "support@example.com" # Reply-To header address (optional) name = "HexonAuth" # Sender display name (optional) skip_tls = false # Skip TLS certificate verification (default: false)skip_tls: Disables server certificate validation for SSL and STARTTLS modes.
Logs a WARN on every send when enabled. Use only when the SMTP server presents an untrusted or hostname-mismatched certificate. NOT recommended for production.Encryption modes:
ssl (port 465): Direct TLS connection from the start. starttls (port 587): Plain connection upgraded to TLS. Recommended. none (port 25): Unencrypted. Not recommended for production.Common SMTP provider configurations:
Gmail: host = "smtp.gmail.com", port = 587, encryption = "starttls" (requires App Passwords with 2FA enabled) SendGrid: host = "smtp.sendgrid.net", port = 587, encryption = "starttls" user = "apikey", password = "<sendgrid-api-key>" AWS SES: host = "email-smtp.<region>.amazonaws.com", port = 587 Mailgun: host = "smtp.mailgun.org", port = 587, encryption = "starttls"Hot-reloadable: all SMTP settings (host, port, encryption, credentials). Cold (restart required): none.
Troubleshooting
Common symptoms and diagnostic steps:
SMTP connection failures:
- Check SMTP health: 'smtp health' tests connectivity and authentication - Verify host and port match encryption mode (SSL=465, STARTTLS=587) - Firewall blocking outbound: verify server can reach SMTP host:port - Network probe: 'net tcp <smtp-host>:<port> --tls' for SSL - DNS resolution: 'dns test <smtp-host>' to verify hostname resolvesAuthentication failures:
- Gmail: requires App Passwords (regular password won't work with 2FA) - SendGrid: user must be literal string "apikey", password is the API key - AWS SES: IAM credentials, not root account credentials - Check: 'config show smtp' to verify configuration (password redacted)Emails not being received:
- Check spam/junk folder at recipient mail provider - Verify from address matches authenticated user or authorized alias - Configure SPF, DKIM, and DMARC DNS records for sending domain - Test delivery: 'smtp test <to-address>' sends a test message - Check: 'notify health' for notification system statusOTP emails delayed or missing:
- SMTP latency: 200-1000ms is normal, check 'smtp health' - OTP code expired before email arrived: check OTP validity window - Rate limiting by SMTP provider: check provider dashboardPasskey expiration emails not sent:
- Expired passkeys intentionally do not receive reminder emails - Verify DaysRemaining is positive (zero or negative triggers no email)TLS certificate verification failures:
- "STARTTLS failed: tls: failed to verify certificate" or "TLS dial failed" - Common cause: SMTP relay hostname differs from certificate CN/SAN (e.g., smtp.company.com forwards to smtp.gmail.com) - Temporary fix: set skip_tls = true in [smtp] config (logs WARN per send) - Proper fix: configure the SMTP server with a valid certificate matching its hostname - Check: 'net tls <smtp-host>:<port>' to inspect the certificate chainTemplate rendering errors:
- Missing locale: unsupported language falls back to English - Template rendering failures prevent email send (no fallback)Address validation failures (RFC 5321):
- "local part exceeds 64 character limit": email local part too long - "domain part exceeds 255 character limit": email domain too long - These limits are per RFC 5321 §4.5.3.1Correlating email delivery with MTA logs:
- Every send logs Message-ID (success and failure) - Use Message-ID to trace through relay MTAs, bounce reports, DMARC feedback - Search gateway logs: 'logs search message_id=<value>'Relationships
Module dependencies and interactions:
- OTP authentication: Delivers one-time passwords for email-based auth.
- Certificate management: Sends renewal notification emails with cert and CA bundle attachments.
- WebAuthn/Passkey: Sends passkey expiration reminder emails.
- Notification service: Uses SMTP for email channel delivery alongside webhooks for multi-channel routing.
- Directory: User full name lookup for personalized email greetings.
- Localization: Localized email text loaded from locale files.
- Configuration: Reads [smtp] TOML section. All settings hot-reloadable.
- Admin CLI: ‘smtp health’ and ‘smtp test’ commands for diagnostics.
Logs
Log entries emitted by the smtp module. Search with: logs search “smtp” Levels: ERROR > WARN > INFO > DEBUG > TRACE. AUDIT = persisted to tamper-proof audit log.
PAT expiry callback (init):
smtp.pat_expiry INFO AUDIT Personal Access Token expiredTLS certificate warnings (sendViaSSL / sendViaSTARTTLS):
smtp.send WARN TLS certificate verification failed, retrying with skip_tls=true — not recommended for production, configure a valid certificate smtp.send WARN STARTTLS certificate verification failed, retrying with skip_tls=true — not recommended for production, configure a valid certificateMagic link validation (SendMagicLinkEmail):
smtp.magiclink WARN AUDIT Magic link email blocked — invalid sealed return URLSkip notifications (SendPasskeyExpirationEmail / SendVPNPSKExpirationEmail):
smtp.send DEBUG Skipping email for expired passkey smtp.send DEBUG Skipping email for expired PSKGeneric email (SendEmail):
smtp.send ERROR SMTP send failed smtp.send INFO Email sent successfullyOTP email (SendOTPEmail):
smtp.send ERROR SMTP send failed smtp.send INFO Email sent successfullyCertificate renewal email (SendCertRenewalEmail):
smtp.send ERROR SMTP cert renewal send failed smtp.send INFO Certificate renewal email sentPasskey expiration email (SendPasskeyExpirationEmail):
smtp.send ERROR SMTP passkey expiration send failed smtp.send INFO Passkey expiration email sentMagic link email (SendMagicLinkEmail):
smtp.send ERROR SMTP send failed smtp.send INFO Magic link email sentTest email (SendTestEmail):
smtp.test ERROR SMTP test email failed smtp.test INFO SMTP test email sentPAT created email (SendPATCreatedEmail):
smtp.pat_created ERROR PAT creation notification email failed smtp.pat_created INFO PAT creation notification email sentPAT revoked email (SendPATRevokedEmail):
smtp.pat_revoked ERROR PAT revocation notification email failed smtp.pat_revoked INFO PAT revocation notification email sentPAT expired email (SendPATExpiredEmail):
smtp.pat_expired ERROR PAT expiration notification email failed smtp.pat_expired INFO PAT expiration notification email sentPasskey created email (SendPasskeyCreatedEmail):
smtp.passkey_created ERROR Passkey creation notification email failed smtp.passkey_created INFO Passkey creation notification email sentPasskey revoked email (SendPasskeyRevokedEmail):
smtp.passkey_revoked ERROR Passkey revocation notification email failed smtp.passkey_revoked INFO Passkey revocation notification email sentTOTP created email (SendTOTPCreatedEmail):
smtp.totp_created ERROR TOTP creation notification email failed smtp.totp_created INFO TOTP creation notification email sentTOTP revoked email (SendTOTPRevokedEmail):
smtp.totp_revoked ERROR TOTP revocation notification email failed smtp.totp_revoked INFO TOTP revocation notification email sentCertificate created email (SendCertCreatedEmail):
smtp.cert_created ERROR Certificate creation notification email failed smtp.cert_created INFO Certificate creation notification email sentCertificate revoked email (SendCertRevokedEmail):
smtp.cert_revoked ERROR Certificate revocation notification email failed smtp.cert_revoked INFO Certificate revocation notification email sentMetrics
Prometheus metrics. Query with: metrics prometheus smtp_<name>
Email delivery:
smtp_emails_sent_total counter {type, result} Emails sent per type and outcome smtp_send_duration latency {type, result} Email send duration per type and outcomeLabel values:
type: generic | otp | cert_renewal | passkey_expiration | vpn_enrollment | vpn_device_code | vpn_psk_expiration | magiclink | test | pat_created | pat_revoked | pat_expired | passkey_created | passkey_revoked | totp_created | totp_revoked | cert_created | cert_revoked result: success | failureNote: Only core email types emit latency (generic, otp, cert_renewal, passkey_expiration, vpn_enrollment, vpn_device_code, vpn_psk_expiration, magiclink). All other types (test, pat_, passkey_, totp_, cert_) emit counter only — no latency metric.
Alerts:
rate(smtp_emails_sent_total{result="failure"}[5m]) > 5 SMTP delivery issues smtp_send_duration{quantile="0.99"} > 5s SMTP latency degradationPersistent File Storage
Persistent on-disk storage for durable module data — supports shared NFS and per-node replication
Overview
Provides persistent, crash-safe file storage for modules that need durable on-disk data. Two deployment modes: shared (NFS) where all nodes see the same filesystem, and replicated (local) where each node maintains its own copy with broadcast synchronization.
Core capabilities:
- Module-namespaced directories (each module gets isolated storage)
- Atomic writes via temporary file + rename pattern (crash-safe)
- Optional file locking via flock for NFS shared mode
- JSON marshaling/unmarshaling for structured data
- Full file lifecycle: Save, Load, Delete, Move, List, Exists
- Path traversal protection with multi-layer validation
- Fuzz-tested security boundary (traversal, null bytes, unicode attacks)
Storage modes:
Shared (NFS): All nodes see the same files. Operations are local only (no broadcast needed). File locking prevents race conditions between nodes. Example path: /shared/webauthn/passkeys/active/abc123.json Replicated (Local): Each node maintains its own filesystem. Write operations writes are replicated to all nodes. No locking needed since each node owns its local copy. Example path: /data/webauthn/passkeys/active/abc123.jsonFile permissions: 0644 (files), 0755 (directories). Module directories are created on demand during Save operations.
Config
Configuration under [filesystem]:
[filesystem] base_path = "/shared" # Root directory for all module storage mode = "shared" # "shared" (NFS) or "local" (replicated per node) use_flock = true # Enable file locking (recommended for shared mode)Mode selection guidance:
shared: Use when all nodes mount the same NFS/distributed filesystem. - Set use_flock = true to prevent concurrent write races - Operations are local only (no cluster replication needed) - Simplest setup, but requires reliable NFS infrastructure local (replicated): Use when each node has independent local storage. - Write operations (Save, Delete, Move) are replicated to all nodes - Read operations (Load, List, Exists) are local only - No file locking needed (each node owns its storage) - More resilient to NFS failures, but eventual consistencyOperation routing by mode:
Shared mode: Save, Load, Delete, Move -> all execute locally (no cluster broadcast) Replicated mode: Save, Delete, Move -> replicated to all cluster nodes Load -> local only (read from local storage)Hot-reloadable: None. Changes to base_path, mode, or use_flock require restart.
Troubleshooting
Common symptoms and diagnostic steps:
File not found after Save (replicated mode):
- Verify Save used cluster-wide replication (replicated mode requires it) - Check if querying node received the write replication (network partition) - Replication is eventually consistent; small delay before Load on other nodes - Verify base_path is correct on all nodes (must match across cluster)Permission denied errors:
- Check filesystem permissions: files need 0644, directories need 0755 - Verify the hexon process user has write access to base_path - NFS mount options: ensure no_root_squash or correct uid/gid mapping - SELinux/AppArmor may block writes to NFS mountsPath traversal error (ErrPathTraversal):
- Module name contains '/', '\', or '..' (invalid characters) - Subpath starts with '/' or '\' (must be relative) - Subpath contains '..' traversal sequences after path cleaning - Resolved path escapes the module directory boundary - This is a security feature; do not attempt to bypass itFile locking issues (shared/NFS mode):
- Stale locks after crash: flock is released on process exit by the OS - NFS lock daemon (lockd/statd) must be running on all nodes - NFSv4 has built-in locking; NFSv3 requires separate lock services - Deadlock: operations hold locks briefly (JSON marshal + write + rename) - If use_flock = false on shared mode, concurrent writes may corrupt filesAtomic write failures:
- Disk full: temporary file creation fails before rename - Cross-device rename: base_path and temp dir must be on same filesystem - Check disk space: df -h on the base_path partition - Temp file cleanup: orphaned .tmp files indicate interrupted writesList operation returns empty:
- Verify the subpath directory exists (directories created on Save only) - Check glob pattern syntax (uses filepath.Glob matching rules) - Pattern is matched against filenames only, not full paths - Module directory is base_path/module_name/subpathData corruption or invalid JSON:
- Atomic writes prevent partial writes; corruption suggests disk issues - NFS cache coherence: mount with actimeo=0 for immediate consistency - Check for concurrent writes without flock enabled - Validate JSON: load the file directly and check for syntax errorsArchitecture
Write path (Save operation):
1. Validate path (module name + subpath traversal checks) 2. Create module directory tree if needed (MkdirAll with 0755) 3. Marshal data to JSON with indentation 4. Create temporary file in same directory 5. Write JSON content to temporary file 6. Sync to disk (fsync) 7. Atomic rename: tmp file -> target path 8. Optional: acquire/release flock around steps 4-7 (shared mode)Read path (Load operation):
1. Validate path 2. Read file contents (os.ReadFile) 3. Unmarshal JSON into interface{} 4. Return data with Found=true, or Found=false if file not foundFile locking (shared mode only):
Uses syscall.Flock with LOCK_EX (exclusive) for writes and LOCK_SH (shared) for reads. Locks are advisory and only effective when all accessors use flock. Lock scope is per-file, not per-directory.Module isolation:
Each module's storage is confined to base_path/module_name/. Path validation ensures no module can read or write outside its own directory. The validation is defense-in-depth: multiple checks at different levels prevent escape.Relationships
Module dependencies and interactions:
- webauthn: Stores passkey credentials as JSON files. Uses shared mode for cross-node passkey availability. Files organized in active/revoked subdirectories.
- acme (CA): Stores issued certificates, private keys, and ACME account data. Requires persistent storage that survives restarts.
- config: Filesystem base_path and mode read from TOML configuration. No hot-reload; changes require restart.
- telemetry: Structured logging for all file operations (save, load, delete, move) with module name, subpath, and error details.
- memory (memorystorage): Complementary storage. Use filesystem for persistent data that must survive restarts; use memory for ephemeral data with TTL. Some modules use both: memory for fast lookups, filesystem for durable backup.
- cluster: In replicated mode, cluster health affects write propagation. Node failures may result in missed broadcasts (eventually consistent).
Logs
Log entries emitted by this module. Search with: logs search “storage.filesystem” Levels: ERROR > WARN > INFO > DEBUG > TRACE.
Save Operation:
storage.filesystem WARN Path traversal attempt blocked storage.filesystem ERROR Failed to create directory storage.filesystem ERROR Failed to marshal JSON storage.filesystem ERROR Failed to save file storage.filesystem DEBUG File savedLoad Operation:
storage.filesystem WARN Path traversal attempt blocked storage.filesystem DEBUG File not found storage.filesystem ERROR Failed to read file storage.filesystem ERROR Failed to unmarshal JSON storage.filesystem DEBUG File loadedDelete Operation:
storage.filesystem WARN Path traversal attempt blocked storage.filesystem DEBUG File not found for deletion storage.filesystem ERROR Failed to delete file storage.filesystem DEBUG File deletedMove Operation:
storage.filesystem WARN Path traversal attempt blocked (source) storage.filesystem WARN Path traversal attempt blocked (target) storage.filesystem ERROR Failed to create target directory storage.filesystem ERROR Failed to move file storage.filesystem DEBUG File movedList Operation:
storage.filesystem WARN Path traversal attempt blocked storage.filesystem DEBUG Directory not found storage.filesystem ERROR Failed to read directory storage.filesystem DEBUG Directory listedExists Operation:
storage.filesystem WARN Path traversal attempt blocked storage.filesystem DEBUG File existence checkedMetrics
No Prometheus metrics are emitted by this module.
Distributed Memory Storage
Ephemeral key-value storage shared across all nodes — used by sessions, OTP, PoW, and tokens
Overview
Provides fast, in-memory key-value storage replicated across all cluster nodes with automatic TTL expiration. Used by sessions, OTP codes, PoW challenges, OIDC tokens, and other time-sensitive data that needs cluster-wide availability. Data expires automatically — no manual cleanup required.
Core capabilities:
- Namespace-isolated caches (cache types prevent key collisions)
- Automatic TTL-based expiration with background eviction every 30 seconds
- OnSet and OnDelete callback support (fire-and-forget, local only)
- Thread-safe operations with mutex protection
- Cluster-wide replication (writes replicated to all nodes)
- Eventually consistent reads (local only, no network overhead)
- NATS JetStream KV persistence for crash recovery (optional)
- Peer-to-peer bootstrap fallback when JetStream unavailable
- Two-tier hot/cold cache for large-scale deployments (30M+ users)
- SetNX for atomic set-if-not-exists (distributed locks)
- Touch for TTL renewal without value modification
Consistency model:
Reads (Get): Local first, O(1). With cold_enabled=true, falls through to KV on miss (~1ms). Reads (All): Local only, O(1). Returns hot entries only (no KV scan in cold mode). Writes (Set, Delete): Local immediate + optional replication to all nodes Writes are best-effort with no quorum requirement by default. For strong consistency, use cluster-wide replication with quorum confirmation.Storage architecture: two-level map structure
caches[cache_type][key] -> storageEntry with Value, Expiration, CallbacksData types stored in memory must be compatible with the cluster serialization layer. Custom structs, slices, and maps with custom types are supported.
Config
Configuration under [cluster] (memory persistence):
[cluster] cluster_path = "/var/lib/hexon/cluster" # Base path for JetStream storage persist_memory = true # Use FileStorage for KV bucket memory_kv_max_write = 10 # Max concurrent KV writes (1-100)When persist_memory = true and cluster_path is set:
- NATS JetStream KV bucket "hexon_storage_memory" is created - Writes are asynchronously persisted to JetStream after local cache update - Concurrent KV writes throttled by memory_kv_max_write (default 10) - On startup, all entries are bootstrapped from JetStream KV - JetStream uses Raft consensus in 3+ node clusters for durability - Data survives full cluster restartsWhen persist_memory = false or cluster_path is unset:
- KV bucket uses MemoryStorage (data lost on restart) - Falls back to peer-to-peer bootstrap from live cluster nodes - Suitable for truly ephemeral data (PoW challenges, rate limit counters)Key encoding for NATS KV:
NATS KV keys only allow [-/_=\.a-zA-Z0-9]+. Keys from external sources (LDAP groups with spaces, email addresses) are base64url encoded: Format: {cacheType}/{base64url(key)} Example: "directory_groups/UmVwbGljYXRpb24gQWRtaW5pc3RyYXRvcnM"Bootstrap sequence on node startup:
1. Attempt to read all entries from NATS JetStream KV 2. Populate in-memory cache with non-expired entries 3. If JetStream unavailable, request data from cluster peers 4. Merge peer responses into local cache 5. Live broadcasts during bootstrap take precedence over stale KV dataHot/cold cache:
[memory] cold_enabled = true # Two-tier hot/cold cache (default: true) cold_ttl = "72h" # How long idle entries stay in hot cacheWhen cold_enabled = true (default):
- Entries load lazily from durable storage on first access - Subsequent reads are in-memory - Idle entries evicted from memory after cold_ttl (still durable) - No startup warmup — node starts instantly, cache fills on demand - Only active entries consume memory; idle entries are served from durable storage on demand - Cluster-wide features that need to enumerate cache contents (admin force-logout safety-net, audit listing) are fully supportedWhen cold_enabled = false (operator opt-out):
- Full in-memory replication; all entries loaded at startup - Best for very small deployments where memory headroom is generous and cluster-wide enumeration features are not needed - A startup warning announces the degraded mode for cluster-wide operations: admin force-logout may not catch sessions on peer nodes if the per-user index is staleNo hot-reloadable settings. Changes require a full restart.
Troubleshooting
Common symptoms and diagnostic steps:
Key not found after Set (cross-node):
- Verify Set used cluster-wide replication, not local-only (replication required for cross-node visibility) - Reads are local only; small propagation delay is normal - Use quorum-confirmed replication before reading for strong consistency - Check cluster health: nodes must be reachable for broadcast delivery - Verify the stored type is compatible with cluster serializationSerialization errors (encoding/decoding failures):
- Custom types stored in memory must be compatible with cluster serialization - Type registration happens during module initialization - Built-in types (string, int, bool, []byte) work out of the box - Error message includes the unregistered type nameTTL expiration not working (entries persist beyond TTL):
- Background eviction runs every 30 seconds (not instantaneous) - Expired entries are immediately invisible to Get (Found=false) - Physical cleanup happens on next eviction cycle - Very large caches (100K+ entries) may slow eviction scans - Check if TTL was set to 0 (zero TTL means no expiration)OnDelete callback not firing:
- Callbacks are local only (fire on the node that runs eviction) - Callbacks are fire-and-forget (errors are logged but not returned) - The callback module and operation must exist and be registered - Check telemetry logs at ERROR level for callback failures - Callbacks do NOT fire on nodes that receive broadcast deletions (only the originating node triggers the callback)Data lost after cluster restart:
- Verify persist_memory = true in [cluster] config - Verify cluster_path is set and writable - Check NATS JetStream health (3+ nodes needed for Raft consensus) - 2-node clusters: JetStream may not achieve quorum, data at risk - Without persistence, data is only in memory (lost on restart) - Bootstrap logs show how many entries were recovered from KVMemory usage growing unbounded:
- Check TTL values: missing or zero TTL entries never expire - Use All operation to inspect cache sizes per cache type - Per-entry overhead: approximately 150 bytes plus key and value sizes - Monitor eviction cycle: entries should be cleaned every 30 seconds - Consider partitioning large cache types into smaller namespacesSetNX returning Set=false unexpectedly:
- Key already exists in local cache (including expired-but-not-evicted) - Another node set the key via broadcast before your SetNX - SetNX is local atomic only; not a distributed lock by itself - For distributed locking, combine SetNX with cluster-wide replication + short TTLBootstrap failures on startup:
- JetStream KV unavailable and no peer nodes responding - Node starts with empty cache; data populates as broadcasts arrive - Check NATS connection health and cluster discovery - Verify cluster_path directory exists and has correct permissions - Base64url decoding errors: corrupted KV keys (manual cleanup needed)KV “too many requests” errors at startup (memory.kv.put_error):
- Caused by bulk operations (e.g. directory fullSync) spawning many concurrent KV writes that overwhelm JetStream rate limits - Each user/group sync fires ~3 Set() calls per user + ~2 per group - A directory with 40 users and 20 groups = ~160 concurrent writes - Fix: increase memory_kv_max_write in [cluster] config (default 10, max 100) - These errors are non-fatal: data is already in local cache, only persistence is delayed. Entries will be persisted on subsequent writes or next restart. - Monitor: logs search "memory.kv.put_error" --since=5mArchitecture
Data flow for write operations:
1. Caller invokes Set/Delete (local-only or cluster-wide) 2. Local cache updated immediately (mutex-protected) 3. OnSet callback triggered if registered (fire-and-forget) 4. If cluster-wide: replicated to all cluster nodes 5. Async persistence goroutine acquires semaphore slot (bounded by memory_kv_max_write) 6. KV write to NATS JetStream (best-effort; skipped on shutdown) 7. JetStream Raft replicates to follower nodes (3+ node clusters) Note: SyncSet bypasses the semaphore (synchronous, caller-blocking, used for signing keys)Data flow for read operations:
1. Caller invokes Get/All 2. Local cache lookup (O(1) for Get, O(n) for All) 3. Expired entries filtered out (Found=false) 4. If cold_enabled=true and Get misses local: KV fallback (~1ms), lazy-load into hot cache 5. All always returns hot entries only (no KV scan in cold mode)Background eviction loop:
1. Wakes every 30 seconds 2. Scans all cache types and all entries 3. Identifies entries with Expiration < now (TTL eviction) 4. Deletes expired entries from local cache and KV 5. Replicates eviction to cluster nodes, triggers OnDelete callbacks 6. Cold eviction pass (cold_enabled=true only): entries idle > cold_ttl removed from memory only — stay in KV for future lazy-load, no callbacks triggeredNATS JetStream KV architecture (when persistence enabled):
- Bucket: hexon_storage_memory - Raft consensus for writes (3+ nodes) - Leader election with automatic failover - Write-ahead log replicated to followers - Can tolerate N/2-1 node failures (e.g., 1 of 3)Peer-to-peer bootstrap fallback:
- Used when JetStream is unavailable (2-node clusters, JetStream down) - Requests data from all cluster peers - Merges responses, preferring newest entries on conflict - Graceful degradation: memory storage works without persistenceRelationships
Module dependencies and interactions:
- sessions: Primary consumer. Stores user sessions with 12-24h TTL. Uses OnDelete callback for session cleanup and index removal. Session indices stored in separate “sessions_index” cache type.
- OTP: Stores one-time passwords with 5-10 minute TTL. Keys are hashed email addresses for privacy. Replicated cluster-wide for OTP availability. OnDelete triggers expiration notifications.
- OIDC provider: Stores authorization codes, access tokens, refresh tokens, and DPoP JTI values. Each in separate cache types with appropriate TTLs (codes: 5-10min, tokens: 1-24h). Critical for OAuth2 flow integrity.
- Proof-of-work: Stores proof-of-work challenge tokens with short TTL. Local-only storage (challenges are node-specific).
- WebAuthn: Stores WebAuthn challenges during registration and authentication ceremonies. Short TTL (5 minutes).
- Kerberos: Stores Kerberos ticket data with ticket lifetime TTL.
- firewall: Uses SetNX for cluster-wide hostname tracking (wildcard DNS). Replicated to all nodes for cross-node hostname state. OnDelete for TTL-based rule cleanup.
- storage.filesystem: Complementary module. Use memory for fast ephemeral lookups; use filesystem for persistent data surviving restarts.
- telemetry: All operations logged at DEBUG level. Errors (callback failures, eviction issues) logged at ERROR. Metrics for cache sizes and hit rates.
- cluster (NATS): JetStream KV persistence layer. Raft consensus provides durability for 3+ node clusters. Bootstrap reads from JetStream on startup.
Logs
Log entries emitted by this module. Search with: logs search “memory” Levels: ERROR > WARN > INFO > DEBUG.
Bootstrap — KV:
memory.bootstrap.start INFO Starting JetStream KV bootstrap memory.kv.init DEBUG Requesting JetStream KV bucket memory.kv.retry DEBUG JetStream not ready, retrying in {duration} (attempt N/M) memory.bootstrap.kv_unavailable INFO JetStream KV unavailable after retries, falling back to peer broadcast memory.kv.ready DEBUG JetStream KV bucket ready memory.bootstrap.cold INFO Cold mode enabled — skipping bootstrap warmup, cache will populate on demand memory.bootstrap.read_keys DEBUG Reading keys from JetStream KV memory.bootstrap.empty INFO JetStream KV bucket is empty, nothing to restore memory.bootstrap.failed ERROR Failed to read KV keys memory.bootstrap.keys_found DEBUG Found N keys in JetStream KV memory.bootstrap.process_key DEBUG Processing KV key memory.bootstrap.retry_transient INFO Retrying N keys after transient NATS errors (JetStream leader stabilizing) memory.bootstrap.complete INFO Bootstrap complete (loaded, skipped, errors, duration)Bootstrap — Key Processing:
memory.bootstrap.get_tombstone DEBUG KV key listed but not found (tombstone) memory.bootstrap.get_transient DEBUG Transient NATS error, will retry memory.bootstrap.get_error WARN Failed to get KV entry memory.bootstrap.decode_error WARN Failed to decode KV entry, deleting corrupted key memory.bootstrap.decode_error_cleanup WARN Failed to delete corrupted KV entry memory.bootstrap.parse_error WARN Failed to parse KV key format memory.bootstrap.skip_expired DEBUG Skipping expired entry memory.bootstrap.skip_exists DEBUG Skipping key (already in memory from broadcast) memory.bootstrap.skip_deleted DEBUG Skipping key (deleted during bootstrap) memory.bootstrap.loaded DEBUG Loaded entry from KV memory.bootstrap.tracking_stopped DEBUG Stopped tracking deletes, bootstrap complete / peer bootstrap complete memory.bootstrap.track_delete DEBUG Tracking delete during bootstrapBootstrap — Peer Fallback:
memory.bootstrap.peers_encryption_timeout WARN Encryption not ready after timeout, proceeding with bootstrap anyway memory.bootstrap.peers_wait_encryption DEBUG Waiting for encryption to be ready (X3DH/shared key sync) memory.bootstrap.peers_start INFO Starting peer-to-peer bootstrap via Broadcast memory.bootstrap.peers_failed ERROR Failed to broadcast BootstrapGetAll memory.bootstrap.peers_responses INFO Collected responses from N peers memory.bootstrap.peers_timeout WARN Failed to collect all peer responses memory.bootstrap.peers_operation_error WARN Operation error from node memory.bootstrap.peers_invalid_response WARN Invalid response type from node memory.bootstrap.peers_merge DEBUG Merging snapshot from node memory.bootstrap.peers_complete INFO Peer bootstrap complete (loaded, skipped, duration)KV Persistence:
memory.kv.encode_error WARN Failed to encode entry for KV memory.kv.put_error WARN Failed to write to KV memory.kv.persist_success DEBUG Entry persisted to KV memory.kv.delete_error WARN Failed to delete from KV memory.kv.delete_success DEBUG Entry deleted from KVCRUD Operations:
memory DEBUG Memory storage Set memory DEBUG Triggering OnSet callback memory WARN OnSet callback failed memory DEBUG Memory storage Delete memory DEBUG Triggering OnDelete callback memory WARN OnDelete callback failed memory DEBUG Memory storage All memory DEBUG Memory storage Touch memory DEBUG Memory storage SetNX memory DEBUG Memory storage SyncSet memory DEBUG Memory storage SyncGet (lazy-loaded from KV)Bootstrap Snapshot:
memory.bootstrap DEBUG BootstrapGetAll returning snapshotCold Cache:
memory.cold WARN Corrupted KV entry, deleting memory.cold DEBUG Cold eviction sweepEviction:
memory.eviction INFO Eviction loop shutting down gracefullyMetrics
Prometheus metrics. Query with: metrics prometheus memory_storage_<name>
CRUD Counters:
memory_storage_gets counter {cache_type, result} Cache reads (result: hit, miss, cold_hit, decode_error, expired) memory_storage_sets counter {cache_type} Cache writes memory_storage_deletes counter {cache_type} Cache deletions memory_storage_touches counter {cache_type, result} TTL renewals (result: hit, miss, expired) memory_storage_setnx counter {cache_type, result} Atomic set-if-not-exists (result: set, exists) memory_storage_sync_sets counter {cache_type} Synchronous KV-persisted writes memory_storage_sync_gets counter {cache_type, result} Synchronous reads with KV fallback (result: hit, miss, kv_hit, decode_error, expired) memory_storage_evictions counter {cache_type, reason} Entries evicted (reason: expired, cold)Gauges:
memory_storage_entries gauge {cache_type} Current entry count per cache type (updated via GetCacheStats)Alerts:
rate(memory_storage_gets{result="miss"}[5m]) > 100 High cache miss rate (check TTLs or missing Set calls) rate(memory_storage_evictions{reason="expired"}[5m]) > 500 Excessive TTL evictions (entries expiring faster than expected) rate(memory_storage_gets{result="decode_error"}[5m]) > 0 Corrupted KV entries (cold cache decode failures)Telemetry & Logging
Structured logging with OTLP export, per-module log levels, audit class, ring buffer queries, and trace correlation
Overview
The telemetry module provides structured logging with key-value pairs, multiple output targets, and cross-module trace correlation for cluster-wide observability.
Core capabilities:
- Structured logging with key-value pairs and fluent builder API
- Six log levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL
- AUDIT log class: bypasses level filtering for security events
- Per-module log level configuration (override global level per module)
- OTLP gRPC log export to OpenTelemetry-compatible collectors
- Trace ID correlation across modules (128-bit hex IDs per request)
- In-memory ring buffer for admin CLI log queries
- JSON and human-readable output formats
- Security context builder for auth-related log entries
Output modes:
stdout: Structured logs written to standard output (default) otlp: Logs exported via gRPC to an OpenTelemetry collector both: Simultaneous stdout and OTLP exportOTLP export includes:
- timestamp, severity, body (message), module attribute - service.name, service.version, environment, host.name, host.ip - Native OTLP TraceId field for trace-to-log correlation - Batched async export via SDK log processorRing buffer:
Configurable in-memory buffer (default 10,000 entries) for admin CLI log queries ('logs tail', 'logs search'). Provides instant access to recent logs without external log aggregation. Set to 0 to disable.Config
Configuration under [telemetry] section:
[telemetry] log_level = "info" # Global: trace|debug|info|warn|error|fatal log_format = "json" # Output format: "json" or "human" output = "stdout" # Output target: "stdout", "otlp", or "both" otlp_endpoint = "otel-collector:4317" # Required when output is "otlp" or "both" log_buffer_size = 10000 # Ring buffer entries for log queries (0 = disabled) audit = true # Audit class: always display security events regardless of log_level[telemetry.module_levels] oidc = "debug" # Per-module override (module name = level) webauthn = "info" bastion = "trace"OTLP endpoint format:
"host:port" - Plain gRPC connection "http://host:port" - Insecure gRPC (http:// stripped, WithInsecure applied) "https://host:port" - TLS gRPC connectionCompatible collectors: Grafana Alloy, OpenTelemetry Collector, Datadog Agent, Splunk OTel Collector, any OTLP/gRPC compatible receiver.
If the OTLP endpoint is unreachable at startup, the system falls back to stdout and logs a warning. gRPC connections are lazy (connect on first export).
Audit class:
When audit = true (default), log entries marked with AsAudit() bypass level filtering. Security events (SFTP ops, SSH connections, admin commands, TLS protection) are always visible even when log_level is set to "error".Hot-reloadable: log_level, module_levels, log_format. Cold (restart required): output, otlp_endpoint, log_buffer_size, audit.
Troubleshooting
Common symptoms and diagnostic steps:
Logs not appearing in OTLP collector:
- Verify output is set to "otlp" or "both" in [telemetry] - Check otlp_endpoint format (host:port, no trailing slash) - Network connectivity: 'net tcp <collector-host>:<port>' - Collector may reject due to resource limits or auth requirements - Startup fallback: if endpoint was unreachable at startup, logs go to stdout - Check: 'logs tail' to verify logs are being generated locallyPer-module log level not working:
- Verify [telemetry.module_levels] has exact module name (case-sensitive) - Module names use dot notation: "oidc", "bastion.session", "identity.scim" - Per-module level must be lower priority than global to have effect - Check: 'config show telemetry' to verify active configurationRing buffer queries returning no results:
- Verify log_buffer_size > 0 (0 disables the ring buffer) - Buffer is in-memory only; cleared on restart - 'logs tail' shows most recent entries - 'logs search <keyword>' filters by content - Buffer wraps around: oldest entries are overwritten when fullLog format issues:
- "json": structured key-value JSON (recommended for log aggregation) - "human": colored, readable format (recommended for development) - Trace IDs: full 128-bit hex in JSON, truncated 8-char in human formatHigh log volume impacting performance:
- Raise global log_level to "warn" or "error" - Use per-module levels to keep verbose logging only where needed - OTLP batched export is async and does not block request processing - Ring buffer size: reduce log_buffer_size if memory is a concernRelationships
Module dependencies and interactions:
- All modules: Every module in the system uses telemetry for structured logging with trace correlation.
- Admin CLI: ‘logs tail’, ‘logs search’, ‘logs stats’, ‘logs anomalies’, ‘logs patterns’ commands query the ring buffer.
- Configuration: Reads [telemetry] section. Log level and format are hot-reloadable without restart.
- Cluster: Each node maintains its own ring buffer. Admin CLI log queries fan out to all nodes and merge results.
Logs
The telemetry module is the logging infrastructure itself — it does not emit structured log entries through its own pipeline. Diagnostic messages are written directly to stderr for bootstrap and shutdown scenarios where the log pipeline is unavailable.
Stderr diagnostics (not structured LogEntry calls):
[TELEMETRY] Failed to initialize OTLP exporter: <err> (falling back to stdout) — Startup: OTLP gRPC connection failed, output mode reverts to stdout Failed to marshal log entry: <err> — Runtime: JSON encoding of a log entry failed (entry is dropped) [TELEMETRY] OTLP provider shutdown error: <err> — Shutdown: OTLP provider flush/close returned an error [TELEMETRY] Shutdown complete: N logs processed, N logs dropped due to overflow — Shutdown: final stats when logs were dropped (includes audit count if any)These messages appear only in stderr, never in the structured log stream or ring buffer. They indicate infrastructure-level issues with the telemetry pipeline itself.
Metrics
Prometheus metrics emitted by the telemetry module. Query with: metrics prometheus telemetry_<name>
Audit event tracking:
telemetry_audit_log_entries_total counter {} Audit-class entries successfully written telemetry_audit_dropped_total counter {} Audit-class entries dropped (channel overflow) telemetry_converging_log_entries_total counter {} Converging-class entries successfully writtenAll three counters have no labels (nil label map). They are incremented in the single backgroundWriter goroutine (no contention).
Alerts:
telemetry_audit_dropped_total > 0 Audit entries lost — increase channel buffer or reduce log volume rate(telemetry_audit_log_entries_total[5m]) == 0 No audit events — possible pipeline failure or misconfiguration