Skip to content

Cluster & Operations

Admin Unix Socket

Server-side CLI access via Unix domain socket — run admin commands directly on the server without bastion

Overview

The admin socket enables operators to run admin CLI commands directly on the server host without opening a bastion SSH session.

The hexon binary operates in two modes:

  • Server mode (default): starts the gateway, creates the Unix socket listener
  • Client mode (hexon admin …): connects to the socket, sends a command, renders output

Same binary, same command registry, same execution path as bastion and MCP. Commands are executed as user “root” with source “cli” in the audit trail.

The socket is created at /tmp/hexon-admin.sock (override with HEXON_ADMIN_SOCK). File permissions are 0600 (owner-only read/write). Stale sockets from crashed instances are detected and cleaned up automatically on startup.

Usage

Commands that require a running server:

hexon admin cluster status
hexon admin proxy health
hexon admin sessions list --user=alice
hexon admin --json proxy backends
hexon admin --cluster health status

Help commands work offline (no running server needed):

hexon admin # List all commands
hexon admin help proxy # Detailed help for proxy command

Custom socket path:

HEXON_ADMIN_SOCK=/tmp/hexon.sock hexon admin ping

Exit codes: 0 on success, 1 on error.

Troubleshooting

Common errors:

  • “hexon is not running (socket not found at …)” — server not started or different socket path. Check HEXON_ADMIN_SOCK env var.
  • “hexon is not running” — socket file exists but connection refused. The previous instance crashed without cleanup. Restart the server.
  • “command timed out” — command took longer than 30 seconds. Check server logs.
  • Socket permission denied — socket is mode 0600, must run as the same user that started the hexon server (typically root).

The socket is cleaned up on graceful shutdown. If the server crashes, the stale socket is automatically removed on next startup.

Logs

No structured log entries. A single console message is emitted on startup. Command execution logging is handled by the admin CLI module.

Metrics

No Prometheus metrics emitted by this module. Admin command metrics are handled by the admin CLI module.


Threshold Signing & Cluster Cryptography

Signs certificates and tokens without any single node holding the full private key — quorum-based threshold cryptography

Overview

Signs certificates and OIDC tokens without any single node holding the complete private key. The signing key is split across cluster nodes — a quorum collaborates to produce each signature. Replaces external HSMs with distributed key management built into the gateway.

Threshold signing means that certificates and tokens are signed by a quorum of cluster nodes working together. No single node ever holds the full private key — the key is split into shares, and a minimum number of nodes (the “threshold”) must cooperate to produce a valid signature.

The cluster runs two signing schemes in parallel:

  1. ECDSA (ES256/ES384/ES512) — for EXTERNAL tokens Used for: OIDC tokens, Personal Access Tokens (PATs), standard OAuth Why: industry-standard algorithms that third-party tools verify natively

  2. FROST Ed25519 — for INTERNAL operations Used for: proxy bearer tokens, bastion device codes, internal service auth Why: faster signing (~15ms) optimized for high-volume internal operations

These two schemes are not fallbacks for each other — they run in parallel, each serving different consumers. The only brief fallback window is during cluster startup: internal tokens temporarily use ECDSA until FROST key generation completes. This is a few seconds, not a steady-state condition.

Token routing (when signing_algorithm = ES256):

Token Type | Scheme | Reason
──────────────────────────────────────────────────────────
Proxy bearer tokens | FROST | Internal — speed, backend verifies via JWKS
Bastion device codes | FROST | Internal — bastion authentication
Internal device codes | FROST | Internal service callers
Personal Access Tokens | ECDSA | External — distributed to users
Standard OIDC tokens | ECDSA | External — third-party OAuth clients

Quorum model (default: 2-of-3 nodes):

- Any 2 nodes can sign; 1 node alone cannot forge signatures
- 1 node failure = still operational (2 remaining nodes can sign)
- 2 nodes down = signing blocked (quorum lost)
- 1 node compromised = attacker has 1 share, cannot forge (needs 2)

Startup sequence:

1. Nodes perform distributed key generation (DKG) for the ECDSA scheme
2. Once ECDSA is ready, FROST key generation auto-triggers
3. Once both complete, all signing paths are available
During the brief window between steps 1-3, internal tokens use ECDSA.
Zero downtime throughout.

When signing_algorithm is EdDSA: everything uses FROST (no dual mode needed).

Key management

Key lifecycle and rotation:

Threshold signing keys (both ECDSA and FROST):

- Generated collaboratively by all cluster nodes (distributed key generation)
- No single node ever holds the full private key — only its own share
- Stored encrypted at rest using authenticated encryption derived from cluster_key
- When nodes join or leave, key shares are redistributed while preserving the
same public key (external verifiers like JWKS consumers are not affected)

ECDSA key rotation (for OIDC tokens):

- Automatic: triggered when key generation completes or cluster membership changes
- New JWKS entry published; old key retained for verification (grace period)
- Relying parties cache JWKS — existing tokens remain valid until cache refresh

FROST key rotation (for internal tokens):

- Independent from ECDSA rotation — separate lifecycle
- Auto-triggered when cluster membership changes
- Internal tokens are short-lived — rotation is seamless with no visible impact

Inter-node encryption:

Nodes communicate over an encrypted channel with forward secrecy. This means
each inter-node session uses unique encryption keys, and even if long-term
keys were compromised, past communications remain unreadable.
- Encryption keys rotate automatically on a schedule (2PC protocol with quorum)
- Grace period: old + new keys accepted simultaneously during rotation
- Temporary fallback to derived keys if the key exchange is briefly unavailable
- SPK rotation uses publish-before-swap: new bundle published before private key swaps
- Key rotation defers if SPK just rotated (SPK recency guard, 5s window)
- On quorum failure, key rotation retries once after flushing stale bundle caches
- All rotation events emitted as audit entries via OnKeyRotationEvent callback

IMPORTANT — All rotations are automatic:

Certificate rotation, signing key rotation, and encryption key rotation are
all handled by background health monitors. Operators do NOT need to set
calendar reminders or manually trigger rotations.
Only investigate when 'health components' or 'hexdcall status' shows warnings.

Threshold resharing

Cluster scaling under threshold signing — what’s safe and what isn’t:

Resharing rotates the cluster’s share material across a (possibly) different party set while preserving the same group public key. The CA certificate, JWKS public keys, and any downstream verifiers stay valid across the rotation. Both threshold algorithms support resharing:

ECDSA threshold (ca_algorithm=ES256, signing_algorithm=ES256/384/512): GG18
FROST threshold (ca_algorithm=EdDSA, signing_algorithm=EdDSA): FROST

Auto-trigger:

The leader's health monitor watches for cluster-membership changes vs the set
of nodes that hold shares in JetStream KV. When they diverge (a node joins or
leaves), resharing fires automatically after a short stabilization window
(~10s) plus exponential backoff (30s base, 5min cap). Operators do NOT need
to manually trigger resharing for normal scaling.

Adding nodes (+N): no upper bound.

You can add any number of nodes. The original cooperating-old subset (t+1
contributors) reshares with their existing share values; joiners receive
fresh shares. The CA / JWKS public key stays unchanged.

Removing nodes (-N): bounded by the OLD threshold.

Resharing math requires |cooperating-old| >= oldThreshold + 1, and the
cooperating-old set must be a subset of the new party set. So:
minimum new cluster size = oldThreshold + 1
With the default majority-quorum config (t = n/2 at birth), this means:
Initial cluster | Old threshold (t) | Minimum after shrinking via reshare
----------------|-------------------|--------------------------------------
3 nodes | 1 | 2 nodes
5 nodes | 2 | 3 nodes
7 nodes | 3 | 4 nodes
9 nodes | 4 | 5 nodes
Example: a 7-node cluster (t=3) cannot shrink to 3 in a single reshare —
it would need 4 cooperating-old shareholders but the new set has only 3
slots. The system rejects this synchronously with a clear "insufficient
cooperators" error.
To shrink past the floor: stage the reduction (e.g., 7 → 5 with a lower
new threshold → 3 with a still-lower threshold). Each step preserves the
CA cert. OR perform a hard CA rotation (delete birth metadata + restart),
which produces a new CA cert and forces every leaf cert to renew.

Choosing initial threshold with shrinkage in mind:

Lower threshold = more shrinkage room but weaker security (fewer cooperators
needed to sign = lower attack threshold). Override via threshold_nodes
config at birth time if your cluster needs to scale down significantly.

Resharing failure mode is non-destructive:

If a reshare protocol run fails (cooperator unreachable, KV write failure,
byzantine commitment), the manager reverts to its old shares and the CA
cert stays valid. Operators see the failure in 'hexdcall status' / health
components and can retry manually or wait for the next auto-trigger.

Deterministic signing

Clarification on “deterministic” in cluster crypto context:

The word “deterministic” appears in two contexts — they mean different things:

  1. AutoTLS deterministic KEY DERIVATION:

    • Private keys derived deterministically from cluster_key for each renewal cycle
    • Ensures all nodes produce the same public key for SPKI pinning
    • ECDSA SIGNATURE nonces remain fully random (standard randomness)
    • This is NOT “deterministic signing” — only the key material is deterministic
    • Security equivalent to random key generation (entropy from cluster_key)
  2. Threshold ECDSA CA signing:

    • Uses multi-party nonce generation across nodes
    • Each node contributes randomness; combined nonce is unpredictable
    • No single node controls the nonce — distributed trust model
    • The signing ceremony is interactive (requires t+1 nodes), not deterministic

Neither case reduces cryptographic security. “Deterministic” in these contexts refers to reproducibility (same inputs → same key) or coordination protocol, not to removing randomness from signatures.

When operators ask about “no randomness” or “deterministic signing”:

- AutoTLS: "The leaf certificate keys are derived deterministically for cluster
consistency and SPKI pinning. Signatures still use standard randomness."
- Threshold CA: "The signing uses multi-party nonce generation — each node
contributes randomness, and the combined nonce is cryptographically secure."

Do NOT present deterministic key derivation as a security concern.

Troubleshooting

Common symptoms and diagnostic steps:

ECDSA threshold signing errors (external tokens not being issued):

- Run 'hexdcall status' to check signer state and health
- State=Active + Health=Healthy → signing should work
- Health=Degraded → at minimum quorum, one more node failure blocks signing
- Health=Unhealthy → cannot sign, check node reachability with 'cluster nodes'
- Run 'hexdcall threshold test' to verify end-to-end signing

FROST signing errors (internal tokens failing, e.g. proxy bearer or device codes):

- Check FROST state and health in 'hexdcall status'
- FROST key generation runs after ECDSA completes — if FROST shows Idle
but ECDSA is Active, FROST key generation has not triggered yet
- Internal tokens fall back to ECDSA while FROST initializes (no outage)
- Run 'hexdcall threshold test --trace' for detailed phase-level timing

Key generation not completing:

Key generation (DKG) is the process where nodes collaboratively create a
shared signing key. It requires all participating nodes to be reachable.
- Check 'cluster nodes' — all expected nodes must be online
- Key generation requires the inter-node encryption channel to be healthy
- Check for membership mismatches: all nodes must agree on the participant set
- Rolling restarts are handled gracefully — key generation is not re-triggered unnecessarily

Inter-node encryption issues:

Nodes encrypt all cluster communication using a forward-secret key exchange.
- Low key pool → automatic replenishment triggers (usually self-healing)
- Key exchange failures → check NATS JetStream connectivity ('cluster status')
- Signature verification failures → possible clock skew between nodes (check NTP)
- During degradation, non-critical operations are deferred and auto-retry on recovery
- Key rotation audit events: search 'logs tail --audit' for module=keyrotation
Events: initiated, deferred, commit_all, commit_quorum, retry, aborted, completed,
activated, abort_received, spk_completed, spk_failed
- "deferred" = SPK recency guard fired (normal when SPK and key rotation intervals match)
- "retry" = first PREPARE attempt failed, retried after bundle cache refresh
- "commit_quorum" = some nodes missed PREPARE, committed with partial ACKs

Interpreting ‘hexdcall status’ output:

ECDSA: Active/Healthy + FROST: Active/Healthy → optimal state (all signing paths available)
ECDSA: Active/Healthy + FROST: Idle → FROST key generation pending, internal tokens use ECDSA
ECDSA: Active/Degraded → at minimum quorum, lost fault tolerance margin — monitor closely
ECDSA: DKG → key generation in progress, signing not yet available
Inter-node encryption: Healthy → encrypted communication between nodes is nominal

Monitoring thresholds for CA certificate:

>90 days until expiry: HEALTHY (normal — renewal is automatic)
20-90 days: INFO (approaching renewal window — still automatic)
5-20 days: WARN (renewal should have happened — check logs)
<5 days: CRITICAL (rotation may have failed — investigate immediately)

Diagnostic commands:

'hexdcall status' - Signing health, key generation state, inter-node encryption
'hexdcall threshold test' - End-to-end ECDSA signing test
'cluster nodes' - List cluster nodes and reachability
'cluster status' - Overall cluster health including NATS connectivity
'health components' - All system components with health status

Relationships

Module dependencies and interactions:

  • OIDC provider: Consumes ECDSA threshold signer for JWT signing (ES256/384/512). JWKS endpoint serves the threshold public key. When signing_algorithm changes, new DKG runs and JWKS updates.
  • OIDC provider (internal tokens): Uses FROST signer for proxy bearer tokens and device codes. Falls back to ECDSA if FROST is not yet ready.
  • ACME CA (threshold mode): When acme_ca_threshold=true, CA signing uses the ECDSA threshold scheme. Quorum of nodes must cooperate to issue certificates.
  • Bastion: Device code authentication uses FROST-signed tokens (internal path).
  • Proxy: Bearer token minting uses FROST for low-latency signing.
  • X3DH: Forward secrecy for DKG messages and key rotation coordination. Threshold signing uses a dedicated encrypted data plane, separate from X3DH.
  • NATS JetStream: Persistent storage for DKG state, key shares, and PreKey bundles.
  • Health monitor: Periodically computes signer health from peer reachability. Auto-triggers FROST DKG when ECDSA is Active. Detects membership mismatches.

Logs

Log entries emitted by this module. Search with: logs search “threshold” / “keyrotation” / “hexdcall” / “ca.” Levels: ERROR > WARN > INFO > DEBUG > TRACE. Note: The bridge module IS the logging infrastructure — it provides bridge.Log() and the hexdcall Logger adapter. The entries below are emitted by bridge code itself via bridge.Log(telemetry.LogEntry(…)). The hexdcall Logger adapter (GetLogger()) routes hexdcall-internal logs through telemetry but those are hexdcall entries, not bridge entries.

Threshold State Changes:

threshold INFO AUDIT Threshold signing ready
threshold WARN AUDIT Threshold signing unavailable
threshold WARN AUDIT Threshold signing degraded
threshold INFO AUDIT DKG initiated
threshold INFO AUDIT DKG complete
threshold ERROR AUDIT DKG failed
threshold ERROR AUDIT DKG timed out
threshold ERROR AUDIT CRITICAL: DKG failed after max retries
threshold INFO AUDIT Threshold share persisted to KV
threshold WARN AUDIT Corrupt threshold share deleted
threshold ERROR AUDIT Threshold signing failed
threshold ERROR AUDIT Threshold signing timed out
threshold INFO AUDIT Threshold CA birth complete
threshold INFO AUDIT CA resharing initiated
threshold INFO AUDIT CA resharing complete
threshold ERROR AUDIT CA resharing failed
threshold ERROR AUDIT CA resharing timed out
threshold ERROR AUDIT CRITICAL: CA public key changed during resharing
threshold INFO AUDIT Threshold share migration pending
threshold ERROR AUDIT TSS replay attack detected
threshold ERROR AUDIT TSS envelope signature verification failed
threshold ERROR AUDIT TSS mandatory signature missing

Key Rotation Events:

keyrotation ERROR AUDIT Key rotation aborted
keyrotation ERROR AUDIT Key rotation spk_failed
keyrotation WARN AUDIT Key rotation retry
keyrotation WARN AUDIT Key rotation commit_quorum
keyrotation WARN AUDIT Key rotation abort_received
keyrotation INFO AUDIT Key rotation <event> (initiated, deferred, commit_all, completed, activated, spk_completed)

Hexon Readiness:

hexdcall INFO AUDIT HexonReady: All subsystems operational - Hexon is ready to serve traffic

CA Module — GetCABundle:

ca.getcabundle ERROR Failed to get ACME CA bundle
ca.getcabundle DEBUG ACME CA bundle retrieved successfully

CA Module — SignCertificate:

ca.signcertificate WARN Certificate template is required
ca.signcertificate WARN Public key DER is required
ca.signcertificate WARN Failed to parse public key DER
ca.signcertificate ERROR Failed to sign certificate with ACME CA
ca.signcertificate INFO AUDIT Certificate signed successfully with ACME CA

CA Module — SignCRL:

ca.signcrl WARN CRL number is required
ca.signcrl WARN CRL number must be positive
ca.signcrl WARN NextUpdate must be after ThisUpdate
ca.signcrl ERROR Failed to sign CRL with ACME CA
ca.signcrl INFO AUDIT CRL signed successfully with ACME CA

CA Module — SignOCSPResponse:

ca.signocspresponse WARN Serial number is required
ca.signocspresponse WARN Serial number must be positive
ca.signocspresponse WARN Invalid OCSP status
ca.signocspresponse WARN NextUpdate must be after ThisUpdate
ca.signocspresponse ERROR Failed to sign OCSP response with ACME CA
ca.signocspresponse INFO AUDIT OCSP response signed successfully with ACME CA

Metrics

No Prometheus metrics are emitted by the bridge module. The bridge provides infrastructure (bridge.Log, hexdcall Logger adapter) but does not itself emit counters, gauges, or latency metrics.


Configuration System

One TOML file defines the entire gateway — hot-reload, env overrides, GitOps, and Kubernetes CRDs

Overview

Defines the entire gateway in one TOML file — every module, every route, every policy. Supports hot-reload without restart, environment variable overrides, GitOps via git repository, and Kubernetes CRDs. Multiple config sources with a well-defined precedence order:

1. Default values (security-focused, applied automatically)
2. TOML literal values (single file, directory of files, or Git repository)
3. ${VAR} template substitution in TOML (arbitrary env var names, pre-parse)
4. HEXON_* auto-computed overrides (post-parse, highest priority)

Key capabilities:

  • Thread-safe access with atomic reads and mutex-protected writes
  • Hot-reload with SHA256 change detection, callback throttling (default 100ms window), section caching, and delta change logging
  • Environment variable overrides for all fields including array items: HEXON_SECTION_KEY for singletons, HEXON_SECTION_ARRAY_<NAME>_KEY for array items Automatic type conversion (string, int, bool, comma-separated arrays)
  • ${VAR} template substitution: embed arbitrary env var names in TOML values, expanded pre-parse. Operators choose their own naming convention.
  • GitOps: clone from Git repo (HTTPS or SSH), automatic polling with cluster-aware leader-only execution, multi-TOML file merge
  • Directory-based config: pass a directory path, all *.toml files merged recursively in alphabetical order (maps merge, arrays concatenate, scalars last-wins)
  • Self-documenting schema: struct tags (desc, hint, default, min, max, enum, format, example, required, sensitive, rfc, depends) drive runtime documentation
  • Config diff history: ring buffer (default 10 entries) tracking per-key old/new values, exposed via “config diff” admin CLI command
  • Invalid config handling: hash-based dedup prevents retry storms, logs every 5 minutes
  • File deletion handling: service continues with last valid config, ALERT logged, status set to “file_missing” for health check visibility

Configuration is organized into domain-specific sections:

Service, Telemetry, Cluster, Operations, Protection, Authentication, Filesystem

The config package is imported by virtually every component in the system. It has no dependencies on other gateway modules (only standard library + go-toml/v2).

Config

Configuration is loaded from TOML files. Default path: /tmp/hexon.toml

[service]
hostname = "auth.example.com" # Public hostname (required)
port = 443 # HTTPS listen port (required)
public_port = 8443 # Public-facing port for URL generation (behind NAT/LB)
tls_cert = "/path/to/cert.pem" # TLS certificate (file path or inline PEM)
tls_key = "/path/to/key.pem" # TLS private key (file path or inline PEM)
read_timeout = 30 # HTTP read timeout in seconds (default: 30)
write_timeout = 30 # HTTP write timeout in seconds (default: 30)
idle_timeout = 120 # HTTP idle timeout in seconds (default: 120)
max_header_bytes = 65536 # Max header size in bytes (default: 65536)
http2_enable = true # Enable HTTP/2 (default: true)
handshake_timeout = 10 # TLS handshake timeout in seconds (default: 10)
block_malformed_tls = true # Reject invalid TLS (default: true)
mtls_mode = "none" # mTLS mode: "none", "optional", "mandatory"
x509_auto_auth = true # Auto-authenticate with client certificate (default: true)
hot_reload_enabled = true # Enable automatic file watching (default: true)
hot_reload_poll_interval = "1s" # File polling interval (default: 1s)
hot_reload_callback_throttle = "100ms" # Callback throttle window (default: 100ms)
[telemetry]
log_level = "info" # trace|debug|info|warn|error|fatal (default: info)
log_format = "json" # json|human (default: json)
output = "stdout" # stdout|otlp|both (default: stdout)
otlp_endpoint = "otel-collector:4317" # Required when output is otlp or both
log_buffer_size = 10000 # Ring buffer for log queries (default: 10000)
[cluster]
cluster_mode = true # Enable clustering (default: false)
cluster_peers = ["10.0.0.2", "10.0.0.3"] # Static peers (IPs or hostnames)
cluster_dns = "hexon.cluster.local" # OR DNS-based discovery (ignored when cluster_peers is set)
cluster_key = "32-char-secret" # Cluster key, exactly 32 chars (required)
cluster_refresh = "15s" # Peer refresh interval (default: 15s)
threshold_required = false # Fail-closed threshold signing after bootstrap grace (default: false)
threshold_bootstrap_grace = "2m" # Grace period for DKG completion (default: 2m)
threshold_nodes = 0 # Threshold t value: 0=auto (n/2), explicit integer for override

Environment variable overrides (three layers):

Precedence: HEXON_* override > ${VAR} expansion > TOML literal > defaults
HEXON_* auto-computed overrides (post-parse, highest priority):
Singleton fields: HEXON_<SECTION>_<KEY>=value # e.g., HEXON_SERVICE_PORT=8443
Array item fields: HEXON_<SECTION>_<ARRAY>_<ITEMNAME>_<KEY>=value
# e.g., HEXON_AUTHENTICATION_OIDC_CLIENTS_MYAPP_CLIENTSECRET=secret
Item names are sanitized: uppercased, non-alphanumeric → underscore, collapsed.
Only existing items (defined in TOML) can be overridden.
Use 'config describe <section>' to see the exact env var for each field.
${VAR} template substitution (pre-parse, in TOML source):
clientsecret = "${VAULT_OIDC_SECRET}" # Arbitrary env var names
Pattern: ${VARNAME} — unset vars left as-is, no recursive expansion.
Type conversion: string, int, bool (true/false/1/0/yes/no), arrays (comma-separated)

GitOps environment variables:

CONFIG_GIT_REPO # Repository URL (HTTPS or SSH, required for GitOps)
CONFIG_GIT_BRANCH # Branch name (required for GitOps)
CONFIG_GIT_PATH # Local clone path (default: /tmp/hexon-config)
CONFIG_GIT_POLLING # Enable remote polling (default: false)
CONFIG_GIT_POLLING_TIME # Polling interval (default: 5m, min: 30s)
CONFIG_GIT_USER / CONFIG_GIT_TOKEN # HTTPS authentication
CONFIG_GIT_SSH_KEY # SSH private key (inline PEM or file path)

Directory-based config:

Pass a directory path instead of file: --config /etc/hexon/conf.d/
All *.toml files merged recursively in alphabetical order.
Merge: maps merge recursively, arrays concatenate, scalars last-wins.
Use numeric prefixes for ordering: 00-base.toml, 90-overrides.toml.
World-writable files (chmod 0002) are rejected for security.

Config diff history:

config_diff_history_enabled = true # Enable/disable diff storage (default: true)
config_diff_history_size = 10 # Max entries retained, range 1-100 (default: 10)

Hot-reloadable: all config values via Get(). Application code must handle changes. Cold (restart required): listener bind address/port, TLS certificate paths at startup.

Troubleshooting

Common symptoms and diagnostic steps:

Config file not loading at startup:

- TOML syntax error: check error message for line number, validate with 'config validate'
- Missing required fields: hostname, port, tls_cert, tls_key must be present
- Invalid CIDR notation: check proxy_cidr, ip_whitelist, ip_blacklist format
- World-writable file: chmod to remove 0002 permission from TOML files

Environment variable overrides not applying:

- Check naming: HEXON_<SECTION>_<KEY> in uppercase (e.g., HEXON_SERVICE_PORT)
- Dots become underscores: HEXON_SERVICE_HEXON_EDGE_CIDR for [service] hexon_edge_cidr
- Boolean values: accepts true/false, 1/0, yes/no (case-insensitive)
- Arrays: comma-separated (HEXON_SERVICE_PROXY_CIDR=10.0.0.0/8,172.16.0.0/12)
- Array items: item must exist in TOML first; env var uses sanitized name
(e.g., HEXON_AUTHENTICATION_OIDC_CLIENTS_MYAPP_CLIENTSECRET for client "MyApp")
- ${VAR} not expanding: variable must be set (os.LookupEnv), pattern must use
braces (${VAR} not $VAR), name must match [a-zA-Z_][a-zA-Z0-9_]*
- Use 'config describe <section>' to see the exact env var name for each field
- Check active overrides: 'config env' shows all HEXON_* variables in effect

Hot-reload not detecting changes:

- File hash unchanged: hot-reload uses SHA256, not mtime
- Throttle window: rapid changes coalesce within 100ms window
- Check status: 'config diff' for recent changes
- Callback timeout: callbacks exceeding 30s are logged at WARN
- hot_reload_enabled=false: file watching is disabled entirely

Config file deleted while running:

- Service continues with last valid config (graceful degradation)
- ALERT logged immediately, reminder every 5 minutes
- Status set to "file_missing" visible in 'health components'
- When file is restored, normal operation resumes automatically

GitOps config not syncing:

- Repository credentials: verify CONFIG_GIT_USER/TOKEN or CONFIG_GIT_SSH_KEY
- Polling disabled: CONFIG_GIT_POLLING must be "true" for automatic updates
- Cluster leader-only: in cluster mode, only the leader node polls Git
- Multi-file merge: check logs for "[CONFIG] Multi-file mode:" to verify

Directory config merge issues:

- File order: alphabetical by full path, use numeric prefixes (00-base, 90-overrides)
- Scalar override: later files win
- Array concatenation: proxy.mappings from multiple files combine (not override)
- Only *.toml files included, rename to .disabled or .bak to exclude

Threshold signing issues:

- threshold_required=true but tokens return 503 after bootstrap grace:
DKG did not complete in time. Check 'status summary' for threshold state,
'logs search threshold' for DKG errors. Ensure cluster_mode=true and ≥2 nodes.
- Threshold signing not activating: requires cluster_mode=true, ≥2 nodes,
X3DH healthy. Check 'health components' for x3dh status.
- Re-DKG not triggering after node join/leave: stale route timeout is 5 minutes,
then 10s stabilization. Wait ~5m10s after membership change.
- threshold_nodes: 0 = auto (floor(n/2)), explicit value sets t directly.
t+1 nodes must cooperate to sign. With t=1 and n=2, both nodes required.

Relationships

Module dependencies and interactions:

  • listener: Consumes service config for TLS settings, bind address, port, HTTP/2 parameters, handshake timeout. Listener reads config via Get() on startup and handles hot-reload for certificate rotation.

  • cluster: Config changes propagate to all nodes via cluster broadcast. GitOps polling runs on the cluster leader only.

  • telemetry: Reads log_level, log_format, output, otlp_endpoint. log_buffer_size controls ring buffer for admin CLI log queries.

  • protection: Rate limiting, PoW, IP whitelist/blacklist settings all loaded from [protection] section. Hot-reloadable for threshold tuning without restart.

  • authentication: All auth backend configuration (LDAP, OIDC, TOTP, WebAuthn, x509) loaded from [authentication] sub-sections.

  • Git config sync: Handles CONFIG_GIT_* env vars, repository cloning, SSH/HTTPS auth, multi-file merge, and polling coordination.

  • Hot reload: Infrastructure module that manages file watching lifecycle, callback registration, and reload orchestration.

  • proxy: Reverse proxy mappings, load balancer, circuit breaker settings from [proxy] section.

  • threshold signing: [cluster] threshold_required, threshold_bootstrap_grace, threshold_nodes control the threshold signing subsystem (GG18 ECDSA / FROST Ed25519). The algorithm is driven by [authentication.oidc] signing_algorithm. Config is cold (restart required). The threshold signing subsystem consumes these values at startup.

  • admin CLI: ‘config show’, ‘config describe’, ‘config example’, ‘config set’, ‘config diff’, ‘config validate’, ‘config env’ commands for operational visibility and management.

  • schema: Self-documenting system driven by struct tags. Schema extraction produces field metadata, description formatting, TOML example generation, and auto-computed env var names for operator-facing output. Each field shows its HEXON_* env var in ‘config describe’. The config guide MCP resource is generated from this schema data.

Logs

This module does not emit structured log entries. All config output goes to process stdout/stderr as console messages.

Console output categories:

startup and reload:
fmt.Printf "[CONFIG] Warning: Failed to start hot-reload system: %v"
fmt.Printf "[CONFIG] Loading configuration from directory: %s"
fmt.Printf "[%s] %s" (license periodic check callback)
fmt.Fprintf "[CONFIG] DEPRECATED: %s is deprecated — %s"
fmt.Fprintf "[CONFIG] Warning: %s: expected %s, got %s — %s" (type mismatch auto-correction)
cross-module validation:
fmt.Fprintf "[CONFIG] WARNING: signin.magiclink.enabled=true but SMTP is not configured — magic link disabled"
fmt.Fprintf "[CONFIG] INFO: auto-enabling authentication.devicecode (required by signin.magiclink)"
git clone and metadata:
fmt.Printf "[CONFIG] Git TLS config: ..."
fmt.Printf "[CONFIG] Loading configuration from git repository: %s (branch: %s)"
fmt.Printf "[CONFIG] Git configuration loaded successfully: ..."
fmt.Printf "[CONFIG] Warning: Failed to extract git metadata: %v"
fmt.Printf "[CONFIG] Using HTTP basic authentication"
fmt.Printf "[CONFIG] Using SSH authentication"
file watching and reload (via logHotReloadEvent helper):
fmt.Printf "[CONFIG-HOTRELOAD] Hot reload system started"
fmt.Printf "[CONFIG-HOTRELOAD] Hot reload system stopped"
fmt.Printf "[CONFIG-HOTRELOAD] Config file changed, triggering reload"
fmt.Printf "[CONFIG-HOTRELOAD] Config reload successful"
fmt.Printf "[CONFIG-HOTRELOAD] Config reload failed - keeping previous config"
fmt.Printf "[CONFIG-HOTRELOAD] ALERT: Config file deleted - running with last valid config"
fmt.Printf "[CONFIG-HOTRELOAD] Config file restored"
fmt.Printf "[CONFIG-HOTRELOAD] Config still invalid - not retrying same broken config"
fmt.Printf "[CONFIG-HOTRELOAD] ALERT: Config parse failure"
fmt.Printf "[CONFIG-HOTRELOAD] ALERT: Config validation failure"
fmt.Printf "[CONFIG-HOTRELOAD] ALERT: Config file missing"
fmt.Printf "[CONFIG-HOTRELOAD] Config reload triggered by cluster broadcast"
fmt.Printf "[CONFIG-HOTRELOAD] Config reload from cluster successful"
fmt.Printf "[CONFIG-HOTRELOAD] Config reload from cluster failed"
fmt.Printf "[CONFIG-HOTRELOAD] Cluster notified of config reload"
fmt.Printf "[CONFIG-HOTRELOAD] Config changes detected"
fmt.Printf "[CONFIG-HOTRELOAD] Config reloaded with no detected changes"
fmt.Printf "[CONFIG-HOTRELOAD] Config callback panicked"
fmt.Printf "[CONFIG-HOTRELOAD] WARN: Legacy config callback timed out (goroutine leaked)"
fmt.Printf "[CONFIG-HOTRELOAD] WARN: Config callback timed out (context cancelled)"
fmt.Printf "[CONFIG-HOTRELOAD] WARN: Context-aware callback not respecting cancellation"
fmt.Printf "[CONFIG-HOTRELOAD] Config cache invalidated"
fmt.Printf "[CONFIG-HOTRELOAD] Hot reload configuration optimized"

None of these are queryable via ‘logs search’. They appear only in process stdout/stderr. The infrastructure/hotreload module wraps some of this via hexdcall manager logger (slog).

Metrics

No Prometheus metrics are emitted directly by this module.

Reload counters are available via the health system:

- Reload attempts, successes, failures
- Parse errors, validation errors, file-not-found errors
- Callback timeouts, callback duration, last reload duration

Query reload status: health components | config status


Git Configuration Management

Pulls configuration from a git repository — GitOps workflow with automatic cluster-wide reload on changes

Overview

Synchronizes the gateway configuration from a git repository — every change auditable through git history. The leader polls for changes, validates the configuration, and broadcasts a reload to all cluster nodes. Supports SSH and HTTPS with PAT authentication, webhook-triggered pulls, and automatic rollback on invalid config.

Core capabilities:

  • Leader-only git repository polling (prevents duplicate change detection)
  • Cluster-wide reload to all members on change detection
  • Hard reset to remote HEAD for deterministic config state
  • Commit tracking with hash, author, message, and timestamp
  • Quorum wait for cluster-wide consistency confirmation
  • Integration with config hot-reload pipeline for seamless updates

Cluster synchronization flow:

1. Leader node polls git repository at configured interval
2. When changes detected, leader pulls and applies config locally
3. Leader notifies all cluster members to pull the latest config
4. Each member pulls latest git config and triggers hot-reload
5. Quorum wait ensures cluster-wide consistency

The module provides GitOps-style configuration management where infrastructure teams push configuration changes to a git repository, and the cluster automatically picks up and applies those changes. This enables:

  • Version-controlled configuration with full audit trail
  • Pull request review workflows for config changes
  • Rollback capability via git revert
  • Branch-based staging/production config separation

Leadership determines which node polls the repository:

- Only the cluster leader runs the git polling loop
- If leadership changes, the new leader automatically starts polling
- In standalone mode, the single node polls directly

Config

Git configuration is managed under [config] section in hexon.toml:

[config]
# Git repository settings
git_enabled = true # Enable git-based config management
git_repo = "/etc/hexon/config.git" # Local path to git repository
git_remote = "origin" # Git remote name (default: origin)
git_branch = "main" # Branch to track (default: main)
git_poll_interval = "30s" # Polling interval (default: 30s)
# Authentication
git_ssh_key = "/etc/hexon/deploy.key" # SSH key for git authentication
git_username = "" # Username for HTTPS auth (optional)
git_password = "" # Password for HTTPS auth (optional)
# Directory-based config
config_dir = "/etc/hexon/config.d" # Directory for split config files
merge_strategy = "deep" # How to merge directory configs

The git repository should contain the hexon.toml (or split config files) at the repository root. The module performs a hard reset to the remote branch HEAD on each pull, ensuring deterministic state regardless of local modifications.

Polling behavior:

- Only the cluster leader polls the git repository
- Poll interval determines change detection latency
- SHA comparison detects changes (not file timestamps)
- On detection, local reload happens first, then broadcast

Hot-reloadable: git_poll_interval. Cold (restart required): git_enabled, git_repo, git_remote, git_branch,

git_ssh_key, git_username, git_password.

Architecture

Operational model and design:

Pull operation details:

Each successful pull reports: commit hash, commit author, commit message,
and pull timestamp. These are visible in structured logs and health status
for auditing which config version is active on each node.

Operational model:

The module is passive on member nodes -- it responds to cluster-wide pull
notifications by performing a local git pull and triggering config reload.
The active polling runs only on the leader node, which detects changes and
initiates the cluster-wide pull.

Leader election dependency:

The module relies on the cluster's leader election mechanism. Only the
elected leader runs the git polling loop. If leadership changes, the new
leader automatically starts polling. This prevents duplicate pulls and
conflicting notification storms.

Troubleshooting

Common symptoms and diagnostic steps:

Config changes in git not being applied:

- Verify git_enabled = true in [config] section
- Check if this node is the leader: cluster status shows leader node
- Verify git remote is accessible: net tcp <git_host:port>
- Check git_poll_interval (default 30s) - changes may be within latency
- Look for git pull errors in logs: logs search "gitconfig" --level=error
- Verify branch name matches: git_branch must match remote branch

Authentication failures (git pull fails):

- SSH: verify git_ssh_key path exists and has correct permissions (0600)
- SSH: check host key is in known_hosts for the git server
- HTTPS: verify git_username and git_password are correct
- HTTPS: check if token has expired (for token-based auth)
- Look for auth errors: logs search "git" --level=error

Cluster members out of sync:

- Check cluster health: cluster status shows all nodes
- Verify pull delivery: logs search "gitconfig" on member nodes
- Member pull failure is local only - check individual node logs
- Force sync: trigger a manual git push (any change) to cause re-poll
- Check quorum: if quorum lost, broadcast may not reach all members

Config validation failure after pull:

- Invalid TOML in repository causes reload failure
- Leader reload failure prevents broadcast (protects cluster)
- Member reload failure logged locally, does not affect other nodes
- Check: config validate to verify current config
- Check git log for the problematic commit

Hard reset behavior:

- The module performs git reset --hard to remote HEAD
- Local modifications to the config file are overwritten
- This is intentional: git is the source of truth
- If local changes are needed, commit them to the repository

Standalone mode (no cluster):

- Git polling runs on the single node directly
- No broadcast occurs (no cluster to notify)
- Config reload happens locally after pull
- Suitable for development and single-node deployments

Relationships

Module dependencies and interactions:

  • config: Primary integration point. The config system performs the actual git fetch and hard reset. Config hot-reload pipeline processes the updated TOML after pull.
  • cluster: Leader election determines which node runs the git polling loop. Cluster-wide notification delivers the pull signal to all members. Quorum wait (optional) ensures cluster-wide consistency.
  • Hot reload: Complementary module — gitconfig handles git-based config changes while hot reload handles file-based config changes. Both trigger the same cluster-wide reload pipeline.
  • telemetry: Structured logging for pull operations with commit hash, author, and success/failure status. Metrics for pull frequency and latency.

Logs

No structured log entries. A console message is emitted on successful pull (commit hash, author, message — not queryable via logs search).

Related logs from other modules:

- config: logs git fetch, hard reset, and reload results
- cluster: logs broadcast delivery to member nodes

Metrics

This module does not emit its own Prometheus metrics.

Observability is provided indirectly through dependent modules:

- config: metrics for config reload success/failure and reload latency
- cluster: metrics for broadcast delivery and quorum wait

Hot Reload

Applies configuration changes without restart — leader detects file changes, broadcasts reload to all nodes

Overview

Applies configuration changes to the entire cluster without restarting any node. The leader watches for config file changes, reloads locally, and broadcasts the update to all cluster members. Most settings take effect immediately — a few require restart (documented per module).

Core capabilities:

  • Leader-only file watching (prevents duplicate change detection across cluster)
  • SHA256 hash comparison for reliable change detection (1-second poll interval)
  • Cluster-wide reload notification to all members after leader detects changes
  • Graceful degradation to standalone mode (single node, no coordination)
  • Atomic config swap with validation before apply
  • Independent node recovery (each node can recover on next poll or restart)

Cluster reload flow:

1. Leader's file watcher polls config file every 1 second
2. SHA256 hash computed and compared to previous hash
3. On change: leader re-reads config, validates, applies defaults
4. Atomic config swap on leader node
5. On success: leader notifies all cluster members to reload
6. Each member independently re-reads file, validates, and swaps config
7. Notification is best-effort (local success is sufficient)

Standalone mode:

When running as a single node or when cluster coordination is not initialized,
every node watches and reloads independently. No broadcast occurs. This mode
provides backward compatibility for development environments, single-node
deployments, and testing scenarios.

Error handling philosophy (best effort):

- Leader reload success: always broadcast to cluster
- Leader reload failure: do NOT propagate (protect cluster from bad config)
- Member reload failure: logged locally, does not affect other nodes
- Cluster propagation failure: logged, local reload already succeeded

Config

Hot reload is an infrastructure module that watches the main config file. Its behavior is controlled by the overall config system rather than a dedicated config section.

The file watcher monitors the main hexon.toml config file path. The poll interval is fixed at 1 second for responsive change detection without excessive I/O overhead.

Key behaviors:

- File watcher only runs on the cluster leader node
- SHA256 hash comparison avoids false-positive reloads from timestamp changes
- Config validation occurs before applying changes (fail-safe)
- Invalid config is rejected; previous config remains active
- Atomic swap ensures no partial config state is visible to readers

Leadership determines which node watches the config file:

- Only the cluster leader runs the file watcher
- If leadership changes, the new leader automatically starts watching
- In standalone mode, every node watches independently

The config system also exposes reload status and metrics for health checks and monitoring.

Hot-reloadable fields vary by module. Each module documents which of its config fields support hot-reload vs. requiring a restart. The hotreload module itself has no user-configurable settings.

Architecture

Operational model and design:

Config version tracking:

Each successful reload increments a version counter. This allows health
checks and monitoring tools to detect whether a node is on the latest
config by comparing version numbers. The version, reload timestamp, and
any error message are exposed via health status.

File watching approach:

The file watcher uses a polling approach (not inotify/kqueue) for maximum
portability across Linux, macOS, and container environments. The 1-second
poll interval provides a good balance between responsiveness and overhead.
SHA256 hashing is more reliable than mtime/ctime comparison, which can
produce false positives with NFS or container volume mounts.

Separation from gitconfig:

hotreload handles direct file modifications (edit, cp, mount update).
gitconfig handles git repository-based changes (git pull, merge).
Both trigger the same config reload pipeline but through different
detection mechanisms. They complement each other:
- Use gitconfig for GitOps workflows with version control
- Use hotreload for direct file modifications or mounted config maps

Troubleshooting

Common symptoms and diagnostic steps:

Config changes not being picked up:

- Verify this node is the cluster leader: cluster status
- In standalone mode, every node watches independently
- Check if file was actually modified: SHA256 hash must change
- Editing in place (vi, nano) changes hash; truncate+write may race
- NFS/mount delays: file may not be visible for up to 1 second
- Check logs for reload attempts: logs search "reload" --level=info

Reload fails with validation error:

- Invalid TOML syntax: config validate to check current file
- Missing required fields after edit
- Leader detects failure and does NOT broadcast to cluster
- Fix the config file; watcher will detect next change automatically
- Check error details: logs search "reload" --level=error

Cluster members not reloading:

- Check cluster connectivity: cluster status and health status
- Verify reload delivery: logs search "reload" on member nodes
- Member failure is independent: check individual node logs
- Network partition: members reload on next local file change or restart
- Quorum issues: cluster-wide reload requires quorum for delivery

Reload succeeded but feature not updated:

- Not all config fields are hot-reloadable
- Check module documentation for which fields require restart
- Cold fields (e.g., listen ports, TLS certs) need full restart
- Verify config version incremented: health status shows config version

Standalone mode issues:

- No broadcast occurs in standalone mode (expected behavior)
- Each node watches independently when cluster is not initialized
- Verify the config file path is correct and accessible
- File permissions: process must have read access to config file

File watcher consuming resources:

- SHA256 computation on large config files is negligible
- 1-second poll interval is fixed and not configurable
- For very large configs (rare), hashing overhead is still minimal
- If concerned, monitor CPU via metrics prometheus "cpu"

Relationships

Module dependencies and interactions:

  • config: Primary integration point. The config system owns file reading, TOML parsing, validation, and atomic swap logic. Reload status and metrics are exposed for health checks.
  • cluster: Leader election determines which node runs the file watcher. Cluster-wide notification delivers the reload signal to all members. Leadership changes automatically transfer the file watching responsibility.
  • GitOps config: Complementary module for git-based config changes. Both modules trigger the same config reload pipeline. gitconfig is for GitOps workflows; hotreload is for direct file modifications.
  • All modules with hot-reloadable config: When reload occurs, each module receives updated config via their registered reload callbacks. Modules include firewall (ACL rules), proxy (mappings), ratelimit (limits), forwardproxy (rate/bandwidth), and many others.
  • telemetry: Structured logging for reload events with success/failure status, config version, and timing. Metrics for reload frequency and duration.

Logs

No structured log entries. A single console message is emitted on initialization.

Related logs from other modules:

- config: logs file watcher start/stop, hash comparison, reload success/failure
- cluster: logs broadcast delivery to member nodes

Metrics

This module does not emit its own Prometheus metrics.

Observability is provided indirectly through dependent modules:

- config: metrics for config reload success/failure, reload latency, and config version
- cluster: metrics for broadcast delivery and quorum wait

Kubernetes CRD Configuration

Kubernetes-native configuration via Custom Resource Definitions with bootstrap reconciliation, live watching, and status feedback

Overview

HexonGateway supports Kubernetes-native configuration through Custom Resource Definitions (CRDs). When running in Kubernetes, operators can manage gateway configuration using standard kubectl commands instead of (or alongside) TOML files.

The system defines 55 CRD types covering every configuration section:

- Service, cluster, telemetry, health, DNS, SMTP, filesystem, memory
- Proxy mappings, connection pools, TCP proxy, forward proxy, subrequest
- Authentication: OIDC clients, auth flows, signup flows
- Identity: LDAP, OIDC providers, SCIM providers
- Protection: WAF config, WAF rules, firewall rules/aliases, rate limiting
- Infrastructure: bastion, SQL bastion, SSH certificates, port forwarding, connector, client access
- Certificates: ACME CA server, ACME client
- Operations: admin, MCP, LLM, playbooks, webhooks, SPIFFE, RADIUS
- Observability: log intelligence, notifications

CRDs are optional — the gateway runs identically on VMs, Docker, or Kubernetes using TOML configuration. CRDs provide a Kubernetes-native alternative that integrates with GitOps tools like ArgoCD and Flux.

All CRDs belong to the config.hexon.io API group with v1alpha1 version. Namespaced scope — instances live in the hexon-system namespace by default.

Config

CRD Lifecycle:

1. Bootstrap: On first start, the cluster leader creates CRD instances from
the running config (TOML + env overrides + defaults merged). Each instance
is labeled config.hexon.io/origin: bootstrap.
2. Pruning: Bootstrap-owned array CRDs no longer in config are automatically
deleted, along with their companion Secrets. Operator-owned CRDs are never
touched. This ensures TOML deletions propagate to Kubernetes.
3. Watching: Informers watch for CRD changes via the Kubernetes API. Changes are
debounced (500ms window) to batch rapid edits.
4. Apply: CRD spec is converted to the internal config struct, validated, and
applied atomically. Config reload callbacks fire for all modules.
5. Status: Each CRD instance gets status conditions reflecting apply success/failure.

Example — create a proxy mapping:

apiVersion: config.hexon.io/v1alpha1
kind: HexonProxy
metadata:
name: dashboard
namespace: hexon-system
spec:
hostname: dashboard.example.com
target: http://dashboard-service:3000
auth_type: oidc

Sensitive fields (SecretKeyRef):

Sensitive config fields (certificates, private keys, passwords, API secrets,
RADIUS shared secrets, OIDC client secrets) are never stored in CRD specs.
Instead, they are stored in companion Kubernetes Secrets and referenced via
SecretKeyRef entries in the CRD spec:
spec:
apiKey:
name: hexon-hexonproxies-dashboard # Secret name
key: apiKey # Key within the Secret
- Empty sensitive fields (e.g., no custom certificate) produce no Secret.
The field stays empty and the gateway uses its default (e.g., wildcard cert).
- Non-empty fields are stored in a companion Secret named
hexon-<plural>-<instance> (e.g., hexon-hexonproxies-dashboard).
- Operators can reference any Secret they create — not limited to the
bootstrap naming convention.
- RBAC: The gateway pod needs get/list/create/update/delete on core Secrets.

Ownership model:

- Bootstrap-created CRDs and companion Secrets have label:
config.hexon.io/origin: bootstrap
- Remove the label to "take ownership" — bootstrap will no longer overwrite
- Operator-created CRDs (no label) are never modified or deleted by bootstrap
- Bootstrap-owned array CRDs removed from config are pruned on next restart

Singleton vs Array CRDs:

- Singleton: one instance named "default" (e.g., HexonClusterConfig, HexonDNSConfig)
- Array: multiple instances, name derived from config (e.g., HexonProxy per mapping)

Resource naming:

- K8s resource names are sanitized from config names (lowercased, spaces/underscores
to dashes, special chars to dashes, max 253 chars). Example: config app "Kubernetes /
Production" becomes resource name "kubernetes---production".
- The original config name is preserved in the CRD spec (e.g., spec.app for proxies).
- The "crd show" command accepts either the K8s resource name or the original config name.
- Use "crd list <kind>" to see both resource names and config names side by side.

Status conditions:

Every CRD instance reports an "Applied" condition:
Applied=True reason=ConfigValid — config applied successfully
Applied=False reason=ApplyError — config apply failed
Applied=False reason=ConversionError — CRD-to-config conversion failed
Check status: kubectl get hexonproxies -o wide
The "Applied" printer column shows the current phase (Ready/Error).

Troubleshooting

CRD instances not being created on startup:

- Only the cluster leader runs bootstrap reconciliation
- Check logs for "bootstrap reconciliation complete" message
- Verify CRD definitions are installed: kubectl get crd | grep hexon

CRD changes not applying:

- Changes are debounced with a 500ms window — wait briefly
- Check status conditions: kubectl describe <crd-kind> <name>
- Look for Applied=False with reason and message
- Verify RBAC: the gateway pod needs get/list/watch/create/update/patch permissions
on all config.hexon.io resources and their status subresources

Status shows Applied=False reason=ConversionError:

- The CRD spec doesn't match the expected config structure
- Check field names match TOML keys (snake_case in spec)
- Verify enum values are valid (e.g., auth_type must be a recognized method)

Bootstrap keeps overwriting my changes:

- Remove the config.hexon.io/origin label from the CRD instance:
kubectl label hexonproxy <name> config.hexon.io/origin-
- Once the label is removed, bootstrap treats it as operator-owned and skips it
- Do the same for companion Secrets if you want to manage them independently

Sensitive field shows empty after CRD apply:

- Check the companion Secret exists: kubectl get secret hexon-<plural>-<name>
- Verify the Secret has the expected key: kubectl get secret <name> -o jsonpath='{.data}'
- Check RBAC allows Secret read: the gateway pod needs get on core/v1 secrets
- If the Secret was manually deleted, restart the gateway to recreate it via bootstrap

CRD still exists after removing mapping from TOML:

- Bootstrap prunes only CRDs with the config.hexon.io/origin: bootstrap label
- If the label was removed (operator-owned), delete it manually:
kubectl delete hexonproxy <name>

Config export for migration:

- Use the admin CLI: config export
- Exports running config as multi-document YAML CRD manifests
- Filter by section: config export proxy
- JSON format: config export --format=json
- Only available when running in Kubernetes

Relationships

Module dependencies and interactions:

  • Configuration system: CRD changes are applied to the same config store used by TOML and environment variables. All modules see changes via the standard config reload mechanism. CRDs have the same precedence as TOML — environment variables still override.

  • Cluster coordination: Bootstrap reconciliation runs on the cluster leader only. Config changes from CRDs propagate to all nodes via the standard config reload broadcast (NATS-based).

  • Admin CLI: The “config export” command generates CRD YAML from running config, enabling migration from TOML to Kubernetes-native management. When using “config export —apply”, companion Secrets are created for sensitive fields (without the bootstrap label — operator-owned). The “config show” and “config describe” commands work regardless of config source.

  • Helm chart: CRDs are distributed separately from the Helm chart. Install CRDs first, then deploy the chart. This avoids Helm’s CRD lifecycle limitations (no update on upgrade, deletion on uninstall).

  • CI/CD integration: CRD manifests are versioned and compatible with ArgoCD, Flux, and other GitOps tools. The all-in-one bundle (hexon-crds.yaml) contains all CRD definitions.

  • Codegen tool: CRD YAML manifests are generated from Go struct tags using the build tool (build-crd.sh). OpenAPI v3 schemas include validation constraints derived from struct tags (required, enum, min, max, default, desc).

Logs

Log entries for Kubernetes operations. No AUDIT entries — all operational/debug. Levels: ERROR > WARN > INFO > DEBUG.

CRD Definition Management:

CRD definition ensure failed WARN Schema ensure error for a single CRD kind
CRD definition created INFO New CRD definition created in cluster
CRD definition updated INFO Existing CRD definition updated with new schema
CRD definitions ensured INFO Summary: created/updated/unchanged counts for all CRDs

Manager Lifecycle:

CRD auto-apply failed, using existing definitions WARN CRD ensure failed (RBAC or network); continues with existing
starting K8s CRD informers INFO Informer startup with namespace and CRD count
K8s API watch interrupted, will retry WARN Transient network error on watch stream (auto-retries)
K8s API watch failed ERROR Non-network watch error (permissions, API server issue)
failed to set watch error handler WARN Could not install custom watch error handler
informer cache sync failed WARN Individual informer cache did not sync
K8s informers synced INFO All informer caches synced, ready to process events
K8s manager stopped INFO Manager shutdown complete
K8s manager restarting after CRD definitions applied INFO Manager restart after CRD sync timeout recovery

Config Apply:

failed to convert CRD to config ERROR UnstructuredToConfig failed for a CRD change
skipping CRD change with unresolved sensitive fields DEBUG SecretKeyRef not yet populated, skip to avoid empty overwrite
failed to apply singleton change ERROR Config mutation failed for singleton CRD
failed to apply array change ERROR Config mutation failed for array/map CRD item
failed to apply delete ERROR Config deletion failed for array/map item
CRD config validation failed, reload skipped ERROR Config.Validate() failed after applying CRD changes
applied CRD config changes INFO Config updated from CRD changes with apply/skip/error counts
all CRD changes matched current config, reload skipped DEBUG All CRD changes identical to running config

Bootstrap Reconciliation:

bootstrap singleton failed ERROR Failed to reconcile a singleton CRD from config
bootstrap array failed ERROR Failed to reconcile an array CRD type from config
bootstrap reconciliation complete INFO Summary: created/updated/skipped/pruned counts
bootstrap array item failed ERROR Failed to create/update a single array item CRD
bootstrap map item failed ERROR Failed to create/update a single map-keyed CRD
failed to prune bootstrap CRD ERROR Could not delete orphaned bootstrap-owned CRD
pruned bootstrap CRD removed from config INFO Deleted bootstrap CRD no longer in TOML config
failed to delete companion Secret during prune WARN Companion Secret cleanup failed during CRD prune
failed to create companion Secret ERROR Could not create K8s Secret for sensitive fields

Secrets:

created companion Secret for CRD INFO New K8s Secret created for sensitive fields
updated companion Secret for CRD DEBUG Existing K8s Secret updated with new sensitive data
failed to resolve Secret for sensitive field WARN Could not read SecretKeyRef value from K8s Secret

Status:

status update: failed to write status WARN Could not write status condition to CRD instance

Health Sync:

health status synced INFO Health status written to CRD resources (with update count)
cluster status sync: failed to get resource WARN Could not read cluster CRD for status update
cluster status sync: failed to write status WARN Could not write leader/nodes/health to cluster CRD
connector status sync: failed to get resource WARN Could not read connector site CRD for status update
connector status sync: failed to write status WARN Could not write rich status to connector site CRD
health sync: failed to get resource WARN Could not read CRD resource for health update
health sync: failed to write status WARN Could not write health field to CRD resource

Resource Apply:

CRD resource created INFO CRD instance created via CLI apply
CRD resource updated INFO CRD instance updated via CLI apply (may include ownership transfer)

Watcher:

unexpected object type in informer event WARN Informer delivered non-Unstructured object

Metrics

Prometheus metrics. Query with: metrics prometheus k8s_<name>

Reconciliation:

k8s_reconciliations_total counter {result} Config-to-CRD reconciliation cycles
result=success | failure

Health Sync:

k8s_health_syncs_total counter {result} Periodic health status writes to CRD .status
result=success | failure

CRD Operations:

k8s_crd_operations_total counter {operation, result} Individual CRD operations
operation=ensure_definition, result=created | updated | failure
operation=status_write, result=success | failure

Alerts:

rate(k8s_reconciliations_total{result="failure"}[5m]) > 0 Reconciliation failing
rate(k8s_health_syncs_total{result="failure"}[5m]) > 0 Health sync failing

AI Assistant

Built-in AI-powered natural language interface for gateway operations via the bastion shell

Overview

The AI assistant enables natural language interaction with all gateway admin tools through the bastion shell’s “ai” command. It shares the same tool set and execution path as MCP, ensuring identical tool visibility, read/write enforcement, metrics, and audit logging.

Capabilities:

Tool execution - Runs any admin CLI command via an agentic loop. The AI
reads tool results, reasons about them, and decides what to run next.
Read-only commands execute automatically. Write operations pause for
interactive operator approval in the SSH session.
Multi-provider support - Works with Anthropic (Claude), OpenAI (GPT-4),
Azure OpenAI, Google Gemini, and Ollama/vLLM for local models. Provider
auto-detected from the API URL or set explicitly.
Conversation context - Maintains per-session conversation history so
follow-up questions build on prior answers. Operators can set session
context hints and the AI sees recent shell commands for awareness.
Background monitoring - The schedule_task tool runs commands periodically
in the background. Results appear between shell prompts. Operators
manage tasks with "task list", "task stop".
Inline monitoring loops - The sleep tool pauses the AI within its
reasoning loop, then resumes with full context. Enables "check health,
wait 30s, check again, compare, report changes" patterns. Each sleep
extends the tool-calling budget so monitoring does not fight the
per-query round limit. Governed by max_sleep_duration (default 5m per
call) and max_sleeps_per_query (default 60 iterations). Ctrl+C
interrupts immediately.
Cluster knowledge base - Persistent cross-session memory for operational
insights and rules. The AI learns from investigations and applies that
knowledge in future sessions.
Prompt caching - Anthropic provider supports prompt caching (5m or 1h
TTL) to reduce token costs on repeated interactions.

Configuration: [llm] section with api_url, api_key, model, required_groups. Enable in bastion with [bastion] use_llm = true. Per-user custom instructions via moduledata or config.

Safety

Multiple layers prevent runaway AI behavior:

Tool round limit - max_tool_rounds (default 15) caps the number of
reasoning cycles per query. Sleep calls extend this budget so
monitoring loops get additional rounds.
Write operation limit - max_write_ops_per_query (default 3) caps
mutations per query. The AI cannot retry failing write commands
with slight variations.
Sleep guardrails - max_sleep_duration (default 5m) caps individual
pauses. max_sleeps_per_query (default 60) caps total iterations.
Token cost on each wake-up naturally limits runaway loops.
Failed operation dedup - Commands that fail are tracked by operation
key. The AI cannot re-execute the same failing command.
RBAC - required_groups restricts which operators can use AI features.
allowed_commands whitelist limits which tools the AI can call.
Interactive approval - Write operations prompt the operator for y/n
confirmation in the SSH session before execution.
Audit trail - All AI interactions logged with distributed tracing.
Sensitive data redacted by default.
Rate limiting - Per-user query rate limit (default 10/1m) prevents
excessive API usage.

Security

Multiple defense layers protect the AI assistant:

RBAC - required_groups restricts which operators can use AI features.
Only operators in the configured groups can access the "ai" command.
Command whitelist - allowed_commands limits which tools the AI can call.
Operators cannot override this from within the AI session.
Write protection - Read-only commands execute automatically. Write
operations pause for interactive operator approval (y/n) in the SSH
session before execution. Cannot be overridden by the AI.
Rate limiting - Per-user query rate limit (default 10/1m) prevents
excessive API usage and token cost accumulation.
Audit trail - All AI interactions logged with distributed tracing.
Sensitive data redacted by default (redact_sensitive = true).
Runaway prevention - Tool round limit (default 15), write operation
limit (default 3), sleep guardrails (5m per call, 60 iterations max),
and failed operation dedup all prevent excessive token consumption.

Troubleshooting

Common symptoms and diagnostic steps:

AI command not available in bastion:

- Verify [bastion] use_llm = true in config
- Verify [llm] section is configured with api_url, api_key, model
- Check required_groups: operator must be in one of the listed groups
- Check: 'config show llm' to verify configuration

AI returns errors or empty responses:

- Check API connectivity: verify api_url is reachable from the gateway
- Check API key validity: invalid keys produce authentication errors
- Check provider detection: auto-detect uses api_url hostname, set
provider explicitly if using a proxy or non-standard endpoint
- Check: 'logs search llm --since=5m' for API errors

Write operations not being approved:

- Write ops require interactive SSH session (not available via MCP)
- Operator must respond y/n to the approval prompt
- max_write_ops_per_query (default 3) may be exhausted for the query
- Check allowed_commands whitelist if specific commands are blocked

AI stops responding mid-conversation:

- max_tool_rounds (default 15) reached: increase if needed for complex
queries, but be aware of token cost implications
- Sleep monitoring loop: max_sleeps_per_query (default 60) may be
exhausted. Ctrl+C interrupts immediately
- Check: 'logs search llm "round limit"' for limit violations

Background tasks not running:

- 'task list' shows scheduled tasks and their status
- 'task stop <id>' to cancel a misbehaving task
- Only read-only commands can be scheduled as background tasks

High API costs:

- Enable prompt caching for Anthropic provider (cache_ttl setting)
- Reduce max_tool_rounds to limit reasoning cycles
- Review max_sleeps_per_query for monitoring loops
- Check per-user rate limits (default 10 queries/minute)

Relationships

Cross-subsystem interactions:

  • Admin CLI: Single source of truth for all tools. The AI calls the same command handlers available via MCP and the bastion shell.
  • MCP: Shares system instructions, tool definitions, and response formatting. Both interfaces use the same execution path.
  • Bastion shell: Hosts the “ai” command and interactive AI mode. Manages conversation history, approval prompts, and background task lifecycle.
  • Cluster knowledge: Memory entries (insights and rules) stored in cluster-wide distributed storage with configurable TTL.
  • Admin CLI commands: diagnose, health, proxy, sessions, certs, dns, directory, config, and 30+ more — all available as AI tools.

Logs

Log entries emitted by the LLM module. Search with: logs search “llm” Levels: ERROR > WARN > INFO > DEBUG. DEBUG requires log level configuration.

Query lifecycle:

llm.query.start INFO Starting LLM query
llm.query.complete INFO LLM query completed
llm.query.api_error ERROR LLM API call failed
llm.query.max_rounds WARN LLM query exceeded maximum tool rounds

Tool execution:

llm.tool.execute INFO Executing tool via hexdcall
llm.tool.approved INFO AUDIT Write operation approved by operator
llm.tool.denied INFO AUDIT Write operation denied by operator

Metrics

Prometheus metrics. Query with: metrics prometheus llm_<name>

Queries:

llm_queries_total counter {success} Query completion count
success=true Query completed with a final answer
llm_query_duration_seconds latency (none) End-to-end query duration including all tool rounds

Tool calls:

llm_tool_calls_total counter {tool, success} Per-tool execution count
tool=<command> CLI command name (e.g. "proxy", "cluster")
success=true/false Whether the command executed successfully

Prompt caching (Anthropic provider only, emitted when cache tokens > 0):

llm_cache_read_tokens_total counter (none) Tokens read from Anthropic prompt cache
llm_cache_creation_tokens_total counter (none) Tokens written to Anthropic prompt cache

Module Data Storage

Stores per-user credentials and settings — passkeys, TOTP secrets, X.509 enrollment data, and preferences

Overview

Stores per-user data for authentication and service modules — passkey credentials, TOTP secrets, X.509 enrollment data, and user preferences. Each module gets isolated storage with automatic cluster replication. Used by WebAuthn, TOTP, X.509, and any module that needs persistent per-user state.

Core capabilities:

  • Hexon KV (NATS JetStream) storage with automatic cluster replication
  • Per-user, per-module namespace isolation (e.g., “totp”, “webauthn”, “x509”)
  • Reserved “preferences” namespace for cross-module user settings (language, etc.)
  • Automatic language preference storage when Language field is set on SetRequest
  • Directory cache refresh broadcast after Set and Delete operations
  • Input validation at facade and storage levels
  • Base64url key encoding for NATS KV compatibility (handles @, :, spaces)

Operations: Get, Set, Delete, check existence, get all data for a user, and bulk load.

Key format uses base64url-encoded usernames for storage compatibility.

Config

Configuration for moduledata storage:

Hexon KV Requirements:

[cluster]
cluster_path = "/var/lib/hexon" # Required for NATS JetStream persistence
- NATS JetStream must be available (cluster mode)
- Data automatically replicated across cluster nodes
- LoadAll returns all stored data (efficient for bootstrap)

Input Validation Rules:

Username:
- Cannot be empty
- Maximum 200 characters (before base64url encoding)
- Any characters allowed (gets base64url encoded)
Module Name:
- Cannot be empty
- Maximum 64 characters
- Pattern: [a-zA-Z0-9][a-zA-Z0-9\-_]* (no dots or colons)
- Examples: "totp", "webauthn", "ssh_keys", "user-preferences"
Combined key maximum: 256 characters after encoding

Reserved Namespaces:

- "preferences": User-wide settings (language, notification preferences)

Troubleshooting

Common symptoms and diagnostic steps:

“Backend unavailable” error (ErrBackendUnavailable):

- Check cluster_path exists and NATS JetStream is running
- Check cluster status for NATS availability

“Invalid username” or “Invalid module name” errors:

- Username must be non-empty and under 200 characters
- Module name must match [a-zA-Z0-9][a-zA-Z0-9\-_]* pattern
- Module name must be under 64 characters
- No dots or colons allowed in module name (NATS KV restriction)

Data not appearing across cluster nodes:

- Verify NATS JetStream cluster health
- Check if directory cache refresh broadcast is working
- Run 'moduledata inspect <username>' to check data on local node

Language preference not being stored:

- Language is stored asynchronously (fire-and-forget) in "preferences" namespace
- Check if Set operation for the primary module succeeded first
- Verify language code is a valid string (e.g., "en", "es", "fr", "zh")
- Query preferences directly: Get with ModuleName="preferences"

Encoding/decoding errors:

- ErrEncodingFailed: data contains types that cannot be JSON-serialized
- ErrDecodingFailed: stored data is corrupted or not valid JSON
- Check NATS KV key format (base64url encoding)
- Verify data values are JSON-compatible (maps, strings, numbers, bools)

Performance and metrics:

- moduledata_operations_total: counter by operation type and status
- moduledata_operation_duration_seconds: latency histogram
- High latency: check NATS JetStream performance

Security

Security considerations for module data storage:

User enumeration prevention:

HTTP handlers should return generic error messages to clients (e.g.,
"Invalid credentials" instead of "User not found"). Detailed errors are
logged internally with traceID for debugging.

Input validation (defense in depth):

All inputs validated at facade and storage levels.
Username length limit (200 chars) prevents DoS via oversized inputs.
Module name character restrictions prevent injection attacks in NATS KV keys.
Base64url encoding of usernames prevents NATS KV key injection.

Data isolation:

Each module's data is stored under its own namespace key.
Modules cannot accidentally overwrite another module's data.
The "preferences" namespace is reserved for cross-module user settings.

Thread safety:

All state managed by NATS JetStream.
Concurrent operations are safe and independent.

Cache consistency:

After Set and Delete operations, a directory cache refresh is
replicated cluster-wide to keep all node caches consistent.
This is fire-and-forget; transient broadcast failures are non-fatal.

Relationships

Module dependencies and interactions:

  • Directory: Provides user existence validation and group lookups. After Set/Delete, moduledata broadcasts RefreshUserCache to directory for cluster-wide cache consistency.
  • WebAuthn: Stores passkey credentials per user in “webauthn” namespace. Uses Get/Set for credential CRUD operations.
  • X.509: Stores X.509 certificate data per user.
  • signin: Stores sign-in flow state and user preferences. Uses Language field on Set to automatically store user language preference.
  • UI templates: Language preference from “preferences” namespace used for localized email rendering and UI template selection.
  • smtp: Looks up user language preference from “preferences” namespace for localized email delivery (OTP, cert renewal, passkey expiration).
  • cluster: Requires NATS JetStream (cluster_path configured). Data automatically replicated across cluster nodes.
  • telemetry: Metrics exported for operation counts and latency histograms. Structured logging with operation type, username (redacted), and traceID.

Logs

Log entries by component. Search with: logs search “moduledata” Levels: ERROR > WARN > INFO > DEBUG > TRACE.

Initialization:

moduledata.init WARN module_data_storage=ldap is deprecated and no longer supported; using hexon KV backend. Migrate existing module data to hexon KV before upgrading.
moduledata.init WARN cluster_path not set - module data may be lost on restart
moduledata.init WARN Persistent storage not enabled - module data will NOT survive restarts
moduledata.init INFO Module data storage initialized (hexon KV)

Get Operation:

moduledata.get DEBUG Getting module data
moduledata.get ERROR Backend.Get failed

Set Operation:

moduledata.set INFO Setting module data
moduledata.set ERROR Backend.Set failed
moduledata.set.preferences WARN Failed to store language preference

Delete Operation:

moduledata.delete INFO Deleting module data
moduledata.delete ERROR Backend.Delete failed

GetAllForUser Operation:

moduledata.getallforuser DEBUG Getting all module data for user
moduledata.getallforuser ERROR Backend.GetAllForUser failed

LoadAll Operation:

moduledata.loadall INFO Loading all module data
moduledata.loadall ERROR Backend.LoadAll failed

Exists Operation:

moduledata.exists ERROR Backend.Exists failed

Hexon KV Backend — Get:

moduledata.hexon.get ERROR PersistentGet failed
moduledata.hexon.get WARN Unexpected value type in KV

Hexon KV Backend — Set:

moduledata.hexon.set ERROR PersistentSet failed
moduledata.hexon.set DEBUG Module data stored in Hexon KV

Hexon KV Backend — Delete:

moduledata.hexon.delete DEBUG Key not found in Hexon KV (nothing to delete)
moduledata.hexon.delete ERROR PersistentDelete failed
moduledata.hexon.delete DEBUG Module data deleted from Hexon KV

Hexon KV Backend — GetAllForUser:

moduledata.hexon.getallforuser DEBUG Retrieved all module data for user

Hexon KV Backend — LoadAll:

moduledata.hexon.loadall INFO Loaded all module data from Hexon KV

Metrics

Prometheus metrics. Query with: metrics prometheus moduledata_<name>

Operations (namespace: moduledata):

moduledata_operations_total counter {operation, backend, result} Operation count
operation=get|set|delete|getallforuser|loadall|exists
backend=hexon
result=success|error
moduledata_operation_duration latency {operation, backend} Operation duration
operation=get|set|delete|getallforuser|loadall|exists
backend=hexon

Notification Service

Routes alerts and events to Slack, Teams, Discord, PagerDuty, email, and custom webhooks

Overview

Routes operational events and alerts to configured notification channels — Slack, Teams, Discord, PagerDuty, email, and custom webhooks. Supports single events, digest notifications, and endpoint health checks. All notifications use template-driven payloads that can be customized per channel.

Core capabilities:

  • Multi-channel routing: email, Slack, Teams, Discord, PagerDuty, custom webhooks
  • Single event notifications (Send) with subject, body, and severity
  • Digest notifications (SendDigest) batching multiple results into one message
  • Five builtin webhook payload formats: generic, slack, teams, discord, pagerduty
  • Custom Go text/template payloads with json, severityColor, severityEmoji helpers
  • Partial success model: Success=true if at least one endpoint delivers
  • Branded HTML email templates rendered via the render module
  • Plain text fallback when render module is unavailable
  • Targeted routing: empty Webhook sends to all, “email” for email only, or a specific webhook name for single-target delivery
  • Health checking for all configured notification endpoints

Routing logic for the Webhook field:

- "" (empty): broadcast to all enabled channels (email + all webhooks)
- "email": send to email channel only (requires To field)
- "<name>": send to the named webhook only (e.g., "slack-ops")

Email delivery chain:

1. Notify module requests email rendering with template + data
2. Render module loads the appropriate notification and digest templates
3. Rendered HTML and plain text forwarded to SMTP module as multipart
4. Fallback: if render unavailable, plain text auto-wrapped in <pre> tags

Webhook payload formats:

- generic: flat JSON with subject, body, severity, username, hostname, timestamp
- slack: Block Kit with header, severity emoji, code block body, context footer
- teams: Adaptive Card v1.4 with TextBlock, FactSet, monospace body
- discord: Embed with severity color mapping, code block body, footer
- pagerduty: Events API v2 with routing_key from Metadata, severity mapping

Config

Notification configuration under [notify] section:

[notify]
digest_window = "5m" # Window for batching digest items
[notify.email]
enabled = true # Enable email notifications (uses [smtp] config)
[[notify.webhooks]]
name = "slack-ops" # Webhook name (used for targeted routing)
url = "https://hooks.slack.com/services/T00/B00/XXX" # Webhook endpoint URL
format = "slack" # Payload format: generic, slack, teams, discord, pagerduty
timeout = "10s" # Request timeout (default: 10s)
[[notify.webhooks]]
name = "teams-infra"
url = "https://outlook.office.com/webhook/XXX"
format = "teams"
timeout = "15s"
[[notify.webhooks]]
name = "pagerduty-critical"
url = "https://events.pagerduty.com/v2/enqueue"
format = "pagerduty"
[[notify.webhooks]]
name = "custom-endpoint"
url = "https://api.internal/alerts"
format = "generic" # Base format (overridden by body_template)
content_type = "application/json" # Custom content type
body_template = '{"alert": "{{json .Subject}}", "detail": "{{json .Body}}"}'
[notify.webhooks.headers]
Authorization = "Bearer token123" # Custom headers sent with every request

Template functions available in custom body_template:

{{json .Field}} - JSON-escape a string (quotes, backslashes, control chars)
{{severityColor .Sev}} - Map severity to Discord embed color (int)
{{severityEmoji .Sev}} - Map severity to Slack emoji string
{{severityPD .Sev}} - Map severity to PagerDuty severity level

Template variables (TemplateData fields):

.Subject, .Body, .Severity, .Username, .Hostname, .Timestamp,
.Metadata (map[string]string), .Items (digest), .ItemCount (digest)

Email template variables (passed to render module):

Subject, Body, Severity, SeverityLabel, Username, Hostname,
Timestamp, Disclaimer, Items (digest), ItemCount (digest)

The HTMLBody field on SendRequest bypasses template rendering entirely, allowing callers to provide custom HTML content. Email requires the To field to be set (recipient address).

Hot-reloadable: webhook URLs, formats, headers, timeouts, email enabled. Cold (restart required): none (all notify config is hot-reloadable).

Troubleshooting

Common symptoms and diagnostic steps:

Webhook not receiving notifications:

- Run 'notify health' to check connectivity to all endpoints
- Verify webhook URL is correct and accessible from the Hexon server
- Check webhook name matches exactly (case-sensitive) when using targeted routing
- Verify format is one of: generic, slack, teams, discord, pagerduty
- Check timeout setting (default 10s) is sufficient for the endpoint
- For custom endpoints, verify content_type and body_template are valid
- Check custom headers (Authorization, API keys) are correct

Email notifications not being delivered:

- Verify [notify.email] enabled = true
- Check SMTP module health: 'smtp health'
- Verify To field is set on the SendRequest
- Check render module is available for template rendering
- If render unavailable, plain text fallback should still work
- Check spam folder - configure SPF/DKIM/DMARC for production

Partial success (some channels fail, others succeed):

- This is expected behavior with the partial success model
- Check resp.Failed[] for endpoints that failed and resp.Sent[] for successes
- resp.Error contains a summary of failures
- Individual endpoint failures do not block other deliveries
- Success=true means at least one endpoint delivered successfully

Digest notifications empty or missing items:

- Verify Items array is populated in SendDigestRequest
- Each DigestItem needs TaskID and Description at minimum
- Check digest_window setting if items seem to be batched incorrectly

Custom template errors:

- Templates are parsed and validated at config load time
- Template execution errors prevent notification delivery
- Use {{json .Field}} for all string interpolation to prevent JSON injection
- Check template syntax matches Go text/template format
- Verify all referenced fields exist in TemplateData

Slack/Teams formatting issues:

- Slack format uses Block Kit (verify workspace supports it)
- Teams format uses Adaptive Card v1.4 (verify connector version)
- Discord embeds have 4096 character limit for description
- PagerDuty requires routing_key in Metadata map

Test notifications:

- Use 'notify test <webhook-name>' to test a specific webhook
- Use 'notify test email <address>' to test email delivery
- Use 'notify list' to see all configured endpoints

Security

Security considerations for notification delivery:

Webhook URL protection:

Webhook URLs are marked as sensitive in configuration and are not exposed
in config dumps or diagnostic output. Only HTTPS URLs are recommended for
production deployments. HTTP is allowed for internal or testing endpoints.

Authentication headers:

Webhook headers (Authorization, API keys) are stored in configuration.
Headers are sent with every webhook request to the endpoint. Consider
using environment variable references for secrets in production.

Custom template safety:

Templates are parsed and validated at config load time to catch syntax
errors early. Always use {{json .Field}} for string interpolation in
custom templates to prevent JSON injection attacks. Template execution
errors are returned and the notification is not sent (fail-safe).

HTML content handling:

Email body text is HTML-escaped when auto-converting plain text to HTML.
The HTMLBody field content is sent as-is without sanitization - callers
are responsible for ensuring HTML content is safe.
The render module handles template escaping for branded email templates.

Credential exposure prevention:

SMTP credentials are handled by the smtp module (never exposed by notify).
Webhook URLs with embedded tokens (Slack, Teams) are redacted in logs.
Error messages from failed webhook deliveries do not include full URLs.

Relationships

Module dependencies and interactions:

  • smtp: Email delivery backend. Notify sends emails through the SMTP module for all email notifications. SMTP configuration ([smtp] section) determines the mail server, credentials, and encryption mode.
  • UI templates: Email template rendering. Notify renders branded HTML email templates cluster-wide. If template rendering is unavailable, plain text fallback is used.
  • Admin CLI: Exposes notify CLI commands (list, health, test) for management and diagnostics.
  • mcp: Notify operations available as MCP tools for AI-assisted operations. LLM and bastion AI assistant can send notifications through MCP tools.
  • config: All notification settings are hot-reloadable. Webhook URLs, formats, headers, timeouts, and email enabled flag can be changed without restart.
  • telemetry: Structured logging for notification delivery with endpoint name, delivery status, and latency. Metrics for send counts and failures.
  • Rate limiting: Callers should apply rate limiting in HTTP handlers to prevent notification flooding. Notify module itself does not rate limit.
  • Various callers: Any module can send notifications cluster-wide. Common callers include authentication modules (login anomalies), certificate management (renewal notifications), and the bastion AI assistant.

Logs

Log entries emitted by the notify module. Search with: logs search “notify” Levels: ERROR > WARN > INFO > DEBUG > TRACE.

Send — single event delivery:

notify.send.email_failed WARN Email notification failed
notify.send.webhook_failed WARN Webhook notification failed
notify.send.webhook_ok DEBUG Webhook notification sent
notify.send.render_fallback WARN Email template rendering failed, using plain text fallback

Digest — batched digest delivery:

notify.digest.email_failed WARN Digest email failed
notify.digest.webhook_failed WARN Digest webhook failed
notify.digest.render_fallback WARN Digest template rendering failed, using plain text fallback

Health check:

notify.healthcheck DEBUG Health check completed

Metrics

Prometheus metrics emitted by the notify module:

notify_sent_total counter {channel, result} Incremented after each single-event delivery attempt.
channel=email|webhook, result=success|failure.
notify_digest_sent_total counter {channel, result} Incremented after each digest delivery attempt.
channel=email|webhook, result=success|failure.

Downstream metrics from related modules:

- smtp_send_total (from smtp module) — covers email delivery outcomes
- render_email_total (from render module) — covers template rendering

Distributed Sessions

Manages sessions across all protocols — HTTP and SSH share the same session store with instant revocation

Overview

Manages sessions for every protocol the gateway handles — HTTP, SSH, and PoW. Replaces per-service session stores with one cluster-wide store that supports instant revocation across all protocols. Disable a user once — every session terminates cluster-wide. Supports:

  • Unique session IDs (crypto/rand UUID v4, base64url-encoded, 256-bit) or custom IDs (e.g., SHA256 hash)
  • Dual-key indexing: primary by session ID, secondary by type+module_key
  • Automatic TTL expiration managed by distributed memory storage
  • Saga-based atomic session+index creation with rollback on failure
  • Pluggable extend validators (e.g., X.509 certificate revocation checks)
  • Pluggable create callbacks (e.g., post-create notifications)
  • Pluggable delete callbacks (e.g., per-type resource cleanup)
  • Session ID regeneration for session fixation protection
  • Lazy index cleanup on GetByModuleKey (handles missed OnDelete callbacks)
  • Thread-safe callback/validator registration (RWMutex)
  • Metrics: sessions_created, validations_success, validations_failed, sessions_extended, sessions_revoked, sessions_bulk_revoked, sessions_regenerated, activity_persisted

Available operations:

Create - Create session with atomic dual-key indexing
Validate - Validate session, update LastActivity (does NOT extend TTL)
Extend - Extend TTL (runs validators first, caps to cert_not_after for X.509)
Revoke - Delete single session (index cleaned automatically)
RevokeAll - Delete all sessions for a type+module_key
List - List all sessions of a given type (filters expired)
GetByModuleKey - Reverse lookup by type+module_key with lazy cleanup
RegenerateID - New ID with same data (session fixation protection)

Session types in use:

user - Authenticated user sessions (web login, OIDC callback, X.509 auto-auth)
bastion - SSH bastion connection tracking
cobrowse - Proxy co-browse viewer sessions
password_expired - Temporary session for password change flow (short TTL)
mfa_pending - MFA verification pending (short TTL)
flow_pending - Signup/enrollment flow pending
jit2fa_pending - JIT 2FA OTP verification pending
jit2fa_auth - JIT 2FA authenticated session
pow - Proof-of-Work challenge session
bearer_cache - JWT Bearer token verification cache (custom ID = SHA256 of token)

Memory usage: ~600 bytes per active session (500 bytes session + 100 bytes index entry). For 1 million active sessions: ~600 MB cluster-wide.

Config

Sessions have no dedicated [sessions] config section. TTL and cookie settings are controlled by the calling module via [service] and per-feature config:

[service]
cookie_name = "hexon" # Default session cookie name (default: "hexon")
cookie_domain = ".example.com" # Cookie domain for cross-subdomain sharing (default: current hostname only)
cookie_ttl = "12h" # Default session cookie TTL (default: "12h")
session_ttl = "24h" # Authenticated user session TTL (default: "24h")
session_password_expired = "15m" # Password expired session TTL (default: "15m")
session_mfa_pending = "5m" # MFA pending session TTL (default: "5m")
max_concurrent_sessions = 1 # Max concurrent sessions per user (default: 1, 0=unlimited)
[jit2fa]
cookie_name = "jit2fa_key" # Cookie name for JIT 2FA sessions (default: "jit2fa_key")
session_ttl = "8h" # JIT 2FA authenticated session TTL (default: "8h")
[forward_proxy]
session_cookie = "hexon_session" # Forward proxy session cookie name (default: "hexon_session")
[protection]
pow_cookie_name = "hexon_pow" # PoW session cookie (default: "hexon_pow", MUST differ from session cookie)

Recommended TTL values by session type:

Interactive web sessions (user): 12-24 hours
API tokens: 30-90 days
OAuth state: 5-10 minutes
MFA pending (mfa_pending): 5 minutes
Password expired (password_expired): 15 minutes
PoW/temporary tokens: 1-5 minutes
JIT 2FA (jit2fa_auth): 8 hours
Bastion: Caller-determined (bastion manager)
Bearer cache (bearer_cache): 5 minutes (default, configurable via [proxy].bearer_cache_ttl)

TTL behavior:

- Validate does NOT extend TTL but persists LastActivity when stale > sessionTTL/10 (clamped 1m–5m), fire-and-forget
- Extend explicitly sets new TTL from current time, requires cluster broadcast
- X.509 sessions: TTL capped to cert_not_after on both Create and Extend
- Minimum effective storage TTL is 1 minute (enforced as floor)
- Expired sessions are filtered out by List and GetByModuleKey
- Storage-level TTL expiry triggers OnDelete callback for automatic index cleanup
- TTLCapped field in Create/Extend responses indicates certificate-based capping

Troubleshooting

Common symptoms and diagnostic commands:

Session not persisting across requests:

- Cookie domain mismatch: verify [service].cookie_domain includes all subdomains
- Secure flag on non-HTTPS: cookies with Secure=true require HTTPS transport
- SameSite=Strict blocking cross-origin: check if auth redirect crosses domains
- Cookie name conflict: ensure cookie_name differs from pow_cookie_name and jit2fa cookie
- max_concurrent_sessions exceeded: new session may evict previous one
- Check: 'sessions list --user=<username>' to verify session exists in storage

Cross-node session loss (works on one node, fails on another):

- JetStream KV replication lag: check cluster quorum status with 'status'
- Saga partial failure: session created but index missing, or vice versa
- Network partition: quorum requirement (>50% nodes) prevents writes during partition
- Validate is local-only: session must be replicated to the validating node
- Check: 'sessions show <session_id>' from multiple nodes to compare
- Check: 'status' for cluster health and node connectivity

Premature session expiration:

- TTL too short: check [service].session_ttl (default 24h) or caller-specific TTL
- Clock skew between nodes: ensure NTP is running (chrony or systemd-timesyncd)
- X.509 TTL capping: session capped to cert_not_after, verify certificate validity
- TTLCapped=true in response indicates certificate-based cap was applied
- Check: 'sessions show <session_id>' to compare ExpiresAt vs current time

Session extend rejected:

- Extend validator rejecting: check 'logs --module=sessions --level=warn'
- x509_revocation validator: certificate revoked (check OCSP/serial index)
- Certificate already expired: X.509 sessions cannot extend past cert_not_after
- Session not found: already expired or revoked before extend attempt
- Check: 'logs --module=sessions --keyword=validator' for rejection details

Stale sessions appearing in index (ghost sessions):

- OnDelete callback failed during network partition or node crash
- GetByModuleKey performs lazy cleanup: stale entries removed on next lookup
- Manual cleanup: 'sessions revoke <session_id>' for individual sessions
- Bulk cleanup: 'sessions revoke-user <username>' to clear all user sessions

Session fixation concerns:

- RegenerateID should be called after authentication or privilege escalation
- RegenerateID atomically creates new ID with same data, revokes old session
- Uses Saga: new session stored, index updated, old session deleted (with compensation)
- Check: 'logs --module=sessions --keyword=regenerated' for regeneration events

Diagnostic commands:

sessions list - List first 20 sessions (all types)
sessions list --type=user - List authenticated user sessions
sessions list --user=alice - List sessions for specific user
sessions list --offset=20 - Paginate to next page
sessions list --limit=50 - Show 50 sessions per page
sessions show <session_id> - Show full session details with metadata
sessions revoke <session_id> - Revoke a single session
sessions revoke-user <username> - Revoke all sessions for a user
diagnose user <username> - Full access diagnostic including session info
logs --module=sessions - Session operation logs
status - Cluster health (affects quorum operations)

Architecture

Dual-key storage strategy:

Primary key: sessions/{uuid} -> Session object
Secondary key: sessions_index/{type}/{module_key} -> SessionIndex (list of session IDs)
Uses '/' separator because NATS KV disallows ':' in key names.

Session lifecycle:

1. Create: custom ID or crypto/rand 32-byte UUID (base64url) -> Saga(store session + update index)
-> OnDelete callback registered for automatic index cleanup
-> Create callbacks fired post-commit
-> Replicated to cluster with quorum requirement (>50% nodes)
2. Validate: Local read from memorystorage -> update LastActivity (local + throttled persist)
-> Persists to storage when stale > sessionTTL/10 (clamped 1m–5m), fire-and-forget
-> Does NOT extend TTL (explicit Extend call required for renewal)
3. Extend: Load session -> run all registered validators in sequence
-> Cap to cert_not_after for X.509 -> broadcast with quorum, OnDelete preserved
4. Revoke: Replicated delete to all nodes -> callback fires -> index cleaned
5. RevokeAll: Two-stage with cluster-wide safety net. The fast path
uses the per-user index; the safety-net path scans the cache for
any sessions the index missed (covers stale-index cases on peer
nodes after partition or partial replication). Both paths confirm
each delete reached at least one node before counting it.
Bounded at 10000 sessions per call; truncations are recorded as a
metric. Requires memory.cold_enabled=true for the safety net to
be effective; when disabled, a throttled warning per affected
user surfaces the gap.

Delete reason audit:

Every session deletion emits an audit log entry and a counter labelled
with the cause: expired (TTL), revoked (single-session admin/user
revoke), bulk (force-logout / password-change), rotated (session
regeneration after privilege change), saga_rollback (creation failed
mid-flow). Audit emission is once-per-cluster — operators alert on
reason=bulk spikes for force-logout activity, on reason=expired
baselines for TTL behaviour, etc.
Counter: sessions_deleted_total{type, reason}
Audit log: sessions.delete (LevelInfo, AsAudit) — fields include
session_type, reason, expires_at; correlates to begin/extend
audit lines via trace_id.
6. RegenerateID: Saga(store new session + update index + delete old session)
-> Preserves original CreatedAt timestamp, copies all metadata
-> Compensation: rollback new session if old session deletion fails

Saga operations (atomic multi-step with rollback):

- Create: Step 1 store session (compensate: delete), Step 2 update index
- RegenerateID: Step 1 store new (compensate: delete), Step 2 add to index,
Step 3 delete old (compensate: restore old session with TTL and OnDelete callback)
- Saga commit marks success; saga finalization defers cleanup/rollback

Index consistency model:

- Automatic cleanup: OnDelete callback removes session_id on TTL expiry or manual delete
- Lazy cleanup: GetByModuleKey validates each session in index, removes stale entries
- Saga atomicity: Create and RegenerateID use compensating transactions
- Delete callbacks execute even if index removal fails (resource cleanup not blocked)

Cluster behavior:

Sessions (Create/Extend): Replicated with quorum (>50% nodes must confirm)
Indices (Create/RegenerateID): Replicated with quorum (consistency required)
Validate: Local read + throttled fire-and-forget broadcast when LastActivity stale (sessionTTL/10, clamped 1m–5m)
Revoke/RevokeAll: Replicated to all nodes (eventual consistency acceptable)
OnDelete callbacks: Local execution per node, fire-and-forget, independent of cluster

Callback and validator architecture:

ExtendValidator: called BEFORE extend, CAN reject (returning error rejects extension)
Built-in: x509_revocation (checks cert revocation via OCSP cache and serial index)
For internal certs: checks serial index and moduledata
For external certs: checks OCSP cache and responder (soft-fails on infra errors)
CreateCallback: called AFTER successful create, fire-and-forget with panic recovery
DeleteCallback: called AFTER delete and index cleanup, fire-and-forget with panic recovery
Registration: thread-safe via RWMutex, map copied under read lock before execution
Execution: sequential, each callback wrapped in defer/recover, panics logged not propagated

Performance:

Direct lookup (Validate): O(1) by session ID, local read only
Reverse lookup (GetByModuleKey): O(1) index lookup + O(n) session loads
List all of type: O(N) scan of all sessions in storage, filtered by type
Typical sessions per user: 1-5 (bounded by max_concurrent_sessions)
Session object: ~500 bytes with metadata, index entry: ~100 bytes per reference

Security:

Session IDs: 256-bit crypto/rand, base64url (RawURLEncoding), no padding
Collision probability: ~2^-61 for 1 billion sessions
X.509 TTL capping: cert_not_after metadata enforced on Create and Extend
Revocation: instant via Revoke/RevokeAll (stateful, no blacklist needed)
Session fixation: RegenerateID for post-authentication ID rotation
Metadata privacy: plaintext module_keys for lookup (hash sensitive identifiers)

Type registration:

All request/response types registered for cluster RPC serialization during init.

Interpreting tool output:

'sessions list':
Normal: Active sessions show User, Type, IP, Age — all expected
Stale: Sessions with Age > max_session_duration — cleanup may be delayed (runs every 5m)
Types: "authenticated" (normal), "mfa_pending" (waiting for MFA, 5min TTL), "password_expired"
High count: Many sessions for one user → check max_concurrent_sessions setting
Action: Suspicious session → 'sessions show <id>' for details, 'sessions revoke <id>' to terminate
'sessions list --user=<username>':
Empty: User has no active sessions — they are not logged in anywhere
Multiple types: "authenticated" + "mfa_pending" = user may be stuck in MFA flow
Action: Clear stuck MFA → 'sessions revoke-user <username>' (terminates ALL sessions)

Relationships

Module dependencies and interactions:

  • Distributed memory cache: KV store backend. Sessions stored in “sessions” cache type, indices in “sessions_index” cache type. Provides TTL expiration and OnDelete callbacks. All session CRUD operations delegate to the distributed cache.

  • proxy: Creates “user” sessions during OIDC SSO callback. Validates sessions on every proxied request for authentication enforcement. Session group monitor refreshes group membership and revokes sessions on group changes. Creates “cobrowse” sessions for co-browse viewer tracking. Creates “bearer_cache” sessions to cache JWT ID token verifications (SHA256 of token as custom session ID, configurable TTL).

  • signin: Creates “user” sessions after successful authentication, “password_expired” sessions for password change flow, “mfa_pending” sessions for MFA verification.

  • signup: Creates “flow_pending” sessions during enrollment, “mfa_pending” during TOTP/passkey setup, “user” sessions after completed registration.

  • bastion: Creates “bastion” sessions for SSH connection tracking. Session metadata includes connection details for audit trail and session sharing features.

  • authentication.x509: Registers x509_revocation extend validator. Checks certificate revocation status before allowing session extension. Sets cert_not_after metadata for TTL capping on both Create and Extend operations.

  • authentication.jit2fa: Creates “jit2fa_pending” and “jit2fa_auth” sessions with separate cookie (jit2fa_key) and configurable TTL (default 8h).

  • passwordchange: Validates “user” and “password_expired” session types. Creates new “user” session after successful password change. Triggers revocation of old sessions.

  • pow: Creates “pow” sessions after successful proof-of-work challenge. Uses separate cookie (hexon_pow) to avoid conflicts with main session cookie.

  • profile: Creates “user” sessions during profile management operations.

  • Directory: Group membership changes can trigger session revocation via proxy session monitor. Provides fresh group lookups for per-request authorization.

  • middleware (handlers): Creates “user” sessions during X.509 auto-authentication in the middleware chain when client certificate is present.

  • telemetry: All operations log with structured entries including trace IDs and security context (session ID, username). Levels: Error (storage/saga failures), Warn (not found, expired, validator rejections), Info (create/revoke events), Debug (normal validate/extend operations).

  • metrics: Runtime counters for all session operations (created, validated, extended, revoked, bulk_revoked, regenerated, validation failures by reason).

  • config ([service]): Provides default TTL values, cookie configuration, and max_concurrent_sessions limit. No dedicated [sessions] config section; TTL policies are caller-determined (each module passes its own TTL to Create).

Logs

Log entries by operation. Search with: logs search “sessions” Levels: ERROR > WARN > INFO > DEBUG.

Session Create:

sessions.create INFO Session created (type, module_key, TTL)
sessions.create WARN TTL capped to certificate validity / DurableKV not available
sessions.create ERROR Failed to generate ID / store session / update index

Session Validate:

sessions.validate DEBUG Session validated (type, module_key)
sessions.validate ERROR Invalid session type in storage

Session Extend:

sessions.extend DEBUG Session TTL extended
sessions.extend WARN Extension rejected by validator / cert expired / TTL capped

Session Revoke:

sessions.revoke INFO Session revoked
sessions.revoke WARN Failed to broadcast deletion
sessions.revoke_all INFO All sessions revoked for module_key

Session Regenerate:

sessions.regenerate INFO Session ID regenerated successfully
sessions.regenerate WARN Session not found for regeneration
sessions.regenerate ERROR Fetch/generate/store/index/delete failures

Activity Tracking:

sessions.persist_activity ERROR Panic recovered persisting LastActivity

Callbacks & Validators:

sessions.validator INFO Session extend validator registered
sessions.callback INFO Session create/delete/delete_v2 callback registered
sessions.callback ERROR Callback panicked (create/delete/delete_v2)

Index:

sessions.index DEBUG Index cleanup / session removed / index deleted

Metrics

Prometheus metrics. Query with: metrics prometheus sessions_<name>

Lifecycle:

sessions_sessions_created counter {type} Sessions created
sessions_sessions_revoked counter {} Sessions revoked (single)
sessions_sessions_bulk_revoked counter {type} Sessions bulk-revoked
sessions_sessions_regenerated counter {type} Session IDs regenerated
sessions_sessions_extended counter {type} Session TTLs extended
sessions_activity_persisted counter {type} Activity timestamps persisted

Validation:

sessions_validations_success counter {type} Successful validations
sessions_validations_failed counter {reason} Failed validations (storage_error, wait_error, not_found, invalid_type)

Alerts:

rate(sessions_validations_failed{reason="not_found"}[5m]) > 50 High session-not-found rate (expired or stale cookies)
rate(sessions_sessions_bulk_revoked[5m]) > 0 Bulk revocation event (user disabled or password change)
rate(sessions_validations_failed{reason="storage_error"}[5m]) > 0 Storage backend issues

SMTP Email Delivery

Sends emails for OTP codes, magic links, certificate notifications, and alerts — templated and localized

Overview

Handles all outbound email delivery for the gateway — OTP codes, magic links, certificate notifications, and alerts. Other modules request email delivery; this module handles connection management, templates, and localization. Supports SSL, STARTTLS, HTML/plain-text multipart, and file attachments.

Core capabilities:

  • Generic email sending with HTML and plain text multipart content
  • OTP (One-Time Password) emails for authentication flows
  • Certificate renewal notification emails with cert and CA bundle attached
  • Passkey expiration reminder emails with re-enrollment link
  • Health checks for SMTP server connectivity verification
  • Multi-part email composition with file attachments
  • Three encryption modes: SSL (port 465), STARTTLS (port 587), plain (port 25)
  • Multi-language email localization (en, es, fr, zh, ca)
  • Template rendering with branded HTML email templates
  • User language preference lookup for automatic localization
  • RFC 5321 compliant address validation (local ≤ 64, domain ≤ 255 chars)
  • RFC 5322 compliant headers (Date, Message-ID on every email)
  • RFC 8255 Content-Language header on templated emails
  • Message-ID in structured logs (success + failure) for MTA correlation

Localization priority for templated emails:

1. Language field explicitly set in the request
2. User preference from stored preferences
3. Default fallback to "en" (English)

Supported languages: en (English), es (Spanish), fr (French), zh (Chinese), ca (Catalan)

Config

SMTP configuration under [smtp] section:

[smtp]
host = "smtp.gmail.com" # SMTP server hostname (required)
port = 587 # SMTP server port (required)
encryption = "starttls" # Encryption mode: "ssl", "starttls", or "none"
user = "noreply@example.com" # SMTP authentication username
password = "app-specific-password" # SMTP authentication password (sensitive)
from = "noreply@example.com" # Sender email address (From header)
reply_to = "support@example.com" # Reply-To header address (optional)
name = "HexonAuth" # Sender display name (optional)
skip_tls = false # Skip TLS certificate verification (default: false)

skip_tls: Disables server certificate validation for SSL and STARTTLS modes.

Logs a WARN on every send when enabled. Use only when the SMTP server presents
an untrusted or hostname-mismatched certificate. NOT recommended for production.

Encryption modes:

ssl (port 465): Direct TLS connection from the start.
starttls (port 587): Plain connection upgraded to TLS. Recommended.
none (port 25): Unencrypted. Not recommended for production.

Common SMTP provider configurations:

Gmail: host = "smtp.gmail.com", port = 587, encryption = "starttls"
(requires App Passwords with 2FA enabled)
SendGrid: host = "smtp.sendgrid.net", port = 587, encryption = "starttls"
user = "apikey", password = "<sendgrid-api-key>"
AWS SES: host = "email-smtp.<region>.amazonaws.com", port = 587
Mailgun: host = "smtp.mailgun.org", port = 587, encryption = "starttls"

Hot-reloadable: all SMTP settings (host, port, encryption, credentials). Cold (restart required): none.

Troubleshooting

Common symptoms and diagnostic steps:

SMTP connection failures:

- Check SMTP health: 'smtp health' tests connectivity and authentication
- Verify host and port match encryption mode (SSL=465, STARTTLS=587)
- Firewall blocking outbound: verify server can reach SMTP host:port
- Network probe: 'net tcp <smtp-host>:<port> --tls' for SSL
- DNS resolution: 'dns test <smtp-host>' to verify hostname resolves

Authentication failures:

- Gmail: requires App Passwords (regular password won't work with 2FA)
- SendGrid: user must be literal string "apikey", password is the API key
- AWS SES: IAM credentials, not root account credentials
- Check: 'config show smtp' to verify configuration (password redacted)

Emails not being received:

- Check spam/junk folder at recipient mail provider
- Verify from address matches authenticated user or authorized alias
- Configure SPF, DKIM, and DMARC DNS records for sending domain
- Test delivery: 'smtp test <to-address>' sends a test message
- Check: 'notify health' for notification system status

OTP emails delayed or missing:

- SMTP latency: 200-1000ms is normal, check 'smtp health'
- OTP code expired before email arrived: check OTP validity window
- Rate limiting by SMTP provider: check provider dashboard

Passkey expiration emails not sent:

- Expired passkeys intentionally do not receive reminder emails
- Verify DaysRemaining is positive (zero or negative triggers no email)

TLS certificate verification failures:

- "STARTTLS failed: tls: failed to verify certificate" or "TLS dial failed"
- Common cause: SMTP relay hostname differs from certificate CN/SAN
(e.g., smtp.company.com forwards to smtp.gmail.com)
- Temporary fix: set skip_tls = true in [smtp] config (logs WARN per send)
- Proper fix: configure the SMTP server with a valid certificate matching its hostname
- Check: 'net tls <smtp-host>:<port>' to inspect the certificate chain

Template rendering errors:

- Missing locale: unsupported language falls back to English
- Template rendering failures prevent email send (no fallback)

Address validation failures (RFC 5321):

- "local part exceeds 64 character limit": email local part too long
- "domain part exceeds 255 character limit": email domain too long
- These limits are per RFC 5321 §4.5.3.1

Correlating email delivery with MTA logs:

- Every send logs Message-ID (success and failure)
- Use Message-ID to trace through relay MTAs, bounce reports, DMARC feedback
- Search gateway logs: 'logs search message_id=<value>'

Relationships

Module dependencies and interactions:

  • OTP authentication: Delivers one-time passwords for email-based auth.
  • Certificate management: Sends renewal notification emails with cert and CA bundle attachments.
  • WebAuthn/Passkey: Sends passkey expiration reminder emails.
  • Notification service: Uses SMTP for email channel delivery alongside webhooks for multi-channel routing.
  • Directory: User full name lookup for personalized email greetings.
  • Localization: Localized email text loaded from locale files.
  • Configuration: Reads [smtp] TOML section. All settings hot-reloadable.
  • Admin CLI: ‘smtp health’ and ‘smtp test’ commands for diagnostics.

Logs

Log entries emitted by the smtp module. Search with: logs search “smtp” Levels: ERROR > WARN > INFO > DEBUG > TRACE. AUDIT = persisted to tamper-proof audit log.

PAT expiry callback (init):

smtp.pat_expiry INFO AUDIT Personal Access Token expired

TLS certificate warnings (sendViaSSL / sendViaSTARTTLS):

smtp.send WARN TLS certificate verification failed, retrying with skip_tls=true — not recommended for production, configure a valid certificate
smtp.send WARN STARTTLS certificate verification failed, retrying with skip_tls=true — not recommended for production, configure a valid certificate

Magic link validation (SendMagicLinkEmail):

smtp.magiclink WARN AUDIT Magic link email blocked — invalid sealed return URL

Skip notifications (SendPasskeyExpirationEmail / SendVPNPSKExpirationEmail):

smtp.send DEBUG Skipping email for expired passkey
smtp.send DEBUG Skipping email for expired PSK

Generic email (SendEmail):

smtp.send ERROR SMTP send failed
smtp.send INFO Email sent successfully

OTP email (SendOTPEmail):

smtp.send ERROR SMTP send failed
smtp.send INFO Email sent successfully

Certificate renewal email (SendCertRenewalEmail):

smtp.send ERROR SMTP cert renewal send failed
smtp.send INFO Certificate renewal email sent

Passkey expiration email (SendPasskeyExpirationEmail):

smtp.send ERROR SMTP passkey expiration send failed
smtp.send INFO Passkey expiration email sent

Magic link email (SendMagicLinkEmail):

smtp.send ERROR SMTP send failed
smtp.send INFO Magic link email sent

Test email (SendTestEmail):

smtp.test ERROR SMTP test email failed
smtp.test INFO SMTP test email sent

PAT created email (SendPATCreatedEmail):

smtp.pat_created ERROR PAT creation notification email failed
smtp.pat_created INFO PAT creation notification email sent

PAT revoked email (SendPATRevokedEmail):

smtp.pat_revoked ERROR PAT revocation notification email failed
smtp.pat_revoked INFO PAT revocation notification email sent

PAT expired email (SendPATExpiredEmail):

smtp.pat_expired ERROR PAT expiration notification email failed
smtp.pat_expired INFO PAT expiration notification email sent

Passkey created email (SendPasskeyCreatedEmail):

smtp.passkey_created ERROR Passkey creation notification email failed
smtp.passkey_created INFO Passkey creation notification email sent

Passkey revoked email (SendPasskeyRevokedEmail):

smtp.passkey_revoked ERROR Passkey revocation notification email failed
smtp.passkey_revoked INFO Passkey revocation notification email sent

TOTP created email (SendTOTPCreatedEmail):

smtp.totp_created ERROR TOTP creation notification email failed
smtp.totp_created INFO TOTP creation notification email sent

TOTP revoked email (SendTOTPRevokedEmail):

smtp.totp_revoked ERROR TOTP revocation notification email failed
smtp.totp_revoked INFO TOTP revocation notification email sent

Certificate created email (SendCertCreatedEmail):

smtp.cert_created ERROR Certificate creation notification email failed
smtp.cert_created INFO Certificate creation notification email sent

Certificate revoked email (SendCertRevokedEmail):

smtp.cert_revoked ERROR Certificate revocation notification email failed
smtp.cert_revoked INFO Certificate revocation notification email sent

Metrics

Prometheus metrics. Query with: metrics prometheus smtp_<name>

Email delivery:

smtp_emails_sent_total counter {type, result} Emails sent per type and outcome
smtp_send_duration latency {type, result} Email send duration per type and outcome

Label values:

type: generic | otp | cert_renewal | passkey_expiration | vpn_enrollment |
vpn_device_code | vpn_psk_expiration | magiclink | test |
pat_created | pat_revoked | pat_expired |
passkey_created | passkey_revoked |
totp_created | totp_revoked |
cert_created | cert_revoked
result: success | failure

Note: Only core email types emit latency (generic, otp, cert_renewal, passkey_expiration, vpn_enrollment, vpn_device_code, vpn_psk_expiration, magiclink). All other types (test, pat_, passkey_, totp_, cert_) emit counter only — no latency metric.

Alerts:

rate(smtp_emails_sent_total{result="failure"}[5m]) > 5 SMTP delivery issues
smtp_send_duration{quantile="0.99"} > 5s SMTP latency degradation

Persistent File Storage

Persistent on-disk storage for durable module data — supports shared NFS and per-node replication

Overview

Provides persistent, crash-safe file storage for modules that need durable on-disk data. Two deployment modes: shared (NFS) where all nodes see the same filesystem, and replicated (local) where each node maintains its own copy with broadcast synchronization.

Core capabilities:

  • Module-namespaced directories (each module gets isolated storage)
  • Atomic writes via temporary file + rename pattern (crash-safe)
  • Optional file locking via flock for NFS shared mode
  • JSON marshaling/unmarshaling for structured data
  • Full file lifecycle: Save, Load, Delete, Move, List, Exists
  • Path traversal protection with multi-layer validation
  • Fuzz-tested security boundary (traversal, null bytes, unicode attacks)

Storage modes:

Shared (NFS): All nodes see the same files. Operations are local only
(no broadcast needed). File locking prevents race conditions between nodes.
Example path: /shared/webauthn/passkeys/active/abc123.json
Replicated (Local): Each node maintains its own filesystem. Write operations
writes are replicated to all nodes. No locking needed since each node
owns its local copy.
Example path: /data/webauthn/passkeys/active/abc123.json

File permissions: 0644 (files), 0755 (directories). Module directories are created on demand during Save operations.

Config

Configuration under [filesystem]:

[filesystem]
base_path = "/shared" # Root directory for all module storage
mode = "shared" # "shared" (NFS) or "local" (replicated per node)
use_flock = true # Enable file locking (recommended for shared mode)

Mode selection guidance:

shared: Use when all nodes mount the same NFS/distributed filesystem.
- Set use_flock = true to prevent concurrent write races
- Operations are local only (no cluster replication needed)
- Simplest setup, but requires reliable NFS infrastructure
local (replicated): Use when each node has independent local storage.
- Write operations (Save, Delete, Move) are replicated to all nodes
- Read operations (Load, List, Exists) are local only
- No file locking needed (each node owns its storage)
- More resilient to NFS failures, but eventual consistency

Operation routing by mode:

Shared mode:
Save, Load, Delete, Move -> all execute locally (no cluster broadcast)
Replicated mode:
Save, Delete, Move -> replicated to all cluster nodes
Load -> local only (read from local storage)

Hot-reloadable: None. Changes to base_path, mode, or use_flock require restart.

Troubleshooting

Common symptoms and diagnostic steps:

File not found after Save (replicated mode):

- Verify Save used cluster-wide replication (replicated mode requires it)
- Check if querying node received the write replication (network partition)
- Replication is eventually consistent; small delay before Load on other nodes
- Verify base_path is correct on all nodes (must match across cluster)

Permission denied errors:

- Check filesystem permissions: files need 0644, directories need 0755
- Verify the hexon process user has write access to base_path
- NFS mount options: ensure no_root_squash or correct uid/gid mapping
- SELinux/AppArmor may block writes to NFS mounts

Path traversal error (ErrPathTraversal):

- Module name contains '/', '\', or '..' (invalid characters)
- Subpath starts with '/' or '\' (must be relative)
- Subpath contains '..' traversal sequences after path cleaning
- Resolved path escapes the module directory boundary
- This is a security feature; do not attempt to bypass it

File locking issues (shared/NFS mode):

- Stale locks after crash: flock is released on process exit by the OS
- NFS lock daemon (lockd/statd) must be running on all nodes
- NFSv4 has built-in locking; NFSv3 requires separate lock services
- Deadlock: operations hold locks briefly (JSON marshal + write + rename)
- If use_flock = false on shared mode, concurrent writes may corrupt files

Atomic write failures:

- Disk full: temporary file creation fails before rename
- Cross-device rename: base_path and temp dir must be on same filesystem
- Check disk space: df -h on the base_path partition
- Temp file cleanup: orphaned .tmp files indicate interrupted writes

List operation returns empty:

- Verify the subpath directory exists (directories created on Save only)
- Check glob pattern syntax (uses filepath.Glob matching rules)
- Pattern is matched against filenames only, not full paths
- Module directory is base_path/module_name/subpath

Data corruption or invalid JSON:

- Atomic writes prevent partial writes; corruption suggests disk issues
- NFS cache coherence: mount with actimeo=0 for immediate consistency
- Check for concurrent writes without flock enabled
- Validate JSON: load the file directly and check for syntax errors

Architecture

Write path (Save operation):

1. Validate path (module name + subpath traversal checks)
2. Create module directory tree if needed (MkdirAll with 0755)
3. Marshal data to JSON with indentation
4. Create temporary file in same directory
5. Write JSON content to temporary file
6. Sync to disk (fsync)
7. Atomic rename: tmp file -> target path
8. Optional: acquire/release flock around steps 4-7 (shared mode)

Read path (Load operation):

1. Validate path
2. Read file contents (os.ReadFile)
3. Unmarshal JSON into interface{}
4. Return data with Found=true, or Found=false if file not found

File locking (shared mode only):

Uses syscall.Flock with LOCK_EX (exclusive) for writes and LOCK_SH (shared)
for reads. Locks are advisory and only effective when all accessors use flock.
Lock scope is per-file, not per-directory.

Module isolation:

Each module's storage is confined to base_path/module_name/. Path validation
ensures no module can read or write outside its own directory. The validation
is defense-in-depth: multiple checks at different levels prevent escape.

Relationships

Module dependencies and interactions:

  • webauthn: Stores passkey credentials as JSON files. Uses shared mode for cross-node passkey availability. Files organized in active/revoked subdirectories.
  • acme (CA): Stores issued certificates, private keys, and ACME account data. Requires persistent storage that survives restarts.
  • config: Filesystem base_path and mode read from TOML configuration. No hot-reload; changes require restart.
  • telemetry: Structured logging for all file operations (save, load, delete, move) with module name, subpath, and error details.
  • memory (memorystorage): Complementary storage. Use filesystem for persistent data that must survive restarts; use memory for ephemeral data with TTL. Some modules use both: memory for fast lookups, filesystem for durable backup.
  • cluster: In replicated mode, cluster health affects write propagation. Node failures may result in missed broadcasts (eventually consistent).

Logs

Log entries emitted by this module. Search with: logs search “storage.filesystem” Levels: ERROR > WARN > INFO > DEBUG > TRACE.

Save Operation:

storage.filesystem WARN Path traversal attempt blocked
storage.filesystem ERROR Failed to create directory
storage.filesystem ERROR Failed to marshal JSON
storage.filesystem ERROR Failed to save file
storage.filesystem DEBUG File saved

Load Operation:

storage.filesystem WARN Path traversal attempt blocked
storage.filesystem DEBUG File not found
storage.filesystem ERROR Failed to read file
storage.filesystem ERROR Failed to unmarshal JSON
storage.filesystem DEBUG File loaded

Delete Operation:

storage.filesystem WARN Path traversal attempt blocked
storage.filesystem DEBUG File not found for deletion
storage.filesystem ERROR Failed to delete file
storage.filesystem DEBUG File deleted

Move Operation:

storage.filesystem WARN Path traversal attempt blocked (source)
storage.filesystem WARN Path traversal attempt blocked (target)
storage.filesystem ERROR Failed to create target directory
storage.filesystem ERROR Failed to move file
storage.filesystem DEBUG File moved

List Operation:

storage.filesystem WARN Path traversal attempt blocked
storage.filesystem DEBUG Directory not found
storage.filesystem ERROR Failed to read directory
storage.filesystem DEBUG Directory listed

Exists Operation:

storage.filesystem WARN Path traversal attempt blocked
storage.filesystem DEBUG File existence checked

Metrics

No Prometheus metrics are emitted by this module.


Distributed Memory Storage

Ephemeral key-value storage shared across all nodes — used by sessions, OTP, PoW, and tokens

Overview

Provides fast, in-memory key-value storage replicated across all cluster nodes with automatic TTL expiration. Used by sessions, OTP codes, PoW challenges, OIDC tokens, and other time-sensitive data that needs cluster-wide availability. Data expires automatically — no manual cleanup required.

Core capabilities:

  • Namespace-isolated caches (cache types prevent key collisions)
  • Automatic TTL-based expiration with background eviction every 30 seconds
  • OnSet and OnDelete callback support (fire-and-forget, local only)
  • Thread-safe operations with mutex protection
  • Cluster-wide replication (writes replicated to all nodes)
  • Eventually consistent reads (local only, no network overhead)
  • NATS JetStream KV persistence for crash recovery (optional)
  • Peer-to-peer bootstrap fallback when JetStream unavailable
  • Two-tier hot/cold cache for large-scale deployments (30M+ users)
  • SetNX for atomic set-if-not-exists (distributed locks)
  • Touch for TTL renewal without value modification

Consistency model:

Reads (Get): Local first, O(1). With cold_enabled=true, falls through to KV on miss (~1ms).
Reads (All): Local only, O(1). Returns hot entries only (no KV scan in cold mode).
Writes (Set, Delete): Local immediate + optional replication to all nodes
Writes are best-effort with no quorum requirement by default.
For strong consistency, use cluster-wide replication with quorum confirmation.

Storage architecture: two-level map structure

caches[cache_type][key] -> storageEntry with Value, Expiration, Callbacks

Data types stored in memory must be compatible with the cluster serialization layer. Custom structs, slices, and maps with custom types are supported.

Config

Configuration under [cluster] (memory persistence):

[cluster]
cluster_path = "/var/lib/hexon/cluster" # Base path for JetStream storage
persist_memory = true # Use FileStorage for KV bucket
memory_kv_max_write = 10 # Max concurrent KV writes (1-100)

When persist_memory = true and cluster_path is set:

- NATS JetStream KV bucket "hexon_storage_memory" is created
- Writes are asynchronously persisted to JetStream after local cache update
- Concurrent KV writes throttled by memory_kv_max_write (default 10)
- On startup, all entries are bootstrapped from JetStream KV
- JetStream uses Raft consensus in 3+ node clusters for durability
- Data survives full cluster restarts

When persist_memory = false or cluster_path is unset:

- KV bucket uses MemoryStorage (data lost on restart)
- Falls back to peer-to-peer bootstrap from live cluster nodes
- Suitable for truly ephemeral data (PoW challenges, rate limit counters)

Key encoding for NATS KV:

NATS KV keys only allow [-/_=\.a-zA-Z0-9]+. Keys from external sources
(LDAP groups with spaces, email addresses) are base64url encoded:
Format: {cacheType}/{base64url(key)}
Example: "directory_groups/UmVwbGljYXRpb24gQWRtaW5pc3RyYXRvcnM"

Bootstrap sequence on node startup:

1. Attempt to read all entries from NATS JetStream KV
2. Populate in-memory cache with non-expired entries
3. If JetStream unavailable, request data from cluster peers
4. Merge peer responses into local cache
5. Live broadcasts during bootstrap take precedence over stale KV data

Hot/cold cache:

[memory]
cold_enabled = true # Two-tier hot/cold cache (default: true)
cold_ttl = "72h" # How long idle entries stay in hot cache

When cold_enabled = true (default):

- Entries load lazily from durable storage on first access
- Subsequent reads are in-memory
- Idle entries evicted from memory after cold_ttl (still durable)
- No startup warmup — node starts instantly, cache fills on demand
- Only active entries consume memory; idle entries are served from
durable storage on demand
- Cluster-wide features that need to enumerate cache contents
(admin force-logout safety-net, audit listing) are fully supported

When cold_enabled = false (operator opt-out):

- Full in-memory replication; all entries loaded at startup
- Best for very small deployments where memory headroom is generous
and cluster-wide enumeration features are not needed
- A startup warning announces the degraded mode for cluster-wide
operations: admin force-logout may not catch sessions on peer
nodes if the per-user index is stale

No hot-reloadable settings. Changes require a full restart.

Troubleshooting

Common symptoms and diagnostic steps:

Key not found after Set (cross-node):

- Verify Set used cluster-wide replication, not local-only (replication required for cross-node visibility)
- Reads are local only; small propagation delay is normal
- Use quorum-confirmed replication before reading for strong consistency
- Check cluster health: nodes must be reachable for broadcast delivery
- Verify the stored type is compatible with cluster serialization

Serialization errors (encoding/decoding failures):

- Custom types stored in memory must be compatible with cluster serialization
- Type registration happens during module initialization
- Built-in types (string, int, bool, []byte) work out of the box
- Error message includes the unregistered type name

TTL expiration not working (entries persist beyond TTL):

- Background eviction runs every 30 seconds (not instantaneous)
- Expired entries are immediately invisible to Get (Found=false)
- Physical cleanup happens on next eviction cycle
- Very large caches (100K+ entries) may slow eviction scans
- Check if TTL was set to 0 (zero TTL means no expiration)

OnDelete callback not firing:

- Callbacks are local only (fire on the node that runs eviction)
- Callbacks are fire-and-forget (errors are logged but not returned)
- The callback module and operation must exist and be registered
- Check telemetry logs at ERROR level for callback failures
- Callbacks do NOT fire on nodes that receive broadcast deletions
(only the originating node triggers the callback)

Data lost after cluster restart:

- Verify persist_memory = true in [cluster] config
- Verify cluster_path is set and writable
- Check NATS JetStream health (3+ nodes needed for Raft consensus)
- 2-node clusters: JetStream may not achieve quorum, data at risk
- Without persistence, data is only in memory (lost on restart)
- Bootstrap logs show how many entries were recovered from KV

Memory usage growing unbounded:

- Check TTL values: missing or zero TTL entries never expire
- Use All operation to inspect cache sizes per cache type
- Per-entry overhead: approximately 150 bytes plus key and value sizes
- Monitor eviction cycle: entries should be cleaned every 30 seconds
- Consider partitioning large cache types into smaller namespaces

SetNX returning Set=false unexpectedly:

- Key already exists in local cache (including expired-but-not-evicted)
- Another node set the key via broadcast before your SetNX
- SetNX is local atomic only; not a distributed lock by itself
- For distributed locking, combine SetNX with cluster-wide replication + short TTL

Bootstrap failures on startup:

- JetStream KV unavailable and no peer nodes responding
- Node starts with empty cache; data populates as broadcasts arrive
- Check NATS connection health and cluster discovery
- Verify cluster_path directory exists and has correct permissions
- Base64url decoding errors: corrupted KV keys (manual cleanup needed)

KV “too many requests” errors at startup (memory.kv.put_error):

- Caused by bulk operations (e.g. directory fullSync) spawning many concurrent
KV writes that overwhelm JetStream rate limits
- Each user/group sync fires ~3 Set() calls per user + ~2 per group
- A directory with 40 users and 20 groups = ~160 concurrent writes
- Fix: increase memory_kv_max_write in [cluster] config (default 10, max 100)
- These errors are non-fatal: data is already in local cache, only persistence
is delayed. Entries will be persisted on subsequent writes or next restart.
- Monitor: logs search "memory.kv.put_error" --since=5m

Architecture

Data flow for write operations:

1. Caller invokes Set/Delete (local-only or cluster-wide)
2. Local cache updated immediately (mutex-protected)
3. OnSet callback triggered if registered (fire-and-forget)
4. If cluster-wide: replicated to all cluster nodes
5. Async persistence goroutine acquires semaphore slot (bounded by memory_kv_max_write)
6. KV write to NATS JetStream (best-effort; skipped on shutdown)
7. JetStream Raft replicates to follower nodes (3+ node clusters)
Note: SyncSet bypasses the semaphore (synchronous, caller-blocking, used for signing keys)

Data flow for read operations:

1. Caller invokes Get/All
2. Local cache lookup (O(1) for Get, O(n) for All)
3. Expired entries filtered out (Found=false)
4. If cold_enabled=true and Get misses local: KV fallback (~1ms), lazy-load into hot cache
5. All always returns hot entries only (no KV scan in cold mode)

Background eviction loop:

1. Wakes every 30 seconds
2. Scans all cache types and all entries
3. Identifies entries with Expiration < now (TTL eviction)
4. Deletes expired entries from local cache and KV
5. Replicates eviction to cluster nodes, triggers OnDelete callbacks
6. Cold eviction pass (cold_enabled=true only): entries idle > cold_ttl removed
from memory only — stay in KV for future lazy-load, no callbacks triggered

NATS JetStream KV architecture (when persistence enabled):

- Bucket: hexon_storage_memory
- Raft consensus for writes (3+ nodes)
- Leader election with automatic failover
- Write-ahead log replicated to followers
- Can tolerate N/2-1 node failures (e.g., 1 of 3)

Peer-to-peer bootstrap fallback:

- Used when JetStream is unavailable (2-node clusters, JetStream down)
- Requests data from all cluster peers
- Merges responses, preferring newest entries on conflict
- Graceful degradation: memory storage works without persistence

Relationships

Module dependencies and interactions:

  • sessions: Primary consumer. Stores user sessions with 12-24h TTL. Uses OnDelete callback for session cleanup and index removal. Session indices stored in separate “sessions_index” cache type.
  • OTP: Stores one-time passwords with 5-10 minute TTL. Keys are hashed email addresses for privacy. Replicated cluster-wide for OTP availability. OnDelete triggers expiration notifications.
  • OIDC provider: Stores authorization codes, access tokens, refresh tokens, and DPoP JTI values. Each in separate cache types with appropriate TTLs (codes: 5-10min, tokens: 1-24h). Critical for OAuth2 flow integrity.
  • Proof-of-work: Stores proof-of-work challenge tokens with short TTL. Local-only storage (challenges are node-specific).
  • WebAuthn: Stores WebAuthn challenges during registration and authentication ceremonies. Short TTL (5 minutes).
  • Kerberos: Stores Kerberos ticket data with ticket lifetime TTL.
  • firewall: Uses SetNX for cluster-wide hostname tracking (wildcard DNS). Replicated to all nodes for cross-node hostname state. OnDelete for TTL-based rule cleanup.
  • storage.filesystem: Complementary module. Use memory for fast ephemeral lookups; use filesystem for persistent data surviving restarts.
  • telemetry: All operations logged at DEBUG level. Errors (callback failures, eviction issues) logged at ERROR. Metrics for cache sizes and hit rates.
  • cluster (NATS): JetStream KV persistence layer. Raft consensus provides durability for 3+ node clusters. Bootstrap reads from JetStream on startup.

Logs

Log entries emitted by this module. Search with: logs search “memory” Levels: ERROR > WARN > INFO > DEBUG.

Bootstrap — KV:

memory.bootstrap.start INFO Starting JetStream KV bootstrap
memory.kv.init DEBUG Requesting JetStream KV bucket
memory.kv.retry DEBUG JetStream not ready, retrying in {duration} (attempt N/M)
memory.bootstrap.kv_unavailable INFO JetStream KV unavailable after retries, falling back to peer broadcast
memory.kv.ready DEBUG JetStream KV bucket ready
memory.bootstrap.cold INFO Cold mode enabled — skipping bootstrap warmup, cache will populate on demand
memory.bootstrap.read_keys DEBUG Reading keys from JetStream KV
memory.bootstrap.empty INFO JetStream KV bucket is empty, nothing to restore
memory.bootstrap.failed ERROR Failed to read KV keys
memory.bootstrap.keys_found DEBUG Found N keys in JetStream KV
memory.bootstrap.process_key DEBUG Processing KV key
memory.bootstrap.retry_transient INFO Retrying N keys after transient NATS errors (JetStream leader stabilizing)
memory.bootstrap.complete INFO Bootstrap complete (loaded, skipped, errors, duration)

Bootstrap — Key Processing:

memory.bootstrap.get_tombstone DEBUG KV key listed but not found (tombstone)
memory.bootstrap.get_transient DEBUG Transient NATS error, will retry
memory.bootstrap.get_error WARN Failed to get KV entry
memory.bootstrap.decode_error WARN Failed to decode KV entry, deleting corrupted key
memory.bootstrap.decode_error_cleanup WARN Failed to delete corrupted KV entry
memory.bootstrap.parse_error WARN Failed to parse KV key format
memory.bootstrap.skip_expired DEBUG Skipping expired entry
memory.bootstrap.skip_exists DEBUG Skipping key (already in memory from broadcast)
memory.bootstrap.skip_deleted DEBUG Skipping key (deleted during bootstrap)
memory.bootstrap.loaded DEBUG Loaded entry from KV
memory.bootstrap.tracking_stopped DEBUG Stopped tracking deletes, bootstrap complete / peer bootstrap complete
memory.bootstrap.track_delete DEBUG Tracking delete during bootstrap

Bootstrap — Peer Fallback:

memory.bootstrap.peers_encryption_timeout WARN Encryption not ready after timeout, proceeding with bootstrap anyway
memory.bootstrap.peers_wait_encryption DEBUG Waiting for encryption to be ready (X3DH/shared key sync)
memory.bootstrap.peers_start INFO Starting peer-to-peer bootstrap via Broadcast
memory.bootstrap.peers_failed ERROR Failed to broadcast BootstrapGetAll
memory.bootstrap.peers_responses INFO Collected responses from N peers
memory.bootstrap.peers_timeout WARN Failed to collect all peer responses
memory.bootstrap.peers_operation_error WARN Operation error from node
memory.bootstrap.peers_invalid_response WARN Invalid response type from node
memory.bootstrap.peers_merge DEBUG Merging snapshot from node
memory.bootstrap.peers_complete INFO Peer bootstrap complete (loaded, skipped, duration)

KV Persistence:

memory.kv.encode_error WARN Failed to encode entry for KV
memory.kv.put_error WARN Failed to write to KV
memory.kv.persist_success DEBUG Entry persisted to KV
memory.kv.delete_error WARN Failed to delete from KV
memory.kv.delete_success DEBUG Entry deleted from KV

CRUD Operations:

memory DEBUG Memory storage Set
memory DEBUG Triggering OnSet callback
memory WARN OnSet callback failed
memory DEBUG Memory storage Delete
memory DEBUG Triggering OnDelete callback
memory WARN OnDelete callback failed
memory DEBUG Memory storage All
memory DEBUG Memory storage Touch
memory DEBUG Memory storage SetNX
memory DEBUG Memory storage SyncSet
memory DEBUG Memory storage SyncGet (lazy-loaded from KV)

Bootstrap Snapshot:

memory.bootstrap DEBUG BootstrapGetAll returning snapshot

Cold Cache:

memory.cold WARN Corrupted KV entry, deleting
memory.cold DEBUG Cold eviction sweep

Eviction:

memory.eviction INFO Eviction loop shutting down gracefully

Metrics

Prometheus metrics. Query with: metrics prometheus memory_storage_<name>

CRUD Counters:

memory_storage_gets counter {cache_type, result} Cache reads (result: hit, miss, cold_hit, decode_error, expired)
memory_storage_sets counter {cache_type} Cache writes
memory_storage_deletes counter {cache_type} Cache deletions
memory_storage_touches counter {cache_type, result} TTL renewals (result: hit, miss, expired)
memory_storage_setnx counter {cache_type, result} Atomic set-if-not-exists (result: set, exists)
memory_storage_sync_sets counter {cache_type} Synchronous KV-persisted writes
memory_storage_sync_gets counter {cache_type, result} Synchronous reads with KV fallback (result: hit, miss, kv_hit, decode_error, expired)
memory_storage_evictions counter {cache_type, reason} Entries evicted (reason: expired, cold)

Gauges:

memory_storage_entries gauge {cache_type} Current entry count per cache type (updated via GetCacheStats)

Alerts:

rate(memory_storage_gets{result="miss"}[5m]) > 100 High cache miss rate (check TTLs or missing Set calls)
rate(memory_storage_evictions{reason="expired"}[5m]) > 500 Excessive TTL evictions (entries expiring faster than expected)
rate(memory_storage_gets{result="decode_error"}[5m]) > 0 Corrupted KV entries (cold cache decode failures)

Telemetry & Logging

Structured logging with OTLP export, per-module log levels, audit class, ring buffer queries, and trace correlation

Overview

The telemetry module provides structured logging with key-value pairs, multiple output targets, and cross-module trace correlation for cluster-wide observability.

Core capabilities:

  • Structured logging with key-value pairs and fluent builder API
  • Six log levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL
  • AUDIT log class: bypasses level filtering for security events
  • Per-module log level configuration (override global level per module)
  • OTLP gRPC log export to OpenTelemetry-compatible collectors
  • Trace ID correlation across modules (128-bit hex IDs per request)
  • In-memory ring buffer for admin CLI log queries
  • JSON and human-readable output formats
  • Security context builder for auth-related log entries

Output modes:

stdout: Structured logs written to standard output (default)
otlp: Logs exported via gRPC to an OpenTelemetry collector
both: Simultaneous stdout and OTLP export

OTLP export includes:

- timestamp, severity, body (message), module attribute
- service.name, service.version, environment, host.name, host.ip
- Native OTLP TraceId field for trace-to-log correlation
- Batched async export via SDK log processor

Ring buffer:

Configurable in-memory buffer (default 10,000 entries) for admin CLI log
queries ('logs tail', 'logs search'). Provides instant access to recent
logs without external log aggregation. Set to 0 to disable.

Config

Configuration under [telemetry] section:

[telemetry]
log_level = "info" # Global: trace|debug|info|warn|error|fatal
log_format = "json" # Output format: "json" or "human"
output = "stdout" # Output target: "stdout", "otlp", or "both"
otlp_endpoint = "otel-collector:4317" # Required when output is "otlp" or "both"
log_buffer_size = 10000 # Ring buffer entries for log queries (0 = disabled)
audit = true # Audit class: always display security events regardless of log_level
[telemetry.module_levels]
oidc = "debug" # Per-module override (module name = level)
webauthn = "info"
bastion = "trace"

OTLP endpoint format:

"host:port" - Plain gRPC connection
"http://host:port" - Insecure gRPC (http:// stripped, WithInsecure applied)
"https://host:port" - TLS gRPC connection

Compatible collectors: Grafana Alloy, OpenTelemetry Collector, Datadog Agent, Splunk OTel Collector, any OTLP/gRPC compatible receiver.

If the OTLP endpoint is unreachable at startup, the system falls back to stdout and logs a warning. gRPC connections are lazy (connect on first export).

Audit class:

When audit = true (default), log entries marked with AsAudit() bypass level
filtering. Security events (SFTP ops, SSH connections, admin commands, TLS
protection) are always visible even when log_level is set to "error".

Hot-reloadable: log_level, module_levels, log_format. Cold (restart required): output, otlp_endpoint, log_buffer_size, audit.

Troubleshooting

Common symptoms and diagnostic steps:

Logs not appearing in OTLP collector:

- Verify output is set to "otlp" or "both" in [telemetry]
- Check otlp_endpoint format (host:port, no trailing slash)
- Network connectivity: 'net tcp <collector-host>:<port>'
- Collector may reject due to resource limits or auth requirements
- Startup fallback: if endpoint was unreachable at startup, logs go to stdout
- Check: 'logs tail' to verify logs are being generated locally

Per-module log level not working:

- Verify [telemetry.module_levels] has exact module name (case-sensitive)
- Module names use dot notation: "oidc", "bastion.session", "identity.scim"
- Per-module level must be lower priority than global to have effect
- Check: 'config show telemetry' to verify active configuration

Ring buffer queries returning no results:

- Verify log_buffer_size > 0 (0 disables the ring buffer)
- Buffer is in-memory only; cleared on restart
- 'logs tail' shows most recent entries
- 'logs search <keyword>' filters by content
- Buffer wraps around: oldest entries are overwritten when full

Log format issues:

- "json": structured key-value JSON (recommended for log aggregation)
- "human": colored, readable format (recommended for development)
- Trace IDs: full 128-bit hex in JSON, truncated 8-char in human format

High log volume impacting performance:

- Raise global log_level to "warn" or "error"
- Use per-module levels to keep verbose logging only where needed
- OTLP batched export is async and does not block request processing
- Ring buffer size: reduce log_buffer_size if memory is a concern

Relationships

Module dependencies and interactions:

  • All modules: Every module in the system uses telemetry for structured logging with trace correlation.
  • Admin CLI: ‘logs tail’, ‘logs search’, ‘logs stats’, ‘logs anomalies’, ‘logs patterns’ commands query the ring buffer.
  • Configuration: Reads [telemetry] section. Log level and format are hot-reloadable without restart.
  • Cluster: Each node maintains its own ring buffer. Admin CLI log queries fan out to all nodes and merge results.

Logs

The telemetry module is the logging infrastructure itself — it does not emit structured log entries through its own pipeline. Diagnostic messages are written directly to stderr for bootstrap and shutdown scenarios where the log pipeline is unavailable.

Stderr diagnostics (not structured LogEntry calls):

[TELEMETRY] Failed to initialize OTLP exporter: <err> (falling back to stdout)
— Startup: OTLP gRPC connection failed, output mode reverts to stdout
Failed to marshal log entry: <err>
— Runtime: JSON encoding of a log entry failed (entry is dropped)
[TELEMETRY] OTLP provider shutdown error: <err>
— Shutdown: OTLP provider flush/close returned an error
[TELEMETRY] Shutdown complete: N logs processed, N logs dropped due to overflow
— Shutdown: final stats when logs were dropped (includes audit count if any)

These messages appear only in stderr, never in the structured log stream or ring buffer. They indicate infrastructure-level issues with the telemetry pipeline itself.

Metrics

Prometheus metrics emitted by the telemetry module. Query with: metrics prometheus telemetry_<name>

Audit event tracking:

telemetry_audit_log_entries_total counter {} Audit-class entries successfully written
telemetry_audit_dropped_total counter {} Audit-class entries dropped (channel overflow)
telemetry_converging_log_entries_total counter {} Converging-class entries successfully written

All three counters have no labels (nil label map). They are incremented in the single backgroundWriter goroutine (no contention).

Alerts:

telemetry_audit_dropped_total > 0 Audit entries lost — increase channel buffer or reduce log volume
rate(telemetry_audit_log_entries_total[5m]) == 0 No audit events — possible pipeline failure or misconfiguration