Cluster & Operations

Admin Unix Socket

Server-side CLI access via Unix domain socket — run admin commands directly on the server without bastion

Overview

The admin socket enables operators to run admin CLI commands directly on the server host without opening a bastion SSH session.

The hexon binary operates in two modes:

Server mode (default): starts the gateway, creates the Unix socket listener
Client mode (hexon admin …): connects to the socket, sends a command, renders output

Same binary, same command registry, same execution path as bastion and MCP. Commands are executed as user “root” with source “cli” in the audit trail.

The socket is created at /tmp/hexon-admin.sock (override with HEXON_ADMIN_SOCK). File permissions are 0600 (owner-only read/write). Stale sockets from crashed instances are detected and cleaned up automatically on startup.

Usage

Commands that require a running server:

  hexon admin cluster status
  hexon admin proxy health
  hexon admin sessions list --user=alice
  hexon admin --json proxy backends
  hexon admin --cluster health status

Help commands work offline (no running server needed):

  hexon admin                     # List all commands
  hexon admin help proxy          # Detailed help for proxy command

Custom socket path:

  HEXON_ADMIN_SOCK=/tmp/hexon.sock hexon admin ping

Exit codes: 0 on success, 1 on error.

Troubleshooting

Common errors:

“hexon is not running (socket not found at …)” — server not started or different socket path. Check HEXON_ADMIN_SOCK env var.
“hexon is not running” — socket file exists but connection refused. The previous instance crashed without cleanup. Restart the server.
“command timed out” — command took longer than 30 seconds. Check server logs.
Socket permission denied — socket is mode 0600, must run as the same user that started the hexon server (typically root).

The socket is cleaned up on graceful shutdown. If the server crashes, the stale socket is automatically removed on next startup.

Logs

No structured log entries. A single console message is emitted on startup. Command execution logging is handled by the admin CLI module.

Metrics

No Prometheus metrics emitted by this module. Admin command metrics are handled by the admin CLI module.

Threshold Signing & Cluster Cryptography

Signs certificates and tokens without any single node holding the full private key — quorum-based threshold cryptography

Overview

Signs certificates and OIDC tokens without any single node holding the complete private key. The signing key is split across cluster nodes — a quorum collaborates to produce each signature. Replaces external HSMs with distributed key management built into the gateway.

Threshold signing means that certificates and tokens are signed by a quorum of cluster nodes working together. No single node ever holds the full private key — the key is split into shares, and a minimum number of nodes (the “threshold”) must cooperate to produce a valid signature.

The cluster runs two signing schemes in parallel:

ECDSA (ES256/ES384/ES512) — for EXTERNAL tokens Used for: OIDC tokens, Personal Access Tokens (PATs), standard OAuth Why: industry-standard algorithms that third-party tools verify natively
FROST Ed25519 — for INTERNAL operations Used for: proxy bearer tokens, bastion device codes, internal service auth Why: faster signing (~15ms) optimized for high-volume internal operations

These two schemes are not fallbacks for each other — they run in parallel, each serving different consumers. The only brief fallback window is during cluster startup: internal tokens temporarily use ECDSA until FROST key generation completes. This is a few seconds, not a steady-state condition.

Token routing (when signing_algorithm = ES256):

  Token Type                 | Scheme  | Reason
  ──────────────────────────────────────────────────────────
  Proxy bearer tokens        | FROST   | Internal — speed, backend verifies via JWKS
  Bastion device codes       | FROST   | Internal — bastion authentication
  Internal device codes      | FROST   | Internal service callers
  Personal Access Tokens     | ECDSA   | External — distributed to users
  Standard OIDC tokens       | ECDSA   | External — third-party OAuth clients

Quorum model (default: 2-of-3 nodes):

  - Any 2 nodes can sign; 1 node alone cannot forge signatures
  - 1 node failure = still operational (2 remaining nodes can sign)
  - 2 nodes down = signing blocked (quorum lost)
  - 1 node compromised = attacker has 1 share, cannot forge (needs 2)

Startup sequence:

  1. Nodes perform distributed key generation (DKG) for the ECDSA scheme
  2. Once ECDSA is ready, FROST key generation auto-triggers
  3. Once both complete, all signing paths are available
  During the brief window between steps 1-3, internal tokens use ECDSA.
  Zero downtime throughout.

When signing_algorithm is EdDSA: everything uses FROST (no dual mode needed).

Key management

Key lifecycle and rotation:

Threshold signing keys (both ECDSA and FROST):

  - Generated collaboratively by all cluster nodes (distributed key generation)
  - No single node ever holds the full private key — only its own share
  - Stored encrypted at rest using authenticated encryption derived from cluster_key
  - When nodes join or leave, key shares are redistributed while preserving the
    same public key (external verifiers like JWKS consumers are not affected)

ECDSA key rotation (for OIDC tokens):

  - Automatic: triggered when key generation completes or cluster membership changes
  - New JWKS entry published; old key retained for verification (grace period)
  - Relying parties cache JWKS — existing tokens remain valid until cache refresh

FROST key rotation (for internal tokens):

  - Independent from ECDSA rotation — separate lifecycle
  - Auto-triggered when cluster membership changes
  - Internal tokens are short-lived — rotation is seamless with no visible impact

Inter-node encryption:

  Nodes communicate over an encrypted channel with forward secrecy. This means
  each inter-node session uses unique encryption keys, and even if long-term
  keys were compromised, past communications remain unreadable.
  - Encryption keys rotate automatically on a schedule (2PC protocol with quorum)
  - Grace period: old + new keys accepted simultaneously during rotation
  - Temporary fallback to derived keys if the key exchange is briefly unavailable
  - SPK rotation uses publish-before-swap: new bundle published before private key swaps
  - Key rotation defers if SPK just rotated (SPK recency guard, 5s window)
  - On quorum failure, key rotation retries once after flushing stale bundle caches
  - All rotation events emitted as audit entries via OnKeyRotationEvent callback

IMPORTANT — All rotations are automatic:

  Certificate rotation, signing key rotation, and encryption key rotation are
  all handled by background health monitors. Operators do NOT need to set
  calendar reminders or manually trigger rotations.
  Only investigate when 'health components' or 'hexdcall status' shows warnings.

Threshold resharing

Cluster scaling under threshold signing — what’s safe and what isn’t:

Resharing rotates the cluster’s share material across a (possibly) different party set while preserving the same group public key. The CA certificate, JWKS public keys, and any downstream verifiers stay valid across the rotation. Both threshold algorithms support resharing:

  ECDSA threshold (ca_algorithm=ES256, signing_algorithm=ES256/384/512): GG18
  FROST threshold (ca_algorithm=EdDSA, signing_algorithm=EdDSA):          FROST

Auto-trigger:

  The leader's health monitor watches for cluster-membership changes vs the set
  of nodes that hold shares in JetStream KV. When they diverge (a node joins or
  leaves), resharing fires automatically after a short stabilization window
  (~10s) plus exponential backoff (30s base, 5min cap). Operators do NOT need
  to manually trigger resharing for normal scaling.

Adding nodes (+N): no upper bound.

  You can add any number of nodes. The original cooperating-old subset (t+1
  contributors) reshares with their existing share values; joiners receive
  fresh shares. The CA / JWKS public key stays unchanged.

Removing nodes (-N): bounded by the OLD threshold.

  Resharing math requires |cooperating-old| >= oldThreshold + 1, and the
  cooperating-old set must be a subset of the new party set. So:

    minimum new cluster size  =  oldThreshold + 1

  With the default majority-quorum config (t = n/2 at birth), this means:

    Initial cluster | Old threshold (t) | Minimum after shrinking via reshare
    ----------------|-------------------|--------------------------------------
    3 nodes         | 1                 | 2 nodes
    5 nodes         | 2                 | 3 nodes
    7 nodes         | 3                 | 4 nodes
    9 nodes         | 4                 | 5 nodes

  Example: a 7-node cluster (t=3) cannot shrink to 3 in a single reshare —
  it would need 4 cooperating-old shareholders but the new set has only 3
  slots. The system rejects this synchronously with a clear "insufficient
  cooperators" error.

  To shrink past the floor: stage the reduction (e.g., 7 → 5 with a lower
  new threshold → 3 with a still-lower threshold). Each step preserves the
  CA cert. OR perform a hard CA rotation (delete birth metadata + restart),
  which produces a new CA cert and forces every leaf cert to renew.

Choosing initial threshold with shrinkage in mind:

  Lower threshold = more shrinkage room but weaker security (fewer cooperators
  needed to sign = lower attack threshold). Override via threshold_nodes
  config at birth time if your cluster needs to scale down significantly.

Resharing failure mode is non-destructive:

  If a reshare protocol run fails (cooperator unreachable, KV write failure,
  byzantine commitment), the manager reverts to its old shares and the CA
  cert stays valid. Operators see the failure in 'hexdcall status' / health
  components and can retry manually or wait for the next auto-trigger.

Deterministic signing

Clarification on “deterministic” in cluster crypto context:

The word “deterministic” appears in two contexts — they mean different things:

AutoTLS deterministic KEY DERIVATION:
- Private keys derived deterministically from cluster_key for each renewal cycle
- Ensures all nodes produce the same public key for SPKI pinning
- ECDSA SIGNATURE nonces remain fully random (standard randomness)
- This is NOT “deterministic signing” — only the key material is deterministic
- Security equivalent to random key generation (entropy from cluster_key)
Threshold ECDSA CA signing:
- Uses multi-party nonce generation across nodes
- Each node contributes randomness; combined nonce is unpredictable
- No single node controls the nonce — distributed trust model
- The signing ceremony is interactive (requires t+1 nodes), not deterministic

Neither case reduces cryptographic security. “Deterministic” in these contexts refers to reproducibility (same inputs → same key) or coordination protocol, not to removing randomness from signatures.

When operators ask about “no randomness” or “deterministic signing”:

  - AutoTLS: "The leaf certificate keys are derived deterministically for cluster
    consistency and SPKI pinning. Signatures still use standard randomness."
  - Threshold CA: "The signing uses multi-party nonce generation — each node
    contributes randomness, and the combined nonce is cryptographically secure."

Do NOT present deterministic key derivation as a security concern.

Troubleshooting

Common symptoms and diagnostic steps:

ECDSA threshold signing errors (external tokens not being issued):

  - Run 'hexdcall status' to check signer state and health
  - State=Active + Health=Healthy → signing should work
  - Health=Degraded → at minimum quorum, one more node failure blocks signing
  - Health=Unhealthy → cannot sign, check node reachability with 'cluster nodes'
  - Run 'hexdcall threshold test' to verify end-to-end signing

FROST signing errors (internal tokens failing, e.g. proxy bearer or device codes):

  - Check FROST state and health in 'hexdcall status'
  - FROST key generation runs after ECDSA completes — if FROST shows Idle
    but ECDSA is Active, FROST key generation has not triggered yet
  - Internal tokens fall back to ECDSA while FROST initializes (no outage)
  - Run 'hexdcall threshold test --trace' for detailed phase-level timing

Key generation not completing:

  Key generation (DKG) is the process where nodes collaboratively create a
  shared signing key. It requires all participating nodes to be reachable.
  - Check 'cluster nodes' — all expected nodes must be online
  - Key generation requires the inter-node encryption channel to be healthy
  - Check for membership mismatches: all nodes must agree on the participant set
  - Rolling restarts are handled gracefully — key generation is not re-triggered unnecessarily

Inter-node encryption issues:

  Nodes encrypt all cluster communication using a forward-secret key exchange.
  - Low key pool → automatic replenishment triggers (usually self-healing)
  - Key exchange failures → check NATS JetStream connectivity ('cluster status')
  - Signature verification failures → possible clock skew between nodes (check NTP)
  - During degradation, non-critical operations are deferred and auto-retry on recovery
  - Key rotation audit events: search 'logs tail --audit' for module=keyrotation
    Events: initiated, deferred, commit_all, commit_quorum, retry, aborted, completed,
    activated, abort_received, spk_completed, spk_failed
  - "deferred" = SPK recency guard fired (normal when SPK and key rotation intervals match)
  - "retry" = first PREPARE attempt failed, retried after bundle cache refresh
  - "commit_quorum" = some nodes missed PREPARE, committed with partial ACKs

Interpreting ‘hexdcall status’ output:

  ECDSA: Active/Healthy + FROST: Active/Healthy → optimal state (all signing paths available)
  ECDSA: Active/Healthy + FROST: Idle → FROST key generation pending, internal tokens use ECDSA
  ECDSA: Active/Degraded → at minimum quorum, lost fault tolerance margin — monitor closely
  ECDSA: DKG → key generation in progress, signing not yet available
  Inter-node encryption: Healthy → encrypted communication between nodes is nominal

Monitoring thresholds for CA certificate:

  >90 days until expiry: HEALTHY (normal — renewal is automatic)
  20-90 days: INFO (approaching renewal window — still automatic)
  5-20 days: WARN (renewal should have happened — check logs)
  <5 days: CRITICAL (rotation may have failed — investigate immediately)

Diagnostic commands:

  'hexdcall status'            - Signing health, key generation state, inter-node encryption
  'hexdcall threshold test'    - End-to-end ECDSA signing test
  'cluster nodes'              - List cluster nodes and reachability
  'cluster status'             - Overall cluster health including NATS connectivity
  'health components'          - All system components with health status

Relationships

Module dependencies and interactions:

OIDC provider: Consumes ECDSA threshold signer for JWT signing (ES256/384/512). JWKS endpoint serves the threshold public key. When signing_algorithm changes, new DKG runs and JWKS updates.
OIDC provider (internal tokens): Uses FROST signer for proxy bearer tokens and device codes. Falls back to ECDSA if FROST is not yet ready.
ACME CA (threshold mode): When acme_ca_threshold=true, CA signing uses the ECDSA threshold scheme. Quorum of nodes must cooperate to issue certificates.
Bastion: Device code authentication uses FROST-signed tokens (internal path).
Proxy: Bearer token minting uses FROST for low-latency signing.
X3DH: Forward secrecy for DKG messages and key rotation coordination. Threshold signing uses a dedicated encrypted data plane, separate from X3DH.
NATS JetStream: Persistent storage for DKG state, key shares, and PreKey bundles.
Health monitor: Periodically computes signer health from peer reachability. Auto-triggers FROST DKG when ECDSA is Active. Detects membership mismatches.

Logs

Log entries emitted by this module. Search with: logs search “threshold” / “keyrotation” / “hexdcall” / “ca.” Levels: ERROR > WARN > INFO > DEBUG > TRACE. Note: The bridge module IS the logging infrastructure — it provides bridge.Log() and the hexdcall Logger adapter. The entries below are emitted by bridge code itself via bridge.Log(telemetry.LogEntry(…)). The hexdcall Logger adapter (GetLogger()) routes hexdcall-internal logs through telemetry but those are hexdcall entries, not bridge entries.

Threshold State Changes:

  threshold                            INFO   AUDIT  Threshold signing ready
  threshold                            WARN   AUDIT  Threshold signing unavailable
  threshold                            WARN   AUDIT  Threshold signing degraded
  threshold                            INFO   AUDIT  DKG initiated
  threshold                            INFO   AUDIT  DKG complete
  threshold                            ERROR  AUDIT  DKG failed
  threshold                            ERROR  AUDIT  DKG timed out
  threshold                            ERROR  AUDIT  CRITICAL: DKG failed after max retries
  threshold                            INFO   AUDIT  Threshold share persisted to KV
  threshold                            WARN   AUDIT  Corrupt threshold share deleted
  threshold                            ERROR  AUDIT  Threshold signing failed
  threshold                            ERROR  AUDIT  Threshold signing timed out
  threshold                            INFO   AUDIT  Threshold CA birth complete
  threshold                            INFO   AUDIT  CA resharing initiated
  threshold                            INFO   AUDIT  CA resharing complete
  threshold                            ERROR  AUDIT  CA resharing failed
  threshold                            ERROR  AUDIT  CA resharing timed out
  threshold                            ERROR  AUDIT  CRITICAL: CA public key changed during resharing
  threshold                            INFO   AUDIT  Threshold share migration pending
  threshold                            ERROR  AUDIT  TSS replay attack detected
  threshold                            ERROR  AUDIT  TSS envelope signature verification failed
  threshold                            ERROR  AUDIT  TSS mandatory signature missing

Key Rotation Events:

  keyrotation                          ERROR  AUDIT  Key rotation aborted
  keyrotation                          ERROR  AUDIT  Key rotation spk_failed
  keyrotation                          WARN   AUDIT  Key rotation retry
  keyrotation                          WARN   AUDIT  Key rotation commit_quorum
  keyrotation                          WARN   AUDIT  Key rotation abort_received
  keyrotation                          INFO   AUDIT  Key rotation <event> (initiated, deferred, commit_all, completed, activated, spk_completed)

Hexon Readiness:

  hexdcall                             INFO   AUDIT  HexonReady: All subsystems operational - Hexon is ready to serve traffic

CA Module — GetCABundle:

  ca.getcabundle                       ERROR         Failed to get ACME CA bundle
  ca.getcabundle                       DEBUG         ACME CA bundle retrieved successfully

CA Module — SignCertificate:

  ca.signcertificate                   WARN          Certificate template is required
  ca.signcertificate                   WARN          Public key DER is required
  ca.signcertificate                   WARN          Failed to parse public key DER
  ca.signcertificate                   ERROR         Failed to sign certificate with ACME CA
  ca.signcertificate                   INFO   AUDIT  Certificate signed successfully with ACME CA

CA Module — SignCRL:

  ca.signcrl                           WARN          CRL number is required
  ca.signcrl                           WARN          CRL number must be positive
  ca.signcrl                           WARN          NextUpdate must be after ThisUpdate
  ca.signcrl                           ERROR         Failed to sign CRL with ACME CA
  ca.signcrl                           INFO   AUDIT  CRL signed successfully with ACME CA

CA Module — SignOCSPResponse:

  ca.signocspresponse                  WARN          Serial number is required
  ca.signocspresponse                  WARN          Serial number must be positive
  ca.signocspresponse                  WARN          Invalid OCSP status
  ca.signocspresponse                  WARN          NextUpdate must be after ThisUpdate
  ca.signocspresponse                  ERROR         Failed to sign OCSP response with ACME CA
  ca.signocspresponse                  INFO   AUDIT  OCSP response signed successfully with ACME CA

Metrics

No Prometheus metrics are emitted by the bridge module. The bridge provides infrastructure (bridge.Log, hexdcall Logger adapter) but does not itself emit counters, gauges, or latency metrics.

Configuration System

One TOML file defines the entire gateway — hot-reload, env overrides, GitOps, and Kubernetes CRDs

Overview

Defines the entire gateway in one TOML file — every module, every route, every policy. Supports hot-reload without restart, environment variable overrides, GitOps via git repository, and Kubernetes CRDs. Multiple config sources with a well-defined precedence order:

  1. Default values (security-focused, applied automatically)
  2. TOML literal values (single file, directory of files, or Git repository)
  3. ${VAR} template substitution in TOML (arbitrary env var names, pre-parse)
  4. HEXON_* auto-computed overrides (post-parse, highest priority)

Key capabilities:

Thread-safe access with atomic reads and mutex-protected writes
Hot-reload with SHA256 change detection, callback throttling (default 100ms window), section caching, and delta change logging
Environment variable overrides for all fields including array items: HEXON_SECTION_KEY for singletons, HEXON_SECTION_ARRAY_<NAME>_KEY for array items Automatic type conversion (string, int, bool, comma-separated arrays)
${VAR} template substitution: embed arbitrary env var names in TOML values, expanded pre-parse. Operators choose their own naming convention.
GitOps: clone from Git repo (HTTPS or SSH), automatic polling with cluster-aware leader-only execution, multi-TOML file merge
Directory-based config: pass a directory path, all *.toml files merged recursively in alphabetical order (maps merge, arrays concatenate, scalars last-wins)
Self-documenting schema: struct tags (desc, hint, default, min, max, enum, format, example, required, sensitive, rfc, depends) drive runtime documentation
Config diff history: ring buffer (default 10 entries) tracking per-key old/new values, exposed via “config diff” admin CLI command
Invalid config handling: hash-based dedup prevents retry storms, logs every 5 minutes
File deletion handling: service continues with last valid config, ALERT logged, status set to “file_missing” for health check visibility

Configuration is organized into domain-specific sections:

  Service, Telemetry, Cluster, Operations, Protection, Authentication, Filesystem

The config package is imported by virtually every component in the system. It has no dependencies on other gateway modules (only standard library + go-toml/v2).

Config

Configuration is loaded from TOML files. Default path: /tmp/hexon.toml

[service]
  hostname = "auth.example.com"        # Public hostname (required)
  port = 443                           # HTTPS listen port (required)
  public_port = 8443                   # Public-facing port for URL generation (behind NAT/LB)
  tls_cert = "/path/to/cert.pem"       # TLS certificate (file path or inline PEM)
  tls_key = "/path/to/key.pem"         # TLS private key (file path or inline PEM)
  read_timeout = 30                    # HTTP read timeout in seconds (default: 30)
  write_timeout = 30                   # HTTP write timeout in seconds (default: 30)
  idle_timeout = 120                   # HTTP idle timeout in seconds (default: 120)
  max_header_bytes = 65536             # Max header size in bytes (default: 65536)
  http2_enable = true                  # Enable HTTP/2 (default: true)
  handshake_timeout = 10               # TLS handshake timeout in seconds (default: 10)
  block_malformed_tls = true           # Reject invalid TLS (default: true)
  mtls_mode = "none"                   # mTLS mode: "none", "optional", "mandatory"
  x509_auto_auth = true                # Auto-authenticate with client certificate (default: true)
  hot_reload_enabled = true            # Enable automatic file watching (default: true)
  hot_reload_poll_interval = "1s"      # File polling interval (default: 1s)
  hot_reload_callback_throttle = "100ms"  # Callback throttle window (default: 100ms)

[telemetry]
  log_level = "info"                   # trace|debug|info|warn|error|fatal (default: info)
  log_format = "json"                  # json|human (default: json)
  output = "stdout"                    # stdout|otlp|both (default: stdout)
  otlp_endpoint = "otel-collector:4317"  # Required when output is otlp or both
  log_buffer_size = 10000              # Ring buffer for log queries (default: 10000)

[cluster]
  cluster_mode = true                  # Enable clustering (default: false)
  cluster_peers = ["10.0.0.2", "10.0.0.3"]  # Static peers (IPs or hostnames)
  cluster_dns = "hexon.cluster.local"  # OR DNS-based discovery (ignored when cluster_peers is set)
  cluster_key = "32-char-secret"       # Cluster key, exactly 32 chars (required)
  cluster_refresh = "15s"              # Peer refresh interval (default: 15s)
  threshold_required = false           # Fail-closed threshold signing after bootstrap grace (default: false)
  threshold_bootstrap_grace = "2m"     # Grace period for DKG completion (default: 2m)
  threshold_nodes = 0                  # Threshold t value: 0=auto (n/2), explicit integer for override

Environment variable overrides (three layers):

  Precedence: HEXON_* override > ${VAR} expansion > TOML literal > defaults

  HEXON_* auto-computed overrides (post-parse, highest priority):
    Singleton fields:  HEXON_<SECTION>_<KEY>=value    # e.g., HEXON_SERVICE_PORT=8443
    Array item fields: HEXON_<SECTION>_<ARRAY>_<ITEMNAME>_<KEY>=value
                       # e.g., HEXON_AUTHENTICATION_OIDC_CLIENTS_MYAPP_CLIENTSECRET=secret
    Item names are sanitized: uppercased, non-alphanumeric → underscore, collapsed.
    Only existing items (defined in TOML) can be overridden.
    Use 'config describe <section>' to see the exact env var for each field.

  ${VAR} template substitution (pre-parse, in TOML source):
    clientsecret = "${VAULT_OIDC_SECRET}"   # Arbitrary env var names
    Pattern: ${VARNAME} — unset vars left as-is, no recursive expansion.

  Type conversion: string, int, bool (true/false/1/0/yes/no), arrays (comma-separated)

GitOps environment variables:

  CONFIG_GIT_REPO                      # Repository URL (HTTPS or SSH, required for GitOps)
  CONFIG_GIT_BRANCH                    # Branch name (required for GitOps)
  CONFIG_GIT_PATH                      # Local clone path (default: /tmp/hexon-config)
  CONFIG_GIT_POLLING                   # Enable remote polling (default: false)
  CONFIG_GIT_POLLING_TIME              # Polling interval (default: 5m, min: 30s)
  CONFIG_GIT_USER / CONFIG_GIT_TOKEN   # HTTPS authentication
  CONFIG_GIT_SSH_KEY                   # SSH private key (inline PEM or file path)

Directory-based config:

  Pass a directory path instead of file: --config /etc/hexon/conf.d/
  All *.toml files merged recursively in alphabetical order.
  Merge: maps merge recursively, arrays concatenate, scalars last-wins.
  Use numeric prefixes for ordering: 00-base.toml, 90-overrides.toml.
  World-writable files (chmod 0002) are rejected for security.

Config diff history:

  config_diff_history_enabled = true   # Enable/disable diff storage (default: true)
  config_diff_history_size = 10        # Max entries retained, range 1-100 (default: 10)

Hot-reloadable: all config values via Get(). Application code must handle changes. Cold (restart required): listener bind address/port, TLS certificate paths at startup.

Troubleshooting

Common symptoms and diagnostic steps:

Config file not loading at startup:

  - TOML syntax error: check error message for line number, validate with 'config validate'
  - Missing required fields: hostname, port, tls_cert, tls_key must be present
  - Invalid CIDR notation: check proxy_cidr, ip_whitelist, ip_blacklist format
  - World-writable file: chmod to remove 0002 permission from TOML files

Environment variable overrides not applying:

  - Check naming: HEXON_<SECTION>_<KEY> in uppercase (e.g., HEXON_SERVICE_PORT)
  - Dots become underscores: HEXON_SERVICE_HEXON_EDGE_CIDR for [service] hexon_edge_cidr
  - Boolean values: accepts true/false, 1/0, yes/no (case-insensitive)
  - Arrays: comma-separated (HEXON_SERVICE_PROXY_CIDR=10.0.0.0/8,172.16.0.0/12)
  - Array items: item must exist in TOML first; env var uses sanitized name
    (e.g., HEXON_AUTHENTICATION_OIDC_CLIENTS_MYAPP_CLIENTSECRET for client "MyApp")
  - ${VAR} not expanding: variable must be set (os.LookupEnv), pattern must use
    braces (${VAR} not $VAR), name must match [a-zA-Z_][a-zA-Z0-9_]*
  - Use 'config describe <section>' to see the exact env var name for each field
  - Check active overrides: 'config env' shows all HEXON_* variables in effect

Hot-reload not detecting changes:

  - File hash unchanged: hot-reload uses SHA256, not mtime
  - Throttle window: rapid changes coalesce within 100ms window
  - Check status: 'config diff' for recent changes
  - Callback timeout: callbacks exceeding 30s are logged at WARN
  - hot_reload_enabled=false: file watching is disabled entirely

Config file deleted while running:

  - Service continues with last valid config (graceful degradation)
  - ALERT logged immediately, reminder every 5 minutes
  - Status set to "file_missing" visible in 'health components'
  - When file is restored, normal operation resumes automatically

GitOps config not syncing:

  - Repository credentials: verify CONFIG_GIT_USER/TOKEN or CONFIG_GIT_SSH_KEY
  - Polling disabled: CONFIG_GIT_POLLING must be "true" for automatic updates
  - Cluster leader-only: in cluster mode, only the leader node polls Git
  - Multi-file merge: check logs for "[CONFIG] Multi-file mode:" to verify

Directory config merge issues:

  - File order: alphabetical by full path, use numeric prefixes (00-base, 90-overrides)
  - Scalar override: later files win
  - Array concatenation: proxy.mappings from multiple files combine (not override)
  - Only *.toml files included, rename to .disabled or .bak to exclude

Threshold signing issues:

  - threshold_required=true but tokens return 503 after bootstrap grace:
    DKG did not complete in time. Check 'status summary' for threshold state,
    'logs search threshold' for DKG errors. Ensure cluster_mode=true and ≥2 nodes.
  - Threshold signing not activating: requires cluster_mode=true, ≥2 nodes,
    X3DH healthy. Check 'health components' for x3dh status.
  - Re-DKG not triggering after node join/leave: stale route timeout is 5 minutes,
    then 10s stabilization. Wait ~5m10s after membership change.
  - threshold_nodes: 0 = auto (floor(n/2)), explicit value sets t directly.
    t+1 nodes must cooperate to sign. With t=1 and n=2, both nodes required.

Relationships

Module dependencies and interactions:

listener: Consumes service config for TLS settings, bind address, port, HTTP/2 parameters, handshake timeout. Listener reads config via Get() on startup and handles hot-reload for certificate rotation.
cluster: Config changes propagate to all nodes via cluster broadcast. GitOps polling runs on the cluster leader only.
telemetry: Reads log_level, log_format, output, otlp_endpoint. log_buffer_size controls ring buffer for admin CLI log queries.
protection: Rate limiting, PoW, IP whitelist/blacklist settings all loaded from [protection] section. Hot-reloadable for threshold tuning without restart.
authentication: All auth backend configuration (LDAP, OIDC, TOTP, WebAuthn, x509) loaded from [authentication] sub-sections.
Git config sync: Handles CONFIG_GIT_* env vars, repository cloning, SSH/HTTPS auth, multi-file merge, and polling coordination.
Hot reload: Infrastructure module that manages file watching lifecycle, callback registration, and reload orchestration.
proxy: Reverse proxy mappings, load balancer, circuit breaker settings from [proxy] section.
threshold signing: [cluster] threshold_required, threshold_bootstrap_grace, threshold_nodes control the threshold signing subsystem (GG18 ECDSA / FROST Ed25519). The algorithm is driven by [authentication.oidc] signing_algorithm. Config is cold (restart required). The threshold signing subsystem consumes these values at startup.
admin CLI: ‘config show’, ‘config describe’, ‘config example’, ‘config set’, ‘config diff’, ‘config validate’, ‘config env’ commands for operational visibility and management.
schema: Self-documenting system driven by struct tags. Schema extraction produces field metadata, description formatting, TOML example generation, and auto-computed env var names for operator-facing output. Each field shows its HEXON_* env var in ‘config describe’. The config guide MCP resource is generated from this schema data.

Logs

This module does not emit structured log entries. All config output goes to process stdout/stderr as console messages.

Console output categories:

  startup and reload:
    fmt.Printf   "[CONFIG] Warning: Failed to start hot-reload system: %v"
    fmt.Printf   "[CONFIG] Loading configuration from directory: %s"
    fmt.Printf   "[%s] %s"  (license periodic check callback)
    fmt.Fprintf   "[CONFIG] DEPRECATED: %s is deprecated — %s"
    fmt.Fprintf   "[CONFIG] Warning: %s: expected %s, got %s — %s"  (type mismatch auto-correction)

  cross-module validation:
    fmt.Fprintf   "[CONFIG] WARNING: signin.magiclink.enabled=true but SMTP is not configured — magic link disabled"
    fmt.Fprintf   "[CONFIG] INFO: auto-enabling authentication.devicecode (required by signin.magiclink)"

  git clone and metadata:
    fmt.Printf   "[CONFIG] Git TLS config: ..."
    fmt.Printf   "[CONFIG] Loading configuration from git repository: %s (branch: %s)"
    fmt.Printf   "[CONFIG] Git configuration loaded successfully: ..."
    fmt.Printf   "[CONFIG] Warning: Failed to extract git metadata: %v"
    fmt.Printf   "[CONFIG] Using HTTP basic authentication"
    fmt.Printf   "[CONFIG] Using SSH authentication"

  file watching and reload (via logHotReloadEvent helper):
    fmt.Printf   "[CONFIG-HOTRELOAD] Hot reload system started"
    fmt.Printf   "[CONFIG-HOTRELOAD] Hot reload system stopped"
    fmt.Printf   "[CONFIG-HOTRELOAD] Config file changed, triggering reload"
    fmt.Printf   "[CONFIG-HOTRELOAD] Config reload successful"
    fmt.Printf   "[CONFIG-HOTRELOAD] Config reload failed - keeping previous config"
    fmt.Printf   "[CONFIG-HOTRELOAD] ALERT: Config file deleted - running with last valid config"
    fmt.Printf   "[CONFIG-HOTRELOAD] Config file restored"
    fmt.Printf   "[CONFIG-HOTRELOAD] Config still invalid - not retrying same broken config"
    fmt.Printf   "[CONFIG-HOTRELOAD] ALERT: Config parse failure"
    fmt.Printf   "[CONFIG-HOTRELOAD] ALERT: Config validation failure"
    fmt.Printf   "[CONFIG-HOTRELOAD] ALERT: Config file missing"
    fmt.Printf   "[CONFIG-HOTRELOAD] Config reload triggered by cluster broadcast"
    fmt.Printf   "[CONFIG-HOTRELOAD] Config reload from cluster successful"
    fmt.Printf   "[CONFIG-HOTRELOAD] Config reload from cluster failed"
    fmt.Printf   "[CONFIG-HOTRELOAD] Cluster notified of config reload"
    fmt.Printf   "[CONFIG-HOTRELOAD] Config changes detected"
    fmt.Printf   "[CONFIG-HOTRELOAD] Config reloaded with no detected changes"
    fmt.Printf   "[CONFIG-HOTRELOAD] Config callback panicked"
    fmt.Printf   "[CONFIG-HOTRELOAD] WARN: Legacy config callback timed out (goroutine leaked)"
    fmt.Printf   "[CONFIG-HOTRELOAD] WARN: Config callback timed out (context cancelled)"
    fmt.Printf   "[CONFIG-HOTRELOAD] WARN: Context-aware callback not respecting cancellation"
    fmt.Printf   "[CONFIG-HOTRELOAD] Config cache invalidated"
    fmt.Printf   "[CONFIG-HOTRELOAD] Hot reload configuration optimized"

None of these are queryable via ‘logs search’. They appear only in process stdout/stderr. The infrastructure/hotreload module wraps some of this via hexdcall manager logger (slog).

Metrics

No Prometheus metrics are emitted directly by this module.

Reload counters are available via the health system:

  - Reload attempts, successes, failures
  - Parse errors, validation errors, file-not-found errors
  - Callback timeouts, callback duration, last reload duration

Query reload status: health components | config status

Git Configuration Management

Pulls configuration from a git repository — GitOps workflow with automatic cluster-wide reload on changes

Overview

Synchronizes the gateway configuration from a git repository — every change auditable through git history. The leader polls for changes, validates the configuration, and broadcasts a reload to all cluster nodes. Supports SSH and HTTPS with PAT authentication, webhook-triggered pulls, and automatic rollback on invalid config.

Core capabilities:

Leader-only git repository polling (prevents duplicate change detection)
Cluster-wide reload to all members on change detection
Hard reset to remote HEAD for deterministic config state
Commit tracking with hash, author, message, and timestamp
Quorum wait for cluster-wide consistency confirmation
Integration with config hot-reload pipeline for seamless updates

Cluster synchronization flow:

  1. Leader node polls git repository at configured interval
  2. When changes detected, leader pulls and applies config locally
  3. Leader notifies all cluster members to pull the latest config
  4. Each member pulls latest git config and triggers hot-reload
  5. Quorum wait ensures cluster-wide consistency

The module provides GitOps-style configuration management where infrastructure teams push configuration changes to a git repository, and the cluster automatically picks up and applies those changes. This enables:

Version-controlled configuration with full audit trail
Pull request review workflows for config changes
Rollback capability via git revert
Branch-based staging/production config separation

Leadership determines which node polls the repository:

  - Only the cluster leader runs the git polling loop
  - If leadership changes, the new leader automatically starts polling
  - In standalone mode, the single node polls directly

Config

Git configuration is managed under [config] section in hexon.toml:

[config]
  # Git repository settings
  git_enabled = true                    # Enable git-based config management
  git_repo = "/etc/hexon/config.git"   # Local path to git repository
  git_remote = "origin"                 # Git remote name (default: origin)
  git_branch = "main"                   # Branch to track (default: main)
  git_poll_interval = "30s"             # Polling interval (default: 30s)

  # Authentication
  git_ssh_key = "/etc/hexon/deploy.key" # SSH key for git authentication
  git_username = ""                      # Username for HTTPS auth (optional)
  git_password = ""                      # Password for HTTPS auth (optional)

  # Directory-based config
  config_dir = "/etc/hexon/config.d"    # Directory for split config files
  merge_strategy = "deep"               # How to merge directory configs

The git repository should contain the hexon.toml (or split config files) at the repository root. The module performs a hard reset to the remote branch HEAD on each pull, ensuring deterministic state regardless of local modifications.

Polling behavior:

  - Only the cluster leader polls the git repository
  - Poll interval determines change detection latency
  - SHA comparison detects changes (not file timestamps)
  - On detection, local reload happens first, then broadcast

Hot-reloadable: git_poll_interval. Cold (restart required): git_enabled, git_repo, git_remote, git_branch,

  git_ssh_key, git_username, git_password.

Architecture

Operational model and design:

Pull operation details:

  Each successful pull reports: commit hash, commit author, commit message,
  and pull timestamp. These are visible in structured logs and health status
  for auditing which config version is active on each node.

Operational model:

  The module is passive on member nodes -- it responds to cluster-wide pull
  notifications by performing a local git pull and triggering config reload.
  The active polling runs only on the leader node, which detects changes and
  initiates the cluster-wide pull.

Leader election dependency:

  The module relies on the cluster's leader election mechanism. Only the
  elected leader runs the git polling loop. If leadership changes, the new
  leader automatically starts polling. This prevents duplicate pulls and
  conflicting notification storms.

Troubleshooting

Common symptoms and diagnostic steps:

Config changes in git not being applied:

  - Verify git_enabled = true in [config] section
  - Check if this node is the leader: cluster status shows leader node
  - Verify git remote is accessible: net tcp <git_host:port>
  - Check git_poll_interval (default 30s) - changes may be within latency
  - Look for git pull errors in logs: logs search "gitconfig" --level=error
  - Verify branch name matches: git_branch must match remote branch

Authentication failures (git pull fails):

  - SSH: verify git_ssh_key path exists and has correct permissions (0600)
  - SSH: check host key is in known_hosts for the git server
  - HTTPS: verify git_username and git_password are correct
  - HTTPS: check if token has expired (for token-based auth)
  - Look for auth errors: logs search "git" --level=error

Cluster members out of sync:

  - Check cluster health: cluster status shows all nodes
  - Verify pull delivery: logs search "gitconfig" on member nodes
  - Member pull failure is local only - check individual node logs
  - Force sync: trigger a manual git push (any change) to cause re-poll
  - Check quorum: if quorum lost, broadcast may not reach all members

Config validation failure after pull:

  - Invalid TOML in repository causes reload failure
  - Leader reload failure prevents broadcast (protects cluster)
  - Member reload failure logged locally, does not affect other nodes
  - Check: config validate to verify current config
  - Check git log for the problematic commit

Hard reset behavior:

  - The module performs git reset --hard to remote HEAD
  - Local modifications to the config file are overwritten
  - This is intentional: git is the source of truth
  - If local changes are needed, commit them to the repository

Standalone mode (no cluster):

  - Git polling runs on the single node directly
  - No broadcast occurs (no cluster to notify)
  - Config reload happens locally after pull
  - Suitable for development and single-node deployments

Relationships

Module dependencies and interactions:

config: Primary integration point. The config system performs the actual git fetch and hard reset. Config hot-reload pipeline processes the updated TOML after pull.
cluster: Leader election determines which node runs the git polling loop. Cluster-wide notification delivers the pull signal to all members. Quorum wait (optional) ensures cluster-wide consistency.
Hot reload: Complementary module — gitconfig handles git-based config changes while hot reload handles file-based config changes. Both trigger the same cluster-wide reload pipeline.
telemetry: Structured logging for pull operations with commit hash, author, and success/failure status. Metrics for pull frequency and latency.

Logs

No structured log entries. A console message is emitted on successful pull (commit hash, author, message — not queryable via logs search).

Related logs from other modules:

  - config: logs git fetch, hard reset, and reload results
  - cluster: logs broadcast delivery to member nodes

Metrics

This module does not emit its own Prometheus metrics.

Observability is provided indirectly through dependent modules:

  - config: metrics for config reload success/failure and reload latency
  - cluster: metrics for broadcast delivery and quorum wait

Hot Reload

Applies configuration changes without restart — leader detects file changes, broadcasts reload to all nodes

Overview

Applies configuration changes to the entire cluster without restarting any node. The leader watches for config file changes, reloads locally, and broadcasts the update to all cluster members. Most settings take effect immediately — a few require restart (documented per module).

Core capabilities:

Leader-only file watching (prevents duplicate change detection across cluster)
SHA256 hash comparison for reliable change detection (1-second poll interval)
Cluster-wide reload notification to all members after leader detects changes
Graceful degradation to standalone mode (single node, no coordination)
Atomic config swap with validation before apply
Independent node recovery (each node can recover on next poll or restart)

Cluster reload flow:

  1. Leader's file watcher polls config file every 1 second
  2. SHA256 hash computed and compared to previous hash
  3. On change: leader re-reads config, validates, applies defaults
  4. Atomic config swap on leader node
  5. On success: leader notifies all cluster members to reload
  6. Each member independently re-reads file, validates, and swaps config
  7. Notification is best-effort (local success is sufficient)

Standalone mode:

  When running as a single node or when cluster coordination is not initialized,
  every node watches and reloads independently. No broadcast occurs. This mode
  provides backward compatibility for development environments, single-node
  deployments, and testing scenarios.

Error handling philosophy (best effort):

  - Leader reload success: always broadcast to cluster
  - Leader reload failure: do NOT propagate (protect cluster from bad config)
  - Member reload failure: logged locally, does not affect other nodes
  - Cluster propagation failure: logged, local reload already succeeded

Config

Hot reload is an infrastructure module that watches the main config file. Its behavior is controlled by the overall config system rather than a dedicated config section.

The file watcher monitors the main hexon.toml config file path. The poll interval is fixed at 1 second for responsive change detection without excessive I/O overhead.

Key behaviors:

  - File watcher only runs on the cluster leader node
  - SHA256 hash comparison avoids false-positive reloads from timestamp changes
  - Config validation occurs before applying changes (fail-safe)
  - Invalid config is rejected; previous config remains active
  - Atomic swap ensures no partial config state is visible to readers

Leadership determines which node watches the config file:

  - Only the cluster leader runs the file watcher
  - If leadership changes, the new leader automatically starts watching
  - In standalone mode, every node watches independently

The config system also exposes reload status and metrics for health checks and monitoring.

Hot-reloadable fields vary by module. Each module documents which of its config fields support hot-reload vs. requiring a restart. The hotreload module itself has no user-configurable settings.

Architecture

Operational model and design:

Config version tracking:

  Each successful reload increments a version counter. This allows health
  checks and monitoring tools to detect whether a node is on the latest
  config by comparing version numbers. The version, reload timestamp, and
  any error message are exposed via health status.

File watching approach:

  The file watcher uses a polling approach (not inotify/kqueue) for maximum
  portability across Linux, macOS, and container environments. The 1-second
  poll interval provides a good balance between responsiveness and overhead.
  SHA256 hashing is more reliable than mtime/ctime comparison, which can
  produce false positives with NFS or container volume mounts.

Separation from gitconfig:

  hotreload handles direct file modifications (edit, cp, mount update).
  gitconfig handles git repository-based changes (git pull, merge).
  Both trigger the same config reload pipeline but through different
  detection mechanisms. They complement each other:
    - Use gitconfig for GitOps workflows with version control
    - Use hotreload for direct file modifications or mounted config maps

Troubleshooting

Common symptoms and diagnostic steps:

Config changes not being picked up:

  - Verify this node is the cluster leader: cluster status
  - In standalone mode, every node watches independently
  - Check if file was actually modified: SHA256 hash must change
  - Editing in place (vi, nano) changes hash; truncate+write may race
  - NFS/mount delays: file may not be visible for up to 1 second
  - Check logs for reload attempts: logs search "reload" --level=info

Reload fails with validation error:

  - Invalid TOML syntax: config validate to check current file
  - Missing required fields after edit
  - Leader detects failure and does NOT broadcast to cluster
  - Fix the config file; watcher will detect next change automatically
  - Check error details: logs search "reload" --level=error

Cluster members not reloading:

  - Check cluster connectivity: cluster status and health status
  - Verify reload delivery: logs search "reload" on member nodes
  - Member failure is independent: check individual node logs
  - Network partition: members reload on next local file change or restart
  - Quorum issues: cluster-wide reload requires quorum for delivery

Reload succeeded but feature not updated:

  - Not all config fields are hot-reloadable
  - Check module documentation for which fields require restart
  - Cold fields (e.g., listen ports, TLS certs) need full restart
  - Verify config version incremented: health status shows config version

Standalone mode issues:

  - No broadcast occurs in standalone mode (expected behavior)
  - Each node watches independently when cluster is not initialized
  - Verify the config file path is correct and accessible
  - File permissions: process must have read access to config file

File watcher consuming resources:

  - SHA256 computation on large config files is negligible
  - 1-second poll interval is fixed and not configurable
  - For very large configs (rare), hashing overhead is still minimal
  - If concerned, monitor CPU via metrics prometheus "cpu"

Relationships

Module dependencies and interactions:

config: Primary integration point. The config system owns file reading, TOML parsing, validation, and atomic swap logic. Reload status and metrics are exposed for health checks.
cluster: Leader election determines which node runs the file watcher. Cluster-wide notification delivers the reload signal to all members. Leadership changes automatically transfer the file watching responsibility.
GitOps config: Complementary module for git-based config changes. Both modules trigger the same config reload pipeline. gitconfig is for GitOps workflows; hotreload is for direct file modifications.
All modules with hot-reloadable config: When reload occurs, each module receives updated config via their registered reload callbacks. Modules include firewall (ACL rules), proxy (mappings), ratelimit (limits), forwardproxy (rate/bandwidth), and many others.
telemetry: Structured logging for reload events with success/failure status, config version, and timing. Metrics for reload frequency and duration.

Logs

No structured log entries. A single console message is emitted on initialization.

Related logs from other modules:

  - config: logs file watcher start/stop, hash comparison, reload success/failure
  - cluster: logs broadcast delivery to member nodes

Metrics

This module does not emit its own Prometheus metrics.

Observability is provided indirectly through dependent modules:

  - config: metrics for config reload success/failure, reload latency, and config version
  - cluster: metrics for broadcast delivery and quorum wait

Kubernetes CRD Configuration

Kubernetes-native configuration via Custom Resource Definitions with bootstrap reconciliation, live watching, and status feedback

Overview

HexonGateway supports Kubernetes-native configuration through Custom Resource Definitions (CRDs). When running in Kubernetes, operators can manage gateway configuration using standard kubectl commands instead of (or alongside) TOML files.

The system defines 55 CRD types covering every configuration section:

  - Service, cluster, telemetry, health, DNS, SMTP, filesystem, memory
  - Proxy mappings, connection pools, TCP proxy, forward proxy, subrequest
  - Authentication: OIDC clients, auth flows, signup flows
  - Identity: LDAP, OIDC providers, SCIM providers
  - Protection: WAF config, WAF rules, firewall rules/aliases, rate limiting
  - Infrastructure: bastion, SQL bastion, SSH certificates, port forwarding, connector, client access
  - Certificates: ACME CA server, ACME client
  - Operations: admin, MCP, LLM, playbooks, webhooks, SPIFFE, RADIUS
  - Observability: log intelligence, notifications

CRDs are optional — the gateway runs identically on VMs, Docker, or Kubernetes using TOML configuration. CRDs provide a Kubernetes-native alternative that integrates with GitOps tools like ArgoCD and Flux.

All CRDs belong to the config.hexon.io API group with v1alpha1 version. Namespaced scope — instances live in the hexon-system namespace by default.

Config

CRD Lifecycle:

  1. Bootstrap: On first start, the cluster leader creates CRD instances from
     the running config (TOML + env overrides + defaults merged). Each instance
     is labeled config.hexon.io/origin: bootstrap.
  2. Pruning: Bootstrap-owned array CRDs no longer in config are automatically
     deleted, along with their companion Secrets. Operator-owned CRDs are never
     touched. This ensures TOML deletions propagate to Kubernetes.
  3. Watching: Informers watch for CRD changes via the Kubernetes API. Changes are
     debounced (500ms window) to batch rapid edits.
  4. Apply: CRD spec is converted to the internal config struct, validated, and
     applied atomically. Config reload callbacks fire for all modules.
  5. Status: Each CRD instance gets status conditions reflecting apply success/failure.

Example — create a proxy mapping:

  apiVersion: config.hexon.io/v1alpha1
  kind: HexonProxy
  metadata:
    name: dashboard
    namespace: hexon-system
  spec:
    hostname: dashboard.example.com
    target: http://dashboard-service:3000
    auth_type: oidc

Sensitive fields (SecretKeyRef):

  Sensitive config fields (certificates, private keys, passwords, API secrets,
  RADIUS shared secrets, OIDC client secrets) are never stored in CRD specs.
  Instead, they are stored in companion Kubernetes Secrets and referenced via
  SecretKeyRef entries in the CRD spec:

    spec:
      apiKey:
        name: hexon-hexonproxies-dashboard    # Secret name
        key: apiKey                           # Key within the Secret

  - Empty sensitive fields (e.g., no custom certificate) produce no Secret.
    The field stays empty and the gateway uses its default (e.g., wildcard cert).
  - Non-empty fields are stored in a companion Secret named
    hexon-<plural>-<instance> (e.g., hexon-hexonproxies-dashboard).
  - Operators can reference any Secret they create — not limited to the
    bootstrap naming convention.
  - RBAC: The gateway pod needs get/list/create/update/delete on core Secrets.

Ownership model:

  - Bootstrap-created CRDs and companion Secrets have label:
    config.hexon.io/origin: bootstrap
  - Remove the label to "take ownership" — bootstrap will no longer overwrite
  - Operator-created CRDs (no label) are never modified or deleted by bootstrap
  - Bootstrap-owned array CRDs removed from config are pruned on next restart

Singleton vs Array CRDs:

  - Singleton: one instance named "default" (e.g., HexonClusterConfig, HexonDNSConfig)
  - Array: multiple instances, name derived from config (e.g., HexonProxy per mapping)

Resource naming:

  - K8s resource names are sanitized from config names (lowercased, spaces/underscores
    to dashes, special chars to dashes, max 253 chars). Example: config app "Kubernetes /
    Production" becomes resource name "kubernetes---production".
  - The original config name is preserved in the CRD spec (e.g., spec.app for proxies).
  - The "crd show" command accepts either the K8s resource name or the original config name.
  - Use "crd list <kind>" to see both resource names and config names side by side.

Status conditions:

  Every CRD instance reports an "Applied" condition:
    Applied=True   reason=ConfigValid      — config applied successfully
    Applied=False  reason=ApplyError       — config apply failed
    Applied=False  reason=ConversionError  — CRD-to-config conversion failed

  Check status: kubectl get hexonproxies -o wide
  The "Applied" printer column shows the current phase (Ready/Error).

Troubleshooting

CRD instances not being created on startup:

  - Only the cluster leader runs bootstrap reconciliation
  - Check logs for "bootstrap reconciliation complete" message
  - Verify CRD definitions are installed: kubectl get crd | grep hexon

CRD changes not applying:

  - Changes are debounced with a 500ms window — wait briefly
  - Check status conditions: kubectl describe <crd-kind> <name>
  - Look for Applied=False with reason and message
  - Verify RBAC: the gateway pod needs get/list/watch/create/update/patch permissions
    on all config.hexon.io resources and their status subresources

Status shows Applied=False reason=ConversionError:

  - The CRD spec doesn't match the expected config structure
  - Check field names match TOML keys (snake_case in spec)
  - Verify enum values are valid (e.g., auth_type must be a recognized method)

Bootstrap keeps overwriting my changes:

  - Remove the config.hexon.io/origin label from the CRD instance:
    kubectl label hexonproxy <name> config.hexon.io/origin-
  - Once the label is removed, bootstrap treats it as operator-owned and skips it
  - Do the same for companion Secrets if you want to manage them independently

Sensitive field shows empty after CRD apply:

  - Check the companion Secret exists: kubectl get secret hexon-<plural>-<name>
  - Verify the Secret has the expected key: kubectl get secret <name> -o jsonpath='{.data}'
  - Check RBAC allows Secret read: the gateway pod needs get on core/v1 secrets
  - If the Secret was manually deleted, restart the gateway to recreate it via bootstrap

CRD still exists after removing mapping from TOML:

  - Bootstrap prunes only CRDs with the config.hexon.io/origin: bootstrap label
  - If the label was removed (operator-owned), delete it manually:
    kubectl delete hexonproxy <name>

Config export for migration:

  - Use the admin CLI: config export
  - Exports running config as multi-document YAML CRD manifests
  - Filter by section: config export proxy
  - JSON format: config export --format=json
  - Only available when running in Kubernetes

Relationships

Module dependencies and interactions:

Configuration system: CRD changes are applied to the same config store used by TOML and environment variables. All modules see changes via the standard config reload mechanism. CRDs have the same precedence as TOML — environment variables still override.
Cluster coordination: Bootstrap reconciliation runs on the cluster leader only. Config changes from CRDs propagate to all nodes via the standard config reload broadcast (NATS-based).
Admin CLI: The “config export” command generates CRD YAML from running config, enabling migration from TOML to Kubernetes-native management. When using “config export —apply”, companion Secrets are created for sensitive fields (without the bootstrap label — operator-owned). The “config show” and “config describe” commands work regardless of config source.
Helm chart: CRDs are distributed separately from the Helm chart. Install CRDs first, then deploy the chart. This avoids Helm’s CRD lifecycle limitations (no update on upgrade, deletion on uninstall).
CI/CD integration: CRD manifests are versioned and compatible with ArgoCD, Flux, and other GitOps tools. The all-in-one bundle (hexon-crds.yaml) contains all CRD definitions.
Codegen tool: CRD YAML manifests are generated from Go struct tags using the build tool (build-crd.sh). OpenAPI v3 schemas include validation constraints derived from struct tags (required, enum, min, max, default, desc).

Logs

Log entries for Kubernetes operations. No AUDIT entries — all operational/debug. Levels: ERROR > WARN > INFO > DEBUG.

CRD Definition Management:

  CRD definition ensure failed                    WARN   Schema ensure error for a single CRD kind
  CRD definition created                          INFO   New CRD definition created in cluster
  CRD definition updated                          INFO   Existing CRD definition updated with new schema
  CRD definitions ensured                         INFO   Summary: created/updated/unchanged counts for all CRDs

Manager Lifecycle:

  CRD auto-apply failed, using existing definitions  WARN   CRD ensure failed (RBAC or network); continues with existing
  starting K8s CRD informers                      INFO   Informer startup with namespace and CRD count
  K8s API watch interrupted, will retry            WARN   Transient network error on watch stream (auto-retries)
  K8s API watch failed                            ERROR  Non-network watch error (permissions, API server issue)
  failed to set watch error handler               WARN   Could not install custom watch error handler
  informer cache sync failed                      WARN   Individual informer cache did not sync
  K8s informers synced                            INFO   All informer caches synced, ready to process events
  K8s manager stopped                             INFO   Manager shutdown complete
  K8s manager restarting after CRD definitions applied  INFO   Manager restart after CRD sync timeout recovery

Config Apply:

  failed to convert CRD to config                 ERROR  UnstructuredToConfig failed for a CRD change
  skipping CRD change with unresolved sensitive fields  DEBUG  SecretKeyRef not yet populated, skip to avoid empty overwrite
  failed to apply singleton change                ERROR  Config mutation failed for singleton CRD
  failed to apply array change                    ERROR  Config mutation failed for array/map CRD item
  failed to apply delete                          ERROR  Config deletion failed for array/map item
  CRD config validation failed, reload skipped    ERROR  Config.Validate() failed after applying CRD changes
  applied CRD config changes                      INFO   Config updated from CRD changes with apply/skip/error counts
  all CRD changes matched current config, reload skipped  DEBUG  All CRD changes identical to running config

Bootstrap Reconciliation:

  bootstrap singleton failed                      ERROR  Failed to reconcile a singleton CRD from config
  bootstrap array failed                          ERROR  Failed to reconcile an array CRD type from config
  bootstrap reconciliation complete               INFO   Summary: created/updated/skipped/pruned counts
  bootstrap array item failed                     ERROR  Failed to create/update a single array item CRD
  bootstrap map item failed                       ERROR  Failed to create/update a single map-keyed CRD
  failed to prune bootstrap CRD                   ERROR  Could not delete orphaned bootstrap-owned CRD
  pruned bootstrap CRD removed from config        INFO   Deleted bootstrap CRD no longer in TOML config
  failed to delete companion Secret during prune  WARN   Companion Secret cleanup failed during CRD prune
  failed to create companion Secret               ERROR  Could not create K8s Secret for sensitive fields

Secrets:

  created companion Secret for CRD                INFO   New K8s Secret created for sensitive fields
  updated companion Secret for CRD                DEBUG  Existing K8s Secret updated with new sensitive data
  failed to resolve Secret for sensitive field    WARN   Could not read SecretKeyRef value from K8s Secret

Status:

  status update: failed to write status           WARN   Could not write status condition to CRD instance

Health Sync:

  health status synced                            INFO   Health status written to CRD resources (with update count)
  cluster status sync: failed to get resource     WARN   Could not read cluster CRD for status update
  cluster status sync: failed to write status     WARN   Could not write leader/nodes/health to cluster CRD
  connector status sync: failed to get resource   WARN   Could not read connector site CRD for status update
  connector status sync: failed to write status   WARN   Could not write rich status to connector site CRD
  health sync: failed to get resource             WARN   Could not read CRD resource for health update
  health sync: failed to write status             WARN   Could not write health field to CRD resource

Resource Apply:

  CRD resource created                            INFO   CRD instance created via CLI apply
  CRD resource updated                            INFO   CRD instance updated via CLI apply (may include ownership transfer)

Watcher:

  unexpected object type in informer event        WARN   Informer delivered non-Unstructured object

Metrics

Prometheus metrics. Query with: metrics prometheus k8s_<name>

Reconciliation:

  k8s_reconciliations_total              counter    {result}              Config-to-CRD reconciliation cycles
    result=success | failure

Health Sync:

  k8s_health_syncs_total                 counter    {result}              Periodic health status writes to CRD .status
    result=success | failure

CRD Operations:

  k8s_crd_operations_total               counter    {operation, result}   Individual CRD operations
    operation=ensure_definition, result=created | updated | failure
    operation=status_write, result=success | failure

Alerts:

  rate(k8s_reconciliations_total{result="failure"}[5m]) > 0       Reconciliation failing
  rate(k8s_health_syncs_total{result="failure"}[5m]) > 0          Health sync failing

AI Assistant

Built-in AI-powered natural language interface for gateway operations via the bastion shell

Overview

The AI assistant enables natural language interaction with all gateway admin tools through the bastion shell’s “ai” command. It shares the same tool set and execution path as MCP, ensuring identical tool visibility, read/write enforcement, metrics, and audit logging.

Capabilities:

  Tool execution - Runs any admin CLI command via an agentic loop. The AI
    reads tool results, reasons about them, and decides what to run next.
    Read-only commands execute automatically. Write operations pause for
    interactive operator approval in the SSH session.

  Multi-provider support - Works with Anthropic (Claude), OpenAI (GPT-4),
    Azure OpenAI, Google Gemini, and Ollama/vLLM for local models. Provider
    auto-detected from the API URL or set explicitly.

  Conversation context - Maintains per-session conversation history so
    follow-up questions build on prior answers. Operators can set session
    context hints and the AI sees recent shell commands for awareness.

  Background monitoring - The schedule_task tool runs commands periodically
    in the background. Results appear between shell prompts. Operators
    manage tasks with "task list", "task stop".

  Inline monitoring loops - The sleep tool pauses the AI within its
    reasoning loop, then resumes with full context. Enables "check health,
    wait 30s, check again, compare, report changes" patterns. Each sleep
    extends the tool-calling budget so monitoring does not fight the
    per-query round limit. Governed by max_sleep_duration (default 5m per
    call) and max_sleeps_per_query (default 60 iterations). Ctrl+C
    interrupts immediately.

  Cluster knowledge base - Persistent cross-session memory for operational
    insights and rules. The AI learns from investigations and applies that
    knowledge in future sessions.

  Prompt caching - Anthropic provider supports prompt caching (5m or 1h
    TTL) to reduce token costs on repeated interactions.

Configuration: [llm] section with api_url, api_key, model, required_groups. Enable in bastion with [bastion] use_llm = true. Per-user custom instructions via moduledata or config.

Safety

Multiple layers prevent runaway AI behavior:

  Tool round limit - max_tool_rounds (default 15) caps the number of
    reasoning cycles per query. Sleep calls extend this budget so
    monitoring loops get additional rounds.

  Write operation limit - max_write_ops_per_query (default 3) caps
    mutations per query. The AI cannot retry failing write commands
    with slight variations.

  Sleep guardrails - max_sleep_duration (default 5m) caps individual
    pauses. max_sleeps_per_query (default 60) caps total iterations.
    Token cost on each wake-up naturally limits runaway loops.

  Failed operation dedup - Commands that fail are tracked by operation
    key. The AI cannot re-execute the same failing command.

  RBAC - required_groups restricts which operators can use AI features.
    allowed_commands whitelist limits which tools the AI can call.

  Interactive approval - Write operations prompt the operator for y/n
    confirmation in the SSH session before execution.

  Audit trail - All AI interactions logged with distributed tracing.
    Sensitive data redacted by default.

  Rate limiting - Per-user query rate limit (default 10/1m) prevents
    excessive API usage.

Security

Multiple defense layers protect the AI assistant:

  RBAC - required_groups restricts which operators can use AI features.
    Only operators in the configured groups can access the "ai" command.

  Command whitelist - allowed_commands limits which tools the AI can call.
    Operators cannot override this from within the AI session.

  Write protection - Read-only commands execute automatically. Write
    operations pause for interactive operator approval (y/n) in the SSH
    session before execution. Cannot be overridden by the AI.

  Rate limiting - Per-user query rate limit (default 10/1m) prevents
    excessive API usage and token cost accumulation.

  Audit trail - All AI interactions logged with distributed tracing.
    Sensitive data redacted by default (redact_sensitive = true).

  Runaway prevention - Tool round limit (default 15), write operation
    limit (default 3), sleep guardrails (5m per call, 60 iterations max),
    and failed operation dedup all prevent excessive token consumption.

Troubleshooting

Common symptoms and diagnostic steps:

AI command not available in bastion:

  - Verify [bastion] use_llm = true in config
  - Verify [llm] section is configured with api_url, api_key, model
  - Check required_groups: operator must be in one of the listed groups
  - Check: 'config show llm' to verify configuration

AI returns errors or empty responses:

  - Check API connectivity: verify api_url is reachable from the gateway
  - Check API key validity: invalid keys produce authentication errors
  - Check provider detection: auto-detect uses api_url hostname, set
    provider explicitly if using a proxy or non-standard endpoint
  - Check: 'logs search llm --since=5m' for API errors

Write operations not being approved:

  - Write ops require interactive SSH session (not available via MCP)
  - Operator must respond y/n to the approval prompt
  - max_write_ops_per_query (default 3) may be exhausted for the query
  - Check allowed_commands whitelist if specific commands are blocked

AI stops responding mid-conversation:

  - max_tool_rounds (default 15) reached: increase if needed for complex
    queries, but be aware of token cost implications
  - Sleep monitoring loop: max_sleeps_per_query (default 60) may be
    exhausted. Ctrl+C interrupts immediately
  - Check: 'logs search llm "round limit"' for limit violations

Background tasks not running:

  - 'task list' shows scheduled tasks and their status
  - 'task stop <id>' to cancel a misbehaving task
  - Only read-only commands can be scheduled as background tasks

High API costs:

  - Enable prompt caching for Anthropic provider (cache_ttl setting)
  - Reduce max_tool_rounds to limit reasoning cycles
  - Review max_sleeps_per_query for monitoring loops
  - Check per-user rate limits (default 10 queries/minute)

Relationships

Cross-subsystem interactions:

Admin CLI: Single source of truth for all tools. The AI calls the same command handlers available via MCP and the bastion shell.
MCP: Shares system instructions, tool definitions, and response formatting. Both interfaces use the same execution path.
Bastion shell: Hosts the “ai” command and interactive AI mode. Manages conversation history, approval prompts, and background task lifecycle.
Cluster knowledge: Memory entries (insights and rules) stored in cluster-wide distributed storage with configurable TTL.
Admin CLI commands: diagnose, health, proxy, sessions, certs, dns, directory, config, and 30+ more — all available as AI tools.

Logs

Log entries emitted by the LLM module. Search with: logs search “llm” Levels: ERROR > WARN > INFO > DEBUG. DEBUG requires log level configuration.

Query lifecycle:

  llm.query.start        INFO          Starting LLM query
  llm.query.complete     INFO          LLM query completed
  llm.query.api_error    ERROR         LLM API call failed
  llm.query.max_rounds   WARN          LLM query exceeded maximum tool rounds

Tool execution:

  llm.tool.execute       INFO          Executing tool via hexdcall
  llm.tool.approved      INFO   AUDIT  Write operation approved by operator
  llm.tool.denied        INFO   AUDIT  Write operation denied by operator

Metrics

Prometheus metrics. Query with: metrics prometheus llm_<name>

Queries:

  llm_queries_total                        counter    {success}                     Query completion count
    success=true            Query completed with a final answer
  llm_query_duration_seconds               latency    (none)                        End-to-end query duration including all tool rounds

Tool calls:

  llm_tool_calls_total                     counter    {tool, success}               Per-tool execution count
    tool=<command>          CLI command name (e.g. "proxy", "cluster")
    success=true/false      Whether the command executed successfully

Prompt caching (Anthropic provider only, emitted when cache tokens > 0):

  llm_cache_read_tokens_total              counter    (none)                        Tokens read from Anthropic prompt cache
  llm_cache_creation_tokens_total          counter    (none)                        Tokens written to Anthropic prompt cache

Module Data Storage

Stores per-user credentials and settings — passkeys, TOTP secrets, X.509 enrollment data, and preferences

Overview

Stores per-user data for authentication and service modules — passkey credentials, TOTP secrets, X.509 enrollment data, and user preferences. Each module gets isolated storage with automatic cluster replication. Used by WebAuthn, TOTP, X.509, and any module that needs persistent per-user state.

Core capabilities:

Hexon KV (NATS JetStream) storage with automatic cluster replication
Per-user, per-module namespace isolation (e.g., “totp”, “webauthn”, “x509”)
Reserved “preferences” namespace for cross-module user settings (language, etc.)
Automatic language preference storage when Language field is set on SetRequest
Directory cache refresh broadcast after Set and Delete operations
Input validation at facade and storage levels
Base64url key encoding for NATS KV compatibility (handles @, :, spaces)

Operations: Get, Set, Delete, check existence, get all data for a user, and bulk load.

Key format uses base64url-encoded usernames for storage compatibility.

Config

Configuration for moduledata storage:

Hexon KV Requirements:

  [cluster]
  cluster_path = "/var/lib/hexon"   # Required for NATS JetStream persistence

  - NATS JetStream must be available (cluster mode)
  - Data automatically replicated across cluster nodes
  - LoadAll returns all stored data (efficient for bootstrap)

Input Validation Rules:

  Username:
    - Cannot be empty
    - Maximum 200 characters (before base64url encoding)
    - Any characters allowed (gets base64url encoded)

  Module Name:
    - Cannot be empty
    - Maximum 64 characters
    - Pattern: [a-zA-Z0-9][a-zA-Z0-9\-_]* (no dots or colons)
    - Examples: "totp", "webauthn", "ssh_keys", "user-preferences"

  Combined key maximum: 256 characters after encoding

Reserved Namespaces:

  - "preferences": User-wide settings (language, notification preferences)

Troubleshooting

Common symptoms and diagnostic steps:

“Backend unavailable” error (ErrBackendUnavailable):

  - Check cluster_path exists and NATS JetStream is running
  - Check cluster status for NATS availability

“Invalid username” or “Invalid module name” errors:

  - Username must be non-empty and under 200 characters
  - Module name must match [a-zA-Z0-9][a-zA-Z0-9\-_]* pattern
  - Module name must be under 64 characters
  - No dots or colons allowed in module name (NATS KV restriction)

Data not appearing across cluster nodes:

  - Verify NATS JetStream cluster health
  - Check if directory cache refresh broadcast is working
  - Run 'moduledata inspect <username>' to check data on local node

Language preference not being stored:

  - Language is stored asynchronously (fire-and-forget) in "preferences" namespace
  - Check if Set operation for the primary module succeeded first
  - Verify language code is a valid string (e.g., "en", "es", "fr", "zh")
  - Query preferences directly: Get with ModuleName="preferences"

Encoding/decoding errors:

  - ErrEncodingFailed: data contains types that cannot be JSON-serialized
  - ErrDecodingFailed: stored data is corrupted or not valid JSON
  - Check NATS KV key format (base64url encoding)
  - Verify data values are JSON-compatible (maps, strings, numbers, bools)

Performance and metrics:

  - moduledata_operations_total: counter by operation type and status
  - moduledata_operation_duration_seconds: latency histogram
  - High latency: check NATS JetStream performance

Security

Security considerations for module data storage:

User enumeration prevention:

  HTTP handlers should return generic error messages to clients (e.g.,
  "Invalid credentials" instead of "User not found"). Detailed errors are
  logged internally with traceID for debugging.

Input validation (defense in depth):

  All inputs validated at facade and storage levels.
  Username length limit (200 chars) prevents DoS via oversized inputs.
  Module name character restrictions prevent injection attacks in NATS KV keys.
  Base64url encoding of usernames prevents NATS KV key injection.

Data isolation:

  Each module's data is stored under its own namespace key.
  Modules cannot accidentally overwrite another module's data.
  The "preferences" namespace is reserved for cross-module user settings.

Thread safety:

  All state managed by NATS JetStream.
  Concurrent operations are safe and independent.

Cache consistency:

  After Set and Delete operations, a directory cache refresh is
  replicated cluster-wide to keep all node caches consistent.
  This is fire-and-forget; transient broadcast failures are non-fatal.

Relationships

Module dependencies and interactions:

Directory: Provides user existence validation and group lookups. After Set/Delete, moduledata broadcasts RefreshUserCache to directory for cluster-wide cache consistency.
WebAuthn: Stores passkey credentials per user in “webauthn” namespace. Uses Get/Set for credential CRUD operations.
X.509: Stores X.509 certificate data per user.
signin: Stores sign-in flow state and user preferences. Uses Language field on Set to automatically store user language preference.
UI templates: Language preference from “preferences” namespace used for localized email rendering and UI template selection.
smtp: Looks up user language preference from “preferences” namespace for localized email delivery (OTP, cert renewal, passkey expiration).
cluster: Requires NATS JetStream (cluster_path configured). Data automatically replicated across cluster nodes.
telemetry: Metrics exported for operation counts and latency histograms. Structured logging with operation type, username (redacted), and traceID.

Logs

Log entries by component. Search with: logs search “moduledata” Levels: ERROR > WARN > INFO > DEBUG > TRACE.

Initialization:

  moduledata.init                    WARN          module_data_storage=ldap is deprecated and no longer supported; using hexon KV backend. Migrate existing module data to hexon KV before upgrading.
  moduledata.init                    WARN          cluster_path not set - module data may be lost on restart
  moduledata.init                    WARN          Persistent storage not enabled - module data will NOT survive restarts
  moduledata.init                    INFO          Module data storage initialized (hexon KV)

Get Operation:

  moduledata.get                     DEBUG         Getting module data
  moduledata.get                     ERROR         Backend.Get failed

Set Operation:

  moduledata.set                     INFO          Setting module data
  moduledata.set                     ERROR         Backend.Set failed
  moduledata.set.preferences         WARN          Failed to store language preference

Delete Operation:

  moduledata.delete                  INFO          Deleting module data
  moduledata.delete                  ERROR         Backend.Delete failed

GetAllForUser Operation:

  moduledata.getallforuser           DEBUG         Getting all module data for user
  moduledata.getallforuser           ERROR         Backend.GetAllForUser failed

LoadAll Operation:

  moduledata.loadall                 INFO          Loading all module data
  moduledata.loadall                 ERROR         Backend.LoadAll failed

Exists Operation:

  moduledata.exists                  ERROR         Backend.Exists failed

Hexon KV Backend — Get:

  moduledata.hexon.get               ERROR         PersistentGet failed
  moduledata.hexon.get               WARN          Unexpected value type in KV

Hexon KV Backend — Set:

  moduledata.hexon.set               ERROR         PersistentSet failed
  moduledata.hexon.set               DEBUG         Module data stored in Hexon KV

Hexon KV Backend — Delete:

  moduledata.hexon.delete            DEBUG         Key not found in Hexon KV (nothing to delete)
  moduledata.hexon.delete            ERROR         PersistentDelete failed
  moduledata.hexon.delete            DEBUG         Module data deleted from Hexon KV

Hexon KV Backend — GetAllForUser:

  moduledata.hexon.getallforuser     DEBUG         Retrieved all module data for user

Hexon KV Backend — LoadAll:

  moduledata.hexon.loadall           INFO          Loaded all module data from Hexon KV

Metrics

Prometheus metrics. Query with: metrics prometheus moduledata_<name>

Operations (namespace: moduledata):

  moduledata_operations_total                    counter    {operation, backend, result}   Operation count
    operation=get|set|delete|getallforuser|loadall|exists
    backend=hexon
    result=success|error
  moduledata_operation_duration                  latency    {operation, backend}            Operation duration
    operation=get|set|delete|getallforuser|loadall|exists
    backend=hexon

Notification Service

Routes alerts and events to Slack, Teams, Discord, PagerDuty, email, and custom webhooks

Overview

Routes operational events and alerts to configured notification channels — Slack, Teams, Discord, PagerDuty, email, and custom webhooks. Supports single events, digest notifications, and endpoint health checks. All notifications use template-driven payloads that can be customized per channel.

Core capabilities:

Multi-channel routing: email, Slack, Teams, Discord, PagerDuty, custom webhooks
Single event notifications (Send) with subject, body, and severity
Digest notifications (SendDigest) batching multiple results into one message
Five builtin webhook payload formats: generic, slack, teams, discord, pagerduty
Custom Go text/template payloads with json, severityColor, severityEmoji helpers
Partial success model: Success=true if at least one endpoint delivers
Branded HTML email templates rendered via the render module
Plain text fallback when render module is unavailable
Targeted routing: empty Webhook sends to all, “email” for email only, or a specific webhook name for single-target delivery
Health checking for all configured notification endpoints

Routing logic for the Webhook field:

  - "" (empty): broadcast to all enabled channels (email + all webhooks)
  - "email": send to email channel only (requires To field)
  - "<name>": send to the named webhook only (e.g., "slack-ops")

Email delivery chain:

  1. Notify module requests email rendering with template + data
  2. Render module loads the appropriate notification and digest templates
  3. Rendered HTML and plain text forwarded to SMTP module as multipart
  4. Fallback: if render unavailable, plain text auto-wrapped in <pre> tags

Webhook payload formats:

  - generic: flat JSON with subject, body, severity, username, hostname, timestamp
  - slack: Block Kit with header, severity emoji, code block body, context footer
  - teams: Adaptive Card v1.4 with TextBlock, FactSet, monospace body
  - discord: Embed with severity color mapping, code block body, footer
  - pagerduty: Events API v2 with routing_key from Metadata, severity mapping

Config

Notification configuration under [notify] section:

[notify]
  digest_window = "5m"              # Window for batching digest items

[notify.email]
  enabled = true                    # Enable email notifications (uses [smtp] config)

[[notify.webhooks]]
  name = "slack-ops"                # Webhook name (used for targeted routing)
  url = "https://hooks.slack.com/services/T00/B00/XXX"  # Webhook endpoint URL
  format = "slack"                  # Payload format: generic, slack, teams, discord, pagerduty
  timeout = "10s"                   # Request timeout (default: 10s)

[[notify.webhooks]]
  name = "teams-infra"
  url = "https://outlook.office.com/webhook/XXX"
  format = "teams"
  timeout = "15s"

[[notify.webhooks]]
  name = "pagerduty-critical"
  url = "https://events.pagerduty.com/v2/enqueue"
  format = "pagerduty"

[[notify.webhooks]]
  name = "custom-endpoint"
  url = "https://api.internal/alerts"
  format = "generic"                # Base format (overridden by body_template)
  content_type = "application/json" # Custom content type
  body_template = '{"alert": "{{json .Subject}}", "detail": "{{json .Body}}"}'
  [notify.webhooks.headers]
  Authorization = "Bearer token123" # Custom headers sent with every request

Template functions available in custom body_template:

  {{json .Field}}          - JSON-escape a string (quotes, backslashes, control chars)
  {{severityColor .Sev}}   - Map severity to Discord embed color (int)
  {{severityEmoji .Sev}}   - Map severity to Slack emoji string
  {{severityPD .Sev}}      - Map severity to PagerDuty severity level

Template variables (TemplateData fields):

  .Subject, .Body, .Severity, .Username, .Hostname, .Timestamp,
  .Metadata (map[string]string), .Items (digest), .ItemCount (digest)

Email template variables (passed to render module):

  Subject, Body, Severity, SeverityLabel, Username, Hostname,
  Timestamp, Disclaimer, Items (digest), ItemCount (digest)

The HTMLBody field on SendRequest bypasses template rendering entirely, allowing callers to provide custom HTML content. Email requires the To field to be set (recipient address).

Hot-reloadable: webhook URLs, formats, headers, timeouts, email enabled. Cold (restart required): none (all notify config is hot-reloadable).

Troubleshooting

Common symptoms and diagnostic steps:

Webhook not receiving notifications:

  - Run 'notify health' to check connectivity to all endpoints
  - Verify webhook URL is correct and accessible from the Hexon server
  - Check webhook name matches exactly (case-sensitive) when using targeted routing
  - Verify format is one of: generic, slack, teams, discord, pagerduty
  - Check timeout setting (default 10s) is sufficient for the endpoint
  - For custom endpoints, verify content_type and body_template are valid
  - Check custom headers (Authorization, API keys) are correct

Email notifications not being delivered:

  - Verify [notify.email] enabled = true
  - Check SMTP module health: 'smtp health'
  - Verify To field is set on the SendRequest
  - Check render module is available for template rendering
  - If render unavailable, plain text fallback should still work
  - Check spam folder - configure SPF/DKIM/DMARC for production

Partial success (some channels fail, others succeed):

  - This is expected behavior with the partial success model
  - Check resp.Failed[] for endpoints that failed and resp.Sent[] for successes
  - resp.Error contains a summary of failures
  - Individual endpoint failures do not block other deliveries
  - Success=true means at least one endpoint delivered successfully

Digest notifications empty or missing items:

  - Verify Items array is populated in SendDigestRequest
  - Each DigestItem needs TaskID and Description at minimum
  - Check digest_window setting if items seem to be batched incorrectly

Custom template errors:

  - Templates are parsed and validated at config load time
  - Template execution errors prevent notification delivery
  - Use {{json .Field}} for all string interpolation to prevent JSON injection
  - Check template syntax matches Go text/template format
  - Verify all referenced fields exist in TemplateData

Slack/Teams formatting issues:

  - Slack format uses Block Kit (verify workspace supports it)
  - Teams format uses Adaptive Card v1.4 (verify connector version)
  - Discord embeds have 4096 character limit for description
  - PagerDuty requires routing_key in Metadata map

Test notifications:

  - Use 'notify test <webhook-name>' to test a specific webhook
  - Use 'notify test email <address>' to test email delivery
  - Use 'notify list' to see all configured endpoints

Security

Security considerations for notification delivery:

Webhook URL protection:

  Webhook URLs are marked as sensitive in configuration and are not exposed
  in config dumps or diagnostic output. Only HTTPS URLs are recommended for
  production deployments. HTTP is allowed for internal or testing endpoints.

Authentication headers:

  Webhook headers (Authorization, API keys) are stored in configuration.
  Headers are sent with every webhook request to the endpoint. Consider
  using environment variable references for secrets in production.

Custom template safety:

  Templates are parsed and validated at config load time to catch syntax
  errors early. Always use {{json .Field}} for string interpolation in
  custom templates to prevent JSON injection attacks. Template execution
  errors are returned and the notification is not sent (fail-safe).

HTML content handling:

  Email body text is HTML-escaped when auto-converting plain text to HTML.
  The HTMLBody field content is sent as-is without sanitization - callers
  are responsible for ensuring HTML content is safe.
  The render module handles template escaping for branded email templates.

Credential exposure prevention:

  SMTP credentials are handled by the smtp module (never exposed by notify).
  Webhook URLs with embedded tokens (Slack, Teams) are redacted in logs.
  Error messages from failed webhook deliveries do not include full URLs.

Relationships

Module dependencies and interactions:

smtp: Email delivery backend. Notify sends emails through the SMTP module for all email notifications. SMTP configuration ([smtp] section) determines the mail server, credentials, and encryption mode.
UI templates: Email template rendering. Notify renders branded HTML email templates cluster-wide. If template rendering is unavailable, plain text fallback is used.
Admin CLI: Exposes notify CLI commands (list, health, test) for management and diagnostics.
mcp: Notify operations available as MCP tools for AI-assisted operations. LLM and bastion AI assistant can send notifications through MCP tools.
config: All notification settings are hot-reloadable. Webhook URLs, formats, headers, timeouts, and email enabled flag can be changed without restart.
telemetry: Structured logging for notification delivery with endpoint name, delivery status, and latency. Metrics for send counts and failures.
Rate limiting: Callers should apply rate limiting in HTTP handlers to prevent notification flooding. Notify module itself does not rate limit.
Various callers: Any module can send notifications cluster-wide. Common callers include authentication modules (login anomalies), certificate management (renewal notifications), and the bastion AI assistant.

Logs

Log entries emitted by the notify module. Search with: logs search “notify” Levels: ERROR > WARN > INFO > DEBUG > TRACE.

Send — single event delivery:

  notify.send.email_failed       WARN    Email notification failed
  notify.send.webhook_failed     WARN    Webhook notification failed
  notify.send.webhook_ok         DEBUG   Webhook notification sent
  notify.send.render_fallback    WARN    Email template rendering failed, using plain text fallback

Digest — batched digest delivery:

  notify.digest.email_failed     WARN    Digest email failed
  notify.digest.webhook_failed   WARN    Digest webhook failed
  notify.digest.render_fallback  WARN    Digest template rendering failed, using plain text fallback

Health check:

  notify.healthcheck             DEBUG   Health check completed

Metrics

Prometheus metrics emitted by the notify module:

  notify_sent_total          counter  {channel, result}  Incremented after each single-event delivery attempt.
                                                         channel=email|webhook, result=success|failure.
  notify_digest_sent_total   counter  {channel, result}  Incremented after each digest delivery attempt.
                                                         channel=email|webhook, result=success|failure.

Downstream metrics from related modules:

  - smtp_send_total (from smtp module) — covers email delivery outcomes
  - render_email_total (from render module) — covers template rendering

Distributed Sessions

Manages sessions across all protocols — HTTP and SSH share the same session store with instant revocation

Overview

Manages sessions for every protocol the gateway handles — HTTP, SSH, and PoW. Replaces per-service session stores with one cluster-wide store that supports instant revocation across all protocols. Disable a user once — every session terminates cluster-wide. Supports:

Unique session IDs (crypto/rand UUID v4, base64url-encoded, 256-bit) or custom IDs (e.g., SHA256 hash)
Dual-key indexing: primary by session ID, secondary by type+module_key
Automatic TTL expiration managed by distributed memory storage
Saga-based atomic session+index creation with rollback on failure
Pluggable extend validators (e.g., X.509 certificate revocation checks)
Pluggable create callbacks (e.g., post-create notifications)
Pluggable delete callbacks (e.g., per-type resource cleanup)
Session ID regeneration for session fixation protection
Lazy index cleanup on GetByModuleKey (handles missed OnDelete callbacks)
Thread-safe callback/validator registration (RWMutex)
Metrics: sessions_created, validations_success, validations_failed, sessions_extended, sessions_revoked, sessions_bulk_revoked, sessions_regenerated, activity_persisted

Available operations:

  Create         - Create session with atomic dual-key indexing
  Validate       - Validate session, update LastActivity (does NOT extend TTL)
  Extend         - Extend TTL (runs validators first, caps to cert_not_after for X.509)
  Revoke         - Delete single session (index cleaned automatically)
  RevokeAll      - Delete all sessions for a type+module_key
  List           - List all sessions of a given type (filters expired)
  GetByModuleKey - Reverse lookup by type+module_key with lazy cleanup
  RegenerateID   - New ID with same data (session fixation protection)

Session types in use:

  user              - Authenticated user sessions (web login, OIDC callback, X.509 auto-auth)
  bastion           - SSH bastion connection tracking
  cobrowse          - Proxy co-browse viewer sessions
  password_expired  - Temporary session for password change flow (short TTL)
  mfa_pending       - MFA verification pending (short TTL)
  flow_pending      - Signup/enrollment flow pending
  jit2fa_pending    - JIT 2FA OTP verification pending
  jit2fa_auth       - JIT 2FA authenticated session
  pow               - Proof-of-Work challenge session
  bearer_cache      - JWT Bearer token verification cache (custom ID = SHA256 of token)

Memory usage: ~600 bytes per active session (500 bytes session + 100 bytes index entry). For 1 million active sessions: ~600 MB cluster-wide.

Config

Sessions have no dedicated [sessions] config section. TTL and cookie settings are controlled by the calling module via [service] and per-feature config:

[service]
  cookie_name = "hexon"                # Default session cookie name (default: "hexon")
  cookie_domain = ".example.com"       # Cookie domain for cross-subdomain sharing (default: current hostname only)
  cookie_ttl = "12h"                   # Default session cookie TTL (default: "12h")
  session_ttl = "24h"                  # Authenticated user session TTL (default: "24h")
  session_password_expired = "15m"     # Password expired session TTL (default: "15m")
  session_mfa_pending = "5m"           # MFA pending session TTL (default: "5m")
  max_concurrent_sessions = 1          # Max concurrent sessions per user (default: 1, 0=unlimited)

[jit2fa]
  cookie_name = "jit2fa_key"           # Cookie name for JIT 2FA sessions (default: "jit2fa_key")
  session_ttl = "8h"                   # JIT 2FA authenticated session TTL (default: "8h")

[forward_proxy]
  session_cookie = "hexon_session"     # Forward proxy session cookie name (default: "hexon_session")

[protection]
  pow_cookie_name = "hexon_pow"        # PoW session cookie (default: "hexon_pow", MUST differ from session cookie)

Recommended TTL values by session type:

  Interactive web sessions (user):    12-24 hours
  API tokens:                         30-90 days
  OAuth state:                        5-10 minutes
  MFA pending (mfa_pending):          5 minutes
  Password expired (password_expired): 15 minutes
  PoW/temporary tokens:               1-5 minutes
  JIT 2FA (jit2fa_auth):              8 hours
  Bastion:                            Caller-determined (bastion manager)
  Bearer cache (bearer_cache):        5 minutes (default, configurable via [proxy].bearer_cache_ttl)

TTL behavior:

  - Validate does NOT extend TTL but persists LastActivity when stale > sessionTTL/10 (clamped 1m–5m), fire-and-forget
  - Extend explicitly sets new TTL from current time, requires cluster broadcast
  - X.509 sessions: TTL capped to cert_not_after on both Create and Extend
  - Minimum effective storage TTL is 1 minute (enforced as floor)
  - Expired sessions are filtered out by List and GetByModuleKey
  - Storage-level TTL expiry triggers OnDelete callback for automatic index cleanup
  - TTLCapped field in Create/Extend responses indicates certificate-based capping

Troubleshooting

Common symptoms and diagnostic commands:

Session not persisting across requests:

  - Cookie domain mismatch: verify [service].cookie_domain includes all subdomains
  - Secure flag on non-HTTPS: cookies with Secure=true require HTTPS transport
  - SameSite=Strict blocking cross-origin: check if auth redirect crosses domains
  - Cookie name conflict: ensure cookie_name differs from pow_cookie_name and jit2fa cookie
  - max_concurrent_sessions exceeded: new session may evict previous one
  - Check: 'sessions list --user=<username>' to verify session exists in storage

Cross-node session loss (works on one node, fails on another):

  - JetStream KV replication lag: check cluster quorum status with 'status'
  - Saga partial failure: session created but index missing, or vice versa
  - Network partition: quorum requirement (>50% nodes) prevents writes during partition
  - Validate is local-only: session must be replicated to the validating node
  - Check: 'sessions show <session_id>' from multiple nodes to compare
  - Check: 'status' for cluster health and node connectivity

Premature session expiration:

  - TTL too short: check [service].session_ttl (default 24h) or caller-specific TTL
  - Clock skew between nodes: ensure NTP is running (chrony or systemd-timesyncd)
  - X.509 TTL capping: session capped to cert_not_after, verify certificate validity
  - TTLCapped=true in response indicates certificate-based cap was applied
  - Check: 'sessions show <session_id>' to compare ExpiresAt vs current time

Session extend rejected:

  - Extend validator rejecting: check 'logs --module=sessions --level=warn'
  - x509_revocation validator: certificate revoked (check OCSP/serial index)
  - Certificate already expired: X.509 sessions cannot extend past cert_not_after
  - Session not found: already expired or revoked before extend attempt
  - Check: 'logs --module=sessions --keyword=validator' for rejection details

Stale sessions appearing in index (ghost sessions):

  - OnDelete callback failed during network partition or node crash
  - GetByModuleKey performs lazy cleanup: stale entries removed on next lookup
  - Manual cleanup: 'sessions revoke <session_id>' for individual sessions
  - Bulk cleanup: 'sessions revoke-user <username>' to clear all user sessions

Session fixation concerns:

  - RegenerateID should be called after authentication or privilege escalation
  - RegenerateID atomically creates new ID with same data, revokes old session
  - Uses Saga: new session stored, index updated, old session deleted (with compensation)
  - Check: 'logs --module=sessions --keyword=regenerated' for regeneration events

Diagnostic commands:

  sessions list                   - List first 20 sessions (all types)
  sessions list --type=user       - List authenticated user sessions
  sessions list --user=alice      - List sessions for specific user
  sessions list --offset=20       - Paginate to next page
  sessions list --limit=50        - Show 50 sessions per page
  sessions show <session_id>      - Show full session details with metadata
  sessions revoke <session_id>    - Revoke a single session
  sessions revoke-user <username> - Revoke all sessions for a user
  diagnose user <username>        - Full access diagnostic including session info
  logs --module=sessions          - Session operation logs
  status                          - Cluster health (affects quorum operations)

Architecture

Dual-key storage strategy:

  Primary key:   sessions/{uuid}                     -> Session object
  Secondary key: sessions_index/{type}/{module_key}   -> SessionIndex (list of session IDs)

  Uses '/' separator because NATS KV disallows ':' in key names.

Session lifecycle:

  1. Create: custom ID or crypto/rand 32-byte UUID (base64url) -> Saga(store session + update index)
     -> OnDelete callback registered for automatic index cleanup
     -> Create callbacks fired post-commit
     -> Replicated to cluster with quorum requirement (>50% nodes)
  2. Validate: Local read from memorystorage -> update LastActivity (local + throttled persist)
     -> Persists to storage when stale > sessionTTL/10 (clamped 1m–5m), fire-and-forget
     -> Does NOT extend TTL (explicit Extend call required for renewal)
  3. Extend: Load session -> run all registered validators in sequence
     -> Cap to cert_not_after for X.509 -> broadcast with quorum, OnDelete preserved
  4. Revoke: Replicated delete to all nodes -> callback fires -> index cleaned
  5. RevokeAll: Two-stage with cluster-wide safety net. The fast path
     uses the per-user index; the safety-net path scans the cache for
     any sessions the index missed (covers stale-index cases on peer
     nodes after partition or partial replication). Both paths confirm
     each delete reached at least one node before counting it.
     Bounded at 10000 sessions per call; truncations are recorded as a
     metric. Requires memory.cold_enabled=true for the safety net to
     be effective; when disabled, a throttled warning per affected
     user surfaces the gap.

Delete reason audit:

  Every session deletion emits an audit log entry and a counter labelled
  with the cause: expired (TTL), revoked (single-session admin/user
  revoke), bulk (force-logout / password-change), rotated (session
  regeneration after privilege change), saga_rollback (creation failed
  mid-flow). Audit emission is once-per-cluster — operators alert on
  reason=bulk spikes for force-logout activity, on reason=expired
  baselines for TTL behaviour, etc.

  Counter: sessions_deleted_total{type, reason}
  Audit log: sessions.delete (LevelInfo, AsAudit) — fields include
    session_type, reason, expires_at; correlates to begin/extend
    audit lines via trace_id.
  6. RegenerateID: Saga(store new session + update index + delete old session)
     -> Preserves original CreatedAt timestamp, copies all metadata
     -> Compensation: rollback new session if old session deletion fails

Saga operations (atomic multi-step with rollback):

  - Create: Step 1 store session (compensate: delete), Step 2 update index
  - RegenerateID: Step 1 store new (compensate: delete), Step 2 add to index,
    Step 3 delete old (compensate: restore old session with TTL and OnDelete callback)
  - Saga commit marks success; saga finalization defers cleanup/rollback

Index consistency model:

  - Automatic cleanup: OnDelete callback removes session_id on TTL expiry or manual delete
  - Lazy cleanup: GetByModuleKey validates each session in index, removes stale entries
  - Saga atomicity: Create and RegenerateID use compensating transactions
  - Delete callbacks execute even if index removal fails (resource cleanup not blocked)

Cluster behavior:

  Sessions (Create/Extend): Replicated with quorum (>50% nodes must confirm)
  Indices (Create/RegenerateID): Replicated with quorum (consistency required)
  Validate: Local read + throttled fire-and-forget broadcast when LastActivity stale (sessionTTL/10, clamped 1m–5m)
  Revoke/RevokeAll: Replicated to all nodes (eventual consistency acceptable)
  OnDelete callbacks: Local execution per node, fire-and-forget, independent of cluster

Callback and validator architecture:

  ExtendValidator: called BEFORE extend, CAN reject (returning error rejects extension)
    Built-in: x509_revocation (checks cert revocation via OCSP cache and serial index)
    For internal certs: checks serial index and moduledata
    For external certs: checks OCSP cache and responder (soft-fails on infra errors)
  CreateCallback: called AFTER successful create, fire-and-forget with panic recovery
  DeleteCallback: called AFTER delete and index cleanup, fire-and-forget with panic recovery
  Registration: thread-safe via RWMutex, map copied under read lock before execution
  Execution: sequential, each callback wrapped in defer/recover, panics logged not propagated

Performance:

  Direct lookup (Validate): O(1) by session ID, local read only
  Reverse lookup (GetByModuleKey): O(1) index lookup + O(n) session loads
  List all of type: O(N) scan of all sessions in storage, filtered by type
  Typical sessions per user: 1-5 (bounded by max_concurrent_sessions)
  Session object: ~500 bytes with metadata, index entry: ~100 bytes per reference

Security:

  Session IDs: 256-bit crypto/rand, base64url (RawURLEncoding), no padding
  Collision probability: ~2^-61 for 1 billion sessions
  X.509 TTL capping: cert_not_after metadata enforced on Create and Extend
  Revocation: instant via Revoke/RevokeAll (stateful, no blacklist needed)
  Session fixation: RegenerateID for post-authentication ID rotation
  Metadata privacy: plaintext module_keys for lookup (hash sensitive identifiers)

Type registration:

  All request/response types registered for cluster RPC serialization during init.

Interpreting tool output:

  'sessions list':
    Normal: Active sessions show User, Type, IP, Age — all expected
    Stale: Sessions with Age > max_session_duration — cleanup may be delayed (runs every 5m)
    Types: "authenticated" (normal), "mfa_pending" (waiting for MFA, 5min TTL), "password_expired"
    High count: Many sessions for one user → check max_concurrent_sessions setting
    Action: Suspicious session → 'sessions show <id>' for details, 'sessions revoke <id>' to terminate

  'sessions list --user=<username>':
    Empty: User has no active sessions — they are not logged in anywhere
    Multiple types: "authenticated" + "mfa_pending" = user may be stuck in MFA flow
    Action: Clear stuck MFA → 'sessions revoke-user <username>' (terminates ALL sessions)

Relationships

Module dependencies and interactions:

Distributed memory cache: KV store backend. Sessions stored in “sessions” cache type, indices in “sessions_index” cache type. Provides TTL expiration and OnDelete callbacks. All session CRUD operations delegate to the distributed cache.
proxy: Creates “user” sessions during OIDC SSO callback. Validates sessions on every proxied request for authentication enforcement. Session group monitor refreshes group membership and revokes sessions on group changes. Creates “cobrowse” sessions for co-browse viewer tracking. Creates “bearer_cache” sessions to cache JWT ID token verifications (SHA256 of token as custom session ID, configurable TTL).
signin: Creates “user” sessions after successful authentication, “password_expired” sessions for password change flow, “mfa_pending” sessions for MFA verification.
signup: Creates “flow_pending” sessions during enrollment, “mfa_pending” during TOTP/passkey setup, “user” sessions after completed registration.
bastion: Creates “bastion” sessions for SSH connection tracking. Session metadata includes connection details for audit trail and session sharing features.
authentication.x509: Registers x509_revocation extend validator. Checks certificate revocation status before allowing session extension. Sets cert_not_after metadata for TTL capping on both Create and Extend operations.
authentication.jit2fa: Creates “jit2fa_pending” and “jit2fa_auth” sessions with separate cookie (jit2fa_key) and configurable TTL (default 8h).
passwordchange: Validates “user” and “password_expired” session types. Creates new “user” session after successful password change. Triggers revocation of old sessions.
pow: Creates “pow” sessions after successful proof-of-work challenge. Uses separate cookie (hexon_pow) to avoid conflicts with main session cookie.
profile: Creates “user” sessions during profile management operations.
Directory: Group membership changes can trigger session revocation via proxy session monitor. Provides fresh group lookups for per-request authorization.
middleware (handlers): Creates “user” sessions during X.509 auto-authentication in the middleware chain when client certificate is present.
telemetry: All operations log with structured entries including trace IDs and security context (session ID, username). Levels: Error (storage/saga failures), Warn (not found, expired, validator rejections), Info (create/revoke events), Debug (normal validate/extend operations).
metrics: Runtime counters for all session operations (created, validated, extended, revoked, bulk_revoked, regenerated, validation failures by reason).
config ([service]): Provides default TTL values, cookie configuration, and max_concurrent_sessions limit. No dedicated [sessions] config section; TTL policies are caller-determined (each module passes its own TTL to Create).

Logs

Log entries by operation. Search with: logs search “sessions” Levels: ERROR > WARN > INFO > DEBUG.

Session Create:

  sessions.create         INFO          Session created (type, module_key, TTL)
  sessions.create         WARN          TTL capped to certificate validity / DurableKV not available
  sessions.create         ERROR         Failed to generate ID / store session / update index

Session Validate:

  sessions.validate       DEBUG         Session validated (type, module_key)
  sessions.validate       ERROR         Invalid session type in storage

Session Extend:

  sessions.extend         DEBUG         Session TTL extended
  sessions.extend         WARN          Extension rejected by validator / cert expired / TTL capped

Session Revoke:

  sessions.revoke         INFO          Session revoked
  sessions.revoke         WARN          Failed to broadcast deletion
  sessions.revoke_all     INFO          All sessions revoked for module_key

Session Regenerate:

  sessions.regenerate     INFO          Session ID regenerated successfully
  sessions.regenerate     WARN          Session not found for regeneration
  sessions.regenerate     ERROR         Fetch/generate/store/index/delete failures

Activity Tracking:

  sessions.persist_activity ERROR       Panic recovered persisting LastActivity

Callbacks & Validators:

  sessions.validator      INFO          Session extend validator registered
  sessions.callback       INFO          Session create/delete/delete_v2 callback registered
  sessions.callback       ERROR         Callback panicked (create/delete/delete_v2)

Index:

  sessions.index          DEBUG         Index cleanup / session removed / index deleted

Metrics

Prometheus metrics. Query with: metrics prometheus sessions_<name>

Lifecycle:

  sessions_sessions_created       counter    {type}                Sessions created
  sessions_sessions_revoked       counter    {}                    Sessions revoked (single)
  sessions_sessions_bulk_revoked  counter    {type}                Sessions bulk-revoked
  sessions_sessions_regenerated   counter    {type}                Session IDs regenerated
  sessions_sessions_extended      counter    {type}                Session TTLs extended
  sessions_activity_persisted     counter    {type}                Activity timestamps persisted

Validation:

  sessions_validations_success    counter    {type}                Successful validations
  sessions_validations_failed     counter    {reason}              Failed validations (storage_error, wait_error, not_found, invalid_type)

Alerts:

  rate(sessions_validations_failed{reason="not_found"}[5m]) > 50   High session-not-found rate (expired or stale cookies)
  rate(sessions_sessions_bulk_revoked[5m]) > 0                      Bulk revocation event (user disabled or password change)
  rate(sessions_validations_failed{reason="storage_error"}[5m]) > 0 Storage backend issues

SMTP Email Delivery

Sends emails for OTP codes, magic links, certificate notifications, and alerts — templated and localized

Overview

Handles all outbound email delivery for the gateway — OTP codes, magic links, certificate notifications, and alerts. Other modules request email delivery; this module handles connection management, templates, and localization. Supports SSL, STARTTLS, HTML/plain-text multipart, and file attachments.

Core capabilities:

Generic email sending with HTML and plain text multipart content
OTP (One-Time Password) emails for authentication flows
Certificate renewal notification emails with cert and CA bundle attached
Passkey expiration reminder emails with re-enrollment link
Health checks for SMTP server connectivity verification
Multi-part email composition with file attachments
Three encryption modes: SSL (port 465), STARTTLS (port 587), plain (port 25)
Multi-language email localization (en, es, fr, zh, ca)
Template rendering with branded HTML email templates
User language preference lookup for automatic localization
RFC 5321 compliant address validation (local ≤ 64, domain ≤ 255 chars)
RFC 5322 compliant headers (Date, Message-ID on every email)
RFC 8255 Content-Language header on templated emails
Message-ID in structured logs (success + failure) for MTA correlation

Localization priority for templated emails:

  1. Language field explicitly set in the request
  2. User preference from stored preferences
  3. Default fallback to "en" (English)

Supported languages: en (English), es (Spanish), fr (French), zh (Chinese), ca (Catalan)

Config

SMTP configuration under [smtp] section:

[smtp]
  host = "smtp.gmail.com"           # SMTP server hostname (required)
  port = 587                        # SMTP server port (required)
  encryption = "starttls"           # Encryption mode: "ssl", "starttls", or "none"
  user = "noreply@example.com"      # SMTP authentication username
  password = "app-specific-password" # SMTP authentication password (sensitive)
  from = "noreply@example.com"      # Sender email address (From header)
  reply_to = "support@example.com"  # Reply-To header address (optional)
  name = "HexonAuth"                # Sender display name (optional)
  skip_tls = false                  # Skip TLS certificate verification (default: false)

skip_tls: Disables server certificate validation for SSL and STARTTLS modes.

  Logs a WARN on every send when enabled. Use only when the SMTP server presents
  an untrusted or hostname-mismatched certificate. NOT recommended for production.

Encryption modes:

  ssl (port 465): Direct TLS connection from the start.
  starttls (port 587): Plain connection upgraded to TLS. Recommended.
  none (port 25): Unencrypted. Not recommended for production.

Common SMTP provider configurations:

  Gmail:    host = "smtp.gmail.com", port = 587, encryption = "starttls"
            (requires App Passwords with 2FA enabled)
  SendGrid: host = "smtp.sendgrid.net", port = 587, encryption = "starttls"
            user = "apikey", password = "<sendgrid-api-key>"
  AWS SES:  host = "email-smtp.<region>.amazonaws.com", port = 587
  Mailgun:  host = "smtp.mailgun.org", port = 587, encryption = "starttls"

Hot-reloadable: all SMTP settings (host, port, encryption, credentials). Cold (restart required): none.

Troubleshooting

Common symptoms and diagnostic steps:

SMTP connection failures:

  - Check SMTP health: 'smtp health' tests connectivity and authentication
  - Verify host and port match encryption mode (SSL=465, STARTTLS=587)
  - Firewall blocking outbound: verify server can reach SMTP host:port
  - Network probe: 'net tcp <smtp-host>:<port> --tls' for SSL
  - DNS resolution: 'dns test <smtp-host>' to verify hostname resolves

Authentication failures:

  - Gmail: requires App Passwords (regular password won't work with 2FA)
  - SendGrid: user must be literal string "apikey", password is the API key
  - AWS SES: IAM credentials, not root account credentials
  - Check: 'config show smtp' to verify configuration (password redacted)

Emails not being received:

  - Check spam/junk folder at recipient mail provider
  - Verify from address matches authenticated user or authorized alias
  - Configure SPF, DKIM, and DMARC DNS records for sending domain
  - Test delivery: 'smtp test <to-address>' sends a test message
  - Check: 'notify health' for notification system status

OTP emails delayed or missing:

  - SMTP latency: 200-1000ms is normal, check 'smtp health'
  - OTP code expired before email arrived: check OTP validity window
  - Rate limiting by SMTP provider: check provider dashboard

Passkey expiration emails not sent:

  - Expired passkeys intentionally do not receive reminder emails
  - Verify DaysRemaining is positive (zero or negative triggers no email)

TLS certificate verification failures:

  - "STARTTLS failed: tls: failed to verify certificate" or "TLS dial failed"
  - Common cause: SMTP relay hostname differs from certificate CN/SAN
    (e.g., smtp.company.com forwards to smtp.gmail.com)
  - Temporary fix: set skip_tls = true in [smtp] config (logs WARN per send)
  - Proper fix: configure the SMTP server with a valid certificate matching its hostname
  - Check: 'net tls <smtp-host>:<port>' to inspect the certificate chain

Template rendering errors:

  - Missing locale: unsupported language falls back to English
  - Template rendering failures prevent email send (no fallback)

Address validation failures (RFC 5321):

  - "local part exceeds 64 character limit": email local part too long
  - "domain part exceeds 255 character limit": email domain too long
  - These limits are per RFC 5321 §4.5.3.1

Correlating email delivery with MTA logs:

  - Every send logs Message-ID (success and failure)
  - Use Message-ID to trace through relay MTAs, bounce reports, DMARC feedback
  - Search gateway logs: 'logs search message_id=<value>'

Relationships

Module dependencies and interactions:

OTP authentication: Delivers one-time passwords for email-based auth.
Certificate management: Sends renewal notification emails with cert and CA bundle attachments.
WebAuthn/Passkey: Sends passkey expiration reminder emails.
Notification service: Uses SMTP for email channel delivery alongside webhooks for multi-channel routing.
Directory: User full name lookup for personalized email greetings.
Localization: Localized email text loaded from locale files.
Configuration: Reads [smtp] TOML section. All settings hot-reloadable.
Admin CLI: ‘smtp health’ and ‘smtp test’ commands for diagnostics.

Logs

Log entries emitted by the smtp module. Search with: logs search “smtp” Levels: ERROR > WARN > INFO > DEBUG > TRACE. AUDIT = persisted to tamper-proof audit log.

PAT expiry callback (init):

  smtp.pat_expiry              INFO  AUDIT  Personal Access Token expired

TLS certificate warnings (sendViaSSL / sendViaSTARTTLS):

  smtp.send                    WARN         TLS certificate verification failed, retrying with skip_tls=true — not recommended for production, configure a valid certificate
  smtp.send                    WARN         STARTTLS certificate verification failed, retrying with skip_tls=true — not recommended for production, configure a valid certificate

Magic link validation (SendMagicLinkEmail):

  smtp.magiclink               WARN  AUDIT  Magic link email blocked — invalid sealed return URL

Skip notifications (SendPasskeyExpirationEmail / SendVPNPSKExpirationEmail):

  smtp.send                    DEBUG        Skipping email for expired passkey
  smtp.send                    DEBUG        Skipping email for expired PSK

Generic email (SendEmail):

  smtp.send                    ERROR        SMTP send failed
  smtp.send                    INFO         Email sent successfully

OTP email (SendOTPEmail):

  smtp.send                    ERROR        SMTP send failed
  smtp.send                    INFO         Email sent successfully

Certificate renewal email (SendCertRenewalEmail):

  smtp.send                    ERROR        SMTP cert renewal send failed
  smtp.send                    INFO         Certificate renewal email sent

Passkey expiration email (SendPasskeyExpirationEmail):

  smtp.send                    ERROR        SMTP passkey expiration send failed
  smtp.send                    INFO         Passkey expiration email sent

Magic link email (SendMagicLinkEmail):

  smtp.send                    ERROR        SMTP send failed
  smtp.send                    INFO         Magic link email sent

Test email (SendTestEmail):

  smtp.test                    ERROR        SMTP test email failed
  smtp.test                    INFO         SMTP test email sent

PAT created email (SendPATCreatedEmail):

  smtp.pat_created             ERROR        PAT creation notification email failed
  smtp.pat_created             INFO         PAT creation notification email sent

PAT revoked email (SendPATRevokedEmail):

  smtp.pat_revoked             ERROR        PAT revocation notification email failed
  smtp.pat_revoked             INFO         PAT revocation notification email sent

PAT expired email (SendPATExpiredEmail):

  smtp.pat_expired             ERROR        PAT expiration notification email failed
  smtp.pat_expired             INFO         PAT expiration notification email sent

Passkey created email (SendPasskeyCreatedEmail):

  smtp.passkey_created         ERROR        Passkey creation notification email failed
  smtp.passkey_created         INFO         Passkey creation notification email sent

Passkey revoked email (SendPasskeyRevokedEmail):

  smtp.passkey_revoked         ERROR        Passkey revocation notification email failed
  smtp.passkey_revoked         INFO         Passkey revocation notification email sent

TOTP created email (SendTOTPCreatedEmail):

  smtp.totp_created            ERROR        TOTP creation notification email failed
  smtp.totp_created            INFO         TOTP creation notification email sent

TOTP revoked email (SendTOTPRevokedEmail):

  smtp.totp_revoked            ERROR        TOTP revocation notification email failed
  smtp.totp_revoked            INFO         TOTP revocation notification email sent

Certificate created email (SendCertCreatedEmail):

  smtp.cert_created            ERROR        Certificate creation notification email failed
  smtp.cert_created            INFO         Certificate creation notification email sent

Certificate revoked email (SendCertRevokedEmail):

  smtp.cert_revoked            ERROR        Certificate revocation notification email failed
  smtp.cert_revoked            INFO         Certificate revocation notification email sent

Metrics

Prometheus metrics. Query with: metrics prometheus smtp_<name>

Email delivery:

  smtp_emails_sent_total             counter    {type, result}       Emails sent per type and outcome
  smtp_send_duration                 latency    {type, result}       Email send duration per type and outcome

Label values:

  type:   generic | otp | cert_renewal | passkey_expiration | vpn_enrollment |
          vpn_device_code | vpn_psk_expiration | magiclink | test |
          pat_created | pat_revoked | pat_expired |
          passkey_created | passkey_revoked |
          totp_created | totp_revoked |
          cert_created | cert_revoked
  result: success | failure

Note: Only core email types emit latency (generic, otp, cert_renewal, passkey_expiration, vpn_enrollment, vpn_device_code, vpn_psk_expiration, magiclink). All other types (test, pat_, passkey_, totp_, cert_) emit counter only — no latency metric.

Alerts:

  rate(smtp_emails_sent_total{result="failure"}[5m]) > 5         SMTP delivery issues
  smtp_send_duration{quantile="0.99"} > 5s                       SMTP latency degradation

Persistent File Storage

Persistent on-disk storage for durable module data — supports shared NFS and per-node replication

Overview

Provides persistent, crash-safe file storage for modules that need durable on-disk data. Two deployment modes: shared (NFS) where all nodes see the same filesystem, and replicated (local) where each node maintains its own copy with broadcast synchronization.

Core capabilities:

Module-namespaced directories (each module gets isolated storage)
Atomic writes via temporary file + rename pattern (crash-safe)
Optional file locking via flock for NFS shared mode
JSON marshaling/unmarshaling for structured data
Full file lifecycle: Save, Load, Delete, Move, List, Exists
Path traversal protection with multi-layer validation
Fuzz-tested security boundary (traversal, null bytes, unicode attacks)

Storage modes:

  Shared (NFS): All nodes see the same files. Operations are local only
  (no broadcast needed). File locking prevents race conditions between nodes.
  Example path: /shared/webauthn/passkeys/active/abc123.json

  Replicated (Local): Each node maintains its own filesystem. Write operations
  writes are replicated to all nodes. No locking needed since each node
  owns its local copy.
  Example path: /data/webauthn/passkeys/active/abc123.json

File permissions: 0644 (files), 0755 (directories). Module directories are created on demand during Save operations.

Config

Configuration under [filesystem]:

[filesystem]
  base_path = "/shared"    # Root directory for all module storage
  mode = "shared"          # "shared" (NFS) or "local" (replicated per node)
  use_flock = true         # Enable file locking (recommended for shared mode)

Mode selection guidance:

  shared: Use when all nodes mount the same NFS/distributed filesystem.
    - Set use_flock = true to prevent concurrent write races
    - Operations are local only (no cluster replication needed)
    - Simplest setup, but requires reliable NFS infrastructure

  local (replicated): Use when each node has independent local storage.
    - Write operations (Save, Delete, Move) are replicated to all nodes
    - Read operations (Load, List, Exists) are local only
    - No file locking needed (each node owns its storage)
    - More resilient to NFS failures, but eventual consistency

Operation routing by mode:

  Shared mode:
    Save, Load, Delete, Move -> all execute locally (no cluster broadcast)

  Replicated mode:
    Save, Delete, Move -> replicated to all cluster nodes
    Load               -> local only (read from local storage)

Hot-reloadable: None. Changes to base_path, mode, or use_flock require restart.

Troubleshooting

Common symptoms and diagnostic steps:

File not found after Save (replicated mode):

  - Verify Save used cluster-wide replication (replicated mode requires it)
  - Check if querying node received the write replication (network partition)
  - Replication is eventually consistent; small delay before Load on other nodes
  - Verify base_path is correct on all nodes (must match across cluster)

Permission denied errors:

  - Check filesystem permissions: files need 0644, directories need 0755
  - Verify the hexon process user has write access to base_path
  - NFS mount options: ensure no_root_squash or correct uid/gid mapping
  - SELinux/AppArmor may block writes to NFS mounts

Path traversal error (ErrPathTraversal):

  - Module name contains '/', '\', or '..' (invalid characters)
  - Subpath starts with '/' or '\' (must be relative)
  - Subpath contains '..' traversal sequences after path cleaning
  - Resolved path escapes the module directory boundary
  - This is a security feature; do not attempt to bypass it

File locking issues (shared/NFS mode):

  - Stale locks after crash: flock is released on process exit by the OS
  - NFS lock daemon (lockd/statd) must be running on all nodes
  - NFSv4 has built-in locking; NFSv3 requires separate lock services
  - Deadlock: operations hold locks briefly (JSON marshal + write + rename)
  - If use_flock = false on shared mode, concurrent writes may corrupt files

Atomic write failures:

  - Disk full: temporary file creation fails before rename
  - Cross-device rename: base_path and temp dir must be on same filesystem
  - Check disk space: df -h on the base_path partition
  - Temp file cleanup: orphaned .tmp files indicate interrupted writes

List operation returns empty:

  - Verify the subpath directory exists (directories created on Save only)
  - Check glob pattern syntax (uses filepath.Glob matching rules)
  - Pattern is matched against filenames only, not full paths
  - Module directory is base_path/module_name/subpath

Data corruption or invalid JSON:

  - Atomic writes prevent partial writes; corruption suggests disk issues
  - NFS cache coherence: mount with actimeo=0 for immediate consistency
  - Check for concurrent writes without flock enabled
  - Validate JSON: load the file directly and check for syntax errors

Architecture

Write path (Save operation):

  1. Validate path (module name + subpath traversal checks)
  2. Create module directory tree if needed (MkdirAll with 0755)
  3. Marshal data to JSON with indentation
  4. Create temporary file in same directory
  5. Write JSON content to temporary file
  6. Sync to disk (fsync)
  7. Atomic rename: tmp file -> target path
  8. Optional: acquire/release flock around steps 4-7 (shared mode)

Read path (Load operation):

  1. Validate path
  2. Read file contents (os.ReadFile)
  3. Unmarshal JSON into interface{}
  4. Return data with Found=true, or Found=false if file not found

File locking (shared mode only):

  Uses syscall.Flock with LOCK_EX (exclusive) for writes and LOCK_SH (shared)
  for reads. Locks are advisory and only effective when all accessors use flock.
  Lock scope is per-file, not per-directory.

Module isolation:

  Each module's storage is confined to base_path/module_name/. Path validation
  ensures no module can read or write outside its own directory. The validation
  is defense-in-depth: multiple checks at different levels prevent escape.

Relationships

Module dependencies and interactions:

webauthn: Stores passkey credentials as JSON files. Uses shared mode for cross-node passkey availability. Files organized in active/revoked subdirectories.
acme (CA): Stores issued certificates, private keys, and ACME account data. Requires persistent storage that survives restarts.
config: Filesystem base_path and mode read from TOML configuration. No hot-reload; changes require restart.
telemetry: Structured logging for all file operations (save, load, delete, move) with module name, subpath, and error details.
memory (memorystorage): Complementary storage. Use filesystem for persistent data that must survive restarts; use memory for ephemeral data with TTL. Some modules use both: memory for fast lookups, filesystem for durable backup.
cluster: In replicated mode, cluster health affects write propagation. Node failures may result in missed broadcasts (eventually consistent).

Logs

Log entries emitted by this module. Search with: logs search “storage.filesystem” Levels: ERROR > WARN > INFO > DEBUG > TRACE.

Save Operation:

  storage.filesystem                 WARN          Path traversal attempt blocked
  storage.filesystem                 ERROR         Failed to create directory
  storage.filesystem                 ERROR         Failed to marshal JSON
  storage.filesystem                 ERROR         Failed to save file
  storage.filesystem                 DEBUG         File saved

Load Operation:

  storage.filesystem                 WARN          Path traversal attempt blocked
  storage.filesystem                 DEBUG         File not found
  storage.filesystem                 ERROR         Failed to read file
  storage.filesystem                 ERROR         Failed to unmarshal JSON
  storage.filesystem                 DEBUG         File loaded

Delete Operation:

  storage.filesystem                 WARN          Path traversal attempt blocked
  storage.filesystem                 DEBUG         File not found for deletion
  storage.filesystem                 ERROR         Failed to delete file
  storage.filesystem                 DEBUG         File deleted

Move Operation:

  storage.filesystem                 WARN          Path traversal attempt blocked (source)
  storage.filesystem                 WARN          Path traversal attempt blocked (target)
  storage.filesystem                 ERROR         Failed to create target directory
  storage.filesystem                 ERROR         Failed to move file
  storage.filesystem                 DEBUG         File moved

List Operation:

  storage.filesystem                 WARN          Path traversal attempt blocked
  storage.filesystem                 DEBUG         Directory not found
  storage.filesystem                 ERROR         Failed to read directory
  storage.filesystem                 DEBUG         Directory listed

Exists Operation:

  storage.filesystem                 WARN          Path traversal attempt blocked
  storage.filesystem                 DEBUG         File existence checked

Metrics

No Prometheus metrics are emitted by this module.

Distributed Memory Storage

Ephemeral key-value storage shared across all nodes — used by sessions, OTP, PoW, and tokens

Overview

Provides fast, in-memory key-value storage replicated across all cluster nodes with automatic TTL expiration. Used by sessions, OTP codes, PoW challenges, OIDC tokens, and other time-sensitive data that needs cluster-wide availability. Data expires automatically — no manual cleanup required.

Core capabilities:

Namespace-isolated caches (cache types prevent key collisions)
Automatic TTL-based expiration with background eviction every 30 seconds
OnSet and OnDelete callback support (fire-and-forget, local only)
Thread-safe operations with mutex protection
Cluster-wide replication (writes replicated to all nodes)
Eventually consistent reads (local only, no network overhead)
NATS JetStream KV persistence for crash recovery (optional)
Peer-to-peer bootstrap fallback when JetStream unavailable
Two-tier hot/cold cache for large-scale deployments (30M+ users)
SetNX for atomic set-if-not-exists (distributed locks)
Touch for TTL renewal without value modification

Consistency model:

  Reads (Get): Local first, O(1). With cold_enabled=true, falls through to KV on miss (~1ms).
  Reads (All): Local only, O(1). Returns hot entries only (no KV scan in cold mode).
  Writes (Set, Delete): Local immediate + optional replication to all nodes
  Writes are best-effort with no quorum requirement by default.
  For strong consistency, use cluster-wide replication with quorum confirmation.

Storage architecture: two-level map structure

  caches[cache_type][key] -> storageEntry with Value, Expiration, Callbacks

Data types stored in memory must be compatible with the cluster serialization layer. Custom structs, slices, and maps with custom types are supported.

Config

Configuration under [cluster] (memory persistence):

[cluster]
  cluster_path = "/var/lib/hexon/cluster"  # Base path for JetStream storage
  persist_memory = true                    # Use FileStorage for KV bucket

  memory_kv_max_write = 10               # Max concurrent KV writes (1-100)

When persist_memory = true and cluster_path is set:

  - NATS JetStream KV bucket "hexon_storage_memory" is created
  - Writes are asynchronously persisted to JetStream after local cache update
  - Concurrent KV writes throttled by memory_kv_max_write (default 10)
  - On startup, all entries are bootstrapped from JetStream KV
  - JetStream uses Raft consensus in 3+ node clusters for durability
  - Data survives full cluster restarts

When persist_memory = false or cluster_path is unset:

  - KV bucket uses MemoryStorage (data lost on restart)
  - Falls back to peer-to-peer bootstrap from live cluster nodes
  - Suitable for truly ephemeral data (PoW challenges, rate limit counters)

Key encoding for NATS KV:

  NATS KV keys only allow [-/_=\.a-zA-Z0-9]+. Keys from external sources
  (LDAP groups with spaces, email addresses) are base64url encoded:
    Format: {cacheType}/{base64url(key)}
    Example: "directory_groups/UmVwbGljYXRpb24gQWRtaW5pc3RyYXRvcnM"

Bootstrap sequence on node startup:

  1. Attempt to read all entries from NATS JetStream KV
  2. Populate in-memory cache with non-expired entries
  3. If JetStream unavailable, request data from cluster peers
  4. Merge peer responses into local cache
  5. Live broadcasts during bootstrap take precedence over stale KV data

Hot/cold cache:

[memory]
  cold_enabled = true    # Two-tier hot/cold cache (default: true)
  cold_ttl = "72h"       # How long idle entries stay in hot cache

When cold_enabled = true (default):

  - Entries load lazily from durable storage on first access
  - Subsequent reads are in-memory
  - Idle entries evicted from memory after cold_ttl (still durable)
  - No startup warmup — node starts instantly, cache fills on demand
  - Only active entries consume memory; idle entries are served from
    durable storage on demand
  - Cluster-wide features that need to enumerate cache contents
    (admin force-logout safety-net, audit listing) are fully supported

When cold_enabled = false (operator opt-out):

  - Full in-memory replication; all entries loaded at startup
  - Best for very small deployments where memory headroom is generous
    and cluster-wide enumeration features are not needed
  - A startup warning announces the degraded mode for cluster-wide
    operations: admin force-logout may not catch sessions on peer
    nodes if the per-user index is stale

No hot-reloadable settings. Changes require a full restart.

Troubleshooting

Common symptoms and diagnostic steps:

Key not found after Set (cross-node):

  - Verify Set used cluster-wide replication, not local-only (replication required for cross-node visibility)
  - Reads are local only; small propagation delay is normal
  - Use quorum-confirmed replication before reading for strong consistency
  - Check cluster health: nodes must be reachable for broadcast delivery
  - Verify the stored type is compatible with cluster serialization

Serialization errors (encoding/decoding failures):

  - Custom types stored in memory must be compatible with cluster serialization
  - Type registration happens during module initialization
  - Built-in types (string, int, bool, []byte) work out of the box
  - Error message includes the unregistered type name

TTL expiration not working (entries persist beyond TTL):

  - Background eviction runs every 30 seconds (not instantaneous)
  - Expired entries are immediately invisible to Get (Found=false)
  - Physical cleanup happens on next eviction cycle
  - Very large caches (100K+ entries) may slow eviction scans
  - Check if TTL was set to 0 (zero TTL means no expiration)

OnDelete callback not firing:

  - Callbacks are local only (fire on the node that runs eviction)
  - Callbacks are fire-and-forget (errors are logged but not returned)
  - The callback module and operation must exist and be registered
  - Check telemetry logs at ERROR level for callback failures
  - Callbacks do NOT fire on nodes that receive broadcast deletions
    (only the originating node triggers the callback)

Data lost after cluster restart:

  - Verify persist_memory = true in [cluster] config
  - Verify cluster_path is set and writable
  - Check NATS JetStream health (3+ nodes needed for Raft consensus)
  - 2-node clusters: JetStream may not achieve quorum, data at risk
  - Without persistence, data is only in memory (lost on restart)
  - Bootstrap logs show how many entries were recovered from KV

Memory usage growing unbounded:

  - Check TTL values: missing or zero TTL entries never expire
  - Use All operation to inspect cache sizes per cache type
  - Per-entry overhead: approximately 150 bytes plus key and value sizes
  - Monitor eviction cycle: entries should be cleaned every 30 seconds
  - Consider partitioning large cache types into smaller namespaces

SetNX returning Set=false unexpectedly:

  - Key already exists in local cache (including expired-but-not-evicted)
  - Another node set the key via broadcast before your SetNX
  - SetNX is local atomic only; not a distributed lock by itself
  - For distributed locking, combine SetNX with cluster-wide replication + short TTL

Bootstrap failures on startup:

  - JetStream KV unavailable and no peer nodes responding
  - Node starts with empty cache; data populates as broadcasts arrive
  - Check NATS connection health and cluster discovery
  - Verify cluster_path directory exists and has correct permissions
  - Base64url decoding errors: corrupted KV keys (manual cleanup needed)

KV “too many requests” errors at startup (memory.kv.put_error):

  - Caused by bulk operations (e.g. directory fullSync) spawning many concurrent
    KV writes that overwhelm JetStream rate limits
  - Each user/group sync fires ~3 Set() calls per user + ~2 per group
  - A directory with 40 users and 20 groups = ~160 concurrent writes
  - Fix: increase memory_kv_max_write in [cluster] config (default 10, max 100)
  - These errors are non-fatal: data is already in local cache, only persistence
    is delayed. Entries will be persisted on subsequent writes or next restart.
  - Monitor: logs search "memory.kv.put_error" --since=5m

Architecture

Data flow for write operations:

  1. Caller invokes Set/Delete (local-only or cluster-wide)
  2. Local cache updated immediately (mutex-protected)
  3. OnSet callback triggered if registered (fire-and-forget)
  4. If cluster-wide: replicated to all cluster nodes
  5. Async persistence goroutine acquires semaphore slot (bounded by memory_kv_max_write)
  6. KV write to NATS JetStream (best-effort; skipped on shutdown)
  7. JetStream Raft replicates to follower nodes (3+ node clusters)
  Note: SyncSet bypasses the semaphore (synchronous, caller-blocking, used for signing keys)

Data flow for read operations:

  1. Caller invokes Get/All
  2. Local cache lookup (O(1) for Get, O(n) for All)
  3. Expired entries filtered out (Found=false)
  4. If cold_enabled=true and Get misses local: KV fallback (~1ms), lazy-load into hot cache
  5. All always returns hot entries only (no KV scan in cold mode)

Background eviction loop:

  1. Wakes every 30 seconds
  2. Scans all cache types and all entries
  3. Identifies entries with Expiration < now (TTL eviction)
  4. Deletes expired entries from local cache and KV
  5. Replicates eviction to cluster nodes, triggers OnDelete callbacks
  6. Cold eviction pass (cold_enabled=true only): entries idle > cold_ttl removed
     from memory only — stay in KV for future lazy-load, no callbacks triggered

NATS JetStream KV architecture (when persistence enabled):

  - Bucket: hexon_storage_memory
  - Raft consensus for writes (3+ nodes)
  - Leader election with automatic failover
  - Write-ahead log replicated to followers
  - Can tolerate N/2-1 node failures (e.g., 1 of 3)

Peer-to-peer bootstrap fallback:

  - Used when JetStream is unavailable (2-node clusters, JetStream down)
  - Requests data from all cluster peers
  - Merges responses, preferring newest entries on conflict
  - Graceful degradation: memory storage works without persistence

Relationships

Module dependencies and interactions:

sessions: Primary consumer. Stores user sessions with 12-24h TTL. Uses OnDelete callback for session cleanup and index removal. Session indices stored in separate “sessions_index” cache type.
OTP: Stores one-time passwords with 5-10 minute TTL. Keys are hashed email addresses for privacy. Replicated cluster-wide for OTP availability. OnDelete triggers expiration notifications.
OIDC provider: Stores authorization codes, access tokens, refresh tokens, and DPoP JTI values. Each in separate cache types with appropriate TTLs (codes: 5-10min, tokens: 1-24h). Critical for OAuth2 flow integrity.
Proof-of-work: Stores proof-of-work challenge tokens with short TTL. Local-only storage (challenges are node-specific).
WebAuthn: Stores WebAuthn challenges during registration and authentication ceremonies. Short TTL (5 minutes).
Kerberos: Stores Kerberos ticket data with ticket lifetime TTL.
firewall: Uses SetNX for cluster-wide hostname tracking (wildcard DNS). Replicated to all nodes for cross-node hostname state. OnDelete for TTL-based rule cleanup.
storage.filesystem: Complementary module. Use memory for fast ephemeral lookups; use filesystem for persistent data surviving restarts.
telemetry: All operations logged at DEBUG level. Errors (callback failures, eviction issues) logged at ERROR. Metrics for cache sizes and hit rates.
cluster (NATS): JetStream KV persistence layer. Raft consensus provides durability for 3+ node clusters. Bootstrap reads from JetStream on startup.

Logs

Log entries emitted by this module. Search with: logs search “memory” Levels: ERROR > WARN > INFO > DEBUG.

Bootstrap — KV:

  memory.bootstrap.start               INFO    Starting JetStream KV bootstrap
  memory.kv.init                       DEBUG   Requesting JetStream KV bucket
  memory.kv.retry                      DEBUG   JetStream not ready, retrying in {duration} (attempt N/M)
  memory.bootstrap.kv_unavailable      INFO    JetStream KV unavailable after retries, falling back to peer broadcast
  memory.kv.ready                      DEBUG   JetStream KV bucket ready
  memory.bootstrap.cold                INFO    Cold mode enabled — skipping bootstrap warmup, cache will populate on demand
  memory.bootstrap.read_keys           DEBUG   Reading keys from JetStream KV
  memory.bootstrap.empty               INFO    JetStream KV bucket is empty, nothing to restore
  memory.bootstrap.failed              ERROR   Failed to read KV keys
  memory.bootstrap.keys_found          DEBUG   Found N keys in JetStream KV
  memory.bootstrap.process_key         DEBUG   Processing KV key
  memory.bootstrap.retry_transient     INFO    Retrying N keys after transient NATS errors (JetStream leader stabilizing)
  memory.bootstrap.complete            INFO    Bootstrap complete (loaded, skipped, errors, duration)

Bootstrap — Key Processing:

  memory.bootstrap.get_tombstone       DEBUG   KV key listed but not found (tombstone)
  memory.bootstrap.get_transient       DEBUG   Transient NATS error, will retry
  memory.bootstrap.get_error           WARN    Failed to get KV entry
  memory.bootstrap.decode_error        WARN    Failed to decode KV entry, deleting corrupted key
  memory.bootstrap.decode_error_cleanup WARN   Failed to delete corrupted KV entry
  memory.bootstrap.parse_error         WARN    Failed to parse KV key format
  memory.bootstrap.skip_expired        DEBUG   Skipping expired entry
  memory.bootstrap.skip_exists         DEBUG   Skipping key (already in memory from broadcast)
  memory.bootstrap.skip_deleted        DEBUG   Skipping key (deleted during bootstrap)
  memory.bootstrap.loaded              DEBUG   Loaded entry from KV
  memory.bootstrap.tracking_stopped    DEBUG   Stopped tracking deletes, bootstrap complete / peer bootstrap complete
  memory.bootstrap.track_delete        DEBUG   Tracking delete during bootstrap

Bootstrap — Peer Fallback:

  memory.bootstrap.peers_encryption_timeout WARN  Encryption not ready after timeout, proceeding with bootstrap anyway
  memory.bootstrap.peers_wait_encryption    DEBUG Waiting for encryption to be ready (X3DH/shared key sync)
  memory.bootstrap.peers_start              INFO  Starting peer-to-peer bootstrap via Broadcast
  memory.bootstrap.peers_failed             ERROR Failed to broadcast BootstrapGetAll
  memory.bootstrap.peers_responses          INFO  Collected responses from N peers
  memory.bootstrap.peers_timeout            WARN  Failed to collect all peer responses
  memory.bootstrap.peers_operation_error    WARN  Operation error from node
  memory.bootstrap.peers_invalid_response   WARN  Invalid response type from node
  memory.bootstrap.peers_merge              DEBUG Merging snapshot from node
  memory.bootstrap.peers_complete           INFO  Peer bootstrap complete (loaded, skipped, duration)

KV Persistence:

  memory.kv.encode_error               WARN    Failed to encode entry for KV
  memory.kv.put_error                  WARN    Failed to write to KV
  memory.kv.persist_success            DEBUG   Entry persisted to KV
  memory.kv.delete_error               WARN    Failed to delete from KV
  memory.kv.delete_success             DEBUG   Entry deleted from KV

CRUD Operations:

  memory                               DEBUG   Memory storage Set
  memory                               DEBUG   Triggering OnSet callback
  memory                               WARN    OnSet callback failed
  memory                               DEBUG   Memory storage Delete
  memory                               DEBUG   Triggering OnDelete callback
  memory                               WARN    OnDelete callback failed
  memory                               DEBUG   Memory storage All
  memory                               DEBUG   Memory storage Touch
  memory                               DEBUG   Memory storage SetNX
  memory                               DEBUG   Memory storage SyncSet
  memory                               DEBUG   Memory storage SyncGet (lazy-loaded from KV)

Bootstrap Snapshot:

  memory.bootstrap                     DEBUG   BootstrapGetAll returning snapshot

Cold Cache:

  memory.cold                          WARN    Corrupted KV entry, deleting
  memory.cold                          DEBUG   Cold eviction sweep

Eviction:

  memory.eviction                      INFO    Eviction loop shutting down gracefully

Metrics

Prometheus metrics. Query with: metrics prometheus memory_storage_<name>

CRUD Counters:

  memory_storage_gets           counter   {cache_type, result}   Cache reads (result: hit, miss, cold_hit, decode_error, expired)
  memory_storage_sets           counter   {cache_type}           Cache writes
  memory_storage_deletes        counter   {cache_type}           Cache deletions
  memory_storage_touches        counter   {cache_type, result}   TTL renewals (result: hit, miss, expired)
  memory_storage_setnx          counter   {cache_type, result}   Atomic set-if-not-exists (result: set, exists)
  memory_storage_sync_sets      counter   {cache_type}           Synchronous KV-persisted writes
  memory_storage_sync_gets      counter   {cache_type, result}   Synchronous reads with KV fallback (result: hit, miss, kv_hit, decode_error, expired)
  memory_storage_evictions      counter   {cache_type, reason}   Entries evicted (reason: expired, cold)

Gauges:

  memory_storage_entries        gauge     {cache_type}           Current entry count per cache type (updated via GetCacheStats)

Alerts:

  rate(memory_storage_gets{result="miss"}[5m]) > 100                High cache miss rate (check TTLs or missing Set calls)
  rate(memory_storage_evictions{reason="expired"}[5m]) > 500        Excessive TTL evictions (entries expiring faster than expected)
  rate(memory_storage_gets{result="decode_error"}[5m]) > 0          Corrupted KV entries (cold cache decode failures)

Telemetry & Logging

Structured logging with OTLP export, per-module log levels, audit class, ring buffer queries, and trace correlation

Overview

The telemetry module provides structured logging with key-value pairs, multiple output targets, and cross-module trace correlation for cluster-wide observability.

Core capabilities:

Structured logging with key-value pairs and fluent builder API
Six log levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL
AUDIT log class: bypasses level filtering for security events
Per-module log level configuration (override global level per module)
OTLP gRPC log export to OpenTelemetry-compatible collectors
Trace ID correlation across modules (128-bit hex IDs per request)
In-memory ring buffer for admin CLI log queries
JSON and human-readable output formats
Security context builder for auth-related log entries

Output modes:

  stdout: Structured logs written to standard output (default)
  otlp:   Logs exported via gRPC to an OpenTelemetry collector
  both:   Simultaneous stdout and OTLP export

OTLP export includes:

  - timestamp, severity, body (message), module attribute
  - service.name, service.version, environment, host.name, host.ip
  - Native OTLP TraceId field for trace-to-log correlation
  - Batched async export via SDK log processor

Ring buffer:

  Configurable in-memory buffer (default 10,000 entries) for admin CLI log
  queries ('logs tail', 'logs search'). Provides instant access to recent
  logs without external log aggregation. Set to 0 to disable.

Config

Configuration under [telemetry] section:

[telemetry]
  log_level = "info"               # Global: trace|debug|info|warn|error|fatal
  log_format = "json"              # Output format: "json" or "human"
  output = "stdout"                # Output target: "stdout", "otlp", or "both"
  otlp_endpoint = "otel-collector:4317"  # Required when output is "otlp" or "both"
  log_buffer_size = 10000          # Ring buffer entries for log queries (0 = disabled)
  audit = true                     # Audit class: always display security events regardless of log_level

[telemetry.module_levels]
  oidc = "debug"                   # Per-module override (module name = level)
  webauthn = "info"
  bastion = "trace"

OTLP endpoint format:

  "host:port"          - Plain gRPC connection
  "http://host:port"   - Insecure gRPC (http:// stripped, WithInsecure applied)
  "https://host:port"  - TLS gRPC connection

Compatible collectors: Grafana Alloy, OpenTelemetry Collector, Datadog Agent, Splunk OTel Collector, any OTLP/gRPC compatible receiver.

If the OTLP endpoint is unreachable at startup, the system falls back to stdout and logs a warning. gRPC connections are lazy (connect on first export).

Audit class:

  When audit = true (default), log entries marked with AsAudit() bypass level
  filtering. Security events (SFTP ops, SSH connections, admin commands, TLS
  protection) are always visible even when log_level is set to "error".

Hot-reloadable: log_level, module_levels, log_format. Cold (restart required): output, otlp_endpoint, log_buffer_size, audit.

Troubleshooting

Common symptoms and diagnostic steps:

Logs not appearing in OTLP collector:

  - Verify output is set to "otlp" or "both" in [telemetry]
  - Check otlp_endpoint format (host:port, no trailing slash)
  - Network connectivity: 'net tcp <collector-host>:<port>'
  - Collector may reject due to resource limits or auth requirements
  - Startup fallback: if endpoint was unreachable at startup, logs go to stdout
  - Check: 'logs tail' to verify logs are being generated locally

Per-module log level not working:

  - Verify [telemetry.module_levels] has exact module name (case-sensitive)
  - Module names use dot notation: "oidc", "bastion.session", "identity.scim"
  - Per-module level must be lower priority than global to have effect
  - Check: 'config show telemetry' to verify active configuration

Ring buffer queries returning no results:

  - Verify log_buffer_size > 0 (0 disables the ring buffer)
  - Buffer is in-memory only; cleared on restart
  - 'logs tail' shows most recent entries
  - 'logs search <keyword>' filters by content
  - Buffer wraps around: oldest entries are overwritten when full

Log format issues:

  - "json": structured key-value JSON (recommended for log aggregation)
  - "human": colored, readable format (recommended for development)
  - Trace IDs: full 128-bit hex in JSON, truncated 8-char in human format

High log volume impacting performance:

  - Raise global log_level to "warn" or "error"
  - Use per-module levels to keep verbose logging only where needed
  - OTLP batched export is async and does not block request processing
  - Ring buffer size: reduce log_buffer_size if memory is a concern

Relationships

Module dependencies and interactions:

All modules: Every module in the system uses telemetry for structured logging with trace correlation.
Admin CLI: ‘logs tail’, ‘logs search’, ‘logs stats’, ‘logs anomalies’, ‘logs patterns’ commands query the ring buffer.
Configuration: Reads [telemetry] section. Log level and format are hot-reloadable without restart.
Cluster: Each node maintains its own ring buffer. Admin CLI log queries fan out to all nodes and merge results.

Logs

The telemetry module is the logging infrastructure itself — it does not emit structured log entries through its own pipeline. Diagnostic messages are written directly to stderr for bootstrap and shutdown scenarios where the log pipeline is unavailable.

Stderr diagnostics (not structured LogEntry calls):

  [TELEMETRY] Failed to initialize OTLP exporter: <err> (falling back to stdout)
      — Startup: OTLP gRPC connection failed, output mode reverts to stdout
  Failed to marshal log entry: <err>
      — Runtime: JSON encoding of a log entry failed (entry is dropped)
  [TELEMETRY] OTLP provider shutdown error: <err>
      — Shutdown: OTLP provider flush/close returned an error
  [TELEMETRY] Shutdown complete: N logs processed, N logs dropped due to overflow
      — Shutdown: final stats when logs were dropped (includes audit count if any)

These messages appear only in stderr, never in the structured log stream or ring buffer. They indicate infrastructure-level issues with the telemetry pipeline itself.

Metrics

Prometheus metrics emitted by the telemetry module. Query with: metrics prometheus telemetry_<name>

Audit event tracking:

  telemetry_audit_log_entries_total     counter    {}    Audit-class entries successfully written
  telemetry_audit_dropped_total         counter    {}    Audit-class entries dropped (channel overflow)
  telemetry_converging_log_entries_total counter   {}    Converging-class entries successfully written

All three counters have no labels (nil label map). They are incremented in the single backgroundWriter goroutine (no contention).

Alerts:

  telemetry_audit_dropped_total > 0                         Audit entries lost — increase channel buffer or reduce log volume
  rate(telemetry_audit_log_entries_total[5m]) == 0           No audit events — possible pipeline failure or misconfiguration