Authentication

Handles user authentication and session management — ten methods, MFA enforcement, cluster-wide sessions

Overview

Handles user authentication for all access paths — HTTP, SSH bastion, RADIUS, and API. Replaces separate identity providers, MFA systems, and session stores with one integrated layer. Applies to every session the gateway creates, regardless of protocol.

Supported primary authentication methods:

passwd: LDAP password authentication with bind verification
passkey: WebAuthn/FIDO2 passwordless authentication (Touch ID, YubiKey)
x509: X.509 client certificate authentication (Subject DN mapping)
oidc: OpenID Connect SSO via internal provider or external IdP (RP)
magiclink: Email-based passwordless sign-in using RFC 8628 device code polling
kerberos: SPNEGO/Kerberos ticket-based authentication (Active Directory)

Supported MFA methods (second factor):

otp: Email-delivered one-time password via SMTP (per-device fingerprinting)
totp: Time-based one-time password (RFC 6238, authenticator apps)

Additional modules:

devicecode: RFC 8628 device authorization grant (bastion SSH, magic link infra)
jit2fa: Just-in-time second factor enrollment and verification
scim: SCIM 2.0 identity provider with multi-provider merge and webhooks

The signin service orchestrates authentication flows. It selects the primary method (configurable via service.signin.primary), falls through to secondary methods, enforces MFA requirements per method, and manages session creation with cluster-wide replication.

Architecture

Authentication flow (signin service orchestration):

Client request arrives at /signin (or /api/signin for API clients)
Method selection: primary method presented first, secondary methods available
Credential verification dispatched to the appropriate auth module:
- passwd: LDAP bind (no password storage)
- passkey: WebAuthn challenge-response ceremony
- x509: Certificate chain validation + Subject DN mapping
- oidc: Authorization Code + PKCE exchange
- magiclink: Device code creation + magic link email delivery
- kerberos: SPNEGO token validation
Identity lookup: username resolved to user record from directory
Group resolution: group memberships fetched from directory
Account status check: disabled/locked accounts rejected synchronously
MFA gate (if require_mfa includes the method): a. Pre-authentication session created (limited, 5-minute TTL) b. MFA challenge presented (OTP email or TOTP authenticator) c. MFA code verified d. Pre-auth session revoked, new authenticated session created (rotation)
Session creation: replicated to all nodes (cluster-wide quorum)
Directory sync: user record synchronized cluster-wide
Session cookie set, redirect to return_url or landing page

All auth modules are invoked cluster-wide, ensuring consistency and observability regardless of which node handles the request.

Session types:

Authenticated: full access, configurable TTL (default 24h)
MFA pending: limited capabilities, short TTL (default 5min)
Password expired: forced password change, restricted access

Configuration:

  [service.signin]
  primary = "passkey"              # Default authentication method
  secondary = ["passwd", "x509"]   # Alternative methods shown on signin page
  require_mfa = ["passwd"]         # Methods that require MFA after primary auth
  mfa_methods = ["otp", "totp"]    # Available MFA methods for users

Relationships

Child modules (authentication.*):

oidc: OIDC provider — SSO hub for proxy, bastion, external apps
webauthn: FIDO2/WebAuthn — passwordless passkey authentication
ldap: LDAP authentication backend — password bind verification
x509: X.509 certificate auth — client cert to username mapping
kerberos: SPNEGO/Kerberos — Active Directory ticket authentication
otp: Email OTP — one-time codes via SMTP with device fingerprinting
totp: TOTP — RFC 6238 authenticator app verification
devicecode: RFC 8628 device authorization — bastion SSH, magic link infra
magiclink: Email-based passwordless — magic link token generation/verification
jit2fa: Just-in-time 2FA — enrollment and verification middleware

Upstream dependencies:

directory: User lookup, group membership, account status (disabled/locked)
sessions: Session creation (quorum), revocation, TTL management
smtp: Email delivery for OTP codes and magic link emails
firewall: Network-level access rules applied before auth endpoints

Downstream consumers:

proxy: Proxy SSO via OIDC provider (dedicated internal client)
bastion: SSH authentication via device authorization grant
services: All HTTP services check session cookies for access control
radius: RADIUS authentication for external NAS hardware (password, x509)

Cross-cutting:

protection: Rate limiting on signin endpoints (JA4 fingerprint-based)
cluster: All auth operations are cluster-wide for consistency
notify: Authentication event notifications (webhooks, email alerts)

Device Code Authorization

Authenticates devices without a browser — CLI tools, IoT, and headless systems enter a code on another device

Overview

The device code module implements RFC 8628 (OAuth 2.0 Device Authorization Grant) for authenticating input-constrained devices such as smart TVs, CLI tools, IoT devices, and headless systems that lack a web browser.

Core capabilities:

Full RFC 8628 compliance (Sections 3.1 through 3.5, 6.1)
BASE20 user codes using consonants only (BCDFGHJKLMNPQRSTVWXZ) to avoid profanity in generated codes
Configurable code length (default: 8 characters) and expiration TTL
Constant-time comparison for user code validation (timing attack prevention)
SHA-256 hashed cache keys to prevent code enumeration
Optimistic locking with version-based concurrency control to prevent double-authorization race conditions in distributed environments
Distributed code storage with cluster-wide replication and quorum consensus
Automatic expiration with configurable TTL (default: 10 minutes)
Single-use enforcement: codes cannot be reused after authorization or denial
Directory integration for fresh user claims at authorization time

Flow summary:

  1. Device requests authorization codes
  2. Device displays short user_code to the user (e.g., "BCDFGHJK")
  3. User visits verification URI on another device (phone or computer)
  4. User enters user_code and authorizes or denies the device
  5. Device polls token endpoint until authorized, denied, or expired
  6. On authorization, device receives access token via OIDC token endpoint

The OIDC service handles the HTTP endpoints (/device page) and token endpoint with device_code grant type. The device code module provides the core logic; the OIDC service provides the HTTP transport.

Config

Device code behavior is configured under the OIDC authentication section:

[authentication.oidc]
  device_code_ttl = "10m"            # Code expiration (default: 10 minutes)
  device_code_interval = 5           # Minimum polling interval in seconds (default: 5)
  device_code_user_code_length = 8   # User code character count (default: 8)

Code generation parameters:

  - Device code: 40-digit cryptographically random token for client polling
  - User code: 8-character BASE20 string (consonants only) for human entry
  - Verification URI: auto-generated from server base URL + /device path
  - VerificationURIComplete: includes pre-filled user_code query parameter

Polling behavior (per RFC 8628 Section 3.5):

  - Clients must wait at least device_code_interval seconds between polls
  - "slow_down" response instructs client to add 5 seconds to interval
  - "authorization_pending" means user has not yet acted
  - "expired_token" means device_code TTL has passed

Hot-reloadable: device_code_ttl, device_code_interval. Cold (restart required): device_code_user_code_length.

The module auto-enables when OIDC is configured. No separate enable flag is needed. Magic link module also auto-enables device code when activated.

Troubleshooting

Common symptoms and diagnostic steps:

User code not accepted at verification page:

  - Verify code format: must be exactly 8 uppercase consonants (BASE20 charset)
  - Check expiration: codes expire after device_code_ttl (default: 10 minutes)
  - Check single-use: codes cannot be reused after authorization or denial
  - Verify AlreadyHandled flag: VerifyUserCode returns AlreadyHandled=true if
    the code was already authorized or denied
  - Case sensitivity: user codes are case-insensitive but stored uppercase

Device polling returns “expired_token” too quickly:

  - Check device_code_ttl configuration (default: 10m)
  - Verify cluster time synchronization (NTP) across nodes
  - Check if code was created with custom TTL override via AdditionalData

Device polling returns “slow_down” repeatedly:

  - Client must increase polling interval by 5 seconds on each slow_down
  - Minimum interval: device_code_interval (default: 5 seconds)
  - Verify client implements backoff correctly per RFC 8628 Section 3.5

“authorization_pending” never resolves:

  - Verify user visited the correct verification URI
  - Check that user entered the correct user_code
  - Verify OIDC service handlers are registered and accessible
  - Check network connectivity to the verification endpoint
  - Confirm user completed the full authorization flow (not just code entry)

Race condition or double authorization:

  - Optimistic locking detects concurrent modifications via version counter
  - Post-broadcast verification rejects stale version authorization attempts
  - Check structured logs for "version mismatch" warnings
  - Multiple users entering same code: statistically improbable with BASE20x8

Token exchange fails after authorization:

  - Verify OIDC token endpoint is configured and accessible
  - Check client_id matches between authorization and token request
  - Verify scope is valid for the OIDC provider configuration
  - Check directory module health (user claims fetched at authorization time)

Codes not replicating across cluster nodes:

  - Check cluster health and quorum status
  - Verify memory storage module is healthy
  - Check cluster connectivity between nodes
  - Codes use distributed storage with quorum; partial cluster may cause issues

Diagnostic commands:

  - auth devicecodes: list active device code authorization flows
  - auth status: check authentication system overview
  - health components: verify device code subsystem health

Security

Security features and hardening measures:

BASE20 charset (RFC 8628 Section 6.1):

  User codes use only consonants (BCDFGHJKLMNPQRSTVWXZ) to prevent profanity
  in randomly generated codes. This is an explicit RFC recommendation.

Constant-time comparison:

  User code validation uses crypto/subtle.ConstantTimeCompare to prevent
  timing side-channel attacks that could leak valid codes. This follows
  RFC 8628 Section 5.2 security recommendations.

SHA-256 hashed storage keys:

  Cache keys for device codes are SHA-256 hashed to prevent enumeration
  attacks. Even with access to the storage layer, codes cannot be extracted
  from their hash keys.

Optimistic locking (distributed race prevention):

  - Each authorization increments a version counter
  - Post-broadcast verification detects concurrent modifications
  - Rejects authorization if version mismatch detected
  - Prevents double-authorization in multi-node clusters
  - Critical for environments where multiple users may attempt simultaneous auth

Single-use enforcement:

  Once a code is authorized or denied, it cannot be reused. The AlreadyHandled
  flag prevents replay attacks on consumed codes.

Directory re-validation:

  CompleteAuthorization fetches the latest user data from the directory module
  rather than relying on stale session data. This ensures:
  - Disabled users cannot complete device authorization
  - Group memberships reflect current state (security-critical)
  - ID tokens contain fresh, authoritative user claims
  - Graceful fallback to session metadata if directory is temporarily unavailable

Automatic expiration:

  Codes expire after configurable TTL (default: 10 minutes). Expired codes
  are automatically cleaned up from distributed storage.

Fuzz testing coverage:

  - FuzzUserCodeValidation: injection attack resistance
  - FuzzUserCodeConstantTimeComparison: timing attack verification
  - FuzzDeviceCodeGeneration: cryptographic randomness quality
  - FuzzOptimisticLockingVersionHandling: race condition prevention
  - FuzzDeviceAuthorizationRequest: parameter handling validation

Relationships

Module dependencies and interactions:

OIDC service: Primary consumer. OIDC token endpoint handles the device_code grant type. OIDC service provides HTTP handlers for the /device verification page. Token generation occurs after CompleteAuthorization.
Magic link: Reuses device code infrastructure for its polling mechanism. Magic link auto-enables device code module when activated.
Directory: Canonical source for user attributes at authorization time. CompleteAuthorization fetches email, full_name, given_name, surname, and group memberships from directory. Graceful fallback to session metadata if directory is temporarily unavailable.
Distributed memory cache: Cache for code storage. Codes replicated across cluster with quorum consensus. TTL-based automatic cleanup.
Sessions: Session integration for authenticated user context during the verification flow.
Client access: Server-side device code auth for hexonclient QUIC tunnels. Gateway generates device code, sends challenge to client over QUIC control stream, polls until authorized. Same pattern as bastion SSH.
Bastion SSH: Server-side device code auth for SSH sessions. Gateway generates device code, displays QR in terminal.
OIDC service: HTTP transport layer. Handles /device endpoint rendering, /oidc/device/authorize for code generation, and token endpoint for code exchange.
config: Runtime configuration access for TTL, interval, and code length settings. Hot-reload supported for TTL and interval.
telemetry: Structured logging for all device code operations including authorization attempts, completions, and expiration events.

Logs

Log entries by component. Search with: logs search “devicecode” Levels: ERROR > WARN > INFO > DEBUG > TRACE.

Init (module startup):

  devicecode.init        INFO          Device Code authorization disabled in config
  devicecode.init        INFO          Device Code authorization (RFC 8628) initialized

Authorize (code generation, RFC 8628 Section 3.1-3.2):

  devicecode.authorize   ERROR         Failed to generate device code
  devicecode.authorize   ERROR         Failed to generate user code
  devicecode.authorize   ERROR         Failed to store device code
  devicecode.authorize   ERROR         Failed to achieve quorum for device code storage
  devicecode.authorize   WARN          Failed to store user code reverse lookup
  devicecode.authorize   INFO          Device authorization codes generated

Verify (user code validation):

  devicecode.verify      INFO          Invalid user code format (not BASE20)

Complete (user authorization/denial):

  devicecode.complete    INFO          Device code already handled
  devicecode.complete    ERROR         Failed to generate tokens for device authorization
  devicecode.complete    ERROR         Failed to get token response
  devicecode.complete    ERROR         Invalid token response type
  devicecode.complete    INFO          Generated tokens for device authorization
  devicecode.complete    ERROR         Failed to broadcast authorization update
  devicecode.complete    ERROR         Failed to achieve quorum for authorization
  devicecode.complete    WARN          Concurrent modification detected (version mismatch)
  devicecode.complete    INFO          Device authorization completed

Poll (device code polling, RFC 8628 Section 3.4-3.5):

  devicecode.poll        WARN          Failed to lookup device code
  devicecode.poll        WARN          Client ID mismatch
  devicecode.poll        DEBUG         Client polling too fast
  devicecode.poll        WARN          Failed to replicate LastPoll update across cluster
  devicecode.poll        WARN          Failed to initiate LastPoll broadcast
  devicecode.poll        INFO          Device authorization denied by user
  devicecode.poll        INFO          Device authorization granted

Metrics

Prometheus metrics. Query with: metrics prometheus devicecode_<name>

Codes:

  devicecode_codes_issued_total          counter    {client_id}           Device codes generated

Authorization:

  devicecode_authorizations_total        counter    {result}              Authorization decisions
    result=authorized                                                     User approved device
    result=denied                                                         User denied device

Polling:

  devicecode_polls_total                 counter    {status}              Poll requests by outcome
    status=pending                                                        Awaiting user action
    status=authorized                                                     User authorized
    status=denied                                                         User denied
    status=slow_down                                                      Client polling too fast
    status=expired                                                        Code expired (not instrumented — returns early)

Alerts:

  rate(devicecode_authorizations_total{result="denied"}[5m]) > 10         High denial rate
  rate(devicecode_polls_total{status="slow_down"}[5m]) > 50               Clients ignoring poll interval

Just-In-Time Two-Factor Authentication

Transparent OTP-based 2FA for legacy applications via login interception and credential replay

Overview

JIT-2FA adds two-factor authentication to legacy web applications without any backend modifications. It operates as a transparent middleware layer within a proxy mapping, intercepting form-based login submissions and gating access with email-based OTP verification.

Core capabilities:

Transparent login interception: intercepts POST submissions to configurable login paths
Webhook credential validation: validates username/password via external HTTP webhook
Email OTP challenge: sends one-time password to user email extracted from webhook response
Credential replay: after OTP success, replays the original POST request to the backend
Auth header mode: alternative to replay, injects X-Hexon-* headers for proxy-aware backends
Asymmetric encryption: NaCl box (X25519 + XSalsa20 + Poly1305) for credential storage
Split-knowledge security: server holds ciphertext, client holds private key in HttpOnly cookie
Secure memory handling: plaintext credentials zeroed immediately after encryption
OTP resend without re-encryption: same ciphertext and cookie reused across resends
Session-based access: authenticated sessions bypass 2FA for subsequent requests
Double logout: destroys both JIT-2FA session and forwards logout to backend

Two trust models controlled by inject_credentials config option:

Credential Replay Mode (inject_credentials = true, default):

  For legacy apps with no proxy-auth support. The full NaCl encryption pipeline
  encrypts the login POST body, stores ciphertext in session, then decrypts and
  replays the original request after OTP verification.
  Flow: Login POST -> Encrypt body -> Store ciphertext -> OTP -> Decrypt -> Replay POST -> Backend

Auth Header Mode (inject_credentials = false):

  For apps supporting trusted reverse proxy authentication (Grafana, GitLab, Gitea,
  Jenkins, etc.). Eliminates the encryption pipeline entirely. After OTP success,
  redirects user to login URL. The proxy layer injects auth headers (X-Hexon-User,
  X-Hexon-Mail, etc.) on every authenticated request.
  Flow: Login POST -> Webhook validate -> OTP -> Redirect 302 -> Auth headers injected -> Backend
  Requires add_auth_headers = true on the parent proxy mapping.

Request flow:

  1. Request arrives at proxy mapping with JIT-2FA enabled
  2. Logout path check: if match, destroy session and forward to backend
  3. Login path POST check: if match, extract credentials and call webhook
  4. Webhook success with email: encrypt body (replay mode) or store username (header mode)
  5. Send OTP email and render verification page
  6. User submits OTP: verify code, decrypt and replay POST (or redirect with headers)
  7. Authenticated session established for subsequent requests
  8. Non-login requests: check session validity, forward if authenticated or redirect to login

Config

JIT-2FA is configured per proxy mapping under [proxy.mapping.jit2fa]:

[proxy.mapping.jit2fa]
  enabled = true                    # Enable JIT-2FA for this mapping
  login_url = "/login"              # Redirect target for unauthenticated users
  login_path_regex = "^/login$"     # Regex matching login POST endpoint
  logout_path_regex = "^/logout$"   # Regex matching logout endpoint
  username_field = "username"       # Form field name for username extraction
  password_field = "password"       # Form field name for password extraction
  inject_credentials = true         # true = credential replay, false = auth header mode

Webhook configuration under [proxy.mapping.jit2fa.webhook]:

[proxy.mapping.jit2fa.webhook]
  url = "https://api.internal/validate"   # Webhook endpoint URL
  method = "GET"                          # HTTP method (GET or POST)
  timeout = "5s"                          # Webhook response timeout (default: 5s)
  success_field = "$.status"              # JSONPath to success indicator in response
  success_value = "ok"                    # Expected value at success_field
  extract_email = "$.email"               # JSONPath to user email for OTP delivery

Optional HTTP transport tuning (defaults aligned with proxy connection pool):

  max_idle_conns = 50                     # Total idle connections (default: 50)
  max_idle_conns_per_host = 20            # Idle connections per host (default: 20)
  force_attempt_http2 = true              # Force HTTP/2 (default: true)
  disable_compression = true              # Disable compression (default: true)
  write_buffer_size = 32768               # Write buffer bytes (default: 32768)
  read_buffer_size = 32768                # Read buffer bytes (default: 32768)
  dial_timeout = "30s"                    # TCP dial timeout (default: 30s)
  keep_alive = "30s"                      # TCP keepalive interval (default: 30s)

OTP configuration under [proxy.mapping.jit2fa.otp]:

[proxy.mapping.jit2fa.otp]
  type = "numeric"                  # OTP type: "numeric" or "base20" (default: global)
  length = 6                        # OTP digit count
  valid = "5m"                      # OTP validity duration
  max_retries = 3                   # Maximum OTP entry attempts (default: global)
  resend_time = 30                  # Seconds before resend allowed (default: global)

When using auth header mode (inject_credentials = false), the parent proxy mapping must also set add_auth_headers = true to inject X-Hexon-User, X-Hexon-Mail, and other identity headers on authenticated requests.

All OTP settings fall back to global email OTP defaults when not specified per mapping.

Token Handoff (optional sub-feature for mobile/SPA/CLI clients):

Add an optional [proxy.mapping.jit2fa.token_handoff] block to expose a bearer- token handoff flow for callable clients. Native mobile apps, SPAs, desktop tools, and CLIs go through the JIT-2FA login + OTP pipeline and receive a signed bearer token at a caller-registered return URL. Subsequent API calls authenticate with the token via a top-of-tree bearer check that injects identity headers and forwards to the backend without running the rest of the middleware chain.

Two entry paths produce the same token handoff flow — callers pick whichever fits their client architecture:

  1. GET /_jit2fa/authorize?return_url=...&dpop_jkt=...
       Gateway-owned URL. Caller opens it in the system browser
       (ASWebAuthenticationSession / Custom Tabs on mobile,
       window.location in SPAs, plain GET in CLI tools with a
       loopback callback). Used when the client cannot submit
       credentials inline — e.g. native apps that delegate the
       login UI to a system browser sheet.

  2. POST to the mapping's login_path_regex with form fields:
       - the username and password fields configured on the mapping
         (username_field / password_field — whatever the backend's
         own login form expects)
       - plus _jit2fa_return_url (required to trigger the handoff)
       - plus optional _jit2fa_dpop_jkt (required when require_dpop=true)
     Used when the client HAS the login form in its own UI — browser-
     based SPAs with a built-in login page, test harnesses, etc.
     Eliminates the bounce through the GET entry path and keeps
     credentials in a single form submission.

Both paths end up at the same post-OTP mint step — the only difference is how the return_url (and optional dpop_jkt) are carried into the flow.

[proxy.mapping.jit2fa.token_handoff]
  enabled = true

  # Path on the mapping where callers start the flow. Must begin with
  # /_jit2fa/ (reserved prefix that guarantees no collision with backend
  # URLs). Default: /_jit2fa/authorize.
  entry_path = "/_jit2fa/authorize"

  # Whitelist of caller return URL patterns. Glob-style: "*" matches any
  # sequence of characters (including slashes, dots, colons), everything
  # else is literal, the pattern is anchored on both ends.
  allowed_return_urls = [
    "com.example.mobile://*",               # native iOS/Android app
    "https://app.example.com/auth/callback", # SPA callback
    "http://127.0.0.1:*/cb",                # CLI tool loopback
  ]

  # Access token lifetime (1m–24h, default 12h). Short-lived by
  # design: when the access token expires the client either uses a
  # refresh token (if enabled) or re-authenticates.
  access_token_ttl = "12h"

  # Audience (aud) claim baked into minted tokens. Required. Callers
  # validate this on receipt to make sure the token was intended for them.
  audience = "myapp.mobile"

  # Accept minted bearer tokens on subsequent requests to this mapping.
  # Default true. Set to false if the mapping should only issue tokens
  # (one-way handoff, e.g. the backend accepts the tokens itself via its
  # own bearer check).
  accept_bearer = true

  # Require DPoP (RFC 9449) proof-of-possession binding. Default false
  # (opportunistic mode — callers may supply dpop_jkt and get bound
  # tokens, non-DPoP flows still work). Set to true to enforce: every
  # entry GET MUST include a dpop_jkt query parameter, and every bearer-
  # authenticated request MUST include a DPoP header whose proof key
  # thumbprint matches the token's cnf.jkt. See "DPoP (RFC 9449)"
  # section below for client-side implementation guidance.
  require_dpop = true

  # Refresh token / max session lifetime (1h–90d). Requires
  # require_dpop=true — refresh without DPoP key binding is rejected
  # at config validation because a stolen refresh token without PoP
  # would grant indefinite access. The refresh token is bound to the
  # SAME DPoP key as the access token (RFC 9449 section 5 strict binding).
  #
  # When set: the fragment delivery includes refresh_token alongside
  # access_token. The client calls POST /_jit2fa/refresh with the
  # refresh token + DPoP proof to get a new access token. On refresh
  # token expiry the client MUST re-authenticate (full login + OTP).
  #
  # When empty or "0": no refresh token issued. Client re-authenticates
  # when the access token expires.
  refresh_token_ttl = "30d"

Writing allowed_return_urls:

The allowed_return_urls list is the ONLY protection against open-redirect attacks in the token handoff flow. Operators are responsible for writing patterns conservatively. The gateway enforces the patterns exactly as written — it does not second-guess them.

DO write exact URLs when possible:

  - "com.example.mobile://auth"
  - "https://app.example.com/auth/callback"

DO use wildcards for legitimate dynamic portions:

  - "com.example.mobile://*"                    any path on a scheme you own
  - "https://*.example.com/auth/callback"       subdomains of a domain you own
  - "http://127.0.0.1:*/cb"                     loopback ephemeral port for CLI

DO NOT write open-redirect patterns:

  - "*"                                         matches literally anything
  - "https://*/*"                               matches any HTTPS URL
  - "https://*.com/callback"                    matches attacker-owned subdomains
  - "*://example.com/callback"                  allows arbitrary schemes

A single badly-written pattern can turn your token handoff flow into a credential exfiltration vector. Review patterns against your actual mobile apps, SPAs, and CLI tools; reject any pattern you cannot justify.

DPoP (RFC 9449) proof-of-possession:

Enabling require_dpop = true on the mapping turns every minted bearer token into a key-bound token: a stolen token without the matching private key cannot be replayed against the mapping. This is the primary mitigation for the URL-fragment-delivery threat model — the token is briefly visible in the browser address bar and in devtools, but without the private key it is useless.

When require_dpop = true:

  - Every entry GET MUST carry a dpop_jkt query parameter (RFC 9449
    §10.1) — the base64url SHA-256 thumbprint of the caller's public
    JWK. The gateway validates the charset and length (exactly 43
    chars, base64url alphabet), stashes it alongside the return_url
    in a sibling cookie, and binds the minted token via the cnf.jkt
    confirmation claim.
  - Every bearer-authenticated request to the mapping MUST include a
    DPoP header (RFC 9449 §4) carrying a proof JWT signed with the
    bound private key. The gateway validates the proof (signature,
    htm, htu, iat, jti replay) and checks that its JWK thumbprint
    matches the token's cnf.jkt before forwarding to the backend.

When require_dpop = false (the default), the flow is opportunistic: clients that provide dpop_jkt get bound tokens, clients that don’t get non-DPoP tokens and continue to work. This lets operators roll out DPoP gradually — watch the metric jit2fa_handoff_bearer_checks_total{result=accepted,reason=""} for DPoP adoption, then flip require_dpop to true once metrics show 100%.

Client-side implementation:

  Native mobile apps: generate an ECDSA P-256 keypair via the
  platform keystore (iOS Keychain / Android Keystore), pin the
  private key to the device (hardware-backed where available),
  compute the JWK thumbprint, and sign a DPoP proof JWT on every
  API call. Proofs have htm/htu/iat/jti fields; jti must be unique
  per proof (UUID is fine).

  Browser SPAs: generate via crypto.subtle.generateKey({name:"ECDSA",
  namedCurve:"P-256"}) with extractable=false on the private key,
  store in IndexedDB (CryptoKey is structured-clone-serializable so
  the key persists across page navigations without ever touching
  its bytes), compute the thumbprint with crypto.subtle.digest. A
  working example lives in recipes/ges-html/{test,callback}.html.

  CLI tools: generate via the host OS keystore (Secret Service on
  Linux, Keychain on macOS, Windows Credential Manager). Never
  write the private key to a plaintext file — that defeats the
  whole threat model.

Thumbprint format: RFC 7638 §3.1. For EC keys, canonical JSON of {crv, kty, x, y} with lex-sorted keys and no whitespace, SHA-256 digest, base64url-encoded without padding. The result is exactly 43 ASCII characters.

Common DPoP failure modes and how to diagnose:

  HTTP 400 from /_jit2fa/authorize with "dpop_jkt query parameter is
  required":
    → require_dpop = true but client did not append dpop_jkt.

  HTTP 400 from /_jit2fa/authorize with "dpop_jkt is not a well-
  formed base64url SHA-256 thumbprint":
    → Length != 43, or charset contains non-base64url chars.
      Check the thumbprint computation — the JOSE base64url
      encoding must be padding-free.

  HTTP 401 from /api/* with DPoP challenge and "DPoP proof header
  required for this token":
    → Token carries cnf.jkt but the request has no DPoP header.
      Client has a bound token but is not signing proofs.

  HTTP 401 with "DPoP proof does not match token binding":
    → The proof validates but its key thumbprint does not match
      the token's cnf.jkt. This is "stolen token + forged proof"
      from the gateway's perspective and is logged at Warn with
      both thumbprints for incident review. From the client's
      perspective: check that you are signing proofs with the
      same keypair that was used to compute dpop_jkt at entry
      time. The most common bug is regenerating the keypair on
      every page load.

  HTTP 401 with "DPoP proof is not valid":
    → Signature failure, iat outside replay window, jti already
      seen, or malformed proof. Check the oidc module's
      dpop_validation_total metric to see which.

Token refresh endpoint (/_jit2fa/refresh):

When refresh_token_ttl is configured (requires require_dpop=true), the gateway issues a refresh token alongside the access token in the URL fragment. Both tokens have the same short TTL (access_token_ttl, e.g. 1h). The client calls the refresh endpoint before expiry to get a new pair.

  Request:
    POST /_jit2fa/refresh
    Content-Type: application/x-www-form-urlencoded
    DPoP: <proof-jwt bound to POST https://host/_jit2fa/refresh>

    refresh_token=<refresh-jwt>

  Success response (HTTP 200):
    {
      "access_token": "<new-jwt>",
      "id_token": "<same-jwt>",
      "token_type": "DPoP",
      "expires_in": 3600,
      "refresh_token": "<rotated-jwt>",
      "scope": "openid email profile groups"
    }

  Note: id_token is the same value as access_token (the access token
  IS an ID token). Included per OIDC Core Section 12.2. Standard
  OIDC client libraries use it to update user profile claims.

  Error responses use standard OAuth error codes (RFC 6749 Section 5.2):
    HTTP 400 {"error":"invalid_request"} — missing refresh_token or form parse error
    HTTP 401 {"error":"invalid_grant"} — token expired, invalid, wrong audience,
             not DPoP-bound, missing auth_time, max session exceeded
    HTTP 401 {"error":"invalid_dpop_proof"} — DPoP proof missing, invalid, or
             key mismatch (RFC 9449 extension)
    HTTP 403 {"error":"invalid_request"} — refresh_token_ttl not configured
    HTTP 500 {"error":"server_error"} — access token mint failed

  All error responses include error_description with a human-readable
  reason. Standard client libraries (AppAuth, oidc-client-ts) parse the
  error code to decide: invalid_grant = re-authenticate, server_error =
  retry, invalid_dpop_proof = fix the DPoP proof.

  Token rotation: every refresh call returns a NEW access token + a
  NEW refresh token. Both get TTL = access_token_ttl. The rotated
  refresh token inherits the original auth_time claim so the absolute
  session lifetime is preserved through every rotation.

  DPoP binding: the DPoP proof on the refresh request MUST be signed
  with the SAME key that was used at the original dpop_jkt entry
  (RFC 9449 section 5 strict binding). The gateway checks:
    proof.thumbprint == token.cnf.jkt
  A different key = rejected. Key rotation = re-authenticate.

  Stateless design: refresh tokens are signed JWTs (not opaque strings),
  validated by signature + exp + audience suffix (":refresh"). No
  server-side storage. The auth_time claim is the session boundary.

  Absolute session lifetime: enforced via auth_time on the refresh JWT.
  The handler checks: now - auth_time > refresh_token_ttl. When exceeded,
  returns HTTP 401 and the client must re-authenticate (full login + OTP).
  A refresh token with auth_time=0 (malformed or crafted) is also rejected
  to prevent bypassing this check.

  Client stops refreshing: if the client doesn't refresh before the
  current refresh token expires (TTL = access_token_ttl), the JWT
  validator rejects it (exp passed) and the client must re-authenticate.

Bearer token 401 responses (what mobile apps see):

  When a bearer-authenticated request fails, the gateway returns HTTP 401
  with a WWW-Authenticate challenge header. The response format follows
  RFC 6750 (Bearer) and RFC 9449 (DPoP) so standard OAuth client libraries
  can branch on the scheme.

  Token expired or invalid (non-DPoP):
    HTTP/1.1 401 Unauthorized
    WWW-Authenticate: Bearer realm="<audience>", error="invalid_token",
                      error_description="token is not valid"
    Content-Type: text/plain

    token is not valid

  Token expired or invalid (DPoP-bound):
    HTTP/1.1 401 Unauthorized
    WWW-Authenticate: DPoP algs="ES256 ES384 ES512 RS256 EdDSA",
                      realm="<audience>", error="invalid_token",
                      error_description="token is not valid"
    Content-Type: text/plain

    token is not valid

  Missing DPoP proof on a DPoP-bound token:
    HTTP/1.1 401 Unauthorized
    WWW-Authenticate: DPoP algs="ES256 ES384 ES512 RS256 EdDSA",
                      realm="<audience>", error="invalid_token",
                      error_description="DPoP proof header required for this token"

  DPoP proof thumbprint mismatch (possible theft):
    HTTP/1.1 401 Unauthorized
    WWW-Authenticate: DPoP ..., error_description="DPoP proof does not match token binding"

  Audience mismatch (cross-mapping replay attempt):
    HTTP/1.1 401 Unauthorized
    WWW-Authenticate: Bearer realm="<audience>", error="invalid_token",
                      error_description="token audience does not match this mapping"

  Mobile app standard response to 401:
    1. If refresh token available: POST /_jit2fa/refresh with DPoP proof
    2. If refresh succeeds: retry the original request with the new access token
    3. If refresh fails (401/403): re-run the full authentication flow
    4. If no refresh token: re-run the full authentication flow

Known limitations:

  - Server-side callers are not supported. Tokens are delivered in the URL
    fragment (#access_token=...) which is not sent to servers. If a future
    caller needs server-side delivery, a POST-based mode will be added in
    a follow-up release.
  - When a mapping has both credential replay (inject_credentials = true)
    and token handoff enabled, browser users going through the regular
    login flow still get credential replay as today. Only callers who
    entered via the token handoff entry URL skip replay in favor of the
    bearer-token handoff. Both modes coexist cleanly.

Troubleshooting

Common symptoms and diagnostic steps:

User submits login but sees an error instead of OTP page:

  - Webhook failure: check webhook URL reachability and response format
  - JSONPath mismatch: verify success_field and success_value match the webhook response
  - No email in response: extract_email JSONPath must resolve to a valid email address
  - Webhook timeout: increase timeout if backend validation is slow (default 5s)
  - Form field names wrong: username_field and password_field must match the HTML form

OTP email not received:

  - Check SMTP configuration: 'smtp health' to verify email delivery system
  - Email address extraction: webhook must return email at the configured JSONPath
  - Rate limiting: protection module may throttle OTP requests
  - Check email OTP module health: OTP generation depends on the emailotp service

OTP verification fails (invalid code):

  - Expired OTP: default validity is 5 minutes, user may have waited too long
  - Max retries exceeded: after max_retries (default 3), session is invalidated
  - Wrong mapping context: DeviceID is mappingID:sessionID, must match original
  - Clock skew: cluster nodes must have synchronized time for OTP validation

Credential replay fails after OTP success:

  - Private key cookie missing: browser may have cleared cookies or cookie expired (5 min)
  - Session expired: NATS session data has TTL, check if ciphertext still exists
  - Decryption error: private key cookie must match the public key used for encryption
  - Backend rejected replayed POST: CSRF token in original form may have expired
  - Content-Type mismatch: replayed request preserves original Content-Type header

Auth header mode not working (inject_credentials = false):

  - Missing add_auth_headers = true on parent proxy mapping configuration
  - Backend not configured to trust X-Hexon-* headers
  - Redirect loop: login_url must match the path the backend expects for login
  - Session cookie not set: check browser cookie settings and SameSite policy

Session issues (user keeps getting redirected to login):

  - Cookie blocked: Secure flag requires HTTPS, SameSite=Strict blocks cross-origin
  - Session storage: verify NATS/JetStream connectivity for session persistence
  - Multiple domains: session cookies are domain-scoped, check cookie domain setting
  - Logout path regex matching too broadly: verify logout_path_regex specificity

  - Regex syntax: login_path_regex uses Go regexp syntax (RE2)
  - Path normalization: check if proxy rewrites the path before JIT-2FA sees it
  - Method filter: only POST requests to login_path_regex trigger interception
  - Case sensitivity: regex is case-sensitive by default

Performance and webhook diagnostics:

  - Webhook latency: high timeout values block the user login flow
  - Connection pooling: webhook HTTP transport shares pool settings with proxy
  - Cluster-wide OTP tracking: retries tracked across all cluster nodes

Security

Cryptographic design and security properties:

Encryption model (credential replay mode):

  NaCl box authenticated encryption using X25519 key agreement, XSalsa20 stream
  cipher, and Poly1305 message authentication. Fresh X25519 keypair generated per
  login attempt. Ciphertext includes 32-byte ephemeral public key and 16-byte
  authentication tag (48 bytes overhead total).

Split-knowledge architecture:

  Server stores: encrypted body ciphertext and public key (cannot decrypt alone)
  Client stores: private key in HttpOnly cookie (cannot access ciphertext alone)
  Both halves required to recover plaintext credentials. Compromise of either
  storage in isolation reveals nothing about the original credentials.

Cookie security:

  Private key cookie attributes: HttpOnly, Secure, SameSite=Strict, Max-Age=300
  - HttpOnly: prevents JavaScript access to private key
  - Secure: only transmitted over HTTPS connections
  - SameSite=Strict: prevents CSRF-based cookie theft
  - Max-Age=300: 5-minute window to complete OTP verification

Memory safety:

  - Plaintext credentials zeroed immediately after encryption
  - Private key zeroed on server side immediately after decryption
  - Zeroing uses subtle.ConstantTimeCopy to prevent compiler optimization
  - No plaintext credentials ever written to disk or session storage

OTP security:

  - OTP hashed with bcrypt before storage (not stored in plaintext)
  - Constant-time comparison prevents timing side-channel attacks
  - Cluster-wide retry tracking prevents distributed brute-force attempts
  - Rate limiting inherited from protection module
  - DeviceID binding: OTP tied to specific mapping and session (prevents reuse)

Webhook security:

  - Webhook URL should use HTTPS for credential transmission
  - Webhook timeout prevents slow-loris style resource exhaustion
  - Credentials sent to webhook only, never stored in plaintext on server
  - JSONPath extraction validates response structure before proceeding

Auth header mode security:

  - No credential storage or encryption needed (eliminates cryptographic attack surface)
  - Backend must be configured to only trust headers from the gateway IP
  - X-Hexon-* headers stripped from external requests by the proxy layer
  - Session-based: authentication state maintained via secure session cookie

CSRF protection:

  - Original form CSRF tokens preserved in encrypted body for replay
  - OTP form uses separate anti-replay mechanism
  - SameSite=Strict cookies prevent cross-origin request forgery

Relationships

Module dependencies and interactions:

proxy: Parent module. JIT-2FA is configured per proxy mapping and runs as middleware in the proxy request pipeline. Auth header mode requires add_auth_headers = true on the mapping. Proxy handles X-Hexon-* header injection on authenticated requests.
authentication.emailotp: Provides OTP generation, delivery, and verification. JIT-2FA delegates all OTP operations to emailotp using DeviceID format of mappingID:sessionID for cluster-wide tracking. OTP settings (type, length, validity, max_retries, resend_time) can be overridden per mapping or fall back to global emailotp defaults.
smtp: Email delivery for OTP codes. SMTP health directly affects OTP delivery. Check smtp health when OTP emails are not received.
sessions: Session storage via NATS/JetStream. Stores encrypted credentials (replay mode) or username/email (header mode). Session TTL governs how long authenticated state persists. Session destruction on logout.
protection.ratelimit: Rate limiting for login attempts and OTP submissions. Prevents brute-force attacks on both webhook validation and OTP verification.
identity.directory: User identity enrichment. In auth header mode, directory attributes populate X-Hexon-* headers (user, email, groups, display name).
config: Per-mapping configuration under [proxy.mapping.jit2fa]. Webhook, OTP, and transport settings are all configurable. Changes require proxy mapping reload to take effect.
protection.pow: Related but independent POST body preservation mechanism. PoW uses symmetric AES-256-GCM for short-lived form data during proof-of-work challenges. JIT-2FA uses asymmetric NaCl box for longer-lived credential storage during OTP verification. Both implement split-knowledge security but with different threat models and durations.
telemetry: Structured logging for login interceptions, webhook calls, OTP events, encryption operations, and session lifecycle. Metrics for monitoring JIT-2FA health and usage patterns.

Logs

Log entries by operation. Search with: logs search “jit2fa” Levels: ERROR > WARN > INFO > DEBUG.

  jit2fa.intercept        INFO   AUDIT  Login POST intercepted
  jit2fa.parse_error      WARN          Failed to extract credentials from login form
  jit2fa.credentials      INFO   AUDIT  Credentials extracted from login form

Webhook Validation:

  jit2fa.validate_webhook DEBUG         Validating credentials via webhook
  jit2fa.webhook          INFO   AUDIT  Webhook validation successful / invalid credentials
  jit2fa.webhook          ERROR  AUDIT  Webhook validation failed (HTTP error)

OTP:

  jit2fa.otp              INFO   AUDIT  OTP sent successfully
  jit2fa.otp              ERROR  AUDIT  Failed to send OTP
  jit2fa.otp.verify       INFO   AUDIT  OTP verification successful / failed
  jit2fa.resend           WARN   AUDIT  Failed to extend session expiry on resend

Session:

  jit2fa.session          INFO   AUDIT  Authenticated session created (replay/header/two-phase/token_handoff)
  jit2fa.redirect         INFO   AUDIT  No valid session, redirecting to login
  jit2fa.logout           INFO   AUDIT  Logout intercepted, clearing session

Rate Limiting:

  jit2fa.ratelimit.status DEBUG         Rate limit check passed
  jit2fa.ratelimit        WARN          Rate limit check failed (fail-open)

Token Handoff — Entry Path:

  jit2fa.handoff.entry      INFO   AUDIT  Rejected: missing return_url query parameter
  jit2fa.handoff.entry      WARN   AUDIT  Rejected: return_url not in allowed_return_urls
  jit2fa.handoff.entry      INFO   AUDIT  Rejected: dpop_jkt malformed (charset or length)
  jit2fa.handoff.entry      INFO   AUDIT  Rejected: require_dpop=true but caller did not supply dpop_jkt
  jit2fa.handoff.entry      INFO   AUDIT  Valid URL, no session — redirecting to login (dpop_bound=true|false)
  jit2fa.handoff.entry      INFO   AUDIT  Valid session — minting directly (fast path, dpop_bound=true|false)

Token Handoff — JKT Cookie:

  jit2fa.handoff.jkt_cookie WARN   AUDIT  Handoff JKT cookie failed revalidation (tampered or truncated)

Token Handoff — Mint Step:

  jit2fa.handoff.mint       ERROR  AUDIT  Revalidation failed before mint (cookie tamper suspected)
  jit2fa.handoff.mint       ERROR  AUDIT  Refusing to mint without username
  jit2fa.handoff.mint       ERROR  AUDIT  require_dpop=true but no dpop_jkt reached finalize (caller bypassed entry)
  jit2fa.handoff.mint       ERROR  AUDIT  return_url malformed after fragment strip (operator wildcard too permissive)
  jit2fa.handoff.mint       ERROR  AUDIT  oidc.MintBearerToken call failed
  jit2fa.handoff.mint       ERROR  AUDIT  oidc.MintBearerToken returned error
  jit2fa.handoff.mint       INFO   AUDIT  Minted access token and redirecting caller
                                           (fields: username, audience, expires_in, dpop_bound, dpop_jkt?)

Token Handoff — Bearer Top-of-Tree Check:

  jit2fa.handoff.bearer     INFO   AUDIT  Authorization header present but token is empty
  jit2fa.handoff.bearer     ERROR  AUDIT  Validator call failed (oidc.ValidateIDToken hexdcall error)
  jit2fa.handoff.bearer     WARN   AUDIT  Token rejected by validator (bad sig / expired / wrong issuer)
  jit2fa.handoff.bearer     WARN   AUDIT  Audience mismatch (cross-mapping token replay attempt — alert signal)
  jit2fa.handoff.bearer     INFO   AUDIT  require_dpop=true but token has no cnf.jkt (legacy client post-rollout)
  jit2fa.handoff.bearer     INFO   AUDIT  DPoP-bound token but no DPoP header on request (client bug)
  jit2fa.handoff.bearer     ERROR  AUDIT  oidc.ValidateDPoP hexdcall call failed
  jit2fa.handoff.bearer     INFO   AUDIT  DPoP proof rejected by validator (stale iat / wrong htu / replayed jti)
  jit2fa.handoff.bearer     WARN   AUDIT  DPoP proof thumbprint does not match token cnf.jkt — possible token theft
  jit2fa.handoff.bearer     INFO   AUDIT  Accepted, forwarding to backend
                                           (fields: username, audience, dpop_bound, dpop_jkt?)

Token Handoff — DPoP Proof Validation:

  jit2fa.handoff.bearer.dpop INFO  AUDIT  DPoP proof validated, thumbprint matches token cnf.jkt
                                           (fields: username, dpop_jkt, htm, htu — one line per
                                           bearer-authenticated API call on a DPoP-bound mapping)

Token Handoff — Refresh:

  jit2fa.handoff.refresh     INFO   AUDIT  Missing refresh_token parameter
  jit2fa.handoff.refresh     INFO   AUDIT  Token rejected by validator (expired or invalid)
  jit2fa.handoff.refresh     INFO   AUDIT  Audience mismatch (not a refresh token for this mapping)
  jit2fa.handoff.refresh     INFO   AUDIT  Token not DPoP-bound
  jit2fa.handoff.refresh     INFO   AUDIT  Missing DPoP proof header
  jit2fa.handoff.refresh     INFO   AUDIT  DPoP proof rejected by validator
  jit2fa.handoff.refresh     WARN   AUDIT  DPoP thumbprint mismatch — different key (abuse signal)
  jit2fa.handoff.refresh     INFO   AUDIT  Token has no valid auth_time (cannot enforce session lifetime)
  jit2fa.handoff.refresh     INFO   AUDIT  Absolute session lifetime exceeded (auth_time + max > now)
  jit2fa.handoff.refresh     ERROR         ValidateIDToken call failed (hexdcall error)
  jit2fa.handoff.refresh     ERROR         DPoP proof validation call failed (hexdcall error)
  jit2fa.handoff.refresh     ERROR         Failed to mint new access token
  jit2fa.handoff.refresh     WARN          Failed to mint rotated refresh token (returning access only)
  jit2fa.handoff.refresh     INFO   AUDIT  Minted new token pair (success)
                                           (fields: username, audience, access_expires_in,
                                           session_remaining_hours, dpop_jkt)

Log level policy:

  - INFO+AUDIT for routine rejections caused by malformed client input
    (missing params, stale proofs, client-side bugs, rollout friction).
    These land in the audit stream for trace reconstruction but do
    not trigger operator alerts.
  - WARN+AUDIT only for events that indicate abuse or attack:
    open-redirect whitelist probing, signature forgery, cross-mapping
    replay attempts, DPoP thumbprint mismatches. Alert on these.
  - ERROR+AUDIT for internal system errors (hexdcall failures, signing
    key missing, cookie tamper on revalidation) that need operator
    investigation regardless of attack status.

The bearer “accepted” path fires per request on DPoP-bound mappings. On high-throughput SPAs hitting the backend at 50 rps, this can generate 50 audit lines per second per user per mapping. Filter at the log sink by event name + result if volume is a problem — losing the accepted-path record at the emit site is a security regression, so the event is always emitted.

Full per-user audit trace pattern (grep):

  mapping_id=<ID> AND username=<user> AND event in
    {jit2fa.handoff.entry, jit2fa.handoff.mint,
     jit2fa.handoff.bearer, jit2fa.handoff.bearer.dpop}

Metrics

Prometheus metrics. Query with: metrics prometheus jit2fa_<name>

Operations:

  jit2fa_login_attempts_total             counter    {mapping_id}              Login interceptions
  jit2fa_webhook_validations_total        counter    {mapping_id, result}      Webhook results (success/failure)
  jit2fa_webhook_validation_duration      latency    {mapping_id}              Webhook response time
  jit2fa_otp_verifications_total          counter    {mapping_id, result, reason?}  OTP results (success/invalid/expired/max_retries/error)
  jit2fa_sessions_created_total           counter    {mapping_id}              Sessions created
  jit2fa_otp_resends_total                counter    {mapping_id, result}      OTP resend attempts
  jit2fa_rate_limited_total               counter    {mapping_id}              Rate-limited requests

Token Handoff:

  jit2fa_handoff_entry_total              counter    {mapping_id, reason, dpop_bound}
                                                     Entry path visits by outcome and DPoP binding state
                                                     reasons: missing_return_url, invalid_return_url,
                                                              missing_dpop_jkt, invalid_dpop_jkt,
                                                              redirect_login, direct_mint,
                                                              form_post (parallel entry: the login POST
                                                              carried _jit2fa_return_url + optional
                                                              _jit2fa_dpop_jkt, and the middleware treated
                                                              the whole thing as a handoff request rather
                                                              than the traditional credential-replay flow)
                                                     dpop_bound: "true" when the caller supplied a valid
                                                              dpop_jkt query parameter (or form field),
                                                              "false" otherwise. Early-rejection paths
                                                              (before dpop_jkt parse) always emit "false".
  jit2fa_handoff_mints_total              counter    {mapping_id, result, reason?, dpop_bound}
                                                     Mint step outcomes by result, reason, and binding
                                                     failure reasons: revalidate_failed, malformed_return_url,
                                                              missing_identity, missing_dpop_jkt, oidc_error
                                                     dpop_bound: "true" when the minted (or attempted)
                                                              token carries a cnf.jkt confirmation claim.
                                                              Use this dimension for DPoP adoption tracking:
                                                                sum by (dpop_bound) (rate(
                                                                  jit2fa_handoff_mints_total{
                                                                    result="success"
                                                                  }[5m]))
  jit2fa_handoff_mint_duration            latency    {mapping_id}
                                                     Time from finalizeTokenHandoff entry to mint response
  jit2fa_handoff_bearer_checks_total      counter    {mapping_id, result, reason?, dpop_bound}
                                                     Bearer check outcomes by result, reason, binding
                                                     rejected reasons: empty_token, validator_error, invalid_token,
                                                              audience_mismatch, token_not_dpop_bound, missing_dpop_header,
                                                              dpop_validator_error, dpop_proof_invalid, dpop_jkt_mismatch
                                                     dpop_bound: "true" when the presented token has a
                                                              cnf.jkt claim, "false" otherwise. Early-rejection
                                                              paths (empty_token, validator_error, invalid_token)
                                                              emit "false" since the token was not parsed.
                                                     DPoP usage query:
                                                       sum by (dpop_bound) (rate(
                                                         jit2fa_handoff_bearer_checks_total{
                                                           result="accepted"
                                                         }[5m]))
  jit2fa_handoff_bearer_check_duration    latency    {mapping_id}
                                                     Time from bearer header parse to validation outcome
                                                     (full cost: JWT validate + optional DPoP proof validate)
  jit2fa_handoff_dpop_validation_duration latency    {mapping_id}
                                                     Isolated cost of oidc.ValidateDPoP alone — component of
                                                     handoff_bearer_check_duration, emitted on every DPoP
                                                     proof validation attempt (success or failure). Use this
                                                     to tell JWT slowness apart from DPoP slowness when the
                                                     bearer check p99 regresses.

Token Refresh:

  jit2fa_handoff_refresh_total              counter    {mapping_id, result, reason?}
                                                       Refresh endpoint outcomes (success/failure)
                                                       failure reasons: disabled, parse_error, missing_token,
                                                                invalid_token, wrong_audience, not_dpop_bound,
                                                                missing_dpop, dpop_invalid, dpop_mismatch,
                                                                missing_auth_time, max_session, mint_failed
  jit2fa_handoff_refresh_duration           latency    {mapping_id}
                                                       Full refresh handler wall-clock latency

Alerts:

  # Backend / operational
  rate(jit2fa_webhook_validations_total{result="failure"}[5m]) > 5                       Webhook backend issues
  jit2fa_otp_verifications_total{reason="max_retries"} > 0                               OTP brute-force attempt
  rate(jit2fa_rate_limited_total[5m]) > 10                                               High rate limiting

  # Token handoff — abuse signals (page on these)
  rate(jit2fa_handoff_entry_total{reason="invalid_return_url"}[5m]) > 2                  Possible open-redirect probing against the whitelist
  rate(jit2fa_handoff_bearer_checks_total{reason="audience_mismatch"}[5m]) > 0           Cross-mapping token replay attempt (alert immediately)
  rate(jit2fa_handoff_bearer_checks_total{reason="invalid_token"}[5m]) > 20              High invalid-token rate (bot scan or clock drift)
  rate(jit2fa_handoff_bearer_checks_total{reason="dpop_jkt_mismatch"}[5m]) > 0           DPoP thumbprint mismatch — possible stolen token (alert immediately)
  rate(jit2fa_handoff_refresh_total{reason="dpop_mismatch"}[5m]) > 0                    Refresh with wrong DPoP key — stolen refresh token attempt

  # Token handoff — capacity / latency
  histogram_quantile(0.99, jit2fa_handoff_mint_duration_bucket) > 0.5                    Token signing p99 slow (OIDC signer degraded)
  histogram_quantile(0.99, jit2fa_handoff_bearer_check_duration_bucket) > 0.1            Bearer check p99 slow (hexdcall / oidc validation contention)
  histogram_quantile(0.99, jit2fa_handoff_dpop_validation_duration_bucket) > 0.05        DPoP proof validation p99 slow (ECDSA cost or replay cache contention)

  # Token handoff — DPoP rollout tracking (not alerts, dashboard panels)
  sum by (dpop_bound) (rate(jit2fa_handoff_mints_total{result="success"}[5m]))           Mint-time DPoP adoption ratio
  sum by (dpop_bound) (rate(jit2fa_handoff_bearer_checks_total{result="accepted"}[5m]))  Bearer-use DPoP adoption ratio
  rate(jit2fa_handoff_bearer_checks_total{reason="token_not_dpop_bound"}[5m])            Legacy clients on a require_dpop mapping (expected to drop to 0 after rollout)
  rate(jit2fa_handoff_entry_total{reason="missing_dpop_jkt"}[5m])                        Clients hitting a require_dpop entry without dpop_jkt (same signal, earlier in the flow)

Kerberos Ticket Management & SPNEGO Browser SSO

Authenticates users via Kerberos tickets — browser SSO through SPNEGO and ticket proxying for SSH bastion

Overview

Authenticates users via Kerberos — browser SSO through SPNEGO negotiation and ticket proxying for the SSH bastion. The gateway is not part of the Kerberos realm. It authenticates to the KDC on behalf of users and manages tickets in memory. Applies to Active Directory and FreeIPA environments where Kerberos is the primary authentication protocol.

Two modes:

  - Browser SSO (SPNEGO) — transparent authentication for domain-joined browsers
  - Ticket proxy (bastion) — acquires TGTs for SSH jump host delegation

Passwords never touch disk. Tickets are stored as encrypted sessions with TTL synchronized to the Kerberos ticket lifetime. CCache output is MIT Kerberos compatible (version 4, big-endian) — works with SSH GSSAPI, kinit, klist, and all standard tools.

Additional capabilities:

ACL protection for ticket retrieval operations
Password change via kpasswd protocol (RFC 3244)
Security audit logging for all Kerberos operations
Prometheus metrics for ticket lifecycle monitoring

Platform notes: memory locking requires CAP_IPC_LOCK (container: —cap-add=IPC_LOCK). Degrades gracefully if memory locking is unavailable.

Config

Kerberos module configuration:

[authentication.kerberos]
  realm = "EXAMPLE.COM"           # Kerberos realm (uppercase by convention)
  kdc = "kdc.example.com"         # Key Distribution Center address
  ticket_ttl = "8h"               # Ticket lifetime (default: 8 hours)
  password_change = true           # Enable kpasswd password change (default: false)
  kpasswd_path = "/usr/bin/kpasswd"  # Optional: override kpasswd binary path

Ticket storage model:

  Tickets are stored as sessions (type: "kerberos") indexed by the Kerberos
  principal (e.g., "alice@EXAMPLE.COM"). Session metadata includes: CCache
  bytes (auto-encrypted), ticket type, realm, principal, creation timestamp,
  and authentication method.

  This provides:
    - Principal-based indexing for fast user lookup
    - Automatic TTL expiration matching Kerberos ticket lifetime
    - Distributed storage with encryption across cluster
    - Cluster-wide ticket access from any node

Password change feature:

  When password_change = true, users can change their Kerberos passwords
  via the ChangePassword operation. Uses standard kpasswd protocol (RFC 3244).
  Password complexity is enforced by the KDC policy, not Hexon.
  All existing tickets are automatically revoked after a successful change.
  Requires kpasswd binary (auto-detected in PATH or specify kpasswd_path).

Hot-reloadable: ticket_ttl, password_change. Cold (restart required): realm, kdc, kpasswd_path.

Troubleshooting

Common symptoms and diagnostic steps:

AcquireTicket fails with authentication error:

  - Verify KDC is reachable: check network connectivity to kdc address
  - Verify realm is correct (must be uppercase by Kerberos convention)
  - Check user credentials: invalid password returns auth_failed
  - KDC clock skew: Kerberos requires clocks within 5 minutes (check NTP)
  - DNS resolution: KDC hostname must resolve correctly

Ticket not found after acquisition:

  - Check session storage health across cluster
  - Verify session TTL has not expired (matches ticket_ttl config)
  - Check cluster quorum status: tickets require quorum for distributed write
  - Verify cluster connectivity between nodes

GetTicket returns access denied:

  - ACLs control which modules can retrieve tickets
  - Only authorized modules (SSH proxy, bastion) should have access
  - Check ACL configuration in the cluster authorization policy
  - Verify the calling module is in the allowed list

SSH GSSAPI authentication fails with valid ticket:

  - Verify CCache format compatibility: use 'klist -c <file>' to inspect
  - Check KRB5CCNAME environment variable is set to the temp file path
  - Verify the ticket principal matches the SSH service principal
  - Check ticket expiration: expired tickets are rejected by SSH server
  - Ensure SSH server has GSSAPIAuthentication enabled

WriteTicketFile fails:

  - Check filesystem permissions for temp directory
  - Verify disk space available for temporary file creation
  - Remember: caller MUST securely delete temp file after use

Reflection errors (TGT extraction):

  - gokrb5 internal structure may change between versions
  - Module is pinned to gokrb5 v8.4.4; do not upgrade without testing
  - Check structured logs for reflection failure messages
  - Fallback behavior may apply if structure changes

Password change fails:

  - Verify password_change = true in configuration
  - Check kpasswd binary availability (auto-detect or kpasswd_path)
  - KDC password policy may reject the new password (complexity requirements)
  - Check structured logs for kpasswd protocol errors
  - Verify KDC supports kpasswd protocol (RFC 3244)

Memory locking warnings:

  - CAP_IPC_LOCK capability required for mlockall
  - Container: add --cap-add=IPC_LOCK to docker run
  - Kubernetes: add IPC_LOCK to securityContext capabilities
  - Without memory locking, passwords may be swapped to disk (security risk)

Ticket lifecycle monitoring:

  - kerberos_ticket_acquisition_total: track acquisition success/failure
  - kerberos_ticket_refresh_total: monitor refresh operations
  - kerberos_ticket_revocation_total: verify revocation operations
  - kerberos_password_change_total: audit password changes

Diagnostic commands:

  - auth kerberos: check Kerberos health and configuration
  - sessions list --type=kerberos: list active Kerberos ticket sessions
  - health components: verify Kerberos subsystem health

Security

Security model and hardening measures:

In-memory password handling:

  Passwords are typed as []byte (not string) to enable secure clearing.
  Every password is cleared immediately after use.
  gokrb5 authenticates with the KDC entirely in memory. Passwords are
  NEVER written to disk, logs, or any persistent storage.

Memory locking:

  mlockall(MCL_CURRENT) prevents the process memory (including passwords
  and ticket data) from being swapped to disk. Requires CAP_IPC_LOCK
  capability. Graceful degradation: logs a warning if locking fails but
  continues operating.

Pure in-memory TGT extraction:

  gokrb5 stores TGT in private internal fields. The module uses low-level
  Go techniques to extract TGT, session key, timestamps, and renewal data.
  This is version-pinned to gokrb5 v8.4.4 with error handling for structural changes.

CCache format security:

  CCache bytes are built manually in standard MIT Kerberos format (version 4,
  big-endian). This ensures compatibility with all Kerberos tools while
  maintaining full control over the byte layout. No external dependencies
  for marshaling.

Sessions encryption:

  CCache bytes stored in session metadata are automatically encrypted at
  rest by the sessions module. No manual encryption is needed. Encryption
  keys are managed by the sessions infrastructure.

Access control (defense in depth):

  - ACLs restrict which modules can retrieve and revoke tickets
  - Typically limited to SSH proxy, bastion, and service delegation modules
  - ACL configuration in the cluster authorization policy
  - Encryption provides second layer even if ACL is misconfigured

Constant-time comparison:

  Uses crypto/subtle.ConstantTimeCompare for security-sensitive comparisons.

Secure file handling:

  WriteTicketFile creates files with 0600 permissions. Secure file deletion
  overwrites with random data before removal. Callers MUST securely delete
  temp files after use.

Audit logging:

  All ticket operations (acquire, access, revoke, password change) are logged
  via the telemetry system with structured fields. Security events logged at
  appropriate severity levels for SIEM integration.

On-behalf-of trust boundary:

  Hexon is NOT part of the Kerberos realm and requires no keytab. Users
  provide credentials directly. The Hexon cluster is the security perimeter.
  Tickets are used for SSH jump hosts, proxies, and delegation.

Spnego

SPNEGO/Negotiate browser authentication (server model):

SPNEGO (RFC 4559) enables transparent SSO for domain-joined workstations. When a browser hits a protected route, the gateway challenges with “WWW-Authenticate: Negotiate”, the browser obtains a service ticket from the KDC and sends it back. The gateway validates the ticket against a keytab file — no password crosses the wire.

This is the SERVER model, contrasting with the existing PROXY model (AcquireTicket) where Hexon authenticates to the KDC on behalf of users.

Two authentication paths (mirrors the X.509 pattern):

  1. Explicit: /signin/kerberos — user navigates here, browser gets 401
     Negotiate challenge, sends SPNEGO token, session created, redirect.
  2. Auto-SPNEGO: When spnego_auto_auth=true, proxy routes try a Negotiate
     challenge before falling back to OIDC redirect. Uses a marker cookie
     (hexon_spnego_tried, 60s TTL) to prevent infinite 401 loops for
     non-domain browsers.

Configuration:

  [authentication.kerberos]
    spnego_enabled = true
    keytab_path = "/etc/krb5.keytab"         # File path (traditional)
    keytab_base64 = ""                       # Base64 string (K8s/containers)
    service_principal = "HTTP/gw.example.com" # Default: HTTP/<service.hostname>
    spnego_auto_auth = false                 # Transparent SPNEGO on proxy routes
    spnego_exclude_nets = ["10.200.0.0/16"]  # Skip auto-SPNEGO for external nets

Keytab setup (FreeIPA example):

  ipa service-add HTTP/gateway.example.com
  ipa-getkeytab -s ipa.example.com -p HTTP/gateway.example.com -k /etc/krb5.keytab
  chmod 0600 /etc/krb5.keytab

Browser compatibility:

  - Chrome/Edge (Windows/macOS): automatic for domain-joined machines
  - Firefox: requires network.negotiate-auth.trusted-uris configuration
  - Safari (macOS): uses system Kerberos ticket
  - Mobile browsers: no SPNEGO support, falls through to OIDC/password

Troubleshooting:

  - "keytab unavailable": check keytab_path permissions (should be 0600)
  - SPNEGO token unmarshal fails: token may not be a valid SPNEGO token
  - Auth failure: check SPN matches keytab (klist -k /etc/krb5.keytab)
  - Clock skew: Kerberos requires clocks within 5 minutes (check NTP)
  - Non-domain browser loop: hexon_spnego_tried cookie should prevent it
  - "user disabled": valid Kerberos ticket but user disabled in directory

Relationships

Module dependencies and interactions:

SSH bastion: Primary consumer. SSH bastion uses Kerberos tickets for GSSAPI authentication to target hosts. Retrieves tickets via GetTicket and sets KRB5CCNAME for SSH connections. Writes temp files via WriteTicketFile for tools requiring file-based credential caches.
Sessions: Distributed ticket storage with automatic encryption at rest. Sessions provide TTL expiration, cluster-wide replication, principal-based indexing via ModuleKey, and atomic operations.
Directory: User identity verification. Directory provides the canonical username and group memberships used in ticket principal construction and access control decisions.
Cluster: ACL definitions control which modules can retrieve tickets.
config: Hot-reloadable configuration for ticket_ttl and password_change. Realm and KDC address require restart.
telemetry: Security audit logging for all ticket operations. Metrics exported as Prometheus counters for monitoring ticket lifecycle, KDC health, authentication failures, and password change operations.
External dependency: gokrb5 v8.4.4 for pure Go Kerberos protocol. Version-pinned due to in-memory TGT extraction from internal fields.
External dependency: kpasswd binary for password change operations (auto-detected in PATH or configured via kpasswd_path).

Logs

Log entries by component. Search with: logs search “kerberos” Levels: ERROR > WARN > INFO > DEBUG.

SPNEGO (Browser SSO):

  kerberos.security       WARN   AUDIT  SPNEGO token exceeds size limit
  kerberos.security       INFO   AUDIT  SPNEGO auth successful / failed / decode failed / unmarshal failed
  kerberos.security       ERROR  AUDIT  SPNEGO validated but no credentials in context
  kerberos.security       WARN   AUDIT  SPNEGO auth for disabled user
  kerberos.spnego         ERROR         Failed to load keytab
  kerberos.spnego         WARN          User not found in directory / unexpected type / lookup failed
  kerberos.spnego         WARN          Keytab permissive permissions / missing service principal
  kerberos.spnego         INFO          Keytab loaded (from base64 or file)

Ticket Acquisition:

  kerberos.security       INFO   AUDIT  Kerberos authentication successful
  kerberos.security       INFO          Kerberos authentication failed
  kerberos.acquire        ERROR         Failed to load krb5.conf

Ticket Access:

  kerberos.security       INFO   AUDIT  Ticket access denied — invalid or expired session
  kerberos.write_file     INFO   AUDIT  Created temporary ticket file

Ticket Lifecycle:

  kerberos.refresh        INFO          Ticket refreshed
  kerberos.refresh        ERROR         Failed to refresh ticket
  kerberos.revoke         INFO          Ticket revoked
  kerberos.revoke_user    INFO          User tickets revoked

Password Change:

  kerberos.security       INFO          Password change failed / successful / tickets revoked after change
  kerberos.password_change ERROR        kpasswd pipe/start/write failures

Initialization:

  kerberos.init           INFO          Memory locking enabled
  kerberos.init           WARN          Memory locking failed — passwords may be swapped

Metrics

Prometheus metrics. Query with: metrics prometheus kerberos_<name>

SPNEGO:

  kerberos_spnego_validation_total        counter    {result, reason?}         SPNEGO validation results
    result=success
    result=failure, reason=invalid_base64|invalid_token|auth_failed|no_credentials|user_disabled

Tickets:

  kerberos_ticket_acquisition_total       counter    {result, reason?}         Ticket acquisition
    result=success | result=failure, reason=auth_failed
  kerberos_ticket_refresh_total           counter    {result}                  Ticket refresh (success/failure)
  kerberos_ticket_revocation_total        counter    {result}                  Ticket revocation (success)
  kerberos_tickets_revoked                counter    {}                        Total tickets revoked (bulk count)

Password:

  kerberos_password_change_total          counter    {result}                  Password changes (success/failure)

Alerts:

  rate(kerberos_spnego_validation_total{result="failure"}[5m]) > 10   SPNEGO failures (keytab/config)
  rate(kerberos_ticket_refresh_total{result="failure"}[5m]) > 0       Ticket refresh failing (KDC)
  kerberos_spnego_validation_total{reason="user_disabled"} > 0        Disabled user SPNEGO attempt

LDAP Authentication

Authenticates users with username and password against LDAP — Active Directory, FreeIPA, or OpenLDAP

Overview

The LDAP authentication module provides username/password verification by performing LDAP bind operations against configured directory servers. It acts as a bridge between the directory cache (for fast pre-flight checks) and the LDAP provider (for live password verification).

Core capabilities:

LDAP bind authentication (no local password storage)
Pre-flight account status checks via directory cache (disabled, expired)
Group membership retrieval from directory cache
Full user profile enrichment on successful authentication (email, name, groups)
Graceful degradation when directory details unavailable after successful bind
Prometheus metrics for authentication success/failure with labeled reasons
Stateless operation suitable for any cluster node

Authentication flow (5-step pipeline):

  1. Input validation: trim username, reject empty fields
  2. Directory status check: existence, disabled, password expiry (via directory cache)
  3. LDAP bind: live password verification against LDAP server
  4. User details retrieval: full profile from directory cache
  5. Response construction: comprehensive result with user metadata

The module never stores, caches, or logs passwords. Every authentication attempt requires a live LDAP bind, ensuring password policy enforcement is always delegated to the LDAP server (lockouts, complexity, expiry).

Failure reasons returned in AuthenticateResponse.Reason:

  - "username required" / "password required" (input validation)
  - "user not found" (not in directory cache)
  - "account disabled" / "password expired" (pre-flight status)
  - "invalid credentials" (LDAP bind failed)
  - "directory unavailable" / "authentication service unavailable" (module errors)

Config

The LDAP authentication module itself has no dedicated configuration section. It depends entirely on configuration from two upstream modules:

Directory module [directory]:

  url = "ldaps://ldap.example.com:636"    # LDAP server URL
  bind_dn = "cn=svc,dc=example,dc=com"   # Service account for searches
  bind_password = "secret"                # Service account password
  user_base = "ou=users,dc=example,dc=com"  # User search base DN
  group_base = "ou=groups,dc=example,dc=com" # Group search base DN
  sync_interval = "5m"                    # Delta sync interval (default: 5m)
  full_sync_interval = "60m"              # Full sync interval (default: 60m)

LDAP provider module [ldap]:

  url = "ldaps://ldap.example.com:636"    # LDAP server URL for bind operations
  bind_dn = "cn=svc,dc=example,dc=com"   # Service account DN
  user_base = "ou=users,dc=example,dc=com" # User search base for DN resolution
  user_filter = "(uid=%s)"               # User lookup filter (%s = username)
  user_attribute = "uid"                  # Username attribute (uid, sAMAccountName)

Active Directory considerations:

  - Use user_attribute = "sAMAccountName" for AD environments
  - Use user_filter = "(sAMAccountName=%s)" for AD user lookups
  - AD lockout policies enforced server-side via LDAP bind
  - Password expiry detected via directory cache sync

Connection pooling is managed by the LDAP provider module, not this module. LDAP bind operations reuse pooled connections for reduced overhead.

Cache staleness window:

  - Account status changes (disable, expiry) reflected within sync_interval
  - Default: up to 5 minutes delay for status changes to propagate
  - Full sync ensures eventual consistency every 60 minutes
  - Immediate effect: password changes always verified live via LDAP bind

Troubleshooting

Common symptoms and diagnostic steps:

User gets “Invalid username or password” but credentials are correct:

  - Run 'diagnose user <username>' to check cross-subsystem status
  - Run 'directory user <username>' to verify user exists in cache
  - Check directory sync status: 'directory status' for last sync time
  - If user recently created, wait for sync or trigger manual sync
  - Verify LDAP server reachability: 'auth ldap' for connection health
  - Check if account locked in LDAP (server-side lockout policy)
  - Verify user_attribute matches LDAP schema (uid vs sAMAccountName)

User gets “account disabled” but account is active in LDAP:

  - Directory cache may be stale; check last sync: 'directory status'
  - Trigger manual sync: 'directory sync <username>' to refresh user
  - Verify the disabled attribute mapping in directory config
  - Check delta sync interval (default 5m) for expected propagation delay

User gets “password expired” unexpectedly:

  - Verify password expiry attribute mapping in directory config
  - Check LDAP password policy (ppolicy overlay or AD fine-grained policy)
  - Trigger user sync to refresh expiry status: 'directory sync <username>'

Authentication returns “directory unavailable”:

  - Check directory module health: 'directory status'
  - Verify cluster bridge status: 'cluster status'
  - Check LDAP server connectivity: 'auth ldap'
  - Review logs: 'logs search "directory"' for connection errors
  - Verify directory module is registered and running

Authentication returns “authentication service unavailable”:

  - Check LDAP provider module health: 'auth ldap'
  - Verify LDAP server URL and port in configuration
  - Check TLS certificate validity for ldaps:// connections
  - Test LDAP connectivity: 'net tcp <ldap-host>:636 --tls'
  - Review logs: 'logs search "ldap"' for bind or connection errors
  - Check connection pool: 'connpool stats' for pool exhaustion

  - LDAP bind is the slow path (50-200ms typical, network dependent)
  - Check LDAP server latency: 'net latency <ldap-host>:636 --tls'
  - Verify connection pooling is working: 'connpool pools'
  - High latency indicates LDAP server load or network issues
  - Directory cache lookups should be <5ms (fast path)

All logins failing simultaneously:

  - LDAP server down: 'auth ldap' for health status
  - Network partition: 'net tcp <ldap-host>:636' for connectivity
  - TLS certificate expired: 'net tls <ldap-host>:636' to inspect cert
  - DNS failure: 'dns test <ldap-hostname>' for resolution check
  - Check cluster health: 'health status' for node-level issues

Metrics for monitoring:

  - ldap_authentication_total{result="success"} -- successful logins
  - ldap_authentication_total{result="failure",reason="invalid_credentials"} -- wrong passwords
  - ldap_authentication_total{result="failure",reason="user_not_found"} -- unknown users
  - ldap_authentication_total{result="failure",reason="account_disabled"} -- disabled accounts
  - ldap_authentication_total{result="failure",reason="directory_unavailable"} -- infra issues
  - ldap_authentication_total{result="failure",reason="ldap_unavailable"} -- LDAP down
  - Spike in invalid_credentials may indicate brute force or credential stuffing
  - Spike in directory_unavailable or ldap_unavailable indicates infrastructure problems

Security

Password handling and credential security:

No local password storage:

  Passwords are never stored, cached, or hashed locally. Every authentication
  requires a live LDAP bind, eliminating the risk of a local password database
  compromise. No password appears in logs, telemetry, metrics, or response objects.

Pre-authentication checks (fail-fast security):

  Account status is verified BEFORE attempting LDAP bind. This prevents
  unnecessary LDAP queries for disabled or expired accounts, reducing load on
  the LDAP server and providing faster rejection of invalid accounts.
  Evaluation order: existence -> disabled -> expired -> LDAP bind.

Enumeration prevention:

  The module returns distinct internal reasons ("user not found" vs "invalid
  credentials") but consuming services MUST map these to a generic message
  (e.g., "Invalid username or password") to prevent username enumeration.
  Timing is kept consistent: directory cache lookups are fast (<5ms) regardless
  of user existence. The module itself does not expose any public API that
  reveals user existence.

Brute force and credential stuffing:

  Account lockout is delegated to the LDAP server's password policy (ppolicy
  overlay or Active Directory lockout settings). The module does not implement
  its own lockout or rate limiting. Consuming services (signin, proxy auth)
  should implement:
  - Per-IP rate limiting (recommended: 10 attempts/minute)
  - Per-username rate limiting (recommended: 5 attempts/minute)
  - CAPTCHA after repeated failures
  - Device fingerprinting for anomaly detection

Injection prevention:

  Username is trimmed of whitespace before use. LDAP filter escaping is handled
  by the downstream LDAP provider module. There are no
  local database queries or command executions, eliminating SQL injection and
  command injection vectors entirely.

Password policy enforcement:

  All password complexity, history, and rotation requirements are enforced by the
  LDAP server. The module reports password expiry status from the directory cache
  but does not enforce policies locally. This ensures a single source of truth
  for password policy (the LDAP directory).

Credential logging policy:

  Debug level: username and operation stage (never password)
  Info level: successful authentication with username and groups
  Warn level: authentication failures with reason (never password)
  Error level: infrastructure failures with error details
  Never logged: password, email (unless required for specific audit)

Memory safety:

  Password memory clearing after LDAP bind is handled by the LDAP provider
  module. The authentication module passes the password through to the bind
  operation and does not retain references after the call completes.

Relationships

Module dependencies and interactions:

Directory: Primary dependency for pre-flight checks. Provides cached user metadata for account status checks (existence, disabled, expired, groups) and full profile retrieval (email, name). Directory cache is synced from LDAP on configurable intervals (delta: 5m, full: 60m). Cache staleness determines the window for status change propagation.
LDAP provider: Primary dependency for password verification. Performs LDAP bind operations for username/password verification. Manages LDAP connection pooling, user DN resolution, and TLS negotiation. Bind success/failure is the authoritative password check.
Sign-in service: Primary consumer. The sign-in flow engine calls ldapauth Authenticate as part of the username/password authentication stage. The flow engine maps internal failure reasons to user-facing messages and manages session creation on success.
Reverse proxy: Consumer for proxy authentication. HTTP proxied applications can require LDAP authentication via proxy auth provider configuration. Uses the same Authenticate operation with credentials from Basic Auth or form POST.
Telemetry: All operations logged with structured fields (username, groups, error, type). Prometheus metrics exported for authentication success/failure counts with reason labels. Metrics enable real-time monitoring, security event detection, and capacity planning.
Cluster: All operations are node-local with no cluster coordination required. The module is stateless and does not require session affinity or leader election.
Rate limiting: Not directly integrated. Rate limiting for authentication endpoints should be configured at the service layer (signin, proxy) using the rate limit module. Recommended: per-IP and per-username limits.
sessions: On successful authentication, the consuming service creates a session with the returned user metadata (username, email, groups). Session lifecycle is managed by the session module, not the authentication module.

Cluster behavior:

  Fully stateless -- no local state, no cluster coordination required. All state
  lives in the directory cache (distributed via NATS/JetStream) and the LDAP
  server. Any cluster node can handle authentication independently. No session
  affinity needed. Directory cache consistency is bounded by sync intervals.

Logs

Log entries. Search with: logs search “ldapauth” All entries use the name ldapauth.authenticate.

  ldapauth.authenticate   DEBUG         Empty username / empty password provided
  ldapauth.authenticate   DEBUG         Attempting LDAP bind
  ldapauth.authenticate   INFO          Bind successful / bind failed (invalid credentials)
  ldapauth.authenticate   ERROR         LDAP bind call failed (service error)

Metrics

Prometheus metrics. Query with: metrics prometheus ldap_<name>

  ldap_authentication_total       counter    {result, reason?}     Authentication attempts
    result=success                                                  Successful bind
    result=failure, reason=empty_username                           Missing username
    result=failure, reason=empty_password                           Missing password
    result=failure, reason=service_unavailable                      LDAP service error
    result=failure, reason=invalid_credentials                      Wrong password

Alerts:

  rate(ldap_authentication_total{result="failure",reason="service_unavailable"}[5m]) > 0   LDAP server down
  rate(ldap_authentication_total{result="failure",reason="invalid_credentials"}[5m]) > 20  Brute-force attempt

Magic Link Authentication

Passwordless sign-in via email magic links with cross-device support

Overview

The magic link module implements passwordless authentication by sending a sign-in link to the user’s email address. Users click the link to authenticate without entering a password or code.

Core capabilities:

Passwordless authentication via email-delivered links
Cross-device support: request link on one device, click on another
Three verification actions: authorize (remote), sign-in-here (local), deny
Anti-enumeration: identical response shape regardless of email validity
Session-based tokens with 128-bit entropy (UUID v4)
Atomic single-use via cluster-wide session revocation
Per-IP and per-email rate limiting to prevent abuse and inbox flooding
Directory re-validation at verify time (disabled users cannot complete auth)
PreVerify is read-only (safe from link-preview bots consuming tokens)
Confirmation page shows request context (IP, location, browser) for phishing detection
Geo-enriched emails showing request origin for user awareness

Flow summary:

  1. User enters email on /signin/magiclink
  2. Module creates device code pair with geo context in AdditionalData
  3. If email matches an active directory user, a "magiclink" session is
     created (cluster-replicated) containing user info and device code key
  4. Email sent with link: /signin/magiclink/verify?token=<SESSION_ID>
  5. Frontend polls /api/signin/magiclink/poll with the device_code
  6. User clicks link in email (possibly on a different device)
  7. PreVerify validates session (read-only) and renders confirmation page
     showing destination, browser, IP, and geographic location
  8. User chooses: Authorize, Sign in here, or Deny
  9. Verify revokes session (atomic single-use) and acts on device code
  10. Polling returns "authorized", "completed_elsewhere", or "denied"

The module reuses the device code module (RFC 8628) for the polling mechanism and the sessions module for cluster-replicated token storage.

Config

Magic link is configured under the signin service section:

[service.signin.magiclink]
  enabled = true              # Master switch (default: false)
  code_ttl = "10m"            # Link validity duration (default: 10 minutes)
  rate_limit = "5/1m"         # Per-IP rate limit (default: 5 per minute)
  rate_limit_email = "3/10m"  # Per-email rate limit (default: 3 per 10 minutes)

Prerequisites:

  - SMTP must be configured for email delivery
  - Device code module is auto-enabled when magic link is activated
  - Directory module must be available for user lookup by email

UI integration:

  When enabled, sign-in templates render a "Send me a sign in link" text link
  below the secondary method buttons. Magic link is NOT injected into the
  secondary methods array. It appears as a separate, lower-emphasis option
  via the "magiclink_enabled" template variable. Operators only need to set
  enabled = true; the link appears on all sign-in pages (passkey, password,
  x509) automatically.

Rate limiting behavior:

  - Per-IP limit (rate_limit): returns error "rate_limited" when exceeded,
    service responds with HTTP 429
  - Per-email limit (rate_limit_email): silently creates orphaned device code
    as decoy (anti-enumeration), no email sent
  - Both limits reset on their respective sliding windows

Anti-enumeration design:

  Initiate always returns the same response shape (DeviceCode + ExpiresIn)
  regardless of whether the email exists, is disabled, or is rate-limited.
  When the email is invalid or per-email rate-limited, a real but orphaned
  device code is created as a decoy so timing and response structure are
  identical. The frontend polls normally and eventually gets "expired",
  which is indistinguishable from a valid request where the user never
  clicked the link.

Hot-reloadable: code_ttl, rate_limit, rate_limit_email. Cold (restart required): enabled.

Troubleshooting

Common symptoms and diagnostic steps:

User never receives magic link email:

  - Check SMTP health: 'smtp health' to verify email delivery is working
  - Verify email belongs to an active directory user
  - Check per-email rate limit: silent suppression after 3/10m (no error shown)
  - Check spam/junk folders for the magic link email
  - Verify the user's email address in directory matches what was entered
  - Check structured logs for SMTP delivery errors

Magic link says “expired” or “invalid” when clicked:

  - Default TTL is 10 minutes; check if user clicked in time
  - Token is single-use: clicking a second time returns "already consumed"
  - Check cluster time synchronization (NTP) across nodes
  - Verify session replication health across cluster

Polling returns “expired” immediately (anti-enumeration):

  - This is expected behavior for non-existent emails (by design)
  - Per-email rate limit exceeded: creates orphaned decoy device code
  - User disabled in directory: treated as non-existent (anti-enumeration)
  - No way to distinguish from legitimate "user never clicked" scenario

“completed_elsewhere” status on polling device:

  - User chose "Sign in here" on the verifying device (the device where
    they clicked the email link)
  - This is intentional: the session was created on the verifying device only
  - Polling browser displays a friendly message, not an error
  - Detected via a cluster-wide signal for cross-device coordination

Confirmation page shows wrong location or IP:

  - Geo data comes from the GeoAccess module's IP-to-country/ASN lookup
  - Check geo database freshness and availability
  - Proxy or CDN may mask the original client IP
  - X-Forwarded-For header processing depends on trusted proxy configuration

Rate limiting triggered unexpectedly:

  - Per-IP limit: 5 requests per minute (shared across all emails from one IP)
  - Per-email limit: 3 requests per 10 minutes (shared across all IPs)
  - Corporate NAT may cause many users to share one IP
  - Adjust rate_limit and rate_limit_email in config as needed

Magic link feature not visible on sign-in page:

  - Verify enabled = true in [service.signin.magiclink]
  - Check that SMTP is configured (prerequisite)
  - Template variable "magiclink_enabled" drives visibility
  - The link appears below secondary method buttons, not in the methods array

Diagnostic commands:

  - smtp health: verify email delivery subsystem
  - auth status: check authentication system overview
  - sessions list --type=magiclink: list active magic link sessions
  - health components: verify magic link subsystem health

Security

Security features and hardening measures:

Token entropy:

  Magic link tokens are session IDs (UUID v4) with 128-bit cryptographic
  entropy. The session ID doubles as the magic link token in the verification
  URL, providing sufficient randomness to resist brute-force guessing.

Single-use enforcement:

  Tokens are consumed via atomic session revocation (replicated to all nodes).
  Once revoked, the token cannot be reused. Double-click on the verification
  link returns AlreadyDone=true (idempotent, no error).

Anti-enumeration:

  The Initiate operation returns identical response structure regardless of
  whether the email exists, the user is disabled, or per-email rate limit
  is exceeded. Orphaned device codes serve as timing-identical decoys.
  This prevents attackers from using magic link requests to discover valid
  email addresses in the directory.

Directory re-validation:

  At Verify time, the module re-validates the user against the directory.
  If the user has been disabled between Initiate and Verify, authentication
  fails. This prevents race conditions where an admin disables a user who
  already has a pending magic link.

Link-preview bot protection:

  PreVerify (GET request when link is clicked) is read-only and does not
  consume the token. Link-preview bots that fetch URLs in emails cannot
  accidentally authorize or deny the request.

Phishing detection:

  The confirmation page displays the request context (source IP, browser
  User-Agent, country, ISP/ASN) so the user can verify whether they
  initiated the request. Suspicious requests can be denied.

Cross-device security:

  The "sign-in-here" action denies the device code and stores a signal,
  so the polling browser sees "completed_elsewhere" rather than "authorized".
  This prevents unintended sessions on the original (potentially shared) device.

Rate limiting:

  - Per-IP: prevents abuse from a single source (default: 5/1m)
  - Per-email: prevents inbox flooding for a target user (default: 3/10m)
  - Per-email limit is silent (anti-enumeration): no error, decoy created

Relationships

Module dependencies and interactions:

Device code: Core dependency. Provides RFC 8628 device code pair generation and polling infrastructure. Magic link auto-enables device code when activated. Device code handles the polling lifecycle; magic link provides the email-based authorization trigger.
Sessions: Cluster-replicated token storage. Magic link tokens are stored as sessions with automatic TTL cleanup and atomic single-use via cluster-wide revocation. Session metadata contains user info, device code key, and request context.
Directory: User lookup by email at Initiate time and re-validation at Verify time. Disabled users are treated as non-existent (anti-enumeration). Directory provides canonical user attributes (username, email, full name, groups) stored in session metadata.
SMTP: HTML/text magic link email delivery. SMTP must be configured as a prerequisite for magic link functionality.
Geo access: IP-to-country and ASN lookup for email context and confirmation page display. Helps users detect phishing attempts.
Rate limiting: Per-IP and per-email request throttling. Per-IP returns HTTP 429; per-email silently creates decoy (anti-enumeration).
config: Runtime access to [service.signin.magiclink] settings. Hot-reload supported for TTL and rate limit values.
Sign-in service: HTTP handlers for /signin/magiclink routes and /api/signin/magiclink/poll that delegate to this module.
Distributed memory cache: Stores cross-device flow coordination signals so the polling browser knows when authentication completed elsewhere.

Logs

Log entries by component. Search with: logs search “magiclink” Levels: ERROR > WARN > INFO > DEBUG.

Rate Limiting:

  magiclink.ratelimit.ip.status    DEBUG  Per-IP rate limit check passed
  magiclink.ratelimit.email.status DEBUG  Per-email rate limit check passed

Initiate (magic link request):

  magiclink.initiate        INFO   Per-email rate limit exceeded
  magiclink.initiate        ERROR  Failed to create device code
  magiclink.initiate        ERROR  Failed to create magiclink session
  magiclink.initiate        WARN   Failed to dispatch magic link email
  magiclink.initiate        INFO   Magic link email queued

Poll (device code polling):

  magiclink.poll            ERROR  PollDeviceCode failed
  magiclink.poll            ERROR  Directory lookup failed during poll
  magiclink.poll            INFO   User invalid at poll time

PreVerify (read-only token validation):

  magiclink.preverify       INFO   Pre-verification successful, showing confirmation page

Verify (token consumption + action):

  magiclink.verify          INFO   Magic link denied by user
  magiclink.verify          ERROR  Directory lookup failed during verify
  magiclink.verify          INFO   Magic link signin_here — session on verifying device only
  magiclink.verify          ERROR  Failed to update device code authorization
  magiclink.verify          INFO   Magic link authorized

Metrics

Prometheus metrics emitted by this module:

  magiclink_initiated_total        counter  Incremented when a magic link email is
                                            successfully queued (valid user, within
                                            rate limits). Not incremented for decoy
                                            flows or unknown emails.

  magiclink_verifications_total    counter  Incremented when Verify completes a user
                                   {result} action. Labels:
                                            authorized  — user approved sign-in
                                            denied      — user rejected the request
                                            signin_here — user chose local sign-in

  magiclink_polls_total            counter  Incremented on every Poll response.
                                   {status} Labels mirror the returned status:
                                            pending, authorized, denied, expired,
                                            slow_down, completed_elsewhere,
                                            invalid (empty device code).

Additional observability via dependent modules:

  - devicecode: device_code_* metrics cover code creation and polling
  - ratelimit: ratelimit_* metrics cover per-IP and per-email throttling
  - sessions: session_* metrics cover magiclink session create/revoke
  - smtp: smtp_* metrics cover magic link email delivery

OIDC Provider

Built-in OpenID Connect provider — issues tokens for proxy SSO, bastion SSH, M2M, and personal access tokens

Overview

Issues and manages OAuth 2.0 / OpenID Connect tokens for all gateway services. Replaces external OIDC providers for proxy SSO, bastion device authorization, M2M workload auth, and personal access tokens. All token operations are cluster-wide — storage, revocation, and signing keys replicated across every node.

User authentication:

Authorization Code Flow with PKCE, prompt/max_age/consent support
Dynamic ACR/AMR claims reflecting the actual authentication method used
DPoP token binding and mTLS certificate-bound tokens for high-security flows

Machine-to-machine:

Client Credentials Grant for service auth, JWT Bearer Grant for certificate-based M2M
Dynamic Client Registration for native OAuth clients

Additional capabilities:

Pushed Authorization Requests (PAR) for enhanced request security
Device Authorization Grant for headless device flows (bastion SSH, CLI)
Personal Access Tokens (PATs) for CLI, CI/CD, and automation
Token introspection and revocation (access tokens, refresh tokens, and PATs)
UserInfo endpoint for retrieving user claims
Response modes: query (default) and form_post
Per-client skip_consent for trusted first-party applications
Optional PKCE plain method deprecation (OAuth 2.1 hardening — S256 only)
JWKS and OpenID Configuration discovery endpoints

A built-in proxy SSO client provides unified single sign-on for all proxy mappings. Its redirect URIs are validated against live proxy configuration to prevent open redirect attacks. This client is managed automatically and does not appear in the TOML configuration.

Config

Core configuration under [authentication.oidc]:

[authentication.oidc]
  signing_key = "..."              # REQUIRED: Min 32 chars, used for deterministic key derivation via HKDF
  signing_algorithm = "ES256"      # ES256 (default), ES384, ES512, or EdDSA
                                   # MUST be identical across all cluster nodes
  hostname = "auth.example.com"    # REQUIRED: OIDC issuer URL (appears in token claims and discovery)
  enable_test_callback = false     # Enable test callback URL (NEVER enable in production)
  dpop_proactive_nonce = true      # Send DPoP-Nonce header in all token responses (default: true)
  par_ttl = "5m"                   # PAR request_uri TTL (range: 1m-10m per RFC 9126)
  enable_dcr = false               # Enable Dynamic Client Registration (RFC 7591)
  rate_limit_dcr = "10/1m"         # DCR endpoint rate limit per IP
  allow_dcr_from = []              # CIDR allowlist for DCR (empty = allow all)
  allow_dcr_redirect_domains = [] # Allowed redirect URI domains (loopback always allowed, supports *.example.com)
  disable_plain_pkce = false       # Reject "plain" PKCE method (OAuth 2.1 hardening, S256 only when true)
  pat_enabled = false              # Master switch for PATs (default: disabled)
  pat_max_ttl = "2160h"            # Maximum PAT lifetime (default 90 days, max 365 days)
  pat_max_per_user = 10            # Maximum PATs per user (default 10)
  pat_required_groups = []         # Groups allowed to create PATs (empty = any authenticated user)

[[authentication.oidc.clients]]
  name = "my-app"                  # REQUIRED: Client identifier (used as client_id)
  clientsecret = "..."             # Min 32 chars with entropy validation (omit for public/mTLS clients)
  redirect_urls = ["https://..."]  # REQUIRED: Allowed redirect URIs (strict validation, wildcard support)
  origin_urls = ["https://..."]    # Allowed CORS origins
  allowed_scopes = ["openid", "profile", "email", "groups"]  # Permitted scopes
  allowed_grant_types = ["authorization_code", "refresh_token"]  # Default grant types
  require_pkce = false             # Enforce PKCE (MUST be true for public clients)
  skip_consent = false             # Skip consent screen for trusted first-party clients
  allow_client_from = ["0.0.0.0/0"]  # IP allowlist in CIDR notation
  client_credentials_ttl = "1h"   # Access token TTL for client_credentials grant

  # mTLS configuration (RFC 8705)
  token_endpoint_auth_method = "tls_client_auth"  # Enable mTLS client auth
  tls_client_auth_san_uri = "spiffe://..."         # URI SAN identity (SPIFFE)
  tls_client_auth_san_dns = "service.local"        # DNS SAN identity
  tls_client_auth_san_email = "svc@example.com"    # Email SAN identity
  tls_client_auth_subject_dn = "CN=service"        # Subject DN identity
  certificate_bound_tokens = true                  # Bind tokens to client certificate
  client_ca_pem = "/path/to/ca.pem"               # Per-client CA trust (inline PEM or file path)

  # JWT Bearer configuration (RFC 7523)
  jwt_public_key = "-----BEGIN PUBLIC KEY-----..."  # Public key for JWT assertion verification
  jwt_algorithm = "RS256"          # RS256/384/512, ES256/384/512, EdDSA
  jwt_issuer = "service-name"      # Expected issuer claim
  jwt_subject = "service-name"     # Expected subject claim

  # Scope-to-group mapping for M2M authorization
  scope_group_mapping = { "api:read" = ["readers"], "api:write" = ["writers"] }

Token storage and TTL defaults:

  Authorization codes:  10 minutes, single-use
  Access tokens:        1 hour (configurable), replicated cluster-wide
  Refresh tokens:       30 days (configurable), replicated cluster-wide
  DPoP JTIs:            120 seconds, replicated cluster-wide (best-effort)
  DPoP nonces:          60 seconds, single-use, replicated cluster-wide (best-effort)
  PAR requests:         5 minutes (configurable 1-10m), replicated cluster-wide (best-effort)
  PAT sessions:         up to pat_max_ttl (default 90 days), managed by sessions module

Key management:

  Signing keys are derived deterministically from the signing_key using
  HKDF (RFC 5869). Supports ES256 (ECDSA P-256,
  default), ES384 (P-384), ES512 (P-521), and EdDSA (Ed25519). All cluster
  nodes derive identical keypairs from the same signing_key, requiring no
  key synchronization. Keys remain stable across restarts.

Hot-reloadable: client configurations, scopes, redirect URIs, IP allowlists. Cold (restart required): signing_key, signing_algorithm, hostname (issuer URL).

Security

Token signing:

  ID tokens signed with configurable algorithm: ES256 (default), ES384, ES512, or EdDSA.
  ES256/384/512 are compatible with Kubernetes kube-apiserver --oidc-signing-algs.
  Two signing modes with automatic failover:
    Threshold: distributed key — no single node holds the full private key (requires cluster quorum).
    Deterministic fallback: all nodes derive identical keys from signing_key for cross-node consistency.
  The module auto-switches between modes based on cluster health.
  All token issuance logs include signer_type attribute ("threshold" or "deterministic").
  Signing key entropy validated at startup.
  Keys derived via HKDF-SHA256 (RFC 5869) for cross-node consistency.

JWT algorithm hardening:

  All JWT parsing enforces strict algorithm allowlists.
  ID token validation: ES256, ES384, ES512, EdDSA only (server-issued tokens).
  JWT Bearer assertion: RS256-512, ES256-512, EdDSA (client-signed assertions).
  DPoP proof validation: RS256-512, ES256-512, EdDSA (client-signed proofs).
  id_token_hint validation: ES256, ES384, ES512, EdDSA only (server-issued tokens).
  Symmetric algorithms (HS256/384/512) always rejected, preventing algorithm
  confusion attacks. DPoP proofs validate typ header per RFC 9449 Section 4.3.

PKCE (Proof Key for Code Exchange, RFC 7636):

  Supports S256 (SHA-256) and plain methods. S256 strongly recommended.
  Optional disable_plain_pkce config rejects plain method (OAuth 2.1 hardening).
  When disable_plain_pkce=true, discovery advertises only S256.
  MANDATORY for public clients (no client_secret configured).
  RECOMMENDED for all confidential clients as defense-in-depth.
  Prevents authorization code interception in mobile and SPA scenarios.

DPoP (Demonstrating Proof-of-Possession, RFC 9449):

  Binds tokens to client cryptographic key, preventing token theft and replay.
  Supports RSA, ECDSA, and Ed25519 proof keys.
  JTI replay prevention with 120-second distributed cache TTL.
  Optional nonce-based replay protection (proactive nonce delivery by default).
  Server issues DPoP-Nonce header in all token responses when enabled.
  Introspection returns cnf claim with jkt field for DPoP-bound tokens.

  Replay protection has two modes (dpop_strict_replay config option):
    - false (default): lower latency, small replay window during propagation
    - true: strict quorum wait, no replay window, higher latency
  Set to true for high-assurance deployments or regulated environments.

Mutual TLS (RFC 8705):

  Client authentication via X.509 certificate presented during TLS handshake.
  Four identity methods: URI SAN (SPIFFE), DNS SAN, Email SAN, Subject DN.
  Configure exactly one identity method per client.
  Certificate-bound tokens contain cnf.x5t#S256 (SHA-256 thumbprint).
  Binding validated at token refresh and UserInfo endpoints.
  Mutual exclusion: tokens are DPoP-bound OR cert-bound, never both.
  Per-client CA trust via client_ca_pem provides defense-in-depth.
  Certificate DER size limited to 16KB. Raw certificates never logged.
  SPIFFE integration: workloads authenticate with existing X.509-SVIDs.

Pushed Authorization Requests (RFC 9126):

  Authorization parameters stored server-side, not exposed in browser URL.
  request_uri enforces single-use consumption (prevents replay attacks).
  request_uri format: urn:ietf:params:oauth:request_uri:<base64url(32 bytes)>.
  Client binding: request_uri locked to creating client_id.
  DPoP integration: optional key binding at PAR time.
  Claims/id_token_hint limited to 8KB to prevent DoS.

OIDC Core compliance (§2, §3.1.2.1, §3.1.3.6, §5.5.1):

  prompt parameter:
    prompt=none: returns error if user not authenticated (no login redirect).
    prompt=login: forces re-authentication even with active session.
    prompt=consent: forces consent screen even for skip_consent clients.
    Mutually exclusive with each other. Validated at authorization endpoint.
    Error redirects (login_required, consent_required) validated against
    registered redirect URIs to prevent open redirect.

  max_age parameter:
    Limits maximum authentication age in seconds.
    If session is older than max_age, forces re-authentication.
    Validates session CreatedAt against current time.

  auth_time claim:
    Reflects the real time the user authenticated, not when the token was issued.
    Carried through the entire token lifecycle (auth code, refresh, ID token).

  at_hash claim (§3.1.3.6):
    Left half of SHA hash of access token, base64url-encoded.
    Hash algorithm matched to signing algorithm:
      ES256 → SHA-256, ES384 → SHA-384, ES512/EdDSA → SHA-512.
    Included in all ID tokens issued alongside an access token.

  ACR/AMR claims (RFC 8176):
    ACR (Authentication Context Class Reference):
      "1" = single factor (password only)
      "2" = multi-factor or strong single factor (WebAuthn, x509)
    AMR (Authentication Methods References):
      Values per RFC 8176: pwd (password), otp (TOTP/email OTP), hwk (WebAuthn), x509.
      Carried through the entire token lifecycle.

  response_mode parameter:
    query (default): authorization code delivered via redirect query string.
    form_post: code delivered via auto-submitting HTML form (POST).
    form_post includes security headers (X-Frame-Options, Referrer-Policy).

  Consent:
    Per-client skip_consent config skips consent screen for first-party apps.
    The built-in proxy SSO client and DCR clients skip consent.
    Unknown clients always show consent screen.
    prompt=consent overrides skip_consent.

Timing attack protection:

  All security-sensitive comparisons use crypto/subtle.ConstantTimeCompare:
  client secrets, PKCE verifiers, authorization code validation, refresh token
  client binding, DPoP thumbprints, token ownership, mTLS SAN/DN matching,
  and certificate thumbprint binding.

Client security:

  Client secrets require minimum 32 characters with entropy validation.
  Strict redirect URI validation with wildcard security (HTTPS enforced).
  State parameter minimum entropy requirements (32+ characters).
  IP allowlisting per client via CIDR notation.
  Public clients (no secret) MUST set require_pkce=true.
  mTLS clients authenticate via certificate (no secret needed).

Proxy SSO client:

  Automatically managed — not configured via TOML.
  Secret derived from cluster key (consistent across all nodes).
  PKCE S256 required. Redirect URIs validated against live proxy mappings.
  Token exchange handled internally (no external HTTP round-trips).
  Invalid or disabled proxy mappings excluded from redirect URI validation.

Personal Access Tokens (PATs):

  Pre-issued long-lived tokens for hexonclient CLI, CI pipelines, and automation.
  Each PAT is a signed JWT backed by a server-side session for revocation control.
  The JWT allows stateless validation; the session enables instant revocation.
  Step-up 2FA required before creation (TOTP or email OTP) — even if already logged in.
  Server-side revocation: revoking a PAT invalidates the JWT at the next validation check.
  Per-user limit (pat_max_per_user, default 10) prevents token accumulation.
  Max TTL cap (pat_max_ttl, default 90d, max 365d) limits blast radius of stolen tokens.
  Optional IP restriction (allowed_ips) checked at validation time.
  Email notification on creation — user alerted if PAT created without their knowledge.
  Last-used tracking (IP + timestamp) for forensics and audit trail.
  Auto-revoke on user disable — directory bulk revocation includes PATs.
  Active connector (QUIC) connections severed immediately on revocation.
  PATs are distinguished from other token types by a dedicated audience claim.
  PAT names optional (default "Token <date>"), duplicate names rejected (case-insensitive).
  Optional group restriction (pat_required_groups) — when set, user must have any listed group.
  Group check enforced at issuance (OIDC module), profile UI (hides section), and bastion CLI.

  PoW-free proxy access:
    All Bearer tokens (opaque access tokens, JWT ID tokens, PATs) bypass Proof-of-Work
    challenges and OIDC browser redirects entirely. The proxy middleware chain resolves
    Bearer tokens at step 1 — before PoW, before OIDC redirect. Two on-ramps:
      Browser: PoW → OIDC SSO → cookie → proxy (human path)
      Machine: Bearer <token> → proxy (machine path, no round-trips)
    Token types: client_credentials grant (M2M), kubelogin ID tokens, PATs (long-lived
    with session-backed revocation + IP restrictions). Same group authorization, identity
    headers, and Ed25519 signing apply to both paths.

Dynamic Client Registration (RFC 7591):

  Fully stateless — no database, no KV storage, no cache.
  Client IDs use "dcr-" prefix + UUID for recognition.
  Client secrets deterministically derived from the cluster signing key.
  All cluster nodes derive identical secrets. PKCE always required.
  Redirect URIs: loopback always allowed (RFC 8252 §7.3):
    http://localhost[:port][/path], http://127.0.0.1[:port][/path], http://[::1][:port][/path].
  Additional domains via allow_dcr_redirect_domains (exact match or *.example.com wildcard).
  Use allow_dcr_redirect_domains = ["*"] to allow any HTTPS domain (for web-based MCP clients).
  Non-loopback redirect URIs require HTTPS.
  CIDR allowlist (allow_dcr_from) controls which IPs can register.
  Rate limited per IP via rate_limit_dcr config.
  Cannot revoke individual DCR clients — toggle enable_dcr=false to disable all.
  MCP service requires enable_dcr = true for OAuth-based MCP client authentication.

Troubleshooting

Common symptoms and diagnostic steps:

Token exchange failures (invalid_grant):

  - Authorization code expired (10-minute TTL): user took too long to complete flow
  - Code already consumed (single-use): possible replay attack or double-submit
  - PKCE verifier mismatch: client sent wrong code_verifier for the code_challenge
  - Client ID mismatch: code was issued to a different client
  - Redirect URI mismatch: URI in token request differs from authorization request
  - Start with: 'auth status' to check OIDC module health
  - Check: 'diagnose user <username>' for cross-subsystem user access diagnostic

DPoP validation failures:

  - proof_too_old: DPoP proof timestamp older than 60 seconds (clock skew?)
  - proof_from_future: client clock ahead of server (NTP issue)
  - jti_replay: same JTI used twice within 120 seconds (SECURITY: possible attack)
    Note: default mode has a small replay window during cluster propagation.
    Set dpop_strict_replay = true to eliminate this window.
  - invalid_nonce: nonce not found or expired (60-second TTL, single-use)
  - htm_mismatch / htu_mismatch: proof HTTP method or URI does not match request
  - thumbprint_error: JWK thumbprint computation failed (malformed key)
  - Monitor: alert on ANY oidc_dpop_jti_replay_total increments

mTLS authentication failures:

  - No client certificate: TLS handshake did not include certificate
  - SAN/DN mismatch: certificate identity does not match client config
  - Certificate too large: DER exceeds 16KB limit
  - CA trust failure: certificate not signed by expected CA (check client_ca_pem)
  - Wrong identity method: client configured with san_uri but cert has san_dns
  - Check: 'auth status' for authentication system overview

Token refresh failures:

  - Refresh token expired (default 30-day TTL)
  - Client ID mismatch: refresh token bound to different client
  - Certificate binding mismatch (mTLS): presented cert differs from original
  - DPoP key mismatch: different key used than at token issuance
  - Token revoked: check if bulk revocation was triggered
  - Check: 'sessions list --user=<username>' for active sessions

M2M (client_credentials / jwt-bearer) failures:

  - ip_not_allowed: source IP not in client allow_client_from CIDR list
  - Invalid client secret: ensure 32+ chars, check for trailing whitespace
  - Wrong grant type: client must have grant type in allowed_grant_types
  - Scope not allowed: requested scope not in client allowed_scopes
  - JWT assertion: check algorithm matches jwt_algorithm, verify issuer/subject
  - JWT public key: ensure PEM format is correct and algorithm matches key type

PAR (Pushed Authorization Request) failures:

  - replay_attempt: request_uri already consumed (SECURITY: possible replay attack)
  - expired: request_uri TTL exceeded (default 5 minutes)
  - client_mismatch: different client_id attempting to use another client's request_uri
  - invalid_length: request_uri format does not match expected 78-character URN
  - Monitor: alert on oidc_par_consume_total result=replay_attempt

Authorization endpoint (OIDC Core) issues:

  - prompt=none returns login_required: user has no active session; expected behavior
  - prompt=none returns consent_required: client requires consent but prompt=none forbids it
  - prompt=login redirect loop: session freshness check prevents infinite loops (30s guard)
  - max_age forces re-auth: session age exceeds max_age seconds; user must re-authenticate
  - "Unsupported prompt value": client sent invalid prompt value (only none, login, consent allowed)
  - "Invalid max_age": client sent non-numeric max_age value
  - Consent screen shown unexpectedly: check skip_consent on client config ('config show authentication')
  - at_hash missing in ID token: at_hash only present when ID token issued alongside access token
  - ACR shows "1" despite MFA: check session auth_method metadata matches expected method
  - form_post not working: ensure client accepts POST at redirect_uri; check response_mode=form_post

Dynamic Client Registration (RFC 7591) failures:

  - 404 on POST /oidc/register: enable_dcr is false in configuration
  - access_denied: source IP not in allow_dcr_from CIDR allowlist
  - invalid_redirect_uri: redirect domain not in allow_dcr_redirect_domains and not loopback
  - Non-loopback redirect URIs must use HTTPS
  - For web-based MCP clients: set allow_dcr_redirect_domains = ["*"] to allow any HTTPS domain
  - For native CLI MCP clients: no domain config needed (loopback always allowed)
  - Client secret not working: ensure client is using the client_secret returned at registration
  - Token exchange fails: DCR clients require PKCE (S256); ensure code_challenge is sent
  - Check: 'config show authentication' to verify enable_dcr and allow_dcr_redirect_domains settings

PAT (Personal Access Token) failures:

  - "PAT revoked or expired": session deleted or TTL exceeded — check 'sessions list --type=pat --user=X'
  - "maximum PAT limit reached": user has pat_max_per_user tokens — revoke unused ones first
  - "PAT name already exists": case-insensitive duplicate — use a different name
  - "authentication failed" after revoke: expected — session deletion invalidates JWT at next check
  - Token not working after creation: ensure hexonclient uses --token flag with full JWT string
  - IP restriction error: remote IP not in allowed_ips metadata — check 'pats show <session_id>'
  - "your groups do not permit PAT creation": user not in pat_required_groups — check 'config show authentication' and user's groups
  - PAT section hidden in profile: pat_required_groups is set and user not in any listed group
  - Step-up verification required: user must complete TOTP or email OTP before PAT creation
  - PAT not working as proxy Bearer token: check 'logs search "handlers.bearer"' — look for
    "PAT rejected" (revoked session) or "Cached PAT rejected" (stale cache, auto-invalidated)
  - PAT introspection returns {active: false}: ensure token_type_hint is "" or "pat",
    check session exists ('sessions list --type=pat'), verify JWT not expired
  - PAT proxy access denied despite valid token: check allowed_ips — proxy enforces IP restriction
    from session metadata. Use 'pats show <session_id>' to see allowed_ips list
  - Check: 'pats list --user=X' to see all PATs for a user
  - Check: 'sessions list --type=pat' for all PAT sessions cluster-wide
  - Check: 'logs search "oidc.pat"' for PAT issuance and validation logs
  - Check: 'logs search "handlers.bearer"' for proxy bearer middleware PAT validation logs

Proxy SSO redirect loops:

  - OIDC callback failing: check proxy oidc_providers configuration
  - Token exchange fails: proxy exchanges tokens internally (no external HTTP hairpin)
  - Cross-domain cookie: verify proxy hostname matches cookie domain
  - Check: 'sessions list --type=proxy --user=<username>'
  - Check: 'proxy traffic <app>' for per-route metrics

Threshold signing issues:

  - signer_type=deterministic when threshold expected: check cluster quorum, 'cluster status'
  - "Threshold signing unavailable but required": threshold_required is set but quorum lost
  - "OIDC switched to deterministic fallback signing": threshold signer lost, using HKDF key
  - Algorithm mismatch: threshold signer algorithm must match signing_algorithm config
  - Check logs: 'logs search "oidc.keys"' for signing mode transitions

Key rotation / history issues:

  - Token validation fails after key rotation: check 'auth keys' — is the old kid still listed?
  - Key history empty: keys are recorded on first token signing or key rotation
  - Historical key expired from history: TTL may be too short relative to token lifetimes
  - Token signed with unknown kid: historical key may have expired from KV — restart loads from KV
  - Check: 'auth keys' — shows kid, algorithm, curve, expiry, and remaining TTL

Health check failures:

  - signing_key_loaded=false: signing key derivation failed (check signing_key length)
  - entropy_validated=false: signing key has insufficient entropy (weak key)
  - issuer_configured=false: hostname not set in configuration
  - Use: 'auth status' for OIDC health overview

General diagnostic commands:

  'auth status'              - Authentication system status overview
  'auth tokens'              - Active OIDC tokens and sessions
  'auth oidc'                - OIDC provider config and registered clients
  'auth keys'                - Active signing keys with kid, algorithm, and TTL
  'diagnose user <username>' - Cross-subsystem user access diagnostic
  'sessions list --user=X'   - List active sessions for a user
  'sessions revoke-user X'   - Revoke all sessions for a user (emergency)
  'logs search oidc'         - Search logs for OIDC-related entries
  'metrics prometheus oidc'  - Raw OIDC Prometheus metrics

Architecture

How the OIDC provider works at the cluster level:

The OIDC module operates cluster-wide. All token operations, key management, and revocation are replicated to all nodes automatically. The HTTP service layer handles request parsing and delegates to the OIDC module internally.

Operation categories:

Authorization (user login flows)
- Authorization code generated after user authentication (10-minute single-use TTL)
- Code exchange validates PKCE, client credentials, redirect URI, then issues tokens
- Supports: prompt (none/login/consent), max_age, response_mode (query/form_post)
- Per-client skip_consent controls whether the consent screen is shown
Token management
- Refresh validates client binding, DPoP key, and certificate binding
- Bulk revocation is replicated to all nodes for immediate effect
- Introspection returns confirmation claims for DPoP-bound and cert-bound tokens
- Introspection also supports PATs (returns token name and ID)
Machine-to-machine (M2M)
- Client credentials: secret-based auth with scope-to-group mapping
- JWT bearer: certificate-based auth with public key validation
- Both return access tokens only (no refresh token, no ID token)
- Scope-to-group mapping bridges OAuth scopes to Hexon group authorization
Device authorization
- Issues tokens after the device authorization flow completes
- Used by bastion SSH for user authentication via browser
Discovery
- JWKS exposes signing public keys for external JWT verification
- OpenID Configuration provides standard OIDC discovery metadata
- Discovery advertises supported response modes, claims, and PKCE methods
Dynamic Client Registration (DCR)
- Stateless: each DCR client gets a unique ID (dcr- prefix) and derived secret
- No storage needed — client credentials are deterministically reproducible
- PKCE required; redirect URIs: loopback always allowed + operator-configured domains
Pushed Authorization Requests (PAR)
- Authorization parameters stored server-side (not exposed in browser URL)
- Single-use consumption prevents replay attacks
- Client binding enforced with constant-time comparison
Personal Access Tokens (PATs)
- JWT signed and displayed once at creation — never stored server-side
- Three validation paths: connector (QUIC), HTTP proxy Bearer header, introspection
- All paths verify JWT signature + server-side session existence + optional IP restriction
- Revocation deletes the session and immediately disconnects active connections

Token replication model:

  - Authorization codes: local node only (short-lived, single-use)
  - Access/refresh tokens: replicated to all nodes with quorum
  - DPoP JTIs and nonces: best-effort replication (short TTL)
  - PAR requests: best-effort replication (short TTL, single-use)
  - PAT sessions: managed by the sessions module (TTL per token, up to pat_max_ttl)

Key management:

  Signing keypair derived deterministically from signing_key using HKDF-SHA256.
  Supports ES256 (P-256, default), ES384 (P-384), ES512 (P-521), and EdDSA (Ed25519).
  All cluster nodes produce identical keys from the same signing_key — no key
  synchronization needed. Keys remain stable across restarts.

  Threshold signing is preferred when cluster quorum is available. The OIDC module
  auto-switches signing mode based on cluster health. If threshold_required is set
  in config, the deterministic fallback is disabled (fail-closed on quorum loss).

  Key history (rotation support):
  On key rotation, old signing keys are retained so that tokens signed with the
  previous key can still be verified. Each key is identified by its kid (Key ID).
  Historical keys have a TTL based on the longest-lived token signed with them.
  JWKS endpoint serves all active keys (current + historical).
  Inspect active keys: 'auth keys' shows kid, algorithm, curve, and TTL.

Metrics and observability:

  Comprehensive Prometheus metrics exported for all operations:
  - Token operations: exchange, refresh, revocation, introspection, userinfo
  - DPoP: validation, JTI replay detection, nonce generation and validation
  - mTLS: authentication attempts, certificate binding validations
  - PAR: request creation, consumption, replay detection
  - Latency histograms: ID token, auth code, access token generation
  - Validation failures: PKCE, scope, redirect URI, signing key entropy

Relationships

Module dependencies and interactions:

proxy: Provides SSO authentication via a built-in proxy client. Authorization codes are exchanged internally (no external HTTP round-trips). Redirect URIs validated against live proxy mapping config. Proxy sessions use 24-hour token TTL.
devicecode: Issues tokens after device authorization flow completes. Used by bastion SSH — trusted internal callers skip client validation.
directory: Provides user information (groups, email, name) for token claims. When a user is disabled in the directory, all their tokens are revoked cluster-wide. Group memberships are included in ID tokens and used for scope-to-group mapping in M2M flows.
sessions: OIDC tokens create sessions for proxy and bastion flows. Session revocation triggers token revocation for the associated user.
authentication.x509: TLS layer validates client certificates against the global CA pool. OIDC performs identity matching (SAN/DN) and optional per-client CA trust validation on top of TLS-layer authentication.
spiffe: SPIFFE X.509-SVIDs used for mTLS client authentication via URI SAN. No separate CA infrastructure needed; reuses ACME SPIFFE profile certificates.
bastion: Bastion SSH uses device authorization flow for user authentication. Bastion shell also provides ‘pat create/list/revoke’ commands with inline TOTP/email OTP verification.
firewall: Network-level access rules applied before OIDC HTTP endpoints. IP allowlisting per client provides additional application-layer restriction.
protection: Rate limiting applied to token and authorization endpoints. Prevents brute-force attacks on client credentials and authorization codes.
mcp: MCP service uses DCR for OAuth-based authentication. MCP clients register dynamically via POST /oidc/register, then complete Authorization Code + PKCE flow. Also supports static bearer token auth as fallback.
connector (hexonclient): PATs are used for QUIC connector authentication. Validates JWT signature + session existence. Active connections are severed immediately when a PAT is revoked. Last-used metadata updated on each use.
proxy (Bearer tokens): PATs can be used as HTTP Bearer tokens for proxy access. Bearer middleware validates the JWT and checks the server-side session on every request (revocation takes effect immediately). IP restrictions from the PAT are enforced at the proxy layer.
profile: Profile web UI allows PAT creation (with step-up 2FA gate), listing, and revocation.
admin CLI: ‘pats’ command for cross-user PAT management with step-up verification.
smtp: Email notification sent on PAT creation, including token name, expiry, and the IP address used during creation.
cluster: All token operations are replicated cluster-wide. Key derivation ensures all nodes produce identical signing keypairs from the same signing_key.

Logs

Log entries by operation. Search with: logs search “oidc” Levels: ERROR > WARN > INFO > DEBUG > TRACE. DEBUG/TRACE require log level configuration.

Authorization Code:

  oidc.authcode.generate  INFO   AUDIT  Generating authorization code
  oidc.authcode.generate  WARN   AUDIT  Rate limited / unknown client / invalid redirect URI
  oidc.authcode.generate  WARN          PKCE missing, unauthorized scope, IP not allowed
  oidc.auth               ERROR         RNG failure during code generation (critical)

Token Generation & Exchange:

  oidc.token.exchange     INFO   AUDIT  Authorization code exchanged for tokens
  oidc.token.exchange     WARN          Invalid/expired code, PKCE failed, client/redirect mismatch
  oidc.tokens.generate    INFO   AUDIT  Tokens issued successfully
  oidc.tokens.generate    ERROR         Token generation failed (signing key, RNG)
  oidc.tokens.saga        ERROR         Saga step failed during token storage
  oidc.token.refresh      INFO   AUDIT  Token refresh requested
  oidc.token.refresh      WARN          Token not found, client mismatch, invalid scope
  oidc.tokens.refresh     INFO   AUDIT  Tokens refreshed (internal)
  oidc.tokens.refresh     WARN          Refresh generation failed
  oidc.token.signing      WARN          Signing retry (threshold signer unavailable)
  oidc.token.signing      ERROR         All signing attempts failed
  oidc.ratelimit.status   DEBUG         Rate limit check result

ID Token:

  oidc.idtoken            ERROR         Signing key not loaded, signing failed
  oidc.idtoken            DEBUG         DPoP/cert binding applied, signer type

Crypto:

  oidc.crypto             ERROR         RNG failure in secure token generation (critical)

Introspection & Revocation:

  oidc.introspect         DEBUG         Token introspected (active true/false, type)
  oidc.revoke             INFO   AUDIT  Token revoked
  oidc.revoke_user_tokens INFO          Bulk user token revocation (account disable/delete)

Client Authentication & Validation:

  oidc.client_auth        WARN          Secret mismatch, JWT assertion failed, unknown method
  oidc.validation         WARN          Redirect URI invalid, wildcard rejected, entropy check
  oidc.pkce               WARN          Invalid verifier length/chars, plain method rejected
  oidc.pkce               TRACE         PKCE validation result

DPoP (RFC 9449):

  oidc.dpop               WARN          JTI replay detected
  oidc.dpop               DEBUG         Proof validation (htm/htu mismatch, expired, future)
  oidc.dpop.nonce         WARN          Nonce validation failed, storage error
  oidc.dpop.nonce         DEBUG         Nonce generated, validated, stored

PAR (RFC 9126):

  oidc.par                INFO          PAR request created
  oidc.par                WARN          Auth failed, request too large, replay attempt
  oidc.par                ERROR         Failed to generate request_uri

mTLS (RFC 8705):

  oidc.mtls               WARN          No certificate, CA mismatch, no identity fields
  oidc.mtls               DEBUG         SAN mismatch (URI/DNS/email/subject DN)
  oidc.mtls               TRACE         Client authenticated via matched method

M2M:

  oidc.client_credentials INFO   AUDIT  Access token generated
  oidc.jwt_bearer         WARN          Invalid JWT assertion

Keys & Init:

  oidc.init               INFO          OIDC provider initializing/disabled
  oidc.init               ERROR         Signing key validation failed (critical)
  oidc.keys               INFO          Key generated, threshold signing active
  oidc.keys               WARN          Threshold signer unhealthy/algorithm mismatch
  oidc.keys               ERROR         Key not configured, too short, low entropy
  oidc.key_history        INFO          Key history loaded/rotated
  oidc.key_history        WARN          Key history storage failed
  oidc.jwks               DEBUG         JWKS requested
  oidc.jwks               WARN          Unknown client requesting JWKS

UserInfo:

  oidc.userinfo           INFO   AUDIT  UserInfo served
  oidc.userinfo           WARN          Token invalid, user not found, scope insufficient

Bearer Token Minting:

  oidc.mint_bearer        INFO   AUDIT  Bearer token minted for proxy
  oidc.mint_bearer        ERROR         Minting failed (signing key, invalid request)

DCR (Dynamic Client Registration):

  oidc.dcr                INFO   AUDIT  Dynamic client registered

PAT (Personal Access Tokens):

  oidc.pat.issue          INFO   AUDIT  PAT issued
  oidc.pat.issue          ERROR         Signing key not loaded, signing/session failed

Token Validation:

  oidc.validate_id_token  INFO          ID token validated

Device Code:

  oidc.device_code        INFO          Generating tokens for device authorization
  oidc.device_code        INFO   AUDIT  Device code grant successful
  oidc.device_code        ERROR         Token generation failed

Logout:

  oidc.logout             INFO   AUDIT  Logout completed, tokens revoked

Health:

  oidc.healthcheck        DEBUG         Health check performed

Metrics

Prometheus metrics. Query with: metrics prometheus oidc_<name>

Token Issuance:

  oidc_authcode_generation_total           counter    {result, reason}          Auth code generation
  oidc_token_exchange_total                counter    {result, reason}          Code-for-token exchanges
  oidc_token_refresh_total                 counter    {result, reason}          Token refreshes
  oidc_tokens_revoked                      counter    {}                        Tokens revoked on logout
  oidc_token_signing_retry_total           counter    {result, reason|attempt}  Signing retries (threshold signer)

Client Auth:

  oidc_validation_failure_total            counter    {type, client_id}         PKCE/scope/redirect failures
  oidc_mtls_auth_total                     counter    {result, reason|method}   mTLS auth (failure: reason, success: method)

DPoP:

  oidc_dpop_validation_total               counter    {result, reason}          Proof validation
  oidc_dpop_jti_replay_total               counter    {detected}               Replay detections
  oidc_dpop_jti_storage_total              counter    {result}                  JTI cache operations
  oidc_dpop_nonce_generation_total         counter    {result}                  Nonce generation
  oidc_dpop_nonce_storage_total            counter    {result}                  Nonce cache operations
  oidc_dpop_nonce_validation_total         counter    {result, reason}          Nonce validation

PAR:

  oidc_par_requests_total                  counter    {result, client_id}       PAR creation
  oidc_par_consume_total                   counter    {result, client_id}       PAR consumption
  oidc_par_request_duration                histogram  {client_id}               PAR processing latency

M2M:

  oidc_client_credentials_total            counter    {result, reason}          Client Credentials grants
  oidc_jwt_bearer_total                    counter    {result, reason}          JWT Bearer grants

Operations:

  oidc_token_introspection_total           counter    {result, token_type, active}  Token introspection
  oidc_token_revocation_total              counter    {result, token_type}      Token revocation
  oidc_userinfo_requests_total             counter    {result, reason}          UserInfo requests
  oidc_logout_total                        counter    {result}                  Logouts
  oidc_device_code_total                   counter    {result, reason}          Device code grants
  oidc_pat_issued_total                    counter    {username}                PAT issuance

Latency:

  oidc_id_token_generation_duration_ms     histogram  {}                        ID token generation
  oidc_access_token_generation_duration_ms histogram  {}                        Access token generation
  oidc_auth_code_generation_duration_ms    histogram  {}                        Auth code generation
  oidc_entropy_validation_duration_ms      histogram  {}                        Entropy validation

Alerts:

  rate(oidc_dpop_jti_replay_total[5m]) > 0              DPoP replay attack
  rate(oidc_validation_failure_total[5m]) > 10           High validation failure rate
  oidc_token_signing_retry_total > 0                     Signing key issues
  rate(oidc_par_consume_total{result="replay_attempt"}[5m]) > 0  PAR replay attempt

Email OTP

Delivers one-time codes via email for second-factor authentication — brute-force and replay protected

Overview

Sends a one-time code to the user’s email for second-factor verification. Used as an MFA step after primary authentication — no app installation required, works with any email provider. Applies when the signin flow requires MFA and email OTP is configured as an available method.

How it works:

  1. User completes primary authentication
  2. The gateway generates a one-time code and emails it
  3. User submits the code — validated with constant-time comparison
  4. Code consumed on use — replay and brute-force protected

Two code formats:

  - Numeric (digits 0-9) — standard, most familiar
  - BASE20 (20 uppercase consonants: BCDFGHJKLMNPQRSTVWXZ) — avoids profanity, easier to read aloud

Security features: device-based rate limiting, resend delay enforcement, configurable max retry limits with OTP locking, email domain allowlisting, and hashed storage keys for privacy.

JIT-2FA override: when a webhook-validated scenario has already confirmed the user’s identity, the OTP step can be bypassed via JIT-2FA integration.

Config

Configuration under [authentication.otp]:

[authentication.otp]
  length = 6                        # OTP code length (4-12, recommended: 4-8)
  type = "numeric"                  # Code type: "numeric" or "base20"
  valid = "5m"                      # OTP expiration duration (bounds: 1m-30m)
  resend_time = 60                  # Minimum seconds between OTP requests per device
  max_retries = 5                   # Max failed validation attempts before OTP locked
  mask_email = true                 # Mask email in MFA page ("user****@example.com")
  domains = [                       # Allowed email domains (empty = all blocked)
    "example.com",
    "company.org",
  ]

Code type selection:

  "numeric": Standard digit-only codes, works with any keyboard layout
  "base20": Consonant-only uppercase codes, prevents generating offensive words
  Invalid type values fall back to "numeric" with a warning log

Override fields for JIT-2FA and programmatic callers:

  TypeOverride: Override code type per-request (empty = global config)
  CodeLengthOverride: Override code length per-request (bounds: 4-12)
  TTLOverride: Override expiration per-request (bounds: 1m-30m)
  ResendTimeOverride: Override resend cooldown per-request (bounds: 10s-5m)
  SkipDomainCheck: Bypass email domain allowlist (for webhook-validated flows)
  MaxRetriesOverride: Override max failed attempts per-request (bounds: 1-10)
  Resolution chain for all overrides: per-request > global config > default

MaxRetries behavior:

  When retry count reaches max_retries, OTP is locked (not deleted).
  Locked OTPs block both validation AND resend requests.
  This prevents brute-force bypass via the resend trick (request new code
  after exhausting retries on the current one).
  Locked OTPs expire naturally via TTL for automatic cleanup.

Resend behavior:

  Retry and attempt counters are preserved across resends for the same email.
  This prevents attackers from resetting counters by requesting a new code.
  Counters only reset when a different email is used from the same device.

All settings are hot-reloadable (read dynamically on each operation).

Troubleshooting

Common symptoms and diagnostic steps:

User does not receive OTP email:

  - Check email domain is in the allowed domains list
  - Verify SMTP module health: 'smtp health'
  - Check telemetry logs for "Failed to send verification email"
  - GenerateOTP propagates SMTP errors to callers — a successful API
    response means the email was accepted by the SMTP server. If
    GenerateOTP returned an error, the user definitely did not get
    the email; check the error message for SMTP-specific detail
  - Verify the user's email address format is valid (must contain @)

“email domain not allowed” error:

  - Email domain not in [authentication.otp] domains list
  - Domain check is case-insensitive
  - Empty domains list or ["*"] allows all domains
  - JIT-2FA callers should set SkipDomainCheck=true if webhook validates

“unidentified device” error:

  - DeviceID is empty in the GenerateOTP request
  - Handler must generate device fingerprint before calling OTP module
  - DeviceID is required for rate limiting and device-email binding

“this device has already requested a code” error:

  - Device has an active (non-expired) OTP for a different email address
  - Prevents attacker from using victim's device session for their email
  - Wait for existing OTP to expire, or use a different device identifier

“please wait X before requesting another code” error:

  - Resend delay not elapsed (default: 60 seconds between requests)
  - Check resend_time config or ResendTimeOverride bounds (10s-5m)

“too many failed attempts” error:

  - OTP locked after max_retries exceeded (default: 5 attempts)
  - Locked OTPs also block resend requests to prevent bypass
  - User must wait for OTP to expire (TTL) then request a new one
  - Check logs for "SECURITY: OTP locked due to max retry attempts exceeded"

OTP validation returns Valid=false without error:

  - Code expired (check valid duration in config)
  - Incorrect code submitted (case-insensitive comparison)
  - No OTP found for email/device combination
  - OTP already consumed (cluster-atomic single-use via authclaim; Reason="already_used")
  - OTP locked from previous max retries exceeded

“OTP storage quorum not reached” error:

  - Insufficient cluster nodes confirmed storage (need >50%)
  - Check cluster health: 'cluster status'
  - May indicate network partition or node failures

Metrics for monitoring:

  - otp.codes_generated (type=numeric|base20): Generation count by type
  - otp.validations_total (result=valid|invalid): Overall validation outcomes
  - otp.validation_failures (reason=not_found|expired|invalid_code|max_retries|locked):
    Failure breakdown by reason
  - otp.replay_prevented: Successful validations where OTP was deleted

Security

Security design and hardening:

Code generation:

  Cryptographically secure random generation using crypto/rand.
  Rejection sampling eliminates modulo bias in digit selection:
    For numeric (base 10): Accept bytes 0-249, reject 250-255 (2.3% rejection rate).
    For BASE20 (base 20): Accept bytes 0-239, reject 240-255.
  This ensures perfectly uniform distribution across all code characters.

Constant-time validation (timing attack resistance):

  All code paths execute identical operations regardless of OTP existence.
  When OTP not found: dummy code "DUMMY0000" and expired metadata are used.
  crypto/subtle.ConstantTimeCompare always called, even on storage errors.
  No early returns before the comparison operation.
  Prevents attackers from determining OTP existence via response time analysis.
  Prevents code enumeration through timing side channels.

Brute-force protection:

  Configurable max retry limit (default: 5 failed attempts).
  OTP locked (not deleted) after max retries — blocks both validation and resend.
  For 6-digit numeric: 5/1,000,000 = 0.0005% success probability per OTP.
  Retry counters preserved across resends to prevent counter-reset bypass.
  Security event logged at WARN level when max retries exceeded.

Device-email binding:

  Each device can only have one active OTP at a time.
  Device cannot switch to a different email while an active OTP exists.
  Prevents attacker from using a compromised device session for their own email.

Email privacy protection:

  Cache keys are SHA-256 hashes of "email|deviceID" (base64url encoded).
  Email addresses never stored directly in cache keys.
  Prevents email enumeration via cache key inspection.
  Deterministic hashing ensures consistent key derivation across cluster nodes.

Replay prevention:

  Cluster-atomic single-use enforced unconditionally via authclaim.
  Marker written to JetStream KV (cache_type "otp_consumed") before
  declaring success. Concurrent successful submissions on different
  cluster nodes resolve to exactly one Won; remainders return
  Reason="already_used". Strict policy fails closed if cluster is
  degraded (Reason="infra_error"). Prevents code reuse cluster-wide.

Resend abuse prevention:

  Per-device resend delay (configurable, default 60 seconds).
  Locked OTPs block resend requests (prevents brute-force via fresh codes).
  Retry counters preserved across resends for the same email.

Cluster storage security:

  OTP broadcast to all cluster nodes with quorum requirement (>50%).
  Ensures OTP availability across node failures.
  Retry count updates also require cluster quorum.
  TTL-based automatic expiration prevents stale OTP accumulation.

Relationships

Module dependencies and interactions:

signin: Primary consumer for email-based MFA. When MFAMethods includes “otp”, users see the email OTP option on the MFA page. The signin flow engine calls GenerateOTP to send a code, then ValidateOTP when the user submits it. Successful validation completes the login flow.
smtp: Email delivery for OTP codes. OTP generation triggers synchronous email delivery via the SMTP module; SMTP errors propagate back to the GenerateOTP caller. Email includes the code, validity duration, and is localized using the Language field from the request.
Distributed memory cache: Backend for OTP metadata. Uses cache type “otp_codes” with SHA-256 hashed keys. All writes use cluster broadcast with quorum for consistency.
authentication.totp: Sibling MFA method. Users may see both email OTP and TOTP options on the MFA page. Email OTP requires no prior enrollment but depends on email delivery; TOTP is faster but requires authenticator app setup.
config: Reads [authentication.otp] settings dynamically at runtime. All settings are hot-reloadable. Override fields in requests take precedence over global config values.
telemetry: Structured logging with email context for all operations. Security events logged at WARN level (max retries exceeded, OTP locked). Metrics counters for generation, validation outcomes, and failure reasons.
Rate limiting: External rate limiting layer. Handlers should implement IP-based rate limiting in addition to the module’s device-based limiting.
jit_2fa: JIT-2FA webhook flow uses override fields (SkipDomainCheck, TTLOverride, CodeLengthOverride) for customized OTP behavior when the webhook has already validated the user.

Logs

Log entries by component. Search with: logs search “otp” Levels: ERROR > WARN > INFO > DEBUG > TRACE.

Generate (OTP creation and delivery):

  otp.generate        INFO   AUDIT  Email domain not allowed
  otp.generate        INFO          Device ID missing
  otp.generate        INFO   AUDIT  Device already has OTP for different email
  otp.generate        INFO   AUDIT  OTP resend blocked - max retries exceeded
  otp.generate        DEBUG         OTP resend denied - too soon
  otp.generate        DEBUG         Generating BASE20 OTP (consonants only)
  otp.generate        DEBUG         Generating numeric OTP
  otp.generate        WARN          Invalid UserpassOTPType configuration, defaulting to numeric
  otp.generate        ERROR         Failed to generate OTP code
  otp.generate        ERROR         Invalid OTP TTL configuration
  otp.generate        ERROR         Failed to broadcast OTP to cluster
  otp.generate        ERROR         Failed to achieve quorum for OTP storage
  otp.generate        DEBUG         OTP stored with cluster quorum
  otp.generate        INFO   AUDIT  OTP code generated
  otp.generate        WARN          Failed to send OTP email

Validate (OTP code verification):

  otp.validate        ERROR         Failed to query OTP from storage
  otp.validate        ERROR         Failed to retrieve OTP
  otp.validate        DEBUG         No OTP found
  otp.validate        ERROR         Invalid OTP type in storage
  otp.validate        DEBUG         OTP validation attempt
  otp.validate        INFO   AUDIT  OTP validation rejected - OTP is locked
  otp.validate        ERROR         Failed to delete expired OTP
  otp.validate        INFO   AUDIT  OTP code expired
  otp.validate        ERROR         Failed to lock OTP after max retries exceeded
  otp.validate        WARN   AUDIT  SECURITY: OTP locked due to max retry attempts exceeded
  otp.validate        ERROR         Failed to update OTP retry count
  otp.validate        ERROR         Failed to achieve quorum for OTP retry update
  otp.validate        INFO   AUDIT  Invalid OTP code submitted
  otp.validate        ERROR         Failed to delete OTP after validation
  otp.validate        DEBUG         OTP deleted after successful validation
  otp.validate        INFO   AUDIT  OTP validated and removed (replay prevention)
  otp.validate        INFO   AUDIT  OTP validated successfully

Domain Check:

  otp.domain          TRACE         Invalid email format
  otp.domain          TRACE         Domain allowed
  otp.domain          TRACE         Domain not in allowed list

Metrics

Prometheus metrics. Query with: metrics prometheus otp_<name>

Generation:

  otp_codes_generated               counter    {type}                 OTP codes generated (type: numeric, base20)

Validation:

  otp_validations_total             counter    {result}               Validation outcomes (result: valid, invalid)
  otp_validation_failures           counter    {reason}               Failure breakdown (reason: not_found, locked, expired, max_retries, invalid_code)
  otp_replay_prevented              counter    (none)                 OTPs deleted after successful validation (replay prevention)

Alerts:

  rate(otp_validation_failures{reason="max_retries"}[5m]) > 0        Brute-force attempt (OTP locked after max retries)
  rate(otp_validation_failures{reason="not_found"}[5m]) > 5          Probing for non-existent OTPs
  rate(otp_codes_generated[5m]) > 20                                  Unusual OTP generation rate

RADIUS Authentication (RADSEC + UDP)

Authenticates network devices via RADIUS — VPN concentrators, WiFi controllers, and switches with group-based authorization

Overview

Handles RADIUS authentication and authorization for network devices — VPN concentrators, WiFi controllers, switches, and other NAS equipment. Replaces standalone RADIUS servers by using the gateway’s own user directory and group policies for access decisions. Applies to any RADIUS-capable network device pointed at the gateway.

Two transport modes:

  - RADSEC (TCP+TLS, default) — encrypted RADIUS on port 2083
  - Dual mode — RADSEC + plain UDP on port 1812 for legacy devices

Core capabilities:

RADSEC listener for Access-Request packets over TCP+TLS (always active)
Plain UDP RADIUS listener for legacy NAS equipment (when dual mode enabled)
TLS certificate cascade: per-client → module-level → auto_tls (ACME) → service default
Per-client mTLS: optional NAS device certificate verification via client_ca_pem
NAS client validation via CIDR matching and shared secret verification (CIDR defaults to 0.0.0.0/0 if empty)
HXEP (Hexon Edge Protocol) support: real NAS IP through SNAT/edge proxy
Password authentication via LDAP bind (standard RADIUS User-Password)
X.509 certificate authentication via RADSEC peer certificates — uses the same authentication.x509 module (7-layer validation: expiry, chain, CRL, identity extraction via cert_subject_map, directory lookup, revocation check)
Group-based authorization mappings with priority ordering (first match wins)
RADIUS attribute-value pair (AVP) responses: VLANs, ACLs, privilege levels
Per-NAS rate limiting (sliding window) and per-user lockout after failed attempts
Global concurrent authentication cap for DoS protection
Full audit logging of authentication decisions with NAS and user context

Both transports share the same packet processing pipeline — authentication, authorization, and response building are transport-independent.

Config

RADIUS configuration under [radius] section:

[radius]
  enabled = true                    # Enable RADIUS service
  radsec_only = true                # true: RADSEC TCP+TLS only; false: dual mode (UDP + RADSEC)
  network_interface = ""            # Bind interface (defaults to service.network_interface → "eth0")
  radsec_port = 2083                # RADSEC TCP+TLS port (default 2083, RFC 6614)
  plain_port = 1812                 # Plain UDP RADIUS port (default 1812, RFC 2865, dual mode only)
  accounting_port = 2083            # Reserved for future accounting
  auth_methods = ["password"]       # Methods: "password" (LDAP bind), "x509" (RADSEC peer cert)
  idle_timeout = "30s"              # Per-connection idle timeout (default: 30s)
  session_ttl = "1h"               # Auth event visibility in session list (1m-24h)
  tls_min_version = "1.2"          # Minimum TLS version: "1.1", "1.2", "1.3"

  # TLS: module-level certificate (optional, falls back to service default)
  tls_cert = ""                     # Server cert (file path or inline PEM)
  tls_key = ""                      # Server private key (file path or inline PEM)
  auto_tls = false                  # Issue cert from internal ACME CA

[radius.rate_limit]
  max_requests_per_second_per_nas = 100   # Per-NAS rate limit
  max_auth_attempts_per_user = 5          # Failed attempts before lockout
  auth_lockout_duration = "5m"            # Lockout period after max failures
  max_concurrent_authentications = 1000   # Global concurrent auth cap

# NAS client definitions (at least one required)
[[radius.client]]
  name = "vpn-concentrator"
  description = "Fortinet FG-100F at DC1"
  cidr = "10.0.1.0/24"               # Defaults to 0.0.0.0/0 if empty (WARNING logged)
  secret = "base64:c2VjdXJlLXJhbmRvbS1zZWNyZXQ="  # min 16 bytes decoded

  # Per-client TLS overrides (optional)
  tls_cert = ""                     # NAS-specific server cert
  tls_key = ""                      # NAS-specific server key
  client_ca_pem = ""                # CA to verify NAS device cert (enables mTLS)

# Group-based authorization mappings (evaluated by priority, highest first)
[[radius.mapping]]
  name = "network-admins"
  groups = ["admins", "network-ops"]
  priority = 100
  [radius.mapping.attributes]
    "Service-Type" = "6"                 # Administrative
    "Tunnel-Type" = "13"                 # VLAN
    "Tunnel-Medium-Type" = "6"           # IEEE 802
    "Tunnel-Private-Group-ID" = "10"     # VLAN 10

[radius.mfa]
  enabled = false                    # Enable MFA for RADIUS password auth
  mode = "challenge"                 # "challenge" (Access-Challenge) or "append" (password+code)
  methods = ["totp"]                 # Priority list: "totp", "otp" (email)
  separator = ":"                    # Append mode separator (split at last occurrence)
  challenge_timeout = "60s"          # Access-Challenge response timeout (10s-300s)
  required_groups = []               # Groups requiring MFA (empty = all users)
  skip_if_unavailable = false        # Skip MFA if no method available (false = reject)
  otp_ttl = "5m"                     # Email OTP validity override (1m-10m)
  otp_code_length = 6                # Email OTP code length (4-8)

Per-client MFA override (optional field on [[radius.client]]):

  mfa_override = ""                  # "" = inherit global, "off" = disable, "challenge", "append"

Hot-reloadable: all settings except port and TLS (requires restart).

Troubleshooting

Common RADIUS issues and diagnostic steps:

NAS cannot connect to RADIUS server:

  - RADSEC: verify port 2083/tcp is open; 'firewall show' to check rules
  - UDP (dual mode): verify configured port (default 1812/udp) is open
  - Verify NAS IP falls within a configured [[radius.client]] CIDR
  - Test connectivity from NAS to gateway on configured port
  - Check: 'config show radius' to verify enabled = true and radsec_only setting
  - TLS handshake failures logged with NAS name and source IP (RADSEC only)

TLS handshake failures:

  - "no TLS certificate available": no cert configured at any level
  - Check TLS cascade: per-client tls_cert → module tls_cert → auto_tls → service cert
  - If using auto_tls, verify ACME CA is configured and reachable
  - If client_ca_pem set: NAS must present valid client certificate (mTLS)
  - Minimum TLS version defaults to 1.2 — check tls_min_version setting
  - Set tls_min_version = "1.1" only for legacy NAS devices that don't support 1.2+

Authentication failures (Access-Reject):

  - Access-Reject always returns "Access denied" in Reply-Message (no internal detail leak)
  - Check server logs for the actual reason (detailed reason logged at each reject point)
  - "bad authenticator" in logs: shared secret mismatch between NAS and config
  - "LDAP bind failed" in logs: user credentials incorrect or user not in directory
  - "User account disabled" in logs: user is disabled in directory
  - "Account temporarily locked" in logs: too many failed attempts, wait for lockout to expire
  - Lockout auto-clears after auth_lockout_duration expires (default 5m)
  - Abandoned lockout entries (< max failures, then idle) are cleaned up after 2× auth_lockout_duration
  - Check rate_limit settings if legitimate users are being locked out

X.509 certificate authentication issues:

  - x509 only works on RADSEC (TCP+TLS) — NAS must present client cert during TLS handshake
  - "Certificate validation service unavailable": [authentication.x509] not enabled or bridge error
  - "Certificate validation failed": cert expired, chain untrusted, revoked, or identity not in directory
  - Identity from cert is authoritative (RADIUS User-Name attribute is optional for x509)
  - Uses same authentication.x509 config (ca_pem, cert_subject_map, OCSP) as web signin
  - Check: 'config show authentication.x509' for CA pool and identity mapping settings

No RADIUS response (NAS timeout):

  - RADSEC: connection drops for unknown NAS IPs (no TLS handshake for unknowns)
  - UDP: unknown source IPs silently dropped (no information leak)
  - Per-NAS rate limit exceeded: increase max_requests_per_second_per_nas
  - Global concurrent auth limit reached: increase max_concurrent_authentications
  - LDAP service not ready: check directory service health
  - Idle timeout (default 30s): increase idle_timeout if NAS sends infrequent requests

HXEP (edge proxy / SNAT) issues:

  - "HXEP resolved real NAS IP" log: normal — shows socket IP → real NAS IP resolution
  - NAS rejected after HXEP: real NAS IP doesn't match any client CIDR — add correct CIDR
  - HXEP not resolving: verify service.hexon_edge_protocol = true and edge IP in service.hexon_edge_cidr
  - TLS handshake fails via edge: HXEP header parsed during TLS handshake read — check edge proxy config
  - UDP via edge: HXEP wrapping is transparent — no RADIUS-specific config needed
  - "Rejecting HXEP connection — NAS has per-client mTLS": client_ca_pem is incompatible
    with HXEP edge proxy — mTLS cannot be enforced because TLS handshake occurs before
    HXEP reveals the real NAS IP. Remove client_ca_pem or connect the NAS directly (no edge)

MFA issues:

  - "MFA enrollment required": user has no TOTP enrolled and skip_if_unavailable=false
    → Enroll user's TOTP via bastion 'totp enroll' or web signup, or set skip_if_unavailable=true
  - "Challenge expired or invalid": user took too long, increase challenge_timeout (max 300s)
  - Access-Challenge not working: NAS may not support Access-Challenge — use mfa_override="append"
  - Append mode "Invalid credentials": password+code not split correctly
    → Check separator config (default ":"), user must type password:123456
  - Email OTP not delivered: verify SMTP configured and user has email in directory
  - Per-client MFA override: set mfa_override on [[radius.client]] to "off", "challenge", or "append"
  - MFA only applies to password auth — x509 certificate is the second factor

Mapping not applied (wrong VLAN/attributes):

  - Mappings evaluated by priority (highest first), first match wins
  - Empty groups = catch-all, ensure it has lowest priority
  - Verify user's group membership in directory matches mapping groups
  - Check: user groups via directory service

Relationships

Module dependencies and interactions:

LDAP module: Password authentication uses LDAP bind for credential verification. RADIUS waits for LDAP readiness before accepting connections.
X.509 auth module: Certificate authentication validates client certificates against the CA. Full 7-layer validation: expiry, chain, CRL, identity extraction, directory, revocation. Uses same [authentication.x509] config as web signin (ca_pem, cert_subject_map, OCSP). Identity extracted from certificate is authoritative (RADIUS User-Name optional for x509).
Directory service: Group membership lookups for authorization mapping evaluation. User disabled status checked before authentication.
Certmanager: TLS certificate cascade — module cert, auto_tls (ACME), or service default. Per-client TLS overrides built at init time for NAS-specific certificates.
Managed listener: TCP and UDP socket lifecycle managed by Hexon’s listener infrastructure. RADSEC: TLS applied per-connection (not at listener level) for per-client cert selection. UDP: packets matched to NAS by source IP, dispatched directly to handlePacket. HXEP (Hexon Edge Protocol): real NAS IP resolved through SNAT/edge proxy. TCP: two-phase NAS matching (socket IP for TLS config → HXEP real IP for final NAS match). UDP: HXEP PacketConn wrapper transparently resolves real IP — no handler changes needed.
TOTP module: MFA checks TOTP enrollment and validates codes (including recovery codes).
Email OTP module: MFA generates and validates email OTP codes. Bypasses web domain allowlist since RADIUS users may not match web-configured domains.
Cluster: All cross-module calls use standard cluster communication.
Metrics: Exposes radius_connections_total, radius_packets_total, radius_auth_total, radius_auth_duration, radius_errors_total, and radius_mapping_matches_total counters.
Sessions module: Auth events recorded as type “radius” sessions on Access-Accept. Visible via ‘sessions list —type=radius’, ‘sessions show’, cluster-wide. TTL controlled by session_ttl config (default 1h). Rich metadata per session: NAS name/IP, transport (tcp/udp), TLS version, auth method, mapping, RADIUS attributes, user groups, packet ID, timing metrics (total_ms, auth_ms, authz_ms), and cert info for x509 (serial, subject, issuer, expiry, CA type).
Configuration: Reads [radius] TOML section. Validated at startup.
Admin CLI: RADIUS status and diagnostics available through admin commands.

Logs

Log entries emitted by this module (runtime/radius). Levels: ERROR > WARN > INFO > DEBUG. AUDIT = security-auditable event.

Initialization:

  radius.init                              INFO          RADIUS service disabled in configuration
  radius.init                              INFO          RADIUS initialization starting (RADSEC TCP+TLS)...
  radius.init                              INFO          RADIUS initialization starting (dual-mode: UDP + RADSEC TCP+TLS)...
  radius.init                              INFO          Waiting for LDAP service to initialize
  radius.init                              INFO          Shutdown requested during LDAP wait, aborting initialization
  radius.init                              INFO          LDAP service ready, creating RADIUS server
  radius.init                              INFO          Shutdown requested before server creation, aborting initialization
  radius.init                              ERROR         Failed to create RADIUS server
  radius.init                              INFO          Shutdown requested before listener creation
  radius.init                              ERROR         Failed to resolve network interface IP
  radius.init                              INFO          Resolved network interface for RADIUS
  radius.init                              ERROR         Failed to create RADSEC listener
  radius.init                              ERROR         Failed to start RADSEC listener
  radius.init                              INFO          RADSEC listener started
  radius.init                              ERROR         Failed to create UDP RADIUS listener
  radius.init                              ERROR         Failed to start UDP RADIUS listener
  radius.init                              INFO          UDP RADIUS listener started
  radius.init                              INFO          RADIUS server started successfully
  radius.init                              WARN          RADIUS auth_methods includes x509 but [authentication.x509] is not enabled — x509 auth will fail at runtime

Connection handling:

  radius.handler                           ERROR         No TLS configuration available
  radius.handler                           WARN          TLS handshake failed
  radius.handler                           INFO          HXEP resolved real NAS IP
  radius.handler                           ERROR         Rejecting HXEP connection — NAS has per-client mTLS (client_ca_pem) which cannot be enforced through edge proxy
  radius.handler                           WARN   AUDIT  Unknown NAS — connection from unregistered IP
  radius.handler                           DEBUG         RADSEC connection established

UDP listener:

  radius.handler                           WARN          UDP temporary read error, continuing
  radius.handler                           ERROR         UDP fatal read error, stopping listener

RADSEC framing:

  radius.handler                           WARN          Failed to read RADSEC frame header
  radius.handler                           WARN          Invalid RADIUS packet length
  radius.handler                           WARN          Incomplete RADSEC frame

Packet processing:

  radius.handler                           WARN   AUDIT  NAS rate limit exceeded
  radius.handler                           WARN   AUDIT  Concurrent authentication limit reached
  radius.handler                           WARN          Failed to parse RADIUS packet
  radius.handler                           WARN          Unexpected RADIUS packet code
  radius.handler                           INFO          Missing User-Name attribute in Access-Request
  radius.handler                           WARN   AUDIT  User locked out

Authentication:

  radius.auth                              DEBUG         Skipping x509 auth — no client certificate
  radius.auth                              ERROR         x509auth bridge call failed
  radius.auth                              ERROR         x509auth validation timed out or failed
  radius.auth                              INFO   AUDIT  Certificate validation rejected
  radius.auth                              INFO   AUDIT  Authentication failed
  radius.auth                              ERROR         Authorization failed
  radius.auth                              INFO          No matching mapping
  radius.auth                              INFO          Authentication and authorization successful

MFA:

  radius.mfa                               WARN          TOTP status check failed
  radius.mfa                               ERROR         Failed to generate challenge token
  radius.mfa                               INFO          MFA validated via recovery code
  radius.mfa                               ERROR         Failed to encode Access-Challenge
  radius.mfa                               WARN          Failed to send Access-Challenge
  radius.mfa                               ERROR         Failed to get user info for MFA check
  radius.mfa                               ERROR  AUDIT  MFA method resolution failed
  radius.mfa                               INFO          MFA skipped — no method available, skip_if_unavailable=true
  radius.mfa                               ERROR         Failed to send email OTP
  radius.mfa                               INFO   AUDIT  Sending MFA challenge
  radius.mfa                               WARN          Invalid or expired MFA challenge state
  radius.mfa                               WARN          MFA challenge response from different NAS
  radius.mfa                               INFO          MFA challenge response missing verification code
  radius.mfa                               INFO          MFA validation failed
  radius.mfa                               ERROR         Authorization failed after MFA
  radius.mfa                               INFO          MFA authentication and authorization successful

Response encoding:

  radius.handler                           ERROR         Failed to encode Access-Reject
  radius.handler                           WARN          Failed to send Access-Reject
  radius.handler                           WARN          Failed to set RADIUS attribute
  radius.handler                           ERROR         Failed to encode Access-Accept
  radius.handler                           WARN          Failed to send Access-Accept

Session recording:

  radius.session                           WARN          Failed to create RADIUS session

Restrictions:

  radius.restrictions.geo                  ERROR         Geo check failed - denying access (fail-closed)
  radius.restrictions.geo                  ERROR         Geo check wait failed - denying access (fail-closed)
  radius.restrictions.geo                  ERROR         Invalid geo check response type - denying access (fail-closed)
  radius.restrictions.geo                  INFO          Access blocked by geo restriction
  radius.restrictions.time                 ERROR         Time check failed - denying access (fail-closed)
  radius.restrictions.time                 ERROR         Time check wait failed - denying access (fail-closed)
  radius.restrictions.time                 ERROR         Invalid time check response type - denying access (fail-closed)
  radius.restrictions.time                 INFO          Access blocked by time restriction

Metrics

Prometheus metrics. Query with: metrics prometheus radius_<name>

Connections:

  radius_connections_total                 counter    {nas}                         TCP connections accepted (RADSEC)

Packets:

  radius_packets_total                     counter    {transport, nas}              RADIUS packets received (transport: tcp or udp)

Authentication:

  radius_auth_total                        counter    {result, method, nas}         Auth outcomes (result: accept/reject, method: password/x509/none)
  radius_auth_total                        counter    {result, reason, nas}         Auth rejections with reason (reason: geo, time)
  radius_auth_duration                     latency    {result}                      End-to-end auth+authz latency (result: accept/reject)

Mappings:

  radius_mapping_matches_total             counter    {mapping, nas}                Mapping match counts per mapping name

Errors:

  radius_errors_total                      counter    {reason, nas}                 Error counts by reason:
    reason=tls_handshake       TLS handshake failure on RADSEC connection
    reason=hxep_mtls_conflict  HXEP connection rejected — NAS has per-client mTLS
    reason=invalid_frame       RADIUS packet length out of range (< 20 or > 4096)
    reason=incomplete_frame    RADSEC frame body read failed (truncated)
    reason=rate_limit          Per-NAS rate limit exceeded (silent drop)
    reason=concurrent_limit    Global concurrent auth limit reached (silent drop)
    reason=parse_error         RADIUS packet parse failed (bad authenticator / malformed)
    reason=invalid_state       MFA challenge state token invalid or expired
    reason=nas_mismatch        MFA challenge response from different NAS than original

TOTP Authenticator

Authenticator app verification for second-factor authentication — QR enrollment, replay protection, recovery codes

Overview

Verifies time-based one-time passwords from authenticator apps like Google Authenticator, Authy, or 1Password. Used as an MFA step after primary authentication — requires the user to have enrolled via QR code scan. Applies when the signin flow requires MFA and TOTP is configured as an available method.

Enrollment flow:

  1. The gateway generates a 160-bit secret and QR code (secret not persisted until confirmed)
  2. User scans the QR code with their authenticator app
  3. User submits the first code to confirm enrollment — proves the QR was scanned correctly
  4. The gateway generates 10 one-time recovery codes (returned in plaintext exactly once)
  5. Subsequent logins verify the 6-digit code from the authenticator app

Replay protection rejects codes that match or precede the last accepted time step. Recovery codes are hashed and consumed on use — each code works exactly once.

HMAC-SHA1 by default (SHA256/SHA512 configurable but reduce app compatibility). Configurable time skew window for clock drift tolerance between the gateway and authenticator apps. Per-user enrollment status and secret deletion available via admin CLI.

Config

Configuration under [authentication.totp]:

[authentication.totp]
  enabled = true                    # Enable TOTP module
  issuer = "HexonGateway"          # Shown in authenticator apps (otpauth URI)
  algorithm = "SHA1"               # HMAC algorithm: SHA1 (most compatible), SHA256, SHA512
  digits = 6                       # Code length: 6 (standard) or 8
  period = 30                      # Time step in seconds (30 is RFC default)
  skew = 1                         # Allow +/- N steps for clock drift (1 = 30s tolerance)
  recovery_codes = 10              # Number of one-time recovery codes generated
  recovery_code_length = 6         # Character length of each recovery code
  rate_limit_auth = "10/1m"        # Rate limit for validation attempts

Algorithm compatibility notes:

  SHA1: Works with all authenticator apps (Google, Authy, 1Password, etc.)
  SHA256: Limited app support (may not work with Google Authenticator)
  SHA512: Minimal app support (not recommended for broad deployments)

Period and skew interaction:

  With period=30 and skew=1, codes are valid for ~90 seconds (current + 1 past + 1 future).
  Increasing skew improves tolerance for clock drift but reduces security.
  Period changes require re-enrollment of all users.

Storage: Hexon KV (NATS JetStream) — no user password needed for writes.

All settings are cold (restart required to take effect on new enrollments). Existing enrollments retain their original algorithm, digits, and period.

Troubleshooting

Common symptoms and diagnostic steps:

User cannot enroll TOTP (enrollment fails):

  - Verify [authentication.totp] enabled = true
  - Check if user already has TOTP enrolled: 'totp status <username>'
  - If re-enrolling, delete first: admin must call Delete operation
  - Check telemetry logs for "Failed to generate TOTP secret" errors

QR code not scanning in authenticator app:

  - Verify issuer is set (some apps reject empty issuer)
  - Check algorithm compatibility: SHA1 works universally, SHA256/SHA512 may not
  - Ensure digits=6 and period=30 for maximum compatibility
  - Try manual entry using the Base32 secret string instead of QR

TOTP code rejected during authentication:

  - Clock drift: user device clock may be off by more than skew * period seconds
  - Replay protection: code was already used (step <= last_used_step)
  - Wrong authenticator entry: user may have multiple entries for same issuer
  - Check enrollment status: 'totp status <username>' to confirm enrollment exists
  - Verify algorithm matches: stored secret uses algorithm from enrollment time

Recovery code rejected:

  - Code already consumed (one-time use, removed from storage after validation)
  - No codes remaining: check RecoveryCodesRemaining in status response
  - Case sensitivity: codes are case-sensitive
  - Storage update failure: check logs for "Failed to consume recovery code"

Replay detection false positives:

  - Rapid successive code submissions: same 30-second window generates same code
  - Step update failed: if persisting the step counter fails, validation is rejected (fail-closed)
  - Check logs for "TOTP replay detected" with step and last_used_step values

TOTP Delete fails:

  - Cluster not ready: moduledata requires cluster connectivity
  - Delete is idempotent: returns Success=true even if no enrollment exists

Metrics for monitoring:

  - totp.enrollments_initiated: Enroll calls (QR generated)
  - totp.enrollments_confirmed: Successful ConfirmEnroll (secret persisted)
  - totp.enrollments_deleted: Successful Delete calls
  - totp.validations_total (result=valid|invalid|replay): Validate outcomes
  - totp.recovery_validations_total (result=valid|invalid|no_codes): Recovery code outcomes

Security

Security design and hardening:

Secret generation:

  160-bit random secrets (20 bytes) from crypto/rand, Base32-encoded.
  Provides 2^160 entropy — brute-forcing the secret is computationally infeasible.

Code validation:

  Constant-time comparison via crypto/subtle prevents timing attacks.
  Attacker cannot determine partial code correctness from response time.

Replay protection:

  Each successful validation records the time step (LastUsedStep).
  Subsequent codes at step <= LastUsedStep are rejected.
  Step update is synchronous (not fire-and-forget) to prevent race conditions.
  If step persistence fails, validation is rejected (fail-closed).
  This prevents concurrent requests from replaying the same code.

Recovery codes:

  Generated with crypto/rand, stored as SHA-256 hashes.
  Plaintext returned to user exactly once during enrollment confirmation.
  Each code is consumed (removed) after successful validation.
  Matching uses constant-time comparison for timing-attack resistance.
  Consumption is synchronous with fail-closed semantics.

Enrollment security:

  Two-phase enrollment: Enroll generates secret, ConfirmEnroll verifies first code.
  This proves the user successfully scanned the QR and their authenticator works.
  Re-enrollment blocked while existing enrollment exists (prevents overwrite race).

Clock drift tolerance:

  Configurable skew parameter allows +/- N time steps.
  Default skew=1 with period=30 accepts codes from 3 consecutive 30-second windows.
  Wider skew reduces security: skew=2 means a valid code window of 150 seconds.

Authentication flow integration:

  TOTP is a second factor only — never used as primary authentication.
  Requires prior successful primary authentication (password, certificate, etc.).
  MFA pending session must exist before TOTP validation is attempted.
  Failed TOTP does not reveal whether the user has TOTP enrolled.

Audit logging:

  All operations logged via telemetry with security context (username).
  Enrollment initiation, confirmation, validation (success/failure/replay),
  recovery code use, and deletion all generate structured log entries.
  Replay attempts logged at WARN level for security monitoring.

Relationships

Module dependencies and interactions:

signin: Primary consumer via MFA flow. When RequireMFA includes “passwd” and MFAMethods includes “totp”, users with TOTP enrolled see the authenticator option on the MFA page. After primary auth creates “mfa_pending” session, user submits 6-digit code, signin calls totp.Validate, and on success the signin flow completes the login.
moduledata: Storage backend for TOTP secrets. Module name “totp” in moduledata stores the per-user secret, algorithm, digits, period, last used step, and recovery codes.
Directory: Provides user context and group membership. TOTP enrollment status can influence access policies.
sessions: MFA pending session must exist before TOTP validation. Successful TOTP validation triggers session upgrade to fully authenticated.
authentication.otp: Sibling MFA method. Users may see both TOTP and email OTP options on the MFA page. TOTP is preferred when enrolled (no email delivery delay).
config: Reads [authentication.totp] settings dynamically at runtime. Algorithm, digits, and period from enrollment time are stored with the secret, so config changes only affect new enrollments.
telemetry: Structured logging with security context for all operations. Metrics counters for enrollment, validation, and recovery code operations.
Admin CLI: TOTP management commands (list enrollments, check status, delete). Admin can delete TOTP enrollment for locked-out users.

Logs

Log entries by component. Search with: logs search “totp” Levels: ERROR > WARN > INFO > DEBUG > TRACE.

Enroll (secret + QR generation):

  totp.enroll          ERROR         Failed to generate TOTP secret
  totp.enroll          ERROR         Failed to generate QR code
  totp.enroll          INFO          TOTP enrollment initiated

ConfirmEnroll (first-code verification and secret persistence):

  totp.enroll.confirm  INFO          TOTP enrollment verification failed - invalid code
  totp.enroll.confirm  ERROR         Failed to generate recovery codes
  totp.enroll.confirm  ERROR         Failed to store TOTP secret
  totp.enroll.confirm  INFO          TOTP enrollment confirmed and persisted

Validate (TOTP code verification):

  totp.validate        INFO   AUDIT  TOTP validation failed - no enrollment found
  totp.validate        ERROR  AUDIT  Failed to decode stored TOTP secret
  totp.validate        INFO   AUDIT  TOTP validation failed - invalid code
  totp.validate        WARN   AUDIT  Clock backward detected during TOTP validation - allowing code
  totp.validate        WARN   AUDIT  TOTP replay detected - code already used
  totp.validate        ERROR  AUDIT  Failed to update last used step - rejecting for safety
  totp.validate        INFO   AUDIT  TOTP validation successful

Recovery (one-time recovery code validation):

  totp.recovery        INFO          Recovery code validation failed - no enrollment found
  totp.recovery        INFO          Recovery code validation failed - no codes remaining
  totp.recovery        INFO          Recovery code validation failed - invalid code
  totp.recovery        ERROR         Failed to consume recovery code - rejecting for safety
  totp.recovery        INFO          Recovery code validated and consumed

Delete (enrollment removal):

  totp.delete          INFO          No TOTP enrollment found to delete
  totp.delete          INFO          TOTP enrollment deleted

Metrics

Prometheus metrics. Query with: metrics prometheus totp_<name>

Enrollment:

  totp_enrollments_initiated          counter    (none)                 Enroll calls (QR + secret generated)
  totp_enrollments_confirmed          counter    (none)                 First code verified, secret persisted
  totp_enrollments_deleted            counter    (none)                 TOTP enrollment deleted

Validation:

  totp_validations_total              counter    {result}               Validation outcomes (result: valid, invalid, replay, clock_backward)

Recovery:

  totp_recovery_validations_total     counter    {result}               Recovery code outcomes (result: valid, invalid, no_codes)

Alerts:

  rate(totp_validations_total{result="replay"}[5m]) > 0                Replay attack attempt detected
  rate(totp_validations_total{result="invalid"}[5m]) > 10              Brute-force attempt on TOTP codes
  rate(totp_validations_total{result="clock_backward"}[5m]) > 0        Server clock drift — check NTP sync
  rate(totp_recovery_validations_total{result="invalid"}[5m]) > 5      Recovery code probing attempt

WebAuthn Passkeys

FIDO2/WebAuthn passwordless authentication with passkey management and clone detection

Overview

The WebAuthn module implements FIDO2/WebAuthn Level 2 passwordless authentication, acting as a WebAuthn Relying Party (RP). It manages the full passkey lifecycle: registration, authentication, revocation, and expiration monitoring.

Key capabilities:

Multiple passkeys per user (laptop, phone, YubiKey, etc.)
Challenge-response registration and authentication ceremonies
Platform authenticators (Touch ID, Face ID, Windows Hello)
Cross-platform authenticators (YubiKey, other FIDO2 security keys)
Attestation statement validation (none, packed, fido-u2f formats)
Clone detection via signature counter monitoring
ECDSA P-256 (ES256) and RSA-2048 (RS256) public key cryptography
Passkey expiration scheduler with email reminders
Distributed passkey storage (replicated or shared filesystem)
Session creation after successful authentication
Optional device naming for passkey identification

Operations: registration ceremonies, authentication ceremonies, passkey management (revoke, get, list), observability metrics, and scheduled expiration reminders.

Storage architecture follows a layered approach:

LDAP is the single source of truth for passkey data
Multi-passkey format: supports multiple passkeys per user with revocation tracking
Legacy single-passkey format auto-detected and migrated on first write
Directory module syncs LDAP to memory cache (including passkey data)
WebAuthn reads passkey data from the directory cache
No separate passkey cache — eliminates synchronization issues
Temporary challenge sessions use in-memory storage with 5-10 minute TTL
Passkey records also persisted to distributed file storage

Config

Configuration under [authentication.webauthn]:

  name = "Hexon Identity"              # RP name shown to users during ceremony
  rpid = "login.example.com"           # Relying Party ID (must match origin domain)
  origin = "https://login.example.com" # Origin URL (must match browser origin exactly)
  skip_port_check = true               # Skip port in origin validation (default: true)
  type = "preferred"                   # Authenticator type: "platform", "cross-platform", "preferred"
  user_verification = "preferred"      # UV policy: required|preferred|discouraged (default: preferred)
  validity = "8760h"                   # Passkey validity (default: 8760h = 1 year; "0" = no expiry)
  algorithms = ["ES256", "RS256", "EdDSA"]  # Signature algorithms in preference order
  attestation = "none"                 # Attestation conveyance: none|indirect|direct|enterprise
  allowed_aaguids = []                 # AAGUID allowlist; empty = any (requires attestation=direct)
  denied_aaguids = []                  # AAGUID denylist; checked first (requires attestation=direct)
  rate_limit_register = "5/1h"         # Registration rate limit per user
  rate_limit_auth = "20/1m"            # Authentication rate limit per user

Signature algorithms (default [“ES256”, “RS256”, “EdDSA”]):

  - ES256 (ECDSA P-256 + SHA-256) — universal authenticator support
  - RS256 (RSA-2048 + SHA-256) — covers older smartcards
  - EdDSA (Ed25519) — modern hardware: recent YubiKeys, Solo Keys,
    iOS 18+ / Android 14+ platform authenticators. Smaller signatures.
  Only these three are accepted; unknown names are rejected at boot.
  Order is the operator's preference list — the authenticator picks the
  first algorithm it supports. To force ES256-only (for compatibility
  with strict regulators or legacy verifiers downstream), set
  algorithms = ["ES256"]. Removing RS256 also blocks legacy smartcards.

Validity semantics:

  - Default 8760h (1 year) — annual re-confirmation of credential possession
  - "0" → no expiry — matches Apple/Google/Microsoft platform-passkey UX
    (credentials live until explicit revocation). Storage omits valid_until
    on the record; authentication treats IsZero() as never-expiring; the
    renewal-reminder scheduler skips zero-validity credentials automatically.

Attestation conveyance (default “none”):

  - "none" — authenticator omits attestation; AAGUID arrives as zero bytes.
    Best privacy, fewest browser prompts. Recommended unless you actually
    consume AAGUID downstream.
  - "indirect" — authenticator may send anonymized attestation. Browser
    may strip identifying material; AAGUID enforcement is unreliable.
  - "direct" — full attestation including real AAGUID and certificate
    chain. REQUIRED for AAGUID allow/deny enforcement. May trigger an
    extra browser prompt on some platforms.
  - "enterprise" — non-anonymized identifiers. Most authenticators require
    an allow-listed RP ID configured in their manufacturer policy;
    coordinate with your hardware vendor before flipping.

AAGUID allow/deny lists:

  AAGUID = 16-byte UUID identifying authenticator make/model. Use these
  lists to restrict registration to specific devices (e.g. hardware-key-only
  deployments). Both lists require attestation="direct" — boot validation
  rejects the inconsistent combination because non-direct modes anonymize
  or omit the AAGUID and would silently block every user.

  Denylist is checked first — a denied AAGUID is rejected even if it
  appears in the allowlist. This lets you express "any hardware key, except
  this revoked batch" by populating both lists.

  When a list is non-empty and the authenticator returns no AAGUID (zero
  bytes), the registration is rejected with a clear error rather than
  silently admitting an unidentified credential.

  AAGUID values come from the FIDO Metadata Service. See
  tools/config/authentication/webauthn.toml for a curated starter list of
  hardware keys (YubiKey, SoloKey, Feitian, Google Titan) and software /
  platform passkey managers (iCloud Keychain, Google Password Manager,
  Windows Hello, 1Password, Bitwarden).

User verification tradeoff (single value, applied to both registration and auth):

  - "preferred" (default): authenticator decides. Touch ID where available,
    falls back gracefully. Non-UV-capable credentials can enroll AND
    authenticate. Best UX, no fallback prompts on macOS. Weakest
    phishing resistance — suitable when another auth layer (mTLS,
    network ACL, IAP session binding) is the primary defence.
  - "required": TouchID/PIN every ceremony. Server rejects UV=0 in authData
    at BOTH registration and authentication — non-UV-capable authenticators
    cannot enroll, and an enrolled credential that skips UV at auth time
    is rejected. Strongest phishing resistance per FIDO2 §7.2.9. On macOS:
    can fall back to account-password prompt if Touch ID isn't accepted
    first-try.
  - "discouraged": skip UV. Reserved for deployments behind another strong
    auth layer that already provides UV-equivalent guarantees.

Registration and auth MUST share the same value. A “preferred” registration accepts non-UV credentials, which then fail a “required” auth with no recovery path. The getter returns one value consumed by both ceremonies and falls back to “preferred” (same as the default) on any unrecognised input so a config typo never bricks passkey auth.

Migration: flipping from “preferred” to “required” mid-deployment can lock out users whose credentials enrolled without UV. Plan a re-registration window before flipping.

Expiration reminder settings:

  renewal_reminder_enabled = true      # Enable expiration reminder emails (default: true)
  renewal_reminder_interval = "24h"    # Check frequency (default: "24h")
  renewal_reminder_before = "360h"     # Lead time before expiry to start sending (default: 360h = 15 days)
  renewal_reminder_timeout = "5m"      # Operation timeout (default: "5m")
  renewal_reminder_retries = 3         # Max retry attempts (default: 3)
  renewal_reminder_retry_delay = "30s" # Delay between retries (default: "30s")

Hot-reload behavior:

  Hot-reloaded (effect on next ceremony / next scheduler tick):
    - validity: new value applies to passkeys registered after the reload;
      existing credentials keep their previously-stored expiry
    - user_verification: applies to the next registration / authentication
    - algorithms: applies to the next registration ceremony; existing
      credentials remain verifiable as long as their algorithm is still
      one the server supports (ES256, RS256, EdDSA)
    - attestation, allowed_aaguids, denied_aaguids: apply to the next
      registration ceremony; existing credentials are not retroactively
      re-evaluated against new lists
    - Scheduler settings: interval, timeout, retries, retry_delay

  Require restart:
    - rpid, origin, type, skip_port_check
    - Changing these mid-flight breaks validation of already-enrolled passkeys

Cluster storage modes:

  Replicated mode (filesystem.mode = "replicated"):
    - Passkeys broadcast to all nodes with quorum (>50% must confirm)
    - Automatic cross-node synchronization
  Shared mode (filesystem.mode = "shared"):
    - Passkeys on shared filesystem (NFS), no replication needed

Troubleshooting

Common symptoms and diagnostic steps:

Registration failures (“invalid attestation”):

  - RP ID mismatch: rpid must match the domain portion of origin
  - Origin mismatch: origin must exactly match the browser URL (scheme + host + port)
  - Port issues in containers: set skip_port_check=true for K8s/Docker deployments
  - Unsupported attestation format: only none, packed, fido-u2f are supported
  - Check config: 'config show authentication.webauthn'
  - Diagnose user: 'diagnose user <username>'

Authentication failures (“signature verification failed”):

  - Passkey expired: check valid_until in passkey record ('webauthn list <username>')
  - Wrong RP ID hash: rpid changed since passkey was registered (requires re-registration)
  - Corrupted public key: revoke and re-register the passkey
  - Check passkey details: 'webauthn list <username>'

Clone detection alerts (“counter did not increase”):

  - Possible cloned authenticator: investigate immediately (security event)
  - Counter validation only enforced when both stored and new counters are non-zero
  - Some authenticators do not support counters (always 0) -- this is normal
  - Counter wrapped around (rare, requires 2^32 uses)
  - Authenticator reset: requires re-registration after investigation
  - Check logs: 'logs search "clone" --module=webauthn'

Challenge expired or not found:

  - Challenge TTL is 5-10 minutes; user took too long to respond
  - Challenge already consumed (single-use; cannot retry with same challenge)
  - Memory storage broadcast delay in large clusters
  - Retry the ceremony from the beginning (BeginRegistration/BeginAuthentication)

Expiration reminders not being sent:

  - Verify scheduler is enabled: renewal_reminder_enabled = true
  - Check SMTP health: 'smtp health'
  - Verify user has email in directory: 'directory user <username>'
  - Disabled users are skipped (by design)
  - Check scheduler status: 'health components'
  - Only the cluster leader runs the check (leader-only scheduling)
  - Look for errors: 'logs search "expiration" --module=webauthn'

Passkey not found during authentication:

  - User has no passkey registered: 'webauthn list <username>'
  - Specific passkey was revoked: 'webauthn list <username>' shows revoked status
  - Credential ID mismatch: browser sending different credential than stored
  - Directory sync delay: passkey in LDAP but not yet in memory cache
  - Trigger sync: 'directory sync <username>'
  - Legacy format issue: check if user's moduledata has old flat format vs new array

502/503 during WebAuthn ceremony:

  - Filestorage unavailable: check filesystem health
  - Quorum not reached in replicated mode: check cluster status ('cluster status')
  - Memory storage broadcast failure: check cluster connectivity ('ping')

Metrics not updating:

  - Check metrics endpoint: 'webauthn metrics'
  - Verify telemetry module is healthy: 'health components'

Security

Critical security requirements:

Challenge-Response Protocol:

  - 32-byte cryptographic random challenges (crypto/rand)
  - Single-use: challenge deleted immediately after validation
  - TTL: 5-10 minutes, expired challenges rejected
  - Prevents replay attacks entirely

Clone Detection (Signature Counter):

  - Authenticator maintains incrementing signature counter
  - On each authentication: new counter must exceed stored counter
  - If new <= stored (both non-zero): REJECT -- possible cloned authenticator
  - Counter=0 authenticators exempt (per WebAuthn specification)
  - Counter updates NOT persisted to LDAP (avoids write on every auth)
  - Detection works by comparing against registration-time stored value

Attestation Validation:

  - Performed during registration for all supported formats
  - Current mode: permissive (registration succeeds even if validation fails)
  - Validation results logged for security auditing
  - For stricter enforcement: modify FinishRegistration to reject failures
  - Future: FIDO Metadata Service (MDS) for authenticator trust verification

Origin and RP ID Validation:

  - Origin must be HTTPS (WebAuthn specification requirement)
  - RP ID must match the domain in the origin URL
  - Browser enforces same-origin policy on credentials
  - skip_port_check=true relaxes port matching only (not scheme or domain)

Public Key Cryptography:

  - Keys stored in COSE format (RFC 8152)
  - ES256 (ECDSA P-256): primary algorithm
  - RS256 (RSA-2048): secondary algorithm
  - Private keys never leave the authenticator hardware
  - Public keys stored base64-encoded in LDAP ModuleData

Rate Limiting:

  - Registration: configurable per-user limit (default 5/1h)
  - Authentication: configurable per-user limit (default 20/1m)
  - Prevents brute-force and denial-of-service attacks

Operational security recommendations:

  - Monitor clone detection alerts as critical security events
  - Set an appropriate validity for your security policy ("8760h" = 1 year is the default; "0" disables expiry)
  - Implement passkey rotation procedures
  - Revoke passkeys immediately on device loss or compromise
  - Enable expiration reminders to prevent credential lapses
  - Audit all authentication events via telemetry logs
  - Consider enabling stricter attestation for high-security deployments

Relationships

Module dependencies and interactions:

directory: Primary passkey data source. WebAuthn reads passkeys from the directory’s in-memory cache (synced from LDAP). Also provides user listing for expiration checks. User’s FullName used for personalized reminder emails.
LDAP: Ultimate source of truth for passkey storage. Passkeys stored in the module data LDAP attribute. The calling layer is responsible for writing passkey data to LDAP after registration.
filestorage: Distributed credential storage with active/ and revoked/ directories. Supports replicated mode (quorum broadcast) and shared mode (NFS). Used for passkey record persistence alongside LDAP.
sessions: Creates authenticated sessions after successful WebAuthn authentication. Session module and TTL configurable per-authentication request (e.g., “sshproxy” module, 8h TTL).
storage.memory: Temporary challenge session storage with broadcast to all cluster nodes. TTL-based expiration (5-10 minutes). Challenges stored under cache type “webauthn_sessions”.
smtp: Sends passkey expiration reminder emails via SMTP module. ACL enforced — only the webauthn module is authorized to call this operation. Passkey expiration reminder emails sent via SMTP module.
telemetry: Security audit logging at multiple levels. LevelError for clone detection and signature failures. LevelWarn for expired passkeys and invalid challenges. LevelInfo for successful operations.
scheduler: Expiration check runs as a leader-only scheduled task (distributed lock for safety). Configurable interval, timeout, retries, and retry delay.
config: Hot-reloadable configuration via the configuration system. Some fields cached at init (rpid, origin, type) to prevent mid-flight breakage.

External dependency:

CBOR decoding for attestation objects and COSE key parsing (RFC 8152).

Logs

Log entries by component. Search with: logs search “webauthn” Levels: ERROR > WARN > INFO > DEBUG.

Registration:

  webauthn.registration   INFO   AUDIT  Begin/finish registration request
  webauthn.registration   INFO          Passkey registered / attestation validated
  webauthn.registration   WARN          Challenge mismatch / origin mismatch / attestation failed
  webauthn.registration   ERROR         Challenge generation / session storage / marshal failures

Authentication:

  webauthn.authentication INFO   AUDIT  New challenge issued
  webauthn.authentication ERROR  AUDIT  E2OE commitment mismatch — Tier 1 binding rejected
  webauthn.authentication INFO          Auth successful / passkey not found / expired / invalid session
  webauthn.authentication WARN          Origin mismatch / RP ID hash mismatch / signature verification failed
  webauthn.authentication ERROR         ECDH keygen / challenge generation / session storage / cloned device / COSE key failures
  webauthn.authentication DEBUG         Begin/finish request trace / counter validation / auth successful

Enrollment:

  webauthn.enroll         INFO   AUDIT  Passkey enrolled (hash, device, active count)
  webauthn.enroll         ERROR         Failed to load existing passkeys / failed to store
  webauthn.enroll         DEBUG         Enroll request

Revocation:

  webauthn.revoke         INFO   AUDIT  Passkey revoked (hash, device, reason, revoked_by)
  webauthn.revoke         WARN          No passkeys found / passkey not found in active list
  webauthn.revoke         ERROR         Failed to store revoked passkey
  webauthn.revoke         DEBUG         Revoke request

Storage:

  webauthn.storage        DEBUG         Loading/storing passkeys (active/revoked counts)
  webauthn.storage        INFO          Passkeys stored to moduledata

Expiration:

  webauthn.expiration     INFO          Check started / completed / reminder sent / disabled / skipping
  webauthn.expiration     WARN          Lock acquisition failed
  webauthn.expiration     ERROR         Scheduler registration / LoadAll / GetAllUsers failures

Initialization:

  webauthn.init           INFO          Provider initialized (RPID, origin, type, validity) / disabled
  webauthn.init           ERROR         Initialization failed

Lookup:

  webauthn.get            DEBUG         Passkey lookup
  webauthn.list           DEBUG         Passkey listing

Metrics

Prometheus metrics. Query with: metrics prometheus webauthn_<name>

Passkey Inventory:

  webauthn_passkeys_issued                gauge      {}                        Total passkeys ever issued
  webauthn_passkeys_active                gauge      {}                        Currently active passkeys
  webauthn_passkeys_revoked               gauge      {}                        Revoked passkeys
  webauthn_passkeys_expired               gauge      {}                        Expired passkeys

Authentication:

  webauthn_auth_attempts                  counter    {}                        Authentication attempts
  webauthn_auth_success                   counter    {}                        Successful authentications
  webauthn_auth_failed                    counter    {}                        Failed authentications

Expiration Monitoring:

  webauthn_expiration_check_total         counter    {result}                  Expiration checks (success/failure)
  webauthn_expiration_passkeys_checked    gauge      {}                        Passkeys checked in last run
  webauthn_expiration_emails_sent         gauge      {}                        Reminder emails sent in last run
  webauthn_expiration_reminder_total      counter    {result}                  Reminder send attempts (success/failure)

Alerts:

  rate(webauthn_auth_failed[5m]) > 20                         High auth failure rate
  webauthn_passkeys_active == 0                                No active passkeys (service unusable)
  rate(webauthn_expiration_check_total{result="failure"}[1h])  Expiration check failing

X.509 Client Certificate Authentication

Authenticates users via client certificates — validates external PKI or issues internal certificates with auto-renewal

Overview

Authenticates users by verifying client certificates presented during the TLS handshake. Two modes:

External PKI validation:

  Validates client certificates from external PKI infrastructure (FreeIPA, Active Directory).
  The gateway performs validation only — certificate lifecycle is managed by the external PKI.

Internal CA enrollment:

  Issues and manages client certificates via the gateway's built-in ACME CA. Users self-enroll
  at /signup/x509 after authenticating. Supports auto-renewal, self-revocation, and
  multi-certificate overlap (max 2 active per user during renewal windows).

Validation is performed as an ordered, defense-in-depth pipeline:

  1. Certificate expiration check (NotBefore/NotAfter)
  2. TLS handshake validation against ClientCAs pool (chain, signature, trust)
  3. Application-level chain validation (full chain verify with client auth usage check)
  4. CRL check -- O(1) in-memory lookup with atomic map swap (if enabled)
  5. Identity extraction from certificate subject (cn, uid, email, or upn)
  6. Directory lookup (user exists and is active)
  7. OCSP check with cluster-cached responses and configurable soft-fail (if enabled)
  8. Session creation with username, email, groups, and certificate metadata

All validation operations are cluster-wide, ensuring consistent behavior regardless of which node handles the authentication request.

Typical authentication latency:

  - Cached path (CRL + cached OCSP): 20-30ms total
  - Uncached path (first OCSP query): 70ms-5s depending on ocsp_timeout
  - CRL lookup: less than 1ms (in-memory hash map)
  - OCSP cached lookup: less than 1ms (cluster memory)

Memory footprint:

  - CRL map: ~100 bytes per revoked certificate (10K certs = ~1MB)
  - OCSP cache: ~200 bytes per response (1K users = ~200KB)

Config

Core configuration under [authentication.x509]:

[authentication.x509]
  enabled = true                     # Enable X.509 authentication
  ca_pem = """..."""                 # CA certificate(s) in PEM format (root + intermediates)

CRL (Certificate Revocation List):

  crl_enabled = true                 # Enable CRL-based revocation checking
  crl_url = "http://ca.example.com/ca.crl"  # CRL distribution point URL
  crl_refresh = "1h"                # CRL refresh interval (default: 1h)
  crl_timeout = "30s"               # HTTP download timeout (default: 30s)
  crl_max_size = 0                  # Max CRL size in bytes (0 = unlimited)

OCSP (Online Certificate Status Protocol):

  ocsp_enabled = true               # Enable OCSP revocation checking
  ocsp_url = "http://ocsp.example.com"  # OCSP responder URL
  ocsp_cache = "15m"                # Cache duration for OCSP responses (default: 15m)
  ocsp_timeout = "5s"               # HTTP timeout for OCSP queries (default: 5s)
  ocsp_soft_fail = true             # Allow auth if OCSP is unreachable (default: true)

IMPORTANT: OCSP timeout is independent of operations.wait_timeout. X.509 validation uses a dynamic timeout of ocsp_timeout + 5s buffer. This ensures OCSP queries complete with their full configured timeout regardless of the global wait_timeout.

Identity Mapping:

  [identity.cert_subject_map]
  username = "cn"                   # Certificate field for username extraction
                                     # Options: "cn" (CommonName), "uid" (LDAP UID OID),
                                     # "email" (email address), "upn" (AD User Principal Name)

Internal CA Enrollment:

  enroll_enabled = true              # Enable self-service certificate enrollment
  enroll_validity_days = 365         # Certificate validity period (default: 365)
  enroll_algorithm = "ECDSA-P256"   # Key algorithm: "ECDSA-P256" or "RSA-2048"
  enroll_max_active_certs = 10       # Max active certificates per user (1-50, default: 10)
  enroll_rate_limit = "3/1h"         # Enrollment rate limit per user (default: "3/1h")
  revoke_rate_limit = "5/1h"         # Revocation rate limit per user (default: "5/1h")
  enroll_p12_min_entropy = 60        # Min entropy bits for PKCS#12 password (default: 60)

Auto-Renewal:

  enroll_auto_renew = true           # Enable automatic renewal before expiry (default: true)
  enroll_auto_renew_days = 15        # Days before expiry to trigger renewal (default: 15)
  enroll_auto_renew_interval = "24h" # Check interval for expiring certs (default: "24h")
  enroll_auto_renew_timeout = "5m"   # Scheduler operation timeout (default: "5m")
  enroll_auto_renew_retries = 3      # Max retry attempts on failure (default: 3)
  enroll_auto_renew_retry_delay = "30s"  # Delay between retries (default: "30s")

PKI-Specific Identity Mapping:

  FreeIPA:          username = "uid"   (FreeIPA uses UID, not CN)
  Active Directory: username = "upn"   (AD uses User Principal Name)
  Generic LDAP:     username = "cn"    (CommonName is default)

Hot-reloadable: ca_pem, CRL settings, OCSP settings, identity mapping, enrollment settings. Cold (restart required): enabled.

Troubleshooting

Common error messages and diagnostic steps:

“certificate revoked (CRL)”:

  - Certificate serial number found in the downloaded CRL
  - Verify revocation status with external CA tools
  - User must obtain a new certificate from the PKI
  - Check CRL freshness: 'certs x509 metrics' for last refresh time

“user not found in directory”:

  - Identity field extracted from certificate does not match any directory user
  - Check cert_subject_map.username setting matches your PKI convention
  - Use 'directory user <username>' to verify user exists in directory
  - Use 'diagnose user <username>' for cross-subsystem check
  - Verify directory sync is current: 'directory status'

“failed to extract identity”:

  - The configured subject field (cn/uid/email/upn) is missing from the certificate
  - Inspect certificate subject with: openssl x509 -in cert.pem -noout -subject
  - Change cert_subject_map.username to a field present in the certificate

“OCSP query failed (soft-fail)”:

  - OCSP responder is unreachable but authentication proceeds (warning only)
  - Soft-fail is the default behavior (ocsp_soft_fail = true)
  - Check OCSP URL: 'net http <ocsp_url>'
  - Verify OCSP responder is operational
  - If hard-fail is required, set ocsp_soft_fail = false

“OCSP query failed (hard-fail)”:

  - OCSP responder is unreachable and ocsp_soft_fail = false
  - Authentication is blocked until OCSP responder recovers
  - Consider enabling soft-fail if OCSP outages are frequent
  - Check connectivity: 'net tcp <ocsp_host:port>'

“failed to download CRL”:

  - CRL URL is unreachable or returned an error
  - Check URL: 'net http <crl_url>'
  - Existing in-memory CRL continues to be used until refresh succeeds
  - Check for size limits: crl_max_size may be rejecting a large CRL

“certificate validation timeout”:

  - OCSP query or validation step exceeded the dynamic timeout
  - X.509 uses a dynamic timeout of ocsp_timeout + 5s, NOT operations.wait_timeout
  - Increase ocsp_timeout if OCSP responder is slow
  - Check OCSP responder latency: 'net latency <ocsp_host:port>'

“certificate expired or not yet valid”:

  - Certificate NotBefore/NotAfter check failed
  - Check certificate dates: openssl x509 -in cert.pem -noout -dates
  - Verify system clock is correct (NTP drift can cause false failures)

Session extension rejected (“x509_revocation”):

  - Certificate was revoked after the initial session was created
  - Internal CA: serial checked against the revocation index
  - External CA: OCSP check performed using stored certificate data from session
  - User must obtain a new certificate and re-authenticate

Enrollment failures:

  - "rate limit exceeded": user hit enroll_rate_limit, wait for window to reset
  - "PKCS#12 password too weak": password entropy below enroll_p12_min_entropy
  - "enrollment not enabled": set enroll_enabled = true in config
  - Check enrollment metrics: 'certs x509 metrics'

Auto-renewal not working:

  - User has no email in directory (skipped with warning)
  - User opted out via /signup/x509 status page (auto-renewal opt-out)
  - Certificate missing stored certificate data (older certificates)
  - enroll_auto_renew = false in config
  - Cluster lock contention: only one node processes renewals at a time
  - Check: 'certs x509 list' for certificate status per user

Browser not prompting for certificate:

  - Firefox: Settings > Privacy & Security > Certificates > View Certificates > Import
  - Chrome: Settings > Privacy and Security > Security > Manage Certificates > Import
  - Certificate must include ExtKeyUsageClientAuth
  - CA certificate must be in browser trust store
  - Verify TLS listener has ClientCAs configured (check logs for "x509 CA loaded")

Security

Defense-in-Depth Validation Pipeline:

Six independent validation layers ensure no single check failure compromises security:

  1. Certificate expiration (NotBefore/NotAfter checked first, fail-fast)
  2. TLS handshake with ClientCAs pool (chain, signature, trust anchor verification)
  3. Application-level chain validation (full chain verify with client auth usage check)
  4. CRL revocation check -- O(1) in-memory, race-condition safe (if enabled)
  5. Directory lookup confirms user exists and is active
  6. OCSP real-time revocation check with cluster caching (if enabled)

Identity is extracted ONLY after successful validation. Unvalidated certificate fields are never trusted.

TOCTOU Protection for CRL:

  CRL updates use atomic.Value to prevent Time-of-Check-Time-of-Use race conditions.
  The entire revoked serial map is built from the new CRL, then atomically swapped.
  Readers always see a consistent snapshot. No locks required for O(1) lookups.

Memory Exhaustion Protection:

  - CRL downloads have configurable timeout (crl_timeout, default 30s)
  - CRL size capped by crl_max_size (prevents DoS via malicious CRL files)
  - OCSP responses cached with TTL to limit memory growth

Configurable Soft-Fail OCSP:

  When ocsp_soft_fail = true (default), OCSP infrastructure failures allow authentication
  to proceed. The certificate is already validated by expiration + TLS handshake + chain
  validation + CRL + directory lookup before OCSP is checked.
  IMPORTANT: Revoked certificates ALWAYS block authentication regardless of soft-fail mode.
  Only infrastructure failures (unreachable, timeout) are affected by the soft-fail setting.

Session TTL Capping:

  X.509 sessions are automatically capped to the certificate validity period.
  Session TTL = min(configured_TTL, cert_not_after - now). This prevents sessions from
  outliving their authenticating certificate. Applied at both signin (caller-side) and
  sessions module (defense-in-depth). Example: if certificate expires in 12h but config
  TTL is 24h, session TTL is capped to 12h.

Session Extension Revocation Check:

  When an X.509 session is extended, revocation is re-checked automatically:
  - Internal CA: serial checked against the revocation index
  - External CA: OCSP cache checked, full OCSP query if certificate data is available
  - Revoked certificates always block extension; soft-fail allows extension if OCSP is down

Internal CA Enrollment Security:

  - PKCS#12 bundles encrypted with Modern2023 profile (AES-256-CBC, SHA-256 HMAC)
  - Minimum password entropy enforced (enroll_p12_min_entropy, default 60 bits)
  - Rate limiting on enrollment and revocation endpoints (per-user)
  - Re-enrollment auto-revokes ALL existing certificates (fresh start with new key)
  - Auto-renewal preserves existing public key (only re-signs with new validity)
  - Maximum 2 active certificates per user (oldest auto-revoked when limit exceeded)
  - Revocation reason codes follow RFC 5280

Logging Security:

  Certificate serial numbers are logged only at DEBUG level. INFO logs contain username
  only, preventing information disclosure in production log aggregation systems.

Cluster Caching:

  OCSP responses are replicated to all nodes asynchronously. Eventual consistency
  is acceptable for cache data. Cache TTL is controlled by ocsp_cache config
  (default 15m).

Relationships

Module dependencies and interactions:

directory: User lookup during validation step 6. Confirms user exists and is active, returns email, full name, and group memberships. Also provides email addresses for auto-renewal notifications.
sessions: Session creation after successful validation. Session TTL capped to certificate validity. Revocation is re-checked when sessions are extended. Session metadata stores certificate data for external CA OCSP re-checks.
acme: Internal CA certificate signing for enrollment. Certificate revocation triggers CRL rebuild. Updated CRL is replicated to all nodes immediately.
identity: cert_subject_map configuration determines which certificate field maps to username (cn, uid, email, upn). Shared config section [identity.cert_subject_map].
signin: The /signin/x509 route triggers X.509 authentication flow. Validates the certificate and creates a session on success.
proxy: Per-mapping mTLS support (mtls=true) uses X.509 for mutual TLS at the route level. Certificate validated against ACME CA bundle or external PKI.
cluster: OCSP responses cached in distributed memory and replicated to all nodes. Auto-renewal uses a distributed lock to prevent duplicate processing across cluster nodes.
smtp: Auto-renewal sends renewed certificate bundles to users via email. Users without email addresses in directory are skipped with a warning.
moduledata: Certificate records stored per-user in the directory backend. Each user can have up to 2 active certificates (during renewal overlap), plus a revocation history and an auto-renewal opt-out flag.

Logs

Log entries by component. Search with: logs search “x509” Levels: ERROR > WARN > INFO > DEBUG.

Init & Lifecycle:

  x509.init               WARN          JetStream temporarily unavailable, retrying serial index rebuild
  x509.init               ERROR         Failed to rebuild serial index after retries
  x509.init               ERROR         Failed to initialize CRL
  x509.init               INFO          X.509 authentication enabled (CRL disabled)
  x509.cleanup            INFO   AUDIT  X.509 module cleanup complete

Validate (certificate authentication pipeline):

  x509.validate           ERROR         Failed to parse DER certificate
  x509.validate           WARN          Certificate not yet valid / Certificate expired
  x509.validate           ERROR         No CA certificates available (config + ACME bundle empty)
  x509.validate           WARN          Certificate chain validation failed
  x509.validate           WARN          Failed to extract identity from certificate
  x509.validate           ERROR         Directory lookup failed
  x509.validate           WARN          User not found in directory
  x509.validate           WARN          Failed to check serial index, falling back to moduledata
  x509.validate           ERROR         Failed to check moduledata revocation
  x509.validate           WARN          Internal certificate revoked / not in registry - rejecting
  x509.validate           WARN          OCSP check failed
  x509.validate           INFO          Certificate validated successfully
  x509.validate           DEBUG         Validation stage progress (expiration, chain, CRL, identity, OCSP)

Enroll (internal CA certificate issuance):

  x509.enroll             INFO          Starting certificate enrollment
  x509.enroll             WARN          Invalid username format / Failed to load existing certificate
  x509.enroll             ERROR         Failed to enforce certificate limit / generate keypair
  x509.enroll             ERROR         Failed to sign certificate with CA / get CA bundle
  x509.enroll             ERROR         Failed to generate PKCS#12 password / build PKCS#12 bundle
  x509.enroll             ERROR         Failed to store certificate record
  x509.enroll             WARN          Failed to store serial index
  x509.enroll             INFO   AUDIT  Certificate enrolled successfully

Revoke:

  x509.revoke             INFO          Revoking certificate
  x509.revoke             WARN          Failed to update serial index
  x509.revoke             INFO   AUDIT  Certificate revoked successfully

Revoke By Serial (self-service):

  x509.revokeBySerial     INFO          Revoking certificate by serial
  x509.revokeBySerial     WARN          Failed to update serial index
  x509.revokeBySerial     INFO   AUDIT  Certificate revoked by serial

Revoke All & Enforce Max:

  x509.revokeAll          WARN          Failed to update serial index
  x509.revokeAll          INFO   AUDIT  Revoked certificates for user
  x509.enforceMax         WARN          Failed to update serial index
  x509.enforceMax         INFO   AUDIT  Revoked oldest cert for user (max reached)

CRL:

  x509.crl.init           ERROR         Failed to download CRL from any server
  x509.crl.init           INFO          CRL loaded successfully
  x509.crl                WARN          CRL download failed, trying next URL
  x509.crl.refresh        ERROR         Failed to refresh CRL from any server
  x509.crl.refresh        INFO          CRL refreshed successfully
  x509.crl.refresh        DEBUG         Refreshing CRL
  x509.crl.rebuild        WARN          Failed to trigger CRL rebuild

OCSP:

  x509.ocsp               DEBUG         OCSP cache hit / cache miss - querying responder(s)
  x509.ocsp               WARN          No OCSP URLs configured and certificate has no AIA OCSP extension
  x509.ocsp               WARN          OCSP responder failed, trying next
  x509.ocsp.check         WARN          All OCSP responders unreachable (soft-fail enabled, allowing authentication)
  x509.ocsp.check         ERROR         All OCSP responders unreachable (hard-fail enabled, blocking authentication)
  x509.ocsp.check         DEBUG         OCSP query successful
  x509.ocsp.serial        WARN          OCSP cache lookup failed / cache wait failed
  x509.ocsp.serial        DEBUG         OCSP cache miss for session extension check / OCSP cache hit

Auto-Renewal:

  x509.renewal            INFO          Auto-renewal is disabled by configuration
  x509.renewal            ERROR         Failed to schedule auto-renewal
  x509.renewal            INFO          Auto-renewal scheduler registered
  x509.renewal            WARN          Failed to acquire renewal lock / wait for lock acquisition
  x509.renewal            INFO          Renewal check already in progress on another node, skipping
  x509.renewal            INFO          Starting certificate renewal check
  x509.renewal            ERROR         Failed to get all users / GetAllUsers failed / Invalid response
  x509.renewal            ERROR         Failed to renew certificate
  x509.renewal            INFO          Certificate renewal check completed
  x509.renewal            WARN          Skipping renewal - user has no email / no CertificateDER stored
  x509.renewal            WARN          Failed to enforce max certs limit
  x509.renewal            WARN          Failed to update serial index / get CA bundle / send renewal email
  x509.renewal            INFO          Certificate renewed successfully

Session Extension Validator:

  x509.session_validator  DEBUG         Checking certificate revocation for session extension
  x509.session_validator  WARN   AUDIT  X.509 session missing required metadata - allowing extension
  x509.session_validator  WARN          Failed to check serial index, falling back to moduledata
  x509.session_validator  WARN   AUDIT  Session extension rejected: internal certificate revoked
  x509.session_validator  WARN          Session extension rejected: internal certificate not in registry
  x509.session_validator  WARN          Session extension rejected: external certificate revoked (OCSP/cache)
  x509.session_validator  WARN          Soft-fail warnings (revocation check, OCSP, cert parse failures)
  x509.session_validator  WARN          OCSP check failed, rejecting extension (hard-fail)
  x509.session_validator  WARN          Unknown CA type in session metadata - allowing extension

Revocation Check (hexdcall operation):

  x509.check_revoked      DEBUG         Checking certificate revocation status / valid / OCSP passed
  x509.check_revoked      WARN          Failed to check serial index / not in registry / no cert DER
  x509.check_revoked      INFO          Internal certificate is revoked / External revoked (OCSP)
  x509.check_revoked      ERROR         Failed to parse certificate DER
  x509.check_revoked      WARN          OCSP check failed for external cert

Recovery (serial index rebuild at startup):

  x509.recovery           INFO          Starting serial index recovery from moduledata
  x509.recovery           WARN          Invalid x509 data format for user
  x509.recovery           WARN          Failed to store serial index for legacy/active/revoked cert
  x509.recovery           INFO          Serial index recovery completed / cancelled during shutdown

Storage:

  x509.storage            INFO          X509 certificate stored to moduledata
  x509.storage            DEBUG         Load/store operations, format parsing

Auto-Renew Opt-Out:

  x509.auto_renew         INFO          Auto-renewal opt-out updated

Revoked Certificates Query:

  x509.revoked            ERROR         Failed to retrieve serial index
  x509.revoked            INFO          Retrieved revoked certificates
  x509.revoked            DEBUG         Retrieving all revoked certificates

Metrics

Prometheus metrics. Query with: metrics prometheus x509_<name>

Validation:

  x509_validation_total                 counter    {result, reason?}         Certificate validation attempts
    result=success                                                            Valid certificate authenticated
    result=failure, reason=not_yet_valid                                      Certificate NotBefore in future
    result=failure, reason=expired                                            Certificate past NotAfter
    result=failure, reason=no_ca_available                                    No CA certs configured
    result=failure, reason=chain_validation_failed                            Chain/signature verification failed
    result=failure, reason=revoked_crl                                        Revoked via CRL (external cert)
    result=failure, reason=invalid_identity                                   Identity field missing from cert
    result=failure, reason=directory_error                                    Directory lookup call failed
    result=failure, reason=directory_timeout                                  Directory lookup timed out
    result=failure, reason=user_not_found                                     User not in directory
    result=failure, reason=revoked_internal                                   Revoked via serial index (internal cert)
    result=failure, reason=not_registered                                     Internal cert not in enrollment registry
    result=failure, reason=revoked_ocsp                                       Revoked via OCSP (external cert)

Enrollment:

  x509_enrollment_total                 counter    {result, reason?}         Certificate enrollment attempts
    result=success                                                            Certificate issued successfully
    result=failure, reason=invalid_username                                   Username validation failed

Revocation:

  x509_revocation_total                 counter    {result, reason}          Certificate revocations
    result=success, reason=<RFC5280 code>                                     Revocation completed

CRL:

  x509_crl_refresh_total                counter    {result}                  CRL download/refresh attempts
    result=success                                                            CRL loaded/refreshed
    result=failure                                                            Download failed from all URLs
  x509_crl_revoked_count                gauge      {}                        Number of revoked certs in CRL
  x509_crl_size_bytes                   gauge      {}                        Raw CRL size in bytes

OCSP:

  x509_ocsp_query_total                 counter    {result, cached}          OCSP lookups
    result=success, cached=true                                               Cache hit (memory)
    result=success, cached=false                                              Responder queried successfully
    result=failure, cached=false                                              All responders unreachable

Auto-Renewal:

  x509_auto_renewal_check_total         counter    {result}                  Renewal check runs
  x509_auto_renewal_total               counter    {result}                  Individual cert renewals
    result=success                                                            Cert renewed and emailed
    result=failure                                                            Renewal failed
  x509_auto_renewal_skipped_total       counter    {reason}                  Renewals skipped
    reason=no_email                                                           User has no email in directory
    reason=no_certificate_der                                                 No stored cert for key extraction
  x509_auto_renewal_certs_checked       gauge      {}                        Certs checked in last run
  x509_auto_renewal_certs_renewed       gauge      {}                        Certs renewed in last run
  x509_auto_renewal_certs_skipped       gauge      {}                        Certs skipped (opt-out) in last run
  x509_auto_renewal_errors              gauge      {}                        Errors in last renewal run

Alerts:

  rate(x509_validation_total{result="failure"}[5m]) > 10           High validation failure rate
  rate(x509_validation_total{reason="revoked_crl"}[5m]) > 0        CRL-revoked cert used (possible compromise)
  rate(x509_validation_total{reason="revoked_internal"}[5m]) > 0   Revoked internal cert used
  x509_crl_refresh_total{result="failure"} increasing              CRL server unreachable
  rate(x509_ocsp_query_total{result="failure"}[5m]) > 0            OCSP responder down
  x509_auto_renewal_errors > 0                                     Auto-renewal failures need attention

Onboarding Service

Self-service user onboarding with magic link verification and passkey enrollment

Overview

The onboarding service provides a streamlined SPA flow for new users to verify their email and enroll a passkey. It combines the magic link passwordless flow with WebAuthn passkey registration into a single guided experience.

The service is a single GET endpoint at /onboarding that renders different steps based on the user’s authentication state. All actual operations (magic link, passkey enrollment) are delegated to existing API endpoints — no new backend APIs are needed.

Onboarding flow (4 steps):

  Step 0: Email entry — user submits email address
  Step 1: Magic link polling — browser polls for authorization while user clicks link in email
  Step 2: Passkey enrollment — WebAuthn ceremony to register a biometric/hardware key
  Step 3: Success — animated confirmation, auto-redirect to /profile

Three handler states:

  1. No session — render email step (unauthenticated users start here)
  2. Authenticated session + no passkey — create mfa_pending session, render passkey step
  3. Authenticated session + has passkey — redirect to /profile (already onboarded)

The service is gated by the portal being enabled (portal = true). When portal is disabled, the /onboarding route is not registered.

Config

The onboarding service has no dedicated configuration section. It relies on:

  [service]
    portal = true                    # Must be enabled for onboarding route registration
    session_mfa_pending = "5m"       # TTL for the mfa_pending session during passkey enrollment
    cookie_name = "hexon"            # Session cookie name (for detecting authenticated users)
    cookie_domain = ""               # Cookie domain for cross-subdomain support

  [service.signin.magiclink]         # Magic link settings used by /api/signin/magiclink
    enabled = true
    code_ttl = "10m"
    rate_limit = "5/1m"

  [protection]
    pow = true                       # PoW protection applied automatically (no DisablePoW on route)

The onboarding page inherits PoW protection from the global middleware. Authenticated users skip PoW automatically (valid session cookies are detected).

Endpoints

UI endpoint:

  GET /onboarding                    Onboarding SPA page (all steps rendered client-side)

The SPA calls existing API endpoints via fetch():

  POST /api/signin/magiclink         Send magic link email (existing signin service)
  POST /api/signin/magiclink/poll    Poll for magic link authorization (existing signin service)
  POST /api/signup/passkey/begin     Begin WebAuthn registration ceremony (existing signup service)
  POST /api/signup/passkey/finish    Complete WebAuthn registration (existing signup service)

On magic link authorization, the poll handler (in signin service) creates an authenticated “user” session via session creation. The onboarding JS then reloads the page, and the handler detects the session, creates an mfa_pending session for passkey enrollment, and renders the passkey step.

Session flow:

  1. Poll authorized → session creation creates "user" session + sets hexon cookie
  2. Page reload → handler reads hexon cookie → validates user session
  3. No passkey found → creates mfa_pending session + sets mfa_session_id cookie
  4. Passkey begin/finish use mfa_session_id cookie for authorization
  5. On passkey success → JS redirects to /profile

Troubleshooting

Common issues and diagnostic steps:

Onboarding page shows email step despite being logged in:

  - Verify session exists: 'sessions list --user=<username>'
  - Check session type is "user" with auth_status "authenticated"
  - Check cookie: session cookie name must match config (default: hexon)
  - PoW interference: if PoW cookie expired, user may be redirected to challenge first

Passkey step not appearing after magic link click:

  - Check magic link poll response: should return status "authorized"
  - Verify session created by session creation: 'sessions list --user=<username>'
  - JS reloads page after authorized — check for network/redirect issues
  - Server log should show "Onboarding: authenticated user entering passkey enrollment"

Passkey registration failing:

  - Check mfa_session_id cookie exists and session is valid
  - Session TTL: mfa_pending session defaults to 5 minutes (session_mfa_pending config)
  - WebAuthn RP ID must match hostname
  - Browser must support PublicKeyCredential API (HTTPS required)
  - Server logs: look for "Begin registration request" and "FinishRegistration failed"

PoW challenge blocking onboarding:

  - Normal behavior for first-time visitors without PoW session cookie
  - Authenticated users skip PoW (middleware checks application session)
  - PoW session TTL: default 30 minutes (pow_session_ttl config)

Page redirect loop or landing on / after magic link:

  - return_url must be HMAC-sealed (handler passes sealed URL to template data)
  - Unsealed URLs fall back to "/"
  - Check that sealed_return_url is present in onboarding-data JSON

Session proliferation on page refresh:

  - Handler reuses existing valid mfa_pending session (checks mfa_session_id cookie first)
  - If mfa_session_id expired, a new session is created on refresh (normal behavior)
  - Old expired sessions are cleaned up by session TTL

Relationships

Module dependencies and interactions:

signin (magiclink): Provides the magic link email flow. POST /api/signin/magiclink initiates the flow, POST /api/signin/magiclink/poll checks status. The poll handler calls session creation which creates the “user” session that onboarding detects.
signup (passkey): Provides WebAuthn enrollment. POST /api/signup/passkey/begin and /finish handle the ceremony. Both require a valid mfa_session_id cookie pointing to an mfa_pending session with signup_flow=“passkey”.
sessions: Used for session detection (Validate) and mfa_pending session creation (Create). The handler checks the main session cookie for authenticated users, and creates a separate mfa_session_id cookie for the passkey enrollment session.
webauthn: Used to check if user already has a passkey. Users with an existing passkey are redirected to /profile immediately.
render: Template rendering. Uses the onboarding manifest entry for CSS/JS asset bundling.
locale: i18n translations via template {{t “onb.*”}} function. All UI text comes from locale TOML files ([onb] section in 10 language files).
protection (PoW): Global PoW middleware protects the route — unauthenticated users solve PoW challenge before seeing the page.
portal: Onboarding route registration is gated by IsPortalEnabled(). Both services share the same user-facing domain.

Logs

Log entries by component. Search with: logs search “onboarding” Levels: ERROR > WARN > INFO > DEBUG.

Init (route registration):

  onboarding.init              INFO   Onboarding disabled (console not enabled)
  onboarding.init              INFO   Onboarding service route registered at /onboarding

MFA Session (passkey enrollment session lifecycle):

  onboarding.mfa_session       ERROR  Failed to create mfa_pending session for passkey enrollment
  onboarding.mfa_session       ERROR  Invalid session response type

Passkey (enrollment flow):

  onboarding.passkey           INFO   Onboarding: authenticated user entering passkey enrollment  AUDIT

Metrics

This module does not emit its own Prometheus metrics.

Observability is provided indirectly through dependent modules:

  - sessions: session_* metrics cover mfa_pending session creation and validation
  - webauthn: webauthn_* metrics cover passkey registration ceremonies
  - magiclink: magiclink_* metrics cover magic link email and polling
  - ratelimit: ratelimit_* metrics cover PoW and request throttling

Authentication coordinator with multi-method sign-in, pluggable MFA, magic links, and session management

Overview

The signin service is the central authentication coordinator for Hexon. It orchestrates the complete user sign-in lifecycle across multiple authentication methods and modules, handling primary authentication, multi-factor verification, magic link passwordless flows, and session creation.

Supported primary methods:

  - passwd: LDAP password authentication (bind-based, no local password storage)
  - passkey: WebAuthn/FIDO2 passwordless (hardware keys, biometrics, phishing-resistant)
  - x509: Client certificate authentication (Subject DN to username mapping)
  - oidc: OpenID Connect single sign-on via external identity provider
  - magiclink: Email-based passwordless authentication (BASE-20 tokens, RFC 8628 polling)

Supported MFA methods (pluggable):

  - otp: Email-delivered verification code (via emailotp module)
  - totp: Time-based One-Time Password / authenticator apps (RFC 6238)

Authentication flow stages:

  1. Primary authentication — credential verification against backend (LDAP/WebAuthn/X.509)
  2. MFA challenge (if required) — pre-auth session created, MFA code verified
  3. Session creation — quorum-replicated across cluster, cookie set
  4. Directory sync — fire-and-forget background user data refresh
  5. Redirect — user sent to original destination (return_url)

Magic link flow (cross-device passwordless):

  1. User submits email on /signin/magiclink
  2. Device code created (RFC 8628), BASE-20 token generated (rejection sampling, no modulo bias)
  3. Token-to-device-code mapping stored as SHA-256 hashes (tokens never in cleartext)
  4. Magic link email sent via SMTP (fire-and-forget, anti-enumeration)
  5. Browser polls /api/signin/magiclink/poll every 5 seconds
  6. User clicks link on any device, token validated, device code marked authorized
  7. Next poll detects authorization, session created on polling browser only

Session security:

  - Session rotation after MFA (new ID prevents session fixation attacks)
  - MFA pending sessions are short-lived (default 5 minutes) and revoked after upgrade
  - Sessions bound to IP address and TLS fingerprint
  - Configurable max concurrent sessions per user (default: 1)
  - Cluster-wide session storage with quorum replication (available on all nodes)

Config

Configuration under [service.signin] in TOML:

[service.signin]
  primary = "passkey"              # Default authentication method shown at /signin
                                   # Options: "passwd", "passkey", "x509", "oidc", "magiclink"
  secondary = ["passwd", "x509"]   # Alternative methods (shown as links on sign-in page)
  require_mfa = ["passwd"]         # Methods that require MFA after primary auth
                                   # Empty list = MFA never required
  mfa_methods = ["otp", "totp"]    # Available MFA methods presented to user
                                   # Order determines default selection

[service.signin.magiclink]
  enabled = true                   # Enable magic link passwordless sign-in
  code_length = 10                 # Token length in BASE-20 characters (range: 6-40, default: 10)
  code_ttl = "10m"                 # Link validity duration (default: 10 minutes)
  rate_limit = "5/1m"             # Per-IP rate limit on magic link requests
  rate_limit_email = "3/10m"      # Per-email rate limit (anti-flooding protection)

Session configuration (under [service.signin] or related session config):

  session_ttl                      # Authenticated session lifetime
  session_password_expired         # Session TTL for expired password flow
  session_mfa_pending              # Pre-auth session TTL (default: 5 minutes)
  max_concurrent_sessions = 1      # Max active sessions per user (default: 1)

Password policy (enforced during passwd authentication):

  - Strength validation via zxcvbn algorithm (configurable score 0-4)
  - Character requirements: uppercase, lowercase, digits, special characters
  - Minimum length and entropy requirements (all configurable via TOML)
  - Password expiry enforcement with dedicated session type

MFA settings:

  max_retries = 5                  # Maximum MFA verification attempts before lockout

Hot-reloadable: primary method, secondary methods, require_mfa list, mfa_methods, magiclink settings, session TTLs, password policy, rate limits. Cold (restart required): service.signin.enabled.

Endpoints

UI endpoints (serve HTML pages):

  GET  /signin                     Redirect to primary authentication method
  GET  /signin/passwd              LDAP password sign-in page
  GET  /signin/passkey             WebAuthn passkey sign-in page
  GET  /signin/x509                X.509 certificate sign-in page
  GET  /signin/magiclink           Magic link email form
  GET  /signin/magiclink/verify    Magic link verification (clicked from email)
  GET  /signin/mfa                 MFA verification page (OTP or TOTP)

API endpoints (JSON/form):

  POST /api/signin                 Authenticate with credentials
                                   Body: {"method", "username", "password", "remember_me"}
                                   Returns: success with session_token, or requires_mfa with
                                   pre-auth session and available mfa_methods

  POST /api/signin/magiclink       Submit magic link request
                                   Body: email, return_url, auth_flow (form-encoded)
                                   Returns: device_code and expires_in for polling
                                   Rate limited: per-IP (5/1m) and per-email (3/10m)

  POST /api/signin/magiclink/poll  Poll magic link authorization status
                                   Body: device_code (form-encoded)
                                   Returns: {"status":"pending"} or {"status":"authorized","redirect":"..."}

  POST /api/signin/mfa             Verify MFA code
                                   Body: {"method", "code", "session_id" (HMAC-sealed), "trust_device"}
                                   Returns: success with redirect (session_id not exposed in response)

  POST /api/signin/mfa/resend      Resend OTP code (email OTP only)

X.509 over HTTP/3 note: QUIC does not support TLS renegotiation. If a user attempts X.509 auth over HTTP/3 without a client certificate, the server responds with Alt-Svc: clear and a 307 redirect to force retry over HTTP/2, which properly prompts for client certificate selection.

Troubleshooting

Common symptoms and diagnostic steps:

Authentication failures (generic “Invalid username or password”):

  - LDAP backend unreachable: 'auth ldap' to check connection health
  - Account locked in LDAP (nsAccountLock attribute): 'directory user <username>'
  - User not found in directory: 'directory user <username>' to verify existence
  - Incorrect bind DN or password: check LDAP module configuration
  - Start with: 'diagnose user <username>' for cross-subsystem check

MFA verification failing:

  - TOTP clock drift: user device time must be within 30-second window
  - OTP expired: default validity window is short, check 'auth otp'
  - Email OTP not delivered: 'smtp health' to verify SMTP service
  - Rate limited (429): max_retries exceeded, check 'metrics ratelimit'
  - Session expired: MFA pending session has 5-minute TTL by default
  - Check MFA session: 'sessions list --user=<username>' for pre-auth sessions

Magic link issues:

  - Email not received: 'smtp health' and 'notify health' to verify delivery path
  - Anti-enumeration: same response whether email exists or not (by design)
  - Token expired: default code_ttl is 10 minutes, check timing
  - Rate limited: per-IP (5/1m) or per-email (3/10m), check 'metrics ratelimit'
  - Poll returns "pending" indefinitely: verify SMTP delivery, check device code
    status via 'auth devicecodes'
  - "Link already used" error: tokens are single-use, mapping deleted after verify

Session creation failures:

  - Cluster quorum not met: 'cluster status' to verify quorum health
  - Session replication timeout: check cluster health for latency
  - Max concurrent sessions reached: 'sessions list --user=<username>'
  - Cookie not set: verify service hostname matches cookie domain
  - Session bound to wrong IP: check proxy/load balancer X-Forwarded-For headers

WebAuthn/passkey errors:

  - No passkey registered: 'webauthn list <username>' to check enrollments
  - Browser not supporting WebAuthn: requires HTTPS and a supported browser
  - Relying party ID mismatch: hostname must match RP ID in WebAuthn config
  - Challenge expired: WebAuthn challenges are cached temporarily

X.509 certificate sign-in issues:

  - Certificate not requested by browser: check TLS configuration
  - HTTP/3 fallback: Alt-Svc: clear redirect expected for QUIC connections
  - Certificate chain validation failure: check CA bundle configuration
  - Subject DN mapping: verify DN-to-username mapping rules
  - Check: 'certs x509 list' for registered client certificates

Password policy rejections:

  - zxcvbn score too low: user password not meeting strength requirements
  - Missing character classes: check uppercase/lowercase/digit/special requirements
  - Password expired: user gets dedicated session type, must change password
  - Check policy: 'config show service.signin' for password policy settings

Redirect loops after sign-in:

  - return_url invalid or pointing to sign-in page itself
  - Session cookie domain mismatch: verify service.hostname configuration
  - OIDC callback failure: check oidc_providers configuration
  - Check: 'sessions list --user=<username>' and 'auth status'

Relationships

Module dependencies and interactions:

authentication.ldap: Primary backend for passwd method. LDAP bind authentication with connection pooling. Reports account lock status (nsAccountLock). Password policy enforcement (strength, expiry, character requirements).
authentication.webauthn: Primary backend for passkey method. WebAuthn/FIDO2 credential storage and verification. Hardware key and biometric support.
authentication.x509: Primary backend for X.509 certificate method. Certificate chain validation, Subject DN to username mapping, revocation checking.
authentication.oidc: Backend for OIDC single sign-on method. Redirects to external identity provider for authentication.
authentication.magiclink: Magic link token generation, email composition. Uses BASE-20 encoding with rejection sampling for unbiased token generation.
authentication.devicecode: RFC 8628 device code flow. Provides polling infrastructure and expiration for magic link authorization tracking.
authentication.otp: Email OTP generation and verification for MFA. Delivers codes via emailotp module with device fingerprinting.
authentication.totp: TOTP verification for MFA. Validates RFC 6238 codes from authenticator apps (Google Authenticator, Authy, etc.).
sessions: Cluster-wide session management with quorum replication. Creates authenticated sessions, MFA pending sessions, and password-expired sessions. Session rotation after MFA completion.
directory: User data synchronization after authentication (fire-and-forget). Provides user lookup by email (magic link), group membership, account status. Fresh data sync ensures up-to-date authorization after sign-in.
smtp: Email delivery for magic link messages and OTP codes. Fire-and-forget delivery ensures consistent response timing (anti-enumeration).
signout: Companion service for session termination and logout flows.
onboarding: Uses magic link flow for email verification, then transitions to passkey enrollment. The onboarding SPA calls /api/signin/magiclink and /api/signin/magiclink/poll directly via fetch(). After authorization, the poll handler creates a “user” session which onboarding detects on page reload.
passwordchange: Handles password change flows when password-expired session is active. Redirects back to sign-in after successful change.
firewall: Network-level access rules applied before sign-in endpoints.
protection: Rate limiting (fingerprint-based) on all sign-in endpoints. Prevents brute force attacks on credentials and MFA codes.

Logs

Log entries by component. Search with: logs search “signin” Levels: ERROR > WARN > INFO > DEBUG. DEBUG requires log level configuration.

Authentication completion:

  signin.complete                INFO          Authentication completed

Finalize (session creation after successful auth):

  signin.finalize               ERROR  AUDIT  Failed to create session
  signin.success                INFO   AUDIT  User signed in successfully

Reauth (re-authentication session for protected proxy paths):

  signin.reauth                 ERROR         Failed to create reauth session
  signin.reauth                 ERROR         Unexpected reauth session response type
  signin.reauth                 INFO   AUDIT  Reauth session created during signin

LDAP password authentication:

  signin.ldap                   INFO   AUDIT  Attempting LDAP authentication
  signin.ldap                   ERROR         LDAP bind call failed
  signin.ldap                   DEBUG         LDAP bind successful, syncing user from directory
  signin.ldap                   WARN          Failed to sync user from directory
  signin.ldap                   WARN          User sync returned failure
  signin.ldap                   ERROR         Failed to get user from directory
  signin.ldap                   INFO          User not found in directory after sync
  signin.ldap                   INFO   AUDIT  Account is disabled
  signin.ldap                   INFO          Password expired - creating temporary session for password change
  signin.ldap                   ERROR         Failed to create password_expired session

MFA (multi-factor authentication flow):

  signin.mfa                    INFO   AUDIT  MFA required for user
  signin.mfa                    DEBUG         Validating MFA session
  signin.mfa                    ERROR         Session validation wait failed
  signin.mfa                    INFO          MFA session not valid
  signin.mfa                    DEBUG         MFA session validated successfully

MFA post-verification:

  signin.mfa                    DEBUG         MFA verified - retrieving pending session
  signin.mfa.session            ERROR         Failed to wait for MFA session validation
  signin.mfa.session            DEBUG         MFA session retrieved - creating authenticated session
  signin.mfa.signup             INFO          MFA verified for signup - redirecting to passkey registration
  signin.mfa.groups             WARN          Directory lookup failed after MFA - using cached groups from pending session
  signin.mfa.complete           DEBUG         Returning success response to client

MFA OTP resend:

  signin.mfa.resend             ERROR         Failed to generate OTP
  signin.mfa.resend             INFO          OTP code resent

MFA email OTP verification:

  signin.mfa.otp                ERROR         OTP validation call failed
  signin.mfa.otp                INFO   AUDIT  OTP validation failed
  signin.mfa.otp                WARN          OTP generation failed — user can resend from MFA page

MFA TOTP verification:

  signin.mfa.totp               ERROR         TOTP validation call failed
  signin.mfa.totp               INFO   AUDIT  TOTP and recovery code validation both failed
  signin.mfa.totp               INFO   AUDIT  TOTP validation failed - invalid code
  signin.mfa.totp               INFO   AUDIT  User authenticated via recovery code
  signin.mfa.totp               ERROR         Failed to check TOTP enrollment status

WebAuthn passkey authentication:

  signin.passkey.begin          DEBUG         Beginning passkey authentication
  signin.passkey.begin          ERROR         BeginAuthentication failed
  signin.passkey.begin          DEBUG         WebAuthn challenge created
  signin.passkey.finish         DEBUG         Finishing passkey authentication
  signin.passkey.finish         INFO          FinishAuthentication failed
  signin.passkey.finish         ERROR         Failed to get user from directory
  signin.passkey.finish         INFO          User not found in directory after passkey auth
  signin.passkey.finish         INFO          Account is disabled
  signin.passkey.finish         ERROR  AUDIT  E2OE: failed to persist Tier 1 ECDH state — channel will degrade to baseline

Kerberos SPNEGO authentication:

  signin.kerberos               DEBUG         Sending Negotiate challenge
  signin.kerberos               ERROR  AUDIT  SPNEGO validation call failed
  signin.kerberos               INFO   AUDIT  SPNEGO authentication failed
  signin.kerberos               ERROR  AUDIT  Failed to create session for SPNEGO user
  signin.kerberos               ERROR         Invalid session create response
  signin.kerberos               INFO   AUDIT  Kerberos SPNEGO authentication successful

Magic link passwordless authentication:

  signin.magiclink              ERROR  AUDIT  Initiate failed
  signin.magiclink.verify       INFO   AUDIT  Magic link verified
  signin.magiclink.verify       ERROR         Failed to finalize authentication

X.509 certificate authentication:

  signin.x509                   DEBUG         X.509 signin handler started
  signin.x509                   INFO          No client certificate provided
  signin.x509                   ERROR         Failed to validate certificate
  signin.x509                   INFO   AUDIT  Certificate revoked
  signin.x509                   INFO   AUDIT  Certificate expired
  signin.x509                   INFO          Certificate not yet valid
  signin.x509                   INFO          Certificate chain validation failed
  signin.x509                   ERROR         Certificate validation failed
  signin.x509                   INFO          Certificate validation failed
  signin.x509                   DEBUG         Capping session TTL to certificate validity
  signin.x509                   ERROR         Failed to create session
  signin.x509                   ERROR         Session creation timeout
  signin.x509                   ERROR         Invalid session response
  signin.x509                   INFO   AUDIT  X.509 authentication successful

Metrics

This service does not emit its own Prometheus metrics.

Observability is provided indirectly through dependent modules:

  - sessions: session_* metrics cover session creation, validation, and revocation
  - ldapauth: ldap_* metrics cover LDAP bind authentication
  - webauthn: webauthn_* metrics cover passkey authentication ceremonies
  - emailotp: otp_* metrics cover OTP generation and validation
  - totp: totp_* metrics cover TOTP validation
  - magiclink: magiclink_* metrics cover magic link initiation and verification
  - ratelimit: ratelimit_* metrics cover brute force protection on signin endpoints
  - directory: directory_* metrics cover user sync and lookup