Authentication
Authentication
Handles user authentication and session management — ten methods, MFA enforcement, cluster-wide sessions
Overview
Handles user authentication for all access paths — HTTP, SSH bastion, RADIUS, and API. Replaces separate identity providers, MFA systems, and session stores with one integrated layer. Applies to every session the gateway creates, regardless of protocol.
Supported primary authentication methods:
- passwd: LDAP password authentication with bind verification
- passkey: WebAuthn/FIDO2 passwordless authentication (Touch ID, YubiKey)
- x509: X.509 client certificate authentication (Subject DN mapping)
- oidc: OpenID Connect SSO via internal provider or external IdP (RP)
- magiclink: Email-based passwordless sign-in using RFC 8628 device code polling
- kerberos: SPNEGO/Kerberos ticket-based authentication (Active Directory)
Supported MFA methods (second factor):
- otp: Email-delivered one-time password via SMTP (per-device fingerprinting)
- totp: Time-based one-time password (RFC 6238, authenticator apps)
Additional modules:
- devicecode: RFC 8628 device authorization grant (bastion SSH, magic link infra)
- jit2fa: Just-in-time second factor enrollment and verification
- scim: SCIM 2.0 identity provider with multi-provider merge and webhooks
The signin service orchestrates authentication flows. It selects the primary method (configurable via service.signin.primary), falls through to secondary methods, enforces MFA requirements per method, and manages session creation with cluster-wide replication.
Architecture
Authentication flow (signin service orchestration):
- Client request arrives at /signin (or /api/signin for API clients)
- Method selection: primary method presented first, secondary methods available
- Credential verification dispatched to the appropriate auth module:
- passwd: LDAP bind (no password storage)
- passkey: WebAuthn challenge-response ceremony
- x509: Certificate chain validation + Subject DN mapping
- oidc: Authorization Code + PKCE exchange
- magiclink: Device code creation + magic link email delivery
- kerberos: SPNEGO token validation
- Identity lookup: username resolved to user record from directory
- Group resolution: group memberships fetched from directory
- Account status check: disabled/locked accounts rejected synchronously
- MFA gate (if require_mfa includes the method): a. Pre-authentication session created (limited, 5-minute TTL) b. MFA challenge presented (OTP email or TOTP authenticator) c. MFA code verified d. Pre-auth session revoked, new authenticated session created (rotation)
- Session creation: replicated to all nodes (cluster-wide quorum)
- Directory sync: user record synchronized cluster-wide
- Session cookie set, redirect to return_url or landing page
All auth modules are invoked cluster-wide, ensuring consistency and observability regardless of which node handles the request.
Session types:
- Authenticated: full access, configurable TTL (default 24h)
- MFA pending: limited capabilities, short TTL (default 5min)
- Password expired: forced password change, restricted access
Configuration:
[service.signin] primary = "passkey" # Default authentication method secondary = ["passwd", "x509"] # Alternative methods shown on signin page require_mfa = ["passwd"] # Methods that require MFA after primary auth mfa_methods = ["otp", "totp"] # Available MFA methods for usersRelationships
Child modules (authentication.*):
- oidc: OIDC provider — SSO hub for proxy, bastion, external apps
- webauthn: FIDO2/WebAuthn — passwordless passkey authentication
- ldap: LDAP authentication backend — password bind verification
- x509: X.509 certificate auth — client cert to username mapping
- kerberos: SPNEGO/Kerberos — Active Directory ticket authentication
- otp: Email OTP — one-time codes via SMTP with device fingerprinting
- totp: TOTP — RFC 6238 authenticator app verification
- devicecode: RFC 8628 device authorization — bastion SSH, magic link infra
- magiclink: Email-based passwordless — magic link token generation/verification
- jit2fa: Just-in-time 2FA — enrollment and verification middleware
Upstream dependencies:
- directory: User lookup, group membership, account status (disabled/locked)
- sessions: Session creation (quorum), revocation, TTL management
- smtp: Email delivery for OTP codes and magic link emails
- firewall: Network-level access rules applied before auth endpoints
Downstream consumers:
- proxy: Proxy SSO via OIDC provider (dedicated internal client)
- bastion: SSH authentication via device authorization grant
- services: All HTTP services check session cookies for access control
- radius: RADIUS authentication for external NAS hardware (password, x509)
Cross-cutting:
- protection: Rate limiting on signin endpoints (JA4 fingerprint-based)
- cluster: All auth operations are cluster-wide for consistency
- notify: Authentication event notifications (webhooks, email alerts)
Device Code Authorization
Authenticates devices without a browser — CLI tools, IoT, and headless systems enter a code on another device
Overview
The device code module implements RFC 8628 (OAuth 2.0 Device Authorization Grant) for authenticating input-constrained devices such as smart TVs, CLI tools, IoT devices, and headless systems that lack a web browser.
Core capabilities:
- Full RFC 8628 compliance (Sections 3.1 through 3.5, 6.1)
- BASE20 user codes using consonants only (BCDFGHJKLMNPQRSTVWXZ) to avoid profanity in generated codes
- Configurable code length (default: 8 characters) and expiration TTL
- Constant-time comparison for user code validation (timing attack prevention)
- SHA-256 hashed cache keys to prevent code enumeration
- Optimistic locking with version-based concurrency control to prevent double-authorization race conditions in distributed environments
- Distributed code storage with cluster-wide replication and quorum consensus
- Automatic expiration with configurable TTL (default: 10 minutes)
- Single-use enforcement: codes cannot be reused after authorization or denial
- Directory integration for fresh user claims at authorization time
Flow summary:
1. Device requests authorization codes 2. Device displays short user_code to the user (e.g., "BCDFGHJK") 3. User visits verification URI on another device (phone or computer) 4. User enters user_code and authorizes or denies the device 5. Device polls token endpoint until authorized, denied, or expired 6. On authorization, device receives access token via OIDC token endpointThe OIDC service handles the HTTP endpoints (/device page) and token endpoint with device_code grant type. The device code module provides the core logic; the OIDC service provides the HTTP transport.
Config
Device code behavior is configured under the OIDC authentication section:
[authentication.oidc] device_code_ttl = "10m" # Code expiration (default: 10 minutes) device_code_interval = 5 # Minimum polling interval in seconds (default: 5) device_code_user_code_length = 8 # User code character count (default: 8)Code generation parameters:
- Device code: 40-digit cryptographically random token for client polling - User code: 8-character BASE20 string (consonants only) for human entry - Verification URI: auto-generated from server base URL + /device path - VerificationURIComplete: includes pre-filled user_code query parameterPolling behavior (per RFC 8628 Section 3.5):
- Clients must wait at least device_code_interval seconds between polls - "slow_down" response instructs client to add 5 seconds to interval - "authorization_pending" means user has not yet acted - "expired_token" means device_code TTL has passedHot-reloadable: device_code_ttl, device_code_interval. Cold (restart required): device_code_user_code_length.
The module auto-enables when OIDC is configured. No separate enable flag is needed. Magic link module also auto-enables device code when activated.
Troubleshooting
Common symptoms and diagnostic steps:
User code not accepted at verification page:
- Verify code format: must be exactly 8 uppercase consonants (BASE20 charset) - Check expiration: codes expire after device_code_ttl (default: 10 minutes) - Check single-use: codes cannot be reused after authorization or denial - Verify AlreadyHandled flag: VerifyUserCode returns AlreadyHandled=true if the code was already authorized or denied - Case sensitivity: user codes are case-insensitive but stored uppercaseDevice polling returns “expired_token” too quickly:
- Check device_code_ttl configuration (default: 10m) - Verify cluster time synchronization (NTP) across nodes - Check if code was created with custom TTL override via AdditionalDataDevice polling returns “slow_down” repeatedly:
- Client must increase polling interval by 5 seconds on each slow_down - Minimum interval: device_code_interval (default: 5 seconds) - Verify client implements backoff correctly per RFC 8628 Section 3.5“authorization_pending” never resolves:
- Verify user visited the correct verification URI - Check that user entered the correct user_code - Verify OIDC service handlers are registered and accessible - Check network connectivity to the verification endpoint - Confirm user completed the full authorization flow (not just code entry)Race condition or double authorization:
- Optimistic locking detects concurrent modifications via version counter - Post-broadcast verification rejects stale version authorization attempts - Check structured logs for "version mismatch" warnings - Multiple users entering same code: statistically improbable with BASE20x8Token exchange fails after authorization:
- Verify OIDC token endpoint is configured and accessible - Check client_id matches between authorization and token request - Verify scope is valid for the OIDC provider configuration - Check directory module health (user claims fetched at authorization time)Codes not replicating across cluster nodes:
- Check cluster health and quorum status - Verify memory storage module is healthy - Check cluster connectivity between nodes - Codes use distributed storage with quorum; partial cluster may cause issuesDiagnostic commands:
- auth devicecodes: list active device code authorization flows - auth status: check authentication system overview - health components: verify device code subsystem healthSecurity
Security features and hardening measures:
BASE20 charset (RFC 8628 Section 6.1):
User codes use only consonants (BCDFGHJKLMNPQRSTVWXZ) to prevent profanity in randomly generated codes. This is an explicit RFC recommendation.Constant-time comparison:
User code validation uses crypto/subtle.ConstantTimeCompare to prevent timing side-channel attacks that could leak valid codes. This follows RFC 8628 Section 5.2 security recommendations.SHA-256 hashed storage keys:
Cache keys for device codes are SHA-256 hashed to prevent enumeration attacks. Even with access to the storage layer, codes cannot be extracted from their hash keys.Optimistic locking (distributed race prevention):
- Each authorization increments a version counter - Post-broadcast verification detects concurrent modifications - Rejects authorization if version mismatch detected - Prevents double-authorization in multi-node clusters - Critical for environments where multiple users may attempt simultaneous authSingle-use enforcement:
Once a code is authorized or denied, it cannot be reused. The AlreadyHandled flag prevents replay attacks on consumed codes.Directory re-validation:
CompleteAuthorization fetches the latest user data from the directory module rather than relying on stale session data. This ensures: - Disabled users cannot complete device authorization - Group memberships reflect current state (security-critical) - ID tokens contain fresh, authoritative user claims - Graceful fallback to session metadata if directory is temporarily unavailableAutomatic expiration:
Codes expire after configurable TTL (default: 10 minutes). Expired codes are automatically cleaned up from distributed storage.Fuzz testing coverage:
- FuzzUserCodeValidation: injection attack resistance - FuzzUserCodeConstantTimeComparison: timing attack verification - FuzzDeviceCodeGeneration: cryptographic randomness quality - FuzzOptimisticLockingVersionHandling: race condition prevention - FuzzDeviceAuthorizationRequest: parameter handling validationRelationships
Module dependencies and interactions:
- OIDC service: Primary consumer. OIDC token endpoint handles the device_code grant type. OIDC service provides HTTP handlers for the /device verification page. Token generation occurs after CompleteAuthorization.
- Magic link: Reuses device code infrastructure for its polling mechanism. Magic link auto-enables device code module when activated.
- Directory: Canonical source for user attributes at authorization time. CompleteAuthorization fetches email, full_name, given_name, surname, and group memberships from directory. Graceful fallback to session metadata if directory is temporarily unavailable.
- Distributed memory cache: Cache for code storage. Codes replicated across cluster with quorum consensus. TTL-based automatic cleanup.
- Sessions: Session integration for authenticated user context during the verification flow.
- Client access: Server-side device code auth for hexonclient QUIC tunnels. Gateway generates device code, sends challenge to client over QUIC control stream, polls until authorized. Same pattern as bastion SSH.
- Bastion SSH: Server-side device code auth for SSH sessions. Gateway generates device code, displays QR in terminal.
- OIDC service: HTTP transport layer. Handles /device endpoint rendering, /oidc/device/authorize for code generation, and token endpoint for code exchange.
- config: Runtime configuration access for TTL, interval, and code length settings. Hot-reload supported for TTL and interval.
- telemetry: Structured logging for all device code operations including authorization attempts, completions, and expiration events.
Logs
Log entries by component. Search with: logs search “devicecode” Levels: ERROR > WARN > INFO > DEBUG > TRACE.
Init (module startup):
devicecode.init INFO Device Code authorization disabled in config devicecode.init INFO Device Code authorization (RFC 8628) initializedAuthorize (code generation, RFC 8628 Section 3.1-3.2):
devicecode.authorize ERROR Failed to generate device code devicecode.authorize ERROR Failed to generate user code devicecode.authorize ERROR Failed to store device code devicecode.authorize ERROR Failed to achieve quorum for device code storage devicecode.authorize WARN Failed to store user code reverse lookup devicecode.authorize INFO Device authorization codes generatedVerify (user code validation):
devicecode.verify INFO Invalid user code format (not BASE20)Complete (user authorization/denial):
devicecode.complete INFO Device code already handled devicecode.complete ERROR Failed to generate tokens for device authorization devicecode.complete ERROR Failed to get token response devicecode.complete ERROR Invalid token response type devicecode.complete INFO Generated tokens for device authorization devicecode.complete ERROR Failed to broadcast authorization update devicecode.complete ERROR Failed to achieve quorum for authorization devicecode.complete WARN Concurrent modification detected (version mismatch) devicecode.complete INFO Device authorization completedPoll (device code polling, RFC 8628 Section 3.4-3.5):
devicecode.poll WARN Failed to lookup device code devicecode.poll WARN Client ID mismatch devicecode.poll DEBUG Client polling too fast devicecode.poll WARN Failed to replicate LastPoll update across cluster devicecode.poll WARN Failed to initiate LastPoll broadcast devicecode.poll INFO Device authorization denied by user devicecode.poll INFO Device authorization grantedMetrics
Prometheus metrics. Query with: metrics prometheus devicecode_<name>
Codes:
devicecode_codes_issued_total counter {client_id} Device codes generatedAuthorization:
devicecode_authorizations_total counter {result} Authorization decisions result=authorized User approved device result=denied User denied devicePolling:
devicecode_polls_total counter {status} Poll requests by outcome status=pending Awaiting user action status=authorized User authorized status=denied User denied status=slow_down Client polling too fast status=expired Code expired (not instrumented — returns early)Alerts:
rate(devicecode_authorizations_total{result="denied"}[5m]) > 10 High denial rate rate(devicecode_polls_total{status="slow_down"}[5m]) > 50 Clients ignoring poll intervalJust-In-Time Two-Factor Authentication
Transparent OTP-based 2FA for legacy applications via login interception and credential replay
Overview
JIT-2FA adds two-factor authentication to legacy web applications without any backend modifications. It operates as a transparent middleware layer within a proxy mapping, intercepting form-based login submissions and gating access with email-based OTP verification.
Core capabilities:
- Transparent login interception: intercepts POST submissions to configurable login paths
- Webhook credential validation: validates username/password via external HTTP webhook
- Email OTP challenge: sends one-time password to user email extracted from webhook response
- Credential replay: after OTP success, replays the original POST request to the backend
- Auth header mode: alternative to replay, injects X-Hexon-* headers for proxy-aware backends
- Asymmetric encryption: NaCl box (X25519 + XSalsa20 + Poly1305) for credential storage
- Split-knowledge security: server holds ciphertext, client holds private key in HttpOnly cookie
- Secure memory handling: plaintext credentials zeroed immediately after encryption
- OTP resend without re-encryption: same ciphertext and cookie reused across resends
- Session-based access: authenticated sessions bypass 2FA for subsequent requests
- Double logout: destroys both JIT-2FA session and forwards logout to backend
Two trust models controlled by inject_credentials config option:
Credential Replay Mode (inject_credentials = true, default):
For legacy apps with no proxy-auth support. The full NaCl encryption pipeline encrypts the login POST body, stores ciphertext in session, then decrypts and replays the original request after OTP verification. Flow: Login POST -> Encrypt body -> Store ciphertext -> OTP -> Decrypt -> Replay POST -> BackendAuth Header Mode (inject_credentials = false):
For apps supporting trusted reverse proxy authentication (Grafana, GitLab, Gitea, Jenkins, etc.). Eliminates the encryption pipeline entirely. After OTP success, redirects user to login URL. The proxy layer injects auth headers (X-Hexon-User, X-Hexon-Mail, etc.) on every authenticated request. Flow: Login POST -> Webhook validate -> OTP -> Redirect 302 -> Auth headers injected -> Backend Requires add_auth_headers = true on the parent proxy mapping.Request flow:
1. Request arrives at proxy mapping with JIT-2FA enabled 2. Logout path check: if match, destroy session and forward to backend 3. Login path POST check: if match, extract credentials and call webhook 4. Webhook success with email: encrypt body (replay mode) or store username (header mode) 5. Send OTP email and render verification page 6. User submits OTP: verify code, decrypt and replay POST (or redirect with headers) 7. Authenticated session established for subsequent requests 8. Non-login requests: check session validity, forward if authenticated or redirect to loginConfig
JIT-2FA is configured per proxy mapping under [proxy.mapping.jit2fa]:
[proxy.mapping.jit2fa] enabled = true # Enable JIT-2FA for this mapping login_url = "/login" # Redirect target for unauthenticated users login_path_regex = "^/login$" # Regex matching login POST endpoint logout_path_regex = "^/logout$" # Regex matching logout endpoint username_field = "username" # Form field name for username extraction password_field = "password" # Form field name for password extraction inject_credentials = true # true = credential replay, false = auth header modeWebhook configuration under [proxy.mapping.jit2fa.webhook]:
[proxy.mapping.jit2fa.webhook] url = "https://api.internal/validate" # Webhook endpoint URL method = "GET" # HTTP method (GET or POST) timeout = "5s" # Webhook response timeout (default: 5s) success_field = "$.status" # JSONPath to success indicator in response success_value = "ok" # Expected value at success_field extract_email = "$.email" # JSONPath to user email for OTP deliveryOptional HTTP transport tuning (defaults aligned with proxy connection pool):
max_idle_conns = 50 # Total idle connections (default: 50) max_idle_conns_per_host = 20 # Idle connections per host (default: 20) force_attempt_http2 = true # Force HTTP/2 (default: true) disable_compression = true # Disable compression (default: true) write_buffer_size = 32768 # Write buffer bytes (default: 32768) read_buffer_size = 32768 # Read buffer bytes (default: 32768) dial_timeout = "30s" # TCP dial timeout (default: 30s) keep_alive = "30s" # TCP keepalive interval (default: 30s)OTP configuration under [proxy.mapping.jit2fa.otp]:
[proxy.mapping.jit2fa.otp] type = "numeric" # OTP type: "numeric" or "base20" (default: global) length = 6 # OTP digit count valid = "5m" # OTP validity duration max_retries = 3 # Maximum OTP entry attempts (default: global) resend_time = 30 # Seconds before resend allowed (default: global)When using auth header mode (inject_credentials = false), the parent proxy mapping must also set add_auth_headers = true to inject X-Hexon-User, X-Hexon-Mail, and other identity headers on authenticated requests.
All OTP settings fall back to global email OTP defaults when not specified per mapping.
Token Handoff (optional sub-feature for mobile/SPA/CLI clients):
Add an optional [proxy.mapping.jit2fa.token_handoff] block to expose a bearer- token handoff flow for callable clients. Native mobile apps, SPAs, desktop tools, and CLIs go through the JIT-2FA login + OTP pipeline and receive a signed bearer token at a caller-registered return URL. Subsequent API calls authenticate with the token via a top-of-tree bearer check that injects identity headers and forwards to the backend without running the rest of the middleware chain.
Two entry paths produce the same token handoff flow — callers pick whichever fits their client architecture:
1. GET /_jit2fa/authorize?return_url=...&dpop_jkt=... Gateway-owned URL. Caller opens it in the system browser (ASWebAuthenticationSession / Custom Tabs on mobile, window.location in SPAs, plain GET in CLI tools with a loopback callback). Used when the client cannot submit credentials inline — e.g. native apps that delegate the login UI to a system browser sheet. 2. POST to the mapping's login_path_regex with form fields: - the username and password fields configured on the mapping (username_field / password_field — whatever the backend's own login form expects) - plus _jit2fa_return_url (required to trigger the handoff) - plus optional _jit2fa_dpop_jkt (required when require_dpop=true) Used when the client HAS the login form in its own UI — browser- based SPAs with a built-in login page, test harnesses, etc. Eliminates the bounce through the GET entry path and keeps credentials in a single form submission.Both paths end up at the same post-OTP mint step — the only difference is how the return_url (and optional dpop_jkt) are carried into the flow.
[proxy.mapping.jit2fa.token_handoff] enabled = true # Path on the mapping where callers start the flow. Must begin with # /_jit2fa/ (reserved prefix that guarantees no collision with backend # URLs). Default: /_jit2fa/authorize. entry_path = "/_jit2fa/authorize" # Whitelist of caller return URL patterns. Glob-style: "*" matches any # sequence of characters (including slashes, dots, colons), everything # else is literal, the pattern is anchored on both ends. allowed_return_urls = [ "com.example.mobile://*", # native iOS/Android app "https://app.example.com/auth/callback", # SPA callback "http://127.0.0.1:*/cb", # CLI tool loopback ] # Access token lifetime (1m–24h, default 12h). Short-lived by # design: when the access token expires the client either uses a # refresh token (if enabled) or re-authenticates. access_token_ttl = "12h" # Audience (aud) claim baked into minted tokens. Required. Callers # validate this on receipt to make sure the token was intended for them. audience = "myapp.mobile" # Accept minted bearer tokens on subsequent requests to this mapping. # Default true. Set to false if the mapping should only issue tokens # (one-way handoff, e.g. the backend accepts the tokens itself via its # own bearer check). accept_bearer = true # Require DPoP (RFC 9449) proof-of-possession binding. Default false # (opportunistic mode — callers may supply dpop_jkt and get bound # tokens, non-DPoP flows still work). Set to true to enforce: every # entry GET MUST include a dpop_jkt query parameter, and every bearer- # authenticated request MUST include a DPoP header whose proof key # thumbprint matches the token's cnf.jkt. See "DPoP (RFC 9449)" # section below for client-side implementation guidance. require_dpop = true # Refresh token / max session lifetime (1h–90d). Requires # require_dpop=true — refresh without DPoP key binding is rejected # at config validation because a stolen refresh token without PoP # would grant indefinite access. The refresh token is bound to the # SAME DPoP key as the access token (RFC 9449 section 5 strict binding). # # When set: the fragment delivery includes refresh_token alongside # access_token. The client calls POST /_jit2fa/refresh with the # refresh token + DPoP proof to get a new access token. On refresh # token expiry the client MUST re-authenticate (full login + OTP). # # When empty or "0": no refresh token issued. Client re-authenticates # when the access token expires. refresh_token_ttl = "30d"Writing allowed_return_urls:
The allowed_return_urls list is the ONLY protection against open-redirect attacks in the token handoff flow. Operators are responsible for writing patterns conservatively. The gateway enforces the patterns exactly as written — it does not second-guess them.
DO write exact URLs when possible:
- "com.example.mobile://auth" - "https://app.example.com/auth/callback"DO use wildcards for legitimate dynamic portions:
- "com.example.mobile://*" any path on a scheme you own - "https://*.example.com/auth/callback" subdomains of a domain you own - "http://127.0.0.1:*/cb" loopback ephemeral port for CLIDO NOT write open-redirect patterns:
- "*" matches literally anything - "https://*/*" matches any HTTPS URL - "https://*.com/callback" matches attacker-owned subdomains - "*://example.com/callback" allows arbitrary schemesA single badly-written pattern can turn your token handoff flow into a credential exfiltration vector. Review patterns against your actual mobile apps, SPAs, and CLI tools; reject any pattern you cannot justify.
DPoP (RFC 9449) proof-of-possession:
Enabling require_dpop = true on the mapping turns every minted bearer token into a key-bound token: a stolen token without the matching private key cannot be replayed against the mapping. This is the primary mitigation for the URL-fragment-delivery threat model — the token is briefly visible in the browser address bar and in devtools, but without the private key it is useless.
When require_dpop = true:
- Every entry GET MUST carry a dpop_jkt query parameter (RFC 9449 §10.1) — the base64url SHA-256 thumbprint of the caller's public JWK. The gateway validates the charset and length (exactly 43 chars, base64url alphabet), stashes it alongside the return_url in a sibling cookie, and binds the minted token via the cnf.jkt confirmation claim. - Every bearer-authenticated request to the mapping MUST include a DPoP header (RFC 9449 §4) carrying a proof JWT signed with the bound private key. The gateway validates the proof (signature, htm, htu, iat, jti replay) and checks that its JWK thumbprint matches the token's cnf.jkt before forwarding to the backend.When require_dpop = false (the default), the flow is opportunistic: clients that provide dpop_jkt get bound tokens, clients that don’t get non-DPoP tokens and continue to work. This lets operators roll out DPoP gradually — watch the metric jit2fa_handoff_bearer_checks_total{result=accepted,reason=""} for DPoP adoption, then flip require_dpop to true once metrics show 100%.
Client-side implementation:
Native mobile apps: generate an ECDSA P-256 keypair via the platform keystore (iOS Keychain / Android Keystore), pin the private key to the device (hardware-backed where available), compute the JWK thumbprint, and sign a DPoP proof JWT on every API call. Proofs have htm/htu/iat/jti fields; jti must be unique per proof (UUID is fine). Browser SPAs: generate via crypto.subtle.generateKey({name:"ECDSA", namedCurve:"P-256"}) with extractable=false on the private key, store in IndexedDB (CryptoKey is structured-clone-serializable so the key persists across page navigations without ever touching its bytes), compute the thumbprint with crypto.subtle.digest. A working example lives in recipes/ges-html/{test,callback}.html. CLI tools: generate via the host OS keystore (Secret Service on Linux, Keychain on macOS, Windows Credential Manager). Never write the private key to a plaintext file — that defeats the whole threat model.Thumbprint format: RFC 7638 §3.1. For EC keys, canonical JSON of {crv, kty, x, y} with lex-sorted keys and no whitespace, SHA-256 digest, base64url-encoded without padding. The result is exactly 43 ASCII characters.
Common DPoP failure modes and how to diagnose:
HTTP 400 from /_jit2fa/authorize with "dpop_jkt query parameter is required": → require_dpop = true but client did not append dpop_jkt. HTTP 400 from /_jit2fa/authorize with "dpop_jkt is not a well- formed base64url SHA-256 thumbprint": → Length != 43, or charset contains non-base64url chars. Check the thumbprint computation — the JOSE base64url encoding must be padding-free. HTTP 401 from /api/* with DPoP challenge and "DPoP proof header required for this token": → Token carries cnf.jkt but the request has no DPoP header. Client has a bound token but is not signing proofs. HTTP 401 with "DPoP proof does not match token binding": → The proof validates but its key thumbprint does not match the token's cnf.jkt. This is "stolen token + forged proof" from the gateway's perspective and is logged at Warn with both thumbprints for incident review. From the client's perspective: check that you are signing proofs with the same keypair that was used to compute dpop_jkt at entry time. The most common bug is regenerating the keypair on every page load. HTTP 401 with "DPoP proof is not valid": → Signature failure, iat outside replay window, jti already seen, or malformed proof. Check the oidc module's dpop_validation_total metric to see which.Token refresh endpoint (/_jit2fa/refresh):
When refresh_token_ttl is configured (requires require_dpop=true), the gateway issues a refresh token alongside the access token in the URL fragment. Both tokens have the same short TTL (access_token_ttl, e.g. 1h). The client calls the refresh endpoint before expiry to get a new pair.
Request: POST /_jit2fa/refresh Content-Type: application/x-www-form-urlencoded DPoP: <proof-jwt bound to POST https://host/_jit2fa/refresh> refresh_token=<refresh-jwt> Success response (HTTP 200): { "access_token": "<new-jwt>", "id_token": "<same-jwt>", "token_type": "DPoP", "expires_in": 3600, "refresh_token": "<rotated-jwt>", "scope": "openid email profile groups" } Note: id_token is the same value as access_token (the access token IS an ID token). Included per OIDC Core Section 12.2. Standard OIDC client libraries use it to update user profile claims. Error responses use standard OAuth error codes (RFC 6749 Section 5.2): HTTP 400 {"error":"invalid_request"} — missing refresh_token or form parse error HTTP 401 {"error":"invalid_grant"} — token expired, invalid, wrong audience, not DPoP-bound, missing auth_time, max session exceeded HTTP 401 {"error":"invalid_dpop_proof"} — DPoP proof missing, invalid, or key mismatch (RFC 9449 extension) HTTP 403 {"error":"invalid_request"} — refresh_token_ttl not configured HTTP 500 {"error":"server_error"} — access token mint failed All error responses include error_description with a human-readable reason. Standard client libraries (AppAuth, oidc-client-ts) parse the error code to decide: invalid_grant = re-authenticate, server_error = retry, invalid_dpop_proof = fix the DPoP proof. Token rotation: every refresh call returns a NEW access token + a NEW refresh token. Both get TTL = access_token_ttl. The rotated refresh token inherits the original auth_time claim so the absolute session lifetime is preserved through every rotation. DPoP binding: the DPoP proof on the refresh request MUST be signed with the SAME key that was used at the original dpop_jkt entry (RFC 9449 section 5 strict binding). The gateway checks: proof.thumbprint == token.cnf.jkt A different key = rejected. Key rotation = re-authenticate. Stateless design: refresh tokens are signed JWTs (not opaque strings), validated by signature + exp + audience suffix (":refresh"). No server-side storage. The auth_time claim is the session boundary. Absolute session lifetime: enforced via auth_time on the refresh JWT. The handler checks: now - auth_time > refresh_token_ttl. When exceeded, returns HTTP 401 and the client must re-authenticate (full login + OTP). A refresh token with auth_time=0 (malformed or crafted) is also rejected to prevent bypassing this check. Client stops refreshing: if the client doesn't refresh before the current refresh token expires (TTL = access_token_ttl), the JWT validator rejects it (exp passed) and the client must re-authenticate.Bearer token 401 responses (what mobile apps see):
When a bearer-authenticated request fails, the gateway returns HTTP 401 with a WWW-Authenticate challenge header. The response format follows RFC 6750 (Bearer) and RFC 9449 (DPoP) so standard OAuth client libraries can branch on the scheme. Token expired or invalid (non-DPoP): HTTP/1.1 401 Unauthorized WWW-Authenticate: Bearer realm="<audience>", error="invalid_token", error_description="token is not valid" Content-Type: text/plain token is not valid Token expired or invalid (DPoP-bound): HTTP/1.1 401 Unauthorized WWW-Authenticate: DPoP algs="ES256 ES384 ES512 RS256 EdDSA", realm="<audience>", error="invalid_token", error_description="token is not valid" Content-Type: text/plain token is not valid Missing DPoP proof on a DPoP-bound token: HTTP/1.1 401 Unauthorized WWW-Authenticate: DPoP algs="ES256 ES384 ES512 RS256 EdDSA", realm="<audience>", error="invalid_token", error_description="DPoP proof header required for this token" DPoP proof thumbprint mismatch (possible theft): HTTP/1.1 401 Unauthorized WWW-Authenticate: DPoP ..., error_description="DPoP proof does not match token binding" Audience mismatch (cross-mapping replay attempt): HTTP/1.1 401 Unauthorized WWW-Authenticate: Bearer realm="<audience>", error="invalid_token", error_description="token audience does not match this mapping" Mobile app standard response to 401: 1. If refresh token available: POST /_jit2fa/refresh with DPoP proof 2. If refresh succeeds: retry the original request with the new access token 3. If refresh fails (401/403): re-run the full authentication flow 4. If no refresh token: re-run the full authentication flowKnown limitations:
- Server-side callers are not supported. Tokens are delivered in the URL fragment (#access_token=...) which is not sent to servers. If a future caller needs server-side delivery, a POST-based mode will be added in a follow-up release. - When a mapping has both credential replay (inject_credentials = true) and token handoff enabled, browser users going through the regular login flow still get credential replay as today. Only callers who entered via the token handoff entry URL skip replay in favor of the bearer-token handoff. Both modes coexist cleanly.Troubleshooting
Common symptoms and diagnostic steps:
User submits login but sees an error instead of OTP page:
- Webhook failure: check webhook URL reachability and response format - JSONPath mismatch: verify success_field and success_value match the webhook response - No email in response: extract_email JSONPath must resolve to a valid email address - Webhook timeout: increase timeout if backend validation is slow (default 5s) - Form field names wrong: username_field and password_field must match the HTML formOTP email not received:
- Check SMTP configuration: 'smtp health' to verify email delivery system - Email address extraction: webhook must return email at the configured JSONPath - Rate limiting: protection module may throttle OTP requests - Check email OTP module health: OTP generation depends on the emailotp serviceOTP verification fails (invalid code):
- Expired OTP: default validity is 5 minutes, user may have waited too long - Max retries exceeded: after max_retries (default 3), session is invalidated - Wrong mapping context: DeviceID is mappingID:sessionID, must match original - Clock skew: cluster nodes must have synchronized time for OTP validationCredential replay fails after OTP success:
- Private key cookie missing: browser may have cleared cookies or cookie expired (5 min) - Session expired: NATS session data has TTL, check if ciphertext still exists - Decryption error: private key cookie must match the public key used for encryption - Backend rejected replayed POST: CSRF token in original form may have expired - Content-Type mismatch: replayed request preserves original Content-Type headerAuth header mode not working (inject_credentials = false):
- Missing add_auth_headers = true on parent proxy mapping configuration - Backend not configured to trust X-Hexon-* headers - Redirect loop: login_url must match the path the backend expects for login - Session cookie not set: check browser cookie settings and SameSite policySession issues (user keeps getting redirected to login):
- Cookie blocked: Secure flag requires HTTPS, SameSite=Strict blocks cross-origin - Session storage: verify NATS/JetStream connectivity for session persistence - Multiple domains: session cookies are domain-scoped, check cookie domain setting - Logout path regex matching too broadly: verify logout_path_regex specificityLogin path regex not matching:
- Regex syntax: login_path_regex uses Go regexp syntax (RE2) - Path normalization: check if proxy rewrites the path before JIT-2FA sees it - Method filter: only POST requests to login_path_regex trigger interception - Case sensitivity: regex is case-sensitive by defaultPerformance and webhook diagnostics:
- Webhook latency: high timeout values block the user login flow - Connection pooling: webhook HTTP transport shares pool settings with proxy - Cluster-wide OTP tracking: retries tracked across all cluster nodesSecurity
Cryptographic design and security properties:
Encryption model (credential replay mode):
NaCl box authenticated encryption using X25519 key agreement, XSalsa20 stream cipher, and Poly1305 message authentication. Fresh X25519 keypair generated per login attempt. Ciphertext includes 32-byte ephemeral public key and 16-byte authentication tag (48 bytes overhead total).Split-knowledge architecture:
Server stores: encrypted body ciphertext and public key (cannot decrypt alone) Client stores: private key in HttpOnly cookie (cannot access ciphertext alone) Both halves required to recover plaintext credentials. Compromise of either storage in isolation reveals nothing about the original credentials.Cookie security:
Private key cookie attributes: HttpOnly, Secure, SameSite=Strict, Max-Age=300 - HttpOnly: prevents JavaScript access to private key - Secure: only transmitted over HTTPS connections - SameSite=Strict: prevents CSRF-based cookie theft - Max-Age=300: 5-minute window to complete OTP verificationMemory safety:
- Plaintext credentials zeroed immediately after encryption - Private key zeroed on server side immediately after decryption - Zeroing uses subtle.ConstantTimeCopy to prevent compiler optimization - No plaintext credentials ever written to disk or session storageOTP security:
- OTP hashed with bcrypt before storage (not stored in plaintext) - Constant-time comparison prevents timing side-channel attacks - Cluster-wide retry tracking prevents distributed brute-force attempts - Rate limiting inherited from protection module - DeviceID binding: OTP tied to specific mapping and session (prevents reuse)Webhook security:
- Webhook URL should use HTTPS for credential transmission - Webhook timeout prevents slow-loris style resource exhaustion - Credentials sent to webhook only, never stored in plaintext on server - JSONPath extraction validates response structure before proceedingAuth header mode security:
- No credential storage or encryption needed (eliminates cryptographic attack surface) - Backend must be configured to only trust headers from the gateway IP - X-Hexon-* headers stripped from external requests by the proxy layer - Session-based: authentication state maintained via secure session cookieCSRF protection:
- Original form CSRF tokens preserved in encrypted body for replay - OTP form uses separate anti-replay mechanism - SameSite=Strict cookies prevent cross-origin request forgeryRelationships
Module dependencies and interactions:
- proxy: Parent module. JIT-2FA is configured per proxy mapping and runs as middleware in the proxy request pipeline. Auth header mode requires add_auth_headers = true on the mapping. Proxy handles X-Hexon-* header injection on authenticated requests.
- authentication.emailotp: Provides OTP generation, delivery, and verification. JIT-2FA delegates all OTP operations to emailotp using DeviceID format of mappingID:sessionID for cluster-wide tracking. OTP settings (type, length, validity, max_retries, resend_time) can be overridden per mapping or fall back to global emailotp defaults.
- smtp: Email delivery for OTP codes. SMTP health directly affects OTP delivery. Check smtp health when OTP emails are not received.
- sessions: Session storage via NATS/JetStream. Stores encrypted credentials (replay mode) or username/email (header mode). Session TTL governs how long authenticated state persists. Session destruction on logout.
- protection.ratelimit: Rate limiting for login attempts and OTP submissions. Prevents brute-force attacks on both webhook validation and OTP verification.
- identity.directory: User identity enrichment. In auth header mode, directory attributes populate X-Hexon-* headers (user, email, groups, display name).
- config: Per-mapping configuration under [proxy.mapping.jit2fa]. Webhook, OTP, and transport settings are all configurable. Changes require proxy mapping reload to take effect.
- protection.pow: Related but independent POST body preservation mechanism. PoW uses symmetric AES-256-GCM for short-lived form data during proof-of-work challenges. JIT-2FA uses asymmetric NaCl box for longer-lived credential storage during OTP verification. Both implement split-knowledge security but with different threat models and durations.
- telemetry: Structured logging for login interceptions, webhook calls, OTP events, encryption operations, and session lifecycle. Metrics for monitoring JIT-2FA health and usage patterns.
Logs
Log entries by operation. Search with: logs search “jit2fa” Levels: ERROR > WARN > INFO > DEBUG.
Login Interception:
jit2fa.intercept INFO AUDIT Login POST intercepted jit2fa.parse_error WARN Failed to extract credentials from login form jit2fa.credentials INFO AUDIT Credentials extracted from login formWebhook Validation:
jit2fa.validate_webhook DEBUG Validating credentials via webhook jit2fa.webhook INFO AUDIT Webhook validation successful / invalid credentials jit2fa.webhook ERROR AUDIT Webhook validation failed (HTTP error)OTP:
jit2fa.otp INFO AUDIT OTP sent successfully jit2fa.otp ERROR AUDIT Failed to send OTP jit2fa.otp.verify INFO AUDIT OTP verification successful / failed jit2fa.resend WARN AUDIT Failed to extend session expiry on resendSession:
jit2fa.session INFO AUDIT Authenticated session created (replay/header/two-phase/token_handoff) jit2fa.redirect INFO AUDIT No valid session, redirecting to login jit2fa.logout INFO AUDIT Logout intercepted, clearing sessionRate Limiting:
jit2fa.ratelimit.status DEBUG Rate limit check passed jit2fa.ratelimit WARN Rate limit check failed (fail-open)Token Handoff — Entry Path:
jit2fa.handoff.entry INFO AUDIT Rejected: missing return_url query parameter jit2fa.handoff.entry WARN AUDIT Rejected: return_url not in allowed_return_urls jit2fa.handoff.entry INFO AUDIT Rejected: dpop_jkt malformed (charset or length) jit2fa.handoff.entry INFO AUDIT Rejected: require_dpop=true but caller did not supply dpop_jkt jit2fa.handoff.entry INFO AUDIT Valid URL, no session — redirecting to login (dpop_bound=true|false) jit2fa.handoff.entry INFO AUDIT Valid session — minting directly (fast path, dpop_bound=true|false)Token Handoff — JKT Cookie:
jit2fa.handoff.jkt_cookie WARN AUDIT Handoff JKT cookie failed revalidation (tampered or truncated)Token Handoff — Mint Step:
jit2fa.handoff.mint ERROR AUDIT Revalidation failed before mint (cookie tamper suspected) jit2fa.handoff.mint ERROR AUDIT Refusing to mint without username jit2fa.handoff.mint ERROR AUDIT require_dpop=true but no dpop_jkt reached finalize (caller bypassed entry) jit2fa.handoff.mint ERROR AUDIT return_url malformed after fragment strip (operator wildcard too permissive) jit2fa.handoff.mint ERROR AUDIT oidc.MintBearerToken call failed jit2fa.handoff.mint ERROR AUDIT oidc.MintBearerToken returned error jit2fa.handoff.mint INFO AUDIT Minted access token and redirecting caller (fields: username, audience, expires_in, dpop_bound, dpop_jkt?)Token Handoff — Bearer Top-of-Tree Check:
jit2fa.handoff.bearer INFO AUDIT Authorization header present but token is empty jit2fa.handoff.bearer ERROR AUDIT Validator call failed (oidc.ValidateIDToken hexdcall error) jit2fa.handoff.bearer WARN AUDIT Token rejected by validator (bad sig / expired / wrong issuer) jit2fa.handoff.bearer WARN AUDIT Audience mismatch (cross-mapping token replay attempt — alert signal) jit2fa.handoff.bearer INFO AUDIT require_dpop=true but token has no cnf.jkt (legacy client post-rollout) jit2fa.handoff.bearer INFO AUDIT DPoP-bound token but no DPoP header on request (client bug) jit2fa.handoff.bearer ERROR AUDIT oidc.ValidateDPoP hexdcall call failed jit2fa.handoff.bearer INFO AUDIT DPoP proof rejected by validator (stale iat / wrong htu / replayed jti) jit2fa.handoff.bearer WARN AUDIT DPoP proof thumbprint does not match token cnf.jkt — possible token theft jit2fa.handoff.bearer INFO AUDIT Accepted, forwarding to backend (fields: username, audience, dpop_bound, dpop_jkt?)Token Handoff — DPoP Proof Validation:
jit2fa.handoff.bearer.dpop INFO AUDIT DPoP proof validated, thumbprint matches token cnf.jkt (fields: username, dpop_jkt, htm, htu — one line per bearer-authenticated API call on a DPoP-bound mapping)Token Handoff — Refresh:
jit2fa.handoff.refresh INFO AUDIT Missing refresh_token parameter jit2fa.handoff.refresh INFO AUDIT Token rejected by validator (expired or invalid) jit2fa.handoff.refresh INFO AUDIT Audience mismatch (not a refresh token for this mapping) jit2fa.handoff.refresh INFO AUDIT Token not DPoP-bound jit2fa.handoff.refresh INFO AUDIT Missing DPoP proof header jit2fa.handoff.refresh INFO AUDIT DPoP proof rejected by validator jit2fa.handoff.refresh WARN AUDIT DPoP thumbprint mismatch — different key (abuse signal) jit2fa.handoff.refresh INFO AUDIT Token has no valid auth_time (cannot enforce session lifetime) jit2fa.handoff.refresh INFO AUDIT Absolute session lifetime exceeded (auth_time + max > now) jit2fa.handoff.refresh ERROR ValidateIDToken call failed (hexdcall error) jit2fa.handoff.refresh ERROR DPoP proof validation call failed (hexdcall error) jit2fa.handoff.refresh ERROR Failed to mint new access token jit2fa.handoff.refresh WARN Failed to mint rotated refresh token (returning access only) jit2fa.handoff.refresh INFO AUDIT Minted new token pair (success) (fields: username, audience, access_expires_in, session_remaining_hours, dpop_jkt)Log level policy:
- INFO+AUDIT for routine rejections caused by malformed client input (missing params, stale proofs, client-side bugs, rollout friction). These land in the audit stream for trace reconstruction but do not trigger operator alerts. - WARN+AUDIT only for events that indicate abuse or attack: open-redirect whitelist probing, signature forgery, cross-mapping replay attempts, DPoP thumbprint mismatches. Alert on these. - ERROR+AUDIT for internal system errors (hexdcall failures, signing key missing, cookie tamper on revalidation) that need operator investigation regardless of attack status.The bearer “accepted” path fires per request on DPoP-bound mappings. On high-throughput SPAs hitting the backend at 50 rps, this can generate 50 audit lines per second per user per mapping. Filter at the log sink by event name + result if volume is a problem — losing the accepted-path record at the emit site is a security regression, so the event is always emitted.
Full per-user audit trace pattern (grep):
mapping_id=<ID> AND username=<user> AND event in {jit2fa.handoff.entry, jit2fa.handoff.mint, jit2fa.handoff.bearer, jit2fa.handoff.bearer.dpop}Metrics
Prometheus metrics. Query with: metrics prometheus jit2fa_<name>
Operations:
jit2fa_login_attempts_total counter {mapping_id} Login interceptions jit2fa_webhook_validations_total counter {mapping_id, result} Webhook results (success/failure) jit2fa_webhook_validation_duration latency {mapping_id} Webhook response time jit2fa_otp_verifications_total counter {mapping_id, result, reason?} OTP results (success/invalid/expired/max_retries/error) jit2fa_sessions_created_total counter {mapping_id} Sessions created jit2fa_otp_resends_total counter {mapping_id, result} OTP resend attempts jit2fa_rate_limited_total counter {mapping_id} Rate-limited requestsToken Handoff:
jit2fa_handoff_entry_total counter {mapping_id, reason, dpop_bound} Entry path visits by outcome and DPoP binding state reasons: missing_return_url, invalid_return_url, missing_dpop_jkt, invalid_dpop_jkt, redirect_login, direct_mint, form_post (parallel entry: the login POST carried _jit2fa_return_url + optional _jit2fa_dpop_jkt, and the middleware treated the whole thing as a handoff request rather than the traditional credential-replay flow) dpop_bound: "true" when the caller supplied a valid dpop_jkt query parameter (or form field), "false" otherwise. Early-rejection paths (before dpop_jkt parse) always emit "false". jit2fa_handoff_mints_total counter {mapping_id, result, reason?, dpop_bound} Mint step outcomes by result, reason, and binding failure reasons: revalidate_failed, malformed_return_url, missing_identity, missing_dpop_jkt, oidc_error dpop_bound: "true" when the minted (or attempted) token carries a cnf.jkt confirmation claim. Use this dimension for DPoP adoption tracking: sum by (dpop_bound) (rate( jit2fa_handoff_mints_total{ result="success" }[5m])) jit2fa_handoff_mint_duration latency {mapping_id} Time from finalizeTokenHandoff entry to mint response jit2fa_handoff_bearer_checks_total counter {mapping_id, result, reason?, dpop_bound} Bearer check outcomes by result, reason, binding rejected reasons: empty_token, validator_error, invalid_token, audience_mismatch, token_not_dpop_bound, missing_dpop_header, dpop_validator_error, dpop_proof_invalid, dpop_jkt_mismatch dpop_bound: "true" when the presented token has a cnf.jkt claim, "false" otherwise. Early-rejection paths (empty_token, validator_error, invalid_token) emit "false" since the token was not parsed. DPoP usage query: sum by (dpop_bound) (rate( jit2fa_handoff_bearer_checks_total{ result="accepted" }[5m])) jit2fa_handoff_bearer_check_duration latency {mapping_id} Time from bearer header parse to validation outcome (full cost: JWT validate + optional DPoP proof validate) jit2fa_handoff_dpop_validation_duration latency {mapping_id} Isolated cost of oidc.ValidateDPoP alone — component of handoff_bearer_check_duration, emitted on every DPoP proof validation attempt (success or failure). Use this to tell JWT slowness apart from DPoP slowness when the bearer check p99 regresses.Token Refresh:
jit2fa_handoff_refresh_total counter {mapping_id, result, reason?} Refresh endpoint outcomes (success/failure) failure reasons: disabled, parse_error, missing_token, invalid_token, wrong_audience, not_dpop_bound, missing_dpop, dpop_invalid, dpop_mismatch, missing_auth_time, max_session, mint_failed jit2fa_handoff_refresh_duration latency {mapping_id} Full refresh handler wall-clock latencyAlerts:
# Backend / operational rate(jit2fa_webhook_validations_total{result="failure"}[5m]) > 5 Webhook backend issues jit2fa_otp_verifications_total{reason="max_retries"} > 0 OTP brute-force attempt rate(jit2fa_rate_limited_total[5m]) > 10 High rate limiting # Token handoff — abuse signals (page on these) rate(jit2fa_handoff_entry_total{reason="invalid_return_url"}[5m]) > 2 Possible open-redirect probing against the whitelist rate(jit2fa_handoff_bearer_checks_total{reason="audience_mismatch"}[5m]) > 0 Cross-mapping token replay attempt (alert immediately) rate(jit2fa_handoff_bearer_checks_total{reason="invalid_token"}[5m]) > 20 High invalid-token rate (bot scan or clock drift) rate(jit2fa_handoff_bearer_checks_total{reason="dpop_jkt_mismatch"}[5m]) > 0 DPoP thumbprint mismatch — possible stolen token (alert immediately) rate(jit2fa_handoff_refresh_total{reason="dpop_mismatch"}[5m]) > 0 Refresh with wrong DPoP key — stolen refresh token attempt # Token handoff — capacity / latency histogram_quantile(0.99, jit2fa_handoff_mint_duration_bucket) > 0.5 Token signing p99 slow (OIDC signer degraded) histogram_quantile(0.99, jit2fa_handoff_bearer_check_duration_bucket) > 0.1 Bearer check p99 slow (hexdcall / oidc validation contention) histogram_quantile(0.99, jit2fa_handoff_dpop_validation_duration_bucket) > 0.05 DPoP proof validation p99 slow (ECDSA cost or replay cache contention) # Token handoff — DPoP rollout tracking (not alerts, dashboard panels) sum by (dpop_bound) (rate(jit2fa_handoff_mints_total{result="success"}[5m])) Mint-time DPoP adoption ratio sum by (dpop_bound) (rate(jit2fa_handoff_bearer_checks_total{result="accepted"}[5m])) Bearer-use DPoP adoption ratio rate(jit2fa_handoff_bearer_checks_total{reason="token_not_dpop_bound"}[5m]) Legacy clients on a require_dpop mapping (expected to drop to 0 after rollout) rate(jit2fa_handoff_entry_total{reason="missing_dpop_jkt"}[5m]) Clients hitting a require_dpop entry without dpop_jkt (same signal, earlier in the flow)Kerberos Ticket Management & SPNEGO Browser SSO
Authenticates users via Kerberos tickets — browser SSO through SPNEGO and ticket proxying for SSH bastion
Overview
Authenticates users via Kerberos — browser SSO through SPNEGO negotiation and ticket proxying for the SSH bastion. The gateway is not part of the Kerberos realm. It authenticates to the KDC on behalf of users and manages tickets in memory. Applies to Active Directory and FreeIPA environments where Kerberos is the primary authentication protocol.
Two modes:
- Browser SSO (SPNEGO) — transparent authentication for domain-joined browsers - Ticket proxy (bastion) — acquires TGTs for SSH jump host delegationPasswords never touch disk. Tickets are stored as encrypted sessions with TTL synchronized to the Kerberos ticket lifetime. CCache output is MIT Kerberos compatible (version 4, big-endian) — works with SSH GSSAPI, kinit, klist, and all standard tools.
Additional capabilities:
- ACL protection for ticket retrieval operations
- Password change via kpasswd protocol (RFC 3244)
- Security audit logging for all Kerberos operations
- Prometheus metrics for ticket lifecycle monitoring
Platform notes: memory locking requires CAP_IPC_LOCK (container: —cap-add=IPC_LOCK). Degrades gracefully if memory locking is unavailable.
Config
Kerberos module configuration:
[authentication.kerberos] realm = "EXAMPLE.COM" # Kerberos realm (uppercase by convention) kdc = "kdc.example.com" # Key Distribution Center address ticket_ttl = "8h" # Ticket lifetime (default: 8 hours) password_change = true # Enable kpasswd password change (default: false) kpasswd_path = "/usr/bin/kpasswd" # Optional: override kpasswd binary pathTicket storage model:
Tickets are stored as sessions (type: "kerberos") indexed by the Kerberos principal (e.g., "alice@EXAMPLE.COM"). Session metadata includes: CCache bytes (auto-encrypted), ticket type, realm, principal, creation timestamp, and authentication method. This provides: - Principal-based indexing for fast user lookup - Automatic TTL expiration matching Kerberos ticket lifetime - Distributed storage with encryption across cluster - Cluster-wide ticket access from any nodePassword change feature:
When password_change = true, users can change their Kerberos passwords via the ChangePassword operation. Uses standard kpasswd protocol (RFC 3244). Password complexity is enforced by the KDC policy, not Hexon. All existing tickets are automatically revoked after a successful change. Requires kpasswd binary (auto-detected in PATH or specify kpasswd_path).Hot-reloadable: ticket_ttl, password_change. Cold (restart required): realm, kdc, kpasswd_path.
Troubleshooting
Common symptoms and diagnostic steps:
AcquireTicket fails with authentication error:
- Verify KDC is reachable: check network connectivity to kdc address - Verify realm is correct (must be uppercase by Kerberos convention) - Check user credentials: invalid password returns auth_failed - KDC clock skew: Kerberos requires clocks within 5 minutes (check NTP) - DNS resolution: KDC hostname must resolve correctlyTicket not found after acquisition:
- Check session storage health across cluster - Verify session TTL has not expired (matches ticket_ttl config) - Check cluster quorum status: tickets require quorum for distributed write - Verify cluster connectivity between nodesGetTicket returns access denied:
- ACLs control which modules can retrieve tickets - Only authorized modules (SSH proxy, bastion) should have access - Check ACL configuration in the cluster authorization policy - Verify the calling module is in the allowed listSSH GSSAPI authentication fails with valid ticket:
- Verify CCache format compatibility: use 'klist -c <file>' to inspect - Check KRB5CCNAME environment variable is set to the temp file path - Verify the ticket principal matches the SSH service principal - Check ticket expiration: expired tickets are rejected by SSH server - Ensure SSH server has GSSAPIAuthentication enabledWriteTicketFile fails:
- Check filesystem permissions for temp directory - Verify disk space available for temporary file creation - Remember: caller MUST securely delete temp file after useReflection errors (TGT extraction):
- gokrb5 internal structure may change between versions - Module is pinned to gokrb5 v8.4.4; do not upgrade without testing - Check structured logs for reflection failure messages - Fallback behavior may apply if structure changesPassword change fails:
- Verify password_change = true in configuration - Check kpasswd binary availability (auto-detect or kpasswd_path) - KDC password policy may reject the new password (complexity requirements) - Check structured logs for kpasswd protocol errors - Verify KDC supports kpasswd protocol (RFC 3244)Memory locking warnings:
- CAP_IPC_LOCK capability required for mlockall - Container: add --cap-add=IPC_LOCK to docker run - Kubernetes: add IPC_LOCK to securityContext capabilities - Without memory locking, passwords may be swapped to disk (security risk)Ticket lifecycle monitoring:
- kerberos_ticket_acquisition_total: track acquisition success/failure - kerberos_ticket_refresh_total: monitor refresh operations - kerberos_ticket_revocation_total: verify revocation operations - kerberos_password_change_total: audit password changesDiagnostic commands:
- auth kerberos: check Kerberos health and configuration - sessions list --type=kerberos: list active Kerberos ticket sessions - health components: verify Kerberos subsystem healthSecurity
Security model and hardening measures:
In-memory password handling:
Passwords are typed as []byte (not string) to enable secure clearing. Every password is cleared immediately after use. gokrb5 authenticates with the KDC entirely in memory. Passwords are NEVER written to disk, logs, or any persistent storage.Memory locking:
mlockall(MCL_CURRENT) prevents the process memory (including passwords and ticket data) from being swapped to disk. Requires CAP_IPC_LOCK capability. Graceful degradation: logs a warning if locking fails but continues operating.Pure in-memory TGT extraction:
gokrb5 stores TGT in private internal fields. The module uses low-level Go techniques to extract TGT, session key, timestamps, and renewal data. This is version-pinned to gokrb5 v8.4.4 with error handling for structural changes.CCache format security:
CCache bytes are built manually in standard MIT Kerberos format (version 4, big-endian). This ensures compatibility with all Kerberos tools while maintaining full control over the byte layout. No external dependencies for marshaling.Sessions encryption:
CCache bytes stored in session metadata are automatically encrypted at rest by the sessions module. No manual encryption is needed. Encryption keys are managed by the sessions infrastructure.Access control (defense in depth):
- ACLs restrict which modules can retrieve and revoke tickets - Typically limited to SSH proxy, bastion, and service delegation modules - ACL configuration in the cluster authorization policy - Encryption provides second layer even if ACL is misconfiguredConstant-time comparison:
Uses crypto/subtle.ConstantTimeCompare for security-sensitive comparisons.Secure file handling:
WriteTicketFile creates files with 0600 permissions. Secure file deletion overwrites with random data before removal. Callers MUST securely delete temp files after use.Audit logging:
All ticket operations (acquire, access, revoke, password change) are logged via the telemetry system with structured fields. Security events logged at appropriate severity levels for SIEM integration.On-behalf-of trust boundary:
Hexon is NOT part of the Kerberos realm and requires no keytab. Users provide credentials directly. The Hexon cluster is the security perimeter. Tickets are used for SSH jump hosts, proxies, and delegation.Spnego
SPNEGO/Negotiate browser authentication (server model):
SPNEGO (RFC 4559) enables transparent SSO for domain-joined workstations. When a browser hits a protected route, the gateway challenges with “WWW-Authenticate: Negotiate”, the browser obtains a service ticket from the KDC and sends it back. The gateway validates the ticket against a keytab file — no password crosses the wire.
This is the SERVER model, contrasting with the existing PROXY model (AcquireTicket) where Hexon authenticates to the KDC on behalf of users.
Two authentication paths (mirrors the X.509 pattern):
1. Explicit: /signin/kerberos — user navigates here, browser gets 401 Negotiate challenge, sends SPNEGO token, session created, redirect. 2. Auto-SPNEGO: When spnego_auto_auth=true, proxy routes try a Negotiate challenge before falling back to OIDC redirect. Uses a marker cookie (hexon_spnego_tried, 60s TTL) to prevent infinite 401 loops for non-domain browsers.Configuration:
[authentication.kerberos] spnego_enabled = true keytab_path = "/etc/krb5.keytab" # File path (traditional) keytab_base64 = "" # Base64 string (K8s/containers) service_principal = "HTTP/gw.example.com" # Default: HTTP/<service.hostname> spnego_auto_auth = false # Transparent SPNEGO on proxy routes spnego_exclude_nets = ["10.200.0.0/16"] # Skip auto-SPNEGO for external netsKeytab setup (FreeIPA example):
ipa service-add HTTP/gateway.example.com ipa-getkeytab -s ipa.example.com -p HTTP/gateway.example.com -k /etc/krb5.keytab chmod 0600 /etc/krb5.keytabBrowser compatibility:
- Chrome/Edge (Windows/macOS): automatic for domain-joined machines - Firefox: requires network.negotiate-auth.trusted-uris configuration - Safari (macOS): uses system Kerberos ticket - Mobile browsers: no SPNEGO support, falls through to OIDC/passwordTroubleshooting:
- "keytab unavailable": check keytab_path permissions (should be 0600) - SPNEGO token unmarshal fails: token may not be a valid SPNEGO token - Auth failure: check SPN matches keytab (klist -k /etc/krb5.keytab) - Clock skew: Kerberos requires clocks within 5 minutes (check NTP) - Non-domain browser loop: hexon_spnego_tried cookie should prevent it - "user disabled": valid Kerberos ticket but user disabled in directoryRelationships
Module dependencies and interactions:
- SSH bastion: Primary consumer. SSH bastion uses Kerberos tickets for GSSAPI authentication to target hosts. Retrieves tickets via GetTicket and sets KRB5CCNAME for SSH connections. Writes temp files via WriteTicketFile for tools requiring file-based credential caches.
- Sessions: Distributed ticket storage with automatic encryption at rest. Sessions provide TTL expiration, cluster-wide replication, principal-based indexing via ModuleKey, and atomic operations.
- Directory: User identity verification. Directory provides the canonical username and group memberships used in ticket principal construction and access control decisions.
- Cluster: ACL definitions control which modules can retrieve tickets.
- config: Hot-reloadable configuration for ticket_ttl and password_change. Realm and KDC address require restart.
- telemetry: Security audit logging for all ticket operations. Metrics exported as Prometheus counters for monitoring ticket lifecycle, KDC health, authentication failures, and password change operations.
- External dependency: gokrb5 v8.4.4 for pure Go Kerberos protocol. Version-pinned due to in-memory TGT extraction from internal fields.
- External dependency: kpasswd binary for password change operations (auto-detected in PATH or configured via kpasswd_path).
Logs
Log entries by component. Search with: logs search “kerberos” Levels: ERROR > WARN > INFO > DEBUG.
SPNEGO (Browser SSO):
kerberos.security WARN AUDIT SPNEGO token exceeds size limit kerberos.security INFO AUDIT SPNEGO auth successful / failed / decode failed / unmarshal failed kerberos.security ERROR AUDIT SPNEGO validated but no credentials in context kerberos.security WARN AUDIT SPNEGO auth for disabled user kerberos.spnego ERROR Failed to load keytab kerberos.spnego WARN User not found in directory / unexpected type / lookup failed kerberos.spnego WARN Keytab permissive permissions / missing service principal kerberos.spnego INFO Keytab loaded (from base64 or file)Ticket Acquisition:
kerberos.security INFO AUDIT Kerberos authentication successful kerberos.security INFO Kerberos authentication failed kerberos.acquire ERROR Failed to load krb5.confTicket Access:
kerberos.security INFO AUDIT Ticket access denied — invalid or expired session kerberos.write_file INFO AUDIT Created temporary ticket fileTicket Lifecycle:
kerberos.refresh INFO Ticket refreshed kerberos.refresh ERROR Failed to refresh ticket kerberos.revoke INFO Ticket revoked kerberos.revoke_user INFO User tickets revokedPassword Change:
kerberos.security INFO Password change failed / successful / tickets revoked after change kerberos.password_change ERROR kpasswd pipe/start/write failuresInitialization:
kerberos.init INFO Memory locking enabled kerberos.init WARN Memory locking failed — passwords may be swappedMetrics
Prometheus metrics. Query with: metrics prometheus kerberos_<name>
SPNEGO:
kerberos_spnego_validation_total counter {result, reason?} SPNEGO validation results result=success result=failure, reason=invalid_base64|invalid_token|auth_failed|no_credentials|user_disabledTickets:
kerberos_ticket_acquisition_total counter {result, reason?} Ticket acquisition result=success | result=failure, reason=auth_failed kerberos_ticket_refresh_total counter {result} Ticket refresh (success/failure) kerberos_ticket_revocation_total counter {result} Ticket revocation (success) kerberos_tickets_revoked counter {} Total tickets revoked (bulk count)Password:
kerberos_password_change_total counter {result} Password changes (success/failure)Alerts:
rate(kerberos_spnego_validation_total{result="failure"}[5m]) > 10 SPNEGO failures (keytab/config) rate(kerberos_ticket_refresh_total{result="failure"}[5m]) > 0 Ticket refresh failing (KDC) kerberos_spnego_validation_total{reason="user_disabled"} > 0 Disabled user SPNEGO attemptLDAP Authentication
Authenticates users with username and password against LDAP — Active Directory, FreeIPA, or OpenLDAP
Overview
The LDAP authentication module provides username/password verification by performing LDAP bind operations against configured directory servers. It acts as a bridge between the directory cache (for fast pre-flight checks) and the LDAP provider (for live password verification).
Core capabilities:
- LDAP bind authentication (no local password storage)
- Pre-flight account status checks via directory cache (disabled, expired)
- Group membership retrieval from directory cache
- Full user profile enrichment on successful authentication (email, name, groups)
- Graceful degradation when directory details unavailable after successful bind
- Prometheus metrics for authentication success/failure with labeled reasons
- Stateless operation suitable for any cluster node
Authentication flow (5-step pipeline):
1. Input validation: trim username, reject empty fields 2. Directory status check: existence, disabled, password expiry (via directory cache) 3. LDAP bind: live password verification against LDAP server 4. User details retrieval: full profile from directory cache 5. Response construction: comprehensive result with user metadataThe module never stores, caches, or logs passwords. Every authentication attempt requires a live LDAP bind, ensuring password policy enforcement is always delegated to the LDAP server (lockouts, complexity, expiry).
Failure reasons returned in AuthenticateResponse.Reason:
- "username required" / "password required" (input validation) - "user not found" (not in directory cache) - "account disabled" / "password expired" (pre-flight status) - "invalid credentials" (LDAP bind failed) - "directory unavailable" / "authentication service unavailable" (module errors)Config
The LDAP authentication module itself has no dedicated configuration section. It depends entirely on configuration from two upstream modules:
Directory module [directory]:
url = "ldaps://ldap.example.com:636" # LDAP server URL bind_dn = "cn=svc,dc=example,dc=com" # Service account for searches bind_password = "secret" # Service account password user_base = "ou=users,dc=example,dc=com" # User search base DN group_base = "ou=groups,dc=example,dc=com" # Group search base DN sync_interval = "5m" # Delta sync interval (default: 5m) full_sync_interval = "60m" # Full sync interval (default: 60m)LDAP provider module [ldap]:
url = "ldaps://ldap.example.com:636" # LDAP server URL for bind operations bind_dn = "cn=svc,dc=example,dc=com" # Service account DN user_base = "ou=users,dc=example,dc=com" # User search base for DN resolution user_filter = "(uid=%s)" # User lookup filter (%s = username) user_attribute = "uid" # Username attribute (uid, sAMAccountName)Active Directory considerations:
- Use user_attribute = "sAMAccountName" for AD environments - Use user_filter = "(sAMAccountName=%s)" for AD user lookups - AD lockout policies enforced server-side via LDAP bind - Password expiry detected via directory cache syncConnection pooling is managed by the LDAP provider module, not this module. LDAP bind operations reuse pooled connections for reduced overhead.
Cache staleness window:
- Account status changes (disable, expiry) reflected within sync_interval - Default: up to 5 minutes delay for status changes to propagate - Full sync ensures eventual consistency every 60 minutes - Immediate effect: password changes always verified live via LDAP bindTroubleshooting
Common symptoms and diagnostic steps:
User gets “Invalid username or password” but credentials are correct:
- Run 'diagnose user <username>' to check cross-subsystem status - Run 'directory user <username>' to verify user exists in cache - Check directory sync status: 'directory status' for last sync time - If user recently created, wait for sync or trigger manual sync - Verify LDAP server reachability: 'auth ldap' for connection health - Check if account locked in LDAP (server-side lockout policy) - Verify user_attribute matches LDAP schema (uid vs sAMAccountName)User gets “account disabled” but account is active in LDAP:
- Directory cache may be stale; check last sync: 'directory status' - Trigger manual sync: 'directory sync <username>' to refresh user - Verify the disabled attribute mapping in directory config - Check delta sync interval (default 5m) for expected propagation delayUser gets “password expired” unexpectedly:
- Verify password expiry attribute mapping in directory config - Check LDAP password policy (ppolicy overlay or AD fine-grained policy) - Trigger user sync to refresh expiry status: 'directory sync <username>'Authentication returns “directory unavailable”:
- Check directory module health: 'directory status' - Verify cluster bridge status: 'cluster status' - Check LDAP server connectivity: 'auth ldap' - Review logs: 'logs search "directory"' for connection errors - Verify directory module is registered and runningAuthentication returns “authentication service unavailable”:
- Check LDAP provider module health: 'auth ldap' - Verify LDAP server URL and port in configuration - Check TLS certificate validity for ldaps:// connections - Test LDAP connectivity: 'net tcp <ldap-host>:636 --tls' - Review logs: 'logs search "ldap"' for bind or connection errors - Check connection pool: 'connpool stats' for pool exhaustionLogin is very slow (>1 second):
- LDAP bind is the slow path (50-200ms typical, network dependent) - Check LDAP server latency: 'net latency <ldap-host>:636 --tls' - Verify connection pooling is working: 'connpool pools' - High latency indicates LDAP server load or network issues - Directory cache lookups should be <5ms (fast path)All logins failing simultaneously:
- LDAP server down: 'auth ldap' for health status - Network partition: 'net tcp <ldap-host>:636' for connectivity - TLS certificate expired: 'net tls <ldap-host>:636' to inspect cert - DNS failure: 'dns test <ldap-hostname>' for resolution check - Check cluster health: 'health status' for node-level issuesMetrics for monitoring:
- ldap_authentication_total{result="success"} -- successful logins - ldap_authentication_total{result="failure",reason="invalid_credentials"} -- wrong passwords - ldap_authentication_total{result="failure",reason="user_not_found"} -- unknown users - ldap_authentication_total{result="failure",reason="account_disabled"} -- disabled accounts - ldap_authentication_total{result="failure",reason="directory_unavailable"} -- infra issues - ldap_authentication_total{result="failure",reason="ldap_unavailable"} -- LDAP down - Spike in invalid_credentials may indicate brute force or credential stuffing - Spike in directory_unavailable or ldap_unavailable indicates infrastructure problemsSecurity
Password handling and credential security:
No local password storage:
Passwords are never stored, cached, or hashed locally. Every authentication requires a live LDAP bind, eliminating the risk of a local password database compromise. No password appears in logs, telemetry, metrics, or response objects.Pre-authentication checks (fail-fast security):
Account status is verified BEFORE attempting LDAP bind. This prevents unnecessary LDAP queries for disabled or expired accounts, reducing load on the LDAP server and providing faster rejection of invalid accounts. Evaluation order: existence -> disabled -> expired -> LDAP bind.Enumeration prevention:
The module returns distinct internal reasons ("user not found" vs "invalid credentials") but consuming services MUST map these to a generic message (e.g., "Invalid username or password") to prevent username enumeration. Timing is kept consistent: directory cache lookups are fast (<5ms) regardless of user existence. The module itself does not expose any public API that reveals user existence.Brute force and credential stuffing:
Account lockout is delegated to the LDAP server's password policy (ppolicy overlay or Active Directory lockout settings). The module does not implement its own lockout or rate limiting. Consuming services (signin, proxy auth) should implement: - Per-IP rate limiting (recommended: 10 attempts/minute) - Per-username rate limiting (recommended: 5 attempts/minute) - CAPTCHA after repeated failures - Device fingerprinting for anomaly detectionInjection prevention:
Username is trimmed of whitespace before use. LDAP filter escaping is handled by the downstream LDAP provider module. There are no local database queries or command executions, eliminating SQL injection and command injection vectors entirely.Password policy enforcement:
All password complexity, history, and rotation requirements are enforced by the LDAP server. The module reports password expiry status from the directory cache but does not enforce policies locally. This ensures a single source of truth for password policy (the LDAP directory).Credential logging policy:
Debug level: username and operation stage (never password) Info level: successful authentication with username and groups Warn level: authentication failures with reason (never password) Error level: infrastructure failures with error details Never logged: password, email (unless required for specific audit)Memory safety:
Password memory clearing after LDAP bind is handled by the LDAP provider module. The authentication module passes the password through to the bind operation and does not retain references after the call completes.Relationships
Module dependencies and interactions:
-
Directory: Primary dependency for pre-flight checks. Provides cached user metadata for account status checks (existence, disabled, expired, groups) and full profile retrieval (email, name). Directory cache is synced from LDAP on configurable intervals (delta: 5m, full: 60m). Cache staleness determines the window for status change propagation.
-
LDAP provider: Primary dependency for password verification. Performs LDAP bind operations for username/password verification. Manages LDAP connection pooling, user DN resolution, and TLS negotiation. Bind success/failure is the authoritative password check.
-
Sign-in service: Primary consumer. The sign-in flow engine calls ldapauth Authenticate as part of the username/password authentication stage. The flow engine maps internal failure reasons to user-facing messages and manages session creation on success.
-
Reverse proxy: Consumer for proxy authentication. HTTP proxied applications can require LDAP authentication via proxy auth provider configuration. Uses the same Authenticate operation with credentials from Basic Auth or form POST.
-
Telemetry: All operations logged with structured fields (username, groups, error, type). Prometheus metrics exported for authentication success/failure counts with reason labels. Metrics enable real-time monitoring, security event detection, and capacity planning.
-
Cluster: All operations are node-local with no cluster coordination required. The module is stateless and does not require session affinity or leader election.
-
Rate limiting: Not directly integrated. Rate limiting for authentication endpoints should be configured at the service layer (signin, proxy) using the rate limit module. Recommended: per-IP and per-username limits.
-
sessions: On successful authentication, the consuming service creates a session with the returned user metadata (username, email, groups). Session lifecycle is managed by the session module, not the authentication module.
Cluster behavior:
Fully stateless -- no local state, no cluster coordination required. All state lives in the directory cache (distributed via NATS/JetStream) and the LDAP server. Any cluster node can handle authentication independently. No session affinity needed. Directory cache consistency is bounded by sync intervals.Logs
Log entries. Search with: logs search “ldapauth” All entries use the name ldapauth.authenticate.
ldapauth.authenticate DEBUG Empty username / empty password provided ldapauth.authenticate DEBUG Attempting LDAP bind ldapauth.authenticate INFO Bind successful / bind failed (invalid credentials) ldapauth.authenticate ERROR LDAP bind call failed (service error)Metrics
Prometheus metrics. Query with: metrics prometheus ldap_<name>
ldap_authentication_total counter {result, reason?} Authentication attempts result=success Successful bind result=failure, reason=empty_username Missing username result=failure, reason=empty_password Missing password result=failure, reason=service_unavailable LDAP service error result=failure, reason=invalid_credentials Wrong passwordAlerts:
rate(ldap_authentication_total{result="failure",reason="service_unavailable"}[5m]) > 0 LDAP server down rate(ldap_authentication_total{result="failure",reason="invalid_credentials"}[5m]) > 20 Brute-force attemptMagic Link Authentication
Passwordless sign-in via email magic links with cross-device support
Overview
The magic link module implements passwordless authentication by sending a sign-in link to the user’s email address. Users click the link to authenticate without entering a password or code.
Core capabilities:
- Passwordless authentication via email-delivered links
- Cross-device support: request link on one device, click on another
- Three verification actions: authorize (remote), sign-in-here (local), deny
- Anti-enumeration: identical response shape regardless of email validity
- Session-based tokens with 128-bit entropy (UUID v4)
- Atomic single-use via cluster-wide session revocation
- Per-IP and per-email rate limiting to prevent abuse and inbox flooding
- Directory re-validation at verify time (disabled users cannot complete auth)
- PreVerify is read-only (safe from link-preview bots consuming tokens)
- Confirmation page shows request context (IP, location, browser) for phishing detection
- Geo-enriched emails showing request origin for user awareness
Flow summary:
1. User enters email on /signin/magiclink 2. Module creates device code pair with geo context in AdditionalData 3. If email matches an active directory user, a "magiclink" session is created (cluster-replicated) containing user info and device code key 4. Email sent with link: /signin/magiclink/verify?token=<SESSION_ID> 5. Frontend polls /api/signin/magiclink/poll with the device_code 6. User clicks link in email (possibly on a different device) 7. PreVerify validates session (read-only) and renders confirmation page showing destination, browser, IP, and geographic location 8. User chooses: Authorize, Sign in here, or Deny 9. Verify revokes session (atomic single-use) and acts on device code 10. Polling returns "authorized", "completed_elsewhere", or "denied"The module reuses the device code module (RFC 8628) for the polling mechanism and the sessions module for cluster-replicated token storage.
Config
Magic link is configured under the signin service section:
[service.signin.magiclink] enabled = true # Master switch (default: false) code_ttl = "10m" # Link validity duration (default: 10 minutes) rate_limit = "5/1m" # Per-IP rate limit (default: 5 per minute) rate_limit_email = "3/10m" # Per-email rate limit (default: 3 per 10 minutes)Prerequisites:
- SMTP must be configured for email delivery - Device code module is auto-enabled when magic link is activated - Directory module must be available for user lookup by emailUI integration:
When enabled, sign-in templates render a "Send me a sign in link" text link below the secondary method buttons. Magic link is NOT injected into the secondary methods array. It appears as a separate, lower-emphasis option via the "magiclink_enabled" template variable. Operators only need to set enabled = true; the link appears on all sign-in pages (passkey, password, x509) automatically.Rate limiting behavior:
- Per-IP limit (rate_limit): returns error "rate_limited" when exceeded, service responds with HTTP 429 - Per-email limit (rate_limit_email): silently creates orphaned device code as decoy (anti-enumeration), no email sent - Both limits reset on their respective sliding windowsAnti-enumeration design:
Initiate always returns the same response shape (DeviceCode + ExpiresIn) regardless of whether the email exists, is disabled, or is rate-limited. When the email is invalid or per-email rate-limited, a real but orphaned device code is created as a decoy so timing and response structure are identical. The frontend polls normally and eventually gets "expired", which is indistinguishable from a valid request where the user never clicked the link.Hot-reloadable: code_ttl, rate_limit, rate_limit_email. Cold (restart required): enabled.
Troubleshooting
Common symptoms and diagnostic steps:
User never receives magic link email:
- Check SMTP health: 'smtp health' to verify email delivery is working - Verify email belongs to an active directory user - Check per-email rate limit: silent suppression after 3/10m (no error shown) - Check spam/junk folders for the magic link email - Verify the user's email address in directory matches what was entered - Check structured logs for SMTP delivery errorsMagic link says “expired” or “invalid” when clicked:
- Default TTL is 10 minutes; check if user clicked in time - Token is single-use: clicking a second time returns "already consumed" - Check cluster time synchronization (NTP) across nodes - Verify session replication health across clusterPolling returns “expired” immediately (anti-enumeration):
- This is expected behavior for non-existent emails (by design) - Per-email rate limit exceeded: creates orphaned decoy device code - User disabled in directory: treated as non-existent (anti-enumeration) - No way to distinguish from legitimate "user never clicked" scenario“completed_elsewhere” status on polling device:
- User chose "Sign in here" on the verifying device (the device where they clicked the email link) - This is intentional: the session was created on the verifying device only - Polling browser displays a friendly message, not an error - Detected via a cluster-wide signal for cross-device coordinationConfirmation page shows wrong location or IP:
- Geo data comes from the GeoAccess module's IP-to-country/ASN lookup - Check geo database freshness and availability - Proxy or CDN may mask the original client IP - X-Forwarded-For header processing depends on trusted proxy configurationRate limiting triggered unexpectedly:
- Per-IP limit: 5 requests per minute (shared across all emails from one IP) - Per-email limit: 3 requests per 10 minutes (shared across all IPs) - Corporate NAT may cause many users to share one IP - Adjust rate_limit and rate_limit_email in config as neededMagic link feature not visible on sign-in page:
- Verify enabled = true in [service.signin.magiclink] - Check that SMTP is configured (prerequisite) - Template variable "magiclink_enabled" drives visibility - The link appears below secondary method buttons, not in the methods arrayDiagnostic commands:
- smtp health: verify email delivery subsystem - auth status: check authentication system overview - sessions list --type=magiclink: list active magic link sessions - health components: verify magic link subsystem healthSecurity
Security features and hardening measures:
Token entropy:
Magic link tokens are session IDs (UUID v4) with 128-bit cryptographic entropy. The session ID doubles as the magic link token in the verification URL, providing sufficient randomness to resist brute-force guessing.Single-use enforcement:
Tokens are consumed via atomic session revocation (replicated to all nodes). Once revoked, the token cannot be reused. Double-click on the verification link returns AlreadyDone=true (idempotent, no error).Anti-enumeration:
The Initiate operation returns identical response structure regardless of whether the email exists, the user is disabled, or per-email rate limit is exceeded. Orphaned device codes serve as timing-identical decoys. This prevents attackers from using magic link requests to discover valid email addresses in the directory.Directory re-validation:
At Verify time, the module re-validates the user against the directory. If the user has been disabled between Initiate and Verify, authentication fails. This prevents race conditions where an admin disables a user who already has a pending magic link.Link-preview bot protection:
PreVerify (GET request when link is clicked) is read-only and does not consume the token. Link-preview bots that fetch URLs in emails cannot accidentally authorize or deny the request.Phishing detection:
The confirmation page displays the request context (source IP, browser User-Agent, country, ISP/ASN) so the user can verify whether they initiated the request. Suspicious requests can be denied.Cross-device security:
The "sign-in-here" action denies the device code and stores a signal, so the polling browser sees "completed_elsewhere" rather than "authorized". This prevents unintended sessions on the original (potentially shared) device.Rate limiting:
- Per-IP: prevents abuse from a single source (default: 5/1m) - Per-email: prevents inbox flooding for a target user (default: 3/10m) - Per-email limit is silent (anti-enumeration): no error, decoy createdRelationships
Module dependencies and interactions:
- Device code: Core dependency. Provides RFC 8628 device code pair generation and polling infrastructure. Magic link auto-enables device code when activated. Device code handles the polling lifecycle; magic link provides the email-based authorization trigger.
- Sessions: Cluster-replicated token storage. Magic link tokens are stored as sessions with automatic TTL cleanup and atomic single-use via cluster-wide revocation. Session metadata contains user info, device code key, and request context.
- Directory: User lookup by email at Initiate time and re-validation at Verify time. Disabled users are treated as non-existent (anti-enumeration). Directory provides canonical user attributes (username, email, full name, groups) stored in session metadata.
- SMTP: HTML/text magic link email delivery. SMTP must be configured as a prerequisite for magic link functionality.
- Geo access: IP-to-country and ASN lookup for email context and confirmation page display. Helps users detect phishing attempts.
- Rate limiting: Per-IP and per-email request throttling. Per-IP returns HTTP 429; per-email silently creates decoy (anti-enumeration).
- config: Runtime access to [service.signin.magiclink] settings. Hot-reload supported for TTL and rate limit values.
- Sign-in service: HTTP handlers for /signin/magiclink routes and /api/signin/magiclink/poll that delegate to this module.
- Distributed memory cache: Stores cross-device flow coordination signals so the polling browser knows when authentication completed elsewhere.
Logs
Log entries by component. Search with: logs search “magiclink” Levels: ERROR > WARN > INFO > DEBUG.
Rate Limiting:
magiclink.ratelimit.ip.status DEBUG Per-IP rate limit check passed magiclink.ratelimit.email.status DEBUG Per-email rate limit check passedInitiate (magic link request):
magiclink.initiate INFO Per-email rate limit exceeded magiclink.initiate ERROR Failed to create device code magiclink.initiate ERROR Failed to create magiclink session magiclink.initiate WARN Failed to dispatch magic link email magiclink.initiate INFO Magic link email queuedPoll (device code polling):
magiclink.poll ERROR PollDeviceCode failed magiclink.poll ERROR Directory lookup failed during poll magiclink.poll INFO User invalid at poll timePreVerify (read-only token validation):
magiclink.preverify INFO Pre-verification successful, showing confirmation pageVerify (token consumption + action):
magiclink.verify INFO Magic link denied by user magiclink.verify ERROR Directory lookup failed during verify magiclink.verify INFO Magic link signin_here — session on verifying device only magiclink.verify ERROR Failed to update device code authorization magiclink.verify INFO Magic link authorizedMetrics
Prometheus metrics emitted by this module:
magiclink_initiated_total counter Incremented when a magic link email is successfully queued (valid user, within rate limits). Not incremented for decoy flows or unknown emails. magiclink_verifications_total counter Incremented when Verify completes a user {result} action. Labels: authorized — user approved sign-in denied — user rejected the request signin_here — user chose local sign-in magiclink_polls_total counter Incremented on every Poll response. {status} Labels mirror the returned status: pending, authorized, denied, expired, slow_down, completed_elsewhere, invalid (empty device code).Additional observability via dependent modules:
- devicecode: device_code_* metrics cover code creation and polling - ratelimit: ratelimit_* metrics cover per-IP and per-email throttling - sessions: session_* metrics cover magiclink session create/revoke - smtp: smtp_* metrics cover magic link email deliveryOIDC Provider
Built-in OpenID Connect provider — issues tokens for proxy SSO, bastion SSH, M2M, and personal access tokens
Overview
Issues and manages OAuth 2.0 / OpenID Connect tokens for all gateway services. Replaces external OIDC providers for proxy SSO, bastion device authorization, M2M workload auth, and personal access tokens. All token operations are cluster-wide — storage, revocation, and signing keys replicated across every node.
User authentication:
- Authorization Code Flow with PKCE, prompt/max_age/consent support
- Dynamic ACR/AMR claims reflecting the actual authentication method used
- DPoP token binding and mTLS certificate-bound tokens for high-security flows
Machine-to-machine:
- Client Credentials Grant for service auth, JWT Bearer Grant for certificate-based M2M
- Dynamic Client Registration for native OAuth clients
Additional capabilities:
- Pushed Authorization Requests (PAR) for enhanced request security
- Device Authorization Grant for headless device flows (bastion SSH, CLI)
- Personal Access Tokens (PATs) for CLI, CI/CD, and automation
- Token introspection and revocation (access tokens, refresh tokens, and PATs)
- UserInfo endpoint for retrieving user claims
- Response modes: query (default) and form_post
- Per-client skip_consent for trusted first-party applications
- Optional PKCE plain method deprecation (OAuth 2.1 hardening — S256 only)
- JWKS and OpenID Configuration discovery endpoints
A built-in proxy SSO client provides unified single sign-on for all proxy mappings. Its redirect URIs are validated against live proxy configuration to prevent open redirect attacks. This client is managed automatically and does not appear in the TOML configuration.
Config
Core configuration under [authentication.oidc]:
[authentication.oidc] signing_key = "..." # REQUIRED: Min 32 chars, used for deterministic key derivation via HKDF signing_algorithm = "ES256" # ES256 (default), ES384, ES512, or EdDSA # MUST be identical across all cluster nodes hostname = "auth.example.com" # REQUIRED: OIDC issuer URL (appears in token claims and discovery) enable_test_callback = false # Enable test callback URL (NEVER enable in production) dpop_proactive_nonce = true # Send DPoP-Nonce header in all token responses (default: true) par_ttl = "5m" # PAR request_uri TTL (range: 1m-10m per RFC 9126) enable_dcr = false # Enable Dynamic Client Registration (RFC 7591) rate_limit_dcr = "10/1m" # DCR endpoint rate limit per IP allow_dcr_from = [] # CIDR allowlist for DCR (empty = allow all) allow_dcr_redirect_domains = [] # Allowed redirect URI domains (loopback always allowed, supports *.example.com) disable_plain_pkce = false # Reject "plain" PKCE method (OAuth 2.1 hardening, S256 only when true) pat_enabled = false # Master switch for PATs (default: disabled) pat_max_ttl = "2160h" # Maximum PAT lifetime (default 90 days, max 365 days) pat_max_per_user = 10 # Maximum PATs per user (default 10) pat_required_groups = [] # Groups allowed to create PATs (empty = any authenticated user)[[authentication.oidc.clients]] name = "my-app" # REQUIRED: Client identifier (used as client_id) clientsecret = "..." # Min 32 chars with entropy validation (omit for public/mTLS clients) redirect_urls = ["https://..."] # REQUIRED: Allowed redirect URIs (strict validation, wildcard support) origin_urls = ["https://..."] # Allowed CORS origins allowed_scopes = ["openid", "profile", "email", "groups"] # Permitted scopes allowed_grant_types = ["authorization_code", "refresh_token"] # Default grant types require_pkce = false # Enforce PKCE (MUST be true for public clients) skip_consent = false # Skip consent screen for trusted first-party clients allow_client_from = ["0.0.0.0/0"] # IP allowlist in CIDR notation client_credentials_ttl = "1h" # Access token TTL for client_credentials grant # mTLS configuration (RFC 8705) token_endpoint_auth_method = "tls_client_auth" # Enable mTLS client auth tls_client_auth_san_uri = "spiffe://..." # URI SAN identity (SPIFFE) tls_client_auth_san_dns = "service.local" # DNS SAN identity tls_client_auth_san_email = "svc@example.com" # Email SAN identity tls_client_auth_subject_dn = "CN=service" # Subject DN identity certificate_bound_tokens = true # Bind tokens to client certificate client_ca_pem = "/path/to/ca.pem" # Per-client CA trust (inline PEM or file path) # JWT Bearer configuration (RFC 7523) jwt_public_key = "-----BEGIN PUBLIC KEY-----..." # Public key for JWT assertion verification jwt_algorithm = "RS256" # RS256/384/512, ES256/384/512, EdDSA jwt_issuer = "service-name" # Expected issuer claim jwt_subject = "service-name" # Expected subject claim # Scope-to-group mapping for M2M authorization scope_group_mapping = { "api:read" = ["readers"], "api:write" = ["writers"] }Token storage and TTL defaults:
Authorization codes: 10 minutes, single-use Access tokens: 1 hour (configurable), replicated cluster-wide Refresh tokens: 30 days (configurable), replicated cluster-wide DPoP JTIs: 120 seconds, replicated cluster-wide (best-effort) DPoP nonces: 60 seconds, single-use, replicated cluster-wide (best-effort) PAR requests: 5 minutes (configurable 1-10m), replicated cluster-wide (best-effort) PAT sessions: up to pat_max_ttl (default 90 days), managed by sessions moduleKey management:
Signing keys are derived deterministically from the signing_key using HKDF (RFC 5869). Supports ES256 (ECDSA P-256, default), ES384 (P-384), ES512 (P-521), and EdDSA (Ed25519). All cluster nodes derive identical keypairs from the same signing_key, requiring no key synchronization. Keys remain stable across restarts.Hot-reloadable: client configurations, scopes, redirect URIs, IP allowlists. Cold (restart required): signing_key, signing_algorithm, hostname (issuer URL).
Security
Token signing:
ID tokens signed with configurable algorithm: ES256 (default), ES384, ES512, or EdDSA. ES256/384/512 are compatible with Kubernetes kube-apiserver --oidc-signing-algs. Two signing modes with automatic failover: Threshold: distributed key — no single node holds the full private key (requires cluster quorum). Deterministic fallback: all nodes derive identical keys from signing_key for cross-node consistency. The module auto-switches between modes based on cluster health. All token issuance logs include signer_type attribute ("threshold" or "deterministic"). Signing key entropy validated at startup. Keys derived via HKDF-SHA256 (RFC 5869) for cross-node consistency.JWT algorithm hardening:
All JWT parsing enforces strict algorithm allowlists. ID token validation: ES256, ES384, ES512, EdDSA only (server-issued tokens). JWT Bearer assertion: RS256-512, ES256-512, EdDSA (client-signed assertions). DPoP proof validation: RS256-512, ES256-512, EdDSA (client-signed proofs). id_token_hint validation: ES256, ES384, ES512, EdDSA only (server-issued tokens). Symmetric algorithms (HS256/384/512) always rejected, preventing algorithm confusion attacks. DPoP proofs validate typ header per RFC 9449 Section 4.3.PKCE (Proof Key for Code Exchange, RFC 7636):
Supports S256 (SHA-256) and plain methods. S256 strongly recommended. Optional disable_plain_pkce config rejects plain method (OAuth 2.1 hardening). When disable_plain_pkce=true, discovery advertises only S256. MANDATORY for public clients (no client_secret configured). RECOMMENDED for all confidential clients as defense-in-depth. Prevents authorization code interception in mobile and SPA scenarios.DPoP (Demonstrating Proof-of-Possession, RFC 9449):
Binds tokens to client cryptographic key, preventing token theft and replay. Supports RSA, ECDSA, and Ed25519 proof keys. JTI replay prevention with 120-second distributed cache TTL. Optional nonce-based replay protection (proactive nonce delivery by default). Server issues DPoP-Nonce header in all token responses when enabled. Introspection returns cnf claim with jkt field for DPoP-bound tokens. Replay protection has two modes (dpop_strict_replay config option): - false (default): lower latency, small replay window during propagation - true: strict quorum wait, no replay window, higher latency Set to true for high-assurance deployments or regulated environments.Mutual TLS (RFC 8705):
Client authentication via X.509 certificate presented during TLS handshake. Four identity methods: URI SAN (SPIFFE), DNS SAN, Email SAN, Subject DN. Configure exactly one identity method per client. Certificate-bound tokens contain cnf.x5t#S256 (SHA-256 thumbprint). Binding validated at token refresh and UserInfo endpoints. Mutual exclusion: tokens are DPoP-bound OR cert-bound, never both. Per-client CA trust via client_ca_pem provides defense-in-depth. Certificate DER size limited to 16KB. Raw certificates never logged. SPIFFE integration: workloads authenticate with existing X.509-SVIDs.Pushed Authorization Requests (RFC 9126):
Authorization parameters stored server-side, not exposed in browser URL. request_uri enforces single-use consumption (prevents replay attacks). request_uri format: urn:ietf:params:oauth:request_uri:<base64url(32 bytes)>. Client binding: request_uri locked to creating client_id. DPoP integration: optional key binding at PAR time. Claims/id_token_hint limited to 8KB to prevent DoS.OIDC Core compliance (§2, §3.1.2.1, §3.1.3.6, §5.5.1):
prompt parameter: prompt=none: returns error if user not authenticated (no login redirect). prompt=login: forces re-authentication even with active session. prompt=consent: forces consent screen even for skip_consent clients. Mutually exclusive with each other. Validated at authorization endpoint. Error redirects (login_required, consent_required) validated against registered redirect URIs to prevent open redirect. max_age parameter: Limits maximum authentication age in seconds. If session is older than max_age, forces re-authentication. Validates session CreatedAt against current time. auth_time claim: Reflects the real time the user authenticated, not when the token was issued. Carried through the entire token lifecycle (auth code, refresh, ID token). at_hash claim (§3.1.3.6): Left half of SHA hash of access token, base64url-encoded. Hash algorithm matched to signing algorithm: ES256 → SHA-256, ES384 → SHA-384, ES512/EdDSA → SHA-512. Included in all ID tokens issued alongside an access token. ACR/AMR claims (RFC 8176): ACR (Authentication Context Class Reference): "1" = single factor (password only) "2" = multi-factor or strong single factor (WebAuthn, x509) AMR (Authentication Methods References): Values per RFC 8176: pwd (password), otp (TOTP/email OTP), hwk (WebAuthn), x509. Carried through the entire token lifecycle. response_mode parameter: query (default): authorization code delivered via redirect query string. form_post: code delivered via auto-submitting HTML form (POST). form_post includes security headers (X-Frame-Options, Referrer-Policy). Consent: Per-client skip_consent config skips consent screen for first-party apps. The built-in proxy SSO client and DCR clients skip consent. Unknown clients always show consent screen. prompt=consent overrides skip_consent.Timing attack protection:
All security-sensitive comparisons use crypto/subtle.ConstantTimeCompare: client secrets, PKCE verifiers, authorization code validation, refresh token client binding, DPoP thumbprints, token ownership, mTLS SAN/DN matching, and certificate thumbprint binding.Client security:
Client secrets require minimum 32 characters with entropy validation. Strict redirect URI validation with wildcard security (HTTPS enforced). State parameter minimum entropy requirements (32+ characters). IP allowlisting per client via CIDR notation. Public clients (no secret) MUST set require_pkce=true. mTLS clients authenticate via certificate (no secret needed).Proxy SSO client:
Automatically managed — not configured via TOML. Secret derived from cluster key (consistent across all nodes). PKCE S256 required. Redirect URIs validated against live proxy mappings. Token exchange handled internally (no external HTTP round-trips). Invalid or disabled proxy mappings excluded from redirect URI validation.Personal Access Tokens (PATs):
Pre-issued long-lived tokens for hexonclient CLI, CI pipelines, and automation. Each PAT is a signed JWT backed by a server-side session for revocation control. The JWT allows stateless validation; the session enables instant revocation. Step-up 2FA required before creation (TOTP or email OTP) — even if already logged in. Server-side revocation: revoking a PAT invalidates the JWT at the next validation check. Per-user limit (pat_max_per_user, default 10) prevents token accumulation. Max TTL cap (pat_max_ttl, default 90d, max 365d) limits blast radius of stolen tokens. Optional IP restriction (allowed_ips) checked at validation time. Email notification on creation — user alerted if PAT created without their knowledge. Last-used tracking (IP + timestamp) for forensics and audit trail. Auto-revoke on user disable — directory bulk revocation includes PATs. Active connector (QUIC) connections severed immediately on revocation. PATs are distinguished from other token types by a dedicated audience claim. PAT names optional (default "Token <date>"), duplicate names rejected (case-insensitive). Optional group restriction (pat_required_groups) — when set, user must have any listed group. Group check enforced at issuance (OIDC module), profile UI (hides section), and bastion CLI. PoW-free proxy access: All Bearer tokens (opaque access tokens, JWT ID tokens, PATs) bypass Proof-of-Work challenges and OIDC browser redirects entirely. The proxy middleware chain resolves Bearer tokens at step 1 — before PoW, before OIDC redirect. Two on-ramps: Browser: PoW → OIDC SSO → cookie → proxy (human path) Machine: Bearer <token> → proxy (machine path, no round-trips) Token types: client_credentials grant (M2M), kubelogin ID tokens, PATs (long-lived with session-backed revocation + IP restrictions). Same group authorization, identity headers, and Ed25519 signing apply to both paths.Dynamic Client Registration (RFC 7591):
Fully stateless — no database, no KV storage, no cache. Client IDs use "dcr-" prefix + UUID for recognition. Client secrets deterministically derived from the cluster signing key. All cluster nodes derive identical secrets. PKCE always required. Redirect URIs: loopback always allowed (RFC 8252 §7.3): http://localhost[:port][/path], http://127.0.0.1[:port][/path], http://[::1][:port][/path]. Additional domains via allow_dcr_redirect_domains (exact match or *.example.com wildcard). Use allow_dcr_redirect_domains = ["*"] to allow any HTTPS domain (for web-based MCP clients). Non-loopback redirect URIs require HTTPS. CIDR allowlist (allow_dcr_from) controls which IPs can register. Rate limited per IP via rate_limit_dcr config. Cannot revoke individual DCR clients — toggle enable_dcr=false to disable all. MCP service requires enable_dcr = true for OAuth-based MCP client authentication.Troubleshooting
Common symptoms and diagnostic steps:
Token exchange failures (invalid_grant):
- Authorization code expired (10-minute TTL): user took too long to complete flow - Code already consumed (single-use): possible replay attack or double-submit - PKCE verifier mismatch: client sent wrong code_verifier for the code_challenge - Client ID mismatch: code was issued to a different client - Redirect URI mismatch: URI in token request differs from authorization request - Start with: 'auth status' to check OIDC module health - Check: 'diagnose user <username>' for cross-subsystem user access diagnosticDPoP validation failures:
- proof_too_old: DPoP proof timestamp older than 60 seconds (clock skew?) - proof_from_future: client clock ahead of server (NTP issue) - jti_replay: same JTI used twice within 120 seconds (SECURITY: possible attack) Note: default mode has a small replay window during cluster propagation. Set dpop_strict_replay = true to eliminate this window. - invalid_nonce: nonce not found or expired (60-second TTL, single-use) - htm_mismatch / htu_mismatch: proof HTTP method or URI does not match request - thumbprint_error: JWK thumbprint computation failed (malformed key) - Monitor: alert on ANY oidc_dpop_jti_replay_total incrementsmTLS authentication failures:
- No client certificate: TLS handshake did not include certificate - SAN/DN mismatch: certificate identity does not match client config - Certificate too large: DER exceeds 16KB limit - CA trust failure: certificate not signed by expected CA (check client_ca_pem) - Wrong identity method: client configured with san_uri but cert has san_dns - Check: 'auth status' for authentication system overviewToken refresh failures:
- Refresh token expired (default 30-day TTL) - Client ID mismatch: refresh token bound to different client - Certificate binding mismatch (mTLS): presented cert differs from original - DPoP key mismatch: different key used than at token issuance - Token revoked: check if bulk revocation was triggered - Check: 'sessions list --user=<username>' for active sessionsM2M (client_credentials / jwt-bearer) failures:
- ip_not_allowed: source IP not in client allow_client_from CIDR list - Invalid client secret: ensure 32+ chars, check for trailing whitespace - Wrong grant type: client must have grant type in allowed_grant_types - Scope not allowed: requested scope not in client allowed_scopes - JWT assertion: check algorithm matches jwt_algorithm, verify issuer/subject - JWT public key: ensure PEM format is correct and algorithm matches key typePAR (Pushed Authorization Request) failures:
- replay_attempt: request_uri already consumed (SECURITY: possible replay attack) - expired: request_uri TTL exceeded (default 5 minutes) - client_mismatch: different client_id attempting to use another client's request_uri - invalid_length: request_uri format does not match expected 78-character URN - Monitor: alert on oidc_par_consume_total result=replay_attemptAuthorization endpoint (OIDC Core) issues:
- prompt=none returns login_required: user has no active session; expected behavior - prompt=none returns consent_required: client requires consent but prompt=none forbids it - prompt=login redirect loop: session freshness check prevents infinite loops (30s guard) - max_age forces re-auth: session age exceeds max_age seconds; user must re-authenticate - "Unsupported prompt value": client sent invalid prompt value (only none, login, consent allowed) - "Invalid max_age": client sent non-numeric max_age value - Consent screen shown unexpectedly: check skip_consent on client config ('config show authentication') - at_hash missing in ID token: at_hash only present when ID token issued alongside access token - ACR shows "1" despite MFA: check session auth_method metadata matches expected method - form_post not working: ensure client accepts POST at redirect_uri; check response_mode=form_postDynamic Client Registration (RFC 7591) failures:
- 404 on POST /oidc/register: enable_dcr is false in configuration - access_denied: source IP not in allow_dcr_from CIDR allowlist - invalid_redirect_uri: redirect domain not in allow_dcr_redirect_domains and not loopback - Non-loopback redirect URIs must use HTTPS - For web-based MCP clients: set allow_dcr_redirect_domains = ["*"] to allow any HTTPS domain - For native CLI MCP clients: no domain config needed (loopback always allowed) - Client secret not working: ensure client is using the client_secret returned at registration - Token exchange fails: DCR clients require PKCE (S256); ensure code_challenge is sent - Check: 'config show authentication' to verify enable_dcr and allow_dcr_redirect_domains settingsPAT (Personal Access Token) failures:
- "PAT revoked or expired": session deleted or TTL exceeded — check 'sessions list --type=pat --user=X' - "maximum PAT limit reached": user has pat_max_per_user tokens — revoke unused ones first - "PAT name already exists": case-insensitive duplicate — use a different name - "authentication failed" after revoke: expected — session deletion invalidates JWT at next check - Token not working after creation: ensure hexonclient uses --token flag with full JWT string - IP restriction error: remote IP not in allowed_ips metadata — check 'pats show <session_id>' - "your groups do not permit PAT creation": user not in pat_required_groups — check 'config show authentication' and user's groups - PAT section hidden in profile: pat_required_groups is set and user not in any listed group - Step-up verification required: user must complete TOTP or email OTP before PAT creation - PAT not working as proxy Bearer token: check 'logs search "handlers.bearer"' — look for "PAT rejected" (revoked session) or "Cached PAT rejected" (stale cache, auto-invalidated) - PAT introspection returns {active: false}: ensure token_type_hint is "" or "pat", check session exists ('sessions list --type=pat'), verify JWT not expired - PAT proxy access denied despite valid token: check allowed_ips — proxy enforces IP restriction from session metadata. Use 'pats show <session_id>' to see allowed_ips list - Check: 'pats list --user=X' to see all PATs for a user - Check: 'sessions list --type=pat' for all PAT sessions cluster-wide - Check: 'logs search "oidc.pat"' for PAT issuance and validation logs - Check: 'logs search "handlers.bearer"' for proxy bearer middleware PAT validation logsProxy SSO redirect loops:
- OIDC callback failing: check proxy oidc_providers configuration - Token exchange fails: proxy exchanges tokens internally (no external HTTP hairpin) - Cross-domain cookie: verify proxy hostname matches cookie domain - Check: 'sessions list --type=proxy --user=<username>' - Check: 'proxy traffic <app>' for per-route metricsThreshold signing issues:
- signer_type=deterministic when threshold expected: check cluster quorum, 'cluster status' - "Threshold signing unavailable but required": threshold_required is set but quorum lost - "OIDC switched to deterministic fallback signing": threshold signer lost, using HKDF key - Algorithm mismatch: threshold signer algorithm must match signing_algorithm config - Check logs: 'logs search "oidc.keys"' for signing mode transitionsKey rotation / history issues:
- Token validation fails after key rotation: check 'auth keys' — is the old kid still listed? - Key history empty: keys are recorded on first token signing or key rotation - Historical key expired from history: TTL may be too short relative to token lifetimes - Token signed with unknown kid: historical key may have expired from KV — restart loads from KV - Check: 'auth keys' — shows kid, algorithm, curve, expiry, and remaining TTLHealth check failures:
- signing_key_loaded=false: signing key derivation failed (check signing_key length) - entropy_validated=false: signing key has insufficient entropy (weak key) - issuer_configured=false: hostname not set in configuration - Use: 'auth status' for OIDC health overviewGeneral diagnostic commands:
'auth status' - Authentication system status overview 'auth tokens' - Active OIDC tokens and sessions 'auth oidc' - OIDC provider config and registered clients 'auth keys' - Active signing keys with kid, algorithm, and TTL 'diagnose user <username>' - Cross-subsystem user access diagnostic 'sessions list --user=X' - List active sessions for a user 'sessions revoke-user X' - Revoke all sessions for a user (emergency) 'logs search oidc' - Search logs for OIDC-related entries 'metrics prometheus oidc' - Raw OIDC Prometheus metricsArchitecture
How the OIDC provider works at the cluster level:
The OIDC module operates cluster-wide. All token operations, key management, and revocation are replicated to all nodes automatically. The HTTP service layer handles request parsing and delegates to the OIDC module internally.
Operation categories:
-
Authorization (user login flows)
- Authorization code generated after user authentication (10-minute single-use TTL)
- Code exchange validates PKCE, client credentials, redirect URI, then issues tokens
- Supports: prompt (none/login/consent), max_age, response_mode (query/form_post)
- Per-client skip_consent controls whether the consent screen is shown
-
Token management
- Refresh validates client binding, DPoP key, and certificate binding
- Bulk revocation is replicated to all nodes for immediate effect
- Introspection returns confirmation claims for DPoP-bound and cert-bound tokens
- Introspection also supports PATs (returns token name and ID)
-
Machine-to-machine (M2M)
- Client credentials: secret-based auth with scope-to-group mapping
- JWT bearer: certificate-based auth with public key validation
- Both return access tokens only (no refresh token, no ID token)
- Scope-to-group mapping bridges OAuth scopes to Hexon group authorization
-
Device authorization
- Issues tokens after the device authorization flow completes
- Used by bastion SSH for user authentication via browser
-
Discovery
- JWKS exposes signing public keys for external JWT verification
- OpenID Configuration provides standard OIDC discovery metadata
- Discovery advertises supported response modes, claims, and PKCE methods
-
Dynamic Client Registration (DCR)
- Stateless: each DCR client gets a unique ID (dcr- prefix) and derived secret
- No storage needed — client credentials are deterministically reproducible
- PKCE required; redirect URIs: loopback always allowed + operator-configured domains
-
Pushed Authorization Requests (PAR)
- Authorization parameters stored server-side (not exposed in browser URL)
- Single-use consumption prevents replay attacks
- Client binding enforced with constant-time comparison
-
Personal Access Tokens (PATs)
- JWT signed and displayed once at creation — never stored server-side
- Three validation paths: connector (QUIC), HTTP proxy Bearer header, introspection
- All paths verify JWT signature + server-side session existence + optional IP restriction
- Revocation deletes the session and immediately disconnects active connections
Token replication model:
- Authorization codes: local node only (short-lived, single-use) - Access/refresh tokens: replicated to all nodes with quorum - DPoP JTIs and nonces: best-effort replication (short TTL) - PAR requests: best-effort replication (short TTL, single-use) - PAT sessions: managed by the sessions module (TTL per token, up to pat_max_ttl)Key management:
Signing keypair derived deterministically from signing_key using HKDF-SHA256. Supports ES256 (P-256, default), ES384 (P-384), ES512 (P-521), and EdDSA (Ed25519). All cluster nodes produce identical keys from the same signing_key — no key synchronization needed. Keys remain stable across restarts. Threshold signing is preferred when cluster quorum is available. The OIDC module auto-switches signing mode based on cluster health. If threshold_required is set in config, the deterministic fallback is disabled (fail-closed on quorum loss). Key history (rotation support): On key rotation, old signing keys are retained so that tokens signed with the previous key can still be verified. Each key is identified by its kid (Key ID). Historical keys have a TTL based on the longest-lived token signed with them. JWKS endpoint serves all active keys (current + historical). Inspect active keys: 'auth keys' shows kid, algorithm, curve, and TTL.Metrics and observability:
Comprehensive Prometheus metrics exported for all operations: - Token operations: exchange, refresh, revocation, introspection, userinfo - DPoP: validation, JTI replay detection, nonce generation and validation - mTLS: authentication attempts, certificate binding validations - PAR: request creation, consumption, replay detection - Latency histograms: ID token, auth code, access token generation - Validation failures: PKCE, scope, redirect URI, signing key entropyRelationships
Module dependencies and interactions:
-
proxy: Provides SSO authentication via a built-in proxy client. Authorization codes are exchanged internally (no external HTTP round-trips). Redirect URIs validated against live proxy mapping config. Proxy sessions use 24-hour token TTL.
-
devicecode: Issues tokens after device authorization flow completes. Used by bastion SSH — trusted internal callers skip client validation.
-
directory: Provides user information (groups, email, name) for token claims. When a user is disabled in the directory, all their tokens are revoked cluster-wide. Group memberships are included in ID tokens and used for scope-to-group mapping in M2M flows.
-
sessions: OIDC tokens create sessions for proxy and bastion flows. Session revocation triggers token revocation for the associated user.
-
authentication.x509: TLS layer validates client certificates against the global CA pool. OIDC performs identity matching (SAN/DN) and optional per-client CA trust validation on top of TLS-layer authentication.
-
spiffe: SPIFFE X.509-SVIDs used for mTLS client authentication via URI SAN. No separate CA infrastructure needed; reuses ACME SPIFFE profile certificates.
-
bastion: Bastion SSH uses device authorization flow for user authentication. Bastion shell also provides ‘pat create/list/revoke’ commands with inline TOTP/email OTP verification.
-
firewall: Network-level access rules applied before OIDC HTTP endpoints. IP allowlisting per client provides additional application-layer restriction.
-
protection: Rate limiting applied to token and authorization endpoints. Prevents brute-force attacks on client credentials and authorization codes.
-
mcp: MCP service uses DCR for OAuth-based authentication. MCP clients register dynamically via POST /oidc/register, then complete Authorization Code + PKCE flow. Also supports static bearer token auth as fallback.
-
connector (hexonclient): PATs are used for QUIC connector authentication. Validates JWT signature + session existence. Active connections are severed immediately when a PAT is revoked. Last-used metadata updated on each use.
-
proxy (Bearer tokens): PATs can be used as HTTP Bearer tokens for proxy access. Bearer middleware validates the JWT and checks the server-side session on every request (revocation takes effect immediately). IP restrictions from the PAT are enforced at the proxy layer.
-
profile: Profile web UI allows PAT creation (with step-up 2FA gate), listing, and revocation.
-
admin CLI: ‘pats’ command for cross-user PAT management with step-up verification.
-
smtp: Email notification sent on PAT creation, including token name, expiry, and the IP address used during creation.
-
cluster: All token operations are replicated cluster-wide. Key derivation ensures all nodes produce identical signing keypairs from the same signing_key.
Logs
Log entries by operation. Search with: logs search “oidc” Levels: ERROR > WARN > INFO > DEBUG > TRACE. DEBUG/TRACE require log level configuration.
Authorization Code:
oidc.authcode.generate INFO AUDIT Generating authorization code oidc.authcode.generate WARN AUDIT Rate limited / unknown client / invalid redirect URI oidc.authcode.generate WARN PKCE missing, unauthorized scope, IP not allowed oidc.auth ERROR RNG failure during code generation (critical)Token Generation & Exchange:
oidc.token.exchange INFO AUDIT Authorization code exchanged for tokens oidc.token.exchange WARN Invalid/expired code, PKCE failed, client/redirect mismatch oidc.tokens.generate INFO AUDIT Tokens issued successfully oidc.tokens.generate ERROR Token generation failed (signing key, RNG) oidc.tokens.saga ERROR Saga step failed during token storage oidc.token.refresh INFO AUDIT Token refresh requested oidc.token.refresh WARN Token not found, client mismatch, invalid scope oidc.tokens.refresh INFO AUDIT Tokens refreshed (internal) oidc.tokens.refresh WARN Refresh generation failed oidc.token.signing WARN Signing retry (threshold signer unavailable) oidc.token.signing ERROR All signing attempts failed oidc.ratelimit.status DEBUG Rate limit check resultID Token:
oidc.idtoken ERROR Signing key not loaded, signing failed oidc.idtoken DEBUG DPoP/cert binding applied, signer typeCrypto:
oidc.crypto ERROR RNG failure in secure token generation (critical)Introspection & Revocation:
oidc.introspect DEBUG Token introspected (active true/false, type) oidc.revoke INFO AUDIT Token revoked oidc.revoke_user_tokens INFO Bulk user token revocation (account disable/delete)Client Authentication & Validation:
oidc.client_auth WARN Secret mismatch, JWT assertion failed, unknown method oidc.validation WARN Redirect URI invalid, wildcard rejected, entropy check oidc.pkce WARN Invalid verifier length/chars, plain method rejected oidc.pkce TRACE PKCE validation resultDPoP (RFC 9449):
oidc.dpop WARN JTI replay detected oidc.dpop DEBUG Proof validation (htm/htu mismatch, expired, future) oidc.dpop.nonce WARN Nonce validation failed, storage error oidc.dpop.nonce DEBUG Nonce generated, validated, storedPAR (RFC 9126):
oidc.par INFO PAR request created oidc.par WARN Auth failed, request too large, replay attempt oidc.par ERROR Failed to generate request_urimTLS (RFC 8705):
oidc.mtls WARN No certificate, CA mismatch, no identity fields oidc.mtls DEBUG SAN mismatch (URI/DNS/email/subject DN) oidc.mtls TRACE Client authenticated via matched methodM2M:
oidc.client_credentials INFO AUDIT Access token generated oidc.jwt_bearer WARN Invalid JWT assertionKeys & Init:
oidc.init INFO OIDC provider initializing/disabled oidc.init ERROR Signing key validation failed (critical) oidc.keys INFO Key generated, threshold signing active oidc.keys WARN Threshold signer unhealthy/algorithm mismatch oidc.keys ERROR Key not configured, too short, low entropy oidc.key_history INFO Key history loaded/rotated oidc.key_history WARN Key history storage failed oidc.jwks DEBUG JWKS requested oidc.jwks WARN Unknown client requesting JWKSUserInfo:
oidc.userinfo INFO AUDIT UserInfo served oidc.userinfo WARN Token invalid, user not found, scope insufficientBearer Token Minting:
oidc.mint_bearer INFO AUDIT Bearer token minted for proxy oidc.mint_bearer ERROR Minting failed (signing key, invalid request)DCR (Dynamic Client Registration):
oidc.dcr INFO AUDIT Dynamic client registeredPAT (Personal Access Tokens):
oidc.pat.issue INFO AUDIT PAT issued oidc.pat.issue ERROR Signing key not loaded, signing/session failedToken Validation:
oidc.validate_id_token INFO ID token validatedDevice Code:
oidc.device_code INFO Generating tokens for device authorization oidc.device_code INFO AUDIT Device code grant successful oidc.device_code ERROR Token generation failedLogout:
oidc.logout INFO AUDIT Logout completed, tokens revokedHealth:
oidc.healthcheck DEBUG Health check performedMetrics
Prometheus metrics. Query with: metrics prometheus oidc_<name>
Token Issuance:
oidc_authcode_generation_total counter {result, reason} Auth code generation oidc_token_exchange_total counter {result, reason} Code-for-token exchanges oidc_token_refresh_total counter {result, reason} Token refreshes oidc_tokens_revoked counter {} Tokens revoked on logout oidc_token_signing_retry_total counter {result, reason|attempt} Signing retries (threshold signer)Client Auth:
oidc_validation_failure_total counter {type, client_id} PKCE/scope/redirect failures oidc_mtls_auth_total counter {result, reason|method} mTLS auth (failure: reason, success: method)DPoP:
oidc_dpop_validation_total counter {result, reason} Proof validation oidc_dpop_jti_replay_total counter {detected} Replay detections oidc_dpop_jti_storage_total counter {result} JTI cache operations oidc_dpop_nonce_generation_total counter {result} Nonce generation oidc_dpop_nonce_storage_total counter {result} Nonce cache operations oidc_dpop_nonce_validation_total counter {result, reason} Nonce validationPAR:
oidc_par_requests_total counter {result, client_id} PAR creation oidc_par_consume_total counter {result, client_id} PAR consumption oidc_par_request_duration histogram {client_id} PAR processing latencyM2M:
oidc_client_credentials_total counter {result, reason} Client Credentials grants oidc_jwt_bearer_total counter {result, reason} JWT Bearer grantsOperations:
oidc_token_introspection_total counter {result, token_type, active} Token introspection oidc_token_revocation_total counter {result, token_type} Token revocation oidc_userinfo_requests_total counter {result, reason} UserInfo requests oidc_logout_total counter {result} Logouts oidc_device_code_total counter {result, reason} Device code grants oidc_pat_issued_total counter {username} PAT issuanceLatency:
oidc_id_token_generation_duration_ms histogram {} ID token generation oidc_access_token_generation_duration_ms histogram {} Access token generation oidc_auth_code_generation_duration_ms histogram {} Auth code generation oidc_entropy_validation_duration_ms histogram {} Entropy validationAlerts:
rate(oidc_dpop_jti_replay_total[5m]) > 0 DPoP replay attack rate(oidc_validation_failure_total[5m]) > 10 High validation failure rate oidc_token_signing_retry_total > 0 Signing key issues rate(oidc_par_consume_total{result="replay_attempt"}[5m]) > 0 PAR replay attemptEmail OTP
Delivers one-time codes via email for second-factor authentication — brute-force and replay protected
Overview
Sends a one-time code to the user’s email for second-factor verification. Used as an MFA step after primary authentication — no app installation required, works with any email provider. Applies when the signin flow requires MFA and email OTP is configured as an available method.
How it works:
1. User completes primary authentication 2. The gateway generates a one-time code and emails it 3. User submits the code — validated with constant-time comparison 4. Code consumed on use — replay and brute-force protectedTwo code formats:
- Numeric (digits 0-9) — standard, most familiar - BASE20 (20 uppercase consonants: BCDFGHJKLMNPQRSTVWXZ) — avoids profanity, easier to read aloudSecurity features: device-based rate limiting, resend delay enforcement, configurable max retry limits with OTP locking, email domain allowlisting, and hashed storage keys for privacy.
JIT-2FA override: when a webhook-validated scenario has already confirmed the user’s identity, the OTP step can be bypassed via JIT-2FA integration.
Config
Configuration under [authentication.otp]:
[authentication.otp] length = 6 # OTP code length (4-12, recommended: 4-8) type = "numeric" # Code type: "numeric" or "base20" valid = "5m" # OTP expiration duration (bounds: 1m-30m) resend_time = 60 # Minimum seconds between OTP requests per device max_retries = 5 # Max failed validation attempts before OTP locked mask_email = true # Mask email in MFA page ("user****@example.com") domains = [ # Allowed email domains (empty = all blocked) "example.com", "company.org", ]Code type selection:
"numeric": Standard digit-only codes, works with any keyboard layout "base20": Consonant-only uppercase codes, prevents generating offensive words Invalid type values fall back to "numeric" with a warning logOverride fields for JIT-2FA and programmatic callers:
TypeOverride: Override code type per-request (empty = global config) CodeLengthOverride: Override code length per-request (bounds: 4-12) TTLOverride: Override expiration per-request (bounds: 1m-30m) ResendTimeOverride: Override resend cooldown per-request (bounds: 10s-5m) SkipDomainCheck: Bypass email domain allowlist (for webhook-validated flows) MaxRetriesOverride: Override max failed attempts per-request (bounds: 1-10) Resolution chain for all overrides: per-request > global config > defaultMaxRetries behavior:
When retry count reaches max_retries, OTP is locked (not deleted). Locked OTPs block both validation AND resend requests. This prevents brute-force bypass via the resend trick (request new code after exhausting retries on the current one). Locked OTPs expire naturally via TTL for automatic cleanup.Resend behavior:
Retry and attempt counters are preserved across resends for the same email. This prevents attackers from resetting counters by requesting a new code. Counters only reset when a different email is used from the same device.All settings are hot-reloadable (read dynamically on each operation).
Troubleshooting
Common symptoms and diagnostic steps:
User does not receive OTP email:
- Check email domain is in the allowed domains list - Verify SMTP module health: 'smtp health' - Check telemetry logs for "Failed to send verification email" - GenerateOTP propagates SMTP errors to callers — a successful API response means the email was accepted by the SMTP server. If GenerateOTP returned an error, the user definitely did not get the email; check the error message for SMTP-specific detail - Verify the user's email address format is valid (must contain @)“email domain not allowed” error:
- Email domain not in [authentication.otp] domains list - Domain check is case-insensitive - Empty domains list or ["*"] allows all domains - JIT-2FA callers should set SkipDomainCheck=true if webhook validates“unidentified device” error:
- DeviceID is empty in the GenerateOTP request - Handler must generate device fingerprint before calling OTP module - DeviceID is required for rate limiting and device-email binding“this device has already requested a code” error:
- Device has an active (non-expired) OTP for a different email address - Prevents attacker from using victim's device session for their email - Wait for existing OTP to expire, or use a different device identifier“please wait X before requesting another code” error:
- Resend delay not elapsed (default: 60 seconds between requests) - Check resend_time config or ResendTimeOverride bounds (10s-5m)“too many failed attempts” error:
- OTP locked after max_retries exceeded (default: 5 attempts) - Locked OTPs also block resend requests to prevent bypass - User must wait for OTP to expire (TTL) then request a new one - Check logs for "SECURITY: OTP locked due to max retry attempts exceeded"OTP validation returns Valid=false without error:
- Code expired (check valid duration in config) - Incorrect code submitted (case-insensitive comparison) - No OTP found for email/device combination - OTP already consumed (cluster-atomic single-use via authclaim; Reason="already_used") - OTP locked from previous max retries exceeded“OTP storage quorum not reached” error:
- Insufficient cluster nodes confirmed storage (need >50%) - Check cluster health: 'cluster status' - May indicate network partition or node failuresMetrics for monitoring:
- otp.codes_generated (type=numeric|base20): Generation count by type - otp.validations_total (result=valid|invalid): Overall validation outcomes - otp.validation_failures (reason=not_found|expired|invalid_code|max_retries|locked): Failure breakdown by reason - otp.replay_prevented: Successful validations where OTP was deletedSecurity
Security design and hardening:
Code generation:
Cryptographically secure random generation using crypto/rand. Rejection sampling eliminates modulo bias in digit selection: For numeric (base 10): Accept bytes 0-249, reject 250-255 (2.3% rejection rate). For BASE20 (base 20): Accept bytes 0-239, reject 240-255. This ensures perfectly uniform distribution across all code characters.Constant-time validation (timing attack resistance):
All code paths execute identical operations regardless of OTP existence. When OTP not found: dummy code "DUMMY0000" and expired metadata are used. crypto/subtle.ConstantTimeCompare always called, even on storage errors. No early returns before the comparison operation. Prevents attackers from determining OTP existence via response time analysis. Prevents code enumeration through timing side channels.Brute-force protection:
Configurable max retry limit (default: 5 failed attempts). OTP locked (not deleted) after max retries — blocks both validation and resend. For 6-digit numeric: 5/1,000,000 = 0.0005% success probability per OTP. Retry counters preserved across resends to prevent counter-reset bypass. Security event logged at WARN level when max retries exceeded.Device-email binding:
Each device can only have one active OTP at a time. Device cannot switch to a different email while an active OTP exists. Prevents attacker from using a compromised device session for their own email.Email privacy protection:
Cache keys are SHA-256 hashes of "email|deviceID" (base64url encoded). Email addresses never stored directly in cache keys. Prevents email enumeration via cache key inspection. Deterministic hashing ensures consistent key derivation across cluster nodes.Replay prevention:
Cluster-atomic single-use enforced unconditionally via authclaim. Marker written to JetStream KV (cache_type "otp_consumed") before declaring success. Concurrent successful submissions on different cluster nodes resolve to exactly one Won; remainders return Reason="already_used". Strict policy fails closed if cluster is degraded (Reason="infra_error"). Prevents code reuse cluster-wide.Resend abuse prevention:
Per-device resend delay (configurable, default 60 seconds). Locked OTPs block resend requests (prevents brute-force via fresh codes). Retry counters preserved across resends for the same email.Cluster storage security:
OTP broadcast to all cluster nodes with quorum requirement (>50%). Ensures OTP availability across node failures. Retry count updates also require cluster quorum. TTL-based automatic expiration prevents stale OTP accumulation.Relationships
Module dependencies and interactions:
- signin: Primary consumer for email-based MFA. When MFAMethods includes “otp”, users see the email OTP option on the MFA page. The signin flow engine calls GenerateOTP to send a code, then ValidateOTP when the user submits it. Successful validation completes the login flow.
- smtp: Email delivery for OTP codes. OTP generation triggers synchronous email delivery via the SMTP module; SMTP errors propagate back to the GenerateOTP caller. Email includes the code, validity duration, and is localized using the Language field from the request.
- Distributed memory cache: Backend for OTP metadata. Uses cache type “otp_codes” with SHA-256 hashed keys. All writes use cluster broadcast with quorum for consistency.
- authentication.totp: Sibling MFA method. Users may see both email OTP and TOTP options on the MFA page. Email OTP requires no prior enrollment but depends on email delivery; TOTP is faster but requires authenticator app setup.
- config: Reads [authentication.otp] settings dynamically at runtime. All settings are hot-reloadable. Override fields in requests take precedence over global config values.
- telemetry: Structured logging with email context for all operations. Security events logged at WARN level (max retries exceeded, OTP locked). Metrics counters for generation, validation outcomes, and failure reasons.
- Rate limiting: External rate limiting layer. Handlers should implement IP-based rate limiting in addition to the module’s device-based limiting.
- jit_2fa: JIT-2FA webhook flow uses override fields (SkipDomainCheck, TTLOverride, CodeLengthOverride) for customized OTP behavior when the webhook has already validated the user.
Logs
Log entries by component. Search with: logs search “otp” Levels: ERROR > WARN > INFO > DEBUG > TRACE.
Generate (OTP creation and delivery):
otp.generate INFO AUDIT Email domain not allowed otp.generate INFO Device ID missing otp.generate INFO AUDIT Device already has OTP for different email otp.generate INFO AUDIT OTP resend blocked - max retries exceeded otp.generate DEBUG OTP resend denied - too soon otp.generate DEBUG Generating BASE20 OTP (consonants only) otp.generate DEBUG Generating numeric OTP otp.generate WARN Invalid UserpassOTPType configuration, defaulting to numeric otp.generate ERROR Failed to generate OTP code otp.generate ERROR Invalid OTP TTL configuration otp.generate ERROR Failed to broadcast OTP to cluster otp.generate ERROR Failed to achieve quorum for OTP storage otp.generate DEBUG OTP stored with cluster quorum otp.generate INFO AUDIT OTP code generated otp.generate WARN Failed to send OTP emailValidate (OTP code verification):
otp.validate ERROR Failed to query OTP from storage otp.validate ERROR Failed to retrieve OTP otp.validate DEBUG No OTP found otp.validate ERROR Invalid OTP type in storage otp.validate DEBUG OTP validation attempt otp.validate INFO AUDIT OTP validation rejected - OTP is locked otp.validate ERROR Failed to delete expired OTP otp.validate INFO AUDIT OTP code expired otp.validate ERROR Failed to lock OTP after max retries exceeded otp.validate WARN AUDIT SECURITY: OTP locked due to max retry attempts exceeded otp.validate ERROR Failed to update OTP retry count otp.validate ERROR Failed to achieve quorum for OTP retry update otp.validate INFO AUDIT Invalid OTP code submitted otp.validate ERROR Failed to delete OTP after validation otp.validate DEBUG OTP deleted after successful validation otp.validate INFO AUDIT OTP validated and removed (replay prevention) otp.validate INFO AUDIT OTP validated successfullyDomain Check:
otp.domain TRACE Invalid email format otp.domain TRACE Domain allowed otp.domain TRACE Domain not in allowed listMetrics
Prometheus metrics. Query with: metrics prometheus otp_<name>
Generation:
otp_codes_generated counter {type} OTP codes generated (type: numeric, base20)Validation:
otp_validations_total counter {result} Validation outcomes (result: valid, invalid) otp_validation_failures counter {reason} Failure breakdown (reason: not_found, locked, expired, max_retries, invalid_code) otp_replay_prevented counter (none) OTPs deleted after successful validation (replay prevention)Alerts:
rate(otp_validation_failures{reason="max_retries"}[5m]) > 0 Brute-force attempt (OTP locked after max retries) rate(otp_validation_failures{reason="not_found"}[5m]) > 5 Probing for non-existent OTPs rate(otp_codes_generated[5m]) > 20 Unusual OTP generation rateRADIUS Authentication (RADSEC + UDP)
Authenticates network devices via RADIUS — VPN concentrators, WiFi controllers, and switches with group-based authorization
Overview
Handles RADIUS authentication and authorization for network devices — VPN concentrators, WiFi controllers, switches, and other NAS equipment. Replaces standalone RADIUS servers by using the gateway’s own user directory and group policies for access decisions. Applies to any RADIUS-capable network device pointed at the gateway.
Two transport modes:
- RADSEC (TCP+TLS, default) — encrypted RADIUS on port 2083 - Dual mode — RADSEC + plain UDP on port 1812 for legacy devicesCore capabilities:
- RADSEC listener for Access-Request packets over TCP+TLS (always active)
- Plain UDP RADIUS listener for legacy NAS equipment (when dual mode enabled)
- TLS certificate cascade: per-client → module-level → auto_tls (ACME) → service default
- Per-client mTLS: optional NAS device certificate verification via client_ca_pem
- NAS client validation via CIDR matching and shared secret verification (CIDR defaults to 0.0.0.0/0 if empty)
- HXEP (Hexon Edge Protocol) support: real NAS IP through SNAT/edge proxy
- Password authentication via LDAP bind (standard RADIUS User-Password)
- X.509 certificate authentication via RADSEC peer certificates — uses the same authentication.x509 module (7-layer validation: expiry, chain, CRL, identity extraction via cert_subject_map, directory lookup, revocation check)
- Group-based authorization mappings with priority ordering (first match wins)
- RADIUS attribute-value pair (AVP) responses: VLANs, ACLs, privilege levels
- Per-NAS rate limiting (sliding window) and per-user lockout after failed attempts
- Global concurrent authentication cap for DoS protection
- Full audit logging of authentication decisions with NAS and user context
Both transports share the same packet processing pipeline — authentication, authorization, and response building are transport-independent.
Config
RADIUS configuration under [radius] section:
[radius] enabled = true # Enable RADIUS service radsec_only = true # true: RADSEC TCP+TLS only; false: dual mode (UDP + RADSEC) network_interface = "" # Bind interface (defaults to service.network_interface → "eth0") radsec_port = 2083 # RADSEC TCP+TLS port (default 2083, RFC 6614) plain_port = 1812 # Plain UDP RADIUS port (default 1812, RFC 2865, dual mode only) accounting_port = 2083 # Reserved for future accounting auth_methods = ["password"] # Methods: "password" (LDAP bind), "x509" (RADSEC peer cert) idle_timeout = "30s" # Per-connection idle timeout (default: 30s) session_ttl = "1h" # Auth event visibility in session list (1m-24h) tls_min_version = "1.2" # Minimum TLS version: "1.1", "1.2", "1.3" # TLS: module-level certificate (optional, falls back to service default) tls_cert = "" # Server cert (file path or inline PEM) tls_key = "" # Server private key (file path or inline PEM) auto_tls = false # Issue cert from internal ACME CA[radius.rate_limit] max_requests_per_second_per_nas = 100 # Per-NAS rate limit max_auth_attempts_per_user = 5 # Failed attempts before lockout auth_lockout_duration = "5m" # Lockout period after max failures max_concurrent_authentications = 1000 # Global concurrent auth cap# NAS client definitions (at least one required)[[radius.client]] name = "vpn-concentrator" description = "Fortinet FG-100F at DC1" cidr = "10.0.1.0/24" # Defaults to 0.0.0.0/0 if empty (WARNING logged) secret = "base64:c2VjdXJlLXJhbmRvbS1zZWNyZXQ=" # min 16 bytes decoded # Per-client TLS overrides (optional) tls_cert = "" # NAS-specific server cert tls_key = "" # NAS-specific server key client_ca_pem = "" # CA to verify NAS device cert (enables mTLS)# Group-based authorization mappings (evaluated by priority, highest first)[[radius.mapping]] name = "network-admins" groups = ["admins", "network-ops"] priority = 100 [radius.mapping.attributes] "Service-Type" = "6" # Administrative "Tunnel-Type" = "13" # VLAN "Tunnel-Medium-Type" = "6" # IEEE 802 "Tunnel-Private-Group-ID" = "10" # VLAN 10[radius.mfa] enabled = false # Enable MFA for RADIUS password auth mode = "challenge" # "challenge" (Access-Challenge) or "append" (password+code) methods = ["totp"] # Priority list: "totp", "otp" (email) separator = ":" # Append mode separator (split at last occurrence) challenge_timeout = "60s" # Access-Challenge response timeout (10s-300s) required_groups = [] # Groups requiring MFA (empty = all users) skip_if_unavailable = false # Skip MFA if no method available (false = reject) otp_ttl = "5m" # Email OTP validity override (1m-10m) otp_code_length = 6 # Email OTP code length (4-8)Per-client MFA override (optional field on [[radius.client]]):
mfa_override = "" # "" = inherit global, "off" = disable, "challenge", "append"Hot-reloadable: all settings except port and TLS (requires restart).
Troubleshooting
Common RADIUS issues and diagnostic steps:
NAS cannot connect to RADIUS server:
- RADSEC: verify port 2083/tcp is open; 'firewall show' to check rules - UDP (dual mode): verify configured port (default 1812/udp) is open - Verify NAS IP falls within a configured [[radius.client]] CIDR - Test connectivity from NAS to gateway on configured port - Check: 'config show radius' to verify enabled = true and radsec_only setting - TLS handshake failures logged with NAS name and source IP (RADSEC only)TLS handshake failures:
- "no TLS certificate available": no cert configured at any level - Check TLS cascade: per-client tls_cert → module tls_cert → auto_tls → service cert - If using auto_tls, verify ACME CA is configured and reachable - If client_ca_pem set: NAS must present valid client certificate (mTLS) - Minimum TLS version defaults to 1.2 — check tls_min_version setting - Set tls_min_version = "1.1" only for legacy NAS devices that don't support 1.2+Authentication failures (Access-Reject):
- Access-Reject always returns "Access denied" in Reply-Message (no internal detail leak) - Check server logs for the actual reason (detailed reason logged at each reject point) - "bad authenticator" in logs: shared secret mismatch between NAS and config - "LDAP bind failed" in logs: user credentials incorrect or user not in directory - "User account disabled" in logs: user is disabled in directory - "Account temporarily locked" in logs: too many failed attempts, wait for lockout to expire - Lockout auto-clears after auth_lockout_duration expires (default 5m) - Abandoned lockout entries (< max failures, then idle) are cleaned up after 2× auth_lockout_duration - Check rate_limit settings if legitimate users are being locked outX.509 certificate authentication issues:
- x509 only works on RADSEC (TCP+TLS) — NAS must present client cert during TLS handshake - "Certificate validation service unavailable": [authentication.x509] not enabled or bridge error - "Certificate validation failed": cert expired, chain untrusted, revoked, or identity not in directory - Identity from cert is authoritative (RADIUS User-Name attribute is optional for x509) - Uses same authentication.x509 config (ca_pem, cert_subject_map, OCSP) as web signin - Check: 'config show authentication.x509' for CA pool and identity mapping settingsNo RADIUS response (NAS timeout):
- RADSEC: connection drops for unknown NAS IPs (no TLS handshake for unknowns) - UDP: unknown source IPs silently dropped (no information leak) - Per-NAS rate limit exceeded: increase max_requests_per_second_per_nas - Global concurrent auth limit reached: increase max_concurrent_authentications - LDAP service not ready: check directory service health - Idle timeout (default 30s): increase idle_timeout if NAS sends infrequent requestsHXEP (edge proxy / SNAT) issues:
- "HXEP resolved real NAS IP" log: normal — shows socket IP → real NAS IP resolution - NAS rejected after HXEP: real NAS IP doesn't match any client CIDR — add correct CIDR - HXEP not resolving: verify service.hexon_edge_protocol = true and edge IP in service.hexon_edge_cidr - TLS handshake fails via edge: HXEP header parsed during TLS handshake read — check edge proxy config - UDP via edge: HXEP wrapping is transparent — no RADIUS-specific config needed - "Rejecting HXEP connection — NAS has per-client mTLS": client_ca_pem is incompatible with HXEP edge proxy — mTLS cannot be enforced because TLS handshake occurs before HXEP reveals the real NAS IP. Remove client_ca_pem or connect the NAS directly (no edge)MFA issues:
- "MFA enrollment required": user has no TOTP enrolled and skip_if_unavailable=false → Enroll user's TOTP via bastion 'totp enroll' or web signup, or set skip_if_unavailable=true - "Challenge expired or invalid": user took too long, increase challenge_timeout (max 300s) - Access-Challenge not working: NAS may not support Access-Challenge — use mfa_override="append" - Append mode "Invalid credentials": password+code not split correctly → Check separator config (default ":"), user must type password:123456 - Email OTP not delivered: verify SMTP configured and user has email in directory - Per-client MFA override: set mfa_override on [[radius.client]] to "off", "challenge", or "append" - MFA only applies to password auth — x509 certificate is the second factorMapping not applied (wrong VLAN/attributes):
- Mappings evaluated by priority (highest first), first match wins - Empty groups = catch-all, ensure it has lowest priority - Verify user's group membership in directory matches mapping groups - Check: user groups via directory serviceRelationships
Module dependencies and interactions:
- LDAP module: Password authentication uses LDAP bind for credential verification. RADIUS waits for LDAP readiness before accepting connections.
- X.509 auth module: Certificate authentication validates client certificates against the CA. Full 7-layer validation: expiry, chain, CRL, identity extraction, directory, revocation. Uses same [authentication.x509] config as web signin (ca_pem, cert_subject_map, OCSP). Identity extracted from certificate is authoritative (RADIUS User-Name optional for x509).
- Directory service: Group membership lookups for authorization mapping evaluation. User disabled status checked before authentication.
- Certmanager: TLS certificate cascade — module cert, auto_tls (ACME), or service default. Per-client TLS overrides built at init time for NAS-specific certificates.
- Managed listener: TCP and UDP socket lifecycle managed by Hexon’s listener infrastructure. RADSEC: TLS applied per-connection (not at listener level) for per-client cert selection. UDP: packets matched to NAS by source IP, dispatched directly to handlePacket. HXEP (Hexon Edge Protocol): real NAS IP resolved through SNAT/edge proxy. TCP: two-phase NAS matching (socket IP for TLS config → HXEP real IP for final NAS match). UDP: HXEP PacketConn wrapper transparently resolves real IP — no handler changes needed.
- TOTP module: MFA checks TOTP enrollment and validates codes (including recovery codes).
- Email OTP module: MFA generates and validates email OTP codes. Bypasses web domain allowlist since RADIUS users may not match web-configured domains.
- Cluster: All cross-module calls use standard cluster communication.
- Metrics: Exposes radius_connections_total, radius_packets_total, radius_auth_total, radius_auth_duration, radius_errors_total, and radius_mapping_matches_total counters.
- Sessions module: Auth events recorded as type “radius” sessions on Access-Accept. Visible via ‘sessions list —type=radius’, ‘sessions show’, cluster-wide. TTL controlled by session_ttl config (default 1h). Rich metadata per session: NAS name/IP, transport (tcp/udp), TLS version, auth method, mapping, RADIUS attributes, user groups, packet ID, timing metrics (total_ms, auth_ms, authz_ms), and cert info for x509 (serial, subject, issuer, expiry, CA type).
- Configuration: Reads [radius] TOML section. Validated at startup.
- Admin CLI: RADIUS status and diagnostics available through admin commands.
Logs
Log entries emitted by this module (runtime/radius). Levels: ERROR > WARN > INFO > DEBUG. AUDIT = security-auditable event.
Initialization:
radius.init INFO RADIUS service disabled in configuration radius.init INFO RADIUS initialization starting (RADSEC TCP+TLS)... radius.init INFO RADIUS initialization starting (dual-mode: UDP + RADSEC TCP+TLS)... radius.init INFO Waiting for LDAP service to initialize radius.init INFO Shutdown requested during LDAP wait, aborting initialization radius.init INFO LDAP service ready, creating RADIUS server radius.init INFO Shutdown requested before server creation, aborting initialization radius.init ERROR Failed to create RADIUS server radius.init INFO Shutdown requested before listener creation radius.init ERROR Failed to resolve network interface IP radius.init INFO Resolved network interface for RADIUS radius.init ERROR Failed to create RADSEC listener radius.init ERROR Failed to start RADSEC listener radius.init INFO RADSEC listener started radius.init ERROR Failed to create UDP RADIUS listener radius.init ERROR Failed to start UDP RADIUS listener radius.init INFO UDP RADIUS listener started radius.init INFO RADIUS server started successfully radius.init WARN RADIUS auth_methods includes x509 but [authentication.x509] is not enabled — x509 auth will fail at runtimeConnection handling:
radius.handler ERROR No TLS configuration available radius.handler WARN TLS handshake failed radius.handler INFO HXEP resolved real NAS IP radius.handler ERROR Rejecting HXEP connection — NAS has per-client mTLS (client_ca_pem) which cannot be enforced through edge proxy radius.handler WARN AUDIT Unknown NAS — connection from unregistered IP radius.handler DEBUG RADSEC connection establishedUDP listener:
radius.handler WARN UDP temporary read error, continuing radius.handler ERROR UDP fatal read error, stopping listenerRADSEC framing:
radius.handler WARN Failed to read RADSEC frame header radius.handler WARN Invalid RADIUS packet length radius.handler WARN Incomplete RADSEC framePacket processing:
radius.handler WARN AUDIT NAS rate limit exceeded radius.handler WARN AUDIT Concurrent authentication limit reached radius.handler WARN Failed to parse RADIUS packet radius.handler WARN Unexpected RADIUS packet code radius.handler INFO Missing User-Name attribute in Access-Request radius.handler WARN AUDIT User locked outAuthentication:
radius.auth DEBUG Skipping x509 auth — no client certificate radius.auth ERROR x509auth bridge call failed radius.auth ERROR x509auth validation timed out or failed radius.auth INFO AUDIT Certificate validation rejected radius.auth INFO AUDIT Authentication failed radius.auth ERROR Authorization failed radius.auth INFO No matching mapping radius.auth INFO Authentication and authorization successfulMFA:
radius.mfa WARN TOTP status check failed radius.mfa ERROR Failed to generate challenge token radius.mfa INFO MFA validated via recovery code radius.mfa ERROR Failed to encode Access-Challenge radius.mfa WARN Failed to send Access-Challenge radius.mfa ERROR Failed to get user info for MFA check radius.mfa ERROR AUDIT MFA method resolution failed radius.mfa INFO MFA skipped — no method available, skip_if_unavailable=true radius.mfa ERROR Failed to send email OTP radius.mfa INFO AUDIT Sending MFA challenge radius.mfa WARN Invalid or expired MFA challenge state radius.mfa WARN MFA challenge response from different NAS radius.mfa INFO MFA challenge response missing verification code radius.mfa INFO MFA validation failed radius.mfa ERROR Authorization failed after MFA radius.mfa INFO MFA authentication and authorization successfulResponse encoding:
radius.handler ERROR Failed to encode Access-Reject radius.handler WARN Failed to send Access-Reject radius.handler WARN Failed to set RADIUS attribute radius.handler ERROR Failed to encode Access-Accept radius.handler WARN Failed to send Access-AcceptSession recording:
radius.session WARN Failed to create RADIUS sessionRestrictions:
radius.restrictions.geo ERROR Geo check failed - denying access (fail-closed) radius.restrictions.geo ERROR Geo check wait failed - denying access (fail-closed) radius.restrictions.geo ERROR Invalid geo check response type - denying access (fail-closed) radius.restrictions.geo INFO Access blocked by geo restriction radius.restrictions.time ERROR Time check failed - denying access (fail-closed) radius.restrictions.time ERROR Time check wait failed - denying access (fail-closed) radius.restrictions.time ERROR Invalid time check response type - denying access (fail-closed) radius.restrictions.time INFO Access blocked by time restrictionMetrics
Prometheus metrics. Query with: metrics prometheus radius_<name>
Connections:
radius_connections_total counter {nas} TCP connections accepted (RADSEC)Packets:
radius_packets_total counter {transport, nas} RADIUS packets received (transport: tcp or udp)Authentication:
radius_auth_total counter {result, method, nas} Auth outcomes (result: accept/reject, method: password/x509/none) radius_auth_total counter {result, reason, nas} Auth rejections with reason (reason: geo, time) radius_auth_duration latency {result} End-to-end auth+authz latency (result: accept/reject)Mappings:
radius_mapping_matches_total counter {mapping, nas} Mapping match counts per mapping nameErrors:
radius_errors_total counter {reason, nas} Error counts by reason: reason=tls_handshake TLS handshake failure on RADSEC connection reason=hxep_mtls_conflict HXEP connection rejected — NAS has per-client mTLS reason=invalid_frame RADIUS packet length out of range (< 20 or > 4096) reason=incomplete_frame RADSEC frame body read failed (truncated) reason=rate_limit Per-NAS rate limit exceeded (silent drop) reason=concurrent_limit Global concurrent auth limit reached (silent drop) reason=parse_error RADIUS packet parse failed (bad authenticator / malformed) reason=invalid_state MFA challenge state token invalid or expired reason=nas_mismatch MFA challenge response from different NAS than originalTOTP Authenticator
Authenticator app verification for second-factor authentication — QR enrollment, replay protection, recovery codes
Overview
Verifies time-based one-time passwords from authenticator apps like Google Authenticator, Authy, or 1Password. Used as an MFA step after primary authentication — requires the user to have enrolled via QR code scan. Applies when the signin flow requires MFA and TOTP is configured as an available method.
Enrollment flow:
1. The gateway generates a 160-bit secret and QR code (secret not persisted until confirmed) 2. User scans the QR code with their authenticator app 3. User submits the first code to confirm enrollment — proves the QR was scanned correctly 4. The gateway generates 10 one-time recovery codes (returned in plaintext exactly once) 5. Subsequent logins verify the 6-digit code from the authenticator appReplay protection rejects codes that match or precede the last accepted time step. Recovery codes are hashed and consumed on use — each code works exactly once.
HMAC-SHA1 by default (SHA256/SHA512 configurable but reduce app compatibility). Configurable time skew window for clock drift tolerance between the gateway and authenticator apps. Per-user enrollment status and secret deletion available via admin CLI.
Config
Configuration under [authentication.totp]:
[authentication.totp] enabled = true # Enable TOTP module issuer = "HexonGateway" # Shown in authenticator apps (otpauth URI) algorithm = "SHA1" # HMAC algorithm: SHA1 (most compatible), SHA256, SHA512 digits = 6 # Code length: 6 (standard) or 8 period = 30 # Time step in seconds (30 is RFC default) skew = 1 # Allow +/- N steps for clock drift (1 = 30s tolerance) recovery_codes = 10 # Number of one-time recovery codes generated recovery_code_length = 6 # Character length of each recovery code rate_limit_auth = "10/1m" # Rate limit for validation attemptsAlgorithm compatibility notes:
SHA1: Works with all authenticator apps (Google, Authy, 1Password, etc.) SHA256: Limited app support (may not work with Google Authenticator) SHA512: Minimal app support (not recommended for broad deployments)Period and skew interaction:
With period=30 and skew=1, codes are valid for ~90 seconds (current + 1 past + 1 future). Increasing skew improves tolerance for clock drift but reduces security. Period changes require re-enrollment of all users.Storage: Hexon KV (NATS JetStream) — no user password needed for writes.
All settings are cold (restart required to take effect on new enrollments). Existing enrollments retain their original algorithm, digits, and period.
Troubleshooting
Common symptoms and diagnostic steps:
User cannot enroll TOTP (enrollment fails):
- Verify [authentication.totp] enabled = true - Check if user already has TOTP enrolled: 'totp status <username>' - If re-enrolling, delete first: admin must call Delete operation - Check telemetry logs for "Failed to generate TOTP secret" errorsQR code not scanning in authenticator app:
- Verify issuer is set (some apps reject empty issuer) - Check algorithm compatibility: SHA1 works universally, SHA256/SHA512 may not - Ensure digits=6 and period=30 for maximum compatibility - Try manual entry using the Base32 secret string instead of QRTOTP code rejected during authentication:
- Clock drift: user device clock may be off by more than skew * period seconds - Replay protection: code was already used (step <= last_used_step) - Wrong authenticator entry: user may have multiple entries for same issuer - Check enrollment status: 'totp status <username>' to confirm enrollment exists - Verify algorithm matches: stored secret uses algorithm from enrollment timeRecovery code rejected:
- Code already consumed (one-time use, removed from storage after validation) - No codes remaining: check RecoveryCodesRemaining in status response - Case sensitivity: codes are case-sensitive - Storage update failure: check logs for "Failed to consume recovery code"Replay detection false positives:
- Rapid successive code submissions: same 30-second window generates same code - Step update failed: if persisting the step counter fails, validation is rejected (fail-closed) - Check logs for "TOTP replay detected" with step and last_used_step valuesTOTP Delete fails:
- Cluster not ready: moduledata requires cluster connectivity - Delete is idempotent: returns Success=true even if no enrollment existsMetrics for monitoring:
- totp.enrollments_initiated: Enroll calls (QR generated) - totp.enrollments_confirmed: Successful ConfirmEnroll (secret persisted) - totp.enrollments_deleted: Successful Delete calls - totp.validations_total (result=valid|invalid|replay): Validate outcomes - totp.recovery_validations_total (result=valid|invalid|no_codes): Recovery code outcomesSecurity
Security design and hardening:
Secret generation:
160-bit random secrets (20 bytes) from crypto/rand, Base32-encoded. Provides 2^160 entropy — brute-forcing the secret is computationally infeasible.Code validation:
Constant-time comparison via crypto/subtle prevents timing attacks. Attacker cannot determine partial code correctness from response time.Replay protection:
Each successful validation records the time step (LastUsedStep). Subsequent codes at step <= LastUsedStep are rejected. Step update is synchronous (not fire-and-forget) to prevent race conditions. If step persistence fails, validation is rejected (fail-closed). This prevents concurrent requests from replaying the same code.Recovery codes:
Generated with crypto/rand, stored as SHA-256 hashes. Plaintext returned to user exactly once during enrollment confirmation. Each code is consumed (removed) after successful validation. Matching uses constant-time comparison for timing-attack resistance. Consumption is synchronous with fail-closed semantics.Enrollment security:
Two-phase enrollment: Enroll generates secret, ConfirmEnroll verifies first code. This proves the user successfully scanned the QR and their authenticator works. Re-enrollment blocked while existing enrollment exists (prevents overwrite race).Clock drift tolerance:
Configurable skew parameter allows +/- N time steps. Default skew=1 with period=30 accepts codes from 3 consecutive 30-second windows. Wider skew reduces security: skew=2 means a valid code window of 150 seconds.Authentication flow integration:
TOTP is a second factor only — never used as primary authentication. Requires prior successful primary authentication (password, certificate, etc.). MFA pending session must exist before TOTP validation is attempted. Failed TOTP does not reveal whether the user has TOTP enrolled.Audit logging:
All operations logged via telemetry with security context (username). Enrollment initiation, confirmation, validation (success/failure/replay), recovery code use, and deletion all generate structured log entries. Replay attempts logged at WARN level for security monitoring.Relationships
Module dependencies and interactions:
- signin: Primary consumer via MFA flow. When RequireMFA includes “passwd” and MFAMethods includes “totp”, users with TOTP enrolled see the authenticator option on the MFA page. After primary auth creates “mfa_pending” session, user submits 6-digit code, signin calls totp.Validate, and on success the signin flow completes the login.
- moduledata: Storage backend for TOTP secrets. Module name “totp” in moduledata stores the per-user secret, algorithm, digits, period, last used step, and recovery codes.
- Directory: Provides user context and group membership. TOTP enrollment status can influence access policies.
- sessions: MFA pending session must exist before TOTP validation. Successful TOTP validation triggers session upgrade to fully authenticated.
- authentication.otp: Sibling MFA method. Users may see both TOTP and email OTP options on the MFA page. TOTP is preferred when enrolled (no email delivery delay).
- config: Reads [authentication.totp] settings dynamically at runtime. Algorithm, digits, and period from enrollment time are stored with the secret, so config changes only affect new enrollments.
- telemetry: Structured logging with security context for all operations. Metrics counters for enrollment, validation, and recovery code operations.
- Admin CLI: TOTP management commands (list enrollments, check status, delete). Admin can delete TOTP enrollment for locked-out users.
Logs
Log entries by component. Search with: logs search “totp” Levels: ERROR > WARN > INFO > DEBUG > TRACE.
Enroll (secret + QR generation):
totp.enroll ERROR Failed to generate TOTP secret totp.enroll ERROR Failed to generate QR code totp.enroll INFO TOTP enrollment initiatedConfirmEnroll (first-code verification and secret persistence):
totp.enroll.confirm INFO TOTP enrollment verification failed - invalid code totp.enroll.confirm ERROR Failed to generate recovery codes totp.enroll.confirm ERROR Failed to store TOTP secret totp.enroll.confirm INFO TOTP enrollment confirmed and persistedValidate (TOTP code verification):
totp.validate INFO AUDIT TOTP validation failed - no enrollment found totp.validate ERROR AUDIT Failed to decode stored TOTP secret totp.validate INFO AUDIT TOTP validation failed - invalid code totp.validate WARN AUDIT Clock backward detected during TOTP validation - allowing code totp.validate WARN AUDIT TOTP replay detected - code already used totp.validate ERROR AUDIT Failed to update last used step - rejecting for safety totp.validate INFO AUDIT TOTP validation successfulRecovery (one-time recovery code validation):
totp.recovery INFO Recovery code validation failed - no enrollment found totp.recovery INFO Recovery code validation failed - no codes remaining totp.recovery INFO Recovery code validation failed - invalid code totp.recovery ERROR Failed to consume recovery code - rejecting for safety totp.recovery INFO Recovery code validated and consumedDelete (enrollment removal):
totp.delete INFO No TOTP enrollment found to delete totp.delete INFO TOTP enrollment deletedMetrics
Prometheus metrics. Query with: metrics prometheus totp_<name>
Enrollment:
totp_enrollments_initiated counter (none) Enroll calls (QR + secret generated) totp_enrollments_confirmed counter (none) First code verified, secret persisted totp_enrollments_deleted counter (none) TOTP enrollment deletedValidation:
totp_validations_total counter {result} Validation outcomes (result: valid, invalid, replay, clock_backward)Recovery:
totp_recovery_validations_total counter {result} Recovery code outcomes (result: valid, invalid, no_codes)Alerts:
rate(totp_validations_total{result="replay"}[5m]) > 0 Replay attack attempt detected rate(totp_validations_total{result="invalid"}[5m]) > 10 Brute-force attempt on TOTP codes rate(totp_validations_total{result="clock_backward"}[5m]) > 0 Server clock drift — check NTP sync rate(totp_recovery_validations_total{result="invalid"}[5m]) > 5 Recovery code probing attemptWebAuthn Passkeys
FIDO2/WebAuthn passwordless authentication with passkey management and clone detection
Overview
The WebAuthn module implements FIDO2/WebAuthn Level 2 passwordless authentication, acting as a WebAuthn Relying Party (RP). It manages the full passkey lifecycle: registration, authentication, revocation, and expiration monitoring.
Key capabilities:
- Multiple passkeys per user (laptop, phone, YubiKey, etc.)
- Challenge-response registration and authentication ceremonies
- Platform authenticators (Touch ID, Face ID, Windows Hello)
- Cross-platform authenticators (YubiKey, other FIDO2 security keys)
- Attestation statement validation (none, packed, fido-u2f formats)
- Clone detection via signature counter monitoring
- ECDSA P-256 (ES256) and RSA-2048 (RS256) public key cryptography
- Passkey expiration scheduler with email reminders
- Distributed passkey storage (replicated or shared filesystem)
- Session creation after successful authentication
- Optional device naming for passkey identification
Operations: registration ceremonies, authentication ceremonies, passkey management (revoke, get, list), observability metrics, and scheduled expiration reminders.
Storage architecture follows a layered approach:
- LDAP is the single source of truth for passkey data
- Multi-passkey format: supports multiple passkeys per user with revocation tracking
- Legacy single-passkey format auto-detected and migrated on first write
- Directory module syncs LDAP to memory cache (including passkey data)
- WebAuthn reads passkey data from the directory cache
- No separate passkey cache — eliminates synchronization issues
- Temporary challenge sessions use in-memory storage with 5-10 minute TTL
- Passkey records also persisted to distributed file storage
Config
Configuration under [authentication.webauthn]:
name = "Hexon Identity" # RP name shown to users during ceremony rpid = "login.example.com" # Relying Party ID (must match origin domain) origin = "https://login.example.com" # Origin URL (must match browser origin exactly) skip_port_check = true # Skip port in origin validation (default: true) type = "preferred" # Authenticator type: "platform", "cross-platform", "preferred" user_verification = "preferred" # UV policy: required|preferred|discouraged (default: preferred) validity = "8760h" # Passkey validity (default: 8760h = 1 year; "0" = no expiry) algorithms = ["ES256", "RS256", "EdDSA"] # Signature algorithms in preference order attestation = "none" # Attestation conveyance: none|indirect|direct|enterprise allowed_aaguids = [] # AAGUID allowlist; empty = any (requires attestation=direct) denied_aaguids = [] # AAGUID denylist; checked first (requires attestation=direct) rate_limit_register = "5/1h" # Registration rate limit per user rate_limit_auth = "20/1m" # Authentication rate limit per userSignature algorithms (default [“ES256”, “RS256”, “EdDSA”]):
- ES256 (ECDSA P-256 + SHA-256) — universal authenticator support - RS256 (RSA-2048 + SHA-256) — covers older smartcards - EdDSA (Ed25519) — modern hardware: recent YubiKeys, Solo Keys, iOS 18+ / Android 14+ platform authenticators. Smaller signatures. Only these three are accepted; unknown names are rejected at boot. Order is the operator's preference list — the authenticator picks the first algorithm it supports. To force ES256-only (for compatibility with strict regulators or legacy verifiers downstream), set algorithms = ["ES256"]. Removing RS256 also blocks legacy smartcards.Validity semantics:
- Default 8760h (1 year) — annual re-confirmation of credential possession - "0" → no expiry — matches Apple/Google/Microsoft platform-passkey UX (credentials live until explicit revocation). Storage omits valid_until on the record; authentication treats IsZero() as never-expiring; the renewal-reminder scheduler skips zero-validity credentials automatically.Attestation conveyance (default “none”):
- "none" — authenticator omits attestation; AAGUID arrives as zero bytes. Best privacy, fewest browser prompts. Recommended unless you actually consume AAGUID downstream. - "indirect" — authenticator may send anonymized attestation. Browser may strip identifying material; AAGUID enforcement is unreliable. - "direct" — full attestation including real AAGUID and certificate chain. REQUIRED for AAGUID allow/deny enforcement. May trigger an extra browser prompt on some platforms. - "enterprise" — non-anonymized identifiers. Most authenticators require an allow-listed RP ID configured in their manufacturer policy; coordinate with your hardware vendor before flipping.AAGUID allow/deny lists:
AAGUID = 16-byte UUID identifying authenticator make/model. Use these lists to restrict registration to specific devices (e.g. hardware-key-only deployments). Both lists require attestation="direct" — boot validation rejects the inconsistent combination because non-direct modes anonymize or omit the AAGUID and would silently block every user. Denylist is checked first — a denied AAGUID is rejected even if it appears in the allowlist. This lets you express "any hardware key, except this revoked batch" by populating both lists. When a list is non-empty and the authenticator returns no AAGUID (zero bytes), the registration is rejected with a clear error rather than silently admitting an unidentified credential. AAGUID values come from the FIDO Metadata Service. See tools/config/authentication/webauthn.toml for a curated starter list of hardware keys (YubiKey, SoloKey, Feitian, Google Titan) and software / platform passkey managers (iCloud Keychain, Google Password Manager, Windows Hello, 1Password, Bitwarden).User verification tradeoff (single value, applied to both registration and auth):
- "preferred" (default): authenticator decides. Touch ID where available, falls back gracefully. Non-UV-capable credentials can enroll AND authenticate. Best UX, no fallback prompts on macOS. Weakest phishing resistance — suitable when another auth layer (mTLS, network ACL, IAP session binding) is the primary defence. - "required": TouchID/PIN every ceremony. Server rejects UV=0 in authData at BOTH registration and authentication — non-UV-capable authenticators cannot enroll, and an enrolled credential that skips UV at auth time is rejected. Strongest phishing resistance per FIDO2 §7.2.9. On macOS: can fall back to account-password prompt if Touch ID isn't accepted first-try. - "discouraged": skip UV. Reserved for deployments behind another strong auth layer that already provides UV-equivalent guarantees.Registration and auth MUST share the same value. A “preferred” registration accepts non-UV credentials, which then fail a “required” auth with no recovery path. The getter returns one value consumed by both ceremonies and falls back to “preferred” (same as the default) on any unrecognised input so a config typo never bricks passkey auth.
Migration: flipping from “preferred” to “required” mid-deployment can lock out users whose credentials enrolled without UV. Plan a re-registration window before flipping.
Expiration reminder settings:
renewal_reminder_enabled = true # Enable expiration reminder emails (default: true) renewal_reminder_interval = "24h" # Check frequency (default: "24h") renewal_reminder_before = "360h" # Lead time before expiry to start sending (default: 360h = 15 days) renewal_reminder_timeout = "5m" # Operation timeout (default: "5m") renewal_reminder_retries = 3 # Max retry attempts (default: 3) renewal_reminder_retry_delay = "30s" # Delay between retries (default: "30s")Hot-reload behavior:
Hot-reloaded (effect on next ceremony / next scheduler tick): - validity: new value applies to passkeys registered after the reload; existing credentials keep their previously-stored expiry - user_verification: applies to the next registration / authentication - algorithms: applies to the next registration ceremony; existing credentials remain verifiable as long as their algorithm is still one the server supports (ES256, RS256, EdDSA) - attestation, allowed_aaguids, denied_aaguids: apply to the next registration ceremony; existing credentials are not retroactively re-evaluated against new lists - Scheduler settings: interval, timeout, retries, retry_delay Require restart: - rpid, origin, type, skip_port_check - Changing these mid-flight breaks validation of already-enrolled passkeysCluster storage modes:
Replicated mode (filesystem.mode = "replicated"): - Passkeys broadcast to all nodes with quorum (>50% must confirm) - Automatic cross-node synchronization Shared mode (filesystem.mode = "shared"): - Passkeys on shared filesystem (NFS), no replication neededTroubleshooting
Common symptoms and diagnostic steps:
Registration failures (“invalid attestation”):
- RP ID mismatch: rpid must match the domain portion of origin - Origin mismatch: origin must exactly match the browser URL (scheme + host + port) - Port issues in containers: set skip_port_check=true for K8s/Docker deployments - Unsupported attestation format: only none, packed, fido-u2f are supported - Check config: 'config show authentication.webauthn' - Diagnose user: 'diagnose user <username>'Authentication failures (“signature verification failed”):
- Passkey expired: check valid_until in passkey record ('webauthn list <username>') - Wrong RP ID hash: rpid changed since passkey was registered (requires re-registration) - Corrupted public key: revoke and re-register the passkey - Check passkey details: 'webauthn list <username>'Clone detection alerts (“counter did not increase”):
- Possible cloned authenticator: investigate immediately (security event) - Counter validation only enforced when both stored and new counters are non-zero - Some authenticators do not support counters (always 0) -- this is normal - Counter wrapped around (rare, requires 2^32 uses) - Authenticator reset: requires re-registration after investigation - Check logs: 'logs search "clone" --module=webauthn'Challenge expired or not found:
- Challenge TTL is 5-10 minutes; user took too long to respond - Challenge already consumed (single-use; cannot retry with same challenge) - Memory storage broadcast delay in large clusters - Retry the ceremony from the beginning (BeginRegistration/BeginAuthentication)Expiration reminders not being sent:
- Verify scheduler is enabled: renewal_reminder_enabled = true - Check SMTP health: 'smtp health' - Verify user has email in directory: 'directory user <username>' - Disabled users are skipped (by design) - Check scheduler status: 'health components' - Only the cluster leader runs the check (leader-only scheduling) - Look for errors: 'logs search "expiration" --module=webauthn'Passkey not found during authentication:
- User has no passkey registered: 'webauthn list <username>' - Specific passkey was revoked: 'webauthn list <username>' shows revoked status - Credential ID mismatch: browser sending different credential than stored - Directory sync delay: passkey in LDAP but not yet in memory cache - Trigger sync: 'directory sync <username>' - Legacy format issue: check if user's moduledata has old flat format vs new array502/503 during WebAuthn ceremony:
- Filestorage unavailable: check filesystem health - Quorum not reached in replicated mode: check cluster status ('cluster status') - Memory storage broadcast failure: check cluster connectivity ('ping')Metrics not updating:
- Check metrics endpoint: 'webauthn metrics' - Verify telemetry module is healthy: 'health components'Security
Critical security requirements:
Challenge-Response Protocol:
- 32-byte cryptographic random challenges (crypto/rand) - Single-use: challenge deleted immediately after validation - TTL: 5-10 minutes, expired challenges rejected - Prevents replay attacks entirelyClone Detection (Signature Counter):
- Authenticator maintains incrementing signature counter - On each authentication: new counter must exceed stored counter - If new <= stored (both non-zero): REJECT -- possible cloned authenticator - Counter=0 authenticators exempt (per WebAuthn specification) - Counter updates NOT persisted to LDAP (avoids write on every auth) - Detection works by comparing against registration-time stored valueAttestation Validation:
- Performed during registration for all supported formats - Current mode: permissive (registration succeeds even if validation fails) - Validation results logged for security auditing - For stricter enforcement: modify FinishRegistration to reject failures - Future: FIDO Metadata Service (MDS) for authenticator trust verificationOrigin and RP ID Validation:
- Origin must be HTTPS (WebAuthn specification requirement) - RP ID must match the domain in the origin URL - Browser enforces same-origin policy on credentials - skip_port_check=true relaxes port matching only (not scheme or domain)Public Key Cryptography:
- Keys stored in COSE format (RFC 8152) - ES256 (ECDSA P-256): primary algorithm - RS256 (RSA-2048): secondary algorithm - Private keys never leave the authenticator hardware - Public keys stored base64-encoded in LDAP ModuleDataRate Limiting:
- Registration: configurable per-user limit (default 5/1h) - Authentication: configurable per-user limit (default 20/1m) - Prevents brute-force and denial-of-service attacksOperational security recommendations:
- Monitor clone detection alerts as critical security events - Set an appropriate validity for your security policy ("8760h" = 1 year is the default; "0" disables expiry) - Implement passkey rotation procedures - Revoke passkeys immediately on device loss or compromise - Enable expiration reminders to prevent credential lapses - Audit all authentication events via telemetry logs - Consider enabling stricter attestation for high-security deploymentsRelationships
Module dependencies and interactions:
-
directory: Primary passkey data source. WebAuthn reads passkeys from the directory’s in-memory cache (synced from LDAP). Also provides user listing for expiration checks. User’s FullName used for personalized reminder emails.
-
LDAP: Ultimate source of truth for passkey storage. Passkeys stored in the module data LDAP attribute. The calling layer is responsible for writing passkey data to LDAP after registration.
-
filestorage: Distributed credential storage with active/ and revoked/ directories. Supports replicated mode (quorum broadcast) and shared mode (NFS). Used for passkey record persistence alongside LDAP.
-
sessions: Creates authenticated sessions after successful WebAuthn authentication. Session module and TTL configurable per-authentication request (e.g., “sshproxy” module, 8h TTL).
-
storage.memory: Temporary challenge session storage with broadcast to all cluster nodes. TTL-based expiration (5-10 minutes). Challenges stored under cache type “webauthn_sessions”.
-
smtp: Sends passkey expiration reminder emails via SMTP module. ACL enforced — only the webauthn module is authorized to call this operation. Passkey expiration reminder emails sent via SMTP module.
-
telemetry: Security audit logging at multiple levels. LevelError for clone detection and signature failures. LevelWarn for expired passkeys and invalid challenges. LevelInfo for successful operations.
-
scheduler: Expiration check runs as a leader-only scheduled task (distributed lock for safety). Configurable interval, timeout, retries, and retry delay.
-
config: Hot-reloadable configuration via the configuration system. Some fields cached at init (rpid, origin, type) to prevent mid-flight breakage.
External dependency:
- CBOR decoding for attestation objects and COSE key parsing (RFC 8152).
Logs
Log entries by component. Search with: logs search “webauthn” Levels: ERROR > WARN > INFO > DEBUG.
Registration:
webauthn.registration INFO AUDIT Begin/finish registration request webauthn.registration INFO Passkey registered / attestation validated webauthn.registration WARN Challenge mismatch / origin mismatch / attestation failed webauthn.registration ERROR Challenge generation / session storage / marshal failuresAuthentication:
webauthn.authentication INFO AUDIT New challenge issued webauthn.authentication ERROR AUDIT E2OE commitment mismatch — Tier 1 binding rejected webauthn.authentication INFO Auth successful / passkey not found / expired / invalid session webauthn.authentication WARN Origin mismatch / RP ID hash mismatch / signature verification failed webauthn.authentication ERROR ECDH keygen / challenge generation / session storage / cloned device / COSE key failures webauthn.authentication DEBUG Begin/finish request trace / counter validation / auth successfulEnrollment:
webauthn.enroll INFO AUDIT Passkey enrolled (hash, device, active count) webauthn.enroll ERROR Failed to load existing passkeys / failed to store webauthn.enroll DEBUG Enroll requestRevocation:
webauthn.revoke INFO AUDIT Passkey revoked (hash, device, reason, revoked_by) webauthn.revoke WARN No passkeys found / passkey not found in active list webauthn.revoke ERROR Failed to store revoked passkey webauthn.revoke DEBUG Revoke requestStorage:
webauthn.storage DEBUG Loading/storing passkeys (active/revoked counts) webauthn.storage INFO Passkeys stored to moduledataExpiration:
webauthn.expiration INFO Check started / completed / reminder sent / disabled / skipping webauthn.expiration WARN Lock acquisition failed webauthn.expiration ERROR Scheduler registration / LoadAll / GetAllUsers failuresInitialization:
webauthn.init INFO Provider initialized (RPID, origin, type, validity) / disabled webauthn.init ERROR Initialization failedLookup:
webauthn.get DEBUG Passkey lookup webauthn.list DEBUG Passkey listingMetrics
Prometheus metrics. Query with: metrics prometheus webauthn_<name>
Passkey Inventory:
webauthn_passkeys_issued gauge {} Total passkeys ever issued webauthn_passkeys_active gauge {} Currently active passkeys webauthn_passkeys_revoked gauge {} Revoked passkeys webauthn_passkeys_expired gauge {} Expired passkeysAuthentication:
webauthn_auth_attempts counter {} Authentication attempts webauthn_auth_success counter {} Successful authentications webauthn_auth_failed counter {} Failed authenticationsExpiration Monitoring:
webauthn_expiration_check_total counter {result} Expiration checks (success/failure) webauthn_expiration_passkeys_checked gauge {} Passkeys checked in last run webauthn_expiration_emails_sent gauge {} Reminder emails sent in last run webauthn_expiration_reminder_total counter {result} Reminder send attempts (success/failure)Alerts:
rate(webauthn_auth_failed[5m]) > 20 High auth failure rate webauthn_passkeys_active == 0 No active passkeys (service unusable) rate(webauthn_expiration_check_total{result="failure"}[1h]) Expiration check failingX.509 Client Certificate Authentication
Authenticates users via client certificates — validates external PKI or issues internal certificates with auto-renewal
Overview
Authenticates users by verifying client certificates presented during the TLS handshake. Two modes:
External PKI validation:
Validates client certificates from external PKI infrastructure (FreeIPA, Active Directory). The gateway performs validation only — certificate lifecycle is managed by the external PKI.Internal CA enrollment:
Issues and manages client certificates via the gateway's built-in ACME CA. Users self-enroll at /signup/x509 after authenticating. Supports auto-renewal, self-revocation, and multi-certificate overlap (max 2 active per user during renewal windows).Validation is performed as an ordered, defense-in-depth pipeline:
1. Certificate expiration check (NotBefore/NotAfter) 2. TLS handshake validation against ClientCAs pool (chain, signature, trust) 3. Application-level chain validation (full chain verify with client auth usage check) 4. CRL check -- O(1) in-memory lookup with atomic map swap (if enabled) 5. Identity extraction from certificate subject (cn, uid, email, or upn) 6. Directory lookup (user exists and is active) 7. OCSP check with cluster-cached responses and configurable soft-fail (if enabled) 8. Session creation with username, email, groups, and certificate metadataAll validation operations are cluster-wide, ensuring consistent behavior regardless of which node handles the authentication request.
Typical authentication latency:
- Cached path (CRL + cached OCSP): 20-30ms total - Uncached path (first OCSP query): 70ms-5s depending on ocsp_timeout - CRL lookup: less than 1ms (in-memory hash map) - OCSP cached lookup: less than 1ms (cluster memory)Memory footprint:
- CRL map: ~100 bytes per revoked certificate (10K certs = ~1MB) - OCSP cache: ~200 bytes per response (1K users = ~200KB)Config
Core configuration under [authentication.x509]:
[authentication.x509] enabled = true # Enable X.509 authentication ca_pem = """...""" # CA certificate(s) in PEM format (root + intermediates)CRL (Certificate Revocation List):
crl_enabled = true # Enable CRL-based revocation checking crl_url = "http://ca.example.com/ca.crl" # CRL distribution point URL crl_refresh = "1h" # CRL refresh interval (default: 1h) crl_timeout = "30s" # HTTP download timeout (default: 30s) crl_max_size = 0 # Max CRL size in bytes (0 = unlimited)OCSP (Online Certificate Status Protocol):
ocsp_enabled = true # Enable OCSP revocation checking ocsp_url = "http://ocsp.example.com" # OCSP responder URL ocsp_cache = "15m" # Cache duration for OCSP responses (default: 15m) ocsp_timeout = "5s" # HTTP timeout for OCSP queries (default: 5s) ocsp_soft_fail = true # Allow auth if OCSP is unreachable (default: true)IMPORTANT: OCSP timeout is independent of operations.wait_timeout. X.509 validation uses a dynamic timeout of ocsp_timeout + 5s buffer. This ensures OCSP queries complete with their full configured timeout regardless of the global wait_timeout.
Identity Mapping:
[identity.cert_subject_map] username = "cn" # Certificate field for username extraction # Options: "cn" (CommonName), "uid" (LDAP UID OID), # "email" (email address), "upn" (AD User Principal Name)Internal CA Enrollment:
enroll_enabled = true # Enable self-service certificate enrollment enroll_validity_days = 365 # Certificate validity period (default: 365) enroll_algorithm = "ECDSA-P256" # Key algorithm: "ECDSA-P256" or "RSA-2048" enroll_max_active_certs = 10 # Max active certificates per user (1-50, default: 10) enroll_rate_limit = "3/1h" # Enrollment rate limit per user (default: "3/1h") revoke_rate_limit = "5/1h" # Revocation rate limit per user (default: "5/1h") enroll_p12_min_entropy = 60 # Min entropy bits for PKCS#12 password (default: 60)Auto-Renewal:
enroll_auto_renew = true # Enable automatic renewal before expiry (default: true) enroll_auto_renew_days = 15 # Days before expiry to trigger renewal (default: 15) enroll_auto_renew_interval = "24h" # Check interval for expiring certs (default: "24h") enroll_auto_renew_timeout = "5m" # Scheduler operation timeout (default: "5m") enroll_auto_renew_retries = 3 # Max retry attempts on failure (default: 3) enroll_auto_renew_retry_delay = "30s" # Delay between retries (default: "30s")PKI-Specific Identity Mapping:
FreeIPA: username = "uid" (FreeIPA uses UID, not CN) Active Directory: username = "upn" (AD uses User Principal Name) Generic LDAP: username = "cn" (CommonName is default)Hot-reloadable: ca_pem, CRL settings, OCSP settings, identity mapping, enrollment settings. Cold (restart required): enabled.
Troubleshooting
Common error messages and diagnostic steps:
“certificate revoked (CRL)”:
- Certificate serial number found in the downloaded CRL - Verify revocation status with external CA tools - User must obtain a new certificate from the PKI - Check CRL freshness: 'certs x509 metrics' for last refresh time“user not found in directory”:
- Identity field extracted from certificate does not match any directory user - Check cert_subject_map.username setting matches your PKI convention - Use 'directory user <username>' to verify user exists in directory - Use 'diagnose user <username>' for cross-subsystem check - Verify directory sync is current: 'directory status'“failed to extract identity”:
- The configured subject field (cn/uid/email/upn) is missing from the certificate - Inspect certificate subject with: openssl x509 -in cert.pem -noout -subject - Change cert_subject_map.username to a field present in the certificate“OCSP query failed (soft-fail)”:
- OCSP responder is unreachable but authentication proceeds (warning only) - Soft-fail is the default behavior (ocsp_soft_fail = true) - Check OCSP URL: 'net http <ocsp_url>' - Verify OCSP responder is operational - If hard-fail is required, set ocsp_soft_fail = false“OCSP query failed (hard-fail)”:
- OCSP responder is unreachable and ocsp_soft_fail = false - Authentication is blocked until OCSP responder recovers - Consider enabling soft-fail if OCSP outages are frequent - Check connectivity: 'net tcp <ocsp_host:port>'“failed to download CRL”:
- CRL URL is unreachable or returned an error - Check URL: 'net http <crl_url>' - Existing in-memory CRL continues to be used until refresh succeeds - Check for size limits: crl_max_size may be rejecting a large CRL“certificate validation timeout”:
- OCSP query or validation step exceeded the dynamic timeout - X.509 uses a dynamic timeout of ocsp_timeout + 5s, NOT operations.wait_timeout - Increase ocsp_timeout if OCSP responder is slow - Check OCSP responder latency: 'net latency <ocsp_host:port>'“certificate expired or not yet valid”:
- Certificate NotBefore/NotAfter check failed - Check certificate dates: openssl x509 -in cert.pem -noout -dates - Verify system clock is correct (NTP drift can cause false failures)Session extension rejected (“x509_revocation”):
- Certificate was revoked after the initial session was created - Internal CA: serial checked against the revocation index - External CA: OCSP check performed using stored certificate data from session - User must obtain a new certificate and re-authenticateEnrollment failures:
- "rate limit exceeded": user hit enroll_rate_limit, wait for window to reset - "PKCS#12 password too weak": password entropy below enroll_p12_min_entropy - "enrollment not enabled": set enroll_enabled = true in config - Check enrollment metrics: 'certs x509 metrics'Auto-renewal not working:
- User has no email in directory (skipped with warning) - User opted out via /signup/x509 status page (auto-renewal opt-out) - Certificate missing stored certificate data (older certificates) - enroll_auto_renew = false in config - Cluster lock contention: only one node processes renewals at a time - Check: 'certs x509 list' for certificate status per userBrowser not prompting for certificate:
- Firefox: Settings > Privacy & Security > Certificates > View Certificates > Import - Chrome: Settings > Privacy and Security > Security > Manage Certificates > Import - Certificate must include ExtKeyUsageClientAuth - CA certificate must be in browser trust store - Verify TLS listener has ClientCAs configured (check logs for "x509 CA loaded")Security
Defense-in-Depth Validation Pipeline:
Six independent validation layers ensure no single check failure compromises security:
1. Certificate expiration (NotBefore/NotAfter checked first, fail-fast) 2. TLS handshake with ClientCAs pool (chain, signature, trust anchor verification) 3. Application-level chain validation (full chain verify with client auth usage check) 4. CRL revocation check -- O(1) in-memory, race-condition safe (if enabled) 5. Directory lookup confirms user exists and is active 6. OCSP real-time revocation check with cluster caching (if enabled)Identity is extracted ONLY after successful validation. Unvalidated certificate fields are never trusted.
TOCTOU Protection for CRL:
CRL updates use atomic.Value to prevent Time-of-Check-Time-of-Use race conditions. The entire revoked serial map is built from the new CRL, then atomically swapped. Readers always see a consistent snapshot. No locks required for O(1) lookups.Memory Exhaustion Protection:
- CRL downloads have configurable timeout (crl_timeout, default 30s) - CRL size capped by crl_max_size (prevents DoS via malicious CRL files) - OCSP responses cached with TTL to limit memory growthConfigurable Soft-Fail OCSP:
When ocsp_soft_fail = true (default), OCSP infrastructure failures allow authentication to proceed. The certificate is already validated by expiration + TLS handshake + chain validation + CRL + directory lookup before OCSP is checked. IMPORTANT: Revoked certificates ALWAYS block authentication regardless of soft-fail mode. Only infrastructure failures (unreachable, timeout) are affected by the soft-fail setting.Session TTL Capping:
X.509 sessions are automatically capped to the certificate validity period. Session TTL = min(configured_TTL, cert_not_after - now). This prevents sessions from outliving their authenticating certificate. Applied at both signin (caller-side) and sessions module (defense-in-depth). Example: if certificate expires in 12h but config TTL is 24h, session TTL is capped to 12h.Session Extension Revocation Check:
When an X.509 session is extended, revocation is re-checked automatically: - Internal CA: serial checked against the revocation index - External CA: OCSP cache checked, full OCSP query if certificate data is available - Revoked certificates always block extension; soft-fail allows extension if OCSP is downInternal CA Enrollment Security:
- PKCS#12 bundles encrypted with Modern2023 profile (AES-256-CBC, SHA-256 HMAC) - Minimum password entropy enforced (enroll_p12_min_entropy, default 60 bits) - Rate limiting on enrollment and revocation endpoints (per-user) - Re-enrollment auto-revokes ALL existing certificates (fresh start with new key) - Auto-renewal preserves existing public key (only re-signs with new validity) - Maximum 2 active certificates per user (oldest auto-revoked when limit exceeded) - Revocation reason codes follow RFC 5280Logging Security:
Certificate serial numbers are logged only at DEBUG level. INFO logs contain username only, preventing information disclosure in production log aggregation systems.Cluster Caching:
OCSP responses are replicated to all nodes asynchronously. Eventual consistency is acceptable for cache data. Cache TTL is controlled by ocsp_cache config (default 15m).Relationships
Module dependencies and interactions:
-
directory: User lookup during validation step 6. Confirms user exists and is active, returns email, full name, and group memberships. Also provides email addresses for auto-renewal notifications.
-
sessions: Session creation after successful validation. Session TTL capped to certificate validity. Revocation is re-checked when sessions are extended. Session metadata stores certificate data for external CA OCSP re-checks.
-
acme: Internal CA certificate signing for enrollment. Certificate revocation triggers CRL rebuild. Updated CRL is replicated to all nodes immediately.
-
identity: cert_subject_map configuration determines which certificate field maps to username (cn, uid, email, upn). Shared config section [identity.cert_subject_map].
-
signin: The /signin/x509 route triggers X.509 authentication flow. Validates the certificate and creates a session on success.
-
proxy: Per-mapping mTLS support (mtls=true) uses X.509 for mutual TLS at the route level. Certificate validated against ACME CA bundle or external PKI.
-
cluster: OCSP responses cached in distributed memory and replicated to all nodes. Auto-renewal uses a distributed lock to prevent duplicate processing across cluster nodes.
-
smtp: Auto-renewal sends renewed certificate bundles to users via email. Users without email addresses in directory are skipped with a warning.
-
moduledata: Certificate records stored per-user in the directory backend. Each user can have up to 2 active certificates (during renewal overlap), plus a revocation history and an auto-renewal opt-out flag.
Logs
Log entries by component. Search with: logs search “x509” Levels: ERROR > WARN > INFO > DEBUG.
Init & Lifecycle:
x509.init WARN JetStream temporarily unavailable, retrying serial index rebuild x509.init ERROR Failed to rebuild serial index after retries x509.init ERROR Failed to initialize CRL x509.init INFO X.509 authentication enabled (CRL disabled) x509.cleanup INFO AUDIT X.509 module cleanup completeValidate (certificate authentication pipeline):
x509.validate ERROR Failed to parse DER certificate x509.validate WARN Certificate not yet valid / Certificate expired x509.validate ERROR No CA certificates available (config + ACME bundle empty) x509.validate WARN Certificate chain validation failed x509.validate WARN Failed to extract identity from certificate x509.validate ERROR Directory lookup failed x509.validate WARN User not found in directory x509.validate WARN Failed to check serial index, falling back to moduledata x509.validate ERROR Failed to check moduledata revocation x509.validate WARN Internal certificate revoked / not in registry - rejecting x509.validate WARN OCSP check failed x509.validate INFO Certificate validated successfully x509.validate DEBUG Validation stage progress (expiration, chain, CRL, identity, OCSP)Enroll (internal CA certificate issuance):
x509.enroll INFO Starting certificate enrollment x509.enroll WARN Invalid username format / Failed to load existing certificate x509.enroll ERROR Failed to enforce certificate limit / generate keypair x509.enroll ERROR Failed to sign certificate with CA / get CA bundle x509.enroll ERROR Failed to generate PKCS#12 password / build PKCS#12 bundle x509.enroll ERROR Failed to store certificate record x509.enroll WARN Failed to store serial index x509.enroll INFO AUDIT Certificate enrolled successfullyRevoke:
x509.revoke INFO Revoking certificate x509.revoke WARN Failed to update serial index x509.revoke INFO AUDIT Certificate revoked successfullyRevoke By Serial (self-service):
x509.revokeBySerial INFO Revoking certificate by serial x509.revokeBySerial WARN Failed to update serial index x509.revokeBySerial INFO AUDIT Certificate revoked by serialRevoke All & Enforce Max:
x509.revokeAll WARN Failed to update serial index x509.revokeAll INFO AUDIT Revoked certificates for user x509.enforceMax WARN Failed to update serial index x509.enforceMax INFO AUDIT Revoked oldest cert for user (max reached)CRL:
x509.crl.init ERROR Failed to download CRL from any server x509.crl.init INFO CRL loaded successfully x509.crl WARN CRL download failed, trying next URL x509.crl.refresh ERROR Failed to refresh CRL from any server x509.crl.refresh INFO CRL refreshed successfully x509.crl.refresh DEBUG Refreshing CRL x509.crl.rebuild WARN Failed to trigger CRL rebuildOCSP:
x509.ocsp DEBUG OCSP cache hit / cache miss - querying responder(s) x509.ocsp WARN No OCSP URLs configured and certificate has no AIA OCSP extension x509.ocsp WARN OCSP responder failed, trying next x509.ocsp.check WARN All OCSP responders unreachable (soft-fail enabled, allowing authentication) x509.ocsp.check ERROR All OCSP responders unreachable (hard-fail enabled, blocking authentication) x509.ocsp.check DEBUG OCSP query successful x509.ocsp.serial WARN OCSP cache lookup failed / cache wait failed x509.ocsp.serial DEBUG OCSP cache miss for session extension check / OCSP cache hitAuto-Renewal:
x509.renewal INFO Auto-renewal is disabled by configuration x509.renewal ERROR Failed to schedule auto-renewal x509.renewal INFO Auto-renewal scheduler registered x509.renewal WARN Failed to acquire renewal lock / wait for lock acquisition x509.renewal INFO Renewal check already in progress on another node, skipping x509.renewal INFO Starting certificate renewal check x509.renewal ERROR Failed to get all users / GetAllUsers failed / Invalid response x509.renewal ERROR Failed to renew certificate x509.renewal INFO Certificate renewal check completed x509.renewal WARN Skipping renewal - user has no email / no CertificateDER stored x509.renewal WARN Failed to enforce max certs limit x509.renewal WARN Failed to update serial index / get CA bundle / send renewal email x509.renewal INFO Certificate renewed successfullySession Extension Validator:
x509.session_validator DEBUG Checking certificate revocation for session extension x509.session_validator WARN AUDIT X.509 session missing required metadata - allowing extension x509.session_validator WARN Failed to check serial index, falling back to moduledata x509.session_validator WARN AUDIT Session extension rejected: internal certificate revoked x509.session_validator WARN Session extension rejected: internal certificate not in registry x509.session_validator WARN Session extension rejected: external certificate revoked (OCSP/cache) x509.session_validator WARN Soft-fail warnings (revocation check, OCSP, cert parse failures) x509.session_validator WARN OCSP check failed, rejecting extension (hard-fail) x509.session_validator WARN Unknown CA type in session metadata - allowing extensionRevocation Check (hexdcall operation):
x509.check_revoked DEBUG Checking certificate revocation status / valid / OCSP passed x509.check_revoked WARN Failed to check serial index / not in registry / no cert DER x509.check_revoked INFO Internal certificate is revoked / External revoked (OCSP) x509.check_revoked ERROR Failed to parse certificate DER x509.check_revoked WARN OCSP check failed for external certRecovery (serial index rebuild at startup):
x509.recovery INFO Starting serial index recovery from moduledata x509.recovery WARN Invalid x509 data format for user x509.recovery WARN Failed to store serial index for legacy/active/revoked cert x509.recovery INFO Serial index recovery completed / cancelled during shutdownStorage:
x509.storage INFO X509 certificate stored to moduledata x509.storage DEBUG Load/store operations, format parsingAuto-Renew Opt-Out:
x509.auto_renew INFO Auto-renewal opt-out updatedRevoked Certificates Query:
x509.revoked ERROR Failed to retrieve serial index x509.revoked INFO Retrieved revoked certificates x509.revoked DEBUG Retrieving all revoked certificatesMetrics
Prometheus metrics. Query with: metrics prometheus x509_<name>
Validation:
x509_validation_total counter {result, reason?} Certificate validation attempts result=success Valid certificate authenticated result=failure, reason=not_yet_valid Certificate NotBefore in future result=failure, reason=expired Certificate past NotAfter result=failure, reason=no_ca_available No CA certs configured result=failure, reason=chain_validation_failed Chain/signature verification failed result=failure, reason=revoked_crl Revoked via CRL (external cert) result=failure, reason=invalid_identity Identity field missing from cert result=failure, reason=directory_error Directory lookup call failed result=failure, reason=directory_timeout Directory lookup timed out result=failure, reason=user_not_found User not in directory result=failure, reason=revoked_internal Revoked via serial index (internal cert) result=failure, reason=not_registered Internal cert not in enrollment registry result=failure, reason=revoked_ocsp Revoked via OCSP (external cert)Enrollment:
x509_enrollment_total counter {result, reason?} Certificate enrollment attempts result=success Certificate issued successfully result=failure, reason=invalid_username Username validation failedRevocation:
x509_revocation_total counter {result, reason} Certificate revocations result=success, reason=<RFC5280 code> Revocation completedCRL:
x509_crl_refresh_total counter {result} CRL download/refresh attempts result=success CRL loaded/refreshed result=failure Download failed from all URLs x509_crl_revoked_count gauge {} Number of revoked certs in CRL x509_crl_size_bytes gauge {} Raw CRL size in bytesOCSP:
x509_ocsp_query_total counter {result, cached} OCSP lookups result=success, cached=true Cache hit (memory) result=success, cached=false Responder queried successfully result=failure, cached=false All responders unreachableAuto-Renewal:
x509_auto_renewal_check_total counter {result} Renewal check runs x509_auto_renewal_total counter {result} Individual cert renewals result=success Cert renewed and emailed result=failure Renewal failed x509_auto_renewal_skipped_total counter {reason} Renewals skipped reason=no_email User has no email in directory reason=no_certificate_der No stored cert for key extraction x509_auto_renewal_certs_checked gauge {} Certs checked in last run x509_auto_renewal_certs_renewed gauge {} Certs renewed in last run x509_auto_renewal_certs_skipped gauge {} Certs skipped (opt-out) in last run x509_auto_renewal_errors gauge {} Errors in last renewal runAlerts:
rate(x509_validation_total{result="failure"}[5m]) > 10 High validation failure rate rate(x509_validation_total{reason="revoked_crl"}[5m]) > 0 CRL-revoked cert used (possible compromise) rate(x509_validation_total{reason="revoked_internal"}[5m]) > 0 Revoked internal cert used x509_crl_refresh_total{result="failure"} increasing CRL server unreachable rate(x509_ocsp_query_total{result="failure"}[5m]) > 0 OCSP responder down x509_auto_renewal_errors > 0 Auto-renewal failures need attentionOnboarding Service
Self-service user onboarding with magic link verification and passkey enrollment
Overview
The onboarding service provides a streamlined SPA flow for new users to verify their email and enroll a passkey. It combines the magic link passwordless flow with WebAuthn passkey registration into a single guided experience.
The service is a single GET endpoint at /onboarding that renders different steps based on the user’s authentication state. All actual operations (magic link, passkey enrollment) are delegated to existing API endpoints — no new backend APIs are needed.
Onboarding flow (4 steps):
Step 0: Email entry — user submits email address Step 1: Magic link polling — browser polls for authorization while user clicks link in email Step 2: Passkey enrollment — WebAuthn ceremony to register a biometric/hardware key Step 3: Success — animated confirmation, auto-redirect to /profileThree handler states:
1. No session — render email step (unauthenticated users start here) 2. Authenticated session + no passkey — create mfa_pending session, render passkey step 3. Authenticated session + has passkey — redirect to /profile (already onboarded)The service is gated by the portal being enabled (portal = true). When portal is disabled, the /onboarding route is not registered.
Config
The onboarding service has no dedicated configuration section. It relies on:
[service] portal = true # Must be enabled for onboarding route registration session_mfa_pending = "5m" # TTL for the mfa_pending session during passkey enrollment cookie_name = "hexon" # Session cookie name (for detecting authenticated users) cookie_domain = "" # Cookie domain for cross-subdomain support [service.signin.magiclink] # Magic link settings used by /api/signin/magiclink enabled = true code_ttl = "10m" rate_limit = "5/1m" [protection] pow = true # PoW protection applied automatically (no DisablePoW on route)The onboarding page inherits PoW protection from the global middleware. Authenticated users skip PoW automatically (valid session cookies are detected).
Endpoints
UI endpoint:
GET /onboarding Onboarding SPA page (all steps rendered client-side)The SPA calls existing API endpoints via fetch():
POST /api/signin/magiclink Send magic link email (existing signin service) POST /api/signin/magiclink/poll Poll for magic link authorization (existing signin service) POST /api/signup/passkey/begin Begin WebAuthn registration ceremony (existing signup service) POST /api/signup/passkey/finish Complete WebAuthn registration (existing signup service)On magic link authorization, the poll handler (in signin service) creates an authenticated “user” session via session creation. The onboarding JS then reloads the page, and the handler detects the session, creates an mfa_pending session for passkey enrollment, and renders the passkey step.
Session flow:
1. Poll authorized → session creation creates "user" session + sets hexon cookie 2. Page reload → handler reads hexon cookie → validates user session 3. No passkey found → creates mfa_pending session + sets mfa_session_id cookie 4. Passkey begin/finish use mfa_session_id cookie for authorization 5. On passkey success → JS redirects to /profileTroubleshooting
Common issues and diagnostic steps:
Onboarding page shows email step despite being logged in:
- Verify session exists: 'sessions list --user=<username>' - Check session type is "user" with auth_status "authenticated" - Check cookie: session cookie name must match config (default: hexon) - PoW interference: if PoW cookie expired, user may be redirected to challenge firstPasskey step not appearing after magic link click:
- Check magic link poll response: should return status "authorized" - Verify session created by session creation: 'sessions list --user=<username>' - JS reloads page after authorized — check for network/redirect issues - Server log should show "Onboarding: authenticated user entering passkey enrollment"Passkey registration failing:
- Check mfa_session_id cookie exists and session is valid - Session TTL: mfa_pending session defaults to 5 minutes (session_mfa_pending config) - WebAuthn RP ID must match hostname - Browser must support PublicKeyCredential API (HTTPS required) - Server logs: look for "Begin registration request" and "FinishRegistration failed"PoW challenge blocking onboarding:
- Normal behavior for first-time visitors without PoW session cookie - Authenticated users skip PoW (middleware checks application session) - PoW session TTL: default 30 minutes (pow_session_ttl config)Page redirect loop or landing on / after magic link:
- return_url must be HMAC-sealed (handler passes sealed URL to template data) - Unsealed URLs fall back to "/" - Check that sealed_return_url is present in onboarding-data JSONSession proliferation on page refresh:
- Handler reuses existing valid mfa_pending session (checks mfa_session_id cookie first) - If mfa_session_id expired, a new session is created on refresh (normal behavior) - Old expired sessions are cleaned up by session TTLRelationships
Module dependencies and interactions:
-
signin (magiclink): Provides the magic link email flow. POST /api/signin/magiclink initiates the flow, POST /api/signin/magiclink/poll checks status. The poll handler calls session creation which creates the “user” session that onboarding detects.
-
signup (passkey): Provides WebAuthn enrollment. POST /api/signup/passkey/begin and /finish handle the ceremony. Both require a valid mfa_session_id cookie pointing to an mfa_pending session with signup_flow=“passkey”.
-
sessions: Used for session detection (Validate) and mfa_pending session creation (Create). The handler checks the main session cookie for authenticated users, and creates a separate mfa_session_id cookie for the passkey enrollment session.
-
webauthn: Used to check if user already has a passkey. Users with an existing passkey are redirected to /profile immediately.
-
render: Template rendering. Uses the onboarding manifest entry for CSS/JS asset bundling.
-
locale: i18n translations via template {{t “onb.*”}} function. All UI text comes from locale TOML files ([onb] section in 10 language files).
-
protection (PoW): Global PoW middleware protects the route — unauthenticated users solve PoW challenge before seeing the page.
-
portal: Onboarding route registration is gated by IsPortalEnabled(). Both services share the same user-facing domain.
Logs
Log entries by component. Search with: logs search “onboarding” Levels: ERROR > WARN > INFO > DEBUG.
Init (route registration):
onboarding.init INFO Onboarding disabled (console not enabled) onboarding.init INFO Onboarding service route registered at /onboardingMFA Session (passkey enrollment session lifecycle):
onboarding.mfa_session ERROR Failed to create mfa_pending session for passkey enrollment onboarding.mfa_session ERROR Invalid session response typePasskey (enrollment flow):
onboarding.passkey INFO Onboarding: authenticated user entering passkey enrollment AUDITMetrics
This module does not emit its own Prometheus metrics.
Observability is provided indirectly through dependent modules:
- sessions: session_* metrics cover mfa_pending session creation and validation - webauthn: webauthn_* metrics cover passkey registration ceremonies - magiclink: magiclink_* metrics cover magic link email and polling - ratelimit: ratelimit_* metrics cover PoW and request throttlingSign-In Service
Authentication coordinator with multi-method sign-in, pluggable MFA, magic links, and session management
Overview
The signin service is the central authentication coordinator for Hexon. It orchestrates the complete user sign-in lifecycle across multiple authentication methods and modules, handling primary authentication, multi-factor verification, magic link passwordless flows, and session creation.
Supported primary methods:
- passwd: LDAP password authentication (bind-based, no local password storage) - passkey: WebAuthn/FIDO2 passwordless (hardware keys, biometrics, phishing-resistant) - x509: Client certificate authentication (Subject DN to username mapping) - oidc: OpenID Connect single sign-on via external identity provider - magiclink: Email-based passwordless authentication (BASE-20 tokens, RFC 8628 polling)Supported MFA methods (pluggable):
- otp: Email-delivered verification code (via emailotp module) - totp: Time-based One-Time Password / authenticator apps (RFC 6238)Authentication flow stages:
1. Primary authentication — credential verification against backend (LDAP/WebAuthn/X.509) 2. MFA challenge (if required) — pre-auth session created, MFA code verified 3. Session creation — quorum-replicated across cluster, cookie set 4. Directory sync — fire-and-forget background user data refresh 5. Redirect — user sent to original destination (return_url)Magic link flow (cross-device passwordless):
1. User submits email on /signin/magiclink 2. Device code created (RFC 8628), BASE-20 token generated (rejection sampling, no modulo bias) 3. Token-to-device-code mapping stored as SHA-256 hashes (tokens never in cleartext) 4. Magic link email sent via SMTP (fire-and-forget, anti-enumeration) 5. Browser polls /api/signin/magiclink/poll every 5 seconds 6. User clicks link on any device, token validated, device code marked authorized 7. Next poll detects authorization, session created on polling browser onlySession security:
- Session rotation after MFA (new ID prevents session fixation attacks) - MFA pending sessions are short-lived (default 5 minutes) and revoked after upgrade - Sessions bound to IP address and TLS fingerprint - Configurable max concurrent sessions per user (default: 1) - Cluster-wide session storage with quorum replication (available on all nodes)Config
Configuration under [service.signin] in TOML:
[service.signin] primary = "passkey" # Default authentication method shown at /signin # Options: "passwd", "passkey", "x509", "oidc", "magiclink" secondary = ["passwd", "x509"] # Alternative methods (shown as links on sign-in page) require_mfa = ["passwd"] # Methods that require MFA after primary auth # Empty list = MFA never required mfa_methods = ["otp", "totp"] # Available MFA methods presented to user # Order determines default selection[service.signin.magiclink] enabled = true # Enable magic link passwordless sign-in code_length = 10 # Token length in BASE-20 characters (range: 6-40, default: 10) code_ttl = "10m" # Link validity duration (default: 10 minutes) rate_limit = "5/1m" # Per-IP rate limit on magic link requests rate_limit_email = "3/10m" # Per-email rate limit (anti-flooding protection)Session configuration (under [service.signin] or related session config):
session_ttl # Authenticated session lifetime session_password_expired # Session TTL for expired password flow session_mfa_pending # Pre-auth session TTL (default: 5 minutes) max_concurrent_sessions = 1 # Max active sessions per user (default: 1)Password policy (enforced during passwd authentication):
- Strength validation via zxcvbn algorithm (configurable score 0-4) - Character requirements: uppercase, lowercase, digits, special characters - Minimum length and entropy requirements (all configurable via TOML) - Password expiry enforcement with dedicated session typeMFA settings:
max_retries = 5 # Maximum MFA verification attempts before lockoutHot-reloadable: primary method, secondary methods, require_mfa list, mfa_methods, magiclink settings, session TTLs, password policy, rate limits. Cold (restart required): service.signin.enabled.
Endpoints
UI endpoints (serve HTML pages):
GET /signin Redirect to primary authentication method GET /signin/passwd LDAP password sign-in page GET /signin/passkey WebAuthn passkey sign-in page GET /signin/x509 X.509 certificate sign-in page GET /signin/magiclink Magic link email form GET /signin/magiclink/verify Magic link verification (clicked from email) GET /signin/mfa MFA verification page (OTP or TOTP)API endpoints (JSON/form):
POST /api/signin Authenticate with credentials Body: {"method", "username", "password", "remember_me"} Returns: success with session_token, or requires_mfa with pre-auth session and available mfa_methods POST /api/signin/magiclink Submit magic link request Body: email, return_url, auth_flow (form-encoded) Returns: device_code and expires_in for polling Rate limited: per-IP (5/1m) and per-email (3/10m) POST /api/signin/magiclink/poll Poll magic link authorization status Body: device_code (form-encoded) Returns: {"status":"pending"} or {"status":"authorized","redirect":"..."} POST /api/signin/mfa Verify MFA code Body: {"method", "code", "session_id" (HMAC-sealed), "trust_device"} Returns: success with redirect (session_id not exposed in response) POST /api/signin/mfa/resend Resend OTP code (email OTP only)X.509 over HTTP/3 note: QUIC does not support TLS renegotiation. If a user attempts X.509 auth over HTTP/3 without a client certificate, the server responds with Alt-Svc: clear and a 307 redirect to force retry over HTTP/2, which properly prompts for client certificate selection.
Troubleshooting
Common symptoms and diagnostic steps:
Authentication failures (generic “Invalid username or password”):
- LDAP backend unreachable: 'auth ldap' to check connection health - Account locked in LDAP (nsAccountLock attribute): 'directory user <username>' - User not found in directory: 'directory user <username>' to verify existence - Incorrect bind DN or password: check LDAP module configuration - Start with: 'diagnose user <username>' for cross-subsystem checkMFA verification failing:
- TOTP clock drift: user device time must be within 30-second window - OTP expired: default validity window is short, check 'auth otp' - Email OTP not delivered: 'smtp health' to verify SMTP service - Rate limited (429): max_retries exceeded, check 'metrics ratelimit' - Session expired: MFA pending session has 5-minute TTL by default - Check MFA session: 'sessions list --user=<username>' for pre-auth sessionsMagic link issues:
- Email not received: 'smtp health' and 'notify health' to verify delivery path - Anti-enumeration: same response whether email exists or not (by design) - Token expired: default code_ttl is 10 minutes, check timing - Rate limited: per-IP (5/1m) or per-email (3/10m), check 'metrics ratelimit' - Poll returns "pending" indefinitely: verify SMTP delivery, check device code status via 'auth devicecodes' - "Link already used" error: tokens are single-use, mapping deleted after verifySession creation failures:
- Cluster quorum not met: 'cluster status' to verify quorum health - Session replication timeout: check cluster health for latency - Max concurrent sessions reached: 'sessions list --user=<username>' - Cookie not set: verify service hostname matches cookie domain - Session bound to wrong IP: check proxy/load balancer X-Forwarded-For headersWebAuthn/passkey errors:
- No passkey registered: 'webauthn list <username>' to check enrollments - Browser not supporting WebAuthn: requires HTTPS and a supported browser - Relying party ID mismatch: hostname must match RP ID in WebAuthn config - Challenge expired: WebAuthn challenges are cached temporarilyX.509 certificate sign-in issues:
- Certificate not requested by browser: check TLS configuration - HTTP/3 fallback: Alt-Svc: clear redirect expected for QUIC connections - Certificate chain validation failure: check CA bundle configuration - Subject DN mapping: verify DN-to-username mapping rules - Check: 'certs x509 list' for registered client certificatesPassword policy rejections:
- zxcvbn score too low: user password not meeting strength requirements - Missing character classes: check uppercase/lowercase/digit/special requirements - Password expired: user gets dedicated session type, must change password - Check policy: 'config show service.signin' for password policy settingsRedirect loops after sign-in:
- return_url invalid or pointing to sign-in page itself - Session cookie domain mismatch: verify service.hostname configuration - OIDC callback failure: check oidc_providers configuration - Check: 'sessions list --user=<username>' and 'auth status'Relationships
Module dependencies and interactions:
-
authentication.ldap: Primary backend for passwd method. LDAP bind authentication with connection pooling. Reports account lock status (nsAccountLock). Password policy enforcement (strength, expiry, character requirements).
-
authentication.webauthn: Primary backend for passkey method. WebAuthn/FIDO2 credential storage and verification. Hardware key and biometric support.
-
authentication.x509: Primary backend for X.509 certificate method. Certificate chain validation, Subject DN to username mapping, revocation checking.
-
authentication.oidc: Backend for OIDC single sign-on method. Redirects to external identity provider for authentication.
-
authentication.magiclink: Magic link token generation, email composition. Uses BASE-20 encoding with rejection sampling for unbiased token generation.
-
authentication.devicecode: RFC 8628 device code flow. Provides polling infrastructure and expiration for magic link authorization tracking.
-
authentication.otp: Email OTP generation and verification for MFA. Delivers codes via emailotp module with device fingerprinting.
-
authentication.totp: TOTP verification for MFA. Validates RFC 6238 codes from authenticator apps (Google Authenticator, Authy, etc.).
-
sessions: Cluster-wide session management with quorum replication. Creates authenticated sessions, MFA pending sessions, and password-expired sessions. Session rotation after MFA completion.
-
directory: User data synchronization after authentication (fire-and-forget). Provides user lookup by email (magic link), group membership, account status. Fresh data sync ensures up-to-date authorization after sign-in.
-
smtp: Email delivery for magic link messages and OTP codes. Fire-and-forget delivery ensures consistent response timing (anti-enumeration).
-
signout: Companion service for session termination and logout flows.
-
onboarding: Uses magic link flow for email verification, then transitions to passkey enrollment. The onboarding SPA calls /api/signin/magiclink and /api/signin/magiclink/poll directly via fetch(). After authorization, the poll handler creates a “user” session which onboarding detects on page reload.
-
passwordchange: Handles password change flows when password-expired session is active. Redirects back to sign-in after successful change.
-
firewall: Network-level access rules applied before sign-in endpoints.
-
protection: Rate limiting (fingerprint-based) on all sign-in endpoints. Prevents brute force attacks on credentials and MFA codes.
Logs
Log entries by component. Search with: logs search “signin” Levels: ERROR > WARN > INFO > DEBUG. DEBUG requires log level configuration.
Authentication completion:
signin.complete INFO Authentication completedFinalize (session creation after successful auth):
signin.finalize ERROR AUDIT Failed to create session signin.success INFO AUDIT User signed in successfullyReauth (re-authentication session for protected proxy paths):
signin.reauth ERROR Failed to create reauth session signin.reauth ERROR Unexpected reauth session response type signin.reauth INFO AUDIT Reauth session created during signinLDAP password authentication:
signin.ldap INFO AUDIT Attempting LDAP authentication signin.ldap ERROR LDAP bind call failed signin.ldap DEBUG LDAP bind successful, syncing user from directory signin.ldap WARN Failed to sync user from directory signin.ldap WARN User sync returned failure signin.ldap ERROR Failed to get user from directory signin.ldap INFO User not found in directory after sync signin.ldap INFO AUDIT Account is disabled signin.ldap INFO Password expired - creating temporary session for password change signin.ldap ERROR Failed to create password_expired sessionMFA (multi-factor authentication flow):
signin.mfa INFO AUDIT MFA required for user signin.mfa DEBUG Validating MFA session signin.mfa ERROR Session validation wait failed signin.mfa INFO MFA session not valid signin.mfa DEBUG MFA session validated successfullyMFA post-verification:
signin.mfa DEBUG MFA verified - retrieving pending session signin.mfa.session ERROR Failed to wait for MFA session validation signin.mfa.session DEBUG MFA session retrieved - creating authenticated session signin.mfa.signup INFO MFA verified for signup - redirecting to passkey registration signin.mfa.groups WARN Directory lookup failed after MFA - using cached groups from pending session signin.mfa.complete DEBUG Returning success response to clientMFA OTP resend:
signin.mfa.resend ERROR Failed to generate OTP signin.mfa.resend INFO OTP code resentMFA email OTP verification:
signin.mfa.otp ERROR OTP validation call failed signin.mfa.otp INFO AUDIT OTP validation failed signin.mfa.otp WARN OTP generation failed — user can resend from MFA pageMFA TOTP verification:
signin.mfa.totp ERROR TOTP validation call failed signin.mfa.totp INFO AUDIT TOTP and recovery code validation both failed signin.mfa.totp INFO AUDIT TOTP validation failed - invalid code signin.mfa.totp INFO AUDIT User authenticated via recovery code signin.mfa.totp ERROR Failed to check TOTP enrollment statusWebAuthn passkey authentication:
signin.passkey.begin DEBUG Beginning passkey authentication signin.passkey.begin ERROR BeginAuthentication failed signin.passkey.begin DEBUG WebAuthn challenge created signin.passkey.finish DEBUG Finishing passkey authentication signin.passkey.finish INFO FinishAuthentication failed signin.passkey.finish ERROR Failed to get user from directory signin.passkey.finish INFO User not found in directory after passkey auth signin.passkey.finish INFO Account is disabled signin.passkey.finish ERROR AUDIT E2OE: failed to persist Tier 1 ECDH state — channel will degrade to baselineKerberos SPNEGO authentication:
signin.kerberos DEBUG Sending Negotiate challenge signin.kerberos ERROR AUDIT SPNEGO validation call failed signin.kerberos INFO AUDIT SPNEGO authentication failed signin.kerberos ERROR AUDIT Failed to create session for SPNEGO user signin.kerberos ERROR Invalid session create response signin.kerberos INFO AUDIT Kerberos SPNEGO authentication successfulMagic link passwordless authentication:
signin.magiclink ERROR AUDIT Initiate failed signin.magiclink.verify INFO AUDIT Magic link verified signin.magiclink.verify ERROR Failed to finalize authenticationX.509 certificate authentication:
signin.x509 DEBUG X.509 signin handler started signin.x509 INFO No client certificate provided signin.x509 ERROR Failed to validate certificate signin.x509 INFO AUDIT Certificate revoked signin.x509 INFO AUDIT Certificate expired signin.x509 INFO Certificate not yet valid signin.x509 INFO Certificate chain validation failed signin.x509 ERROR Certificate validation failed signin.x509 INFO Certificate validation failed signin.x509 DEBUG Capping session TTL to certificate validity signin.x509 ERROR Failed to create session signin.x509 ERROR Session creation timeout signin.x509 ERROR Invalid session response signin.x509 INFO AUDIT X.509 authentication successfulMetrics
This service does not emit its own Prometheus metrics.
Observability is provided indirectly through dependent modules:
- sessions: session_* metrics cover session creation, validation, and revocation - ldapauth: ldap_* metrics cover LDAP bind authentication - webauthn: webauthn_* metrics cover passkey authentication ceremonies - emailotp: otp_* metrics cover OTP generation and validation - totp: totp_* metrics cover TOTP validation - magiclink: magiclink_* metrics cover magic link initiation and verification - ratelimit: ratelimit_* metrics cover brute force protection on signin endpoints - directory: directory_* metrics cover user sync and lookup