Metrics Reference
Every Prometheus metric across all modules. Types: counter (monotonic), gauge (current value), histogram (distribution buckets), latency (histogram with fixed buckets: <1ms, <10ms, <100ms, <1s, <10s, >10s).
Reverse Proxy
Load Balancer
Pool Lifecycle: loadbalancer_pools_created counter {strategy} Pools created loadbalancer_pools_deleted counter {} Pools deleted
Backend Selection: loadbalancer_selects counter {pool_id, strategy} Successful backend selections loadbalancer_select_failures counter {pool_id, reason} Failed selections (reason: pool_not_found|no_healthy_backends|algorithm_returned_nil) loadbalancer_select_latency latency {pool_id} Backend selection duration
Health Checks: loadbalancer_health_checks counter {pool_id, healthy} Health check executions (healthy: true|false)
Circuit Breaker: loadbalancer_circuit_state_changes counter {pool_id, backend_id, from_state, to_state} Circuit state transitions (optionally includes protocol label in per-protocol mode) loadbalancer_circuit_resets counter {pool_id, backend_id} Manual circuit resets
Connections: loadbalancer_connections_opened counter {pool_id, backend_id} Connections opened loadbalancer_connections_closed counter {pool_id, backend_id} Connections closed loadbalancer_active_connections gauge {pool_id, backend_id} Current active connections loadbalancer_connection_duration latency {pool_id, backend_id} Connection duration loadbalancer_bytes_sent counter {pool_id, backend_id} Bytes sent to backends loadbalancer_bytes_recv counter {pool_id, backend_id} Bytes received from backends
Rate Limiting: loadbalancer_rate_limit_allowed counter {pool_id} Requests allowed by rate limiter loadbalancer_rate_limit_denied counter {pool_id} Requests denied by rate limiter Note: metrics aggregate at pool level. For per-user denial details when rate_limit_per_user = true, use: logs search "rate limit exceeded" (includes user_id)
Outlier Detection: loadbalancer_outlier_ejections counter {pool_id, backend_id, reason} Backends ejected (reason: consecutive_5xx|consecutive_gateway|consecutive_local|success_rate|failure_percentage) loadbalancer_outlier_readmissions counter {pool_id, backend_id} Backends auto-readmitted after ejection period loadbalancer_outlier_manual_uneject counter {pool_id, backend_id} Backends manually un-ejected
DNS Discovery: loadbalancer_dns_discovery_failures counter {pool_id, hostname} DNS resolution failures loadbalancer_dns_discovery_updates counter {pool_id, hostname} Backend set updates from DNS
Alerts: rate(loadbalancer_select_failures{reason="no_healthy_backends"}[5m]) > 0 All backends down — check health and outlier state rate(loadbalancer_circuit_state_changes{to_state="open"}[5m]) > 0 Circuit opened — backend degradation rate(loadbalancer_outlier_ejections[5m]) > 5 High ejection rate — systemic backend issues rate(loadbalancer_rate_limit_denied[5m]) > 50 High rate limit denial — check capacity or limits rate(loadbalancer_dns_discovery_failures[5m]) > 0 DNS discovery failing — check DNS configRequest Shadow/Mirror
Request Counts: shadow_requests_total counter {shadow_name} Shadow requests dispatched shadow_success_total counter {shadow_name} Shadow requests with 2xx/3xx responses shadow_errors_total counter {shadow_name, error_type} Shadow request errors (error_type: request_creation|timeout|network|http_error)
Note: error_type "http_error" also includes a "status_code" label.
Latency: shadow_request_duration latency {shadow_name} Shadow request round-trip duration
Alerts: rate(shadow_errors_total[5m]) > 0 Shadow errors occurring — check backend health rate(shadow_errors_total{error_type="timeout"}[5m]) > 5 High timeout rate — increase timeout or check backend histogram_quantile(0.99, shadow_request_duration) > 5 p99 latency exceeds 5s — shadow backend slowReverse Proxy
Request Flow: proxy_requests counter {app, host, auth} Successful proxied requests proxy_errors counter {app, host} Proxy request errors proxy_backend_duration latency {app} Backend response time proxy_auth_failures counter {app, host} Authentication failures proxy_authz_failures counter {app, host} Group authorization failures proxy_auth_bypass_total counter {app, host} Auth bypassed (bypass_auth_cidrs) proxy_subnet_failures counter {app, host} Subnet restriction failures proxy_reauth_required counter {app, host} Re-authentication triggered
Caching: proxy_cache_hits counter {app, type} Response cache hits (304/full) proxy_cache_misses counter {app} Response cache misses proxy_cache_size gauge {} Response cache entries proxy_cache_invalidated counter {} Cache entries invalidated on reload
Header Signing: proxy_signing_total counter {status, app} Header signing operations proxy_signing_duration latency {app} Header signing time proxy_request_signing_total counter {status, app} Request signing operations proxy_request_signing_duration latency {app} Request signing time proxy_sign_payload_total counter {status} Payload signing outcomes proxy_key_derivation_total counter {status} Key derivation attempts proxy_key_rotation_total counter {status} Key rotations proxy_key_operation_total counter {status, operation} Key operation failures proxy_key_request_total counter {status} Public key endpoint requests proxy_key_request_duration latency {} Public key endpoint latency proxy_signature_verify_total counter {status} Signature verification requests proxy_signature_verify_duration latency {} Signature verification latency proxy_request_signature_verify_total counter {status} Request signature verification proxy_request_signature_verify_duration latency {} Request signature verify latency
OIDC SSO: proxy_oidc_flow_initiated counter {host, provider} OIDC auth flows started proxy_oidc_flow_completed counter {host, provider} OIDC auth flows completed proxy_oidc_flow_failed counter {host, reason, provider?} OIDC auth flow failures (provider absent on pre-state errors) proxy_oidc_state_validation_failed counter {reason} State validation failures
Per-User Rate Limiting: proxy_rate_limit_per_user_denied counter {app} Requests denied by per-user rate limit
Canary: proxy_canary_requests counter {app, version} Requests routed per version (stable/canary label)
Retry: proxy_retry_attempts_total counter {app, attempt} Retry attempts by attempt number proxy_retry_success_total counter {app} Retries that succeeded proxy_retry_budget_exceeded counter {app} Retries blocked by budget proxy_retry_exhausted counter {app} All retry attempts failed
Hedge: proxy_hedge_fired_total counter {app} Hedge requests sent (primary too slow, value = number of hedges) proxy_hedge_won_total counter {app} Hedge response used (primary was slower or failed) proxy_hedge_lost_total counter {app} All attempts failed (primary + all hedges) proxy_hedge_skipped_total counter {app} Hedge skipped (no different backend, body replay failure)
Circuit Breaker: proxy_circuit_breaker_rejections counter {app} Requests rejected (breaker open) proxy_circuit_breaker_fallbacks counter {app} Fallback service activated
Transport: proxy_transport_cache_hits counter {} HTTP transport reused proxy_transport_cache_misses counter {} New HTTP transport created proxy_transport_cache_size gauge {} Cached transports proxy_transport_cache_invalidated counter {reason} Cache invalidated (CA rotation) proxy_http3_transport_cache_hits counter {} HTTP/3 transport reused proxy_http3_transport_cache_misses counter {} HTTP/3 transport created proxy_proxyprotocol_sent counter {version} PROXY protocol headers sent proxy_proxyprotocol_skipped counter {reason} PROXY protocol skipped proxy_ca_pool_version gauge {} CA pool version proxy_optimized_transport_hits counter {route, pool} Optimized transport cache hits proxy_optimized_transport_fallbacks counter {route, reason} Optimized transport fallbacks proxy_transport_pools_created counter {pool, route} Transport pools created proxy_transport_pools_cleaned counter {pool} Transport pools cleaned proxy_pool_registration_failures counter {pool, error} Pool registration failures proxy_pool_cleanup_errors counter {pool, error} Pool cleanup errors
Rewriting: proxy_rewrite_duration latency {app} HTML rewriting time proxy_buffer_pool_gets counter {} Buffer pool acquisitions
Reload: proxy_reload_attempts counter {trigger} Reload attempts proxy_reload_total counter {success, reason?} Reload results (reason on failure only) proxy_reload_skipped counter {reason} Reload skipped proxy_reload_duration latency {success} Reload time (success path only) proxy_routes_configured gauge {} Active routes proxy_routes_changed gauge {} Routes changed on reload proxy_routes_unchanged gauge {} Routes unchanged on reload proxy_routes_added counter {} Routes added proxy_routes_removed counter {} Routes removed proxy_config_hash_changed counter {} Config hash changes proxy_lb_pools_preserved counter {app} LB pools preserved proxy_lb_pools_created counter {app, reason} LB pools created
Session Monitoring: proxy_group_monitor_changes_total counter {username} Group membership changes proxy_group_monitor_updates_total counter {username} Session metadata updates proxy_group_monitor_errors_total counter {error_type} Monitor errors proxy_group_monitor_check_duration latency {} Check cycle time
FastCGI: proxy_fastcgi_requests_total counter {mapping, status_class} FastCGI RoundTrip exits (status_class: 2xx/3xx/4xx/5xx/error) proxy_fastcgi_request_duration latency {mapping} RoundTrip wall-clock time (auto-bucketed) proxy_fastcgi_pool_total counter {mapping, result} Conn pool acquire (hit/fresh/retry) proxy_fastcgi_stderr_bytes_total counter {mapping, severity} PHP-FPM STDERR bytes routed to audit log proxy_fastcgi_proto_status_total counter {mapping, status} Non-success FCGI_END_REQUEST signals (cant_mpx/overloaded/unknown_role)
Alerts: rate(proxy_errors[5m]) / rate(proxy_requests[5m]) > 0.05 Error rate > 5% proxy_circuit_breaker_rejections > 0 Backend unhealthy rate(proxy_auth_failures[5m]) > 10 Brute-force attempt rate(proxy_oidc_state_validation_failed[5m]) > 5 CSRF/state attack proxy_transport_cache_invalidated > 0 CA rotation event rate(proxy_reload_total{success="false"}[5m]) > 0 Config reload failing rate(proxy_retry_budget_exceeded[5m]) > 10 Retry storm — budget protecting cluster rate(proxy_retry_exhausted[5m]) > 5 Backend failures exhausting retries rate(proxy_hedge_fired_total[5m]) / rate(proxy_requests[5m]) > 0.1 >10% requests hedging — check tail latency rate(proxy_fastcgi_requests_total{status_class="error"}[5m]) > 5 FastCGI transport-level failures (backend unreachable) rate(proxy_fastcgi_stderr_bytes_total{severity="error"}[5m]) > 1000 Sustained PHP-FPM error output (investigate php-fpm.log)Authentication
Device Code Authorization
Codes: devicecode_codes_issued_total counter {client_id} Device codes generated
Authorization: devicecode_authorizations_total counter {result} Authorization decisions result=authorized User approved device result=denied User denied device
Polling: devicecode_polls_total counter {status} Poll requests by outcome status=pending Awaiting user action status=authorized User authorized status=denied User denied status=slow_down Client polling too fast status=expired Code expired (not instrumented — returns early)
Alerts: rate(devicecode_authorizations_total{result="denied"}[5m]) > 10 High denial rate rate(devicecode_polls_total{status="slow_down"}[5m]) > 50 Clients ignoring poll intervalJust-In-Time Two-Factor Authentication
Operations: jit2fa_login_attempts_total counter {mapping_id} Login interceptions jit2fa_webhook_validations_total counter {mapping_id, result} Webhook results (success/failure) jit2fa_webhook_validation_duration latency {mapping_id} Webhook response time jit2fa_otp_verifications_total counter {mapping_id, result, reason?} OTP results (success/invalid/expired/max_retries/error) jit2fa_sessions_created_total counter {mapping_id} Sessions created jit2fa_otp_resends_total counter {mapping_id, result} OTP resend attempts jit2fa_rate_limited_total counter {mapping_id} Rate-limited requests
Token Handoff: jit2fa_handoff_entry_total counter {mapping_id, reason, dpop_bound} Entry path visits by outcome and DPoP binding state reasons: missing_return_url, invalid_return_url, missing_dpop_jkt, invalid_dpop_jkt, redirect_login, direct_mint, form_post (parallel entry: the login POST carried _jit2fa_return_url + optional _jit2fa_dpop_jkt, and the middleware treated the whole thing as a handoff request rather than the traditional credential-replay flow) dpop_bound: "true" when the caller supplied a valid dpop_jkt query parameter (or form field), "false" otherwise. Early-rejection paths (before dpop_jkt parse) always emit "false". jit2fa_handoff_mints_total counter {mapping_id, result, reason?, dpop_bound} Mint step outcomes by result, reason, and binding failure reasons: revalidate_failed, malformed_return_url, missing_identity, missing_dpop_jkt, oidc_error dpop_bound: "true" when the minted (or attempted) token carries a cnf.jkt confirmation claim. Use this dimension for DPoP adoption tracking: sum by (dpop_bound) (rate( jit2fa_handoff_mints_total{ result="success" }[5m])) jit2fa_handoff_mint_duration latency {mapping_id} Time from finalizeTokenHandoff entry to mint response jit2fa_handoff_bearer_checks_total counter {mapping_id, result, reason?, dpop_bound} Bearer check outcomes by result, reason, binding rejected reasons: empty_token, validator_error, invalid_token, audience_mismatch, token_not_dpop_bound, missing_dpop_header, dpop_validator_error, dpop_proof_invalid, dpop_jkt_mismatch dpop_bound: "true" when the presented token has a cnf.jkt claim, "false" otherwise. Early-rejection paths (empty_token, validator_error, invalid_token) emit "false" since the token was not parsed. DPoP usage query: sum by (dpop_bound) (rate( jit2fa_handoff_bearer_checks_total{ result="accepted" }[5m])) jit2fa_handoff_bearer_check_duration latency {mapping_id} Time from bearer header parse to validation outcome (full cost: JWT validate + optional DPoP proof validate) jit2fa_handoff_dpop_validation_duration latency {mapping_id} Isolated cost of oidc.ValidateDPoP alone — component of handoff_bearer_check_duration, emitted on every DPoP proof validation attempt (success or failure). Use this to tell JWT slowness apart from DPoP slowness when the bearer check p99 regresses.
Token Refresh: jit2fa_handoff_refresh_total counter {mapping_id, result, reason?} Refresh endpoint outcomes (success/failure) failure reasons: disabled, parse_error, missing_token, invalid_token, wrong_audience, not_dpop_bound, missing_dpop, dpop_invalid, dpop_mismatch, missing_auth_time, max_session, mint_failed jit2fa_handoff_refresh_duration latency {mapping_id} Full refresh handler wall-clock latency
Alerts: # Backend / operational rate(jit2fa_webhook_validations_total{result="failure"}[5m]) > 5 Webhook backend issues jit2fa_otp_verifications_total{reason="max_retries"} > 0 OTP brute-force attempt rate(jit2fa_rate_limited_total[5m]) > 10 High rate limiting
# Token handoff — abuse signals (page on these) rate(jit2fa_handoff_entry_total{reason="invalid_return_url"}[5m]) > 2 Possible open-redirect probing against the whitelist rate(jit2fa_handoff_bearer_checks_total{reason="audience_mismatch"}[5m]) > 0 Cross-mapping token replay attempt (alert immediately) rate(jit2fa_handoff_bearer_checks_total{reason="invalid_token"}[5m]) > 20 High invalid-token rate (bot scan or clock drift) rate(jit2fa_handoff_bearer_checks_total{reason="dpop_jkt_mismatch"}[5m]) > 0 DPoP thumbprint mismatch — possible stolen token (alert immediately) rate(jit2fa_handoff_refresh_total{reason="dpop_mismatch"}[5m]) > 0 Refresh with wrong DPoP key — stolen refresh token attempt
# Token handoff — capacity / latency histogram_quantile(0.99, jit2fa_handoff_mint_duration_bucket) > 0.5 Token signing p99 slow (OIDC signer degraded) histogram_quantile(0.99, jit2fa_handoff_bearer_check_duration_bucket) > 0.1 Bearer check p99 slow (hexdcall / oidc validation contention) histogram_quantile(0.99, jit2fa_handoff_dpop_validation_duration_bucket) > 0.05 DPoP proof validation p99 slow (ECDSA cost or replay cache contention)
# Token handoff — DPoP rollout tracking (not alerts, dashboard panels) sum by (dpop_bound) (rate(jit2fa_handoff_mints_total{result="success"}[5m])) Mint-time DPoP adoption ratio sum by (dpop_bound) (rate(jit2fa_handoff_bearer_checks_total{result="accepted"}[5m])) Bearer-use DPoP adoption ratio rate(jit2fa_handoff_bearer_checks_total{reason="token_not_dpop_bound"}[5m]) Legacy clients on a require_dpop mapping (expected to drop to 0 after rollout) rate(jit2fa_handoff_entry_total{reason="missing_dpop_jkt"}[5m]) Clients hitting a require_dpop entry without dpop_jkt (same signal, earlier in the flow)Kerberos Ticket Management & SPNEGO Browser SSO
SPNEGO: kerberos_spnego_validation_total counter {result, reason?} SPNEGO validation results result=success result=failure, reason=invalid_base64|invalid_token|auth_failed|no_credentials|user_disabled
Tickets: kerberos_ticket_acquisition_total counter {result, reason?} Ticket acquisition result=success | result=failure, reason=auth_failed kerberos_ticket_refresh_total counter {result} Ticket refresh (success/failure) kerberos_ticket_revocation_total counter {result} Ticket revocation (success) kerberos_tickets_revoked counter {} Total tickets revoked (bulk count)
Password: kerberos_password_change_total counter {result} Password changes (success/failure)
Alerts: rate(kerberos_spnego_validation_total{result="failure"}[5m]) > 10 SPNEGO failures (keytab/config) rate(kerberos_ticket_refresh_total{result="failure"}[5m]) > 0 Ticket refresh failing (KDC) kerberos_spnego_validation_total{reason="user_disabled"} > 0 Disabled user SPNEGO attemptLDAP Authentication
ldap_authentication_total counter {result, reason?} Authentication attempts result=success Successful bind result=failure, reason=empty_username Missing username result=failure, reason=empty_password Missing password result=failure, reason=service_unavailable LDAP service error result=failure, reason=invalid_credentials Wrong password
Alerts: rate(ldap_authentication_total{result="failure",reason="service_unavailable"}[5m]) > 0 LDAP server down rate(ldap_authentication_total{result="failure",reason="invalid_credentials"}[5m]) > 20 Brute-force attemptMagic Link Authentication
magiclink_initiated_total counter Incremented when a magic link email is successfully queued (valid user, within rate limits). Not incremented for decoy flows or unknown emails.
magiclink_verifications_total counter Incremented when Verify completes a user {result} action. Labels: authorized — user approved sign-in denied — user rejected the request signin_here — user chose local sign-in
magiclink_polls_total counter Incremented on every Poll response. {status} Labels mirror the returned status: pending, authorized, denied, expired, slow_down, completed_elsewhere, invalid (empty device code).
Additional observability via dependent modules: - devicecode: device_code_* metrics cover code creation and polling - ratelimit: ratelimit_* metrics cover per-IP and per-email throttling - sessions: session_* metrics cover magiclink session create/revoke - smtp: smtp_* metrics cover magic link email deliveryOIDC Provider
Token Issuance: oidc_authcode_generation_total counter {result, reason} Auth code generation oidc_token_exchange_total counter {result, reason} Code-for-token exchanges oidc_token_refresh_total counter {result, reason} Token refreshes oidc_tokens_revoked counter {} Tokens revoked on logout oidc_token_signing_retry_total counter {result, reason|attempt} Signing retries (threshold signer)
Client Auth: oidc_validation_failure_total counter {type, client_id} PKCE/scope/redirect failures oidc_mtls_auth_total counter {result, reason|method} mTLS auth (failure: reason, success: method)
DPoP: oidc_dpop_validation_total counter {result, reason} Proof validation oidc_dpop_jti_replay_total counter {detected} Replay detections oidc_dpop_jti_storage_total counter {result} JTI cache operations oidc_dpop_nonce_generation_total counter {result} Nonce generation oidc_dpop_nonce_storage_total counter {result} Nonce cache operations oidc_dpop_nonce_validation_total counter {result, reason} Nonce validation
PAR: oidc_par_requests_total counter {result, client_id} PAR creation oidc_par_consume_total counter {result, client_id} PAR consumption oidc_par_request_duration histogram {client_id} PAR processing latency
M2M: oidc_client_credentials_total counter {result, reason} Client Credentials grants oidc_jwt_bearer_total counter {result, reason} JWT Bearer grants
Operations: oidc_token_introspection_total counter {result, token_type, active} Token introspection oidc_token_revocation_total counter {result, token_type} Token revocation oidc_userinfo_requests_total counter {result, reason} UserInfo requests oidc_logout_total counter {result} Logouts oidc_device_code_total counter {result, reason} Device code grants oidc_pat_issued_total counter {username} PAT issuance
Latency: oidc_id_token_generation_duration_ms histogram {} ID token generation oidc_access_token_generation_duration_ms histogram {} Access token generation oidc_auth_code_generation_duration_ms histogram {} Auth code generation oidc_entropy_validation_duration_ms histogram {} Entropy validation
Alerts: rate(oidc_dpop_jti_replay_total[5m]) > 0 DPoP replay attack rate(oidc_validation_failure_total[5m]) > 10 High validation failure rate oidc_token_signing_retry_total > 0 Signing key issues rate(oidc_par_consume_total{result="replay_attempt"}[5m]) > 0 PAR replay attemptEmail OTP
Generation: otp_codes_generated counter {type} OTP codes generated (type: numeric, base20)
Validation: otp_validations_total counter {result} Validation outcomes (result: valid, invalid) otp_validation_failures counter {reason} Failure breakdown (reason: not_found, locked, expired, max_retries, invalid_code) otp_replay_prevented counter (none) OTPs deleted after successful validation (replay prevention)
Alerts: rate(otp_validation_failures{reason="max_retries"}[5m]) > 0 Brute-force attempt (OTP locked after max retries) rate(otp_validation_failures{reason="not_found"}[5m]) > 5 Probing for non-existent OTPs rate(otp_codes_generated[5m]) > 20 Unusual OTP generation rateTOTP Authenticator
Enrollment: totp_enrollments_initiated counter (none) Enroll calls (QR + secret generated) totp_enrollments_confirmed counter (none) First code verified, secret persisted totp_enrollments_deleted counter (none) TOTP enrollment deleted
Validation: totp_validations_total counter {result} Validation outcomes (result: valid, invalid, replay, clock_backward)
Recovery: totp_recovery_validations_total counter {result} Recovery code outcomes (result: valid, invalid, no_codes)
Alerts: rate(totp_validations_total{result="replay"}[5m]) > 0 Replay attack attempt detected rate(totp_validations_total{result="invalid"}[5m]) > 10 Brute-force attempt on TOTP codes rate(totp_validations_total{result="clock_backward"}[5m]) > 0 Server clock drift — check NTP sync rate(totp_recovery_validations_total{result="invalid"}[5m]) > 5 Recovery code probing attemptWebAuthn Passkeys
Passkey Inventory: webauthn_passkeys_issued gauge {} Total passkeys ever issued webauthn_passkeys_active gauge {} Currently active passkeys webauthn_passkeys_revoked gauge {} Revoked passkeys webauthn_passkeys_expired gauge {} Expired passkeys
Authentication: webauthn_auth_attempts counter {} Authentication attempts webauthn_auth_success counter {} Successful authentications webauthn_auth_failed counter {} Failed authentications
Expiration Monitoring: webauthn_expiration_check_total counter {result} Expiration checks (success/failure) webauthn_expiration_passkeys_checked gauge {} Passkeys checked in last run webauthn_expiration_emails_sent gauge {} Reminder emails sent in last run webauthn_expiration_reminder_total counter {result} Reminder send attempts (success/failure)
Alerts: rate(webauthn_auth_failed[5m]) > 20 High auth failure rate webauthn_passkeys_active == 0 No active passkeys (service unusable) rate(webauthn_expiration_check_total{result="failure"}[1h]) Expiration check failingX.509 Client Certificate Authentication
Validation: x509_validation_total counter {result, reason?} Certificate validation attempts result=success Valid certificate authenticated result=failure, reason=not_yet_valid Certificate NotBefore in future result=failure, reason=expired Certificate past NotAfter result=failure, reason=no_ca_available No CA certs configured result=failure, reason=chain_validation_failed Chain/signature verification failed result=failure, reason=revoked_crl Revoked via CRL (external cert) result=failure, reason=invalid_identity Identity field missing from cert result=failure, reason=directory_error Directory lookup call failed result=failure, reason=directory_timeout Directory lookup timed out result=failure, reason=user_not_found User not in directory result=failure, reason=revoked_internal Revoked via serial index (internal cert) result=failure, reason=not_registered Internal cert not in enrollment registry result=failure, reason=revoked_ocsp Revoked via OCSP (external cert)
Enrollment: x509_enrollment_total counter {result, reason?} Certificate enrollment attempts result=success Certificate issued successfully result=failure, reason=invalid_username Username validation failed
Revocation: x509_revocation_total counter {result, reason} Certificate revocations result=success, reason=<RFC5280 code> Revocation completed
CRL: x509_crl_refresh_total counter {result} CRL download/refresh attempts result=success CRL loaded/refreshed result=failure Download failed from all URLs x509_crl_revoked_count gauge {} Number of revoked certs in CRL x509_crl_size_bytes gauge {} Raw CRL size in bytes
OCSP: x509_ocsp_query_total counter {result, cached} OCSP lookups result=success, cached=true Cache hit (memory) result=success, cached=false Responder queried successfully result=failure, cached=false All responders unreachable
Auto-Renewal: x509_auto_renewal_check_total counter {result} Renewal check runs x509_auto_renewal_total counter {result} Individual cert renewals result=success Cert renewed and emailed result=failure Renewal failed x509_auto_renewal_skipped_total counter {reason} Renewals skipped reason=no_email User has no email in directory reason=no_certificate_der No stored cert for key extraction x509_auto_renewal_certs_checked gauge {} Certs checked in last run x509_auto_renewal_certs_renewed gauge {} Certs renewed in last run x509_auto_renewal_certs_skipped gauge {} Certs skipped (opt-out) in last run x509_auto_renewal_errors gauge {} Errors in last renewal run
Alerts: rate(x509_validation_total{result="failure"}[5m]) > 10 High validation failure rate rate(x509_validation_total{reason="revoked_crl"}[5m]) > 0 CRL-revoked cert used (possible compromise) rate(x509_validation_total{reason="revoked_internal"}[5m]) > 0 Revoked internal cert used x509_crl_refresh_total{result="failure"} increasing CRL server unreachable rate(x509_ocsp_query_total{result="failure"}[5m]) > 0 OCSP responder down x509_auto_renewal_errors > 0 Auto-renewal failures need attentionRADIUS Authentication (RADSEC + UDP)
Connections: radius_connections_total counter {nas} TCP connections accepted (RADSEC)
Packets: radius_packets_total counter {transport, nas} RADIUS packets received (transport: tcp or udp)
Authentication: radius_auth_total counter {result, method, nas} Auth outcomes (result: accept/reject, method: password/x509/none) radius_auth_total counter {result, reason, nas} Auth rejections with reason (reason: geo, time) radius_auth_duration latency {result} End-to-end auth+authz latency (result: accept/reject)
Mappings: radius_mapping_matches_total counter {mapping, nas} Mapping match counts per mapping name
Errors: radius_errors_total counter {reason, nas} Error counts by reason: reason=tls_handshake TLS handshake failure on RADSEC connection reason=hxep_mtls_conflict HXEP connection rejected — NAS has per-client mTLS reason=invalid_frame RADIUS packet length out of range (< 20 or > 4096) reason=incomplete_frame RADSEC frame body read failed (truncated) reason=rate_limit Per-NAS rate limit exceeded (silent drop) reason=concurrent_limit Global concurrent auth limit reached (silent drop) reason=parse_error RADIUS packet parse failed (bad authenticator / malformed) reason=invalid_state MFA challenge state token invalid or expired reason=nas_mismatch MFA challenge response from different NAS than originalOnboarding Service
Observability is provided indirectly through dependent modules: - sessions: session_* metrics cover mfa_pending session creation and validation - webauthn: webauthn_* metrics cover passkey registration ceremonies - magiclink: magiclink_* metrics cover magic link email and polling - ratelimit: ratelimit_* metrics cover PoW and request throttlingSign-In Service
Observability is provided indirectly through dependent modules: - sessions: session_* metrics cover session creation, validation, and revocation - ldapauth: ldap_* metrics cover LDAP bind authentication - webauthn: webauthn_* metrics cover passkey authentication ceremonies - emailotp: otp_* metrics cover OTP generation and validation - totp: totp_* metrics cover TOTP validation - magiclink: magiclink_* metrics cover magic link initiation and verification - ratelimit: ratelimit_* metrics cover brute force protection on signin endpoints - directory: directory_* metrics cover user sync and lookupIdentity & Directory
Directory Cache
Sync counters: directory_sync_total counter {type, result} Sync operations completed Labels: type="full"|"delta", result="success"
Sync gauges: directory_users_synced gauge {} Users synchronized in last full sync directory_groups_synced gauge {} Groups synchronized in last full sync
Sync latency: directory_sync_duration histogram {type} Sync processing time Labels: type="full"|"delta"
Alerts: changes(directory_sync_total{result="success"}[10m]) == 0 No successful syncs (LDAP connectivity) directory_sync_duration{type="full"} > 60s Full sync taking too long changes(directory_users_synced[1h]) == 0 No syncs completingLDAP Provider
Operations: ldap_operations_total counter {operation, status} LDAP operation count operation=bind, status=success|invalid_credentials|error Bind outcomes operation=search_users, status=success|error, paged=true|false User search outcomes operation=search_groups, status=success|error, paged=true|false Group search outcomes ldap_operation_duration latency {operation, status} LDAP operation latency Same label sets as operations_total
Bind: ldap_bind_success counter {} Successful user binds ldap_bind_failures counter {reason} Failed user binds reason=invalid_credentials Wrong password reason=ldap_error Server/network error
Connection Pool: ldap_pool_errors counter {reason} Pool-level errors reason=not_initialized Pool not ready reason=pool_closed Pool shut down reason=config_unavailable Config missing on reconnect reason=reconnect_failed Reconnect attempt failed reason=timeout Pool wait timeout ldap_pool_reconnects counter {reason} Pool reconnections reason=stale_connection Stale conn on acquire reason=stale_on_release Stale conn on release ldap_pool_acquire_duration latency {reconnected} Time to acquire connection reconnected=true|false ldap_pool_available gauge {} Available connections in pool ldap_pool_capacity gauge {} Total pool capacity ldap_pool_utilization_pct gauge {} Pool utilization percentage
Search Results: ldap_search_results gauge {type} Result set size type=users|groups ldap_paged_search_pages gauge {type} Pages fetched in paged search type=users|groups
Alerts: ldap_pool_utilization_pct > 90 Pool near exhaustion rate(ldap_pool_errors{reason="timeout"}[5m]) > 0 Pool starvation rate(ldap_bind_failures{reason="ldap_error"}[5m]) > 0 LDAP server issues rate(ldap_operations_total{status="error"}[5m]) > 5 Search failuresOIDC Relying Party
Authorization flow: oidc_rp_authorization_initiated_total counter {provider} Authorization URL built successfully oidc_rp_state_validation_success_total counter {provider} State validated and session consumed oidc_rp_state_validation_failures_total counter {reason} State validation failures (reason: decryption_failed, version_mismatch, state_expired, session_not_found, state_mismatch, csrf_mismatch)
Token exchange: oidc_rp_token_exchange_success_total counter {provider} Code-for-tokens exchange succeeded oidc_rp_token_exchange_failures_total counter {provider, reason} Exchange failures (reason: network_error, id_token_invalid, at_hash_mismatch, or IdP error code) oidc_rp_token_exchange_duration latency {provider} Token endpoint round-trip time
Token refresh: oidc_rp_token_refresh_success_total counter {provider} Refresh token exchange succeeded oidc_rp_token_refresh_failures_total counter {provider, reason} Refresh failures (reason: network_error, or IdP error code) oidc_rp_token_refresh_duration latency {provider} Token refresh round-trip time
Token revocation: oidc_rp_token_revocation_success_total counter {provider} Revocation acknowledged by IdP oidc_rp_token_revocation_failures_total counter {provider, reason} Revocation failures (reason: network_error, or IdP error code)
Token introspection: oidc_rp_token_introspection_active_total counter {provider} Introspection returned active=true oidc_rp_token_introspection_inactive_total counter {provider} Introspection returned active=false oidc_rp_token_introspection_failures_total counter {provider, reason} Introspection failures (reason: network_error, or IdP error code) oidc_rp_token_introspection_duration latency {provider} Introspection round-trip time
ID token validation: oidc_rp_id_token_validation_success_total counter {provider} ID token signature and claims validated
Discovery: oidc_rp_discovery_cache_hits_total counter {provider} Discovery served from cache oidc_rp_discovery_cache_misses_total counter {provider} Discovery cache miss, fetched from IdP oidc_rp_discovery_fetch_success_total counter {provider} Discovery fetched and validated oidc_rp_discovery_fetch_failures_total counter {provider} Discovery fetch failed oidc_rp_discovery_fetch_duration latency {provider} Discovery endpoint round-trip time
JWKS: oidc_rp_jwks_cache_hits_total counter {provider} JWKS key found in cache oidc_rp_jwks_fetch_success_total counter {provider} JWKS fetched from IdP oidc_rp_jwks_fetch_failures_total counter {provider} JWKS fetch failed oidc_rp_jwks_fetch_duration latency {provider} JWKS endpoint round-trip time
DPoP: oidc_rp_dpop_jti_replay_total counter {provider} DPoP JTI replay attack detected oidc_rp_dpop_validation_success_total counter {provider} DPoP proof validated successfully
PAR (Pushed Authorization Requests): oidc_rp_par_success_total counter {provider} PAR request accepted by IdP oidc_rp_par_failures_total counter {provider, reason} PAR failures (reason: discovery_failed, not_supported, invalid_redirect_uri, network_error, http_error, invalid_expires_in, expires_in_too_large, or IdP error code) oidc_rp_par_request_duration latency {provider} PAR endpoint round-trip time oidc_rp_par_authorization_success_total counter {provider} Authorization URL built via PAR flow
UserInfo: oidc_rp_userinfo_success_total counter {provider} UserInfo fetched successfully oidc_rp_userinfo_failures_total counter {provider, reason} UserInfo failures (reason: network_error, http_<status>) oidc_rp_userinfo_fetch_duration latency {provider} UserInfo endpoint round-trip time
Alerts: rate(oidc_rp_state_validation_failures_total{reason="csrf_mismatch"}[5m]) > 0 CSRF attack attempt rate(oidc_rp_state_validation_failures_total{reason="decryption_failed"}[5m]) > 5 State tampering or key mismatch rate(oidc_rp_dpop_jti_replay_total[5m]) > 0 DPoP replay attack attempt rate(oidc_rp_discovery_fetch_failures_total[5m]) > 3 IdP discovery unreachable rate(oidc_rp_token_exchange_failures_total[5m]) > 10 Elevated token exchange failuresSCIM Identity Provider
SCIM client counters (module: scim_client): scim_client_list_failures_total counter {provider, endpoint, reason} Paginated list failures scim_client_request_errors_total counter {provider, reason} HTTP request errors (network/timeout) scim_client_requests_total counter {provider, status} HTTP requests by status code scim_client_oauth2_failures_total counter {provider, reason} OAuth2 token refresh failures scim_client_oauth2_success_total counter {provider} OAuth2 token refresh successes
SCIM client latency (module: scim_client): scim_client_request_duration histogram {provider, method} Per-request HTTP latency scim_client_list_duration histogram {provider, endpoint} Full paginated list latency
SCIM client gauges (module: scim_client): scim_client_list_results gauge {provider, endpoint} Resources returned from last list
Sync counters (module: identity.scim): identity_scim_sync_started counter {provider, sync_type} Sync operations started identity_scim_sync_completed counter {provider, sync_type, status} Sync operations completed identity_scim_sync_failed counter {provider, sync_type, status} Sync operations failed identity_scim_delta_fallback_to_full counter {provider} Delta syncs that fell back to full identity_scim_circuit_opened counter {provider} Circuit breaker open events identity_scim_circuit_closed counter {provider} Circuit breaker close events identity_scim_deletions_blocked counter {provider, reason} Deletion operations blocked by safety thresholds
Sync gauges (module: identity.scim): identity_scim_users_synced gauge {provider} Users from last sync identity_scim_groups_synced gauge {provider} Groups from last sync
Sync latency (module: identity.scim): identity_scim_sync_duration histogram {provider, sync_type, status} Sync processing time
Directory apply counters (module: identity.scim): identity_scim_users_created counter {provider, source} Users created in directory identity_scim_users_updated counter {provider, source} Users updated in directory identity_scim_users_disabled counter {provider, source} Users disabled in directory identity_scim_users_deleted counter {provider, source} Users deleted from directory identity_scim_groups_created counter {provider, source} Groups created in directory identity_scim_groups_updated counter {provider, source} Groups updated in directory identity_scim_groups_deleted counter {provider, source} Groups deleted from directory identity_scim_sync_errors counter {provider, source} Per-operation sync errors
Webhook counters (module: identity.scim): identity_scim_webhook_total counter {provider, result} Webhook events by result Labels: result="success"|"unknown_provider"|"provider_disabled"|"no_secret_configured"| "empty_payload"|"payload_too_large"|"missing_signature"|"invalid_signature"| "parse_error"|"unknown_event_type"|"missing_timestamp"|"stale_event"| "duplicate"|"dedup_failed_closed"|"dedup_impossible"|"deletion_budget_exceeded"| "apply_error"
Alerts: changes(identity_scim_sync_completed{status="success"}[30m]) == 0 No successful syncs identity_scim_circuit_opened > 0 Circuit breaker tripped rate(identity_scim_webhook_total{result="invalid_signature"}[5m]) > 0 Webhook signature failures identity_scim_sync_duration > 120s Sync taking too longSSH & SQL Bastion
SSH Bastion Gateway
Connections: bastion_connections_total counter {} Total connections bastion_connections_active gauge {} Active connections bastion_connections_rejected counter {reason} Rejected connections bastion_connection_limit_hits_total counter {limit_type} Connection limit enforced bastion_connection_duration latency {} Connection lifetime
Authentication: bastion_auth_attempts_total counter {client_ip} Auth attempts bastion_auth_success_total counter {username, client_ip} Successful auths bastion_auth_failures_total counter {client_ip, reason} Auth failures bastion_auth_bans_total counter {client_ip, reason} Client bans bastion_auth_duration latency {status} Auth operation time
Sessions: bastion_sessions_total counter {username, client_ip} Sessions created bastion_sessions_active gauge {} Active sessions bastion_session_duration latency {username} Session lifetime bastion_sessions_rejected counter {reason} Sessions rejected bastion_session_limit_hits_total counter {limit_type} Session limit enforced
Commands: bastion_commands_total counter {command} Commands executed bastion_command_duration latency {command} Command execution time bastion_commands_rate_limited_total counter {} Commands rate limited
DoS Protection: bastion_rate_limit_hits_total counter {limit_type} Rate limits enforced
Resources: bastion_history_entries_total gauge {} Command history entries bastion_port_forwards_active gauge {} Active port forwards bastion_health_status gauge {} Health (1=healthy, 0=unhealthy) bastion_buffer_pool_gets counter {} Buffer pool acquisitions bastion_buffer_pool_puts counter {} Buffer pool releases
SFTP: bastion_sftp_sessions_total counter {auth_type} SFTP sessions bastion_ssrf_blocks_total counter {block_type} SSRF blocks
SQL Bastion: bastion.sql_queries_total counter {site, status} SQL queries executed bastion.sql_query_duration latency {site, db_type, user} SQL query time bastion.sql_acl_rejections counter {site, reason} SQL ACL rejections
Alerts: bastion_auth_bans_total > 0 Client banned (brute force) rate(bastion_auth_failures_total[5m]) > 20 High auth failure rate bastion_connections_active > 500 Connection saturation bastion_sessions_active > 200 Session saturation rate(bastion_ssrf_blocks_total[5m]) > 0 SSRF attempt detected rate(bastion_sql_acl_rejections[5m]) > 5 SQL ACL violationsCertificates & PKI
ACME CA Server
ACME Provider Rate Limiting (namespace: acme_provider): acme_provider_ratelimit_checks_total counter {} Total rate limit check invocations acme_provider_ratelimit_check_results_total counter {limit_type, result} Rate limit check outcomes limit_type=all, result=passed All checks passed limit_type=<type>, result=blocked Operation blocked by specific limit type acme_provider_ratelimit_check_duration latency {} Rate limit check duration acme_provider_ratelimit_circuit_breaker_trips_total counter {limit_type} Circuit breaker trips (blocking after consecutive state errors) acme_provider_ratelimit_orders_created_total counter {} Orders recorded for rate limiting acme_provider_ratelimit_certs_issued_total counter {} Certificates recorded for rate limiting acme_provider_ratelimit_domain_issuances_total counter {domain} Issuances per registered domain acme_provider_ratelimit_auth_failures_total counter {domain} Authorization failures per registered domain acme_provider_ratelimit_finalization_failures_total counter {} Failed finalization attempts recorded acme_provider_ratelimit_state_errors_total counter {limit_type, operation} Distributed state access errors acme_provider_ratelimit_approaching_total counter {limit_type} Warning: nearing rate limit capacity (80%+) acme_provider_ratelimit_current_usage gauge {limit_type} Current usage count for limit dimension acme_provider_ratelimit_limit gauge {limit_type} Configured limit for dimension acme_provider_ratelimit_usage_percent gauge {limit_type} Usage as percentage of limit
SPIFFE (namespace: spiffe): spiffe_accounts_created_total counter {workload} SPIFFE accounts created spiffe_orders_created_total counter {workload} SPIFFE orders created spiffe_orders_finalized_total counter {workload} SPIFFE orders finalized (issuance started) spiffe_certificates_issued_total counter {workload} SPIFFE certificates issued successfully spiffe_certificate_issuance_errors_total counter {workload, reason} SPIFFE certificate issuance failures spiffe_certificates_revoked_total counter {workload} SPIFFE certificates revoked spiffe_certificate_retrievals_total counter {} SPIFFE certificate downloads spiffe_certificate_retrieval_errors_total counter {reason} SPIFFE certificate download errors spiffe_trust_bundle_requests_total counter {} SPIFFE trust bundle requests spiffe_issuance_queue_depth gauge {} Current concurrent issuance goroutines spiffe_issuance_queue_full_total counter {workload} Issuance rejected due to queue full spiffe_ca_signing_duration_ms histogram {} CA signing ceremony latency (ms) spiffe_ratelimit_current_usage gauge {workload} Current rate limit usage per workload spiffe_ratelimit_blocked_total counter {workload} Requests blocked by rate limit spiffe_ratelimit_check_error_total counter {workload, reason} Rate limit check errors (fail-open) spiffe_ratelimit_record_error_total counter {workload, reason} Rate limit recording errors spiffe_ratelimit_record_success_total counter {workload} Rate limit entries recorded successfullyACME Client
Issuance counters (module: acmeclient): acmeclient_issuance_started_total counter {domain} Certificate issuance started acmeclient_issuance_success_total counter {domain, key_type} Certificate issuance succeeded acmeclient_issuance_failed_total counter {domain, error_type} Certificate issuance failed Labels: error_type="timeout"|"rate_limit"|"authorization"|"network"|"dns"|"invalid_request"|"not_found"|"unknown"|"none"
Issuance latency (module: acmeclient): acmeclient_issuance_duration histogram {domain, key_type} End-to-end issuance time
Renewal counters (module: acmeclient): acmeclient_renewal_checks_total counter (no labels) Renewal check cycles executed acmeclient_renewal_success_total counter {domain} Certificate renewals succeeded acmeclient_renewal_failed_total counter {domain, error_type} Certificate renewals failed
Challenge counters (module: acmeclient): acmeclient_challenges_stored_total counter {domain} Challenge tokens stored in distributed cache acmeclient_challenges_served_total counter {status} Challenge responses served Labels: status="success"|"not_found"|"invalid_token"|"lookup_error"|"internal_error"|"invalid_value"|"write_error"
Certificate gauges (module: acmeclient): acmeclient_certificates_checked gauge (no labels) Certificates checked in last renewal cycle acmeclient_certificates_expiring gauge (no labels) Certificates needing renewal in last cycle acmeclient_certificates_loaded gauge (no labels) Total certificates in memory cache acmeclient_certificate_days_until_expiry gauge {domain} Days until certificate expires
ARI counters (module: acmeclient): acmeclient_ari_fetch_total counter {result} ARI fetch attempts Labels: result="success"|"error" acmeclient_ari_early_renewal_suggestions_total counter (no labels) CA-suggested early renewals (possible revocation) acmeclient_ari_renewals_total counter {domain} ARI-guided certificate renewals completed acmeclient_ari_marked_replaced_total counter (no labels) ARI renewals marked as replaced acmeclient_ari_cache_total counter {result} ARI cache lookups Labels: result="hit"|"miss"
Rate limit counters (module: acmeclient): acmeclient_ratelimit_checks_total counter (no labels) Rate limit pre-flight checks executed acmeclient_ratelimit_check_results_total counter {limit_type, result} Rate limit check outcomes Labels: limit_type="retry_after"|"min_order_interval"|"orders_per_account"|"certs_per_domain"| "auth_failures_per_domain"|"certs_per_exact_set"|"all" Labels: result="blocked"|"passed"|"error" acmeclient_ratelimit_orders_created_total counter (no labels) ACME orders created (tracked for limits) acmeclient_ratelimit_certs_issued_total counter {domain} Certificates issued per domain (tracked for limits) acmeclient_ratelimit_auth_failures_total counter {domain} Authorization failures per domain acmeclient_ratelimit_retry_after_total counter {status_code} Retry-After responses from CA Labels: status_code="429"|"503"|"other" acmeclient_ratelimit_state_errors_total counter {operation} Rate limit state storage errors Labels: operation="set"|"get" acmeclient_ratelimit_approaching_total counter {limit_type} Rate limit approaching capacity warnings (>80%)
Rate limit gauges (module: acmeclient): acmeclient_ratelimit_current_usage gauge {limit_type} Current usage count per limit type acmeclient_ratelimit_limit gauge {limit_type} Effective limit value per limit type acmeclient_ratelimit_usage_percent gauge {limit_type} Usage percentage per limit type
Rate limit latency (module: acmeclient): acmeclient_ratelimit_check_duration histogram (no labels) Pre-flight rate limit check duration
Alerts: issuance_failed_total increasing -> Check CA reachability, DNS, port 80 access certificates_expiring > 0 persisting -> Renewal failing, check error_type labels certificate_days_until_expiry < 7 -> Urgent: cert near expiry, check renewal logs ratelimit_check_results_total{blocked} -> Client-side rate limit preventing issuance ari_early_renewal_suggestions_total -> CA suggesting early renewal, possible revocation challenges_served_total{status!=success} -> Challenge failures, check port 80 and DNSAutoTLS Certificate Management
autotls_issuances_total counter {result=success|failure} Initial certificate issuance on startup (after retry loop) autotls_renewals_total counter {result=success|failure} Certificate renewal attempts (both automatic renewal loop and manual 'autotls renew')
Certificate state is also observable via the 'autotls status' hexdcall command and log entries.Certificate Management
Certificate Operations (namespace: certmanager): certmanager_set_total counter {source} Certificate store operations (both domain and default) source=static|acme|acmeclient Certificate source type certmanager_get_total counter {hit} Cache lookups for TLS certificate retrieval hit=true Certificate found (exact or wildcard match) hit=false Certificate not found in cache certmanager_expired_total counter {} Certificates expired from cache (renewal may have failed) certmanager_certificates_total gauge {} Total certificates currently held in local cacheSPIFFE Workload Identity
Requests: spiffe_requests_total counter {endpoint, status} Requests per endpoint (status: success/error) spiffe_request_duration_ms histogram {endpoint} Request latency in milliseconds
Errors: spiffe_errors_total counter {type, status} ACME error responses by problem type and HTTP status spiffe_cidr_blocked_total counter (none) Requests blocked by CIDR policy
Endpoint label values for requests_total and request_duration_ms: directory, new_nonce, new_account, new_order, get_order, finalize, get_certificate, trust_bundle, revoke_cert, tos
Alerts: rate(spiffe_errors_total[5m]) > 10 High ACME error rate rate(spiffe_cidr_blocked_total[5m]) > 0 CIDR policy blocking requests histogram_quantile(0.99, spiffe_request_duration_ms) > 5000 P99 latency exceeding 5 secondsProtection
Access Policy Engine
No Prometheus metrics emitted by this module.
Data Loss Prevention
Counters: dlp_scanned counter {direction,content_type} Bodies scanned dlp_violations counter {detector,action,direction} Violations found dlp_blocked counter {direction} Requests/responses blocked dlp_redacted counter {direction} Bodies redacted dlp_skipped counter {reason,direction} Scan skipped
Histograms: dlp_scan_duration_ms histogram {direction} Scan latency in millisecondsGeo/IP and ASN Access Control
Request outcomes: geoaccess_requests_total counter {status, reason} Per-request outcome geoaccess_blocked_by_country counter {country} Blocked requests by country code geoaccess_blocked_by_asn counter {asn} Blocked requests by ASN number geoaccess_cdn_country_used counter {country} Requests using CDN-provided country header
Label values for requests_total: status: allowed | blocked reason: bypass_cidr | passed | asn_denied | asn_not_allowed | country_denied | country_not_allowed
Cache performance: geoaccess_cache counter {result, type} Cache hit/miss tracking Label values: result: hit | miss type: (empty for full lookup) | asn_only (CDN country mode, ASN-only lookup)
Note: blocked_by_country and blocked_by_asn are emitted alongside requests_totalfor per-entity breakdown. requests_total with reason=asn_not_allowed andreason=country_not_allowed intentionally omit the per-entity label to avoidunbounded cardinality (the blocked entity is not in any configured list).
Alerts: rate(geoaccess_requests_total{status="blocked"}[5m]) spike Unusual geo-block volume — verify rules or check for attack geoaccess_cache{result="miss"} >> geoaccess_cache{result="hit"} Low cache hit rate — high IP diversity or short TTLProof-of-Work Challenge
Counters: pow_challenges_issued counter {} Challenges generated (generateChallenge + createChallenge) pow_challenges_solved counter {} Challenges solved successfully (valid hash + timing + honeypot) pow_challenges_failed counter {} Challenges failed (expired, invalid, bot detection, bad hash)
Alerts: rate(pow_challenges_failed[5m]) > rate(pow_challenges_solved[5m]) More failures than successes (possible bot wave) rate(pow_challenges_issued[5m]) > 1000 High challenge generation rate (DDoS or misconfigured difficulty)Rate Limiting
Counters: ratelimit_requests_total counter {result,hostname} Requests checked (result: "allowed" or "blocked") ratelimit_clients_banned counter {hostname} Clients banned (auto rate-limit exceeded + manual bans) ratelimit_clients_dropped counter {} Clients refused tracking due to memory capacity limit ratelimit_clients_unbanned counter {} Clients manually unbanned
Gauges: ratelimit_clients_tracked gauge {} Currently tracked unique clients (exported on GetStats)
Alerts: rate(ratelimit_requests_total{result="blocked"}[5m]) > rate(ratelimit_requests_total{result="allowed"}[5m]) More blocks than allows (attack or too-strict config) ratelimit_clients_tracked > 0.8 * max_clients Approaching memory capacity limitRequest Size Limiting
Counters: sizelimit_requests_total counter {result} Requests processed (result: "allowed" or "rejected") sizelimit_exception_matched counter {host,path} Requests that matched a size limit exceptionTime-Based Access Control
Operations: timeaccess_requests_total counter {status, reason} Allowed/blocked requests (status=allowed|blocked, reason=bypass_cidr|passed|day_denied|day_not_allowed|hours_denied|hours_not_allowed) timeaccess_windows_checked counter {matched_by} Window match distribution (matched_by=cidr|country|default)
Alerts: rate(timeaccess_requests_total{status="blocked"}[5m]) > 10 High block rate may indicate misconfigured time windows timeaccess_windows_checked{matched_by="default"} increasing Many requests falling through to default window — consider adding country/CIDR windowsWeb Application Firewall
Counters: waf_requests counter {blocked,method} Requests inspected by WAF waf_blocked counter {rule_id,path,action} Requests blocked by WAF rules waf_passed counter {path} Requests that passed WAF inspection waf_bypassed counter {path} Requests bypassed (WAF disabled for route) waf_body_too_large counter {path} Requests rejected for body size exceeding limit
Histograms: waf_duration_ms histogram {blocked,method} WAF inspection duration in millisecondsEnd-to-Origin Encryption
e2oe_channels_total{type} Channel establishments type=baseline Baseline ECDH channel type=established Tier 1 (WebAuthn) first establishment type=rebound Tier 1 rebind on page reload type=prf_wrapped Tier 1 via PRF-wrapped relay (cross-origin promotion)
e2oe_channel_tier_total{tier,origin_match} tier=baseline|webauthn Negotiated tier origin_match=auth Channel established on the auth origin origin_match=cross_origin Channel established on a non-auth origin (PRF-wrapped path)
e2oe_requests_encrypted_total Requests processed through E2OE Incremented for every header-path request (fetch/XHR)
e2oe_decryption_failures_total Request body decryption failures
e2oe_websocket_frames_total{direction} WebSocket frames encrypted/decrypted direction=encrypt Server→browser frames direction=decrypt Browser→server frames
e2oe_websocket_failures_total{direction} WebSocket encrypt/decrypt failures direction=encrypt Server→browser encryption failed direction=decrypt Browser→server decryption failed direction=decrypt_seq Strict-monotonic seq gate rejected a frame (replay or reorder)
e2oe_tier1_relay_total{outcome} Wrap-relay endpoint outcomes outcome=served Relay HTML served successfully
e2oe_tier1_provision_total{outcome} Wrap-upload endpoint outcomes outcome=full Browser uploaded a complete wrapped map
e2oe_tier1_wrap_relay_total{outcome,layer} Per-endpoint rate-limit blocks e2oe_tier1_wrap_upload_total{outcome,layer} e2oe_tier1_wrap_state_total{outcome,layer} outcome=rate_limited Block emitted (per-session or per-IP layer) layer=session|ip Which bucket triggeredConnectivity
DNS Resolution
Resolution (namespace: dns): dns_resolve_total counter {result, cached, dnssec} Resolution outcomes result=success, cached=true|false Successful resolution result=nxdomain, cached=true|false Domain not found (valid response) result=failure, cached=false Resolution failed dns_nxdomain_total counter {} NXDOMAIN responses (uncached) dns_cache_hits counter {} Cache hits dns_cache_misses counter {} Cache misses dns_lookup_coalesced counter {} Lookups coalesced (shared concurrent result) dns_lookup_performed counter {} Lookups actually performed dns_cache_operations_total counter {operation, result} Cache write operations operation=set, result=success|error Broadcast cache set outcomes
Resolver Selection (namespace: dns): dns_resolver_queries_total counter {resolver, result} Per-resolver query outcomes result=success|nxdomain|failure Query result per resolver dns_system_dns_queries_total counter {result} System DNS fallback queries result=success|nxdomain|failure System resolver outcomes
Transport (namespace: dns): dns_transport_used counter {type, resolver} DNS transport protocol used type=udp|dot UDP or DNS-over-TLS
CNAME Resolution (namespace: dns): dns_cname_resolutions_total counter {status} CNAME chain resolution outcomes status=success|depth_exceeded CNAME follow results
DNSSEC Validation (namespace: dns): dns_dnssec_validations_total counter {result, resolver} Resolver-trust mode validations result=valid|invalid|unsigned AD bit check outcomes dns_dnssec_full_validations counter {result, resolver} Full cryptographic validations result=valid|invalid RRSIG/DNSKEY verification outcomes dns_dnssec_signature_validations counter {result, algorithm} RRSIG signature verifications result=valid Successful signature check dns_dnssec_dnskey_queries counter {result} DNSKEY record fetches result=success DNSKEY query succeeded dns_dnssec_response_validations counter {result} Full response validations result=valid All RRsets validated dns_dnssec_chain_validations counter {result} Chain of trust DS validations result=valid|invalid DNSKEY-DS digest match dns_dnssec_root_validations counter {result} Root trust anchor validations result=valid|invalid Root DNSKEY match dns_dnssec_nsec_validations counter {result, type} NSEC/NSEC3 denial validations result=valid|invalid, type=nsec|nsec3 Authenticated denial outcomes
DNSSEC Cache (namespace: dns): dns_dnssec_cache_hits counter {type} DNSSEC record cache hits type=dnskey|ds Cached record type dns_dnssec_cache_misses counter {type} DNSSEC record cache misses type=dnskey|ds Record type queried dns_dnssec_cache_clears counter {} DNSSEC cache full clears
Health Management (namespace: dns): dns_resolver_latency latency {resolver} Per-resolver query latency dns_resolver_healthy gauge {resolver} Resolver health status (1=healthy, 0=unhealthy) dns_resolver_avg_latency_ms gauge {resolver} Resolver average latency EMA (ms) dns_resolver_consecutive_failures gauge {resolver} Consecutive failure count per resolver dns_resolver_failures_total counter {resolver} Total resolver failures dns_system_fallback gauge {} System DNS fallback active (1=active, 0=inactive) dns_fallback_activations counter {} System DNS fallback activations
Adaptive Selection (namespace: dns): dns_adaptive_resolver_selected counter {resolver, reason} Adaptive resolver selections reason=exploration|best_score|round_robin|... Selection strategy used dns_adaptive_selection_total counter {mode, resolver} Selection mode distribution mode=explore|exploit Exploration vs exploitation dns_resolver_score histogram {resolver} Resolver scores (intelligent phase)Forward Proxy Engine
Connection Metrics (namespace: forwardproxy): forwardproxy_connections_total counter {protocol, user_id} Proxy connections recorded forwardproxy_bytes_sent_total counter {protocol, user_id} Bytes sent through proxy forwardproxy_bytes_received_total counter {protocol, user_id} Bytes received through proxy forwardproxy_connection_duration latency {protocol, user_id} Connection duration forwardproxy_errors_total counter {protocol, error} Failed proxy connections forwardproxy_active_connections gauge {} Currently active proxy connectionsClient Access (HexonClient)
Connections: clientaccess_connections_total counter {} QUIC connections accepted clientaccess_connections_active gauge {} Currently active QUIC connections clientaccess_connections_rejected counter {reason} Connections rejected before auth clientaccess_connection_duration latency {username?} Connection lifetime
Authentication: clientaccess_auth_success_total counter {username?} Successful authentications clientaccess_auth_failures_total counter {reason} Failed authentications
Clients: clientaccess_clients_active gauge {} Registered client instances
Heartbeat: clientaccess_heartbeat_latency latency {username?} Heartbeat RTT (raw)
Dial: clientaccess_dials_total counter {} Dial requests received clientaccess_dials_denied_total counter {} Dials denied by ACL clientaccess_dials_success_total counter {} Dials completed successfully clientaccess_dials_errors_total counter {} Dial errors (connect refused, timeout) clientaccess_dial_latency latency {} Backend dial time clientaccess_streams_active gauge {} Active QUIC dial streams
DNS: clientaccess_dns_queries_total counter {} DNS queries processed
Alerts: clientaccess_connections_active > max_clients * 0.9 Approaching client limit rate(clientaccess_connections_rejected[5m]) > 10 Connection rejection spike rate(clientaccess_auth_failures_total[5m]) > 10 Authentication failure spike rate(clientaccess_dials_denied_total[5m]) > 20 ACL denial spikeQUIC Connector
Connections: connectors_connections_total counter {} Total connector connections connectors_connections_active gauge {} Active connector connections connectors_connections_rejected counter {reason} Rejected connections connectors_connection_duration latency {site_id} Connection lifetime
Authentication: connectors_auth_success_total counter {site_id} Successful authentications connectors_auth_failures_total counter {site_id, reason} Authentication failures
Instances: connectors_instances_active gauge {site_id} Active connector instances connectors_heartbeat_latency latency {site_id} Heartbeat round-trip time
Dial (tunnel dispatch): connectors_dials_total counter {site_id} Dial attempts through tunnel connectors_dials_success_total counter {site_id} Successful dials connectors_dials_errors_total counter {site_id, reason} Failed dials connectors_dial_latency latency {site_id} Dial latency connectors_streams_active gauge {} Active QUIC streams
Rebalance: connectors_rebalance_reject_total counter {site_id} Soft-rejected for rebalance connectors_rebalance_accept_total counter {site_id} Accepted after rebalance check
Inter-node forwarding (TCP-level): connectors_forward_total counter {site_id, target} Forward attempts to peer node connectors_forward_success_total counter {site_id, target} Successful forwards connectors_forward_errors_total counter {site_id, target} Failed forwards connectors_forward_latency latency {site_id, target} Forward latency connectors_forward_local_total counter {site_id} Requests handled locally
Relay (QUIC inter-node dispatch): connectors_relay_total counter {site_id, target} Client-side relay attempts connectors_relay_served counter {site_id, target} Server-side relay requests handled connectors_relay_success_total counter {site_id, target} Successful relay dispatches connectors_relay_errors_total counter {site_id, reason} Failed relay dispatches connectors_relay_rejected_total counter {reason} Relay connections rejected (auth)
Alerts: rate(connectors_auth_failures_total[5m]) > 5 High auth failure rate (brute force or misconfiguration) connectors_instances_active == 0 No connector instances (site unreachable) rate(connectors_dials_errors_total[5m]) > 10 High dial failure rate (tunnel health) rate(connectors_relay_rejected_total[5m]) > 0 Relay auth failures (cluster misconfiguration) connectors_connections_active > 100 High connection countNetwork Listener
Lifecycle: listener_starts counter {type, name} Listener startups listener_stops counter {type, name} Listener shutdowns listener_restarts counter {type, name} Listener restarts listener_errors counter {type, name} Listener errors
Rate & Size Limiting: listener_rate_limit_hits counter {reason} Requests blocked by rate limit listener_ratelimit_circuit_breaker_trips_total counter {} Circuit breaker trips listener_size_limit_hits counter {host, path} Size limit exceeded
TLS Security: listener_connections_accepted counter {protocol} Successful TLS connections listener_security_non_tls_dropped counter {reason} Non-TLS connections rejected listener_security_malformed_tls counter {reason} Invalid TLS versions listener_security_oversized_record counter {reason} TLS records exceeding RFC limits listener_security_oversized_clienthello counter {reason} ClientHello too large listener_security_small_clienthello counter {reason} Suspiciously small ClientHello listener_security_malformed_clienthello counter {reason} Malformed ClientHello listener_security_no_sni counter {reason} TLS handshakes without SNI
QUIC Affinity: listener_quic_affinity_packets_received counter {} QUIC packets received listener_quic_affinity_packets_dropped counter {reason} QUIC packets dropped listener_quic_affinity_decryption_failures counter {} QUIC decryption failures listener_quic_affinity_packets_local counter {} QUIC packets processed locally listener_quic_affinity_packets_forwarded counter {target_node} QUIC packets forwarded to cluster listener_quic_affinity_forward_failures counter {target_node} Forward failures listener_quic_affinity_response_dropped counter {reason} QUIC response packets dropped listener_quic_affinity_cid_mappings gauge {} Active connection ID mappings listener_quic_connection_migrations counter {} QUIC connection migrations
QUIC Forwarding: listener_quic_forward_connect_errors counter {target_node} Forwarding connect errors listener_quic_forward_write_errors counter {target_node} Forwarding write errors listener_quic_forward_bytes counter {target_node} Bytes forwarded
HXEP (Edge Protocol): hxep_parsed_trusted counter {} TCP HXEP parsed (trusted) hxep_stripped_untrusted counter {} TCP HXEP stripped (untrusted) hxep_parse_failed counter {} TCP HXEP parse failures hxep_partial_header counter {} TCP HXEP incomplete headers hxep_udp_parsed_trusted counter {} UDP HXEP parsed (trusted) hxep_udp_stripped_untrusted counter {} UDP HXEP stripped (untrusted) hxep_udp_parse_failed counter {} UDP HXEP parse failures
Alerts: rate(listener_rate_limit_hits[5m]) > 50 High rate limiting (possible attack) listener_ratelimit_circuit_breaker_trips_total > 0 Circuit breaker tripped rate(listener_security_no_sni[5m]) > 10 SNI probing rate(hxep_stripped_untrusted[5m]) > 0 HXEP spoofing attempt rate(listener_quic_affinity_forward_failures[5m]) > 0 Cluster QUIC forwarding issuesForward Proxy
No Prometheus metrics emitted by this module.
Cluster & Operations
Git Configuration Management
Observability is provided indirectly through dependent modules: - config: metrics for config reload success/failure and reload latency - cluster: metrics for broadcast delivery and quorum waitHot Reload
Observability is provided indirectly through dependent modules: - config: metrics for config reload success/failure, reload latency, and config version - cluster: metrics for broadcast delivery and quorum waitModule Data Storage
Operations (namespace: moduledata): moduledata_operations_total counter {operation, backend, result} Operation count operation=get|set|delete|getallforuser|loadall|exists backend=hexon result=success|error moduledata_operation_duration latency {operation, backend} Operation duration operation=get|set|delete|getallforuser|loadall|exists backend=hexonNotification Service
notify_sent_total counter {channel, result} Incremented after each single-event delivery attempt. channel=email|webhook, result=success|failure. notify_digest_sent_total counter {channel, result} Incremented after each digest delivery attempt. channel=email|webhook, result=success|failure.
Downstream metrics from related modules: - smtp_send_total (from smtp module) — covers email delivery outcomes - render_email_total (from render module) — covers template renderingDistributed Sessions
Lifecycle: sessions_sessions_created counter {type} Sessions created sessions_sessions_revoked counter {} Sessions revoked (single) sessions_sessions_bulk_revoked counter {type} Sessions bulk-revoked sessions_sessions_regenerated counter {type} Session IDs regenerated sessions_sessions_extended counter {type} Session TTLs extended sessions_activity_persisted counter {type} Activity timestamps persisted
Validation: sessions_validations_success counter {type} Successful validations sessions_validations_failed counter {reason} Failed validations (storage_error, wait_error, not_found, invalid_type)
Alerts: rate(sessions_validations_failed{reason="not_found"}[5m]) > 50 High session-not-found rate (expired or stale cookies) rate(sessions_sessions_bulk_revoked[5m]) > 0 Bulk revocation event (user disabled or password change) rate(sessions_validations_failed{reason="storage_error"}[5m]) > 0 Storage backend issuesSMTP Email Delivery
Email delivery: smtp_emails_sent_total counter {type, result} Emails sent per type and outcome smtp_send_duration latency {type, result} Email send duration per type and outcome
Label values: type: generic | otp | cert_renewal | passkey_expiration | vpn_enrollment | vpn_device_code | vpn_psk_expiration | magiclink | test | pat_created | pat_revoked | pat_expired | passkey_created | passkey_revoked | totp_created | totp_revoked | cert_created | cert_revoked result: success | failure
Note: Only core email types emit latency (generic, otp, cert_renewal, passkey_expiration,vpn_enrollment, vpn_device_code, vpn_psk_expiration, magiclink). All other types(test, pat_*, passkey_*, totp_*, cert_*) emit counter only — no latency metric.
Alerts: rate(smtp_emails_sent_total{result="failure"}[5m]) > 5 SMTP delivery issues smtp_send_duration{quantile="0.99"} > 5s SMTP latency degradationPersistent File Storage
No Prometheus metrics emitted by this module.
Distributed Memory Storage
CRUD Counters: memory_storage_gets counter {cache_type, result} Cache reads (result: hit, miss, cold_hit, decode_error, expired) memory_storage_sets counter {cache_type} Cache writes memory_storage_deletes counter {cache_type} Cache deletions memory_storage_touches counter {cache_type, result} TTL renewals (result: hit, miss, expired) memory_storage_setnx counter {cache_type, result} Atomic set-if-not-exists (result: set, exists) memory_storage_sync_sets counter {cache_type} Synchronous KV-persisted writes memory_storage_sync_gets counter {cache_type, result} Synchronous reads with KV fallback (result: hit, miss, kv_hit, decode_error, expired) memory_storage_evictions counter {cache_type, reason} Entries evicted (reason: expired, cold)
Gauges: memory_storage_entries gauge {cache_type} Current entry count per cache type (updated via GetCacheStats)
Alerts: rate(memory_storage_gets{result="miss"}[5m]) > 100 High cache miss rate (check TTLs or missing Set calls) rate(memory_storage_evictions{reason="expired"}[5m]) > 500 Excessive TTL evictions (entries expiring faster than expected) rate(memory_storage_gets{result="decode_error"}[5m]) > 0 Corrupted KV entries (cold cache decode failures)Telemetry & Logging
Audit event tracking: telemetry_audit_log_entries_total counter {} Audit-class entries successfully written telemetry_audit_dropped_total counter {} Audit-class entries dropped (channel overflow) telemetry_converging_log_entries_total counter {} Converging-class entries successfully written
All three counters have no labels (nil label map). They are incremented in thesingle backgroundWriter goroutine (no contention).
Alerts: telemetry_audit_dropped_total > 0 Audit entries lost — increase channel buffer or reduce log volume rate(telemetry_audit_log_entries_total[5m]) == 0 No audit events — possible pipeline failure or misconfigurationAI Assistant
Queries: llm_queries_total counter {success} Query completion count success=true Query completed with a final answer llm_query_duration_seconds latency (none) End-to-end query duration including all tool rounds
Tool calls: llm_tool_calls_total counter {tool, success} Per-tool execution count tool=<command> CLI command name (e.g. "proxy", "cluster") success=true/false Whether the command executed successfully
Prompt caching (Anthropic provider only, emitted when cache tokens > 0): llm_cache_read_tokens_total counter (none) Tokens read from Anthropic prompt cache llm_cache_creation_tokens_total counter (none) Tokens written to Anthropic prompt cacheAdmin Unix Socket
No Prometheus metrics emitted by this module.
Threshold Signing & Cluster Cryptography
No Prometheus metrics emitted by this module.
Configuration System
Reload counters are available via the health system: - Reload attempts, successes, failures - Parse errors, validation errors, file-not-found errors - Callback timeouts, callback duration, last reload duration
Query reload status: health components | config statusKubernetes CRD Configuration
Reconciliation: k8s_reconciliations_total counter {result} Config-to-CRD reconciliation cycles result=success | failure
Health Sync: k8s_health_syncs_total counter {result} Periodic health status writes to CRD .status result=success | failure
CRD Operations: k8s_crd_operations_total counter {operation, result} Individual CRD operations operation=ensure_definition, result=created | updated | failure operation=status_write, result=success | failure
Alerts: rate(k8s_reconciliations_total{result="failure"}[5m]) > 0 Reconciliation failing rate(k8s_health_syncs_total{result="failure"}[5m]) > 0 Health sync failing