Skip to content

Metrics Reference

Every Prometheus metric across all modules. Types: counter (monotonic), gauge (current value), histogram (distribution buckets), latency (histogram with fixed buckets: <1ms, <10ms, <100ms, <1s, <10s, >10s).

Reverse Proxy

Load Balancer

Pool Lifecycle:
loadbalancer_pools_created counter {strategy} Pools created
loadbalancer_pools_deleted counter {} Pools deleted
Backend Selection:
loadbalancer_selects counter {pool_id, strategy} Successful backend selections
loadbalancer_select_failures counter {pool_id, reason} Failed selections (reason: pool_not_found|no_healthy_backends|algorithm_returned_nil)
loadbalancer_select_latency latency {pool_id} Backend selection duration
Health Checks:
loadbalancer_health_checks counter {pool_id, healthy} Health check executions (healthy: true|false)
Circuit Breaker:
loadbalancer_circuit_state_changes counter {pool_id, backend_id, from_state, to_state} Circuit state transitions (optionally includes protocol label in per-protocol mode)
loadbalancer_circuit_resets counter {pool_id, backend_id} Manual circuit resets
Connections:
loadbalancer_connections_opened counter {pool_id, backend_id} Connections opened
loadbalancer_connections_closed counter {pool_id, backend_id} Connections closed
loadbalancer_active_connections gauge {pool_id, backend_id} Current active connections
loadbalancer_connection_duration latency {pool_id, backend_id} Connection duration
loadbalancer_bytes_sent counter {pool_id, backend_id} Bytes sent to backends
loadbalancer_bytes_recv counter {pool_id, backend_id} Bytes received from backends
Rate Limiting:
loadbalancer_rate_limit_allowed counter {pool_id} Requests allowed by rate limiter
loadbalancer_rate_limit_denied counter {pool_id} Requests denied by rate limiter
Note: metrics aggregate at pool level. For per-user denial details when
rate_limit_per_user = true, use: logs search "rate limit exceeded" (includes user_id)
Outlier Detection:
loadbalancer_outlier_ejections counter {pool_id, backend_id, reason} Backends ejected (reason: consecutive_5xx|consecutive_gateway|consecutive_local|success_rate|failure_percentage)
loadbalancer_outlier_readmissions counter {pool_id, backend_id} Backends auto-readmitted after ejection period
loadbalancer_outlier_manual_uneject counter {pool_id, backend_id} Backends manually un-ejected
DNS Discovery:
loadbalancer_dns_discovery_failures counter {pool_id, hostname} DNS resolution failures
loadbalancer_dns_discovery_updates counter {pool_id, hostname} Backend set updates from DNS
Alerts:
rate(loadbalancer_select_failures{reason="no_healthy_backends"}[5m]) > 0 All backends down — check health and outlier state
rate(loadbalancer_circuit_state_changes{to_state="open"}[5m]) > 0 Circuit opened — backend degradation
rate(loadbalancer_outlier_ejections[5m]) > 5 High ejection rate — systemic backend issues
rate(loadbalancer_rate_limit_denied[5m]) > 50 High rate limit denial — check capacity or limits
rate(loadbalancer_dns_discovery_failures[5m]) > 0 DNS discovery failing — check DNS config

Request Shadow/Mirror

Request Counts:
shadow_requests_total counter {shadow_name} Shadow requests dispatched
shadow_success_total counter {shadow_name} Shadow requests with 2xx/3xx responses
shadow_errors_total counter {shadow_name, error_type} Shadow request errors (error_type: request_creation|timeout|network|http_error)
Note: error_type "http_error" also includes a "status_code" label.
Latency:
shadow_request_duration latency {shadow_name} Shadow request round-trip duration
Alerts:
rate(shadow_errors_total[5m]) > 0 Shadow errors occurring — check backend health
rate(shadow_errors_total{error_type="timeout"}[5m]) > 5 High timeout rate — increase timeout or check backend
histogram_quantile(0.99, shadow_request_duration) > 5 p99 latency exceeds 5s — shadow backend slow

Reverse Proxy

Request Flow:
proxy_requests counter {app, host, auth} Successful proxied requests
proxy_errors counter {app, host} Proxy request errors
proxy_backend_duration latency {app} Backend response time
proxy_auth_failures counter {app, host} Authentication failures
proxy_authz_failures counter {app, host} Group authorization failures
proxy_auth_bypass_total counter {app, host} Auth bypassed (bypass_auth_cidrs)
proxy_subnet_failures counter {app, host} Subnet restriction failures
proxy_reauth_required counter {app, host} Re-authentication triggered
Caching:
proxy_cache_hits counter {app, type} Response cache hits (304/full)
proxy_cache_misses counter {app} Response cache misses
proxy_cache_size gauge {} Response cache entries
proxy_cache_invalidated counter {} Cache entries invalidated on reload
Header Signing:
proxy_signing_total counter {status, app} Header signing operations
proxy_signing_duration latency {app} Header signing time
proxy_request_signing_total counter {status, app} Request signing operations
proxy_request_signing_duration latency {app} Request signing time
proxy_sign_payload_total counter {status} Payload signing outcomes
proxy_key_derivation_total counter {status} Key derivation attempts
proxy_key_rotation_total counter {status} Key rotations
proxy_key_operation_total counter {status, operation} Key operation failures
proxy_key_request_total counter {status} Public key endpoint requests
proxy_key_request_duration latency {} Public key endpoint latency
proxy_signature_verify_total counter {status} Signature verification requests
proxy_signature_verify_duration latency {} Signature verification latency
proxy_request_signature_verify_total counter {status} Request signature verification
proxy_request_signature_verify_duration latency {} Request signature verify latency
OIDC SSO:
proxy_oidc_flow_initiated counter {host, provider} OIDC auth flows started
proxy_oidc_flow_completed counter {host, provider} OIDC auth flows completed
proxy_oidc_flow_failed counter {host, reason, provider?} OIDC auth flow failures (provider absent on pre-state errors)
proxy_oidc_state_validation_failed counter {reason} State validation failures
Per-User Rate Limiting:
proxy_rate_limit_per_user_denied counter {app} Requests denied by per-user rate limit
Canary:
proxy_canary_requests counter {app, version} Requests routed per version (stable/canary label)
Retry:
proxy_retry_attempts_total counter {app, attempt} Retry attempts by attempt number
proxy_retry_success_total counter {app} Retries that succeeded
proxy_retry_budget_exceeded counter {app} Retries blocked by budget
proxy_retry_exhausted counter {app} All retry attempts failed
Hedge:
proxy_hedge_fired_total counter {app} Hedge requests sent (primary too slow, value = number of hedges)
proxy_hedge_won_total counter {app} Hedge response used (primary was slower or failed)
proxy_hedge_lost_total counter {app} All attempts failed (primary + all hedges)
proxy_hedge_skipped_total counter {app} Hedge skipped (no different backend, body replay failure)
Circuit Breaker:
proxy_circuit_breaker_rejections counter {app} Requests rejected (breaker open)
proxy_circuit_breaker_fallbacks counter {app} Fallback service activated
Transport:
proxy_transport_cache_hits counter {} HTTP transport reused
proxy_transport_cache_misses counter {} New HTTP transport created
proxy_transport_cache_size gauge {} Cached transports
proxy_transport_cache_invalidated counter {reason} Cache invalidated (CA rotation)
proxy_http3_transport_cache_hits counter {} HTTP/3 transport reused
proxy_http3_transport_cache_misses counter {} HTTP/3 transport created
proxy_proxyprotocol_sent counter {version} PROXY protocol headers sent
proxy_proxyprotocol_skipped counter {reason} PROXY protocol skipped
proxy_ca_pool_version gauge {} CA pool version
proxy_optimized_transport_hits counter {route, pool} Optimized transport cache hits
proxy_optimized_transport_fallbacks counter {route, reason} Optimized transport fallbacks
proxy_transport_pools_created counter {pool, route} Transport pools created
proxy_transport_pools_cleaned counter {pool} Transport pools cleaned
proxy_pool_registration_failures counter {pool, error} Pool registration failures
proxy_pool_cleanup_errors counter {pool, error} Pool cleanup errors
Rewriting:
proxy_rewrite_duration latency {app} HTML rewriting time
proxy_buffer_pool_gets counter {} Buffer pool acquisitions
Reload:
proxy_reload_attempts counter {trigger} Reload attempts
proxy_reload_total counter {success, reason?} Reload results (reason on failure only)
proxy_reload_skipped counter {reason} Reload skipped
proxy_reload_duration latency {success} Reload time (success path only)
proxy_routes_configured gauge {} Active routes
proxy_routes_changed gauge {} Routes changed on reload
proxy_routes_unchanged gauge {} Routes unchanged on reload
proxy_routes_added counter {} Routes added
proxy_routes_removed counter {} Routes removed
proxy_config_hash_changed counter {} Config hash changes
proxy_lb_pools_preserved counter {app} LB pools preserved
proxy_lb_pools_created counter {app, reason} LB pools created
Session Monitoring:
proxy_group_monitor_changes_total counter {username} Group membership changes
proxy_group_monitor_updates_total counter {username} Session metadata updates
proxy_group_monitor_errors_total counter {error_type} Monitor errors
proxy_group_monitor_check_duration latency {} Check cycle time
FastCGI:
proxy_fastcgi_requests_total counter {mapping, status_class} FastCGI RoundTrip exits (status_class: 2xx/3xx/4xx/5xx/error)
proxy_fastcgi_request_duration latency {mapping} RoundTrip wall-clock time (auto-bucketed)
proxy_fastcgi_pool_total counter {mapping, result} Conn pool acquire (hit/fresh/retry)
proxy_fastcgi_stderr_bytes_total counter {mapping, severity} PHP-FPM STDERR bytes routed to audit log
proxy_fastcgi_proto_status_total counter {mapping, status} Non-success FCGI_END_REQUEST signals (cant_mpx/overloaded/unknown_role)
Alerts:
rate(proxy_errors[5m]) / rate(proxy_requests[5m]) > 0.05 Error rate > 5%
proxy_circuit_breaker_rejections > 0 Backend unhealthy
rate(proxy_auth_failures[5m]) > 10 Brute-force attempt
rate(proxy_oidc_state_validation_failed[5m]) > 5 CSRF/state attack
proxy_transport_cache_invalidated > 0 CA rotation event
rate(proxy_reload_total{success="false"}[5m]) > 0 Config reload failing
rate(proxy_retry_budget_exceeded[5m]) > 10 Retry storm — budget protecting cluster
rate(proxy_retry_exhausted[5m]) > 5 Backend failures exhausting retries
rate(proxy_hedge_fired_total[5m]) / rate(proxy_requests[5m]) > 0.1 >10% requests hedging — check tail latency
rate(proxy_fastcgi_requests_total{status_class="error"}[5m]) > 5 FastCGI transport-level failures (backend unreachable)
rate(proxy_fastcgi_stderr_bytes_total{severity="error"}[5m]) > 1000 Sustained PHP-FPM error output (investigate php-fpm.log)

Authentication

Device Code Authorization

Codes:
devicecode_codes_issued_total counter {client_id} Device codes generated
Authorization:
devicecode_authorizations_total counter {result} Authorization decisions
result=authorized User approved device
result=denied User denied device
Polling:
devicecode_polls_total counter {status} Poll requests by outcome
status=pending Awaiting user action
status=authorized User authorized
status=denied User denied
status=slow_down Client polling too fast
status=expired Code expired (not instrumented — returns early)
Alerts:
rate(devicecode_authorizations_total{result="denied"}[5m]) > 10 High denial rate
rate(devicecode_polls_total{status="slow_down"}[5m]) > 50 Clients ignoring poll interval

Just-In-Time Two-Factor Authentication

Operations:
jit2fa_login_attempts_total counter {mapping_id} Login interceptions
jit2fa_webhook_validations_total counter {mapping_id, result} Webhook results (success/failure)
jit2fa_webhook_validation_duration latency {mapping_id} Webhook response time
jit2fa_otp_verifications_total counter {mapping_id, result, reason?} OTP results (success/invalid/expired/max_retries/error)
jit2fa_sessions_created_total counter {mapping_id} Sessions created
jit2fa_otp_resends_total counter {mapping_id, result} OTP resend attempts
jit2fa_rate_limited_total counter {mapping_id} Rate-limited requests
Token Handoff:
jit2fa_handoff_entry_total counter {mapping_id, reason, dpop_bound}
Entry path visits by outcome and DPoP binding state
reasons: missing_return_url, invalid_return_url,
missing_dpop_jkt, invalid_dpop_jkt,
redirect_login, direct_mint,
form_post (parallel entry: the login POST
carried _jit2fa_return_url + optional
_jit2fa_dpop_jkt, and the middleware treated
the whole thing as a handoff request rather
than the traditional credential-replay flow)
dpop_bound: "true" when the caller supplied a valid
dpop_jkt query parameter (or form field),
"false" otherwise. Early-rejection paths
(before dpop_jkt parse) always emit "false".
jit2fa_handoff_mints_total counter {mapping_id, result, reason?, dpop_bound}
Mint step outcomes by result, reason, and binding
failure reasons: revalidate_failed, malformed_return_url,
missing_identity, missing_dpop_jkt, oidc_error
dpop_bound: "true" when the minted (or attempted)
token carries a cnf.jkt confirmation claim.
Use this dimension for DPoP adoption tracking:
sum by (dpop_bound) (rate(
jit2fa_handoff_mints_total{
result="success"
}[5m]))
jit2fa_handoff_mint_duration latency {mapping_id}
Time from finalizeTokenHandoff entry to mint response
jit2fa_handoff_bearer_checks_total counter {mapping_id, result, reason?, dpop_bound}
Bearer check outcomes by result, reason, binding
rejected reasons: empty_token, validator_error, invalid_token,
audience_mismatch, token_not_dpop_bound, missing_dpop_header,
dpop_validator_error, dpop_proof_invalid, dpop_jkt_mismatch
dpop_bound: "true" when the presented token has a
cnf.jkt claim, "false" otherwise. Early-rejection
paths (empty_token, validator_error, invalid_token)
emit "false" since the token was not parsed.
DPoP usage query:
sum by (dpop_bound) (rate(
jit2fa_handoff_bearer_checks_total{
result="accepted"
}[5m]))
jit2fa_handoff_bearer_check_duration latency {mapping_id}
Time from bearer header parse to validation outcome
(full cost: JWT validate + optional DPoP proof validate)
jit2fa_handoff_dpop_validation_duration latency {mapping_id}
Isolated cost of oidc.ValidateDPoP alone — component of
handoff_bearer_check_duration, emitted on every DPoP
proof validation attempt (success or failure). Use this
to tell JWT slowness apart from DPoP slowness when the
bearer check p99 regresses.
Token Refresh:
jit2fa_handoff_refresh_total counter {mapping_id, result, reason?}
Refresh endpoint outcomes (success/failure)
failure reasons: disabled, parse_error, missing_token,
invalid_token, wrong_audience, not_dpop_bound,
missing_dpop, dpop_invalid, dpop_mismatch,
missing_auth_time, max_session, mint_failed
jit2fa_handoff_refresh_duration latency {mapping_id}
Full refresh handler wall-clock latency
Alerts:
# Backend / operational
rate(jit2fa_webhook_validations_total{result="failure"}[5m]) > 5 Webhook backend issues
jit2fa_otp_verifications_total{reason="max_retries"} > 0 OTP brute-force attempt
rate(jit2fa_rate_limited_total[5m]) > 10 High rate limiting
# Token handoff — abuse signals (page on these)
rate(jit2fa_handoff_entry_total{reason="invalid_return_url"}[5m]) > 2 Possible open-redirect probing against the whitelist
rate(jit2fa_handoff_bearer_checks_total{reason="audience_mismatch"}[5m]) > 0 Cross-mapping token replay attempt (alert immediately)
rate(jit2fa_handoff_bearer_checks_total{reason="invalid_token"}[5m]) > 20 High invalid-token rate (bot scan or clock drift)
rate(jit2fa_handoff_bearer_checks_total{reason="dpop_jkt_mismatch"}[5m]) > 0 DPoP thumbprint mismatch — possible stolen token (alert immediately)
rate(jit2fa_handoff_refresh_total{reason="dpop_mismatch"}[5m]) > 0 Refresh with wrong DPoP key — stolen refresh token attempt
# Token handoff — capacity / latency
histogram_quantile(0.99, jit2fa_handoff_mint_duration_bucket) > 0.5 Token signing p99 slow (OIDC signer degraded)
histogram_quantile(0.99, jit2fa_handoff_bearer_check_duration_bucket) > 0.1 Bearer check p99 slow (hexdcall / oidc validation contention)
histogram_quantile(0.99, jit2fa_handoff_dpop_validation_duration_bucket) > 0.05 DPoP proof validation p99 slow (ECDSA cost or replay cache contention)
# Token handoff — DPoP rollout tracking (not alerts, dashboard panels)
sum by (dpop_bound) (rate(jit2fa_handoff_mints_total{result="success"}[5m])) Mint-time DPoP adoption ratio
sum by (dpop_bound) (rate(jit2fa_handoff_bearer_checks_total{result="accepted"}[5m])) Bearer-use DPoP adoption ratio
rate(jit2fa_handoff_bearer_checks_total{reason="token_not_dpop_bound"}[5m]) Legacy clients on a require_dpop mapping (expected to drop to 0 after rollout)
rate(jit2fa_handoff_entry_total{reason="missing_dpop_jkt"}[5m]) Clients hitting a require_dpop entry without dpop_jkt (same signal, earlier in the flow)

Kerberos Ticket Management & SPNEGO Browser SSO

SPNEGO:
kerberos_spnego_validation_total counter {result, reason?} SPNEGO validation results
result=success
result=failure, reason=invalid_base64|invalid_token|auth_failed|no_credentials|user_disabled
Tickets:
kerberos_ticket_acquisition_total counter {result, reason?} Ticket acquisition
result=success | result=failure, reason=auth_failed
kerberos_ticket_refresh_total counter {result} Ticket refresh (success/failure)
kerberos_ticket_revocation_total counter {result} Ticket revocation (success)
kerberos_tickets_revoked counter {} Total tickets revoked (bulk count)
Password:
kerberos_password_change_total counter {result} Password changes (success/failure)
Alerts:
rate(kerberos_spnego_validation_total{result="failure"}[5m]) > 10 SPNEGO failures (keytab/config)
rate(kerberos_ticket_refresh_total{result="failure"}[5m]) > 0 Ticket refresh failing (KDC)
kerberos_spnego_validation_total{reason="user_disabled"} > 0 Disabled user SPNEGO attempt

LDAP Authentication

ldap_authentication_total counter {result, reason?} Authentication attempts
result=success Successful bind
result=failure, reason=empty_username Missing username
result=failure, reason=empty_password Missing password
result=failure, reason=service_unavailable LDAP service error
result=failure, reason=invalid_credentials Wrong password
Alerts:
rate(ldap_authentication_total{result="failure",reason="service_unavailable"}[5m]) > 0 LDAP server down
rate(ldap_authentication_total{result="failure",reason="invalid_credentials"}[5m]) > 20 Brute-force attempt
magiclink_initiated_total counter Incremented when a magic link email is
successfully queued (valid user, within
rate limits). Not incremented for decoy
flows or unknown emails.
magiclink_verifications_total counter Incremented when Verify completes a user
{result} action. Labels:
authorized — user approved sign-in
denied — user rejected the request
signin_here — user chose local sign-in
magiclink_polls_total counter Incremented on every Poll response.
{status} Labels mirror the returned status:
pending, authorized, denied, expired,
slow_down, completed_elsewhere,
invalid (empty device code).
Additional observability via dependent modules:
- devicecode: device_code_* metrics cover code creation and polling
- ratelimit: ratelimit_* metrics cover per-IP and per-email throttling
- sessions: session_* metrics cover magiclink session create/revoke
- smtp: smtp_* metrics cover magic link email delivery

OIDC Provider

Token Issuance:
oidc_authcode_generation_total counter {result, reason} Auth code generation
oidc_token_exchange_total counter {result, reason} Code-for-token exchanges
oidc_token_refresh_total counter {result, reason} Token refreshes
oidc_tokens_revoked counter {} Tokens revoked on logout
oidc_token_signing_retry_total counter {result, reason|attempt} Signing retries (threshold signer)
Client Auth:
oidc_validation_failure_total counter {type, client_id} PKCE/scope/redirect failures
oidc_mtls_auth_total counter {result, reason|method} mTLS auth (failure: reason, success: method)
DPoP:
oidc_dpop_validation_total counter {result, reason} Proof validation
oidc_dpop_jti_replay_total counter {detected} Replay detections
oidc_dpop_jti_storage_total counter {result} JTI cache operations
oidc_dpop_nonce_generation_total counter {result} Nonce generation
oidc_dpop_nonce_storage_total counter {result} Nonce cache operations
oidc_dpop_nonce_validation_total counter {result, reason} Nonce validation
PAR:
oidc_par_requests_total counter {result, client_id} PAR creation
oidc_par_consume_total counter {result, client_id} PAR consumption
oidc_par_request_duration histogram {client_id} PAR processing latency
M2M:
oidc_client_credentials_total counter {result, reason} Client Credentials grants
oidc_jwt_bearer_total counter {result, reason} JWT Bearer grants
Operations:
oidc_token_introspection_total counter {result, token_type, active} Token introspection
oidc_token_revocation_total counter {result, token_type} Token revocation
oidc_userinfo_requests_total counter {result, reason} UserInfo requests
oidc_logout_total counter {result} Logouts
oidc_device_code_total counter {result, reason} Device code grants
oidc_pat_issued_total counter {username} PAT issuance
Latency:
oidc_id_token_generation_duration_ms histogram {} ID token generation
oidc_access_token_generation_duration_ms histogram {} Access token generation
oidc_auth_code_generation_duration_ms histogram {} Auth code generation
oidc_entropy_validation_duration_ms histogram {} Entropy validation
Alerts:
rate(oidc_dpop_jti_replay_total[5m]) > 0 DPoP replay attack
rate(oidc_validation_failure_total[5m]) > 10 High validation failure rate
oidc_token_signing_retry_total > 0 Signing key issues
rate(oidc_par_consume_total{result="replay_attempt"}[5m]) > 0 PAR replay attempt

Email OTP

Generation:
otp_codes_generated counter {type} OTP codes generated (type: numeric, base20)
Validation:
otp_validations_total counter {result} Validation outcomes (result: valid, invalid)
otp_validation_failures counter {reason} Failure breakdown (reason: not_found, locked, expired, max_retries, invalid_code)
otp_replay_prevented counter (none) OTPs deleted after successful validation (replay prevention)
Alerts:
rate(otp_validation_failures{reason="max_retries"}[5m]) > 0 Brute-force attempt (OTP locked after max retries)
rate(otp_validation_failures{reason="not_found"}[5m]) > 5 Probing for non-existent OTPs
rate(otp_codes_generated[5m]) > 20 Unusual OTP generation rate

TOTP Authenticator

Enrollment:
totp_enrollments_initiated counter (none) Enroll calls (QR + secret generated)
totp_enrollments_confirmed counter (none) First code verified, secret persisted
totp_enrollments_deleted counter (none) TOTP enrollment deleted
Validation:
totp_validations_total counter {result} Validation outcomes (result: valid, invalid, replay, clock_backward)
Recovery:
totp_recovery_validations_total counter {result} Recovery code outcomes (result: valid, invalid, no_codes)
Alerts:
rate(totp_validations_total{result="replay"}[5m]) > 0 Replay attack attempt detected
rate(totp_validations_total{result="invalid"}[5m]) > 10 Brute-force attempt on TOTP codes
rate(totp_validations_total{result="clock_backward"}[5m]) > 0 Server clock drift — check NTP sync
rate(totp_recovery_validations_total{result="invalid"}[5m]) > 5 Recovery code probing attempt

WebAuthn Passkeys

Passkey Inventory:
webauthn_passkeys_issued gauge {} Total passkeys ever issued
webauthn_passkeys_active gauge {} Currently active passkeys
webauthn_passkeys_revoked gauge {} Revoked passkeys
webauthn_passkeys_expired gauge {} Expired passkeys
Authentication:
webauthn_auth_attempts counter {} Authentication attempts
webauthn_auth_success counter {} Successful authentications
webauthn_auth_failed counter {} Failed authentications
Expiration Monitoring:
webauthn_expiration_check_total counter {result} Expiration checks (success/failure)
webauthn_expiration_passkeys_checked gauge {} Passkeys checked in last run
webauthn_expiration_emails_sent gauge {} Reminder emails sent in last run
webauthn_expiration_reminder_total counter {result} Reminder send attempts (success/failure)
Alerts:
rate(webauthn_auth_failed[5m]) > 20 High auth failure rate
webauthn_passkeys_active == 0 No active passkeys (service unusable)
rate(webauthn_expiration_check_total{result="failure"}[1h]) Expiration check failing

X.509 Client Certificate Authentication

Validation:
x509_validation_total counter {result, reason?} Certificate validation attempts
result=success Valid certificate authenticated
result=failure, reason=not_yet_valid Certificate NotBefore in future
result=failure, reason=expired Certificate past NotAfter
result=failure, reason=no_ca_available No CA certs configured
result=failure, reason=chain_validation_failed Chain/signature verification failed
result=failure, reason=revoked_crl Revoked via CRL (external cert)
result=failure, reason=invalid_identity Identity field missing from cert
result=failure, reason=directory_error Directory lookup call failed
result=failure, reason=directory_timeout Directory lookup timed out
result=failure, reason=user_not_found User not in directory
result=failure, reason=revoked_internal Revoked via serial index (internal cert)
result=failure, reason=not_registered Internal cert not in enrollment registry
result=failure, reason=revoked_ocsp Revoked via OCSP (external cert)
Enrollment:
x509_enrollment_total counter {result, reason?} Certificate enrollment attempts
result=success Certificate issued successfully
result=failure, reason=invalid_username Username validation failed
Revocation:
x509_revocation_total counter {result, reason} Certificate revocations
result=success, reason=<RFC5280 code> Revocation completed
CRL:
x509_crl_refresh_total counter {result} CRL download/refresh attempts
result=success CRL loaded/refreshed
result=failure Download failed from all URLs
x509_crl_revoked_count gauge {} Number of revoked certs in CRL
x509_crl_size_bytes gauge {} Raw CRL size in bytes
OCSP:
x509_ocsp_query_total counter {result, cached} OCSP lookups
result=success, cached=true Cache hit (memory)
result=success, cached=false Responder queried successfully
result=failure, cached=false All responders unreachable
Auto-Renewal:
x509_auto_renewal_check_total counter {result} Renewal check runs
x509_auto_renewal_total counter {result} Individual cert renewals
result=success Cert renewed and emailed
result=failure Renewal failed
x509_auto_renewal_skipped_total counter {reason} Renewals skipped
reason=no_email User has no email in directory
reason=no_certificate_der No stored cert for key extraction
x509_auto_renewal_certs_checked gauge {} Certs checked in last run
x509_auto_renewal_certs_renewed gauge {} Certs renewed in last run
x509_auto_renewal_certs_skipped gauge {} Certs skipped (opt-out) in last run
x509_auto_renewal_errors gauge {} Errors in last renewal run
Alerts:
rate(x509_validation_total{result="failure"}[5m]) > 10 High validation failure rate
rate(x509_validation_total{reason="revoked_crl"}[5m]) > 0 CRL-revoked cert used (possible compromise)
rate(x509_validation_total{reason="revoked_internal"}[5m]) > 0 Revoked internal cert used
x509_crl_refresh_total{result="failure"} increasing CRL server unreachable
rate(x509_ocsp_query_total{result="failure"}[5m]) > 0 OCSP responder down
x509_auto_renewal_errors > 0 Auto-renewal failures need attention

RADIUS Authentication (RADSEC + UDP)

Connections:
radius_connections_total counter {nas} TCP connections accepted (RADSEC)
Packets:
radius_packets_total counter {transport, nas} RADIUS packets received (transport: tcp or udp)
Authentication:
radius_auth_total counter {result, method, nas} Auth outcomes (result: accept/reject, method: password/x509/none)
radius_auth_total counter {result, reason, nas} Auth rejections with reason (reason: geo, time)
radius_auth_duration latency {result} End-to-end auth+authz latency (result: accept/reject)
Mappings:
radius_mapping_matches_total counter {mapping, nas} Mapping match counts per mapping name
Errors:
radius_errors_total counter {reason, nas} Error counts by reason:
reason=tls_handshake TLS handshake failure on RADSEC connection
reason=hxep_mtls_conflict HXEP connection rejected — NAS has per-client mTLS
reason=invalid_frame RADIUS packet length out of range (< 20 or > 4096)
reason=incomplete_frame RADSEC frame body read failed (truncated)
reason=rate_limit Per-NAS rate limit exceeded (silent drop)
reason=concurrent_limit Global concurrent auth limit reached (silent drop)
reason=parse_error RADIUS packet parse failed (bad authenticator / malformed)
reason=invalid_state MFA challenge state token invalid or expired
reason=nas_mismatch MFA challenge response from different NAS than original

Onboarding Service

Observability is provided indirectly through dependent modules:
- sessions: session_* metrics cover mfa_pending session creation and validation
- webauthn: webauthn_* metrics cover passkey registration ceremonies
- magiclink: magiclink_* metrics cover magic link email and polling
- ratelimit: ratelimit_* metrics cover PoW and request throttling

Sign-In Service

Observability is provided indirectly through dependent modules:
- sessions: session_* metrics cover session creation, validation, and revocation
- ldapauth: ldap_* metrics cover LDAP bind authentication
- webauthn: webauthn_* metrics cover passkey authentication ceremonies
- emailotp: otp_* metrics cover OTP generation and validation
- totp: totp_* metrics cover TOTP validation
- magiclink: magiclink_* metrics cover magic link initiation and verification
- ratelimit: ratelimit_* metrics cover brute force protection on signin endpoints
- directory: directory_* metrics cover user sync and lookup

Identity & Directory

Directory Cache

Sync counters:
directory_sync_total counter {type, result} Sync operations completed
Labels: type="full"|"delta", result="success"
Sync gauges:
directory_users_synced gauge {} Users synchronized in last full sync
directory_groups_synced gauge {} Groups synchronized in last full sync
Sync latency:
directory_sync_duration histogram {type} Sync processing time
Labels: type="full"|"delta"
Alerts:
changes(directory_sync_total{result="success"}[10m]) == 0 No successful syncs (LDAP connectivity)
directory_sync_duration{type="full"} > 60s Full sync taking too long
changes(directory_users_synced[1h]) == 0 No syncs completing

LDAP Provider

Operations:
ldap_operations_total counter {operation, status} LDAP operation count
operation=bind, status=success|invalid_credentials|error Bind outcomes
operation=search_users, status=success|error, paged=true|false User search outcomes
operation=search_groups, status=success|error, paged=true|false Group search outcomes
ldap_operation_duration latency {operation, status} LDAP operation latency
Same label sets as operations_total
Bind:
ldap_bind_success counter {} Successful user binds
ldap_bind_failures counter {reason} Failed user binds
reason=invalid_credentials Wrong password
reason=ldap_error Server/network error
Connection Pool:
ldap_pool_errors counter {reason} Pool-level errors
reason=not_initialized Pool not ready
reason=pool_closed Pool shut down
reason=config_unavailable Config missing on reconnect
reason=reconnect_failed Reconnect attempt failed
reason=timeout Pool wait timeout
ldap_pool_reconnects counter {reason} Pool reconnections
reason=stale_connection Stale conn on acquire
reason=stale_on_release Stale conn on release
ldap_pool_acquire_duration latency {reconnected} Time to acquire connection
reconnected=true|false
ldap_pool_available gauge {} Available connections in pool
ldap_pool_capacity gauge {} Total pool capacity
ldap_pool_utilization_pct gauge {} Pool utilization percentage
Search Results:
ldap_search_results gauge {type} Result set size
type=users|groups
ldap_paged_search_pages gauge {type} Pages fetched in paged search
type=users|groups
Alerts:
ldap_pool_utilization_pct > 90 Pool near exhaustion
rate(ldap_pool_errors{reason="timeout"}[5m]) > 0 Pool starvation
rate(ldap_bind_failures{reason="ldap_error"}[5m]) > 0 LDAP server issues
rate(ldap_operations_total{status="error"}[5m]) > 5 Search failures

OIDC Relying Party

Authorization flow:
oidc_rp_authorization_initiated_total counter {provider} Authorization URL built successfully
oidc_rp_state_validation_success_total counter {provider} State validated and session consumed
oidc_rp_state_validation_failures_total counter {reason} State validation failures (reason: decryption_failed, version_mismatch, state_expired, session_not_found, state_mismatch, csrf_mismatch)
Token exchange:
oidc_rp_token_exchange_success_total counter {provider} Code-for-tokens exchange succeeded
oidc_rp_token_exchange_failures_total counter {provider, reason} Exchange failures (reason: network_error, id_token_invalid, at_hash_mismatch, or IdP error code)
oidc_rp_token_exchange_duration latency {provider} Token endpoint round-trip time
Token refresh:
oidc_rp_token_refresh_success_total counter {provider} Refresh token exchange succeeded
oidc_rp_token_refresh_failures_total counter {provider, reason} Refresh failures (reason: network_error, or IdP error code)
oidc_rp_token_refresh_duration latency {provider} Token refresh round-trip time
Token revocation:
oidc_rp_token_revocation_success_total counter {provider} Revocation acknowledged by IdP
oidc_rp_token_revocation_failures_total counter {provider, reason} Revocation failures (reason: network_error, or IdP error code)
Token introspection:
oidc_rp_token_introspection_active_total counter {provider} Introspection returned active=true
oidc_rp_token_introspection_inactive_total counter {provider} Introspection returned active=false
oidc_rp_token_introspection_failures_total counter {provider, reason} Introspection failures (reason: network_error, or IdP error code)
oidc_rp_token_introspection_duration latency {provider} Introspection round-trip time
ID token validation:
oidc_rp_id_token_validation_success_total counter {provider} ID token signature and claims validated
Discovery:
oidc_rp_discovery_cache_hits_total counter {provider} Discovery served from cache
oidc_rp_discovery_cache_misses_total counter {provider} Discovery cache miss, fetched from IdP
oidc_rp_discovery_fetch_success_total counter {provider} Discovery fetched and validated
oidc_rp_discovery_fetch_failures_total counter {provider} Discovery fetch failed
oidc_rp_discovery_fetch_duration latency {provider} Discovery endpoint round-trip time
JWKS:
oidc_rp_jwks_cache_hits_total counter {provider} JWKS key found in cache
oidc_rp_jwks_fetch_success_total counter {provider} JWKS fetched from IdP
oidc_rp_jwks_fetch_failures_total counter {provider} JWKS fetch failed
oidc_rp_jwks_fetch_duration latency {provider} JWKS endpoint round-trip time
DPoP:
oidc_rp_dpop_jti_replay_total counter {provider} DPoP JTI replay attack detected
oidc_rp_dpop_validation_success_total counter {provider} DPoP proof validated successfully
PAR (Pushed Authorization Requests):
oidc_rp_par_success_total counter {provider} PAR request accepted by IdP
oidc_rp_par_failures_total counter {provider, reason} PAR failures (reason: discovery_failed, not_supported, invalid_redirect_uri, network_error, http_error, invalid_expires_in, expires_in_too_large, or IdP error code)
oidc_rp_par_request_duration latency {provider} PAR endpoint round-trip time
oidc_rp_par_authorization_success_total counter {provider} Authorization URL built via PAR flow
UserInfo:
oidc_rp_userinfo_success_total counter {provider} UserInfo fetched successfully
oidc_rp_userinfo_failures_total counter {provider, reason} UserInfo failures (reason: network_error, http_<status>)
oidc_rp_userinfo_fetch_duration latency {provider} UserInfo endpoint round-trip time
Alerts:
rate(oidc_rp_state_validation_failures_total{reason="csrf_mismatch"}[5m]) > 0 CSRF attack attempt
rate(oidc_rp_state_validation_failures_total{reason="decryption_failed"}[5m]) > 5 State tampering or key mismatch
rate(oidc_rp_dpop_jti_replay_total[5m]) > 0 DPoP replay attack attempt
rate(oidc_rp_discovery_fetch_failures_total[5m]) > 3 IdP discovery unreachable
rate(oidc_rp_token_exchange_failures_total[5m]) > 10 Elevated token exchange failures

SCIM Identity Provider

SCIM client counters (module: scim_client):
scim_client_list_failures_total counter {provider, endpoint, reason} Paginated list failures
scim_client_request_errors_total counter {provider, reason} HTTP request errors (network/timeout)
scim_client_requests_total counter {provider, status} HTTP requests by status code
scim_client_oauth2_failures_total counter {provider, reason} OAuth2 token refresh failures
scim_client_oauth2_success_total counter {provider} OAuth2 token refresh successes
SCIM client latency (module: scim_client):
scim_client_request_duration histogram {provider, method} Per-request HTTP latency
scim_client_list_duration histogram {provider, endpoint} Full paginated list latency
SCIM client gauges (module: scim_client):
scim_client_list_results gauge {provider, endpoint} Resources returned from last list
Sync counters (module: identity.scim):
identity_scim_sync_started counter {provider, sync_type} Sync operations started
identity_scim_sync_completed counter {provider, sync_type, status} Sync operations completed
identity_scim_sync_failed counter {provider, sync_type, status} Sync operations failed
identity_scim_delta_fallback_to_full counter {provider} Delta syncs that fell back to full
identity_scim_circuit_opened counter {provider} Circuit breaker open events
identity_scim_circuit_closed counter {provider} Circuit breaker close events
identity_scim_deletions_blocked counter {provider, reason} Deletion operations blocked by safety thresholds
Sync gauges (module: identity.scim):
identity_scim_users_synced gauge {provider} Users from last sync
identity_scim_groups_synced gauge {provider} Groups from last sync
Sync latency (module: identity.scim):
identity_scim_sync_duration histogram {provider, sync_type, status} Sync processing time
Directory apply counters (module: identity.scim):
identity_scim_users_created counter {provider, source} Users created in directory
identity_scim_users_updated counter {provider, source} Users updated in directory
identity_scim_users_disabled counter {provider, source} Users disabled in directory
identity_scim_users_deleted counter {provider, source} Users deleted from directory
identity_scim_groups_created counter {provider, source} Groups created in directory
identity_scim_groups_updated counter {provider, source} Groups updated in directory
identity_scim_groups_deleted counter {provider, source} Groups deleted from directory
identity_scim_sync_errors counter {provider, source} Per-operation sync errors
Webhook counters (module: identity.scim):
identity_scim_webhook_total counter {provider, result} Webhook events by result
Labels: result="success"|"unknown_provider"|"provider_disabled"|"no_secret_configured"|
"empty_payload"|"payload_too_large"|"missing_signature"|"invalid_signature"|
"parse_error"|"unknown_event_type"|"missing_timestamp"|"stale_event"|
"duplicate"|"dedup_failed_closed"|"dedup_impossible"|"deletion_budget_exceeded"|
"apply_error"
Alerts:
changes(identity_scim_sync_completed{status="success"}[30m]) == 0 No successful syncs
identity_scim_circuit_opened > 0 Circuit breaker tripped
rate(identity_scim_webhook_total{result="invalid_signature"}[5m]) > 0 Webhook signature failures
identity_scim_sync_duration > 120s Sync taking too long

SSH & SQL Bastion

SSH Bastion Gateway

Connections:
bastion_connections_total counter {} Total connections
bastion_connections_active gauge {} Active connections
bastion_connections_rejected counter {reason} Rejected connections
bastion_connection_limit_hits_total counter {limit_type} Connection limit enforced
bastion_connection_duration latency {} Connection lifetime
Authentication:
bastion_auth_attempts_total counter {client_ip} Auth attempts
bastion_auth_success_total counter {username, client_ip} Successful auths
bastion_auth_failures_total counter {client_ip, reason} Auth failures
bastion_auth_bans_total counter {client_ip, reason} Client bans
bastion_auth_duration latency {status} Auth operation time
Sessions:
bastion_sessions_total counter {username, client_ip} Sessions created
bastion_sessions_active gauge {} Active sessions
bastion_session_duration latency {username} Session lifetime
bastion_sessions_rejected counter {reason} Sessions rejected
bastion_session_limit_hits_total counter {limit_type} Session limit enforced
Commands:
bastion_commands_total counter {command} Commands executed
bastion_command_duration latency {command} Command execution time
bastion_commands_rate_limited_total counter {} Commands rate limited
DoS Protection:
bastion_rate_limit_hits_total counter {limit_type} Rate limits enforced
Resources:
bastion_history_entries_total gauge {} Command history entries
bastion_port_forwards_active gauge {} Active port forwards
bastion_health_status gauge {} Health (1=healthy, 0=unhealthy)
bastion_buffer_pool_gets counter {} Buffer pool acquisitions
bastion_buffer_pool_puts counter {} Buffer pool releases
SFTP:
bastion_sftp_sessions_total counter {auth_type} SFTP sessions
bastion_ssrf_blocks_total counter {block_type} SSRF blocks
SQL Bastion:
bastion.sql_queries_total counter {site, status} SQL queries executed
bastion.sql_query_duration latency {site, db_type, user} SQL query time
bastion.sql_acl_rejections counter {site, reason} SQL ACL rejections
Alerts:
bastion_auth_bans_total > 0 Client banned (brute force)
rate(bastion_auth_failures_total[5m]) > 20 High auth failure rate
bastion_connections_active > 500 Connection saturation
bastion_sessions_active > 200 Session saturation
rate(bastion_ssrf_blocks_total[5m]) > 0 SSRF attempt detected
rate(bastion_sql_acl_rejections[5m]) > 5 SQL ACL violations

Certificates & PKI

ACME CA Server

ACME Provider Rate Limiting (namespace: acme_provider):
acme_provider_ratelimit_checks_total counter {} Total rate limit check invocations
acme_provider_ratelimit_check_results_total counter {limit_type, result} Rate limit check outcomes
limit_type=all, result=passed All checks passed
limit_type=<type>, result=blocked Operation blocked by specific limit type
acme_provider_ratelimit_check_duration latency {} Rate limit check duration
acme_provider_ratelimit_circuit_breaker_trips_total counter {limit_type} Circuit breaker trips (blocking after consecutive state errors)
acme_provider_ratelimit_orders_created_total counter {} Orders recorded for rate limiting
acme_provider_ratelimit_certs_issued_total counter {} Certificates recorded for rate limiting
acme_provider_ratelimit_domain_issuances_total counter {domain} Issuances per registered domain
acme_provider_ratelimit_auth_failures_total counter {domain} Authorization failures per registered domain
acme_provider_ratelimit_finalization_failures_total counter {} Failed finalization attempts recorded
acme_provider_ratelimit_state_errors_total counter {limit_type, operation} Distributed state access errors
acme_provider_ratelimit_approaching_total counter {limit_type} Warning: nearing rate limit capacity (80%+)
acme_provider_ratelimit_current_usage gauge {limit_type} Current usage count for limit dimension
acme_provider_ratelimit_limit gauge {limit_type} Configured limit for dimension
acme_provider_ratelimit_usage_percent gauge {limit_type} Usage as percentage of limit
SPIFFE (namespace: spiffe):
spiffe_accounts_created_total counter {workload} SPIFFE accounts created
spiffe_orders_created_total counter {workload} SPIFFE orders created
spiffe_orders_finalized_total counter {workload} SPIFFE orders finalized (issuance started)
spiffe_certificates_issued_total counter {workload} SPIFFE certificates issued successfully
spiffe_certificate_issuance_errors_total counter {workload, reason} SPIFFE certificate issuance failures
spiffe_certificates_revoked_total counter {workload} SPIFFE certificates revoked
spiffe_certificate_retrievals_total counter {} SPIFFE certificate downloads
spiffe_certificate_retrieval_errors_total counter {reason} SPIFFE certificate download errors
spiffe_trust_bundle_requests_total counter {} SPIFFE trust bundle requests
spiffe_issuance_queue_depth gauge {} Current concurrent issuance goroutines
spiffe_issuance_queue_full_total counter {workload} Issuance rejected due to queue full
spiffe_ca_signing_duration_ms histogram {} CA signing ceremony latency (ms)
spiffe_ratelimit_current_usage gauge {workload} Current rate limit usage per workload
spiffe_ratelimit_blocked_total counter {workload} Requests blocked by rate limit
spiffe_ratelimit_check_error_total counter {workload, reason} Rate limit check errors (fail-open)
spiffe_ratelimit_record_error_total counter {workload, reason} Rate limit recording errors
spiffe_ratelimit_record_success_total counter {workload} Rate limit entries recorded successfully

ACME Client

Issuance counters (module: acmeclient):
acmeclient_issuance_started_total counter {domain} Certificate issuance started
acmeclient_issuance_success_total counter {domain, key_type} Certificate issuance succeeded
acmeclient_issuance_failed_total counter {domain, error_type} Certificate issuance failed
Labels: error_type="timeout"|"rate_limit"|"authorization"|"network"|"dns"|"invalid_request"|"not_found"|"unknown"|"none"
Issuance latency (module: acmeclient):
acmeclient_issuance_duration histogram {domain, key_type} End-to-end issuance time
Renewal counters (module: acmeclient):
acmeclient_renewal_checks_total counter (no labels) Renewal check cycles executed
acmeclient_renewal_success_total counter {domain} Certificate renewals succeeded
acmeclient_renewal_failed_total counter {domain, error_type} Certificate renewals failed
Challenge counters (module: acmeclient):
acmeclient_challenges_stored_total counter {domain} Challenge tokens stored in distributed cache
acmeclient_challenges_served_total counter {status} Challenge responses served
Labels: status="success"|"not_found"|"invalid_token"|"lookup_error"|"internal_error"|"invalid_value"|"write_error"
Certificate gauges (module: acmeclient):
acmeclient_certificates_checked gauge (no labels) Certificates checked in last renewal cycle
acmeclient_certificates_expiring gauge (no labels) Certificates needing renewal in last cycle
acmeclient_certificates_loaded gauge (no labels) Total certificates in memory cache
acmeclient_certificate_days_until_expiry gauge {domain} Days until certificate expires
ARI counters (module: acmeclient):
acmeclient_ari_fetch_total counter {result} ARI fetch attempts
Labels: result="success"|"error"
acmeclient_ari_early_renewal_suggestions_total counter (no labels) CA-suggested early renewals (possible revocation)
acmeclient_ari_renewals_total counter {domain} ARI-guided certificate renewals completed
acmeclient_ari_marked_replaced_total counter (no labels) ARI renewals marked as replaced
acmeclient_ari_cache_total counter {result} ARI cache lookups
Labels: result="hit"|"miss"
Rate limit counters (module: acmeclient):
acmeclient_ratelimit_checks_total counter (no labels) Rate limit pre-flight checks executed
acmeclient_ratelimit_check_results_total counter {limit_type, result} Rate limit check outcomes
Labels: limit_type="retry_after"|"min_order_interval"|"orders_per_account"|"certs_per_domain"|
"auth_failures_per_domain"|"certs_per_exact_set"|"all"
Labels: result="blocked"|"passed"|"error"
acmeclient_ratelimit_orders_created_total counter (no labels) ACME orders created (tracked for limits)
acmeclient_ratelimit_certs_issued_total counter {domain} Certificates issued per domain (tracked for limits)
acmeclient_ratelimit_auth_failures_total counter {domain} Authorization failures per domain
acmeclient_ratelimit_retry_after_total counter {status_code} Retry-After responses from CA
Labels: status_code="429"|"503"|"other"
acmeclient_ratelimit_state_errors_total counter {operation} Rate limit state storage errors
Labels: operation="set"|"get"
acmeclient_ratelimit_approaching_total counter {limit_type} Rate limit approaching capacity warnings (>80%)
Rate limit gauges (module: acmeclient):
acmeclient_ratelimit_current_usage gauge {limit_type} Current usage count per limit type
acmeclient_ratelimit_limit gauge {limit_type} Effective limit value per limit type
acmeclient_ratelimit_usage_percent gauge {limit_type} Usage percentage per limit type
Rate limit latency (module: acmeclient):
acmeclient_ratelimit_check_duration histogram (no labels) Pre-flight rate limit check duration
Alerts:
issuance_failed_total increasing -> Check CA reachability, DNS, port 80 access
certificates_expiring > 0 persisting -> Renewal failing, check error_type labels
certificate_days_until_expiry < 7 -> Urgent: cert near expiry, check renewal logs
ratelimit_check_results_total{blocked} -> Client-side rate limit preventing issuance
ari_early_renewal_suggestions_total -> CA suggesting early renewal, possible revocation
challenges_served_total{status!=success} -> Challenge failures, check port 80 and DNS

AutoTLS Certificate Management

autotls_issuances_total counter {result=success|failure} Initial certificate issuance on startup (after retry loop)
autotls_renewals_total counter {result=success|failure} Certificate renewal attempts (both automatic renewal loop and manual 'autotls renew')
Certificate state is also observable via the 'autotls status' hexdcall command and log entries.

Certificate Management

Certificate Operations (namespace: certmanager):
certmanager_set_total counter {source} Certificate store operations (both domain and default)
source=static|acme|acmeclient Certificate source type
certmanager_get_total counter {hit} Cache lookups for TLS certificate retrieval
hit=true Certificate found (exact or wildcard match)
hit=false Certificate not found in cache
certmanager_expired_total counter {} Certificates expired from cache (renewal may have failed)
certmanager_certificates_total gauge {} Total certificates currently held in local cache

SPIFFE Workload Identity

Requests:
spiffe_requests_total counter {endpoint, status} Requests per endpoint (status: success/error)
spiffe_request_duration_ms histogram {endpoint} Request latency in milliseconds
Errors:
spiffe_errors_total counter {type, status} ACME error responses by problem type and HTTP status
spiffe_cidr_blocked_total counter (none) Requests blocked by CIDR policy
Endpoint label values for requests_total and request_duration_ms:
directory, new_nonce, new_account, new_order, get_order, finalize, get_certificate,
trust_bundle, revoke_cert, tos
Alerts:
rate(spiffe_errors_total[5m]) > 10 High ACME error rate
rate(spiffe_cidr_blocked_total[5m]) > 0 CIDR policy blocking requests
histogram_quantile(0.99, spiffe_request_duration_ms) > 5000 P99 latency exceeding 5 seconds

Protection

Access Policy Engine

No Prometheus metrics emitted by this module.

Data Loss Prevention

Counters:
dlp_scanned counter {direction,content_type} Bodies scanned
dlp_violations counter {detector,action,direction} Violations found
dlp_blocked counter {direction} Requests/responses blocked
dlp_redacted counter {direction} Bodies redacted
dlp_skipped counter {reason,direction} Scan skipped
Histograms:
dlp_scan_duration_ms histogram {direction} Scan latency in milliseconds

Geo/IP and ASN Access Control

Request outcomes:
geoaccess_requests_total counter {status, reason} Per-request outcome
geoaccess_blocked_by_country counter {country} Blocked requests by country code
geoaccess_blocked_by_asn counter {asn} Blocked requests by ASN number
geoaccess_cdn_country_used counter {country} Requests using CDN-provided country header
Label values for requests_total:
status: allowed | blocked
reason: bypass_cidr | passed | asn_denied | asn_not_allowed | country_denied | country_not_allowed
Cache performance:
geoaccess_cache counter {result, type} Cache hit/miss tracking
Label values:
result: hit | miss
type: (empty for full lookup) | asn_only (CDN country mode, ASN-only lookup)
Note: blocked_by_country and blocked_by_asn are emitted alongside requests_total
for per-entity breakdown. requests_total with reason=asn_not_allowed and
reason=country_not_allowed intentionally omit the per-entity label to avoid
unbounded cardinality (the blocked entity is not in any configured list).
Alerts:
rate(geoaccess_requests_total{status="blocked"}[5m]) spike Unusual geo-block volume — verify rules or check for attack
geoaccess_cache{result="miss"} >> geoaccess_cache{result="hit"} Low cache hit rate — high IP diversity or short TTL

Proof-of-Work Challenge

Counters:
pow_challenges_issued counter {} Challenges generated (generateChallenge + createChallenge)
pow_challenges_solved counter {} Challenges solved successfully (valid hash + timing + honeypot)
pow_challenges_failed counter {} Challenges failed (expired, invalid, bot detection, bad hash)
Alerts:
rate(pow_challenges_failed[5m]) > rate(pow_challenges_solved[5m]) More failures than successes (possible bot wave)
rate(pow_challenges_issued[5m]) > 1000 High challenge generation rate (DDoS or misconfigured difficulty)

Rate Limiting

Counters:
ratelimit_requests_total counter {result,hostname} Requests checked (result: "allowed" or "blocked")
ratelimit_clients_banned counter {hostname} Clients banned (auto rate-limit exceeded + manual bans)
ratelimit_clients_dropped counter {} Clients refused tracking due to memory capacity limit
ratelimit_clients_unbanned counter {} Clients manually unbanned
Gauges:
ratelimit_clients_tracked gauge {} Currently tracked unique clients (exported on GetStats)
Alerts:
rate(ratelimit_requests_total{result="blocked"}[5m]) > rate(ratelimit_requests_total{result="allowed"}[5m]) More blocks than allows (attack or too-strict config)
ratelimit_clients_tracked > 0.8 * max_clients Approaching memory capacity limit

Request Size Limiting

Counters:
sizelimit_requests_total counter {result} Requests processed (result: "allowed" or "rejected")
sizelimit_exception_matched counter {host,path} Requests that matched a size limit exception

Time-Based Access Control

Operations:
timeaccess_requests_total counter {status, reason} Allowed/blocked requests (status=allowed|blocked, reason=bypass_cidr|passed|day_denied|day_not_allowed|hours_denied|hours_not_allowed)
timeaccess_windows_checked counter {matched_by} Window match distribution (matched_by=cidr|country|default)
Alerts:
rate(timeaccess_requests_total{status="blocked"}[5m]) > 10 High block rate may indicate misconfigured time windows
timeaccess_windows_checked{matched_by="default"} increasing Many requests falling through to default window — consider adding country/CIDR windows

Web Application Firewall

Counters:
waf_requests counter {blocked,method} Requests inspected by WAF
waf_blocked counter {rule_id,path,action} Requests blocked by WAF rules
waf_passed counter {path} Requests that passed WAF inspection
waf_bypassed counter {path} Requests bypassed (WAF disabled for route)
waf_body_too_large counter {path} Requests rejected for body size exceeding limit
Histograms:
waf_duration_ms histogram {blocked,method} WAF inspection duration in milliseconds

End-to-Origin Encryption

e2oe_channels_total{type} Channel establishments
type=baseline Baseline ECDH channel
type=established Tier 1 (WebAuthn) first establishment
type=rebound Tier 1 rebind on page reload
type=prf_wrapped Tier 1 via PRF-wrapped relay (cross-origin promotion)
e2oe_channel_tier_total{tier,origin_match}
tier=baseline|webauthn Negotiated tier
origin_match=auth Channel established on the auth origin
origin_match=cross_origin Channel established on a non-auth origin (PRF-wrapped path)
e2oe_requests_encrypted_total Requests processed through E2OE
Incremented for every header-path request (fetch/XHR)
e2oe_decryption_failures_total Request body decryption failures
e2oe_websocket_frames_total{direction} WebSocket frames encrypted/decrypted
direction=encrypt Server→browser frames
direction=decrypt Browser→server frames
e2oe_websocket_failures_total{direction} WebSocket encrypt/decrypt failures
direction=encrypt Server→browser encryption failed
direction=decrypt Browser→server decryption failed
direction=decrypt_seq Strict-monotonic seq gate rejected a frame (replay or reorder)
e2oe_tier1_relay_total{outcome} Wrap-relay endpoint outcomes
outcome=served Relay HTML served successfully
e2oe_tier1_provision_total{outcome} Wrap-upload endpoint outcomes
outcome=full Browser uploaded a complete wrapped map
e2oe_tier1_wrap_relay_total{outcome,layer} Per-endpoint rate-limit blocks
e2oe_tier1_wrap_upload_total{outcome,layer}
e2oe_tier1_wrap_state_total{outcome,layer}
outcome=rate_limited Block emitted (per-session or per-IP layer)
layer=session|ip Which bucket triggered

Connectivity

DNS Resolution

Resolution (namespace: dns):
dns_resolve_total counter {result, cached, dnssec} Resolution outcomes
result=success, cached=true|false Successful resolution
result=nxdomain, cached=true|false Domain not found (valid response)
result=failure, cached=false Resolution failed
dns_nxdomain_total counter {} NXDOMAIN responses (uncached)
dns_cache_hits counter {} Cache hits
dns_cache_misses counter {} Cache misses
dns_lookup_coalesced counter {} Lookups coalesced (shared concurrent result)
dns_lookup_performed counter {} Lookups actually performed
dns_cache_operations_total counter {operation, result} Cache write operations
operation=set, result=success|error Broadcast cache set outcomes
Resolver Selection (namespace: dns):
dns_resolver_queries_total counter {resolver, result} Per-resolver query outcomes
result=success|nxdomain|failure Query result per resolver
dns_system_dns_queries_total counter {result} System DNS fallback queries
result=success|nxdomain|failure System resolver outcomes
Transport (namespace: dns):
dns_transport_used counter {type, resolver} DNS transport protocol used
type=udp|dot UDP or DNS-over-TLS
CNAME Resolution (namespace: dns):
dns_cname_resolutions_total counter {status} CNAME chain resolution outcomes
status=success|depth_exceeded CNAME follow results
DNSSEC Validation (namespace: dns):
dns_dnssec_validations_total counter {result, resolver} Resolver-trust mode validations
result=valid|invalid|unsigned AD bit check outcomes
dns_dnssec_full_validations counter {result, resolver} Full cryptographic validations
result=valid|invalid RRSIG/DNSKEY verification outcomes
dns_dnssec_signature_validations counter {result, algorithm} RRSIG signature verifications
result=valid Successful signature check
dns_dnssec_dnskey_queries counter {result} DNSKEY record fetches
result=success DNSKEY query succeeded
dns_dnssec_response_validations counter {result} Full response validations
result=valid All RRsets validated
dns_dnssec_chain_validations counter {result} Chain of trust DS validations
result=valid|invalid DNSKEY-DS digest match
dns_dnssec_root_validations counter {result} Root trust anchor validations
result=valid|invalid Root DNSKEY match
dns_dnssec_nsec_validations counter {result, type} NSEC/NSEC3 denial validations
result=valid|invalid, type=nsec|nsec3 Authenticated denial outcomes
DNSSEC Cache (namespace: dns):
dns_dnssec_cache_hits counter {type} DNSSEC record cache hits
type=dnskey|ds Cached record type
dns_dnssec_cache_misses counter {type} DNSSEC record cache misses
type=dnskey|ds Record type queried
dns_dnssec_cache_clears counter {} DNSSEC cache full clears
Health Management (namespace: dns):
dns_resolver_latency latency {resolver} Per-resolver query latency
dns_resolver_healthy gauge {resolver} Resolver health status (1=healthy, 0=unhealthy)
dns_resolver_avg_latency_ms gauge {resolver} Resolver average latency EMA (ms)
dns_resolver_consecutive_failures gauge {resolver} Consecutive failure count per resolver
dns_resolver_failures_total counter {resolver} Total resolver failures
dns_system_fallback gauge {} System DNS fallback active (1=active, 0=inactive)
dns_fallback_activations counter {} System DNS fallback activations
Adaptive Selection (namespace: dns):
dns_adaptive_resolver_selected counter {resolver, reason} Adaptive resolver selections
reason=exploration|best_score|round_robin|... Selection strategy used
dns_adaptive_selection_total counter {mode, resolver} Selection mode distribution
mode=explore|exploit Exploration vs exploitation
dns_resolver_score histogram {resolver} Resolver scores (intelligent phase)

Forward Proxy Engine

Connection Metrics (namespace: forwardproxy):
forwardproxy_connections_total counter {protocol, user_id} Proxy connections recorded
forwardproxy_bytes_sent_total counter {protocol, user_id} Bytes sent through proxy
forwardproxy_bytes_received_total counter {protocol, user_id} Bytes received through proxy
forwardproxy_connection_duration latency {protocol, user_id} Connection duration
forwardproxy_errors_total counter {protocol, error} Failed proxy connections
forwardproxy_active_connections gauge {} Currently active proxy connections

Client Access (HexonClient)

Connections:
clientaccess_connections_total counter {} QUIC connections accepted
clientaccess_connections_active gauge {} Currently active QUIC connections
clientaccess_connections_rejected counter {reason} Connections rejected before auth
clientaccess_connection_duration latency {username?} Connection lifetime
Authentication:
clientaccess_auth_success_total counter {username?} Successful authentications
clientaccess_auth_failures_total counter {reason} Failed authentications
Clients:
clientaccess_clients_active gauge {} Registered client instances
Heartbeat:
clientaccess_heartbeat_latency latency {username?} Heartbeat RTT (raw)
Dial:
clientaccess_dials_total counter {} Dial requests received
clientaccess_dials_denied_total counter {} Dials denied by ACL
clientaccess_dials_success_total counter {} Dials completed successfully
clientaccess_dials_errors_total counter {} Dial errors (connect refused, timeout)
clientaccess_dial_latency latency {} Backend dial time
clientaccess_streams_active gauge {} Active QUIC dial streams
DNS:
clientaccess_dns_queries_total counter {} DNS queries processed
Alerts:
clientaccess_connections_active > max_clients * 0.9 Approaching client limit
rate(clientaccess_connections_rejected[5m]) > 10 Connection rejection spike
rate(clientaccess_auth_failures_total[5m]) > 10 Authentication failure spike
rate(clientaccess_dials_denied_total[5m]) > 20 ACL denial spike

QUIC Connector

Connections:
connectors_connections_total counter {} Total connector connections
connectors_connections_active gauge {} Active connector connections
connectors_connections_rejected counter {reason} Rejected connections
connectors_connection_duration latency {site_id} Connection lifetime
Authentication:
connectors_auth_success_total counter {site_id} Successful authentications
connectors_auth_failures_total counter {site_id, reason} Authentication failures
Instances:
connectors_instances_active gauge {site_id} Active connector instances
connectors_heartbeat_latency latency {site_id} Heartbeat round-trip time
Dial (tunnel dispatch):
connectors_dials_total counter {site_id} Dial attempts through tunnel
connectors_dials_success_total counter {site_id} Successful dials
connectors_dials_errors_total counter {site_id, reason} Failed dials
connectors_dial_latency latency {site_id} Dial latency
connectors_streams_active gauge {} Active QUIC streams
Rebalance:
connectors_rebalance_reject_total counter {site_id} Soft-rejected for rebalance
connectors_rebalance_accept_total counter {site_id} Accepted after rebalance check
Inter-node forwarding (TCP-level):
connectors_forward_total counter {site_id, target} Forward attempts to peer node
connectors_forward_success_total counter {site_id, target} Successful forwards
connectors_forward_errors_total counter {site_id, target} Failed forwards
connectors_forward_latency latency {site_id, target} Forward latency
connectors_forward_local_total counter {site_id} Requests handled locally
Relay (QUIC inter-node dispatch):
connectors_relay_total counter {site_id, target} Client-side relay attempts
connectors_relay_served counter {site_id, target} Server-side relay requests handled
connectors_relay_success_total counter {site_id, target} Successful relay dispatches
connectors_relay_errors_total counter {site_id, reason} Failed relay dispatches
connectors_relay_rejected_total counter {reason} Relay connections rejected (auth)
Alerts:
rate(connectors_auth_failures_total[5m]) > 5 High auth failure rate (brute force or misconfiguration)
connectors_instances_active == 0 No connector instances (site unreachable)
rate(connectors_dials_errors_total[5m]) > 10 High dial failure rate (tunnel health)
rate(connectors_relay_rejected_total[5m]) > 0 Relay auth failures (cluster misconfiguration)
connectors_connections_active > 100 High connection count

Network Listener

Lifecycle:
listener_starts counter {type, name} Listener startups
listener_stops counter {type, name} Listener shutdowns
listener_restarts counter {type, name} Listener restarts
listener_errors counter {type, name} Listener errors
Rate & Size Limiting:
listener_rate_limit_hits counter {reason} Requests blocked by rate limit
listener_ratelimit_circuit_breaker_trips_total counter {} Circuit breaker trips
listener_size_limit_hits counter {host, path} Size limit exceeded
TLS Security:
listener_connections_accepted counter {protocol} Successful TLS connections
listener_security_non_tls_dropped counter {reason} Non-TLS connections rejected
listener_security_malformed_tls counter {reason} Invalid TLS versions
listener_security_oversized_record counter {reason} TLS records exceeding RFC limits
listener_security_oversized_clienthello counter {reason} ClientHello too large
listener_security_small_clienthello counter {reason} Suspiciously small ClientHello
listener_security_malformed_clienthello counter {reason} Malformed ClientHello
listener_security_no_sni counter {reason} TLS handshakes without SNI
QUIC Affinity:
listener_quic_affinity_packets_received counter {} QUIC packets received
listener_quic_affinity_packets_dropped counter {reason} QUIC packets dropped
listener_quic_affinity_decryption_failures counter {} QUIC decryption failures
listener_quic_affinity_packets_local counter {} QUIC packets processed locally
listener_quic_affinity_packets_forwarded counter {target_node} QUIC packets forwarded to cluster
listener_quic_affinity_forward_failures counter {target_node} Forward failures
listener_quic_affinity_response_dropped counter {reason} QUIC response packets dropped
listener_quic_affinity_cid_mappings gauge {} Active connection ID mappings
listener_quic_connection_migrations counter {} QUIC connection migrations
QUIC Forwarding:
listener_quic_forward_connect_errors counter {target_node} Forwarding connect errors
listener_quic_forward_write_errors counter {target_node} Forwarding write errors
listener_quic_forward_bytes counter {target_node} Bytes forwarded
HXEP (Edge Protocol):
hxep_parsed_trusted counter {} TCP HXEP parsed (trusted)
hxep_stripped_untrusted counter {} TCP HXEP stripped (untrusted)
hxep_parse_failed counter {} TCP HXEP parse failures
hxep_partial_header counter {} TCP HXEP incomplete headers
hxep_udp_parsed_trusted counter {} UDP HXEP parsed (trusted)
hxep_udp_stripped_untrusted counter {} UDP HXEP stripped (untrusted)
hxep_udp_parse_failed counter {} UDP HXEP parse failures
Alerts:
rate(listener_rate_limit_hits[5m]) > 50 High rate limiting (possible attack)
listener_ratelimit_circuit_breaker_trips_total > 0 Circuit breaker tripped
rate(listener_security_no_sni[5m]) > 10 SNI probing
rate(hxep_stripped_untrusted[5m]) > 0 HXEP spoofing attempt
rate(listener_quic_affinity_forward_failures[5m]) > 0 Cluster QUIC forwarding issues

Forward Proxy

No Prometheus metrics emitted by this module.

Cluster & Operations

Git Configuration Management

Observability is provided indirectly through dependent modules:
- config: metrics for config reload success/failure and reload latency
- cluster: metrics for broadcast delivery and quorum wait

Hot Reload

Observability is provided indirectly through dependent modules:
- config: metrics for config reload success/failure, reload latency, and config version
- cluster: metrics for broadcast delivery and quorum wait

Module Data Storage

Operations (namespace: moduledata):
moduledata_operations_total counter {operation, backend, result} Operation count
operation=get|set|delete|getallforuser|loadall|exists
backend=hexon
result=success|error
moduledata_operation_duration latency {operation, backend} Operation duration
operation=get|set|delete|getallforuser|loadall|exists
backend=hexon

Notification Service

notify_sent_total counter {channel, result} Incremented after each single-event delivery attempt.
channel=email|webhook, result=success|failure.
notify_digest_sent_total counter {channel, result} Incremented after each digest delivery attempt.
channel=email|webhook, result=success|failure.
Downstream metrics from related modules:
- smtp_send_total (from smtp module) — covers email delivery outcomes
- render_email_total (from render module) — covers template rendering

Distributed Sessions

Lifecycle:
sessions_sessions_created counter {type} Sessions created
sessions_sessions_revoked counter {} Sessions revoked (single)
sessions_sessions_bulk_revoked counter {type} Sessions bulk-revoked
sessions_sessions_regenerated counter {type} Session IDs regenerated
sessions_sessions_extended counter {type} Session TTLs extended
sessions_activity_persisted counter {type} Activity timestamps persisted
Validation:
sessions_validations_success counter {type} Successful validations
sessions_validations_failed counter {reason} Failed validations (storage_error, wait_error, not_found, invalid_type)
Alerts:
rate(sessions_validations_failed{reason="not_found"}[5m]) > 50 High session-not-found rate (expired or stale cookies)
rate(sessions_sessions_bulk_revoked[5m]) > 0 Bulk revocation event (user disabled or password change)
rate(sessions_validations_failed{reason="storage_error"}[5m]) > 0 Storage backend issues

SMTP Email Delivery

Email delivery:
smtp_emails_sent_total counter {type, result} Emails sent per type and outcome
smtp_send_duration latency {type, result} Email send duration per type and outcome
Label values:
type: generic | otp | cert_renewal | passkey_expiration | vpn_enrollment |
vpn_device_code | vpn_psk_expiration | magiclink | test |
pat_created | pat_revoked | pat_expired |
passkey_created | passkey_revoked |
totp_created | totp_revoked |
cert_created | cert_revoked
result: success | failure
Note: Only core email types emit latency (generic, otp, cert_renewal, passkey_expiration,
vpn_enrollment, vpn_device_code, vpn_psk_expiration, magiclink). All other types
(test, pat_*, passkey_*, totp_*, cert_*) emit counter only — no latency metric.
Alerts:
rate(smtp_emails_sent_total{result="failure"}[5m]) > 5 SMTP delivery issues
smtp_send_duration{quantile="0.99"} > 5s SMTP latency degradation

Persistent File Storage

No Prometheus metrics emitted by this module.

Distributed Memory Storage

CRUD Counters:
memory_storage_gets counter {cache_type, result} Cache reads (result: hit, miss, cold_hit, decode_error, expired)
memory_storage_sets counter {cache_type} Cache writes
memory_storage_deletes counter {cache_type} Cache deletions
memory_storage_touches counter {cache_type, result} TTL renewals (result: hit, miss, expired)
memory_storage_setnx counter {cache_type, result} Atomic set-if-not-exists (result: set, exists)
memory_storage_sync_sets counter {cache_type} Synchronous KV-persisted writes
memory_storage_sync_gets counter {cache_type, result} Synchronous reads with KV fallback (result: hit, miss, kv_hit, decode_error, expired)
memory_storage_evictions counter {cache_type, reason} Entries evicted (reason: expired, cold)
Gauges:
memory_storage_entries gauge {cache_type} Current entry count per cache type (updated via GetCacheStats)
Alerts:
rate(memory_storage_gets{result="miss"}[5m]) > 100 High cache miss rate (check TTLs or missing Set calls)
rate(memory_storage_evictions{reason="expired"}[5m]) > 500 Excessive TTL evictions (entries expiring faster than expected)
rate(memory_storage_gets{result="decode_error"}[5m]) > 0 Corrupted KV entries (cold cache decode failures)

Telemetry & Logging

Audit event tracking:
telemetry_audit_log_entries_total counter {} Audit-class entries successfully written
telemetry_audit_dropped_total counter {} Audit-class entries dropped (channel overflow)
telemetry_converging_log_entries_total counter {} Converging-class entries successfully written
All three counters have no labels (nil label map). They are incremented in the
single backgroundWriter goroutine (no contention).
Alerts:
telemetry_audit_dropped_total > 0 Audit entries lost — increase channel buffer or reduce log volume
rate(telemetry_audit_log_entries_total[5m]) == 0 No audit events — possible pipeline failure or misconfiguration

AI Assistant

Queries:
llm_queries_total counter {success} Query completion count
success=true Query completed with a final answer
llm_query_duration_seconds latency (none) End-to-end query duration including all tool rounds
Tool calls:
llm_tool_calls_total counter {tool, success} Per-tool execution count
tool=<command> CLI command name (e.g. "proxy", "cluster")
success=true/false Whether the command executed successfully
Prompt caching (Anthropic provider only, emitted when cache tokens > 0):
llm_cache_read_tokens_total counter (none) Tokens read from Anthropic prompt cache
llm_cache_creation_tokens_total counter (none) Tokens written to Anthropic prompt cache

Admin Unix Socket

No Prometheus metrics emitted by this module.

Threshold Signing & Cluster Cryptography

No Prometheus metrics emitted by this module.

Configuration System

Reload counters are available via the health system:
- Reload attempts, successes, failures
- Parse errors, validation errors, file-not-found errors
- Callback timeouts, callback duration, last reload duration
Query reload status: health components | config status

Kubernetes CRD Configuration

Reconciliation:
k8s_reconciliations_total counter {result} Config-to-CRD reconciliation cycles
result=success | failure
Health Sync:
k8s_health_syncs_total counter {result} Periodic health status writes to CRD .status
result=success | failure
CRD Operations:
k8s_crd_operations_total counter {operation, result} Individual CRD operations
operation=ensure_definition, result=created | updated | failure
operation=status_write, result=success | failure
Alerts:
rate(k8s_reconciliations_total{result="failure"}[5m]) > 0 Reconciliation failing
rate(k8s_health_syncs_total{result="failure"}[5m]) > 0 Health sync failing