Certificates & PKI
Certificate Management
Manages all TLS certificates — internal CA, Let’s Encrypt, and static PEM in one place with auto-renewal
Overview
Manages the lifecycle of all TLS certificates the gateway uses — issuance, renewal, distribution, and SNI routing. Unifies three certificate sources into one model so operators don’t manage certificates per-service:
-
Internal ACME CA: Issues certificates for internal services using a built-in RFC 8555 compliant Certificate Authority. Supports http-01, dns-01, and tls-alpn-01 challenges with OCSP and CRL distribution.
-
External ACME Client: Obtains certificates from Let’s Encrypt or any external ACME-compliant CA. Handles automatic renewal, bootstrap fallback, and cluster-wide distribution.
-
Static PEM: Certificates loaded directly from configuration files (tls_cert/tls_key). Highest priority source, used when pre-provisioned certificates are available.
Regardless of source, all certificates flow through the certificate manager for unified storage, caching, and distribution. The TLS handshake retrieves certificates from the local in-memory cache with sub-microsecond latency, while cluster-wide consistency is maintained via broadcast operations.
Certificate selection during TLS handshake follows a three-tier priority:
exact domain match > wildcard match (*.example.com) > default certificateStorage layers:
1. Distributed cache: keyed by domain with TTL matching certificate expiry 2. Local cache: parsed TLS certificates held in-memory for zero-latency TLS serving (reads are local-only, writes broadcast cluster-wide)Validation on storage:
- Maximum PEM sizes enforced (256KB cert, 32KB key) - Certificate chain length limited to 10 - Domain length capped at 253 characters (RFC 1035) - Date range and domain-SAN match validationConfig
Certificate management is an infrastructure module — it has no dedicated configuration section. Certificate sources are configured via other modules:
Static certificates:
[service] tls_cert = "/path/to/cert.pem" # Static TLS certificate tls_key = "/path/to/key.pem" # Static TLS private key Per-proxy mapping certificates: [[proxy.mapping]] hostname = "api.example.com" tls_cert = "/path/to/api-cert.pem" # Per-route certificate tls_key = "/path/to/api-key.pem"Internal ACME CA (automatic):
[acme] enabled = true # Enables internal certificate issuanceExternal ACME Client (automatic):
[acme_client] enabled = true # Obtains certs from Let's Encrypt or similarAutoTLS (automatic wildcard):
[service] auto_tls = true # Wildcard cert via internal ACME CACertificate selection priority during TLS handshake:
1. Static PEM (from tls_cert/tls_key or per-mapping config) 2. ACME-issued certificate (internal or external) 3. Default certificate (service-level or AutoTLS wildcard)Domain matching priority:
exact match > wildcard match (*.example.com) > default certificateValidation limits (enforced on all certificate storage):
- Maximum certificate PEM size: 256KB - Maximum private key PEM size: 32KB - Maximum chain length: 10 certificates - Maximum domain name length: 253 characters (RFC 1035) - Date range validation: NotBefore <= now <= NotAfterTroubleshooting
Common symptoms and diagnostic steps:
Certificate not found for domain:
- Check 'certs list' for all managed certificates - Check 'certs show <domain>' for specific domain - Verify domain matches exactly (case-sensitive) or has wildcard match - Check certificate source: static (config), ACME (internal CA), or ACME client - If ACME: check 'autotls status' and 'certs acme' for issuance statusTLS handshake using wrong certificate:
- Check priority: static > ACME > default - Per-mapping certificates override service-level defaults - 'diagnose domain <hostname>' shows which certificate is being served - Wildcard certificates only match one level (*.example.com matches api.example.com but NOT sub.api.example.com)Certificate expired or expiring soon:
- Check 'certs list' for expiration dates - ACME certificates renew automatically before expiry - Static certificates must be replaced manually and config reloaded - If auto-renewal failed: check ACME CA health ('health components') - Check 'logs search certmanager --level=warn' for expiration warningsCertificate not propagating across cluster:
- Writes are broadcast to all nodes; check cluster health - Check 'cluster status' for quorum and node connectivity - Each node maintains a local cache — propagation is near-instant - If a node missed the broadcast: restart triggers fresh certificate loadInvalid certificate PEM:
- Check PEM encoding: must be valid base64 with proper headers - Chain must include intermediates (server cert + intermediate CAs) - Key must match the certificate's public key - File path must be readable by the gateway processMetrics for monitoring:
- certmanager_certificates_total: total certificates in cache - certmanager_set_total: certificate store operations by source - certmanager_get_total: cache lookups (hit=true/false) - certmanager_expired_total: certificates that expired from cacheRelationships
Child modules:
- certificates.acme: Internal ACME CA server for issuing certificates to internal services. Acts as the certificate authority that clients can request certificates from. - certificates.acmeclient: ACME protocol client that obtains certificates from Let's Encrypt or any ACME-compliant CA (including the internal CA). Handles renewal scheduling, bootstrap fallback, and cluster distribution.Key dependents:
- TLS server: Retrieves certificates for TLS handshake callbacks. - Proxy: Per-mapping TLS certificates for SNI-based routing. Invalid or missing certificates prevent routes from mounting. - AutoTLS: Uses the internal ACME CA to issue wildcard certificates, then stores them through the certificate manager for cluster-wide availability.Infrastructure dependencies:
- Distributed storage: Certificate cache with TTL-based expiration. - Cluster broadcast: Operations for cluster-wide certificate updates. - Configuration: Certificate source selection and TLS parameters.Logs
Log entries by component. Search with: logs search “certmanager” Levels: ERROR > WARN > INFO > DEBUG.
SetCertificate:
certmanager.set ERROR Failed to parse certificate certmanager.set ERROR Certificate does not match domain certmanager.set ERROR Rejecting expired certificate certmanager.set ERROR Rejecting not-yet-valid certificate certmanager.set ERROR Failed to store certificate in memorystorage certmanager.set INFO Certificate stored successfullySetDefaultCertificate:
certmanager.setdefault ERROR Failed to parse default certificate certmanager.setdefault INFO Default certificate set successfullyDeleteCertificate:
certmanager.delete INFO Certificate deletedOnCertificateExpired:
certmanager.expired ERROR Panic in OnCertificateExpired callback certmanager.expired WARN Certificate expired from cache - renewal may have failedClearCache:
certmanager.clearcache INFO Certificate cache clearedShutdown:
certmanager.shutdown INFO Certificate manager shutdown completeMetrics
Prometheus metrics. Query with: metrics prometheus certmanager_<name>
Certificate Operations (namespace: certmanager):
certmanager_set_total counter {source} Certificate store operations (both domain and default) source=static|acme|acmeclient Certificate source type certmanager_get_total counter {hit} Cache lookups for TLS certificate retrieval hit=true Certificate found (exact or wildcard match) hit=false Certificate not found in cache certmanager_expired_total counter {} Certificates expired from cache (renewal may have failed) certmanager_certificates_total gauge {} Total certificates currently held in local cacheACME CA Server
Built-in certificate authority — issues TLS certificates for internal services via standard ACME protocol
Overview
Issues TLS certificates for internal services using a built-in ACME certificate authority. Replaces external CA infrastructure for internal PKI — compatible with certbot, cert-manager, Caddy, and Traefik. Supports http-01, dns-01, and tls-alpn-01 challenges with OCSP responder and CRL distribution.
Core capabilities:
- Full RFC 8555 ACME protocol compliance
- Stateless accounts derived from JWK thumbprint (no account database)
- All three challenge types: http-01, dns-01, and tls-alpn-01
- IP address certificates via RFC 8738
- CAA checking via RFC 8659 with domain hierarchy walk-up
- OCSP responder (RFC 6960) with caching for real-time certificate status
- CRL distribution (RFC 5280) rebuilt on each revocation
- Deterministic DNS challenges for internal domains without DNS API
- UUID v4 certificate serial numbers (collision-free)
- Cluster-ready with distributed storage across all nodes
- Comprehensive multi-dimensional rate limiting (7 dimensions)
- Saga pattern for atomic distributed updates (TOCTOU prevention)
- Optimistic concurrency control for rate limit counters
- Threshold CA (acme_ca_threshold=true): the CA private key exists only as distributed threshold shares — no single node holds the full key. Protocol selected by ca_algorithm: ES256 → GG18 ECDSA, EdDSA → FROST Ed25519. Fail-closed — no certs until DKG completes. Both algorithms support resharing: cluster scaling preserves the CA cert across membership changes (see threshold_resharing for shrinkage constraints).
- CA signature algorithm selectable via [operations].ca_algorithm — “ES256” (default, ECDSA P-256) or “EdDSA” (Ed25519, RFC 8032). End-to-end Ed25519 supported: deterministic CA generation, CSR validation, JWS/JWK with OKP keys (RFC 8037), thumbprints (RFC 7638 + RFC 8037 §2), OCSP responses (RFC 8419), ACME and SPIFFE leaf certs, threshold mode with FROST resharing. Algorithm is immutable after first bootstrap; mismatched config on a subsequent startup is rejected with a clear migration error.
- CA certificate rotation is automatic — managed by AutoTLS renewalLoop or ACME client renewal scheduler. Operators do NOT need to set calendar reminders for CA cert expiry. Only investigate if ‘health components’ shows CA warnings or ‘certs list’ shows expiring certs.
HTTP endpoints under /acme (configurable prefix):
GET /acme/directory -> ACME directory with all endpoint URLs HEAD /acme/new-nonce -> Anti-replay nonce POST /acme/new-account -> Create/lookup account POST /acme/new-order -> Create certificate order POST /acme/order/{id} -> Order status POST /acme/authz/{id} -> Authorization status POST /acme/challenge/{id} -> Challenge response POST /acme/finalize/{id} -> Finalize order with CSR POST /acme/cert/{id} -> Download certificate (PEM) POST /acme/revoke-cert -> Revoke certificate GET /acme/ca-certs -> CA certificate bundle (PEM) GET /acme/crl -> CRL (DER-encoded) GET /acme/ocsp/{req} -> OCSP check (GET) POST /acme/ocsp -> OCSP check (POST)Storage model:
- Volatile: in-memory cache for fast access (orders, authorizations, challenges, nonces, OCSP)
- Persistent: NATS JetStream KV for durability (certificates, CRL, serial index)
- All persistent data encrypted at rest (key derived from cluster_key)
- Startup: certificates loaded from persistent storage into memory cache
Cluster behavior:
- Write operations (create order, finalize, revoke): Replicated with quorum
- Read operations (get directory, get order, get cert): Local only
- Validation operations (validate challenge, check CAA): Local with external calls
- Nonces: created across all nodes; validated and consumed locally
Config
ACME CA configuration in hexon.toml under [acme]:
[acme] enabled = true # Enable ACME CA server path_prefix = "/acme" # URL prefix for ACME endpoints (default: /acme) external_url = "" # External URL (derived from hostname if not set) # Access control allowed_cidrs = ["10.0.0.0/8"] # Restrict ACME API to specific networks (optional) allowed_identifiers = ["*.internal.example.com"] # Domain patterns (optional, wildcards) # Challenge configuration challenges_enabled = ["http-01"] # Enabled challenge types (default: http-01 only) challenge_validity = "15m" # Challenge validity period (default: 15m) nonce_validity = "15m" # Nonce validity period (default: 15m) # Certificate parameters max_validity = "2160h" # Maximum certificate validity (default: 90 days) default_validity = "2160h" # Default certificate validity (default: 90 days) max_san_count = 100 # Maximum SANs per certificate (default: 100) enable_ip_identifiers = true # Enable IP address identifiers (RFC 8738) # CAA checking (RFC 8659) caa_checking = false # Enable CAA record checking (default: false) caa_identifiers = ["acme.example.com"] # CAA identifiers for this CA # OCSP Responder (RFC 6960) ocsp_enabled = true # Enable OCSP responder (default: true) ocsp_cache_ttl = "5m" # OCSP response cache TTL (default: 5m) ocsp_cidrs = ["0.0.0.0/0"] # Allowed CIDRs for OCSP (default: all) # CRL Distribution (RFC 5280) crl_enabled = true # Enable CRL endpoint (default: true) crl_cidrs = ["0.0.0.0/0"] # Allowed CIDRs for CRL (default: all) crl_next_update = "48h" # CRL NextUpdate offset (default: 48h) # Deterministic DNS for internal domains dns_deterministic = false # Enable deterministic DNS challenges dns_deterministic_cidrs = ["10.0.0.0/8"] # Allowed CIDRs for deterministic DNS # Legacy rate limits (simple) rate_limit_orders_per_ip = 50 # Orders per IP per hour rate_limit_certs_per_domain = 50 # Certs per domain per week # Comprehensive rate limits [acme.rate_limits] enabled = true orders_per_account = 5000 # Max orders per account per window orders_per_account_window = "3h" certs_per_domain = 500 # Max certs per eTLD+1 domain per window certs_per_domain_window = "168h" # 1 week certs_per_exact_set = 50 # Max certs per exact domain set certs_per_exact_set_window = "168h" auth_failures_per_domain = 50 # Max auth failures per domain per window auth_failures_window = "1h" orders_per_ip = 1000 # Max orders per IP per window orders_per_ip_window = "1h" failed_finalizations_per_order = 10 min_order_interval = "100ms" # Minimum time between orders per account buffer_percent = 10 # Warning threshold at 90% of limitSafe defaults with just enabled = true:
- http-01 challenge enabled, CAA checking disabled
- 90-day certificate validity, 15-minute challenge/nonce validity
- 100 SANs maximum, IP identifiers enabled
- OCSP responder enabled (5-minute cache)
- CRL distribution enabled (48-hour NextUpdate)
- No CIDR or domain restrictions
Hot-reloadable: rate limits, allowed_cidrs, allowed_identifiers, OCSP/CRL settings. Cold (restart required): enabled, path_prefix, challenges_enabled.
Troubleshooting
Common symptoms and diagnostic steps:
“badNonce” errors from ACME clients:
- Nonce expired: increase nonce_validity (default 15m) - Nonce already consumed: client must retry with fresh nonce from Replay-Nonce header - Clock skew between cluster nodes: verify NTP synchronization - Single-use enforcement: each nonce valid for exactly one request“connection” errors during http-01 challenge:
- Firewall blocking port 80 from ACME server to client - Client not serving challenge response at /.well-known/acme-challenge/{token} - Wrong content at challenge URL (must be {token}.{thumbprint}) - Validation timeout: client has 30 seconds total (2s initial delay, 5 retries)“dns” errors during dns-01 challenge:
- TXT record _acme-challenge.{domain} not propagated yet - Wrong record value (must be base64url(SHA256(keyAuthorization))) - DNS TTL too high, stale cached record - DNS module must be enabled and healthy“caa” errors during certificate issuance:
- Domain has CAA records that do not include this CA's identifier - SERVFAIL on CAA lookup denies issuance per RFC 8659 (mandatory) - Add CA identifier to domain CAA records or disable caa_checking - CAA queries always bypass DNS cache for fresh data“unauthorized” errors:
- Account key mismatch between request JWK and order account - Order belongs to a different account thumbprint - Certificate revocation attempted by non-owner without certificate key“rejectedIdentifier” errors:
- Domain not matching allowed_identifiers patterns - IP identifier requested but enable_ip_identifiers = false“rateLimited” errors:
- Check which rate limit dimension was hit (logged at WARN level) - Rate limits use fail-open design: errors do not block operations - IPv6 addresses normalized to /64 prefix for rate limiting - When both legacy and comprehensive rate limits configured, both enforcedOCSP responder returning “unknown” status:
- Serial number not recognized by this CA - Certificate not loaded into memory on startup (check startup logs) - Persistent storage lookup failed (check NATS JetStream health)CRL endpoint returning empty or stale CRL:
- CRL rebuilt only on certificate revocation (not periodically) - Check crl_enabled = true - Verify persistent storage (NATS JetStream KV) is healthy - Lazy load: first access after restart may be slowerCertificate storage issues:
- Memory storage: check distributed memory cache module health - Persistent storage: check NATS JetStream KV connectivity - Key naming: NATS KV does not allow ":" in keys, uses "/" separator - Encryption: all persistent data encrypted with AES-256-GCM - Startup loading: expired certificates are skipped during reloadChallenge validation timing out:
- Initial delay allows client time to set up challenge response - Multiple retry attempts with backoff before giving up - Total validation timeout is 30 seconds - tls-alpn-01: verify client serves ALPN protocol "acme-tls/1" on port 443Verify CA threshold signing works:
Run 'hexdcall threshold test' to trigger a test signing ceremony. Shows per-node participation, latency, and signature verification. Use '--trace' for phase-level timing and per-node message counts.Security
Security model and hardening:
Account security:
Stateless accounts derived from JWK thumbprint. No account credentials stored server-side. Account key compromise allows certificate issuance for any allowed domain. Consider key rotation procedures for high-security deployments.Challenge validation:
http-01: validates web server access on port 80 (follows up to 10 redirects) dns-01: validates DNS control via TXT record at _acme-challenge.{domain} tls-alpn-01: validates TLS server access via acmeIdentifier extension (OID 1.3.6.1.5.5.7.1.31) All challenges have short validity (default 15 minutes) and are single-use.Certificate security:
UUID v4 serial numbers prevent collision attacks. CAA checking (when enabled) prevents unauthorized issuance per RFC 8659. CIDR restrictions limit API access. Rate limiting prevents abuse across 7 dimensions.Nonce security:
Cryptographically random, single-use, short validity (15 minutes default). Consumed immediately on use. Prevents replay attacks per RFC 8555.Deterministic DNS security boundary:
Token derived from cluster_key and domain name. Domain must resolve to IP within dns_deterministic_cidrs. Only for internal domains where DNS API is unavailable.Persistent storage encryption:
All certificate data encrypted at rest with keys derived from cluster_key. Defense-in-depth on top of transport encryption. Private keys never stored in plaintext.Distributed consistency:
Write operations use transactional patterns to prevent race conditions. Write operations require quorum consensus. Rate limits fail-open for availability.Threshold CA key protection:
When acme_ca_threshold=true, the ACME CA private key never exists in full on any node. Generated via distributed key generation, exists only as shares. Signing requires a quorum of nodes. Shares encrypted at rest with keys derived from cluster_key. After initial key generation, membership changes use resharing (no re-generation). Fail-closed: no certificates issued until key generation completes.OCSP/CRL security:
OCSP responses signed with CA key, cached for performance (configurable TTL). Cache invalidated immediately on certificate revocation. CRL signed with CA key, rebuilt on each revocation (not periodic). Both endpoints support CIDR-based access control.Relationships
Module dependencies and interactions:
- certmanager: Issued certificates can be stored via certmanager for TLS serving. ACME CA is the issuer; certmanager is the consumer for cluster-wide distribution.
- autotls: AutoTLS uses the internal ACME CA for wildcard certificate issuance. When auto_tls = true, ACME is automatically enabled.
- acmeclient: The ACME client module can point at this internal CA as its ACME server, creating a fully internal PKI without external dependencies.
- dns: Used for dns-01 challenge validation (TXT record queries), CAA record checking (typed DNS lookup with “CAA” query type), and deterministic DNS token validation. SERVFAIL on CAA lookup must deny issuance per RFC 8659.
- Distributed cache: Primary storage for orders, authorizations, challenges, nonces, OCSP cache, and CRL. Distributed across cluster nodes.
- Persistent storage: Durable storage for certificates, serial number index, and CRL. Uses NATS JetStream KV with encryption at rest.
- config: Hot-reload of rate limits, allowed CIDRs, OCSP/CRL settings.
- telemetry: Structured logging and Prometheus metrics for orders, challenges, certificates, OCSP, CRL, and rate limit events.
- Rate limiting: ACME implements its own rate limiting layer (7 dimensions) independent of the global rate limiter. Both legacy and comprehensive limits can be enforced simultaneously (defense in depth).
Logs
Log entries by component. Search with: logs search “acme” Levels: ERROR > WARN > INFO > DEBUG.
Init & Lifecycle:
acme.init INFO ACME CA server disabled in config acme.init WARN JetStream temporarily unavailable, retrying certificate load acme.init ERROR Failed to load certificates after retries acme.init INFO ACME CA server initialized acme.init DEBUG Restored CRL number from persistent storage acme.init INFO CRL signing failed on startup, retrying acme.init ERROR AUDIT Failed to regenerate CRL on startup — revoked certificates may not be enforced acme.init INFO CRL regenerated on startup acme.init INFO Skipping CRL rebuild on startup (not leader)Periodic CRL Health Check:
acme.crl.periodic WARN AUDIT CRL expired or missing — rebuilding acme.crl.periodic INFO AUDIT Periodic CRL rebuild succeeded acme.crl.periodic ERROR AUDIT Periodic CRL rebuild failed — revoked certificates may not be enforcedCertificate Load from Persistent Storage:
acme.init.load INFO Persistent storage not enabled, skipping certificate load acme.init.load ERROR Failed to load certificates from persistent storage acme.init.load DEBUG Skipping expired certificate acme.init.load WARN Failed to store certificate in memory cache acme.init.load DEBUG Loaded certificate from persistent storage acme.init.load WARN Failed to load certificate from persistent storageCertificate Issuance:
acme.certificate.issue WARN AUDIT CAA re-check failed at issuance time acme.certificate.issue WARN Failed to get CA chain acme.certificate.issue WARN Serial index replication incomplete - revocation may need retry acme.certificate.issue WARN Failed to save certificate to persistent storage acme.certificate.issue INFO AUDIT Certificate issued acme.certificate.issue WARN Failed to record certificate issuance for rate limitingCertificate Revocation:
acme.certificate.revoke WARN Failed to update revocation in persistent storage acme.certificate.revoke INFO AUDIT Certificate revokedCAA Checking:
acme.caa.check DEBUG Checking CAA records acme.caa.check WARN CAA lookup returned SERVFAIL acme.caa.check DEBUG CAA lookup returned no records acme.caa.check WARN CAA records do not authorize this CA acme.caa.check DEBUG CAA check passed acme.caa.lookup DEBUG CAA records found acme.caa.iodef DEBUG CAA iodef record foundChallenge Response:
acme.challenge.respond ERROR Failed to atomically update challenge status acme.challenge.respond ERROR Failed to update authorization acme.challenge.respond INFO AUDIT Challenge response receivedChallenge Validation:
acme.challenge.validate WARN Async validation cancelled during initial delay acme.challenge.validate ERROR Failed to reload challenge for async validation acme.challenge.validate INFO Challenge no longer in processing state, skipping validation acme.challenge.validate ERROR Failed to reload challenge after validation acme.challenge.validate WARN Failed to record auth failure for rate limiting acme.challenge.validate ERROR Failed to store challenge after validation acme.challenge.validate DEBUG Starting challenge validation acme.challenge.validate INFO Challenge validation completedAuthorization:
acme.authorization.update INFO AUDIT Authorization status updatedDeterministic DNS Token:
acme.challenge.deterministic ERROR Cluster key not configured for deterministic DNS acme.challenge.deterministic DEBUG Generated deterministic tokenCRL:
acme.crl.get DEBUG CRL served from memory cache acme.crl.get INFO No CRL found, generating initial CRL acme.crl.get ERROR Failed to load CRL after rebuild acme.crl.get ERROR Failed to load CRL from persistent storage acme.crl.get DEBUG CRL loaded from persistent storage and cached acme.crl.rebuild INFO Rebuilding CRL acme.crl.rebuild ERROR Failed to collect revoked certificates acme.crl.rebuild ERROR Failed to request CRL signing acme.crl.rebuild ERROR Failed to sign CRL acme.crl.rebuild ERROR Unexpected response type from CA module acme.crl.rebuild ERROR CA module failed to sign CRL acme.crl.rebuild WARN Failed to persist CRL to storage acme.crl.rebuild INFO CRL rebuilt successfully acme.crl.rebuild ERROR Background CRL rebuild failed acme.crl.collect WARN Failed to collect ACME revocations, continuing with X.509 acme.crl.collect WARN Failed to collect X.509 revocations acme.crl.collect DEBUG Collected revoked certificates acme.crl.collect WARN Failed to parse certificate serial number, skipping acme.crl.collect WARN Invalid serial number (zero or negative), skipping acme.crl.collect.x509 WARN Failed to parse X.509 certificate serial number, skipping acme.crl.collect.x509 WARN Invalid X.509 serial number (zero or negative), skippingNonce:
acme.nonce.create ERROR Failed to generate random nonce acme.nonce.create ERROR Failed to store nonce acme.nonce.create ERROR Failed to achieve nonce storage quorum acme.nonce.create DEBUG Created new nonce acme.nonce.validate ERROR Failed to get nonce from cache acme.nonce.validate ERROR Failed to wait for nonce lookup acme.nonce.validate ERROR Unexpected cache response type acme.nonce.validate WARN Nonce not found acme.nonce.validate ERROR Invalid nonce data type in cache acme.nonce.validate WARN Nonce expired acme.nonce.validate ERROR Failed to atomically consume nonce acme.nonce.validate DEBUG Nonce validated and consumed atomicallyOCSP:
acme.ocsp.handle WARN Invalid OCSP request acme.ocsp.handle WARN Invalid serial number in OCSP request acme.ocsp.handle DEBUG Processing OCSP request acme.ocsp.handle DEBUG OCSP response served from cache acme.ocsp.handle ERROR Failed to check certificate status acme.ocsp.handle ERROR Failed to request OCSP signing acme.ocsp.handle ERROR Failed to sign OCSP response acme.ocsp.handle ERROR Unexpected response type from CA module acme.ocsp.handle ERROR CA module failed to sign OCSP response acme.ocsp.handle INFO OCSP response generated acme.ocsp.x509 DEBUG Failed to query X.509 module acme.ocsp.flush INFO OCSP cache flushed on startupOrder:
acme.order.create ERROR Failed to generate order ID acme.order.create ERROR Failed to create authorization acme.order.create ERROR Failed to store order acme.order.create ERROR Failed to achieve order storage quorum acme.order.create INFO Created new order acme.order.create WARN Failed to record order for rate limiting acme.order.finalize INFO Order finalization started acme.order.issue ERROR Failed to reload order for async certificate issuance acme.order.issue INFO Order no longer in processing state, skipping certificate issuance acme.order.issue WARN Context cancelled before certificate issuance acme.order.issue ERROR Failed to issue certificate acme.order.issue WARN Failed to record finalization failure for rate limiting acme.order.issue ERROR Failed to reload order after certificate issuance acme.order.issue ERROR Failed to update order after certificate issuance acme.order.issue INFO AUDIT Certificate issued successfullyLegacy Order Rate Limit:
acme.order.ratelimit WARN Failed to check rate limit, allowing request acme.order.ratelimit WARN Rate limit optimistic lock failed after retries, allowing request acme.order.ratelimit WARN Order rate limit exceededValidation HTTP-01:
acme.validation.http01 DEBUG Validating HTTP-01 challenge acme.validation.http01 WARN HTTP-01 validation failed: connection error acme.validation.http01 WARN HTTP-01 validation failed: wrong status code acme.validation.http01 WARN HTTP-01 validation failed: invalid key authorization format acme.validation.http01 WARN HTTP-01 validation failed: key authorization hash mismatch acme.validation.http01 INFO HTTP-01 validation successful acme.validation.http01.dns ERROR Failed to resolve hostname via DNS module acme.validation.http01.dns ERROR DNS returned no addresses acme.validation.http01.dns DEBUG Resolved hostname via DNS module acme.validation.http01.dns DEBUG Connected to validation target acme.validation.http01.dns WARN Failed to connect to IP, trying next acme.validation.http01.dns ERROR Failed to connect to any resolved IPValidation DNS-01:
acme.validation.dns01 DEBUG Validating DNS-01 challenge acme.validation.dns01 WARN DNS-01 validation failed: DNS lookup error acme.validation.dns01 ERROR DNS-01 validation failed: no expected value computed acme.validation.dns01 INFO DNS-01 validation successful acme.validation.dns01 WARN DNS-01 validation failed: no matching TXT recordValidation TLS-ALPN-01:
acme.validation.tlsalpn01 DEBUG Validating TLS-ALPN-01 challenge acme.validation.tlsalpn01 WARN TLS-ALPN-01 validation failed: connection error acme.validation.tlsalpn01 WARN TLS-ALPN-01 validation failed: wrong ALPN protocol acme.validation.tlsalpn01 WARN TLS-ALPN-01 validation failed: certificate doesn't contain identifier acme.validation.tlsalpn01 WARN TLS-ALPN-01 validation failed: no acmeIdentifier extension acme.validation.tlsalpn01 WARN TLS-ALPN-01 validation failed: acmeIdentifier mismatch acme.validation.tlsalpn01 INFO TLS-ALPN-01 validation successful acme.validation.tlsalpn01.dns ERROR Failed to resolve hostname via DNS module acme.validation.tlsalpn01.dns ERROR DNS returned no addresses acme.validation.tlsalpn01.dns DEBUG Resolved hostname via DNS module acme.validation.tlsalpn01.dns DEBUG TLS connection established acme.validation.tlsalpn01.dns WARN Failed to connect to IP, trying next acme.validation.tlsalpn01.dns ERROR Failed to establish TLS connection to any resolved IPValidation Deterministic DNS:
acme.validation.deterministic DEBUG Failed to resolve domain for deterministic DNS checkComprehensive Rate Limiting:
acme.ratelimit.circuitbreaker ERROR Rate limit circuit breaker open — blocking requests acme.ratelimit.check DEBUG Rate limit checks passed acme.ratelimit.warn WARN Approaching rate limit capacity acme.ratelimit.blocked INFO AUDIT Rate limit check blocked operation acme.ratelimit.error WARN Rate limit state access error acme.ratelimit.record DEBUG Recorded certificate issuance acme.ratelimit.record WARN Recorded authorization failure acme.ratelimit.record WARN Recorded finalization failureSPIFFE Account:
spiffe.account.create WARN Unknown account key - no matching workload found spiffe.account.create WARN Client IP not allowed for workload spiffe.account.create ERROR Failed to store SPIFFE account spiffe.account.create INFO Created SPIFFE account spiffe.account.deactivate INFO SPIFFE account deactivatedSPIFFE Order:
spiffe.order.create WARN Client IP not allowed for workload spiffe.order.create WARN SAN not allowed for workload spiffe.order.create ERROR Failed to generate order ID spiffe.order.create ERROR Failed to store SPIFFE order spiffe.order.create INFO Created SPIFFE order (auto-approved) spiffe.order.get WARN Client IP not allowed for workload spiffe.order.finalize DEBUG Using workload snapshot from order creation (hot-reload safe) spiffe.order.finalize WARN Workload removed from config during order lifetime spiffe.order.finalize DEBUG Using current workload config (v1 order - upgrade for hot-reload safety) spiffe.order.finalize WARN Client IP not allowed for workload spiffe.order.finalize WARN Certificate issuance queue full, waiting for slot spiffe.order.finalize ERROR Failed to revert order status after timeout spiffe.order.finalize INFO SPIFFE order finalization started spiffe.order.issue ERROR Failed to reload SPIFFE order for certificate issuance spiffe.order.issue INFO SPIFFE order no longer in processing state, skipping spiffe.order.issue ERROR Failed to issue SPIFFE certificate spiffe.order.issue ERROR Failed to reload SPIFFE order after certificate issuance spiffe.order.issue ERROR Failed to update SPIFFE order after certificate issuance spiffe.order.issue INFO AUDIT SPIFFE certificate issued successfullySPIFFE Certificate:
spiffe.certificate.issue WARN Failed to get CA chain spiffe.certificate.issue WARN Serial index replication incomplete spiffe.certificate.issue WARN Failed to save SPIFFE certificate to persistent storage spiffe.certificate.get.error WARN Account not found for certificate retrieval spiffe.certificate.get.cidr WARN Client IP not allowed for workload during certificate retrieval spiffe.certificate.revoke.cidr WARN Client IP not allowed for workload during revocation spiffe.certificate.revoke INFO AUDIT SPIFFE certificate revokedSPIFFE Rate Limiting:
spiffe.ratelimit.check WARN Failed to check rate limit, allowing request spiffe.ratelimit.blocked WARN SPIFFE rate limit exceeded spiffe.ratelimit.record WARN Failed to get rate limit state spiffe.ratelimit.record WARN Failed to store rate limit stateMetrics
Prometheus metrics. Query with: metrics prometheus acme_<name> or spiffe_<name>
ACME Provider Rate Limiting (namespace: acme_provider):
acme_provider_ratelimit_checks_total counter {} Total rate limit check invocations acme_provider_ratelimit_check_results_total counter {limit_type, result} Rate limit check outcomes limit_type=all, result=passed All checks passed limit_type=<type>, result=blocked Operation blocked by specific limit type acme_provider_ratelimit_check_duration latency {} Rate limit check duration acme_provider_ratelimit_circuit_breaker_trips_total counter {limit_type} Circuit breaker trips (blocking after consecutive state errors) acme_provider_ratelimit_orders_created_total counter {} Orders recorded for rate limiting acme_provider_ratelimit_certs_issued_total counter {} Certificates recorded for rate limiting acme_provider_ratelimit_domain_issuances_total counter {domain} Issuances per registered domain acme_provider_ratelimit_auth_failures_total counter {domain} Authorization failures per registered domain acme_provider_ratelimit_finalization_failures_total counter {} Failed finalization attempts recorded acme_provider_ratelimit_state_errors_total counter {limit_type, operation} Distributed state access errors acme_provider_ratelimit_approaching_total counter {limit_type} Warning: nearing rate limit capacity (80%+) acme_provider_ratelimit_current_usage gauge {limit_type} Current usage count for limit dimension acme_provider_ratelimit_limit gauge {limit_type} Configured limit for dimension acme_provider_ratelimit_usage_percent gauge {limit_type} Usage as percentage of limitSPIFFE (namespace: spiffe):
spiffe_accounts_created_total counter {workload} SPIFFE accounts created spiffe_orders_created_total counter {workload} SPIFFE orders created spiffe_orders_finalized_total counter {workload} SPIFFE orders finalized (issuance started) spiffe_certificates_issued_total counter {workload} SPIFFE certificates issued successfully spiffe_certificate_issuance_errors_total counter {workload, reason} SPIFFE certificate issuance failures spiffe_certificates_revoked_total counter {workload} SPIFFE certificates revoked spiffe_certificate_retrievals_total counter {} SPIFFE certificate downloads spiffe_certificate_retrieval_errors_total counter {reason} SPIFFE certificate download errors spiffe_trust_bundle_requests_total counter {} SPIFFE trust bundle requests spiffe_issuance_queue_depth gauge {} Current concurrent issuance goroutines spiffe_issuance_queue_full_total counter {workload} Issuance rejected due to queue full spiffe_ca_signing_duration_ms histogram {} CA signing ceremony latency (ms) spiffe_ratelimit_current_usage gauge {workload} Current rate limit usage per workload spiffe_ratelimit_blocked_total counter {workload} Requests blocked by rate limit spiffe_ratelimit_check_error_total counter {workload, reason} Rate limit check errors (fail-open) spiffe_ratelimit_record_error_total counter {workload, reason} Rate limit recording errors spiffe_ratelimit_record_success_total counter {workload} Rate limit entries recorded successfullyACME Client
Automatic TLS certificate management via Let’s Encrypt or ACME-compliant CAs with cluster-wide distribution
Overview
The ACME client module obtains and manages TLS certificates from Let’s Encrypt or any ACME-compliant Certificate Authority (including Hexon’s internal ACME CA). It handles certificate issuance, automatic renewal, cluster-wide distribution, and bootstrap fallback for high-availability deployments.
Core capabilities:
- Automatic certificate issuance via ACME protocol (RFC 8555)
- HTTP-01 challenge with dynamic port 80 listener (only during verification)
- Cluster-wide certificate distribution via persistent KV watch (NATS JetStream KV)
- Encrypted persistent storage (AES-256-GCM with cluster_key domain separation)
- Automatic renewal with configurable threshold (default: 30 days before expiry)
- ACME Renewal Information (ARI) support (RFC 8739) for optimal renewal windows
- Bootstrap fallback: self-signed temporary certificate when ACME fails on startup
- Recovery mechanism: exponential backoff retry after bootstrap fallback
- Smart startup with leader detection for cluster-wide deduplication
- Wildcard certificate coverage detection to avoid redundant issuance
- Client-side rate limiting to avoid hitting CA limits
- Startup readiness integration to prevent HTTPS binding without valid certificate
Certificate modes (dual mode):
Static: tls_cert/tls_key in config used directly (highest priority) ACME: acme_client.enabled = true, certificates managed automatically Both can coexist. Static wildcards suppress redundant ACME issuance.Dynamic port 80 challenge listener architecture:
1. Certificate issuance starts -> challenge listener started on ALL nodes 2. All nodes start port 80 listener -> Wait for quorum confirmation 3. ACME challenge tokens stored in distributed memory cache (cluster-wide) 4. ACME server validates -> Any node can respond to challenge 5. Certificate issued -> challenge listener stopped on ALL nodes 6. All nodes stop port 80 listener Port 80 exposed only during brief verification window (~30 seconds).Cluster coordination:
1. Leader node performs ACME protocol exchange 2. Certificate saved to Persistent Storage (encrypted) 3. persistent KV watch automatically syncs to all cluster nodes 4. All nodes update in-memory TLS configuration via watch handler No manual certificate distribution needed.Storage model:
Persistent: NATS JetStream KV with AES-256-GCM encryption Keys: account, cert/{base64url(domain)}, issuance/{base64url(domain)} Watch pattern: "cert/*" for automatic cluster sync Domain encoding: base64url (RFC 4648 without padding) for special charactersConfig
ACME client configuration in hexon.toml under [acme_client]:
[acme_client] enabled = true # Enable ACME client email = "admin@example.com" # Contact email for CA notifications accept_tos = true # Accept CA terms of service (required) reset = false # Delete all ACME data on startup (default: false) # Certificate parameters key_type = "ecdsa256" # Key type: ecdsa256, ecdsa384, rsa2048, rsa4096, ed25519 renewal_threshold_hours = 720 # Renew when fewer than N hours remain (default: 30 days) renewal_check_interval = "6h" # How often to check for renewals (default: 6h) auto_proxy_domains = true # Auto-issue certs for proxy mapping domains # Challenge configuration challenge_port = 80 # Port for HTTP-01 challenge listener (default: 80) # Bootstrap fallback allow_bootstrap_fallback = true # Use self-signed cert if ACME fails (default: true) startup_timeout = "60s" # Max time for ACME on startup (default: 60s) startup_retries = 3 # Retries within timeout before fallback (default: 3) # ACME Renewal Information (ARI) - RFC 8739 ari_enabled = true # Fetch optimal renewal windows from CA (default: true) ari_check_interval = "6h" # How often to refresh ARI data (default: 6h) # Client-side rate limits (avoid hitting CA limits) [acme_client.rate_limits] enabled = true # Enable rate limit tracking (default: true) orders_per_account = 300 # Max orders per account per window (default: 300) orders_window = "3h" # Orders window (default: 3h) certs_per_domain = 50 # Max certs per domain per window (default: 50) certs_per_domain_window = "168h" # 7 days (default: 168h) buffer_percent = 10 # Safety margin before limit (default: 10%)TLS certificate source priority:
1. Static certificate (tls_cert + tls_key) -- highest priority 2. AutoTLS (auto_tls = true) 3. ACME client (acme_client.enabled = true) 4. Error -- no TLS configuredBootstrap certificate characteristics (when ACME fails):
CN: HEXON-BOOTSTRAP-{hostname}, O: HexonGateway Validity: 7 days, Key: ECDSA P-256, SANs: configured hostname only NOT persisted -- regenerated on each startup if neededRecovery schedule after bootstrap fallback:
1 minute -> 5 minutes -> 15 minutes -> 30 minutes -> 1 hour -> normal cycle (6h)Hot-reloadable: renewal_threshold_hours, renewal_check_interval, rate limits. Cold (restart required): enabled, email, accept_tos, key_type, challenge_port.
Troubleshooting
Common symptoms and diagnostic steps:
Certificate not issued on startup:
- Check if leader exists: issuance uses leader-only scheduling which requires a leader node - Smart retry loop polls for leader with exponential backoff - Verify startup_timeout (default 60s) and startup_retries (default 3) - If no leader within timeout and allow_bootstrap_fallback = true, bootstrap cert used - Check logs for "unknown UUID" errors (indicates leader-only scheduling called without leader)Using bootstrap certificate (self-signed):
- Bootstrap certificate in use indicates ACME failure on startup - Recovery routine runs with exponential backoff (1m, 5m, 15m, 30m, 1h, then 6h) - Check if ACME directory is reachable from the node - Verify DNS resolution for the configured domain - Check ACME account creation succeeded (email, accept_tos required)HTTP-01 challenge failing:
- Port 80 must be accessible from the ACME CA server to any cluster node - Verify port 80 is not already in use (challenge_port config) - Challenge listener is dynamic: only active during issuance (~30 seconds) - Check distributed memory cache health: challenge tokens stored cluster-wide - Challenge tokens have short TTL (5 minutes) - Path traversal protection on challenge token validationCertificate not renewing:
- Check renewal_threshold_hours (default 720 = 30 days) - Verify renewal_check_interval schedule (default 6h) - Renewal checks run via the internal scheduler automatically - ARI-suggested renewal windows may differ from threshold-based renewal - Check rate limits: client-side tracking prevents exceeding CA limitsCertificate not appearing on all nodes:
- persistent KV watch subscribes to "cert/*" for automatic sync - Check NATS JetStream KV connectivity on all nodes - WatchEventPut: decrypt and install; WatchEventDelete: remove from cache - Encryption key mismatch: all nodes must share same cluster_key - Check AES-256-GCM decryption errors in logsRate limiting issues:
- Let's Encrypt limits: 300 orders/3h, 50 certs/domain/7d, 5 exact set/7d - Client-side tracking uses memory module with TTLs matching CA windows - HTTP 429 with Retry-After: short waits retried immediately, long waits deferred - ARI-suggested renewals exempt from some rate limits - buffer_percent (default 10%) triggers warning at 90% of limitWildcard coverage preventing ACME issuance:
- Static wildcard cert (e.g., *.example.com) suppresses ACME for covered domains - *.example.com covers api.example.com but NOT example.com (apex) - *.example.com does NOT cover sub.api.example.com (nested subdomain) - Check if domain is covered by existing static certificateHexonReady timeout:
- ACME client registers a readiness check for certificate availability - HexonReady polls every 500ms with 2-minute timeout - If timeout: either ACME failed and bootstrap disabled, or leader unavailable - Check certificatesReady atomic flag in module stateMetrics for diagnosis:
acmeclient_issuance_started_total, acmeclient_issuance_success_total acmeclient_issuance_failed_total (labels: domain, error_type) acmeclient_renewal_checks_total, acmeclient_certificates_expiring acmeclient_challenges_served_total (labels: status) acmeclient_certificates_loaded (gauge), acmeclient_certificate_days_until_expiryInterpreting tool output:
'certs list': Healthy: All certs Status=OK, Days Left > 30 Warning: Days Left < 30 — renewal should happen automatically, check 'certs acme' Expiring: Days Left < 7 — urgent, check ACME client health immediately Bootstrap: Source=bootstrap — self-signed temporary cert, ACME failed on startup 'certs acme list': Healthy: All domains show Status=valid with reasonable expiry Pending: Status=pending — issuance in progress or waiting for challenge Failed: Status=failed with error — check challenge port 80 accessibility Action: Failed → 'logs search acmeclient' for issuance error details 'autotls status': Healthy: Certificate valid, Days Left > 30, auto-renewal scheduled Renewing: Renewal in progress — certificate will update automatically Failed: Renewal failed — check internal ACME CA health with 'health components'Security
Security model and hardening:
Private key protection:
All sensitive data encrypted at rest using AES-256-GCM. Key derived from cluster_key with module-specific domain separation ("acmeclient"). Defense-in-depth on top of NATS transport encryption. Private keys never stored in plaintext in persistent storage.Challenge listener security:
Port 80 only exposed during brief verification window (~30 seconds). Challenge tokens have short TTL (5 minutes) in distributed memory cache. Token validation prevents path traversal attacks. Challenge listener management restricted to internal operations only.Cluster synchronization security:
Automatic encrypted sync across all nodes (no manual distribution needed). All nodes must share the same cluster_key for decryption. NATS JetStream uses TLS encryption in transit.Bootstrap certificate limitations:
Self-signed, not trusted by any external client. 7-day validity, ECDSA P-256 key. NOT persisted -- cannot be accidentally used long-term. Recovery routine continuously attempts real ACME certificate.Access control:
Certificate issuance and renewal are restricted to internal scheduler and admin commands. Challenge listener management is internal only (during issuance). Certificate status queries are available to all services and admin commands.Relationships
Module dependencies and interactions:
- Certificate manager: Primary consumer. ACME client stores issued certificates and distributes them cluster-wide via the certificate manager.
- TLS server: Retrieves certificates via certificate manager for TLS handshakes. Checks for ACME-managed certificates when no static TLS cert configured.
- Internal ACME CA: Can be configured as the ACME server endpoint, creating a fully internal PKI. The ACME client is a standard client — it works with any RFC 8555 compliant CA.
- AutoTLS: Alternative for internal-only deployments. AutoTLS uses the internal CA directly. Priority: static > autotls > acmeclient.
- Distributed memory: Challenge tokens stored cluster-wide for any-node validation. Rate limit counters tracked with CA-matched TTLs.
- Persistent storage: Durable certificate storage (encrypted, NATS JetStream KV). Automatic cluster distribution via watch on certificate keys.
- Cluster coordination: Leader detection for deduplication of certificate issuance. Readiness check for startup sequencing.
- Config: Certificate mode selection (static vs ACME), renewal parameters, rate limit configuration. Hot-reload for renewal settings.
- Telemetry: Prometheus metrics for issuance, renewal, challenges, and certificate state. Structured logging for all ACME protocol interactions.
- Scheduler: Renewal checks run on configurable interval (default 6h). ARI check runs on separate interval (default 6h).
Logs
Log entries by component. Search with: logs search “acmeclient” Levels: ERROR > WARN > INFO > DEBUG > TRACE. AUDIT = persisted to audit trail.
Init (module startup):
acmeclient.init INFO Registered ACME certificates readiness check with HexonReady acmeclient.init INFO Static TLS certificate configured - ACME client inactive acmeclient.init INFO ACME client disabled in config acmeclient.init INFO Initializing ACME client acmeclient.init ERROR Failed to initialize ACME client acmeclient.init INFO ACME client initialized successfully acmeclient.init WARN Persistent storage is memory-only (cluster_path not set). Certificates will NOT survive cluster restart. acmeclient.init WARN Failed to load issuance state, starting fresh acmeclient.init WARN Failed to load some certificates from storage acmeclient.init WARN Service certificate acquisition issue - will retry via recoveryReset (data reset on startup):
acmeclient.reset WARN ACME reset requested - deleting all ACME data (account, certificates, issuance state) acmeclient.reset ERROR Failed to reset ACME data acmeclient.reset WARN ACME data reset complete - starting freshFallback (bootstrap fallback):
acmeclient.fallback WARN ACME initialization failed, attempting bootstrap fallback acmeclient.fallback ERROR Failed to generate bootstrap certificate - server cannot start with TLS acmeclient.fallback WARN Using bootstrap certificate - ACME unavailableStartup (service certificate acquisition):
acmeclient.startup WARN No service hostname configured - skipping certificate check acmeclient.startup INFO Using existing valid certificate acmeclient.startup INFO No valid certificate found - attempting ACME issuance with leader detection acmeclient.startup INFO Certificate now available acmeclient.startup DEBUG No leader available, waiting... acmeclient.startup INFO Attempting ACME certificate issuance acmeclient.startup WARN Certificate issuance request failed acmeclient.startup WARN Certificate issuance wait failed acmeclient.startup INFO Certificate issued successfully during startup acmeclient.startup WARN ACME issuance timeout, falling back to bootstrap certificate acmeclient.startup WARN Using bootstrap certificate - ACME recovery will be attemptedAccount (ACME account management):
acmeclient.account INFO Loaded existing ACME account from persistent storage acmeclient.account INFO No existing account found, waiting before creation to prevent race acmeclient.account INFO Account was created by another node during wait, using existing acmeclient.account INFO Creating new ACME account acmeclient.account WARN Failed to save ACME account to persistent storage acmeclient.account INFO Saved new ACME account to persistent storage acmeclient.account INFO Created new ACME account successfullyRequest (signed ACME requests):
acmeclient.request DEBUG Retrying ACME request after transient errorRate limit (CA rate limit handling and client-side tracking):
acmeclient.ratelimit WARN Rate limited by CA, waiting before retry acmeclient.ratelimit WARN Rate limited by CA, scheduling for later retry acmeclient.ratelimit WARN Rate limited by CA without valid Retry-After, using exponential backoff acmeclient.ratelimit DEBUG Rate limit checking disabled, skipping pre-flight checks acmeclient.ratelimit INFO Starting pre-flight rate limit checks acmeclient.ratelimit INFO All rate limit checks passed acmeclient.ratelimit WARN Failed to check rate limit state acmeclient.ratelimit WARN Approaching rate limit capacity acmeclient.ratelimit WARN Rate limit check blocked operation acmeclient.ratelimit DEBUG Rate limit check passed acmeclient.ratelimit WARN Failed to record last order time acmeclient.ratelimit WARN Failed to retrieve account order state acmeclient.ratelimit ERROR Failed to store account order state acmeclient.ratelimit INFO Recorded order creation acmeclient.ratelimit WARN Failed to retrieve domain state, creating new acmeclient.ratelimit WARN IssuedAt array exceeded max entries, truncating acmeclient.ratelimit ERROR Failed to store domain issuance state acmeclient.ratelimit INFO Recorded domain certificate issuance acmeclient.ratelimit WARN Failed to retrieve exact set state, creating new acmeclient.ratelimit WARN IssuedAt array exceeded max entries, truncating acmeclient.ratelimit ERROR Failed to store exact set issuance state acmeclient.ratelimit INFO Recorded exact set certificate issuance acmeclient.ratelimit WARN Failed to retrieve domain state for auth failure recording acmeclient.ratelimit ERROR Failed to store authorization failure state acmeclient.ratelimit WARN Recorded authorization failure acmeclient.ratelimit ERROR Failed to store Retry-After state acmeclient.ratelimit WARN Stored Retry-After delay from CAIssue (certificate issuance):
acmeclient.issue WARN Certificate issuance attempted but ACME client not initialized acmeclient.issue INFO Starting certificate issuance acmeclient.issue ERROR Certificate issuance failed acmeclient.issue INFO Certificate issued successfully acmeclient.issue INFO All domains covered by static certificate, no ACME issuance needed acmeclient.issue WARN Rate limit check failed, delaying issuance acmeclient.issue WARN Certificate issuance already in progress for domain acmeclient.issue DEBUG Starting ACME certificate issuance acmeclient.issue WARN Failed to record order creation for rate limiting acmeclient.issue INFO Starting challenge listeners cluster-wide acmeclient.issue WARN Node failed to start challenge listener acmeclient.issue INFO Stopping challenge listeners cluster-wide acmeclient.issue WARN Failed to broadcast stop challenge listener acmeclient.issue WARN Failed to save certificate to persistent storage acmeclient.issue WARN Failed to install certificate locally acmeclient.issue WARN Failed to record certificate issuance for rate limitingChallenge (HTTP-01 challenge handling):
acmeclient.challenge WARN Invalid ACME token format acmeclient.challenge WARN Failed to lookup challenge token acmeclient.challenge WARN Failed to wait for challenge lookup acmeclient.challenge ERROR Unexpected response type from memorystorage acmeclient.challenge DEBUG Challenge token not found acmeclient.challenge WARN Challenge token has invalid value type acmeclient.challenge WARN Failed to write challenge response acmeclient.challenge INFO Served ACME challenge acmeclient.challenge DEBUG Challenge token stored, responding to ACME server acmeclient.challenge INFO Authorization validated acmeclient.challenge WARN Failed to record authorization failure for rate limitingListener (challenge listener lifecycle):
acmeclient.listener DEBUG Challenge listener already running acmeclient.listener WARN Failed to resolve interface IP, falling back to 0.0.0.0 acmeclient.listener ERROR Failed to create challenge listener acmeclient.listener ERROR Failed to start challenge listener acmeclient.listener INFO Challenge listener started acmeclient.listener DEBUG Challenge listener not running, nothing to stop acmeclient.listener WARN Challenge listener shutdown error acmeclient.listener INFO Challenge listener stoppedBootstrap (bootstrap certificate generation):
acmeclient.bootstrap INFO Generated CA-signed bootstrap certificate acmeclient.bootstrap WARN CA signing failed, falling back to self-signed acmeclient.bootstrap WARN Generated temporary bootstrap certificate - ACME certificate pendingRenewal (certificate renewal):
acmeclient.renewal INFO Scheduling renewal checks acmeclient.renewal ERROR Failed to schedule renewal checks acmeclient.renewal INFO Renewal check scheduler registered acmeclient.renewal INFO Running startup certificate check acmeclient.renewal ERROR Failed to trigger startup renewal check acmeclient.renewal ERROR Startup renewal check failed acmeclient.renewal INFO Startup certificate check completed acmeclient.renewal INFO Cleaned up old failure records acmeclient.renewal INFO Cleaned up stale inProgress entries acmeclient.renewal WARN Failed to fetch ARI info acmeclient.renewal INFO ARI suggests certificate renewal acmeclient.renewal DEBUG ARI window not yet open, skipping renewal acmeclient.renewal INFO Certificate needs renewal acmeclient.renewal INFO Certificate missing for domain acmeclient.renewal INFO Skipping certificate renewal - retry not allowed acmeclient.renewal INFO Renewing certificate acmeclient.renewal ERROR Failed to renew certificate acmeclient.renewal DEBUG Domain covered by static certificate, skipping acmeclient.renewal INFO ARI-guided certificate renewal completed acmeclient.renewal INFO Certificate renewed successfullyRenewals (hexdcall renewal check operation):
acmeclient.renewals WARN Renewal check skipped - ACME client not initialized acmeclient.renewals INFO Starting renewal check acmeclient.renewals INFO Renewal check completedDomains (domain collection for certificate issuance):
acmeclient.domains DEBUG Added service hostname to domain list acmeclient.domains DEBUG Added additional domains from config acmeclient.domains DEBUG Added proxy mapping hosts acmeclient.domains DEBUG Added proxy landing page hostname acmeclient.domains DEBUG Added forward proxy hostname acmeclient.domains DEBUG Added connector hostname acmeclient.domains INFO Collected domains for ACME certificates acmeclient.domains INFO Domains skipped (covered by static TLS certificate) acmeclient.domains WARN No domains configured for ACME. Set service.hostname, acme_client.additional_domains, or configure proxy mappingsLoad (certificate loading from storage):
acmeclient.load WARN Certificate load skipped - ACME client not initialized acmeclient.load INFO Loading certificates from storage acmeclient.load INFO Loaded certificate from persistent storageCoverage (static certificate coverage checking):
acmeclient.coverage WARN Failed to read static certificate for coverage check acmeclient.coverage WARN Failed to decode static certificate PEM acmeclient.coverage WARN Failed to parse static certificate acmeclient.coverage INFO Parsed static certificate for coverage check acmeclient.coverage DEBUG Domain covered by static certificate, skipping ACMEARI (ACME Renewal Information - RFC 8739):
acmeclient.ari WARN Invalid ARI window: end not after start, using window start acmeclient.ari WARN ARI window exceeds maximum, capping duration acmeclient.ari WARN Failed to generate random offset for ARI window, using window start acmeclient.ari WARN CA suggests early renewal - check explanation URL acmeclient.ari DEBUG Using cached ARI info acmeclient.ari ERROR Failed to fetch ARI info from CA acmeclient.ari WARN Failed to cache ARI info acmeclient.ari INFO Fetched and cached ARI info from CA acmeclient.ari DEBUG No ARI info available for domain acmeclient.ari WARN Failed to retrieve ARI info for marking as replaced acmeclient.ari DEBUG No ARI info found to mark as replaced acmeclient.ari ERROR Failed to store ARI replaced state acmeclient.ari INFO Marked ARI renewal as completedRecovery (bootstrap recovery routine):
acmeclient.recovery INFO Starting ACME recovery routine acmeclient.recovery INFO Bootstrap certificate replaced - recovery complete acmeclient.recovery INFO Waiting for next recovery attempt acmeclient.recovery INFO Bootstrap certificate replaced during wait - recovery complete acmeclient.recovery WARN Initial recovery schedule exhausted - switching to normal renewal cycle acmeclient.recovery INFO Attempting ACME recovery acmeclient.recovery WARN ACME client not fully initialized - attempting reinitialization acmeclient.recovery WARN ACME reinitialization failed acmeclient.recovery WARN ACME recovery request failed acmeclient.recovery WARN ACME recovery wait failed acmeclient.recovery WARN ACME recovery got unexpected response type acmeclient.recovery WARN ACME recovery issuance failed acmeclient.recovery INFO ACME recovery successful - real certificate obtainedWatch (PersistentWatch certificate sync):
acmeclient.watch WARN PersistentWatch disconnected, will retry acmeclient.watch ERROR Failed to start PersistentWatch acmeclient.watch INFO Started PersistentWatch for certificate updates acmeclient.watch INFO PersistentWatch channel closed acmeclient.watch WARN Received invalid envelope type acmeclient.watch ERROR Failed to decrypt certificate from watch event acmeclient.watch WARN Module state not ready, skipping certificate install acmeclient.watch ERROR Failed to install certificate from watch event acmeclient.watch INFO AUDIT Certificate installed via PersistentWatch acmeclient.watch INFO AUDIT Certificate removed via PersistentWatchStatus, List, Get (certificate queries):
acmeclient.status DEBUG Certificate status check - ACME client not initialized acmeclient.list DEBUG Certificate list requested - ACME client not initialized acmeclient.get DEBUG Certificate requested - ACME client not initialized acmeclient.get WARN Failed to load certificate from storage acmeclient.get DEBUG Certificate retrievedState (issuance state persistence):
acmeclient.state WARN Failed to delete issuance state acmeclient.state WARN Failed to save issuance state acmeclient.state INFO Loaded issuance state from persistent storageCleanup (stale data removal):
acmeclient.cleanup WARN Failed to delete old issuance state acmeclient.cleanup WARN Removed stale inProgress entryShutdown:
acmeclient.shutdown WARN Shutdown timed out waiting for watch goroutine acmeclient.shutdown INFO ACME client shutdown completeMetrics
Prometheus metrics. Query with: metrics prometheus acmeclient_<name>
Issuance counters (module: acmeclient):
acmeclient_issuance_started_total counter {domain} Certificate issuance started acmeclient_issuance_success_total counter {domain, key_type} Certificate issuance succeeded acmeclient_issuance_failed_total counter {domain, error_type} Certificate issuance failed Labels: error_type="timeout"|"rate_limit"|"authorization"|"network"|"dns"|"invalid_request"|"not_found"|"unknown"|"none"Issuance latency (module: acmeclient):
acmeclient_issuance_duration histogram {domain, key_type} End-to-end issuance timeRenewal counters (module: acmeclient):
acmeclient_renewal_checks_total counter (no labels) Renewal check cycles executed acmeclient_renewal_success_total counter {domain} Certificate renewals succeeded acmeclient_renewal_failed_total counter {domain, error_type} Certificate renewals failedChallenge counters (module: acmeclient):
acmeclient_challenges_stored_total counter {domain} Challenge tokens stored in distributed cache acmeclient_challenges_served_total counter {status} Challenge responses served Labels: status="success"|"not_found"|"invalid_token"|"lookup_error"|"internal_error"|"invalid_value"|"write_error"Certificate gauges (module: acmeclient):
acmeclient_certificates_checked gauge (no labels) Certificates checked in last renewal cycle acmeclient_certificates_expiring gauge (no labels) Certificates needing renewal in last cycle acmeclient_certificates_loaded gauge (no labels) Total certificates in memory cache acmeclient_certificate_days_until_expiry gauge {domain} Days until certificate expiresARI counters (module: acmeclient):
acmeclient_ari_fetch_total counter {result} ARI fetch attempts Labels: result="success"|"error" acmeclient_ari_early_renewal_suggestions_total counter (no labels) CA-suggested early renewals (possible revocation) acmeclient_ari_renewals_total counter {domain} ARI-guided certificate renewals completed acmeclient_ari_marked_replaced_total counter (no labels) ARI renewals marked as replaced acmeclient_ari_cache_total counter {result} ARI cache lookups Labels: result="hit"|"miss"Rate limit counters (module: acmeclient):
acmeclient_ratelimit_checks_total counter (no labels) Rate limit pre-flight checks executed acmeclient_ratelimit_check_results_total counter {limit_type, result} Rate limit check outcomes Labels: limit_type="retry_after"|"min_order_interval"|"orders_per_account"|"certs_per_domain"| "auth_failures_per_domain"|"certs_per_exact_set"|"all" Labels: result="blocked"|"passed"|"error" acmeclient_ratelimit_orders_created_total counter (no labels) ACME orders created (tracked for limits) acmeclient_ratelimit_certs_issued_total counter {domain} Certificates issued per domain (tracked for limits) acmeclient_ratelimit_auth_failures_total counter {domain} Authorization failures per domain acmeclient_ratelimit_retry_after_total counter {status_code} Retry-After responses from CA Labels: status_code="429"|"503"|"other" acmeclient_ratelimit_state_errors_total counter {operation} Rate limit state storage errors Labels: operation="set"|"get" acmeclient_ratelimit_approaching_total counter {limit_type} Rate limit approaching capacity warnings (>80%)Rate limit gauges (module: acmeclient):
acmeclient_ratelimit_current_usage gauge {limit_type} Current usage count per limit type acmeclient_ratelimit_limit gauge {limit_type} Effective limit value per limit type acmeclient_ratelimit_usage_percent gauge {limit_type} Usage percentage per limit typeRate limit latency (module: acmeclient):
acmeclient_ratelimit_check_duration histogram (no labels) Pre-flight rate limit check durationAlerts:
issuance_failed_total increasing -> Check CA reachability, DNS, port 80 access certificates_expiring > 0 persisting -> Renewal failing, check error_type labels certificate_days_until_expiry < 7 -> Urgent: cert near expiry, check renewal logs ratelimit_check_results_total{blocked} -> Client-side rate limit preventing issuance ari_early_renewal_suggestions_total -> CA suggesting early renewal, possible revocation challenges_served_total{status!=success} -> Challenge failures, check port 80 and DNSAutoTLS Certificate Management
Zero-touch TLS — automatically issues and renews wildcard certificates from the built-in CA
Overview
Automatically issues and renews wildcard TLS certificates for all internal domains — no operator intervention required. When enabled, the gateway provisions certificates from its built-in ACME CA on startup and renews them before expiry. No external dependencies, no manual certificate management, no configuration per service.
Core capabilities:
- Fully automatic certificate issuance and renewal — zero operator intervention required
- Wildcard certificates covering all subdomains of the configured hostname
- Deterministic key derivation (HKDF from cluster_key) for cluster-wide SPKI consistency
- Configurable renewal cycles (default: 30 days) and validity (default: 60 days)
- Seamless key rotation on each renewal cycle (new HKDF-derived key per cycle)
- Background renewal loop with automatic retry on failure
- Hostname change detection with automatic re-issuance
IMPORTANT — Automatic renewal:
Certificate renewal is fully automatic. The renewal loop runs continuously in the background, sleeping until the next cycle boundary (deterministic mode) or until the renewal threshold (random mode). When the timer fires, a new certificate is automatically issued and installed. There is NO manual rotation step. Operators do NOT need to set calendar reminders or monitor expiry dates for routine certificate management. The system handles it. Only investigate if 'autotls status' shows unexpected errors or if certificates are not renewing (check logs for issuance failures).Each node operates independently:
1. Derives deterministic ECDSA P-256 private key from cluster_key and current cycle 2. Signs wildcard certificate via internal CA 3. Stores certificate and sets as default for SNI fallback 4. Background loop sleeps until next cycle boundary, then repeatsNo cluster coordination needed — deterministic derivation from cluster_key means all nodes produce certificates with the same public key.
Deterministic keys
Deterministic key derivation — what it means and why it is secure:
IMPORTANT — “deterministic” refers to KEY MATERIAL ONLY, not signatures. ECDSA signatures still use standard randomness for nonce generation. This is NOT “deterministic signing” in the RFC 6979 sense.
How it works:
Private keys are derived deterministically from cluster_key, the certificate SANs, and a cycle counter. The same inputs always produce the same key. Each renewal cycle increments the counter → new key material.Why deterministic keys:
1. SPKI pinning: The public key is identical across all cluster nodes, enabling external clients to pin the certificate 2. Cluster consistency: No coordination needed — all nodes derive the same key 3. Reproducibility: Certificate can be re-derived after node restart without fetching state from other nodesWhat remains random:
- ECDSA signature nonces use standard randomness — certificate bytes may differ between nodes, but the public key is identical - The certificate is signed with full cryptographic randomness; only the subject key pair is deterministicSecurity properties:
- Key material entropy comes from cluster_key (256-bit minimum) - Cryptographic domain separation ensures different inputs produce independent keys - Each cycle produces an independent key — compromise of one cycle's key does not reveal other cycles' keys - Without cluster_key (single-node mode), keys are fully randomWhen the admin AI or operators see “deterministic” in AutoTLS context, it means deterministic KEY DERIVATION for cluster consistency — not reduced randomness. The cryptographic security is equivalent to random key generation.
Config
AutoTLS configuration in hexon.toml under [service]:
[service] hostname = "access.corp.internal" auto_tls = true # auto_tls_renewal = 30 # Renewal cycle in days (default: 30, range: 20-525) # auto_tls_validity = 60 # Certificate validity in days (default: 60, range: 30-790)Certificate timing:
Renewal cycle: how often a new certificate is issued (default: 30 days) Validity period: how long each certificate is valid (default: 60 days) Overlap: validity - renewal = 30 days of dual-certificate coverage Constraint: overlap must be 20%-80% of validity (auto-adjusted if not)When auto_tls = true:
- Internal ACME CA is automatically enabled - ACME CA endpoints available for trust anchoring (/acme/ca-bundle) - Static TLS certificates (tls_cert/tls_key) take priority if definedCertificate details:
Key: ECDSA P-256 (deterministically derived when cluster_key set, random otherwise) Serial: Deterministic (derived from SANs and cycle for consistency) SANs: *.{base_domain} + {hostname} CN: HEXON-AUTOTLS-*.{base_domain}Hot-reloadable: auto_tls_renewal, auto_tls_validity (via renewalLoop detection). Cold (restart required): auto_tls, hostname.
Troubleshooting
Common symptoms and diagnostic steps:
Certificate not issued on startup:
- Check if internal ACME CA is healthy: 'health components' - Verify cluster_key is set (required for deterministic mode) - Check logs for certificate signing errors - If startup fails, the renewal loop retries automaticallyCertificate not renewing:
- Renewal is fully automatic — check if the renewal loop is running - Check 'autotls status' for current certificate state and days left - Check logs for "Renewing AutoTLS certificate" messagesHostname changed but old certificate still served:
- renewalLoop detects hostname changes and re-issues automatically - Old certificate expires naturally (validity period) - Force immediate renewal: 'autotls renew' (admin command)SPKI pin changed unexpectedly:
- Pin changes on each renewal cycle (new HKDF-derived key) - Update pinned hashes after each renewal cycle - Pin rotation window = certificate overlap period (default: 30 days) - Use both current and next pin for seamless rotationTrust not established:
- Clients must trust the internal CA root certificate - CA bundle available at /acme/ca-bundle (HTTPS endpoint) - Add to system trust store: update-ca-certificates (Linux) or Keychain (macOS)Interpreting ‘autotls status’ tool output:
Healthy: Certificate valid, Days Left > renewal_threshold Renewing: Background loop triggered renewal — automatic, no action needed Failed: Check logs for certificate signing errors Disabled: auto_tls = false in configRelationships
Module dependencies and interactions:
- acme (internal CA): AutoTLS uses the internal ACME CA for certificate signing. When auto_tls = true, ACME CA is automatically enabled. Certificate signing is local — no network round-trip, no ACME protocol overhead.
- certmanager: AutoTLS stores issued certificates in the certificate manager. The certificate manager handles TLS handshake certificate selection (exact match > wildcard > default).
- config: Reads hostname, auto_tls, auto_tls_renewal, auto_tls_validity from [service] section. Detects hostname changes for automatic re-issuance.
- server: Server’s TLS handshake retrieves certificates from certmanager. Priority: static (tls_cert) > AutoTLS > ACME client > error.
- acmeclient: Alternative to AutoTLS for external CA-signed certificates. Both can coexist; static certs take highest priority.
Logs
Log entries by component. Search with: logs search “autotls” Levels: ERROR > WARN > INFO > DEBUG.
Init & Lifecycle:
autotls.init ERROR AutoTLS init panic recovered: <detail> autotls.init INFO Static TLS certificate configured, AutoTLS skipping autotls.init INFO AutoTLS enabled, issuing <type> certificate autotls.init INFO Certificate signing failed on startup, retrying autotls.init ERROR AutoTLS initialization failed, will retry in renewal loop autotls.init INFO AutoTLS initialized successfullyCertificate Issuance:
autotls.issue INFO Issuing deterministic certificate autotls.issue WARN Failed to store wildcard certificate, hostname cert is still active autotls.issue WARN Failed to set default certificate, hostname cert is still active autotls.issue INFO AutoTLS certificate issuedRenewal:
autotls.renew INFO Manual certificate renewal requested autotls.renew WARN Hostname changed, issuing certificate for new hostname autotls.renew INFO Renewing AutoTLS certificate autotls.renew ERROR AutoTLS certificate renewal failed autotls.renew INFO AutoTLS certificate renewed successfullyEpoch Parsing:
autotls.epoch WARN invalid epoch "<value>", falling back to default <default> autotls.epoch WARN ACME CA epoch is in the future, certificate cycle will be 0 until epoch is reachedMetrics
Prometheus metrics emitted by this module:
autotls_issuances_total counter {result=success|failure} Initial certificate issuance on startup (after retry loop) autotls_renewals_total counter {result=success|failure} Certificate renewal attempts (both automatic renewal loop and manual 'autotls renew')Certificate state is also observable via the ‘autotls status’ hexdcall command and log entries.
SPIFFE Workload Identity
Issues workload identity certificates for service-to-service mTLS — services authenticate directly to each other
Overview
Issues SPIFFE workload identity certificates so services can authenticate directly to each other via mTLS. Traffic flows service-to-service without routing through the gateway — the gateway issues identities, it doesn’t need to be the data plane. Uses a modified ACME profile where pre-registered workloads receive certificates without domain validation challenges.
Key capabilities:
- Pre-registration: Workloads configured with public keys in TOML config
- No challenges: Authorization based on JWK thumbprint matching (RFC 7638)
- CIDR enforcement: Per-workload and global IP restrictions, re-validated at every operation
- Rate limiting: Per-workload sliding window (1 hour) with eventual consistency
- Short-lived certificates: Workload-specific TTL (default 24h, max configurable)
- SPIFFE URI SAN: spiffe://{hostname}/workload/{identity}
- AllowedPeers extension: Custom OID (1.3.6.1.4.1.64753.1.1) with JSON peer list
- CRL and OCSP integration: Certificates include CRL Distribution Point and OCSP responder URL
- Hot-reload: New/removed/modified workloads applied without restart
- Workload snapshots: Orders capture config at creation time for zero-downtime updates
ACME endpoints (default prefix /acme/spiffe):
GET /directory ACME directory with endpoint URLs (public) GET /bundle CA trust bundle in PEM format (public) GET /tos Terms of Service (public) HEAD /new-nonce Get replay nonce for JWS requests POST /new-account Create or retrieve SPIFFE account (JWS required) POST /new-order Create certificate order, auto-approved (JWS required) POST /order/{id} Get order status via POST-as-GET (JWS required) POST /finalize/{id} Submit CSR to finalize order (JWS required) POST /cert/{id} Download issued certificate via POST-as-GET (JWS required) POST /revoke-cert Revoke a certificate (JWS required)Certificate features:
- SPIFFE URI SAN: spiffe://{hostname}/workload/{identity}
- Extended Key Usage: Server Authentication + Client Authentication
- AllowedPeers X.509 extension with authorized peer SPIFFE IDs
- CRL Distribution Point and OCSP Responder URLs embedded
Config
Core configuration under [spiffe] and [[spiffe.workloads]]:
[spiffe] enabled = true # Enable SPIFFE workload identity service path_prefix = "/acme/spiffe" # HTTP endpoint prefix (default: /acme/spiffe) allowed_cidrs = ["10.0.0.0/8"] # Global CIDR allowlist for all workloads default_ttl = "24h" # Default certificate TTL max_ttl = "168h" # Maximum certificate TTL (7 days) rate_limit_per_workload = 100 # Max certificates per workload per hour order_timeout = "1h" # Order expiration timeout allowed_key_algorithms = ["EC-P256", "EC-P384", "RSA-2048", "Ed25519"] # Allowed CSR key algorithms (Ed25519 since 0.9.1)[[spiffe.workloads]] identity = "api-backend" # Workload identity name (used in SPIFFE URI) account_public_key = "-----BEGIN..." # PEM-encoded public key for JWK thumbprint matching sans = ["api.example.com"] # Allowed DNS SANs for this workload allowed_peers = ["frontend", "db"] # Peer SPIFFE IDs embedded in AllowedPeers extension allowed_cidrs = ["10.0.1.0/24"] # Per-workload CIDR restriction (optional, narrows global) ttl = "4h" # Per-workload TTL override (optional, must be <= max_ttl)JWK thumbprint computation (RFC 7638):
1. Parse DER-encoded public key from account_public_key PEM 2. Convert to canonical JWK format (lexicographically sorted fields) 3. SHA-256 hash of UTF-8 encoded JWK JSON 4. Base64url encode the hash (no padding) Workload authenticates by signing JWS requests with matching private key.JWS verification requirements for authenticated endpoints:
- Algorithm: ES256 (ECDSA P-256) or RS256/RS384/RS512 (RSA 2048+) - URL field must match request URL - Nonce: single-use replay protection via Replay-Nonce header - Signature verified against account public keyCSR validation rules:
- Maximum size: 64KB - CSR must be self-signed (signature verified) - SANs must match order identifiers exactly - Key algorithm must be in allowed_key_algorithms - RSA keys require minimum 2048 bitsHot-reload behavior:
New workloads: immediately available for account creation and issuance Removed workloads: in-flight orders (v2) complete via snapshot; new orders blocked Modified workloads: TTL/CIDR/SAN/peer changes apply to new orders only Public key changes: old thumbprint orphaned; new account required with new keyCluster storage:
Accounts: cluster-wide storage with 90-day TTL, quorum required Orders: cluster-wide storage with order_timeout + 10min buffer, quorum required Certificates: cluster-wide storage with configurable TTL, quorum required Rate limits: cluster-wide storage with 2-hour TTL, best-effort eventual consistencyTroubleshooting
Common symptoms and diagnostic steps:
Account creation fails with “unauthorized”:
- JWK thumbprint does not match any configured workload public key - Verify thumbprint: 'step crypto jwk thumbprint workload-pub.pem' - Check config: 'spiffe workloads' to list configured workloads - CIDR mismatch: client IP not in global or per-workload allowed_cidrs - Check CIDR: 'spiffe check <workload-id>' for workload detailsOrder creation fails:
- Account deactivated: cannot create new orders - Rate limit exceeded: per-workload sliding window (1 hour) hit - SAN validation: requested identifiers not in workload's sans list - Check status: 'spiffe status' for overall SPIFFE healthCertificate finalization fails:
- CSR too large: maximum 64KB - CSR signature invalid: CSR must be self-signed - SAN mismatch: CSR SANs must match order identifiers exactly - Key algorithm not allowed: check allowed_key_algorithms in config - RSA key too small: minimum 2048 bits required - Rate limit re-check: limit may have been reached between order and finalize - Order expired: check order_timeout settingCertificate retrieval returns error:
- CIDR re-validation: client IP checked again at retrieval time - Order not yet valid: certificate issuance is asynchronous (semaphore-limited) - Issuance queue full: check 'metrics prometheus spiffe_issuance_queue' for queue depthCIDR enforcement issues:
- Global [spiffe].allowed_cidrs applies to ALL requests - Per-workload allowed_cidrs narrows the global allowlist - CIDR is re-validated at every operation (account, order, finalize, retrieve, revoke) - Check specific IP: 'geo lookup <ip>' for network detailsRate limiting unexpectedly blocking:
- Eventual consistency: under high concurrent load from multiple cluster nodes, limits may be exceeded by up to the number of concurrent requests - Sliding window is 1 hour; check 'metrics prometheus spiffe_ratelimit' for current usage - Fail-open: if rate limit state is unavailable, requests are allowedNonce errors (badNonce):
- Nonces are single-use; retry with fresh nonce from Replay-Nonce response header - All error responses include a new Replay-Nonce header for immediate retryCertificate not trusted by peers:
- Trust bundle: GET /acme/spiffe/bundle returns PEM-encoded CA chain - AllowedPeers: verify peer SPIFFE ID is in allowed_peers list - OCSP/CRL: check 'certs ocsp' and 'certs crl' for responder statusIntegration with cert-manager:
- Use SPIFFE ACME directory URL as ClusterIssuer server - No challenge solvers needed (SPIFFE auto-approves) - Account key secret must match configured workload public key - cert-manager may require empty solvers list or dummy solverRelationships
Module dependencies and interactions:
- acme: SPIFFE uses the internal ACME CA for certificate signing. The CA signing operation runs asynchronously after order finalization with a configurable concurrency semaphore (default: 50). CA signing latency tracked via spiffe_ca_signing_duration_ms metric.
- x509: Certificates include SPIFFE URI SAN, AllowedPeers custom extension, and standard Extended Key Usage (serverAuth + clientAuth). CRL Distribution Point and OCSP Responder URLs are embedded in issued certificates.
- sessions: No direct dependency. SPIFFE uses JWS-based authentication (not session cookies). Each request is independently authenticated via JWK thumbprint matching.
- directory: No direct dependency. Workload identity is managed through TOML configuration, not the directory module.
- storage: Accounts, orders, certificates, and rate limits stored cluster-wide via distributed storage with quorum requirements. Rate limits use eventual consistency.
- hotreload: Configuration changes detected via file watcher or SIGHUP. Workload snapshots (v2 orders) enable zero-downtime updates during rolling config changes.
- loadbalancer: No direct dependency. SPIFFE handles its own rate limiting via per-workload sliding window counters in cluster storage.
- geoaccess: No direct dependency. CIDR enforcement is built into the SPIFFE module itself using per-workload and global allowed_cidrs configuration.
Logs
Log entries by component. Search with: logs search “spiffe” Levels: ERROR > WARN > INFO > DEBUG. DEBUG requires log level configuration.
Route registration:
spiffe.routes INFO Registering SPIFFE ACME routes spiffe.routes INFO SPIFFE ACME routes registered successfullyCIDR enforcement:
spiffe.cidr.validate WARN Invalid CIDR in AllowedCIDRs, skipping spiffe.cidr.blocked WARN AUDIT SPIFFE request blocked by CIDR policyError responses:
spiffe.handler.error WARN SPIFFE ACME error responseMetrics
Prometheus metrics. Query with: metrics prometheus spiffe_<name>
Requests:
spiffe_requests_total counter {endpoint, status} Requests per endpoint (status: success/error) spiffe_request_duration_ms histogram {endpoint} Request latency in millisecondsErrors:
spiffe_errors_total counter {type, status} ACME error responses by problem type and HTTP status spiffe_cidr_blocked_total counter (none) Requests blocked by CIDR policyEndpoint label values for requests_total and request_duration_ms:
directory, new_nonce, new_account, new_order, get_order, finalize, get_certificate, trust_bundle, revoke_cert, tosAlerts:
rate(spiffe_errors_total[5m]) > 10 High ACME error rate rate(spiffe_cidr_blocked_total[5m]) > 0 CIDR policy blocking requests histogram_quantile(0.99, spiffe_request_duration_ms) > 5000 P99 latency exceeding 5 seconds