Skip to content

Certificates & PKI

Certificate Management

Manages all TLS certificates — internal CA, Let’s Encrypt, and static PEM in one place with auto-renewal

Overview

Manages the lifecycle of all TLS certificates the gateway uses — issuance, renewal, distribution, and SNI routing. Unifies three certificate sources into one model so operators don’t manage certificates per-service:

  1. Internal ACME CA: Issues certificates for internal services using a built-in RFC 8555 compliant Certificate Authority. Supports http-01, dns-01, and tls-alpn-01 challenges with OCSP and CRL distribution.

  2. External ACME Client: Obtains certificates from Let’s Encrypt or any external ACME-compliant CA. Handles automatic renewal, bootstrap fallback, and cluster-wide distribution.

  3. Static PEM: Certificates loaded directly from configuration files (tls_cert/tls_key). Highest priority source, used when pre-provisioned certificates are available.

Regardless of source, all certificates flow through the certificate manager for unified storage, caching, and distribution. The TLS handshake retrieves certificates from the local in-memory cache with sub-microsecond latency, while cluster-wide consistency is maintained via broadcast operations.

Certificate selection during TLS handshake follows a three-tier priority:

exact domain match > wildcard match (*.example.com) > default certificate

Storage layers:

1. Distributed cache: keyed by domain with TTL matching certificate expiry
2. Local cache: parsed TLS certificates held in-memory for zero-latency
TLS serving (reads are local-only, writes broadcast cluster-wide)

Validation on storage:

- Maximum PEM sizes enforced (256KB cert, 32KB key)
- Certificate chain length limited to 10
- Domain length capped at 253 characters (RFC 1035)
- Date range and domain-SAN match validation

Config

Certificate management is an infrastructure module — it has no dedicated configuration section. Certificate sources are configured via other modules:

Static certificates:

[service]
tls_cert = "/path/to/cert.pem" # Static TLS certificate
tls_key = "/path/to/key.pem" # Static TLS private key
Per-proxy mapping certificates:
[[proxy.mapping]]
hostname = "api.example.com"
tls_cert = "/path/to/api-cert.pem" # Per-route certificate
tls_key = "/path/to/api-key.pem"

Internal ACME CA (automatic):

[acme]
enabled = true # Enables internal certificate issuance

External ACME Client (automatic):

[acme_client]
enabled = true # Obtains certs from Let's Encrypt or similar

AutoTLS (automatic wildcard):

[service]
auto_tls = true # Wildcard cert via internal ACME CA

Certificate selection priority during TLS handshake:

1. Static PEM (from tls_cert/tls_key or per-mapping config)
2. ACME-issued certificate (internal or external)
3. Default certificate (service-level or AutoTLS wildcard)

Domain matching priority:

exact match > wildcard match (*.example.com) > default certificate

Validation limits (enforced on all certificate storage):

- Maximum certificate PEM size: 256KB
- Maximum private key PEM size: 32KB
- Maximum chain length: 10 certificates
- Maximum domain name length: 253 characters (RFC 1035)
- Date range validation: NotBefore <= now <= NotAfter

Troubleshooting

Common symptoms and diagnostic steps:

Certificate not found for domain:

- Check 'certs list' for all managed certificates
- Check 'certs show <domain>' for specific domain
- Verify domain matches exactly (case-sensitive) or has wildcard match
- Check certificate source: static (config), ACME (internal CA), or ACME client
- If ACME: check 'autotls status' and 'certs acme' for issuance status

TLS handshake using wrong certificate:

- Check priority: static > ACME > default
- Per-mapping certificates override service-level defaults
- 'diagnose domain <hostname>' shows which certificate is being served
- Wildcard certificates only match one level (*.example.com matches
api.example.com but NOT sub.api.example.com)

Certificate expired or expiring soon:

- Check 'certs list' for expiration dates
- ACME certificates renew automatically before expiry
- Static certificates must be replaced manually and config reloaded
- If auto-renewal failed: check ACME CA health ('health components')
- Check 'logs search certmanager --level=warn' for expiration warnings

Certificate not propagating across cluster:

- Writes are broadcast to all nodes; check cluster health
- Check 'cluster status' for quorum and node connectivity
- Each node maintains a local cache — propagation is near-instant
- If a node missed the broadcast: restart triggers fresh certificate load

Invalid certificate PEM:

- Check PEM encoding: must be valid base64 with proper headers
- Chain must include intermediates (server cert + intermediate CAs)
- Key must match the certificate's public key
- File path must be readable by the gateway process

Metrics for monitoring:

- certmanager_certificates_total: total certificates in cache
- certmanager_set_total: certificate store operations by source
- certmanager_get_total: cache lookups (hit=true/false)
- certmanager_expired_total: certificates that expired from cache

Relationships

Child modules:

- certificates.acme: Internal ACME CA server for issuing certificates to
internal services. Acts as the certificate authority that clients can
request certificates from.
- certificates.acmeclient: ACME protocol client that obtains certificates
from Let's Encrypt or any ACME-compliant CA (including the internal CA).
Handles renewal scheduling, bootstrap fallback, and cluster distribution.

Key dependents:

- TLS server: Retrieves certificates for TLS handshake callbacks.
- Proxy: Per-mapping TLS certificates for SNI-based routing. Invalid or
missing certificates prevent routes from mounting.
- AutoTLS: Uses the internal ACME CA to issue wildcard certificates, then
stores them through the certificate manager for cluster-wide availability.

Infrastructure dependencies:

- Distributed storage: Certificate cache with TTL-based expiration.
- Cluster broadcast: Operations for cluster-wide certificate updates.
- Configuration: Certificate source selection and TLS parameters.

Logs

Log entries by component. Search with: logs search “certmanager” Levels: ERROR > WARN > INFO > DEBUG.

SetCertificate:

certmanager.set ERROR Failed to parse certificate
certmanager.set ERROR Certificate does not match domain
certmanager.set ERROR Rejecting expired certificate
certmanager.set ERROR Rejecting not-yet-valid certificate
certmanager.set ERROR Failed to store certificate in memorystorage
certmanager.set INFO Certificate stored successfully

SetDefaultCertificate:

certmanager.setdefault ERROR Failed to parse default certificate
certmanager.setdefault INFO Default certificate set successfully

DeleteCertificate:

certmanager.delete INFO Certificate deleted

OnCertificateExpired:

certmanager.expired ERROR Panic in OnCertificateExpired callback
certmanager.expired WARN Certificate expired from cache - renewal may have failed

ClearCache:

certmanager.clearcache INFO Certificate cache cleared

Shutdown:

certmanager.shutdown INFO Certificate manager shutdown complete

Metrics

Prometheus metrics. Query with: metrics prometheus certmanager_<name>

Certificate Operations (namespace: certmanager):

certmanager_set_total counter {source} Certificate store operations (both domain and default)
source=static|acme|acmeclient Certificate source type
certmanager_get_total counter {hit} Cache lookups for TLS certificate retrieval
hit=true Certificate found (exact or wildcard match)
hit=false Certificate not found in cache
certmanager_expired_total counter {} Certificates expired from cache (renewal may have failed)
certmanager_certificates_total gauge {} Total certificates currently held in local cache

ACME CA Server

Built-in certificate authority — issues TLS certificates for internal services via standard ACME protocol

Overview

Issues TLS certificates for internal services using a built-in ACME certificate authority. Replaces external CA infrastructure for internal PKI — compatible with certbot, cert-manager, Caddy, and Traefik. Supports http-01, dns-01, and tls-alpn-01 challenges with OCSP responder and CRL distribution.

Core capabilities:

  • Full RFC 8555 ACME protocol compliance
  • Stateless accounts derived from JWK thumbprint (no account database)
  • All three challenge types: http-01, dns-01, and tls-alpn-01
  • IP address certificates via RFC 8738
  • CAA checking via RFC 8659 with domain hierarchy walk-up
  • OCSP responder (RFC 6960) with caching for real-time certificate status
  • CRL distribution (RFC 5280) rebuilt on each revocation
  • Deterministic DNS challenges for internal domains without DNS API
  • UUID v4 certificate serial numbers (collision-free)
  • Cluster-ready with distributed storage across all nodes
  • Comprehensive multi-dimensional rate limiting (7 dimensions)
  • Saga pattern for atomic distributed updates (TOCTOU prevention)
  • Optimistic concurrency control for rate limit counters
  • Threshold CA (acme_ca_threshold=true): the CA private key exists only as distributed threshold shares — no single node holds the full key. Protocol selected by ca_algorithm: ES256 → GG18 ECDSA, EdDSA → FROST Ed25519. Fail-closed — no certs until DKG completes. Both algorithms support resharing: cluster scaling preserves the CA cert across membership changes (see threshold_resharing for shrinkage constraints).
  • CA signature algorithm selectable via [operations].ca_algorithm — “ES256” (default, ECDSA P-256) or “EdDSA” (Ed25519, RFC 8032). End-to-end Ed25519 supported: deterministic CA generation, CSR validation, JWS/JWK with OKP keys (RFC 8037), thumbprints (RFC 7638 + RFC 8037 §2), OCSP responses (RFC 8419), ACME and SPIFFE leaf certs, threshold mode with FROST resharing. Algorithm is immutable after first bootstrap; mismatched config on a subsequent startup is rejected with a clear migration error.
  • CA certificate rotation is automatic — managed by AutoTLS renewalLoop or ACME client renewal scheduler. Operators do NOT need to set calendar reminders for CA cert expiry. Only investigate if ‘health components’ shows CA warnings or ‘certs list’ shows expiring certs.

HTTP endpoints under /acme (configurable prefix):

GET /acme/directory -> ACME directory with all endpoint URLs
HEAD /acme/new-nonce -> Anti-replay nonce
POST /acme/new-account -> Create/lookup account
POST /acme/new-order -> Create certificate order
POST /acme/order/{id} -> Order status
POST /acme/authz/{id} -> Authorization status
POST /acme/challenge/{id} -> Challenge response
POST /acme/finalize/{id} -> Finalize order with CSR
POST /acme/cert/{id} -> Download certificate (PEM)
POST /acme/revoke-cert -> Revoke certificate
GET /acme/ca-certs -> CA certificate bundle (PEM)
GET /acme/crl -> CRL (DER-encoded)
GET /acme/ocsp/{req} -> OCSP check (GET)
POST /acme/ocsp -> OCSP check (POST)

Storage model:

  • Volatile: in-memory cache for fast access (orders, authorizations, challenges, nonces, OCSP)
  • Persistent: NATS JetStream KV for durability (certificates, CRL, serial index)
  • All persistent data encrypted at rest (key derived from cluster_key)
  • Startup: certificates loaded from persistent storage into memory cache

Cluster behavior:

  • Write operations (create order, finalize, revoke): Replicated with quorum
  • Read operations (get directory, get order, get cert): Local only
  • Validation operations (validate challenge, check CAA): Local with external calls
  • Nonces: created across all nodes; validated and consumed locally

Config

ACME CA configuration in hexon.toml under [acme]:

[acme]
enabled = true # Enable ACME CA server
path_prefix = "/acme" # URL prefix for ACME endpoints (default: /acme)
external_url = "" # External URL (derived from hostname if not set)
# Access control
allowed_cidrs = ["10.0.0.0/8"] # Restrict ACME API to specific networks (optional)
allowed_identifiers = ["*.internal.example.com"] # Domain patterns (optional, wildcards)
# Challenge configuration
challenges_enabled = ["http-01"] # Enabled challenge types (default: http-01 only)
challenge_validity = "15m" # Challenge validity period (default: 15m)
nonce_validity = "15m" # Nonce validity period (default: 15m)
# Certificate parameters
max_validity = "2160h" # Maximum certificate validity (default: 90 days)
default_validity = "2160h" # Default certificate validity (default: 90 days)
max_san_count = 100 # Maximum SANs per certificate (default: 100)
enable_ip_identifiers = true # Enable IP address identifiers (RFC 8738)
# CAA checking (RFC 8659)
caa_checking = false # Enable CAA record checking (default: false)
caa_identifiers = ["acme.example.com"] # CAA identifiers for this CA
# OCSP Responder (RFC 6960)
ocsp_enabled = true # Enable OCSP responder (default: true)
ocsp_cache_ttl = "5m" # OCSP response cache TTL (default: 5m)
ocsp_cidrs = ["0.0.0.0/0"] # Allowed CIDRs for OCSP (default: all)
# CRL Distribution (RFC 5280)
crl_enabled = true # Enable CRL endpoint (default: true)
crl_cidrs = ["0.0.0.0/0"] # Allowed CIDRs for CRL (default: all)
crl_next_update = "48h" # CRL NextUpdate offset (default: 48h)
# Deterministic DNS for internal domains
dns_deterministic = false # Enable deterministic DNS challenges
dns_deterministic_cidrs = ["10.0.0.0/8"] # Allowed CIDRs for deterministic DNS
# Legacy rate limits (simple)
rate_limit_orders_per_ip = 50 # Orders per IP per hour
rate_limit_certs_per_domain = 50 # Certs per domain per week
# Comprehensive rate limits
[acme.rate_limits]
enabled = true
orders_per_account = 5000 # Max orders per account per window
orders_per_account_window = "3h"
certs_per_domain = 500 # Max certs per eTLD+1 domain per window
certs_per_domain_window = "168h" # 1 week
certs_per_exact_set = 50 # Max certs per exact domain set
certs_per_exact_set_window = "168h"
auth_failures_per_domain = 50 # Max auth failures per domain per window
auth_failures_window = "1h"
orders_per_ip = 1000 # Max orders per IP per window
orders_per_ip_window = "1h"
failed_finalizations_per_order = 10
min_order_interval = "100ms" # Minimum time between orders per account
buffer_percent = 10 # Warning threshold at 90% of limit

Safe defaults with just enabled = true:

  • http-01 challenge enabled, CAA checking disabled
  • 90-day certificate validity, 15-minute challenge/nonce validity
  • 100 SANs maximum, IP identifiers enabled
  • OCSP responder enabled (5-minute cache)
  • CRL distribution enabled (48-hour NextUpdate)
  • No CIDR or domain restrictions

Hot-reloadable: rate limits, allowed_cidrs, allowed_identifiers, OCSP/CRL settings. Cold (restart required): enabled, path_prefix, challenges_enabled.

Troubleshooting

Common symptoms and diagnostic steps:

“badNonce” errors from ACME clients:

- Nonce expired: increase nonce_validity (default 15m)
- Nonce already consumed: client must retry with fresh nonce from Replay-Nonce header
- Clock skew between cluster nodes: verify NTP synchronization
- Single-use enforcement: each nonce valid for exactly one request

“connection” errors during http-01 challenge:

- Firewall blocking port 80 from ACME server to client
- Client not serving challenge response at /.well-known/acme-challenge/{token}
- Wrong content at challenge URL (must be {token}.{thumbprint})
- Validation timeout: client has 30 seconds total (2s initial delay, 5 retries)

“dns” errors during dns-01 challenge:

- TXT record _acme-challenge.{domain} not propagated yet
- Wrong record value (must be base64url(SHA256(keyAuthorization)))
- DNS TTL too high, stale cached record
- DNS module must be enabled and healthy

“caa” errors during certificate issuance:

- Domain has CAA records that do not include this CA's identifier
- SERVFAIL on CAA lookup denies issuance per RFC 8659 (mandatory)
- Add CA identifier to domain CAA records or disable caa_checking
- CAA queries always bypass DNS cache for fresh data

“unauthorized” errors:

- Account key mismatch between request JWK and order account
- Order belongs to a different account thumbprint
- Certificate revocation attempted by non-owner without certificate key

“rejectedIdentifier” errors:

- Domain not matching allowed_identifiers patterns
- IP identifier requested but enable_ip_identifiers = false

“rateLimited” errors:

- Check which rate limit dimension was hit (logged at WARN level)
- Rate limits use fail-open design: errors do not block operations
- IPv6 addresses normalized to /64 prefix for rate limiting
- When both legacy and comprehensive rate limits configured, both enforced

OCSP responder returning “unknown” status:

- Serial number not recognized by this CA
- Certificate not loaded into memory on startup (check startup logs)
- Persistent storage lookup failed (check NATS JetStream health)

CRL endpoint returning empty or stale CRL:

- CRL rebuilt only on certificate revocation (not periodically)
- Check crl_enabled = true
- Verify persistent storage (NATS JetStream KV) is healthy
- Lazy load: first access after restart may be slower

Certificate storage issues:

- Memory storage: check distributed memory cache module health
- Persistent storage: check NATS JetStream KV connectivity
- Key naming: NATS KV does not allow ":" in keys, uses "/" separator
- Encryption: all persistent data encrypted with AES-256-GCM
- Startup loading: expired certificates are skipped during reload

Challenge validation timing out:

- Initial delay allows client time to set up challenge response
- Multiple retry attempts with backoff before giving up
- Total validation timeout is 30 seconds
- tls-alpn-01: verify client serves ALPN protocol "acme-tls/1" on port 443

Verify CA threshold signing works:

Run 'hexdcall threshold test' to trigger a test signing ceremony.
Shows per-node participation, latency, and signature verification.
Use '--trace' for phase-level timing and per-node message counts.

Security

Security model and hardening:

Account security:

Stateless accounts derived from JWK thumbprint. No account credentials stored
server-side. Account key compromise allows certificate issuance for any
allowed domain. Consider key rotation procedures for high-security deployments.

Challenge validation:

http-01: validates web server access on port 80 (follows up to 10 redirects)
dns-01: validates DNS control via TXT record at _acme-challenge.{domain}
tls-alpn-01: validates TLS server access via acmeIdentifier extension (OID 1.3.6.1.5.5.7.1.31)
All challenges have short validity (default 15 minutes) and are single-use.

Certificate security:

UUID v4 serial numbers prevent collision attacks. CAA checking (when enabled)
prevents unauthorized issuance per RFC 8659. CIDR restrictions limit API
access. Rate limiting prevents abuse across 7 dimensions.

Nonce security:

Cryptographically random, single-use, short validity (15 minutes default).
Consumed immediately on use. Prevents replay attacks per RFC 8555.

Deterministic DNS security boundary:

Token derived from cluster_key and domain name. Domain must resolve to IP
within dns_deterministic_cidrs. Only for internal domains where DNS API is
unavailable.

Persistent storage encryption:

All certificate data encrypted at rest with keys derived from cluster_key.
Defense-in-depth on top of transport encryption. Private keys never stored
in plaintext.

Distributed consistency:

Write operations use transactional patterns to prevent race conditions.
Write operations require quorum consensus. Rate limits fail-open for availability.

Threshold CA key protection:

When acme_ca_threshold=true, the ACME CA private key never exists in full on any node.
Generated via distributed key generation, exists only as shares. Signing requires
a quorum of nodes. Shares encrypted at rest with keys derived from cluster_key.
After initial key generation, membership changes use resharing (no re-generation).
Fail-closed: no certificates issued until key generation completes.

OCSP/CRL security:

OCSP responses signed with CA key, cached for performance (configurable TTL).
Cache invalidated immediately on certificate revocation.
CRL signed with CA key, rebuilt on each revocation (not periodic).
Both endpoints support CIDR-based access control.

Relationships

Module dependencies and interactions:

  • certmanager: Issued certificates can be stored via certmanager for TLS serving. ACME CA is the issuer; certmanager is the consumer for cluster-wide distribution.
  • autotls: AutoTLS uses the internal ACME CA for wildcard certificate issuance. When auto_tls = true, ACME is automatically enabled.
  • acmeclient: The ACME client module can point at this internal CA as its ACME server, creating a fully internal PKI without external dependencies.
  • dns: Used for dns-01 challenge validation (TXT record queries), CAA record checking (typed DNS lookup with “CAA” query type), and deterministic DNS token validation. SERVFAIL on CAA lookup must deny issuance per RFC 8659.
  • Distributed cache: Primary storage for orders, authorizations, challenges, nonces, OCSP cache, and CRL. Distributed across cluster nodes.
  • Persistent storage: Durable storage for certificates, serial number index, and CRL. Uses NATS JetStream KV with encryption at rest.
  • config: Hot-reload of rate limits, allowed CIDRs, OCSP/CRL settings.
  • telemetry: Structured logging and Prometheus metrics for orders, challenges, certificates, OCSP, CRL, and rate limit events.
  • Rate limiting: ACME implements its own rate limiting layer (7 dimensions) independent of the global rate limiter. Both legacy and comprehensive limits can be enforced simultaneously (defense in depth).

Logs

Log entries by component. Search with: logs search “acme” Levels: ERROR > WARN > INFO > DEBUG.

Init & Lifecycle:

acme.init INFO ACME CA server disabled in config
acme.init WARN JetStream temporarily unavailable, retrying certificate load
acme.init ERROR Failed to load certificates after retries
acme.init INFO ACME CA server initialized
acme.init DEBUG Restored CRL number from persistent storage
acme.init INFO CRL signing failed on startup, retrying
acme.init ERROR AUDIT Failed to regenerate CRL on startup — revoked certificates may not be enforced
acme.init INFO CRL regenerated on startup
acme.init INFO Skipping CRL rebuild on startup (not leader)

Periodic CRL Health Check:

acme.crl.periodic WARN AUDIT CRL expired or missing — rebuilding
acme.crl.periodic INFO AUDIT Periodic CRL rebuild succeeded
acme.crl.periodic ERROR AUDIT Periodic CRL rebuild failed — revoked certificates may not be enforced

Certificate Load from Persistent Storage:

acme.init.load INFO Persistent storage not enabled, skipping certificate load
acme.init.load ERROR Failed to load certificates from persistent storage
acme.init.load DEBUG Skipping expired certificate
acme.init.load WARN Failed to store certificate in memory cache
acme.init.load DEBUG Loaded certificate from persistent storage
acme.init.load WARN Failed to load certificate from persistent storage

Certificate Issuance:

acme.certificate.issue WARN AUDIT CAA re-check failed at issuance time
acme.certificate.issue WARN Failed to get CA chain
acme.certificate.issue WARN Serial index replication incomplete - revocation may need retry
acme.certificate.issue WARN Failed to save certificate to persistent storage
acme.certificate.issue INFO AUDIT Certificate issued
acme.certificate.issue WARN Failed to record certificate issuance for rate limiting

Certificate Revocation:

acme.certificate.revoke WARN Failed to update revocation in persistent storage
acme.certificate.revoke INFO AUDIT Certificate revoked

CAA Checking:

acme.caa.check DEBUG Checking CAA records
acme.caa.check WARN CAA lookup returned SERVFAIL
acme.caa.check DEBUG CAA lookup returned no records
acme.caa.check WARN CAA records do not authorize this CA
acme.caa.check DEBUG CAA check passed
acme.caa.lookup DEBUG CAA records found
acme.caa.iodef DEBUG CAA iodef record found

Challenge Response:

acme.challenge.respond ERROR Failed to atomically update challenge status
acme.challenge.respond ERROR Failed to update authorization
acme.challenge.respond INFO AUDIT Challenge response received

Challenge Validation:

acme.challenge.validate WARN Async validation cancelled during initial delay
acme.challenge.validate ERROR Failed to reload challenge for async validation
acme.challenge.validate INFO Challenge no longer in processing state, skipping validation
acme.challenge.validate ERROR Failed to reload challenge after validation
acme.challenge.validate WARN Failed to record auth failure for rate limiting
acme.challenge.validate ERROR Failed to store challenge after validation
acme.challenge.validate DEBUG Starting challenge validation
acme.challenge.validate INFO Challenge validation completed

Authorization:

acme.authorization.update INFO AUDIT Authorization status updated

Deterministic DNS Token:

acme.challenge.deterministic ERROR Cluster key not configured for deterministic DNS
acme.challenge.deterministic DEBUG Generated deterministic token

CRL:

acme.crl.get DEBUG CRL served from memory cache
acme.crl.get INFO No CRL found, generating initial CRL
acme.crl.get ERROR Failed to load CRL after rebuild
acme.crl.get ERROR Failed to load CRL from persistent storage
acme.crl.get DEBUG CRL loaded from persistent storage and cached
acme.crl.rebuild INFO Rebuilding CRL
acme.crl.rebuild ERROR Failed to collect revoked certificates
acme.crl.rebuild ERROR Failed to request CRL signing
acme.crl.rebuild ERROR Failed to sign CRL
acme.crl.rebuild ERROR Unexpected response type from CA module
acme.crl.rebuild ERROR CA module failed to sign CRL
acme.crl.rebuild WARN Failed to persist CRL to storage
acme.crl.rebuild INFO CRL rebuilt successfully
acme.crl.rebuild ERROR Background CRL rebuild failed
acme.crl.collect WARN Failed to collect ACME revocations, continuing with X.509
acme.crl.collect WARN Failed to collect X.509 revocations
acme.crl.collect DEBUG Collected revoked certificates
acme.crl.collect WARN Failed to parse certificate serial number, skipping
acme.crl.collect WARN Invalid serial number (zero or negative), skipping
acme.crl.collect.x509 WARN Failed to parse X.509 certificate serial number, skipping
acme.crl.collect.x509 WARN Invalid X.509 serial number (zero or negative), skipping

Nonce:

acme.nonce.create ERROR Failed to generate random nonce
acme.nonce.create ERROR Failed to store nonce
acme.nonce.create ERROR Failed to achieve nonce storage quorum
acme.nonce.create DEBUG Created new nonce
acme.nonce.validate ERROR Failed to get nonce from cache
acme.nonce.validate ERROR Failed to wait for nonce lookup
acme.nonce.validate ERROR Unexpected cache response type
acme.nonce.validate WARN Nonce not found
acme.nonce.validate ERROR Invalid nonce data type in cache
acme.nonce.validate WARN Nonce expired
acme.nonce.validate ERROR Failed to atomically consume nonce
acme.nonce.validate DEBUG Nonce validated and consumed atomically

OCSP:

acme.ocsp.handle WARN Invalid OCSP request
acme.ocsp.handle WARN Invalid serial number in OCSP request
acme.ocsp.handle DEBUG Processing OCSP request
acme.ocsp.handle DEBUG OCSP response served from cache
acme.ocsp.handle ERROR Failed to check certificate status
acme.ocsp.handle ERROR Failed to request OCSP signing
acme.ocsp.handle ERROR Failed to sign OCSP response
acme.ocsp.handle ERROR Unexpected response type from CA module
acme.ocsp.handle ERROR CA module failed to sign OCSP response
acme.ocsp.handle INFO OCSP response generated
acme.ocsp.x509 DEBUG Failed to query X.509 module
acme.ocsp.flush INFO OCSP cache flushed on startup

Order:

acme.order.create ERROR Failed to generate order ID
acme.order.create ERROR Failed to create authorization
acme.order.create ERROR Failed to store order
acme.order.create ERROR Failed to achieve order storage quorum
acme.order.create INFO Created new order
acme.order.create WARN Failed to record order for rate limiting
acme.order.finalize INFO Order finalization started
acme.order.issue ERROR Failed to reload order for async certificate issuance
acme.order.issue INFO Order no longer in processing state, skipping certificate issuance
acme.order.issue WARN Context cancelled before certificate issuance
acme.order.issue ERROR Failed to issue certificate
acme.order.issue WARN Failed to record finalization failure for rate limiting
acme.order.issue ERROR Failed to reload order after certificate issuance
acme.order.issue ERROR Failed to update order after certificate issuance
acme.order.issue INFO AUDIT Certificate issued successfully

Legacy Order Rate Limit:

acme.order.ratelimit WARN Failed to check rate limit, allowing request
acme.order.ratelimit WARN Rate limit optimistic lock failed after retries, allowing request
acme.order.ratelimit WARN Order rate limit exceeded

Validation HTTP-01:

acme.validation.http01 DEBUG Validating HTTP-01 challenge
acme.validation.http01 WARN HTTP-01 validation failed: connection error
acme.validation.http01 WARN HTTP-01 validation failed: wrong status code
acme.validation.http01 WARN HTTP-01 validation failed: invalid key authorization format
acme.validation.http01 WARN HTTP-01 validation failed: key authorization hash mismatch
acme.validation.http01 INFO HTTP-01 validation successful
acme.validation.http01.dns ERROR Failed to resolve hostname via DNS module
acme.validation.http01.dns ERROR DNS returned no addresses
acme.validation.http01.dns DEBUG Resolved hostname via DNS module
acme.validation.http01.dns DEBUG Connected to validation target
acme.validation.http01.dns WARN Failed to connect to IP, trying next
acme.validation.http01.dns ERROR Failed to connect to any resolved IP

Validation DNS-01:

acme.validation.dns01 DEBUG Validating DNS-01 challenge
acme.validation.dns01 WARN DNS-01 validation failed: DNS lookup error
acme.validation.dns01 ERROR DNS-01 validation failed: no expected value computed
acme.validation.dns01 INFO DNS-01 validation successful
acme.validation.dns01 WARN DNS-01 validation failed: no matching TXT record

Validation TLS-ALPN-01:

acme.validation.tlsalpn01 DEBUG Validating TLS-ALPN-01 challenge
acme.validation.tlsalpn01 WARN TLS-ALPN-01 validation failed: connection error
acme.validation.tlsalpn01 WARN TLS-ALPN-01 validation failed: wrong ALPN protocol
acme.validation.tlsalpn01 WARN TLS-ALPN-01 validation failed: certificate doesn't contain identifier
acme.validation.tlsalpn01 WARN TLS-ALPN-01 validation failed: no acmeIdentifier extension
acme.validation.tlsalpn01 WARN TLS-ALPN-01 validation failed: acmeIdentifier mismatch
acme.validation.tlsalpn01 INFO TLS-ALPN-01 validation successful
acme.validation.tlsalpn01.dns ERROR Failed to resolve hostname via DNS module
acme.validation.tlsalpn01.dns ERROR DNS returned no addresses
acme.validation.tlsalpn01.dns DEBUG Resolved hostname via DNS module
acme.validation.tlsalpn01.dns DEBUG TLS connection established
acme.validation.tlsalpn01.dns WARN Failed to connect to IP, trying next
acme.validation.tlsalpn01.dns ERROR Failed to establish TLS connection to any resolved IP

Validation Deterministic DNS:

acme.validation.deterministic DEBUG Failed to resolve domain for deterministic DNS check

Comprehensive Rate Limiting:

acme.ratelimit.circuitbreaker ERROR Rate limit circuit breaker open — blocking requests
acme.ratelimit.check DEBUG Rate limit checks passed
acme.ratelimit.warn WARN Approaching rate limit capacity
acme.ratelimit.blocked INFO AUDIT Rate limit check blocked operation
acme.ratelimit.error WARN Rate limit state access error
acme.ratelimit.record DEBUG Recorded certificate issuance
acme.ratelimit.record WARN Recorded authorization failure
acme.ratelimit.record WARN Recorded finalization failure

SPIFFE Account:

spiffe.account.create WARN Unknown account key - no matching workload found
spiffe.account.create WARN Client IP not allowed for workload
spiffe.account.create ERROR Failed to store SPIFFE account
spiffe.account.create INFO Created SPIFFE account
spiffe.account.deactivate INFO SPIFFE account deactivated

SPIFFE Order:

spiffe.order.create WARN Client IP not allowed for workload
spiffe.order.create WARN SAN not allowed for workload
spiffe.order.create ERROR Failed to generate order ID
spiffe.order.create ERROR Failed to store SPIFFE order
spiffe.order.create INFO Created SPIFFE order (auto-approved)
spiffe.order.get WARN Client IP not allowed for workload
spiffe.order.finalize DEBUG Using workload snapshot from order creation (hot-reload safe)
spiffe.order.finalize WARN Workload removed from config during order lifetime
spiffe.order.finalize DEBUG Using current workload config (v1 order - upgrade for hot-reload safety)
spiffe.order.finalize WARN Client IP not allowed for workload
spiffe.order.finalize WARN Certificate issuance queue full, waiting for slot
spiffe.order.finalize ERROR Failed to revert order status after timeout
spiffe.order.finalize INFO SPIFFE order finalization started
spiffe.order.issue ERROR Failed to reload SPIFFE order for certificate issuance
spiffe.order.issue INFO SPIFFE order no longer in processing state, skipping
spiffe.order.issue ERROR Failed to issue SPIFFE certificate
spiffe.order.issue ERROR Failed to reload SPIFFE order after certificate issuance
spiffe.order.issue ERROR Failed to update SPIFFE order after certificate issuance
spiffe.order.issue INFO AUDIT SPIFFE certificate issued successfully

SPIFFE Certificate:

spiffe.certificate.issue WARN Failed to get CA chain
spiffe.certificate.issue WARN Serial index replication incomplete
spiffe.certificate.issue WARN Failed to save SPIFFE certificate to persistent storage
spiffe.certificate.get.error WARN Account not found for certificate retrieval
spiffe.certificate.get.cidr WARN Client IP not allowed for workload during certificate retrieval
spiffe.certificate.revoke.cidr WARN Client IP not allowed for workload during revocation
spiffe.certificate.revoke INFO AUDIT SPIFFE certificate revoked

SPIFFE Rate Limiting:

spiffe.ratelimit.check WARN Failed to check rate limit, allowing request
spiffe.ratelimit.blocked WARN SPIFFE rate limit exceeded
spiffe.ratelimit.record WARN Failed to get rate limit state
spiffe.ratelimit.record WARN Failed to store rate limit state

Metrics

Prometheus metrics. Query with: metrics prometheus acme_<name> or spiffe_<name>

ACME Provider Rate Limiting (namespace: acme_provider):

acme_provider_ratelimit_checks_total counter {} Total rate limit check invocations
acme_provider_ratelimit_check_results_total counter {limit_type, result} Rate limit check outcomes
limit_type=all, result=passed All checks passed
limit_type=<type>, result=blocked Operation blocked by specific limit type
acme_provider_ratelimit_check_duration latency {} Rate limit check duration
acme_provider_ratelimit_circuit_breaker_trips_total counter {limit_type} Circuit breaker trips (blocking after consecutive state errors)
acme_provider_ratelimit_orders_created_total counter {} Orders recorded for rate limiting
acme_provider_ratelimit_certs_issued_total counter {} Certificates recorded for rate limiting
acme_provider_ratelimit_domain_issuances_total counter {domain} Issuances per registered domain
acme_provider_ratelimit_auth_failures_total counter {domain} Authorization failures per registered domain
acme_provider_ratelimit_finalization_failures_total counter {} Failed finalization attempts recorded
acme_provider_ratelimit_state_errors_total counter {limit_type, operation} Distributed state access errors
acme_provider_ratelimit_approaching_total counter {limit_type} Warning: nearing rate limit capacity (80%+)
acme_provider_ratelimit_current_usage gauge {limit_type} Current usage count for limit dimension
acme_provider_ratelimit_limit gauge {limit_type} Configured limit for dimension
acme_provider_ratelimit_usage_percent gauge {limit_type} Usage as percentage of limit

SPIFFE (namespace: spiffe):

spiffe_accounts_created_total counter {workload} SPIFFE accounts created
spiffe_orders_created_total counter {workload} SPIFFE orders created
spiffe_orders_finalized_total counter {workload} SPIFFE orders finalized (issuance started)
spiffe_certificates_issued_total counter {workload} SPIFFE certificates issued successfully
spiffe_certificate_issuance_errors_total counter {workload, reason} SPIFFE certificate issuance failures
spiffe_certificates_revoked_total counter {workload} SPIFFE certificates revoked
spiffe_certificate_retrievals_total counter {} SPIFFE certificate downloads
spiffe_certificate_retrieval_errors_total counter {reason} SPIFFE certificate download errors
spiffe_trust_bundle_requests_total counter {} SPIFFE trust bundle requests
spiffe_issuance_queue_depth gauge {} Current concurrent issuance goroutines
spiffe_issuance_queue_full_total counter {workload} Issuance rejected due to queue full
spiffe_ca_signing_duration_ms histogram {} CA signing ceremony latency (ms)
spiffe_ratelimit_current_usage gauge {workload} Current rate limit usage per workload
spiffe_ratelimit_blocked_total counter {workload} Requests blocked by rate limit
spiffe_ratelimit_check_error_total counter {workload, reason} Rate limit check errors (fail-open)
spiffe_ratelimit_record_error_total counter {workload, reason} Rate limit recording errors
spiffe_ratelimit_record_success_total counter {workload} Rate limit entries recorded successfully

ACME Client

Automatic TLS certificate management via Let’s Encrypt or ACME-compliant CAs with cluster-wide distribution

Overview

The ACME client module obtains and manages TLS certificates from Let’s Encrypt or any ACME-compliant Certificate Authority (including Hexon’s internal ACME CA). It handles certificate issuance, automatic renewal, cluster-wide distribution, and bootstrap fallback for high-availability deployments.

Core capabilities:

  • Automatic certificate issuance via ACME protocol (RFC 8555)
  • HTTP-01 challenge with dynamic port 80 listener (only during verification)
  • Cluster-wide certificate distribution via persistent KV watch (NATS JetStream KV)
  • Encrypted persistent storage (AES-256-GCM with cluster_key domain separation)
  • Automatic renewal with configurable threshold (default: 30 days before expiry)
  • ACME Renewal Information (ARI) support (RFC 8739) for optimal renewal windows
  • Bootstrap fallback: self-signed temporary certificate when ACME fails on startup
  • Recovery mechanism: exponential backoff retry after bootstrap fallback
  • Smart startup with leader detection for cluster-wide deduplication
  • Wildcard certificate coverage detection to avoid redundant issuance
  • Client-side rate limiting to avoid hitting CA limits
  • Startup readiness integration to prevent HTTPS binding without valid certificate

Certificate modes (dual mode):

Static: tls_cert/tls_key in config used directly (highest priority)
ACME: acme_client.enabled = true, certificates managed automatically
Both can coexist. Static wildcards suppress redundant ACME issuance.

Dynamic port 80 challenge listener architecture:

1. Certificate issuance starts -> challenge listener started on ALL nodes
2. All nodes start port 80 listener -> Wait for quorum confirmation
3. ACME challenge tokens stored in distributed memory cache (cluster-wide)
4. ACME server validates -> Any node can respond to challenge
5. Certificate issued -> challenge listener stopped on ALL nodes
6. All nodes stop port 80 listener
Port 80 exposed only during brief verification window (~30 seconds).

Cluster coordination:

1. Leader node performs ACME protocol exchange
2. Certificate saved to Persistent Storage (encrypted)
3. persistent KV watch automatically syncs to all cluster nodes
4. All nodes update in-memory TLS configuration via watch handler
No manual certificate distribution needed.

Storage model:

Persistent: NATS JetStream KV with AES-256-GCM encryption
Keys: account, cert/{base64url(domain)}, issuance/{base64url(domain)}
Watch pattern: "cert/*" for automatic cluster sync
Domain encoding: base64url (RFC 4648 without padding) for special characters

Config

ACME client configuration in hexon.toml under [acme_client]:

[acme_client]
enabled = true # Enable ACME client
email = "admin@example.com" # Contact email for CA notifications
accept_tos = true # Accept CA terms of service (required)
reset = false # Delete all ACME data on startup (default: false)
# Certificate parameters
key_type = "ecdsa256" # Key type: ecdsa256, ecdsa384, rsa2048, rsa4096, ed25519
renewal_threshold_hours = 720 # Renew when fewer than N hours remain (default: 30 days)
renewal_check_interval = "6h" # How often to check for renewals (default: 6h)
auto_proxy_domains = true # Auto-issue certs for proxy mapping domains
# Challenge configuration
challenge_port = 80 # Port for HTTP-01 challenge listener (default: 80)
# Bootstrap fallback
allow_bootstrap_fallback = true # Use self-signed cert if ACME fails (default: true)
startup_timeout = "60s" # Max time for ACME on startup (default: 60s)
startup_retries = 3 # Retries within timeout before fallback (default: 3)
# ACME Renewal Information (ARI) - RFC 8739
ari_enabled = true # Fetch optimal renewal windows from CA (default: true)
ari_check_interval = "6h" # How often to refresh ARI data (default: 6h)
# Client-side rate limits (avoid hitting CA limits)
[acme_client.rate_limits]
enabled = true # Enable rate limit tracking (default: true)
orders_per_account = 300 # Max orders per account per window (default: 300)
orders_window = "3h" # Orders window (default: 3h)
certs_per_domain = 50 # Max certs per domain per window (default: 50)
certs_per_domain_window = "168h" # 7 days (default: 168h)
buffer_percent = 10 # Safety margin before limit (default: 10%)

TLS certificate source priority:

1. Static certificate (tls_cert + tls_key) -- highest priority
2. AutoTLS (auto_tls = true)
3. ACME client (acme_client.enabled = true)
4. Error -- no TLS configured

Bootstrap certificate characteristics (when ACME fails):

CN: HEXON-BOOTSTRAP-{hostname}, O: HexonGateway
Validity: 7 days, Key: ECDSA P-256, SANs: configured hostname only
NOT persisted -- regenerated on each startup if needed

Recovery schedule after bootstrap fallback:

1 minute -> 5 minutes -> 15 minutes -> 30 minutes -> 1 hour -> normal cycle (6h)

Hot-reloadable: renewal_threshold_hours, renewal_check_interval, rate limits. Cold (restart required): enabled, email, accept_tos, key_type, challenge_port.

Troubleshooting

Common symptoms and diagnostic steps:

Certificate not issued on startup:

- Check if leader exists: issuance uses leader-only scheduling which requires a leader node
- Smart retry loop polls for leader with exponential backoff
- Verify startup_timeout (default 60s) and startup_retries (default 3)
- If no leader within timeout and allow_bootstrap_fallback = true, bootstrap cert used
- Check logs for "unknown UUID" errors (indicates leader-only scheduling called without leader)

Using bootstrap certificate (self-signed):

- Bootstrap certificate in use indicates ACME failure on startup
- Recovery routine runs with exponential backoff (1m, 5m, 15m, 30m, 1h, then 6h)
- Check if ACME directory is reachable from the node
- Verify DNS resolution for the configured domain
- Check ACME account creation succeeded (email, accept_tos required)

HTTP-01 challenge failing:

- Port 80 must be accessible from the ACME CA server to any cluster node
- Verify port 80 is not already in use (challenge_port config)
- Challenge listener is dynamic: only active during issuance (~30 seconds)
- Check distributed memory cache health: challenge tokens stored cluster-wide
- Challenge tokens have short TTL (5 minutes)
- Path traversal protection on challenge token validation

Certificate not renewing:

- Check renewal_threshold_hours (default 720 = 30 days)
- Verify renewal_check_interval schedule (default 6h)
- Renewal checks run via the internal scheduler automatically
- ARI-suggested renewal windows may differ from threshold-based renewal
- Check rate limits: client-side tracking prevents exceeding CA limits

Certificate not appearing on all nodes:

- persistent KV watch subscribes to "cert/*" for automatic sync
- Check NATS JetStream KV connectivity on all nodes
- WatchEventPut: decrypt and install; WatchEventDelete: remove from cache
- Encryption key mismatch: all nodes must share same cluster_key
- Check AES-256-GCM decryption errors in logs

Rate limiting issues:

- Let's Encrypt limits: 300 orders/3h, 50 certs/domain/7d, 5 exact set/7d
- Client-side tracking uses memory module with TTLs matching CA windows
- HTTP 429 with Retry-After: short waits retried immediately, long waits deferred
- ARI-suggested renewals exempt from some rate limits
- buffer_percent (default 10%) triggers warning at 90% of limit

Wildcard coverage preventing ACME issuance:

- Static wildcard cert (e.g., *.example.com) suppresses ACME for covered domains
- *.example.com covers api.example.com but NOT example.com (apex)
- *.example.com does NOT cover sub.api.example.com (nested subdomain)
- Check if domain is covered by existing static certificate

HexonReady timeout:

- ACME client registers a readiness check for certificate availability
- HexonReady polls every 500ms with 2-minute timeout
- If timeout: either ACME failed and bootstrap disabled, or leader unavailable
- Check certificatesReady atomic flag in module state

Metrics for diagnosis:

acmeclient_issuance_started_total, acmeclient_issuance_success_total
acmeclient_issuance_failed_total (labels: domain, error_type)
acmeclient_renewal_checks_total, acmeclient_certificates_expiring
acmeclient_challenges_served_total (labels: status)
acmeclient_certificates_loaded (gauge), acmeclient_certificate_days_until_expiry

Interpreting tool output:

'certs list':
Healthy: All certs Status=OK, Days Left > 30
Warning: Days Left < 30 — renewal should happen automatically, check 'certs acme'
Expiring: Days Left < 7 — urgent, check ACME client health immediately
Bootstrap: Source=bootstrap — self-signed temporary cert, ACME failed on startup
'certs acme list':
Healthy: All domains show Status=valid with reasonable expiry
Pending: Status=pending — issuance in progress or waiting for challenge
Failed: Status=failed with error — check challenge port 80 accessibility
Action: Failed → 'logs search acmeclient' for issuance error details
'autotls status':
Healthy: Certificate valid, Days Left > 30, auto-renewal scheduled
Renewing: Renewal in progress — certificate will update automatically
Failed: Renewal failed — check internal ACME CA health with 'health components'

Security

Security model and hardening:

Private key protection:

All sensitive data encrypted at rest using AES-256-GCM. Key derived from
cluster_key with module-specific domain separation ("acmeclient"). Defense-in-depth
on top of NATS transport encryption. Private keys never stored in plaintext
in persistent storage.

Challenge listener security:

Port 80 only exposed during brief verification window (~30 seconds).
Challenge tokens have short TTL (5 minutes) in distributed memory cache.
Token validation prevents path traversal attacks.
Challenge listener management restricted to internal operations only.

Cluster synchronization security:

Automatic encrypted sync across all nodes (no manual distribution needed).
All nodes must share the same cluster_key for decryption.
NATS JetStream uses TLS encryption in transit.

Bootstrap certificate limitations:

Self-signed, not trusted by any external client.
7-day validity, ECDSA P-256 key.
NOT persisted -- cannot be accidentally used long-term.
Recovery routine continuously attempts real ACME certificate.

Access control:

Certificate issuance and renewal are restricted to internal scheduler and
admin commands. Challenge listener management is internal only (during
issuance). Certificate status queries are available to all services and
admin commands.

Relationships

Module dependencies and interactions:

  • Certificate manager: Primary consumer. ACME client stores issued certificates and distributes them cluster-wide via the certificate manager.
  • TLS server: Retrieves certificates via certificate manager for TLS handshakes. Checks for ACME-managed certificates when no static TLS cert configured.
  • Internal ACME CA: Can be configured as the ACME server endpoint, creating a fully internal PKI. The ACME client is a standard client — it works with any RFC 8555 compliant CA.
  • AutoTLS: Alternative for internal-only deployments. AutoTLS uses the internal CA directly. Priority: static > autotls > acmeclient.
  • Distributed memory: Challenge tokens stored cluster-wide for any-node validation. Rate limit counters tracked with CA-matched TTLs.
  • Persistent storage: Durable certificate storage (encrypted, NATS JetStream KV). Automatic cluster distribution via watch on certificate keys.
  • Cluster coordination: Leader detection for deduplication of certificate issuance. Readiness check for startup sequencing.
  • Config: Certificate mode selection (static vs ACME), renewal parameters, rate limit configuration. Hot-reload for renewal settings.
  • Telemetry: Prometheus metrics for issuance, renewal, challenges, and certificate state. Structured logging for all ACME protocol interactions.
  • Scheduler: Renewal checks run on configurable interval (default 6h). ARI check runs on separate interval (default 6h).

Logs

Log entries by component. Search with: logs search “acmeclient” Levels: ERROR > WARN > INFO > DEBUG > TRACE. AUDIT = persisted to audit trail.

Init (module startup):

acmeclient.init INFO Registered ACME certificates readiness check with HexonReady
acmeclient.init INFO Static TLS certificate configured - ACME client inactive
acmeclient.init INFO ACME client disabled in config
acmeclient.init INFO Initializing ACME client
acmeclient.init ERROR Failed to initialize ACME client
acmeclient.init INFO ACME client initialized successfully
acmeclient.init WARN Persistent storage is memory-only (cluster_path not set). Certificates will NOT survive cluster restart.
acmeclient.init WARN Failed to load issuance state, starting fresh
acmeclient.init WARN Failed to load some certificates from storage
acmeclient.init WARN Service certificate acquisition issue - will retry via recovery

Reset (data reset on startup):

acmeclient.reset WARN ACME reset requested - deleting all ACME data (account, certificates, issuance state)
acmeclient.reset ERROR Failed to reset ACME data
acmeclient.reset WARN ACME data reset complete - starting fresh

Fallback (bootstrap fallback):

acmeclient.fallback WARN ACME initialization failed, attempting bootstrap fallback
acmeclient.fallback ERROR Failed to generate bootstrap certificate - server cannot start with TLS
acmeclient.fallback WARN Using bootstrap certificate - ACME unavailable

Startup (service certificate acquisition):

acmeclient.startup WARN No service hostname configured - skipping certificate check
acmeclient.startup INFO Using existing valid certificate
acmeclient.startup INFO No valid certificate found - attempting ACME issuance with leader detection
acmeclient.startup INFO Certificate now available
acmeclient.startup DEBUG No leader available, waiting...
acmeclient.startup INFO Attempting ACME certificate issuance
acmeclient.startup WARN Certificate issuance request failed
acmeclient.startup WARN Certificate issuance wait failed
acmeclient.startup INFO Certificate issued successfully during startup
acmeclient.startup WARN ACME issuance timeout, falling back to bootstrap certificate
acmeclient.startup WARN Using bootstrap certificate - ACME recovery will be attempted

Account (ACME account management):

acmeclient.account INFO Loaded existing ACME account from persistent storage
acmeclient.account INFO No existing account found, waiting before creation to prevent race
acmeclient.account INFO Account was created by another node during wait, using existing
acmeclient.account INFO Creating new ACME account
acmeclient.account WARN Failed to save ACME account to persistent storage
acmeclient.account INFO Saved new ACME account to persistent storage
acmeclient.account INFO Created new ACME account successfully

Request (signed ACME requests):

acmeclient.request DEBUG Retrying ACME request after transient error

Rate limit (CA rate limit handling and client-side tracking):

acmeclient.ratelimit WARN Rate limited by CA, waiting before retry
acmeclient.ratelimit WARN Rate limited by CA, scheduling for later retry
acmeclient.ratelimit WARN Rate limited by CA without valid Retry-After, using exponential backoff
acmeclient.ratelimit DEBUG Rate limit checking disabled, skipping pre-flight checks
acmeclient.ratelimit INFO Starting pre-flight rate limit checks
acmeclient.ratelimit INFO All rate limit checks passed
acmeclient.ratelimit WARN Failed to check rate limit state
acmeclient.ratelimit WARN Approaching rate limit capacity
acmeclient.ratelimit WARN Rate limit check blocked operation
acmeclient.ratelimit DEBUG Rate limit check passed
acmeclient.ratelimit WARN Failed to record last order time
acmeclient.ratelimit WARN Failed to retrieve account order state
acmeclient.ratelimit ERROR Failed to store account order state
acmeclient.ratelimit INFO Recorded order creation
acmeclient.ratelimit WARN Failed to retrieve domain state, creating new
acmeclient.ratelimit WARN IssuedAt array exceeded max entries, truncating
acmeclient.ratelimit ERROR Failed to store domain issuance state
acmeclient.ratelimit INFO Recorded domain certificate issuance
acmeclient.ratelimit WARN Failed to retrieve exact set state, creating new
acmeclient.ratelimit WARN IssuedAt array exceeded max entries, truncating
acmeclient.ratelimit ERROR Failed to store exact set issuance state
acmeclient.ratelimit INFO Recorded exact set certificate issuance
acmeclient.ratelimit WARN Failed to retrieve domain state for auth failure recording
acmeclient.ratelimit ERROR Failed to store authorization failure state
acmeclient.ratelimit WARN Recorded authorization failure
acmeclient.ratelimit ERROR Failed to store Retry-After state
acmeclient.ratelimit WARN Stored Retry-After delay from CA

Issue (certificate issuance):

acmeclient.issue WARN Certificate issuance attempted but ACME client not initialized
acmeclient.issue INFO Starting certificate issuance
acmeclient.issue ERROR Certificate issuance failed
acmeclient.issue INFO Certificate issued successfully
acmeclient.issue INFO All domains covered by static certificate, no ACME issuance needed
acmeclient.issue WARN Rate limit check failed, delaying issuance
acmeclient.issue WARN Certificate issuance already in progress for domain
acmeclient.issue DEBUG Starting ACME certificate issuance
acmeclient.issue WARN Failed to record order creation for rate limiting
acmeclient.issue INFO Starting challenge listeners cluster-wide
acmeclient.issue WARN Node failed to start challenge listener
acmeclient.issue INFO Stopping challenge listeners cluster-wide
acmeclient.issue WARN Failed to broadcast stop challenge listener
acmeclient.issue WARN Failed to save certificate to persistent storage
acmeclient.issue WARN Failed to install certificate locally
acmeclient.issue WARN Failed to record certificate issuance for rate limiting

Challenge (HTTP-01 challenge handling):

acmeclient.challenge WARN Invalid ACME token format
acmeclient.challenge WARN Failed to lookup challenge token
acmeclient.challenge WARN Failed to wait for challenge lookup
acmeclient.challenge ERROR Unexpected response type from memorystorage
acmeclient.challenge DEBUG Challenge token not found
acmeclient.challenge WARN Challenge token has invalid value type
acmeclient.challenge WARN Failed to write challenge response
acmeclient.challenge INFO Served ACME challenge
acmeclient.challenge DEBUG Challenge token stored, responding to ACME server
acmeclient.challenge INFO Authorization validated
acmeclient.challenge WARN Failed to record authorization failure for rate limiting

Listener (challenge listener lifecycle):

acmeclient.listener DEBUG Challenge listener already running
acmeclient.listener WARN Failed to resolve interface IP, falling back to 0.0.0.0
acmeclient.listener ERROR Failed to create challenge listener
acmeclient.listener ERROR Failed to start challenge listener
acmeclient.listener INFO Challenge listener started
acmeclient.listener DEBUG Challenge listener not running, nothing to stop
acmeclient.listener WARN Challenge listener shutdown error
acmeclient.listener INFO Challenge listener stopped

Bootstrap (bootstrap certificate generation):

acmeclient.bootstrap INFO Generated CA-signed bootstrap certificate
acmeclient.bootstrap WARN CA signing failed, falling back to self-signed
acmeclient.bootstrap WARN Generated temporary bootstrap certificate - ACME certificate pending

Renewal (certificate renewal):

acmeclient.renewal INFO Scheduling renewal checks
acmeclient.renewal ERROR Failed to schedule renewal checks
acmeclient.renewal INFO Renewal check scheduler registered
acmeclient.renewal INFO Running startup certificate check
acmeclient.renewal ERROR Failed to trigger startup renewal check
acmeclient.renewal ERROR Startup renewal check failed
acmeclient.renewal INFO Startup certificate check completed
acmeclient.renewal INFO Cleaned up old failure records
acmeclient.renewal INFO Cleaned up stale inProgress entries
acmeclient.renewal WARN Failed to fetch ARI info
acmeclient.renewal INFO ARI suggests certificate renewal
acmeclient.renewal DEBUG ARI window not yet open, skipping renewal
acmeclient.renewal INFO Certificate needs renewal
acmeclient.renewal INFO Certificate missing for domain
acmeclient.renewal INFO Skipping certificate renewal - retry not allowed
acmeclient.renewal INFO Renewing certificate
acmeclient.renewal ERROR Failed to renew certificate
acmeclient.renewal DEBUG Domain covered by static certificate, skipping
acmeclient.renewal INFO ARI-guided certificate renewal completed
acmeclient.renewal INFO Certificate renewed successfully

Renewals (hexdcall renewal check operation):

acmeclient.renewals WARN Renewal check skipped - ACME client not initialized
acmeclient.renewals INFO Starting renewal check
acmeclient.renewals INFO Renewal check completed

Domains (domain collection for certificate issuance):

acmeclient.domains DEBUG Added service hostname to domain list
acmeclient.domains DEBUG Added additional domains from config
acmeclient.domains DEBUG Added proxy mapping hosts
acmeclient.domains DEBUG Added proxy landing page hostname
acmeclient.domains DEBUG Added forward proxy hostname
acmeclient.domains DEBUG Added connector hostname
acmeclient.domains INFO Collected domains for ACME certificates
acmeclient.domains INFO Domains skipped (covered by static TLS certificate)
acmeclient.domains WARN No domains configured for ACME. Set service.hostname, acme_client.additional_domains, or configure proxy mappings

Load (certificate loading from storage):

acmeclient.load WARN Certificate load skipped - ACME client not initialized
acmeclient.load INFO Loading certificates from storage
acmeclient.load INFO Loaded certificate from persistent storage

Coverage (static certificate coverage checking):

acmeclient.coverage WARN Failed to read static certificate for coverage check
acmeclient.coverage WARN Failed to decode static certificate PEM
acmeclient.coverage WARN Failed to parse static certificate
acmeclient.coverage INFO Parsed static certificate for coverage check
acmeclient.coverage DEBUG Domain covered by static certificate, skipping ACME

ARI (ACME Renewal Information - RFC 8739):

acmeclient.ari WARN Invalid ARI window: end not after start, using window start
acmeclient.ari WARN ARI window exceeds maximum, capping duration
acmeclient.ari WARN Failed to generate random offset for ARI window, using window start
acmeclient.ari WARN CA suggests early renewal - check explanation URL
acmeclient.ari DEBUG Using cached ARI info
acmeclient.ari ERROR Failed to fetch ARI info from CA
acmeclient.ari WARN Failed to cache ARI info
acmeclient.ari INFO Fetched and cached ARI info from CA
acmeclient.ari DEBUG No ARI info available for domain
acmeclient.ari WARN Failed to retrieve ARI info for marking as replaced
acmeclient.ari DEBUG No ARI info found to mark as replaced
acmeclient.ari ERROR Failed to store ARI replaced state
acmeclient.ari INFO Marked ARI renewal as completed

Recovery (bootstrap recovery routine):

acmeclient.recovery INFO Starting ACME recovery routine
acmeclient.recovery INFO Bootstrap certificate replaced - recovery complete
acmeclient.recovery INFO Waiting for next recovery attempt
acmeclient.recovery INFO Bootstrap certificate replaced during wait - recovery complete
acmeclient.recovery WARN Initial recovery schedule exhausted - switching to normal renewal cycle
acmeclient.recovery INFO Attempting ACME recovery
acmeclient.recovery WARN ACME client not fully initialized - attempting reinitialization
acmeclient.recovery WARN ACME reinitialization failed
acmeclient.recovery WARN ACME recovery request failed
acmeclient.recovery WARN ACME recovery wait failed
acmeclient.recovery WARN ACME recovery got unexpected response type
acmeclient.recovery WARN ACME recovery issuance failed
acmeclient.recovery INFO ACME recovery successful - real certificate obtained

Watch (PersistentWatch certificate sync):

acmeclient.watch WARN PersistentWatch disconnected, will retry
acmeclient.watch ERROR Failed to start PersistentWatch
acmeclient.watch INFO Started PersistentWatch for certificate updates
acmeclient.watch INFO PersistentWatch channel closed
acmeclient.watch WARN Received invalid envelope type
acmeclient.watch ERROR Failed to decrypt certificate from watch event
acmeclient.watch WARN Module state not ready, skipping certificate install
acmeclient.watch ERROR Failed to install certificate from watch event
acmeclient.watch INFO AUDIT Certificate installed via PersistentWatch
acmeclient.watch INFO AUDIT Certificate removed via PersistentWatch

Status, List, Get (certificate queries):

acmeclient.status DEBUG Certificate status check - ACME client not initialized
acmeclient.list DEBUG Certificate list requested - ACME client not initialized
acmeclient.get DEBUG Certificate requested - ACME client not initialized
acmeclient.get WARN Failed to load certificate from storage
acmeclient.get DEBUG Certificate retrieved

State (issuance state persistence):

acmeclient.state WARN Failed to delete issuance state
acmeclient.state WARN Failed to save issuance state
acmeclient.state INFO Loaded issuance state from persistent storage

Cleanup (stale data removal):

acmeclient.cleanup WARN Failed to delete old issuance state
acmeclient.cleanup WARN Removed stale inProgress entry

Shutdown:

acmeclient.shutdown WARN Shutdown timed out waiting for watch goroutine
acmeclient.shutdown INFO ACME client shutdown complete

Metrics

Prometheus metrics. Query with: metrics prometheus acmeclient_<name>

Issuance counters (module: acmeclient):

acmeclient_issuance_started_total counter {domain} Certificate issuance started
acmeclient_issuance_success_total counter {domain, key_type} Certificate issuance succeeded
acmeclient_issuance_failed_total counter {domain, error_type} Certificate issuance failed
Labels: error_type="timeout"|"rate_limit"|"authorization"|"network"|"dns"|"invalid_request"|"not_found"|"unknown"|"none"

Issuance latency (module: acmeclient):

acmeclient_issuance_duration histogram {domain, key_type} End-to-end issuance time

Renewal counters (module: acmeclient):

acmeclient_renewal_checks_total counter (no labels) Renewal check cycles executed
acmeclient_renewal_success_total counter {domain} Certificate renewals succeeded
acmeclient_renewal_failed_total counter {domain, error_type} Certificate renewals failed

Challenge counters (module: acmeclient):

acmeclient_challenges_stored_total counter {domain} Challenge tokens stored in distributed cache
acmeclient_challenges_served_total counter {status} Challenge responses served
Labels: status="success"|"not_found"|"invalid_token"|"lookup_error"|"internal_error"|"invalid_value"|"write_error"

Certificate gauges (module: acmeclient):

acmeclient_certificates_checked gauge (no labels) Certificates checked in last renewal cycle
acmeclient_certificates_expiring gauge (no labels) Certificates needing renewal in last cycle
acmeclient_certificates_loaded gauge (no labels) Total certificates in memory cache
acmeclient_certificate_days_until_expiry gauge {domain} Days until certificate expires

ARI counters (module: acmeclient):

acmeclient_ari_fetch_total counter {result} ARI fetch attempts
Labels: result="success"|"error"
acmeclient_ari_early_renewal_suggestions_total counter (no labels) CA-suggested early renewals (possible revocation)
acmeclient_ari_renewals_total counter {domain} ARI-guided certificate renewals completed
acmeclient_ari_marked_replaced_total counter (no labels) ARI renewals marked as replaced
acmeclient_ari_cache_total counter {result} ARI cache lookups
Labels: result="hit"|"miss"

Rate limit counters (module: acmeclient):

acmeclient_ratelimit_checks_total counter (no labels) Rate limit pre-flight checks executed
acmeclient_ratelimit_check_results_total counter {limit_type, result} Rate limit check outcomes
Labels: limit_type="retry_after"|"min_order_interval"|"orders_per_account"|"certs_per_domain"|
"auth_failures_per_domain"|"certs_per_exact_set"|"all"
Labels: result="blocked"|"passed"|"error"
acmeclient_ratelimit_orders_created_total counter (no labels) ACME orders created (tracked for limits)
acmeclient_ratelimit_certs_issued_total counter {domain} Certificates issued per domain (tracked for limits)
acmeclient_ratelimit_auth_failures_total counter {domain} Authorization failures per domain
acmeclient_ratelimit_retry_after_total counter {status_code} Retry-After responses from CA
Labels: status_code="429"|"503"|"other"
acmeclient_ratelimit_state_errors_total counter {operation} Rate limit state storage errors
Labels: operation="set"|"get"
acmeclient_ratelimit_approaching_total counter {limit_type} Rate limit approaching capacity warnings (>80%)

Rate limit gauges (module: acmeclient):

acmeclient_ratelimit_current_usage gauge {limit_type} Current usage count per limit type
acmeclient_ratelimit_limit gauge {limit_type} Effective limit value per limit type
acmeclient_ratelimit_usage_percent gauge {limit_type} Usage percentage per limit type

Rate limit latency (module: acmeclient):

acmeclient_ratelimit_check_duration histogram (no labels) Pre-flight rate limit check duration

Alerts:

issuance_failed_total increasing -> Check CA reachability, DNS, port 80 access
certificates_expiring > 0 persisting -> Renewal failing, check error_type labels
certificate_days_until_expiry < 7 -> Urgent: cert near expiry, check renewal logs
ratelimit_check_results_total{blocked} -> Client-side rate limit preventing issuance
ari_early_renewal_suggestions_total -> CA suggesting early renewal, possible revocation
challenges_served_total{status!=success} -> Challenge failures, check port 80 and DNS

AutoTLS Certificate Management

Zero-touch TLS — automatically issues and renews wildcard certificates from the built-in CA

Overview

Automatically issues and renews wildcard TLS certificates for all internal domains — no operator intervention required. When enabled, the gateway provisions certificates from its built-in ACME CA on startup and renews them before expiry. No external dependencies, no manual certificate management, no configuration per service.

Core capabilities:

  • Fully automatic certificate issuance and renewal — zero operator intervention required
  • Wildcard certificates covering all subdomains of the configured hostname
  • Deterministic key derivation (HKDF from cluster_key) for cluster-wide SPKI consistency
  • Configurable renewal cycles (default: 30 days) and validity (default: 60 days)
  • Seamless key rotation on each renewal cycle (new HKDF-derived key per cycle)
  • Background renewal loop with automatic retry on failure
  • Hostname change detection with automatic re-issuance

IMPORTANT — Automatic renewal:

Certificate renewal is fully automatic. The renewal loop runs continuously
in the background, sleeping until the next cycle boundary (deterministic mode) or
until the renewal threshold (random mode). When the timer fires, a new certificate
is automatically issued and installed. There is NO manual rotation step.
Operators do NOT need to set calendar reminders or monitor expiry dates for
routine certificate management. The system handles it. Only investigate if
'autotls status' shows unexpected errors or if certificates are not renewing
(check logs for issuance failures).

Each node operates independently:

1. Derives deterministic ECDSA P-256 private key from cluster_key and current cycle
2. Signs wildcard certificate via internal CA
3. Stores certificate and sets as default for SNI fallback
4. Background loop sleeps until next cycle boundary, then repeats

No cluster coordination needed — deterministic derivation from cluster_key means all nodes produce certificates with the same public key.

Deterministic keys

Deterministic key derivation — what it means and why it is secure:

IMPORTANT — “deterministic” refers to KEY MATERIAL ONLY, not signatures. ECDSA signatures still use standard randomness for nonce generation. This is NOT “deterministic signing” in the RFC 6979 sense.

How it works:

Private keys are derived deterministically from cluster_key, the certificate
SANs, and a cycle counter. The same inputs always produce the same key.
Each renewal cycle increments the counter → new key material.

Why deterministic keys:

1. SPKI pinning: The public key is identical across all cluster nodes,
enabling external clients to pin the certificate
2. Cluster consistency: No coordination needed — all nodes derive the same key
3. Reproducibility: Certificate can be re-derived after node restart without
fetching state from other nodes

What remains random:

- ECDSA signature nonces use standard randomness — certificate bytes may
differ between nodes, but the public key is identical
- The certificate is signed with full cryptographic randomness;
only the subject key pair is deterministic

Security properties:

- Key material entropy comes from cluster_key (256-bit minimum)
- Cryptographic domain separation ensures different inputs produce independent keys
- Each cycle produces an independent key — compromise of one cycle's key
does not reveal other cycles' keys
- Without cluster_key (single-node mode), keys are fully random

When the admin AI or operators see “deterministic” in AutoTLS context, it means deterministic KEY DERIVATION for cluster consistency — not reduced randomness. The cryptographic security is equivalent to random key generation.

Config

AutoTLS configuration in hexon.toml under [service]:

[service]
hostname = "access.corp.internal"
auto_tls = true
# auto_tls_renewal = 30 # Renewal cycle in days (default: 30, range: 20-525)
# auto_tls_validity = 60 # Certificate validity in days (default: 60, range: 30-790)

Certificate timing:

Renewal cycle: how often a new certificate is issued (default: 30 days)
Validity period: how long each certificate is valid (default: 60 days)
Overlap: validity - renewal = 30 days of dual-certificate coverage
Constraint: overlap must be 20%-80% of validity (auto-adjusted if not)

When auto_tls = true:

- Internal ACME CA is automatically enabled
- ACME CA endpoints available for trust anchoring (/acme/ca-bundle)
- Static TLS certificates (tls_cert/tls_key) take priority if defined

Certificate details:

Key: ECDSA P-256 (deterministically derived when cluster_key set, random otherwise)
Serial: Deterministic (derived from SANs and cycle for consistency)
SANs: *.{base_domain} + {hostname}
CN: HEXON-AUTOTLS-*.{base_domain}

Hot-reloadable: auto_tls_renewal, auto_tls_validity (via renewalLoop detection). Cold (restart required): auto_tls, hostname.

Troubleshooting

Common symptoms and diagnostic steps:

Certificate not issued on startup:

- Check if internal ACME CA is healthy: 'health components'
- Verify cluster_key is set (required for deterministic mode)
- Check logs for certificate signing errors
- If startup fails, the renewal loop retries automatically

Certificate not renewing:

- Renewal is fully automatic — check if the renewal loop is running
- Check 'autotls status' for current certificate state and days left
- Check logs for "Renewing AutoTLS certificate" messages

Hostname changed but old certificate still served:

- renewalLoop detects hostname changes and re-issues automatically
- Old certificate expires naturally (validity period)
- Force immediate renewal: 'autotls renew' (admin command)

SPKI pin changed unexpectedly:

- Pin changes on each renewal cycle (new HKDF-derived key)
- Update pinned hashes after each renewal cycle
- Pin rotation window = certificate overlap period (default: 30 days)
- Use both current and next pin for seamless rotation

Trust not established:

- Clients must trust the internal CA root certificate
- CA bundle available at /acme/ca-bundle (HTTPS endpoint)
- Add to system trust store: update-ca-certificates (Linux) or Keychain (macOS)

Interpreting ‘autotls status’ tool output:

Healthy: Certificate valid, Days Left > renewal_threshold
Renewing: Background loop triggered renewal — automatic, no action needed
Failed: Check logs for certificate signing errors
Disabled: auto_tls = false in config

Relationships

Module dependencies and interactions:

  • acme (internal CA): AutoTLS uses the internal ACME CA for certificate signing. When auto_tls = true, ACME CA is automatically enabled. Certificate signing is local — no network round-trip, no ACME protocol overhead.
  • certmanager: AutoTLS stores issued certificates in the certificate manager. The certificate manager handles TLS handshake certificate selection (exact match > wildcard > default).
  • config: Reads hostname, auto_tls, auto_tls_renewal, auto_tls_validity from [service] section. Detects hostname changes for automatic re-issuance.
  • server: Server’s TLS handshake retrieves certificates from certmanager. Priority: static (tls_cert) > AutoTLS > ACME client > error.
  • acmeclient: Alternative to AutoTLS for external CA-signed certificates. Both can coexist; static certs take highest priority.

Logs

Log entries by component. Search with: logs search “autotls” Levels: ERROR > WARN > INFO > DEBUG.

Init & Lifecycle:

autotls.init ERROR AutoTLS init panic recovered: <detail>
autotls.init INFO Static TLS certificate configured, AutoTLS skipping
autotls.init INFO AutoTLS enabled, issuing <type> certificate
autotls.init INFO Certificate signing failed on startup, retrying
autotls.init ERROR AutoTLS initialization failed, will retry in renewal loop
autotls.init INFO AutoTLS initialized successfully

Certificate Issuance:

autotls.issue INFO Issuing deterministic certificate
autotls.issue WARN Failed to store wildcard certificate, hostname cert is still active
autotls.issue WARN Failed to set default certificate, hostname cert is still active
autotls.issue INFO AutoTLS certificate issued

Renewal:

autotls.renew INFO Manual certificate renewal requested
autotls.renew WARN Hostname changed, issuing certificate for new hostname
autotls.renew INFO Renewing AutoTLS certificate
autotls.renew ERROR AutoTLS certificate renewal failed
autotls.renew INFO AutoTLS certificate renewed successfully

Epoch Parsing:

autotls.epoch WARN invalid epoch "<value>", falling back to default <default>
autotls.epoch WARN ACME CA epoch is in the future, certificate cycle will be 0 until epoch is reached

Metrics

Prometheus metrics emitted by this module:

autotls_issuances_total counter {result=success|failure} Initial certificate issuance on startup (after retry loop)
autotls_renewals_total counter {result=success|failure} Certificate renewal attempts (both automatic renewal loop and manual 'autotls renew')

Certificate state is also observable via the ‘autotls status’ hexdcall command and log entries.


SPIFFE Workload Identity

Issues workload identity certificates for service-to-service mTLS — services authenticate directly to each other

Overview

Issues SPIFFE workload identity certificates so services can authenticate directly to each other via mTLS. Traffic flows service-to-service without routing through the gateway — the gateway issues identities, it doesn’t need to be the data plane. Uses a modified ACME profile where pre-registered workloads receive certificates without domain validation challenges.

Key capabilities:

  • Pre-registration: Workloads configured with public keys in TOML config
  • No challenges: Authorization based on JWK thumbprint matching (RFC 7638)
  • CIDR enforcement: Per-workload and global IP restrictions, re-validated at every operation
  • Rate limiting: Per-workload sliding window (1 hour) with eventual consistency
  • Short-lived certificates: Workload-specific TTL (default 24h, max configurable)
  • SPIFFE URI SAN: spiffe://{hostname}/workload/{identity}
  • AllowedPeers extension: Custom OID (1.3.6.1.4.1.64753.1.1) with JSON peer list
  • CRL and OCSP integration: Certificates include CRL Distribution Point and OCSP responder URL
  • Hot-reload: New/removed/modified workloads applied without restart
  • Workload snapshots: Orders capture config at creation time for zero-downtime updates

ACME endpoints (default prefix /acme/spiffe):

GET /directory ACME directory with endpoint URLs (public)
GET /bundle CA trust bundle in PEM format (public)
GET /tos Terms of Service (public)
HEAD /new-nonce Get replay nonce for JWS requests
POST /new-account Create or retrieve SPIFFE account (JWS required)
POST /new-order Create certificate order, auto-approved (JWS required)
POST /order/{id} Get order status via POST-as-GET (JWS required)
POST /finalize/{id} Submit CSR to finalize order (JWS required)
POST /cert/{id} Download issued certificate via POST-as-GET (JWS required)
POST /revoke-cert Revoke a certificate (JWS required)

Certificate features:

  • SPIFFE URI SAN: spiffe://{hostname}/workload/{identity}
  • Extended Key Usage: Server Authentication + Client Authentication
  • AllowedPeers X.509 extension with authorized peer SPIFFE IDs
  • CRL Distribution Point and OCSP Responder URLs embedded

Config

Core configuration under [spiffe] and [[spiffe.workloads]]:

[spiffe]
enabled = true # Enable SPIFFE workload identity service
path_prefix = "/acme/spiffe" # HTTP endpoint prefix (default: /acme/spiffe)
allowed_cidrs = ["10.0.0.0/8"] # Global CIDR allowlist for all workloads
default_ttl = "24h" # Default certificate TTL
max_ttl = "168h" # Maximum certificate TTL (7 days)
rate_limit_per_workload = 100 # Max certificates per workload per hour
order_timeout = "1h" # Order expiration timeout
allowed_key_algorithms = ["EC-P256", "EC-P384", "RSA-2048", "Ed25519"] # Allowed CSR key algorithms (Ed25519 since 0.9.1)
[[spiffe.workloads]]
identity = "api-backend" # Workload identity name (used in SPIFFE URI)
account_public_key = "-----BEGIN..." # PEM-encoded public key for JWK thumbprint matching
sans = ["api.example.com"] # Allowed DNS SANs for this workload
allowed_peers = ["frontend", "db"] # Peer SPIFFE IDs embedded in AllowedPeers extension
allowed_cidrs = ["10.0.1.0/24"] # Per-workload CIDR restriction (optional, narrows global)
ttl = "4h" # Per-workload TTL override (optional, must be <= max_ttl)

JWK thumbprint computation (RFC 7638):

1. Parse DER-encoded public key from account_public_key PEM
2. Convert to canonical JWK format (lexicographically sorted fields)
3. SHA-256 hash of UTF-8 encoded JWK JSON
4. Base64url encode the hash (no padding)
Workload authenticates by signing JWS requests with matching private key.

JWS verification requirements for authenticated endpoints:

- Algorithm: ES256 (ECDSA P-256) or RS256/RS384/RS512 (RSA 2048+)
- URL field must match request URL
- Nonce: single-use replay protection via Replay-Nonce header
- Signature verified against account public key

CSR validation rules:

- Maximum size: 64KB
- CSR must be self-signed (signature verified)
- SANs must match order identifiers exactly
- Key algorithm must be in allowed_key_algorithms
- RSA keys require minimum 2048 bits

Hot-reload behavior:

New workloads: immediately available for account creation and issuance
Removed workloads: in-flight orders (v2) complete via snapshot; new orders blocked
Modified workloads: TTL/CIDR/SAN/peer changes apply to new orders only
Public key changes: old thumbprint orphaned; new account required with new key

Cluster storage:

Accounts: cluster-wide storage with 90-day TTL, quorum required
Orders: cluster-wide storage with order_timeout + 10min buffer, quorum required
Certificates: cluster-wide storage with configurable TTL, quorum required
Rate limits: cluster-wide storage with 2-hour TTL, best-effort eventual consistency

Troubleshooting

Common symptoms and diagnostic steps:

Account creation fails with “unauthorized”:

- JWK thumbprint does not match any configured workload public key
- Verify thumbprint: 'step crypto jwk thumbprint workload-pub.pem'
- Check config: 'spiffe workloads' to list configured workloads
- CIDR mismatch: client IP not in global or per-workload allowed_cidrs
- Check CIDR: 'spiffe check <workload-id>' for workload details

Order creation fails:

- Account deactivated: cannot create new orders
- Rate limit exceeded: per-workload sliding window (1 hour) hit
- SAN validation: requested identifiers not in workload's sans list
- Check status: 'spiffe status' for overall SPIFFE health

Certificate finalization fails:

- CSR too large: maximum 64KB
- CSR signature invalid: CSR must be self-signed
- SAN mismatch: CSR SANs must match order identifiers exactly
- Key algorithm not allowed: check allowed_key_algorithms in config
- RSA key too small: minimum 2048 bits required
- Rate limit re-check: limit may have been reached between order and finalize
- Order expired: check order_timeout setting

Certificate retrieval returns error:

- CIDR re-validation: client IP checked again at retrieval time
- Order not yet valid: certificate issuance is asynchronous (semaphore-limited)
- Issuance queue full: check 'metrics prometheus spiffe_issuance_queue' for queue depth

CIDR enforcement issues:

- Global [spiffe].allowed_cidrs applies to ALL requests
- Per-workload allowed_cidrs narrows the global allowlist
- CIDR is re-validated at every operation (account, order, finalize, retrieve, revoke)
- Check specific IP: 'geo lookup <ip>' for network details

Rate limiting unexpectedly blocking:

- Eventual consistency: under high concurrent load from multiple cluster nodes,
limits may be exceeded by up to the number of concurrent requests
- Sliding window is 1 hour; check 'metrics prometheus spiffe_ratelimit' for current usage
- Fail-open: if rate limit state is unavailable, requests are allowed

Nonce errors (badNonce):

- Nonces are single-use; retry with fresh nonce from Replay-Nonce response header
- All error responses include a new Replay-Nonce header for immediate retry

Certificate not trusted by peers:

- Trust bundle: GET /acme/spiffe/bundle returns PEM-encoded CA chain
- AllowedPeers: verify peer SPIFFE ID is in allowed_peers list
- OCSP/CRL: check 'certs ocsp' and 'certs crl' for responder status

Integration with cert-manager:

- Use SPIFFE ACME directory URL as ClusterIssuer server
- No challenge solvers needed (SPIFFE auto-approves)
- Account key secret must match configured workload public key
- cert-manager may require empty solvers list or dummy solver

Relationships

Module dependencies and interactions:

  • acme: SPIFFE uses the internal ACME CA for certificate signing. The CA signing operation runs asynchronously after order finalization with a configurable concurrency semaphore (default: 50). CA signing latency tracked via spiffe_ca_signing_duration_ms metric.
  • x509: Certificates include SPIFFE URI SAN, AllowedPeers custom extension, and standard Extended Key Usage (serverAuth + clientAuth). CRL Distribution Point and OCSP Responder URLs are embedded in issued certificates.
  • sessions: No direct dependency. SPIFFE uses JWS-based authentication (not session cookies). Each request is independently authenticated via JWK thumbprint matching.
  • directory: No direct dependency. Workload identity is managed through TOML configuration, not the directory module.
  • storage: Accounts, orders, certificates, and rate limits stored cluster-wide via distributed storage with quorum requirements. Rate limits use eventual consistency.
  • hotreload: Configuration changes detected via file watcher or SIGHUP. Workload snapshots (v2 orders) enable zero-downtime updates during rolling config changes.
  • loadbalancer: No direct dependency. SPIFFE handles its own rate limiting via per-workload sliding window counters in cluster storage.
  • geoaccess: No direct dependency. CIDR enforcement is built into the SPIFFE module itself using per-workload and global allowed_cidrs configuration.

Logs

Log entries by component. Search with: logs search “spiffe” Levels: ERROR > WARN > INFO > DEBUG. DEBUG requires log level configuration.

Route registration:

spiffe.routes INFO Registering SPIFFE ACME routes
spiffe.routes INFO SPIFFE ACME routes registered successfully

CIDR enforcement:

spiffe.cidr.validate WARN Invalid CIDR in AllowedCIDRs, skipping
spiffe.cidr.blocked WARN AUDIT SPIFFE request blocked by CIDR policy

Error responses:

spiffe.handler.error WARN SPIFFE ACME error response

Metrics

Prometheus metrics. Query with: metrics prometheus spiffe_<name>

Requests:

spiffe_requests_total counter {endpoint, status} Requests per endpoint (status: success/error)
spiffe_request_duration_ms histogram {endpoint} Request latency in milliseconds

Errors:

spiffe_errors_total counter {type, status} ACME error responses by problem type and HTTP status
spiffe_cidr_blocked_total counter (none) Requests blocked by CIDR policy

Endpoint label values for requests_total and request_duration_ms:

directory, new_nonce, new_account, new_order, get_order, finalize, get_certificate,
trust_bundle, revoke_cert, tos

Alerts:

rate(spiffe_errors_total[5m]) > 10 High ACME error rate
rate(spiffe_cidr_blocked_total[5m]) > 0 CIDR policy blocking requests
histogram_quantile(0.99, spiffe_request_duration_ms) > 5000 P99 latency exceeding 5 seconds