Reliability and High Availability
Cribl Validated Architectures (CVA) treat reliability, high availability (HA), and disaster recovery (DR) as first-class design dimensions. The following guardrails apply across all CVA blueprints.
Worker Nodes Resiliency
Worker Nodes are the “engine room” of your Cribl deployment, performing the heavy lifting of parsing, reducing, and routing data. To ensure this layer remains resilient, we follow two primary strategies.
The N+1 Redundancy Standard
In a production environment, capacity planning must account for the N+1 rule. For example, if your peak data volume requires four Worker Nodes to process without lag, you must deploy at least five. This provides a “buffer” Worker Node, allowing you to take one Worker Node offline for patching or handle an unexpected crash without breaching Service Level Agreements (SLAs) or causing upstream backpressure. For details, see How Many Worker Nodes?
Handling Ingress: Push vs. Pull
Resiliency is implemented differently depending on how data arrives at the Worker Group:
| Method | Resiliency Mechanism | Implementation Detail |
|---|---|---|
| Push Sources (Syslog, HTTP, Splunk HEC) | Network Load Balancing (NLB) | Use a non-sticky Virtual IP (VIP) or a load balancer (LB) to distribute traffic. If a Worker Node fails, the LB stops sending traffic to that IP, and the remaining Worker Nodes absorb the load. |
| Pull Sources (S3, SQS, Azure Monitor) | Leader orchestration | If a Worker Node goes offline, the Leader detects the health failure and automatically reassigns those tasks to healthy Worker Nodes in the Worker Group. |
For details, see Data Resilience and Workload Architecture.
Leader High Availability
In an enterprise environment, the Cribl Leader serves as the central control plane, making its availability critical for configuration management and Worker/Edge Nodes orchestration.
Treat Git as the system of record for Cribl configuration. All Cribl Stream and Edge configuration should be version-controlled, with branches and promotion workflows enabling backup, rollback, and environment promotion through GitOps practices. For details, see Configuration Management.
For customer-managed control planes that require HA, deploy Leaders in an active/standby pair behind an LB, backed by a shared failover volume (for example, NFS) so that the standby can assume control without configuration drift or state re-seeding. For details, see High Availability Architecture.
Data Durability
In production environments, data loss isn’t just a technical glitch, it’s a compliance and operational risk. For critical data flows, you should always route a raw, durable copy of events in an object store (such as Amazon S3, MinIO, or Azure Blob Storage) before it ever reaches your analytics tool. This creates a “low-cost, long-term” retention tier that is decoupled from your real-time platforms.
While object stores handle long-term durability, Persistent Queues (PQs) handle immediate, short-term disruptions. Enable PQs on appropriate Sources and flaƒDestinations so that when downstream systems are slow or unavailable, data is spooled to disk and drained once backpressure clears, minimizing data loss while honoring latency and storage constraints.
Data Replay
Treat replay as a first-class operating pattern. Use Cribl Stream Collectors to systematically read from an object-store “landing zone” to re-ingest events on demand. You can replay data for deep forensic investigations, backfilling data after a Destination outage, or migrating historical logs to a new analytics platform.
To maximize the value of this architecture, you must explicitly size your Object Store Destination retention to meet specific business replay SLAs. For example, a policy might mandate “90 days of replayable, full-fidelity events,” which remains independent of the typically shorter (and more expensive) retention periods in SIEM, logging, or metrics platforms.
This decoupling allows you to strategically trade lower-cost cold storage against high-performance historical reprocessing capabilities, ensuring that you only pay for high-performance indexing for data you need immediately, while maintaining the ability to replay data whenever a retrospective need arises. For details, see Using S3 Storage and Replay and Unified Ingest & Replay Data.
Disaster Recovery
For enterprise-grade DR, you must ensure that both your control plane and your data stream are replicated across geographic boundaries. This prevents a single regional outage from paralyzing your observability pipeline.
To maintain operational continuity, replicate your Git-backed configuration, including all Cribl Packs and environment settings, to a standby deployment in a secondary region. If your primary region fails, this “Config-as-Code” approach allows you to quickly rehydrate Leader Nodes and Worker Groups without manual re-configuration. For organizations using HA architectures, ensure any shared storage volumes (like NFS) are also mirrored to the DR site to eliminate state loss.
Your data resilience strategy hinges on the availability of your object store. Configure your landing buckets (such as S3 or Azure Blob) with cloud-native cross-region replication and encryption. This ensures that even during a total regional failure, your replayable data, long-term archives, and compliance evidence remain accessible from the secondary region for immediate investigation or backfilling.
Monitoring
To maintain HA, you must treat Cribl Internal Metrics as a vital telemetry source. By exporting these metrics to an enterprise observability platform or using Cribl’s built-in Monitoring dashboards, you can establish proactive operational monitoring to defend your SLAs. For details, see Operational Monitoring.
Effective alerting should focus on these key health indicators:
- Worker/Edge Node health: Monitor for frequent restarts or Node disconnects to identify underlying resource contention.
- PQ management: Track PQ utilization and disk usage to prevent “head-of-line” blocking and data loss.
- Traffic integrity: Set thresholds for throughput deviations, Source idleness, or sudden volume drops that may indicate upstream failures.
- Destination stability: Alert on rising retry counts and delivery failures to mitigate downstream backpressure before it impacts the rest of the pipeline.
Pipeline Reliability
For Pipeline reliability, you must actively track transformation and routing failures, such as parse errors and malformed payloads, using the Cribl Internal Logs and troubleshooting views.
For environments requiring high data integrity, use the JSON Schemas to perform JSON schema validation during processing. This allows you to detect schema violations at processing time, route invalid events to dead-letter or quarantine Destinations, and feed those errors into operational dashboards for analysis and remediation. For details, see Processing Failure and Recovery.