/ / / / /

High Availability Requirements

Before adding a standby Leader, ensure that you fulfill all requirements in this section.

Disk Space Requirements

Ensure that all Leaders, both the primary and the standby ones, have enough disk space available. You need enough space to hold the contents of the failover volume (except the state directory), all your configuration files, plus at least 10 GB in reserve for future use.

Leader Health Check Timeouts

Configure adequate timeout for Leader health checks to ensure the Leader has enough time to pull all the configuration and the git repository from the failover volume. Too low timeout could results in the Leader being terminated prematurely.

The recommended timeout scales with the number of Worker Groups and Fleets the Leader manages. The exact value depends on the size of the git repository, size of the groups directory, and the total number of configuration files. Approximate recommendations are:

2 minutes for smaller deployments
5 minutes for deployments with > 50 Worker Groups/Fleets
10 minutes for deployments with > 100 Worker Groups/Fleets

Git Configuration Requirements

If you have git timeout configured in local/cribl.yml, ensure that it is equal or greater than the system default of 10 minutes.

Auth Token Requirements

All Leaders must have matching auth tokens. If you configure a custom Auth token, make sure that all Leaders have that same token.

In Cribl Stream, check that the values of Settings > Global > Distributed Settings > Leader Settings > Auth token match for each Leader.
Or, from the filesystem, check that the values in instance.yml > master section > authToken values match for each Leader.

See How to Secure the Auth Token for the Leader Node for information on changing your auth token and allowed characters.

NFS Requirements

On all Leader Nodes, use the latest version of the NFS client. NFSv4 is required.
Ensure that the NFS volume has at least 100 GB available disk space.
Ensure that the NFS volume’s IOPS (Input/Output Operations per Second) is ≥ 200. (Lower IOPS values can cause excessive latency.)
Ensure that ping/latency between the Leader Nodes and NFS is < 50 ms.
Ensure your NFS system supports updating mtime.

You can validate the NFS latency using a tool like ioping. Navigate to the NFS mount, and enter the following command:
ioping .
For details on this particular option, see the ioping docs.

NFS Mount Option Requirements

The Leader Node will access large numbers of files whenever you use the UI or deploy configurations to Worker Nodes. When this happens, NFS’s default behavior is to synchronize access time updates for those files, often across multiple availability zones and/or regions. To avoid the problematic latency that this can introduce, Cribl recommends that you add one of the following NFS mount options:

relatime: Update the access time only if it is more than 24 hours ago, or if the file is being created or modified. This allows you to track general file usage without introducing significant latency in Cribl Stream. To do the same for folders, add the reldiratime option.
noatime: Never update the access time. (Cribl Stream does not need access times to be updated to operate correctly.) This is the most performant option - but you will be unable to see which files are being accessed. To do the same for folders, add the nodiratime option.

Recommended NFS Configuration

If you are on AWS, we recommend the following:

Use Amazon’s Elastic File System (AWS EFS) for your NFS storage.
Ensure that the user running Cribl Stream has read/write access to the mount point.
Configure the EFS Throughput mode to Enhanced > Elastic.
For details on NFS mount options, see Recommended NFS mount options.

For best performance, place your Leader Nodes in the same geographic region as the NFS storage. If the Leader and NFS are distant from each other, you might run into the following issues:

Latency in UI and/or API access.
Missing metrics between Leader restarts.
Slower performance on data Collectors.

Load Balancer Requirements

Configure all Leaders behind a load balancer.

Port 4200 must be exposed via a network load balancer.
Port 9000 can be exposed via an application load balancer or network load balancer.

Active-Active (Proxy) Mode for HA Leaders

Both primary and standby Leaders use an Active-Active (Proxy) configuration to simplify load balancer (LB) routing. All Leaders can receive UI/API traffic on port 9000, with the standby Leader automatically proxying any incoming requests to the elected primary. This architecture maintains crucial single-writer semantics, as only the primary Leader owns the state.

For proxy mode to work, the standby must be able to reach the primary over port 9000 to forward UI/API traffic. The control-plane connections on port 4200 are also proxied. Ensure firewall rules allow Leader-to-Leader communication on these paths.

Load Balancer Health Checks

Health checks over HTTP/HTTPS via the /health endpoint are only supported on port 9000. Load balancers that support such health checks include:

Amazon Web Services (AWS) Network Load Balancer (NLB). Suitable for TCP, UDP, and TLS traffic.
AWS Application Load Balancer (ALB). Application-aware, suitable for HTTP/HTTPS traffic.
HAProxy.
NGINX Plus.

To ensure reliable load balancing for your Leader Nodes, configure health checks against the /health endpoint. For optimal performance and to avoid potential throttling, adhere to the following guidelines:

Polling frequency: Set your load balancer to check the health endpoint every 60 seconds. This interval aligns with Cribl.Cloud’s internal monitoring and prevents excessive API calls.
Avoid overly frequent checks: Polling more often than every 60 seconds (for example, every 5 seconds) can overload the Leader’s API Process.
Load balancer routing: Configure your load balancer to direct traffic exclusively to Leader Nodes that return a 200 status code from the health check.

For detailed information on the /health endpoint query and response formats, refer to the Query the Health Endpoint documentation.

AWS Network Load Balancers

If you need to access the same target through a Network Load Balancer, use an IP-based target group and deactivate client IP preservation. For details, see:
Why can an instance in a target group not reach itself via NLB?
Why can’t a target behind my Network Load Balancer connect to its own Network Load Balancer?

Source-Level Health Checks

For many HTTP-based Sources, you can enable a Source-level health check endpoint in the Advanced Settings tab. Load balancers can send periodic test requests to these endpoints, and a 200 OK response indicates that the Source is healthy.

Frequent requests to Source-level health check endpoints can trigger throttling settings. In such cases, Cribl will return a 503 Service Unavailable response, which can be misinterpreted as a service failure. A 503 response may indicate that the health checks are running too frequently instead of an actual service failure.

High Availability Requirements

Disk Space Requirements​

Leader Health Check Timeouts​

Git Configuration Requirements​

Auth Token Requirements​

NFS Requirements​

NFS Mount Option Requirements​

Recommended NFS Configuration​

Load Balancer Requirements​

Active-Active (Proxy) Mode for HA Leaders​

Load Balancer Health Checks​

AWS Network Load Balancers​

Source-Level Health Checks​

Common Resources