High Availability Requirements
Before adding a standby Leader, ensure that you fulfill all requirements in this section.
Disk Space Requirements
Ensure that all Leaders, both the primary and the standby ones, have enough disk space available.
You need enough space to hold the contents of the failover volume (except the state directory),
all your configuration files, plus at least 10 GB in reserve for future use.
Leader Health Check Timeouts
Configure adequate timeout for Leader health checks to ensure the Leader has enough time to pull all the configuration and the git repository from the failover volume. Too low timeout could results in the Leader being terminated prematurely.
The recommended timeout scales with the number of Worker Groups and Fleets the Leader manages.
The exact value depends on the size of the git repository,
size of the groups directory, and the total number of configuration files.
Approximate recommendations are:
- 2 minutes for smaller deployments
- 5 minutes for deployments with > 50 Worker Groups/Fleets
- 10 minutes for deployments with > 100 Worker Groups/Fleets
Git Configuration Requirements
If you have git timeout configured in local/cribl.yml,
ensure that it is equal or greater than the system default of 10 minutes.
Auth Token Requirements
All Leaders must have matching auth tokens. If you configure a custom Auth token, make sure that all Leaders have that same token.
- In Cribl Stream, check that the values of Settings > Global > Distributed Settings > Leader Settings > Auth token match for each Leader.
- Or, from the filesystem, check that the values in instance.yml >
mastersection >authTokenvalues match for each Leader.
See How to Secure the Auth Token for the Leader Node for information on changing your auth token and allowed characters.
NFS Requirements
- On all Leader Nodes, use the latest version of the NFS client. NFSv4 is required.
- Ensure that the NFS volume has at least 100 GB available disk space.
- Ensure that the NFS volume’s IOPS (Input/Output Operations per Second) is ≥ 200. (Lower IOPS values can cause excessive latency.)
- Ensure that ping/latency between the Leader Nodes and NFS is < 50 ms.
- Ensure your NFS system supports updating
mtime.
You can validate the NFS latency using a tool like
ioping. Navigate to the NFS mount, and enter the following command:ioping .For details on this particular option, see the ioping docs.
NFS Mount Option Requirements
The Leader Node will access large numbers of files whenever you use the UI or deploy configurations to Worker Nodes. When this happens, NFS’s default behavior is to synchronize access time updates for those files, often across multiple availability zones and/or regions. To avoid the problematic latency that this can introduce, Cribl recommends that you add one of the following NFS mount options:
relatime: Update the access time only if it is more than 24 hours ago, or if the file is being created or modified. This allows you to track general file usage without introducing significant latency in Cribl Stream. To do the same for folders, add thereldiratimeoption.noatime: Never update the access time. (Cribl Stream does not need access times to be updated to operate correctly.) This is the most performant option - but you will be unable to see which files are being accessed. To do the same for folders, add thenodiratimeoption.
Recommended NFS Configuration
If you are on AWS, we recommend the following:
- Use Amazon’s Elastic File System (AWS EFS) for your NFS storage.
- Ensure that the user running Cribl Stream has read/write access to the mount point.
- Configure the EFS Throughput mode to
Enhanced>Elastic. - For details on NFS mount options, see Recommended NFS mount options.
For best performance, place your Leader Nodes in the same geographic region as the NFS storage. If the Leader and NFS are distant from each other, you might run into the following issues:
- Latency in UI and/or API access.
- Missing metrics between Leader restarts.
- Slower performance on data Collectors.
Load Balancer Requirements
Configure all Leaders behind a load balancer.
- Port
4200must be exposed via a network load balancer. - Port
9000can be exposed via an application load balancer or network load balancer.
Active-Active (Proxy) Mode for HA Leaders
Both primary and standby Leaders use an Active-Active (Proxy) configuration to simplify load balancer (LB) routing. All Leaders can receive UI/API traffic on port 9000, with the standby Leader automatically proxying any incoming requests to the elected primary. This architecture maintains crucial single-writer semantics, as only the primary Leader owns the state.
For proxy mode to work, the standby must be able to reach the primary over port 9000 to forward UI/API traffic. The control-plane connections on port 4200 are also proxied. Ensure firewall rules allow Leader-to-Leader communication on these paths.
Load Balancer Health Checks
Health checks over HTTP/HTTPS via the /health endpoint are only supported on port 9000. Load balancers that support such health checks include:
- Amazon Web Services (AWS) Network Load Balancer (NLB). Suitable for TCP, UDP, and TLS traffic.
- AWS Application Load Balancer (ALB). Application-aware, suitable for HTTP/HTTPS traffic.
- HAProxy.
- NGINX Plus.
To ensure reliable load balancing for your Leader Nodes, configure health checks against the /health endpoint. For optimal performance and to avoid potential throttling, adhere to the following guidelines:
- Polling frequency: Set your load balancer to check the health endpoint every
60seconds. This interval aligns with Cribl.Cloud’s internal monitoring and prevents excessive API calls. - Avoid overly frequent checks: Polling more often than every
60seconds (for example, every5seconds) can overload the Leader’s API Process. - Load balancer routing: Configure your load balancer to direct traffic exclusively to Leader Nodes that return a
200status code from the health check.
For detailed information on the /health endpoint query and response formats, refer to the Query the Health Endpoint documentation.
AWS Network Load Balancers
If you need to access the same target through a Network Load Balancer, use an IP-based target group and deactivate client IP preservation. For details, see:
Source-Level Health Checks
For many HTTP-based Sources, you can enable a Source-level health check endpoint in the Advanced Settings tab. Load balancers can send periodic test requests to these endpoints, and a 200 OK response indicates that the Source is healthy.
Frequent requests to Source-level health check endpoints can trigger throttling settings. In such cases, Cribl will return a 503 Service Unavailable response, which can be misinterpreted as a service failure. A 503 response may indicate that the health checks are running too frequently instead of an actual service failure.