Configure Standby Leader Nodes
To prepare your deployment for High Availability, in the configuration for all Leaders, both primary and standby, you must specify the same failover volume - a shared Network File System (NFS) volume.
You can configure standby Leader Nodes in the following ways. These configuration options are similar to configuring the primary Leader Node:
Remember, the
$CRIBL_VOLUME_DIRenvironment variable overrides$CRIBL_HOME.
Use the UI
For all Leader Nodes, both primary and standby, configure the following:
In Settings > Global > Distributed Settings > General Settings, select Mode:
Leader.Next, on the Leader Settings left tab, select Resiliency:
Failover. This exposes several additional fields.In the Failover volume field, enter the NFS directory that the configurations and git commits will be replicated to. This directory must be outside of
$CRIBL_HOME. One valid solution is to useCRIBL_DIST_MASTER_FAILOVER_VOLUME=<shared_dir>. See Using Environment Variables for more information.Optionally, adjust the Lease refresh period setting, which determines how often the primary Leader tries to refresh its hold on the Lease file, and the Missed refresh limit, which determines how many Lease refresh periods elapse before standby Nodes attempt to promote themselves to primary.
Increase these two values to avoid unnecessary failovers if your NFS has higher latency or is occasionally slow.
Select Save to restart.
When you save the Resiliency:
Failoversetting, further Distributed Settings changes via the UI will lock for both the primary and backup Leader. (This prevents errors in bootstrapping Workers due to incomplete token synchronization between the two Leaders.) However, you can still update each Leader’s distributed settings by modifying its configuration files.
Use the YAML Config File
For all Leader Nodes, both primary and standby, configure the following
in $CRIBL_HOME/local/_system/instance.yml, under the distributed section:
- Set
resiliencytofailover. - Specify a volume for the NFS disk to automatically add to the Leader Failover cluster and trigger automated data migration.
distributed:
mode: master
master:
host: <IP or 0.0.0.0>
port: 4200
resiliency: failover
failover:
volume: /path/to/nfsNote that
instance.ymlconfigs are local, not on the shared NFS volume.
Use the Command Line
For all Leader Nodes, both primary and standby, use a command of this form:
./cribl mode-master -r failover -v /tmp/shared
For all options, see the CLI Reference.
Use Environment Variables
For all Leader Nodes, both primary and standby, configure environment variables listed in Environment Variables Reference: Adding Fallback Leaders.
Address HA Migration Timeouts
When enabling Leader High Availability (HA) on a production system with a lot of data, or when booting up new Leader Nodes, the process of moving data can sometimes take longer than expected. If the system runs out of time during this move, it might only copy part of the data, which can cause problems when the system restarts.
To mitigate the risk of incomplete data transfers,
temporarily increase the system’s transfer time prior to enabling HA.
You can do this by adjusting a setting called TimeoutSec in the cribl.service configuration file.
Locate the Service section and add or modify the TimeoutSec directive.
For example, set it to 600 seconds (10 minutes):
[Service]
TimeoutSec=600A successful data transfer will be confirmed by specific messages in the system’s log, such as:
{"time":"2025-02-17T22:01:46.295Z","cid":"api","channel":"ResiliencyConfigs","level":"info","message":"failover migration is complete"}
A failed migration due to timeout might show logs ending with a system shutdown, like this:
{"time":"2025-02-06T15:29:15.416Z","cid":"api","channel":"ShutdownMgr","level":"info","message":"Starting shutdown ...","reason":"Got SIGTERM","timeout":0}
After the HA setup is complete, revert the
TimeoutSecsetting to its original value. Leaving the timeout extended indefinitely can lead to prolonged Leader downtime if a problem occurs during a future failover. Make sure to document the original setting before making changes to facilitate easy reversion.