On This Page

Home / Stream/ Set Up Cribl Stream/ On-Prem Deployment/ Leader High Availability/Failover/Configure Standby Leader Nodes

Configure Standby Leader Nodes

To prepare your deployment for High Availability, in the configuration for all Leaders, both primary and standby, you must specify the same failover volume - a shared Network File System (NFS) volume.

You can configure standby Leader Nodes in the following ways. These configuration options are similar to configuring the primary Leader Node:

Remember, the $CRIBL_VOLUME_DIR environment variable overrides $CRIBL_HOME.

Use the UI

For all Leader Nodes, both primary and standby, configure the following:

  1. In Settings > Global > Distributed Settings > General Settings, select Mode: Leader.

  2. Next, on the Leader Settings left tab, select Resiliency: Failover. This exposes several additional fields.

  3. In the Failover volume field, enter the NFS directory that the configurations and git commits will be replicated to. This directory must be outside of $CRIBL_HOME. One valid solution is to use CRIBL_DIST_MASTER_FAILOVER_VOLUME=<shared_dir>. See Using Environment Variables for more information.

  4. Optionally, adjust the Lease refresh period setting, which determines how often the primary Leader tries to refresh its hold on the Lease file, and the Missed refresh limit, which determines how many Lease refresh periods elapse before standby Nodes attempt to promote themselves to primary.

    Increase these two values to avoid unnecessary failovers if your NFS has higher latency or is occasionally slow.

  5. Select Save to restart.

When you save the Resiliency: Failover setting, further Distributed Settings changes via the UI will lock for both the primary and backup Leader. (This prevents errors in bootstrapping Workers due to incomplete token synchronization between the two Leaders.) However, you can still update each Leader’s distributed settings by modifying its configuration files.

Use the YAML Config File

For all Leader Nodes, both primary and standby, configure the following in $CRIBL_HOME/local/_system/instance.yml, under the distributed section:

  1. Set resiliency to failover.
  2. Specify a volume for the NFS disk to automatically add to the Leader Failover cluster and trigger automated data migration.
$CRIBL_HOME/local/_system/instance.yml
distributed:
  mode: master
  master:
    host: <IP or 0.0.0.0>
    port: 4200
    resiliency: failover
    failover:
      volume: /path/to/nfs

Note that instance.yml configs are local, not on the shared NFS volume.

Use the Command Line

For all Leader Nodes, both primary and standby, use a command of this form:

./cribl mode-master -r failover -v /tmp/shared

For all options, see the CLI Reference.

Use Environment Variables

For all Leader Nodes, both primary and standby, configure environment variables listed in Environment Variables Reference: Adding Fallback Leaders.

Address HA Migration Timeouts

When enabling Leader High Availability (HA) on a production system with a lot of data, or when booting up new Leader Nodes, the process of moving data can sometimes take longer than expected. If the system runs out of time during this move, it might only copy part of the data, which can cause problems when the system restarts.

To mitigate the risk of incomplete data transfers, temporarily increase the system’s transfer time prior to enabling HA. You can do this by adjusting a setting called TimeoutSec in the cribl.service configuration file. Locate the Service section and add or modify the TimeoutSec directive. For example, set it to 600 seconds (10 minutes):

[Service]
TimeoutSec=600

A successful data transfer will be confirmed by specific messages in the system’s log, such as:

{"time":"2025-02-17T22:01:46.295Z","cid":"api","channel":"ResiliencyConfigs","level":"info","message":"failover migration is complete"}

A failed migration due to timeout might show logs ending with a system shutdown, like this:

{"time":"2025-02-06T15:29:15.416Z","cid":"api","channel":"ShutdownMgr","level":"info","message":"Starting shutdown ...","reason":"Got SIGTERM","timeout":0}

After the HA setup is complete, revert the TimeoutSec setting to its original value. Leaving the timeout extended indefinitely can lead to prolonged Leader downtime if a problem occurs during a future failover. Make sure to document the original setting before making changes to facilitate easy reversion.