On This Page

Home /About Load Balancing {#lb}

About Load Balancing

@{product} will attempt to load-balance outbound data as fairly as possibly across all endpoints. For example, if FQDNs/hostnames are used as the Destination addresses, and each resolves to 5 (unique) IPs, then each Worker Process will have its # of outbound connections = {# of IPs x # of FQDNs} for purposes of the Destination.

Data is sent by all Worker Processes to a randomly selected endpoint, and the amount sent to each connection depends on these parameters:

  1. Respective destination weight: If a Load Weight is set to 0, @{product} will not try to connect to that IP (nor to any IPs resolved for that FQDN). When a connection is blocked, @{product} will apply a 10% penalty to the load weight, and will select the connection with the lower adjusted load.
  2. Respective destination historical data: By default, historical data is tracked for 300s. @{product} uses this data to influence the traffic sent to each destination, to ensure that differences decay over time, and that total ratios converge towards configured weights.

Sources share a single Worker Process for load balancing, so you don’t need to dedicate a separate Worker Process for load balancing for each Source. Load balancing Worker Processes have a high throughput rate because they only handle Event Breakers.

On Destinations that support load balancing, Cribl recommends enabling the feature, even if you only have one hostname - because this hostname can expand to multiple IPs. If you must disable load balancing, then to ensure robust connections, enable the alternative Advanced Settings > Round-robin DNS option. This will allow @{product} to follow the DNS host’s IP rotation, preventing connection pinning.

If a request fails, @{product} will resend the data to a different endpoint. @{product} will block only if all endpoints are experiencing problems.

Example: Equal Distribution

Suppose we have two connections, A and B, each with weight of 1 (i.e., they are configured to receive equal amounts of data). Suppose further that the load-balance stats period is set at the default 300s and - to make things easy - for each period, there are 200 events of equal size (bytes) that need to be balanced.

IntervalTime RangeEvents to balance
1time=0s —> time=300s200

Both A and B start this interval with 0 historical stats each.

Let’s assume that, due to various circumstances, 200 events are “balanced” as follows: A = 120 events and B = 80 events - a difference of 40 events and a ratio of 1.5:1.

IntervalTime RangeEvents to balance
2time=300s —> time=600s200

At the beginning of interval 2, the load-balancing algorithm will look back to the previous interval stats and carry half of the receiving stats forward. In other words, connection A will start the interval with 60 and connection B with 40. To determine how many events A and B will receive during this next interval, @{product} will use their weights and their stats as follows:

Total number of events: events to balance + stats carried forward = 200 + 60 + 40 = 300.

Number of events per each destination (weighted): 300/2 = 150 (they’re equal, due to equal weight).

Number of events to send to each destination A: 150 - 60 = 90 and B: 150 - 40 = 110.

Totals at the end of interval 2: A=120+90=210, B=80+110=190, a difference of 20 events and a ratio of 1.1:1.

Over the subsequent intervals, the difference becomes exponentially less pronounced, and eventually insignificant. Thus, the load gets balanced fairly.

Example: Unequal Distribution

Suppose you have three connections, A, B, and C, weighted to receive unequal amounts of data:

  • A: 1
  • B: 2
  • C: 7

Assume that the load-balance stats period is set at the default 300s and for each period, 200 events of equal size (bytes) need to be balanced:

IntervalTime RangeEvents to balance
1time=0s —> time=300s200

All three connections (A, B, and C) start this interval with 0 historical stats each.

Let’s assume that 200 events are “balanced” as follows: A = 30 events, B = 50 events, and C = 120 events - a difference of 90 events and a ratio of 3:5:12.

IntervalTime RangeEvents to balance
2time=600s —> time=600s200

At the beginning of interval 2, the load-balancing algorithm will look back to the previous interval stats and carry half of the receiving stats forward. This means that connection A will start the interval with 15, connection B with 25, and connection C with 60. To determine how many events each connection will receive during this next interval, @{product} will use their weights and their stats as follows:

Total number of events: events to balance + stats carried forward = 200 + 15 + 25 + 60 = 300.

Number of events per each destination (weighted): 300 / (1 + 2 + 7) = 300 / 10 = 30.

Number of events to send to each destination A: 30 - 15 = 15; B: 60 - 25 = 35; C: 210 - 60 = 150.

Totals at the end of interval 2: A: 30 + 15 = 45; B: 50 + 35 = 85; C: 120 + 150 = 270.

The difference between the highest connection (C) and the lowest (A) is 225 events. The ratio at the end of interval 2 is approximately 1:1.89:6.

Over the subsequent intervals, the difference becomes exponentially less pronounced, and eventually insignificant. Thus, the load gets balanced fairly.