Home / Stream/ Integrations/ Manage Backpressure/Optimize Destination Persistent Queues

Optimize Destination Persistent Queues

Cribl Stream’s Destination persistent queues supplement in-memory queues to prevent data loss during a backpressure event, such as when a downstream Destination receiver is unreachable or slow to respond. See Supported Destinations for more information about which Destinations support persistent queueing.

Read this guide to learn how to optimize your Destination persistent queue. This guide is helpful when you are in the initial planning and configuration stage and also when you have an active deployment that you want to performance tune. This guide explains:

  • How Destination persistent queues respond to a backpressure event.
  • The overall goals and key metrics involved in optimizing Destination persistent queues.
  • An in-depth discussion of the Destination persistent queue configuration options and their potential impact on your system.
  • The available notifications, monitoring dashboards, and troubleshooting tools for persistent queues.

See also:

On Cribl-managed Cribl.Cloud Workers with an Enterprise plan, you can enable persistent queue for supported Destinations. However, be aware that mode options and storage limits have specific restrictions in Cribl.Cloud. See For Cribl.Cloud Only: Whether to Use a Hybrid Group for more information.

How Destination Persistent Queues Work

Backpressure occurs when there is a temporary imbalance between the rates of inbound and outbound data that is either coming in or going out of Cribl Stream.

Different events trigger persistent queues on Destinations. For Destinations, the triggers depend on what kind of Destination it is, such as whether it is filesystem-based, TCP, UDP, and so on. See Destination Backpressure Triggers for more detailed information.

When persistent queue (PQ) engages, it writes and stores data on a local file system in your environment until the backpressure event resolves.

Destinations handle data rate imbalances using the following order of operations:

  1. In-memory buffers store data under normal conditions. Under normal conditions, Destinations store data in in-memory buffers before sending the data to a receiver. These in-memory buffers on Destinations help absorb temporary imbalances between inbound and outbound data rates.

  2. If inbound data rates begin to spike, data shifts to storage in all the available buffers for that Destination. Cribl Stream then attempts to output events at a more appropriate rate for the downstream receiver.

    The threshold for triggering a backpressure event varies by Destination type. See Destination Backpressure Triggers for more information.

  3. A backpressure event triggers when buffers are full.

  4. Backpressure behavior engages. During a backpressure event, Cribl Stream responds in one of the following ways, based on the Destination Backpressure behavior options you configured on the Destination:

    SettingDescription
    BlockAll data is blocked until the backpressure event has resolved. If set to Block, the output will refuse to accept new data until the receiver is ready. The system will return block signals back to the sender.

    If the sender also supports backpressure backpressure response mechanisms, the sender may stop sending data. Consult the documentation for the sender to ensure you know how that specific sender handles a backpressure signal. In general, TCP-based senders support backpressure and stop sending data once Cribl Stream stops sending TCP acknowledgments back to it.
    Drop EventsThe Destination will discard all new events until the backpressure event has resolved and the receiver is ready.
    Persistent QueueIf the Destination supports it, the Destination will engage the persistent queue behavior.
  5. If enabled, persistent queue engages. If the persistent queue behavior engages, Cribl Stream responds in one of the following ways, based on its configured backpressure mode:

    ModeDescription
    ErrorQueues and stores data on a disk only when the Destination is in an error state.
    BackpressureAfter the Destination has been in a backpressure state for a specified amount of time, Cribl Stream queues and stores data to a disk until the backpressure event resolves.
    Always OnCribl Stream immediately queues and stores all data on a disk for all events, even when there is no backpressure.

    You can configure additional settings to determine:

    • How much data to store in memory.
    • What happens to data when the queue reaches its capacity limit.
  6. When the backpressure event ends, data starts flowing out of the persistent queue and into the Destination. Cribl Stream begins to drain persistent queues when the receiver can accept data. In the case of load-balanced receivers, this means when at least one receiver can accept data. By default, after data flow is re-established, the persistent queue flushes every 10 seconds. It uses FIFO (first in, first out) event forwarding, ensuring Cribl Stream sends earlier queued events first when receivers recover. However, in Cribl Stream 4.1 and later, you can instead prioritize new events by disabling Strict ordering on a Destination.

If you turn off Destination persistent queue (or shut down the whole Destination) before its queues drain, you will orphan any events still queued on disk. See Steps to Safely Disable and Clear Persistent Queues for more information.

Configure Destination Persistent Queue

You configure Destination persistent queue behavior when you configure each individual Destination. Consult the documentation for that Destination for more information about initial configuration.

This section provides a configuration quick start for experienced users. However, new Cribl Stream users should consider reading the full guidance in this topic to get a deeper understanding of the decisions and ideal system architecture design for persistent queue.

To configure persistent queue on a Destination:

  1. In the General Settings tab, set the Backpressure Behavior option to Persistent Queue. See Whether to Enable Persistent Queues for Backpressure Events for guidance about the decision to enable persistent queue.

  2. In the Persistent Queue Settings tab, select an appropriate Mode from the available options. See When and How to Store Data to Disk During a Backpressure Event for information about which mode to select.

  3. Set the Max file size to specify the maximum amount of data stored in each queue file before the file closes and starts a new file. Set the Max queue size to determine the maximum amount of disk space that Cribl allows the queue to consume on each Worker Process. See How Much Data to Store in Memory for more information about these settings.

  4. Set the Queue file path to determine the location for the persistent queue files. See Where to Store Data and How Much Disk Space to Allocate for more information.

  5. Optional: Set the Compression type to apply compression to the data before the queue file closes. See Whether to Compress Data for more information.

  6. Set the Queue-full behavior to determine what happens when the persistent queue reaches it Max queue size limit and runs out of space. See What Happens to Data when the Queue Reaches Its Configured Limits for more information.

  7. Optional: If needed, turn off Strict Ordering to more finely tune the Drain rate limit (EPS). See How to Drain Persistent Queues When Backpressure Events End for more information.

Overview of Destination Persistent Queue Optimization

A well-optimized data throughput system can efficiently and reliably processes and transfer data from Sources to Destinations. Optimized systems maximize performance while minimizing latency, errors, and resource consumption. An optimized system has these characteristics:

  • High efficiency: Data flows seamlessly at the maximum achievable rate without bottlenecks or excessive resource usage. The system gracefully handles backpressure events, ensuring that data intelligently throttles or queues without overwhelming downstream components.
  • Reliability: The system consistently handles variations in data volume or speed without losing or corrupting data, even during spikes or failures.
  • Scalability: The system adapts smoothly to increasing data loads or additional data senders or receivers without degradation in performance.
  • Low latency: The system processes and delivers data in near real-time, meeting the required time constraints for your business use cases.
  • Cost-efficient resource allocation: The system uses CPU, memory, disk, and network resources effectively, avoiding over-provisioning or waste.
  • Security and compliance: The system transmits data securely, meeting all relevant privacy, encryption, and regulatory requirements.

Ultimately, you and your business stakeholders should have confidence in the system’s performance, reliability, and ability to adapt to future needs. Persistent queue is an important tool to achieve this confidence.

Persistent Queue Optimization Metrics

The most effective way to performance tune is to understand which key metrics give insights into your persistent queues system’s overall health. The most effective system administrators study their deployments to establish a performance baseline that they can use for comparison if a performance issue arises. Most system administrators understand their system well and can easily identify peak data periods.

Pay attention to these metrics:

  • CPU utilization: When your system gets sudden increases in data loads, you might notice that your Worker Processes operate at full capacity and use all their CPU. Generally, that’s a sign of inefficiency or the need for more resources. If needed, you can teleport into a single Worker and view all the processes and their individual CPU utilization. To see your CPU utilization, you can use the system monitoring tools to analyze CPU utilization. If needed, you can also teleport into a single Worker and view all the processes and their individual CPU utilization.
  • Data latency: Some persistent queues can add latency because of the amount of overhead time it takes to write data to disk. As a starting point, assume that each event incurs approximately 1 second of added latency. To measure the latency impact of your persistent queue system, compare the throughput times of your system without persistent queues enabled and then introduce persistent queue and measure the difference.
  • Memory: When your system experiences sudden increases in data, your Workers need memory to store that data in temporary buffers. Consider what should happen in your system when memory is fully utilized.
  • Data storage: When you back up data to a disk, that requires available disk space to use. Allocating memory and disk space can potentially be expensive. Be aware of the rate at which you consume disk memory and the general costs you estimate you will incur to write data to disk. Have systems in place for knowing when disk space is running out and adding more memory as needed.
  • Data throughput: You need to understand your system’s throughput under normal conditions so that you can identify potential bottlenecks in the system. Use the system monitoring tools to develop this baseline.

Key Decisions When Configuring Destination Persistent Queues

You configure Destination persistent queue behavior when you configure each individual Destination. Consult the documentation for that Destination for more information about initial configuration.

This section explains the decision-making process you should consider when implementing persistent queue on a Destination. It explains the benefits and tradeoffs of different configuration settings for most use cases.

Whether to Enable Persistent Queues for Backpressure Events

The first decision you need to make is whether to use persistent queue to respond to backpressure. See When to Use Persistent Queues for a detailed explanation of which kinds of environments benefits from persistent queues.

In general, you should not use persistent queue is you are already using a backpressure mechanism on a downstream receiver. Implementing backpressure mechanisms such as persistent queue on both Cribl Stream and the Destination is redundant and could cause unnecessary data latency or impact data throughput.

The alternatives to using persistent queue are to block data or to drop events when a backpressure event occurs. Consider blocking events if you need Cribl to return block signals back to an upstream sender. If the sender also supports backpressure backpressure response mechanisms, the sender may stop sending data.

Consult the documentation for the sender to ensure you know how that specific sender handles a backpressure signal. In general, TCP-based senders support backpressure and stop sending data once Cribl Stream stops sending TCP acknowledgments back to it.

For Cribl.Cloud Only: Whether to Use a Hybrid Group

On Cribl-managed Cribl.Cloud Workers (with an Enterprise plan), persistent queue does not come with many of the configuration options described in the rest of this page. For Cribl.Cloud, enabling persistent queue only exposes the destructive Clear Persistent Queue button. Use this button with caution to avoid data loss. See Steps to Safely Disable and Clear Persistent Queues for more information.

When you enable persistent queues in Cribl.Cloud, Cribl automatically allocates up to 1 GB of disk space per Destination, per Worker Process. This 1 GB limit applies to outbound uncompressed data, and the queue does not perform any compression. If the queue fills up, Cribl Stream blocks data flow.

You cannot change this limit. If the queue fills up, Cribl Stream/Edge will block outbound data. Use a hybrid Group if you need to configure:

  • Queue size
  • Compression
  • Backpressure modes
  • Other persistent queue settings

When and How to Store Data to Disk During a Backpressure Event

Once the backpressure queue behavior engages, you need to determine how it should write data to disk. You have three available options (modes) for storing data during a backpressure event:

ModeDescription
ErrorQueues and stores data on a disk only when the Destination is in an error state.
BackpressureAfter the Destination has been in a backpressure state for a specified amount of time, Cribl Stream queues and stores data to a disk until the backpressure event resolves.
Always OnCribl Stream immediately queues and stores all data on a disk for all events, even when there is no backpressure.

Consider using the Always On mode if your business has a low tolerance for data loss. However, do so with some caution. Enabling this mode could have impacts on the persistent queue optimization metrics, such as disk space, throughput, data latency, and CPU utilization.

Consider experimenting with this mode to see whether it’s right for your business needs. Establish a baseline for system performance before you enable Always On and then observe the impacts when you turn that mode on. Ensure you have enough disk storage space to handle your data storage needs.

How Much Data to Store in Memory

When you enable persistent queues, you can configure the File size limit/Max file size setting to specify the maximum amount of data stored in each queue file before the file closes and starts a new file. You can enter a value in units such as KB, MB, and more. If left unspecified, Cribl Stream/Edge defaults to 1 MB.

Additionally, you can use the Queue size limit/Max queue size setting to determine the maximum amount of disk space that Cribl allows the queue to consume on each Worker Process. Once the queue reaches this limit, the Source will stop queueing data and engage the configured backpressure mode. You can enter a value in units such as KB, MB, GB, or TB.

If the Queue size limit/Max queue size is not specified, it defaults to 5 GB. You can adjust this value to better suit your needs. For Sources with undefined Queue size limit/Max queue size values, queues exceeding 5 GB will trigger the configured queue-full behavior until you resize the limit.

To set a higher maximum value, navigate to Group Settings > General Settings > Limits > Storage > Worker Process PQ size limit. The corresponding key in the limits.yml file is maxPQSize. (See Config Files: limits.yml for more information.) The default for this setting is 1 TB. Before increasing this limit, consult Cribl Support for guidance.

The information in this section applies only to on-prem or hybrid Worker Groups. For Cribl-managed Cribl.Cloud Organizations using persistent queue, Cribl automatically allocates up to 1 GB of disk space per Source, per Worker Process. You cannot change this limit.

This 1 GB limit applies to outbound uncompressed data, and the queue does not perform any compression. If the queue fills up, Cribl Stream/Edge blocks data flow, including outbound data.

Whether to Compress Data

When the queue file reaches its maximum file size, you can use the optional Compression setting to apply compression to the data before the queue file closes. Currently, the only available compression codec is Gzip.

If you enable Compression and also enter a Queue size limit/Max queue size, be aware of how these options interact:

  • Cribl Stream/Edge currently applies the Queue size limit/Max queue size against the uncompressed data volume entering the queue.
  • With this combination, Cribl recommends that you set the Queue size limit/Max queue size higher than the volume’s total available disk space, not including compression. This will maximize queue saturation and minimize data loss.

Where to Store Data and How Much Disk Space to Allocate

Use the Queue file path setting to determine the location for the persistent queue files. The default is $CRIBL_HOME/state/queues. When writing data, Cribl Stream/Edge will append /<worker-id>/<output-id> to this path.

For queuing to operate properly, you must provide sufficient disk space. In Distributed mode, this minimum disk space configuration is set on each Worker Group/Fleet. You can access this setting at Group/Fleet settings > General Settings > Limits > Min free disk space.

If available disk space falls below this minimum threshold, Cribl Stream/Edge will stop maintaining persistent queues, and data loss will begin. The default minimum is 5 GB.

Queue file path: The location for the persistent queue files. Defaults to $CRIBL_HOME/state/queues. To this value, Cribl Stream/Edge will append /<worker-id>/<output-id>.

What Happens to Data when the Queue Reaches Its Configured Limits

You can use the Queue-Full Behavior setting to determine what happens when the persistent queue reaches it Max queue size limit and runs out of space. There are two options:

SettingDescription
BlockBlocks all data until the backpressure event has resolved. If set to Block, the output will refuse to accept new data until the receiver is ready. The system will return block signals back to the sender.
Drop new dataThe Destination will discard all new events until the backpressure event has resolved and the receiver is ready.

How to Drain Persistent Queues When Backpressure Events End

When a backpressure event ends, the persistent queue begins to drain to the downstream Destination. You can configure the order of event delivery and the rate at which Cribl Stream forwards events.

Use the Strict Ordering setting to configure the order that Cribl Stream sends data to the Destination. Toggle Strict Ordering on (default) to use FIFO (first in, first out) processing. This means Cribl prioritizes earlier queued events before sending newly arrived data. Toggle Strict Ordering off to change this behavior, prioritizing new data over the events already in the queue.

If you toggle Strict Ordering off, you must configure the Drain rate limit (EPS). This setting allows you to throttle the rate at which events drain from the persistent queue to the Destination. The default value of 0 disables throttling. Throttling the drain rate helps Cribl manage the incoming stream of live data more effectively, preventing delays caused by queue processing. It also improves the throughput of active connections by reserving additional resources for them.

The Drain rate limit (EPS) is set in Events Per Second (EPS). To determine your system’s EPS value:

  1. Open the Monitoring page. From Cribl Stream’s sidebar, select Monitoring to access dashboards displaying data for all Worker Groups and all nodes.
  2. Filter by a specific node. Use the Groups menu at the top to restrict monitoring data to a single Worker Node.
  3. Adjust the time granularity. Use the time menu change the display granularity from the default 15 min to 1 day for better accuracy.
  4. Locate the EPS metrics. On the Monitoring page, find the Events In and Out metrics group. Copy the value listed under THROUGHPUT IN (AVG). For example, a typical value might be 31,000 EPS.

Cribl recommends setting the Drain Rate Limit to 10% of the Worker Process EPS to prioritize live data processing. To calculate an appropriate drain rate:

  1. Determine EPS per Worker Process. Divide the Worker Process EPS by the number of Worker Processes (WPs). For example: 31,000 EPS / 14 WPs = 2,200 EPS per WP.
  2. Calculate 10% of EPS per Worker Process. Multiply the result by 0.10 to find the recommended drain rate. For example: 10% of 2,200 EPS = 220 EPS.
  3. Configure the Drain Rate Limit. In the persistent queue settings for the Destination, enter this value in the Drain rate limit (EPS) field. For example: 220.

After applying this setting, monitor the queue’s drain rate. If the queue is draining too slowly, carefully increase the Drain rate limit (EPS) while ensuring it doesn’t interfere with live data processing. Avoid setting the limit too high, as this can impact the performance of live streams. If needed, consider optimizing Worker startup connections and managing CPU load to further enhance performance at Group Settings > Worker Processes.

Cribl’s guarantee for data delivery is that all data is delivered at least once, but not exactly once. Sometimes Cribl Stream sends duplicate data events when the persistent queue drains.

Steps to Safely Disable and Clear Persistent Queues

If you want to disable persistent queuing, always wait for the queue to fully drain. The Clear Persistent Queue button is available as a destructive fallback. This action frees up disk space, but it also deletes the queued data without sending it on to receivers.

After the queue has drained, then you can select a different backpressure behavior. Save the configuration after changing it.

Destination Persistent Queue Notifications

Some Cribl licenses allow you to configure Notifications for Destination persistent queues for on-prem deployments. In general, Cribl recommends setting Notifications for these events when using persistent queue to ensure you are aware of important system events. These Notifications always appear in Cribl Stream’s user interface and internal logs. You can also send them to external systems.

You can set notifications for these scenarios:

  • Persistent queue files exceed a configurable percentage of allocated storage.
  • A Destination reaches the queue-full state, meaning it is running out of available disk space.
  • A Destination backpressure engages, causing blocked or dropped events.
  • Persistent Queue Usage (saturation).

See About Notifications for information about adding Notifications for Destinations.

Monitor Destination Persistent Queues

Cribl Stream provides various Monitoring dashboards to gather information about your persistent queues. You can access these dashboards from Cribl Stream’s sidebar by selecting Monitoring.

To view data about Destination persistent queues:

  1. Open the Monitoring page. From Cribl Stream’s sidebar, select Monitoring to access dashboards displaying data for all Worker Groups and all nodes.
  2. Navigate to System > Queues (Destinations) and view relevant data.

Other Monitoring resources may become unreliable when persistent queue engages during a backpressure event. These include the main Monitoring dashboards for Sources and Destinations. When persistent queue engages, these dashboards will not accurately represent queued data because Cribl Stream queues data to disk rather than arriving at the affected Destinations.

The affected Destinations will show EPS and bytes sent while their data is queueing, but these metrics can be misleading. These metrics measure data going into the queue, but not proceeding to downstream receivers.

Some other Monitoring considerations:

  • Activity in the System > Queues (Destinations) dashboards sometimes shows up only when queues drain.
  • With or without persistent queue enabled, Monitoring dashboards do not show usage of in-memory buffers for the Destination.
  • Cribl Stream currently does not provide a Monitoring dashboard for the disk volume of persistent queues.
  • While events got to a Destination persistent queue, metrics reflect events/seconds and bytes going into that Destination’s queue, not proceeding beyond Cribl Stream.
  • Data captures might not work while a Destination has a blocked state.

For these reasons, consider enabling persistent queue notifications.

After the unhealthy condition resolves, and disk queues start to drain, dashboards resume accurately representing delivery rates and volumes.

Troubleshooting Tools

Cribl internal logs for Cribl-managed Workers might include entries labeled Revived PQ Worker Process. These indicate where a Cribl-managed Worker was newly assigned to drain a persistent queue left over from a shut-down Worker Node. You can use these log events to troubleshoot Cloud PQ issues.

Reference: Destination Persistent Queue Fields

The Persistent Queue Settings tab displays when the Backpressure behavior option in General settings is set to Persistent Queue. Persistent queue buffers and preserves incoming events when a downstream Destination has an outage or experiences backpressure.

Before enabling persistent queue, learn more about persistent queue behavior and how to optimize it with your system:

On Cribl-managed Cloud Workers (with an Enterprise plan), this tab exposes only the destructive Clear Persistent Queue button (described at the end of this section). A maximum queue size of 1 GB disk space is automatically allocated per PQ‑enabled Destination, per Worker Process. The 1 GB limit is on outbound uncompressed data, and no compression is applied to the queue.

This limit is not configurable. If the queue fills up, Cribl Stream/Edge will block outbound data. To configure the queue size, compression, queue-full fallback behavior, and other options below, use a hybrid Group.

Mode: Use this menu to select when Cribl Stream/Edge engages the persistent queue in response to backpressure events from this Destination. The options are:

ModeDescription
ErrorQueues and stores data on a disk only when the Destination is in an error state.
BackpressureAfter the Destination has been in a backpressure state for a specified amount of time, Cribl Stream/Edge queues and stores data to a disk until the backpressure event resolves.
Always onCribl Stream/Edge immediately queues and stores all data on a disk for all events, even when there is no backpressure.

File size limit: The maximum data volume to store in each queue file before closing it. Enter a numeral with units of KB, MB, etc. Defaults to 1 MB.

Max queue size: The maximum amount of disk space that the queue can consume on each Worker Process. When the queue reaches this limit, the Destination stops queueing data and applies the Queue‑full behavior. Defaults to 5 GB. This field accepts positive numbers with units of KB, MB, GB, and so on. You can set it as high as 1 TB, unless you’ve configured a different Worker Process PQ size limit on the Group Settings/Fleet Settings page.

Queue file path: The location for the persistent queue files. Defaults to $CRIBL_HOME/state/queues. Cribl Stream/Edge will append /<worker‑id>/<output‑id> to this value.

Compression: Set the codec to use when compressing the persisted data after closing a file. Defaults to None. Gzip is also available.

Queue-full behavior: Whether to block or drop events when the queue begins to exert backpressure. A queue begins to exert backpressure when the disk is low or at full capacity. This setting has two options:

  • Block: The output will refuse to accept new data until the receiver is ready. The system will return block signals back to the sender.
  • Drop new data: Discard all new events until the backpressure event has resolved and the receiver is ready.

Strict ordering: Toggle on (default) to enable FIFO (first in, first out) event forwarding, ensuring Cribl Stream/Edge sends earlier queued events first when receivers recover. The persistent queue flushes every 10 seconds in this mode. Toggle off to prioritize new events over queued events, configure a custom drain rate for the queue, and display this option:

  • Drain rate limit (EPS): Optionally, set a throttling rate (in events per second) on writing from the queue to receivers. (The default 0 value disables throttling.) Throttling the queue drain rate can boost the throughput of new and active connections by reserving more resources for them. You can further optimize Worker startup connections and CPU load in the Worker Processes settings.

Clear Persistent Queue: For Cloud Enterprise only, click this button if you want to delete the files that are currently queued for delivery to this Destination. If you click this button, a confirmation modal appears. Clearing the queue frees up disk space by permanently deleting the queued data, without delivering it to downstream receivers. This button only appears after you define the Output ID.

Use the Clear Persistent Queue button with caution to avoid data loss. See Steps to Safely Disable and Clear Persistent Queues for more information.