Optimize Source Persistent Queues
Cribl Stream’s Source persistent queues can supplement in-memory queues to prevent data loss during a backpressure event, such as when a configured Source sends a higher volume of data than Cribl Stream can process. See Supported Sources for more information about which Sources support persistent queueing.
Read this guide to learn how to optimize your Source persistent queue. This guide is helpful when you are in the initial planning and configuration stage and also when you have an active deployment that you want to performance tune. This guide explains:
- How Source persistent queues respond to a backpressure event.
- The overall goals and key metrics involved in optimizing Source persistent queues.
- An in-depth discussion of the Source persistent queue configuration options and their potential impact on your system.
- The available notifications, monitoring dashboards, and troubleshooting tools for Source persistent queues.
For a general overview of persistent queues, see About Persistent Queues.
On Cribl-managed Cribl.Cloud Workers with an Enterprise plan, you can enable persistent queue for supported Sources. However, be aware that mode options and storage limits have specific restrictions in Cribl.Cloud. See For Cribl.Cloud Only:
Always On
is the Only Available Mode for more information.
How Source Persistent Queues Work
Backpressure occurs when there is a temporary imbalance between the rates of inbound and outbound data that is either coming in or going out of Cribl Stream.
Different events trigger persistent queues on Sources:
- Error conditions.
- Degraded performance—such as slowdowns or short backups—in downstream Cribl resources.
- External receivers might trigger backpressure.
When Source persistent queue engages, it writes and stores data on a local file system in your environment until the backpressure event resolves.
How Sources Handle Backpressure
Source persistent queue can serve as a stopgap for upstream senders that don’t provide their own buffering, or whose small buffers cannot queue for long. This applies to several syslog senders and HTTP-based Sources.
Sources that have persistent queue support have the option to supplement in-memory buffering with persistent queue. When you enable persistent queue, you can choose between these Source persistent queue trigger conditions:
Condition | Description |
---|---|
Always On | Uses persistent queue as a disk buffer for all events. |
Smart | Only engages persistent queue upon backpressure from downstream receivers or Cribl internal processes. |
In Always On
mode, the persistent queue constantly writes all data to a backup disk buffer regardless of whether there is a backpressure event or normal conditions. After the backpressure event ends, the persistent queue flushes every 10 seconds when it drains in Always On
mode.
In Smart
mode, persistent queue engages under any of these conditions:
- A connected Destination is fully down and Destination-side persistent queue is not enabled.
- A connected Destination is slow or blocking.
- A downstream Cribl Pipeline or Route is slow.
When persistent queue engages on a Source in Smart
mode:
- Events queue before any pre-processing Pipeline takes effect.
- Metrics (data in) will show
0
bytes/events while the system buffers events to the disk queue. - Cribl Stream Pipelines and Routes process the data with a delay as the persistent queue drains.
Smart
mode only engages when necessary, such as when a downstream Destination becomes blocked and the Buffer size limit reaches its limit. When persistent queue is set to Smart
mode, Cribl attempts to flush the queue when every new event arrives. The only time events stay in the buffer is when a downstream Destination becomes blocked.
Enabling persistent queue on a Source is redundant if the upstream sender already safeguards events in its own disk buffer. Enabling persistent queue on both the sender and Cribl Stream could cause data latency. Check the settings for the sender to ensure you know how whether that specific sender already safeguards against data loss.
If you turn off Source persistent queue (or shut down the whole Source) before its queues drain, you will orphan any events still queued on disk.
Configure Source Persistent Queue
You configure Source persistent queue behavior when you configure each individual Source. Consult the documentation for that Source for more information about initial configuration.
This section provides a configuration quick start for experienced users. However, new Cribl Stream users should consider reading the full guidance in this topic to get a deeper understanding of the decisions and ideal system architecture design for persistent queue.
To configure persistent queue on a Source:
In the Persistent Queue Settings tab, toggle Enable Persistent Queue off. See Whether to Enable Persistent Queues for Backpressure Events for guidance about making the decision to enable persistent queue.
For On-Prem only: Select an appropriate Mode from the available options. See When and How to Store Data to Disk During a Backpressure Event for information about which mode to select.
On Cribl-managed Cribl.Cloud Workers with an Enterprise plan, you can enable persistent queues for supported Sources. By default, Source persistent queue is not enabled, so you must explicitly decide to turn it on.
When you enable Source persistent queue in Cribl.Cloud, the only available mode is
Always On
, which writes data to disk at all times.Smart
mode is not available. See For Cribl.Cloud Only:Always On
is the Only Available Mode for more information.For On-Prem only: Set the Buffer size limit to determine the maximum number of events to hold in memory before reporting backpressure to the sender. See When to Report Backpressure to the Sender for more information.
For On-Prem only: Set the Commit frequency to determine how many events to send downstream before committing that Cribl Stream has read them. See How Many Events to Commit During a Backpressure Event for more information.
For On-Prem only: Set the File size limit to specify the maximum amount of data stored in each queue file before the file closes and starts a new file. Set the Queue size limit to determine the maximum amount of disk space that Cribl allows the queue to consume on each Worker Process. See How Much Data to Store in Memory for more information about these settings.
Set the Queue file path to determine the location for the persistent queue files. See Where to Store Data and How Much Disk Space to Allocate for more information.
Optional: Set the Compression type to apply compression to the data before the queue file closes. See Whether to Compress Data for more information.
Overview of Source Persistent Queue Optimization
A well-optimized data throughput system can efficiently and reliably processes and transfer data from Sources to Destinations. Optimized systems maximize performance while minimizing latency, errors, and resource consumption. An optimized system has these characteristics:
- High efficiency: Data flows seamlessly at the maximum achievable rate without bottlenecks or excessive resource usage. The system gracefully handles backpressure events, ensuring that data intelligently throttles or queues without overwhelming downstream components.
- Reliability: The system consistently handles variations in data volume or speed without losing or corrupting data, even during spikes or failures.
- Scalability: The system adapts smoothly to increasing data loads or additional data senders or receivers without degradation in performance.
- Low latency: The system processes and delivers data in near real-time, meeting the required time constraints for your business use cases.
- Cost-efficient resource allocation: The system uses CPU, memory, disk, and network resources effectively, avoiding over-provisioning or waste.
- Security and compliance: The system transmits data securely, meeting all relevant privacy, encryption, and regulatory requirements.
Ultimately, you and your business stakeholders should have confidence in the system’s performance, reliability, and ability to adapt to future needs. Persistent queue is an important tool to achieve this confidence.
Persistent Queue Optimization Metrics
The most effective way to performance tune is to understand which key metrics give insights into your persistent queues system’s overall health. The most effective system administrators study their deployments to establish a performance baseline that they can use for comparison if a performance issue arises. Most system administrators understand their system well and can easily identify peak data periods.
Pay attention to these metrics:
- CPU utilization: When your system gets sudden increases in data loads, you might notice that your Worker Processes operate at full capacity and use all their CPU. Generally, that’s a sign of inefficiency or the need for more resources. If needed, you can teleport into a single Worker and view all the processes and their individual CPU utilization. To see your CPU utilization, you can use the system monitoring tools to analyze CPU utilization. If needed, you can also teleport into a single Worker and view all the processes and their individual CPU utilization.
- Data latency: Some persistent queues can add latency because of the amount of overhead time it takes to write data to disk. As a starting point, assume that each event incurs approximately 1 second of added latency. To measure the latency impact of your persistent queue system, compare the throughput times of your system without persistent queues enabled and then introduce persistent queue and measure the difference.
- Memory: When your system experiences sudden increases in data, your Workers need memory to store that data in temporary buffers. Consider what should happen in your system when memory is fully utilized.
- Data storage: When you back up data to a disk, that requires available disk space to use. Allocating memory and disk space can potentially be expensive. Be aware of the rate at which you consume disk memory and the general costs you estimate you will incur to write data to disk. Have systems in place for knowing when disk space is running out and adding more memory as needed.
- Data throughput: You need to understand your system’s throughput under normal conditions so that you can identify potential bottlenecks in the system. Use the system monitoring tools to develop this baseline.
Key Decisions When Configuring Source Persistent Queues
You configure Source persistent queue behavior when you configure each individual Source. Consult the documentation for that Source for more information about initial configuration.
This section explains the decision-making process you should consider when implementing persistent queue on a Source. It explains the benefits and tradeoffs of different configuration settings for most use cases.
Whether to Enable Persistent Queues for Backpressure Events
The first decision you need to make is whether to use persistent queue to respond to backpressure. See When to Use Persistent Queues for a detailed explanation of which kinds of environments benefits from persistent queues.
In general, you should not use persistent queue is you are already using a backpressure mechanism on a upstream sender. Implementing backpressure mechanisms such as persistent queue on both Cribl Stream and the Source is redundant and could cause unnecessary data latency or impact data throughput.
The alternatives to using persistent queue are to block data or to drop events when a backpressure event occurs. Consider blocking events if you need Cribl to return block signals back to an upstream sender. If the sender also supports backpressure response mechanisms, the sender may stop sending data.
Consult the documentation for the sender to ensure you know how that specific sender handles a backpressure signal. In general, TCP-based senders support backpressure and stop sending data once Cribl Stream stops sending TCP acknowledgments back to it.
When and How to Store Data to Disk During a Backpressure Event
Once the backpressure queue behavior engages, you need to determine how it should write data to disk. Source persistent queues have two available options (modes) for storing data during a backpressure event:
Mode | Description |
---|---|
Always On | Uses persistent queue as a disk buffer for all events. |
Smart | Only engages persistent queue upon backpressure from downstream receivers or Cribl internal processes. |
The default is Always On mode. This mode is helpful if your business has a low tolerance for data loss. However, using Always On this mode could have impacts on the persistent queue optimization metrics, such as disk space, throughput, data latency, and CPU utilization.
Considerations for using Always On mode:
- This mode can increase memory demands compared to Sources without persistent queue enabled. If memory is a constraint, you can experiment with tuning the Buffer size limit setting. However, smaller buffers will reduce the efficiency of disk writes, adding more latency.
- Always On mode has higher disk requirements than Smart mode. Cribl generally recommends that you allocate each Worker Node enough disk space for at least 1 day’s worth of queued throughput. For example: If one Source is sending 1 TB/day to two evenly distributed Worker Processes, you should provision at least 500 GB queue storage for each Worker Process. Expand this basic calculation for additional Sources. From there, expand or contract it based on the typical length of outages you experience.
- Cribl also recommends that you deploy the highest-performance storage available. This prevents IOPS (input/output operations per second) and throughput bottlenecks for both read and write operations.
Considerations for using Smart mode:
Smart
mode only engages when necessary, such as when a downstream Destination becomes blocked and the Buffer size limit reaches its limit. When persistent queue is set toSmart
mode, Cribl attempts to flush the queue when every new event arrives. The only time events stay in the buffer is when a downstream Destination becomes blocked.- If you’re using Smart mode and observe that it causes Source persistent queue to engage frequently, on option is to reduce the File size limit setting. Source persistent queue excessively triggers backpressure only when it reaches the last queued file and that file is large. Keeping this threshold low prevents that condition.
Consider experimenting with these modes to determine what is appropriate for your needs. Establish a baseline for system performance before you enable Always On and then observe the impacts when you turn that mode on.
For Cribl.Cloud Only: Always On
is the Only Available Mode
On Cribl-managed Cribl.Cloud Workers with an Enterprise plan, you can enable persistent queues for supported Sources. By default, Source persistent queue is not enabled, so you must explicitly decide to turn it on.
When you enable Source persistent queue in Cribl.Cloud, the only available mode is Always On
, which writes data to disk at all times. Smart
mode is not available.
When you enable Source persistent queue in Cribl.Cloud, Cribl automatically allocates up to 1 GB of disk space per Source, per Worker Process. You cannot change this limit. This 1 GB limit applies to outbound uncompressed data, and the queue does not perform any compression. If the queue fills up, Cribl Stream blocks data flow, including outbound data.
When to Report Backpressure to the Sender
You can use the Buffer size limit setting to determine the maximum number of events to hold in memory before reporting backpressure to the sender. Once the number of events reaches this limit, Cribl Stream begins writing the queue to disk. This setting defaults to 1000
.
This buffer is for all connections, not just per Worker Process. For that reason, this can dramatically expand memory usage. Connections share this limit, which may result in slightly lower throughput for higher number of connections. For higher numbers of connections, consider increasing the limit.
How Many Events to Commit During a Backpressure Event
Use the Commit frequency setting to determine how many events to send downstream before committing that Cribl Stream has read them. The defaults is 42
.
How Much Data to Store in Memory
When you enable persistent queues, you can configure the File size limit/Max file size setting to specify the maximum amount of data stored in each queue file before the file closes and starts a new file. You can enter a value in units such as KB, MB, and more. If left unspecified, Cribl Stream/Edge defaults to 1 MB.
Additionally, you can use the Queue size limit/Max queue size setting to determine the maximum amount of disk space that Cribl allows the queue to consume on each Worker Process. Once the queue reaches this limit, the Source will stop queueing data and engage the configured backpressure mode. You can enter a value in units such as KB, MB, GB, or TB.
If the Queue size limit/Max queue size is not specified, it defaults to 5 GB. You can adjust this value to better suit your needs. For Sources with undefined Queue size limit/Max queue size values, queues exceeding 5 GB will trigger the configured queue-full behavior until you resize the limit.
To set a higher maximum value, navigate to Group Settings > General Settings > Limits > Storage > Worker Process PQ size limit. The corresponding key in the limits.yml
file is maxPQSize
. (See Config Files: limits.yml for more information.) The default for this setting is 1 TB. Before increasing this limit, consult Cribl Support for guidance.
The information in this section applies only to on-prem or hybrid Worker Groups. For Cribl-managed Cribl.Cloud Organizations using persistent queue, Cribl automatically allocates up to 1 GB of disk space per Source, per Worker Process. You cannot change this limit.
This 1 GB limit applies to outbound uncompressed data, and the queue does not perform any compression. If the queue fills up, Cribl Stream/Edge blocks data flow, including outbound data.
Whether to Compress Data
When the queue file reaches its maximum file size, you can use the optional Compression setting to apply compression to the data before the queue file closes. Currently, the only available compression codec is Gzip.
If you enable Compression and also enter a Queue size limit/Max queue size, be aware of how these options interact:
- Cribl Stream/Edge currently applies the Queue size limit/Max queue size against the uncompressed data volume entering the queue.
- With this combination, Cribl recommends that you set the Queue size limit/Max queue size higher than the volume’s total available disk space, not including compression. This will maximize queue saturation and minimize data loss.
Where to Store Data and How Much Disk Space to Allocate
Use the Queue file path setting to determine the location for the persistent queue files. The default is $CRIBL_HOME/state/queues
. When writing data, Cribl Stream/Edge will append /<worker-id>/<output-id>
to this path.
For queuing to operate properly, you must provide sufficient disk space. In Distributed mode, this minimum disk space configuration is set on each Worker Group/Fleet. You can access this setting at Group/Fleet settings > General Settings > Limits > Min free disk space.
If available disk space falls below this minimum threshold, Cribl Stream/Edge will stop maintaining persistent queues, and data loss will begin. The default minimum is 5 GB.
Queue file path: The location for the persistent queue files. Defaults to $CRIBL_HOME/state/queues
. To this value, Cribl Stream/Edge will append /<worker-id>/<output-id>
.
Source Persistent Queue Notifications
Some Cribl licenses allow you to configure Notifications for Source persistent queues for on-prem deployments. In general, Cribl recommends setting Notifications for these events when using persistent queue to ensure you are aware of important system events. These Notifications always appear in Cribl Stream’s user interface and internal logs. You can also send them to external systems.
You can set Notifications for these Source persistent queue scenarios:
- Persistent queue files exceed a configurable percentage of allocated storage.
- A Source reaches the queue-full state, meaning it is running out of available disk space.
- Persistent Queue Usage (saturation).
See About Notifications for information about adding Notifications for Sources.
Monitor Source Persistent Queues
Cribl Stream provides various Monitoring dashboards to gather information about your persistent queues. You can access these dashboards from Cribl Stream’s sidebar by selecting Monitoring.
To view data about Source persistent queues:
- Open the Monitoring page. From Cribl Stream’s sidebar, select Monitoring to access dashboards displaying data for all Worker Groups and all nodes.
- Navigate to System > Queues (Sources) and view relevant data.
Other Monitoring resources may become unreliable when persistent queue engages during a backpressure event. These include the main Monitoring dashboards for Sources and Destinations. When persistent queue engages, these dashboards will not accurately represent queued data because Cribl Stream queues data to disk rather than arriving into Pipelines or affected Destinations.
The affected Sources will not show EPS and bytes ingested while their data is queueing.
Some other Monitoring considerations:
- Activity in the System > Queues (Sources) dashboards sometimes shows up only when queues drain.
- With or without persistent queues enabled, Monitoring dashboards do not show usage of in-memory buffers for the Destination.
- Cribl Stream currently does not provide a Monitoring dashboard for the disk volume of persistent queues.
- While events go to a Source persistent queue, Cribl Stream does not record metrics about that Source’s inbound data (events/second or bytes).
For these reasons, consider enabling persistent queue notifications.
After the unhealthy condition resolves, and disk queues start to drain, dashboards resume accurately representing delivery rates and volumes.
Reference: Source Persistent Queue Fields
In the Persistent Queue Settings tab, you can optionally specify persistent queue storage, using the following controls. Persistent queue buffers and preserves incoming events when a downstream Destination has an outage or experiences backpressure.
Before enabling persistent queue, learn more about persistent queue behavior and how to optimize it with your system:
On Cribl-managed Cloud Workers (with an Enterprise plan), this tab exposes only the Enable Persistent Queue toggle. If enabled, PQ is automatically configured in
Always On
mode, with a maximum queue size of 1 GB disk space allocated per PQ‑enabled Source, per Worker Process.The 1 GB limit is on uncompressed inbound data, and the queue does not perform any compression. This limit is not configurable. For configurable queue size, compression, mode, and other options below, use a hybrid Group.
Enable persistent queue: Default is toggled off. When toggled on:
Mode: Select a condition for engaging persistent queues.
Always On
: This default option will always write events to the persistent queue, before forwarding them to the Cribl Stream data processing engine.Smart
: This option will engage PQ only when the Source detects backpressure from the Cribl Stream data processing engine.
Smart
mode only engages when necessary, such as when a downstream Destination becomes blocked and the Buffer size limit reaches its limit. When persistent queue is set toSmart
mode, Cribl attempts to flush the queue when every new event arrives. The only time events stay in the buffer is when a downstream Destination becomes blocked.
Buffer size limit: The maximum number of events to hold in memory before reporting backpressure to the sender and writing the queue to disk. Defaults to 1000
. This buffer is for all connections, not just per Worker Process. For that reason, this can dramatically expand memory usage. Connections share this limit, which may result in slightly lower throughput for higher numbers of connections. For higher numbers of connections, consider increasing the limit.
Commit frequency: The number of events to send downstream before committing that Stream has read them. Defaults to 42
.
File size limit: The maximum data volume to store in each queue file before closing it and (optionally) applying the configured Compression. Enter a numeral with units of KB, MB, and so forth. If not specified, Cribl Stream applies the default 1 MB
.
Queue size limit: The maximum amount of disk space that the queue is allowed to consume on each Worker Process. Once this limit is reached, this Source will stop queueing data and block incoming data. Required, and defaults to 5
GB. Accepts positive numbers with units of KB
, MB
, GB
, and so forth. Can be set as high as 1 TB
, unless you’ve configured a different Worker Process PQ size limit in Group or Fleet settings.
Queue file path: The location for the persistent queue files. Defaults to $CRIBL_HOME/state/queues
. To this field’s specified path, Cribl Stream will append /<worker-id>/inputs/<input-id>
.
Compression: Optional codec to compress the persisted data after a file closes. Defaults to None
; Gzip
is also available.
In Cribl Stream 4.1 and later, the Source persistent queue default Mode is
Always on
, to best ensure events’ delivery. For details on optimizing this selection, see Optimize Source Persistent Queues (sPQ).You can optimize Workers’ startup connections and CPU load at Group/Fleet settings > Worker Processes.