Persistent Queue Shared Storage
By default, persistent queues (PQs) write to each Worker Node’s local disk. If that Worker Node goes down before the queue drains, the data stays on that Worker Node until the same Worker Node returns.
Shared PQ storage configures a shared backing store for all Worker Nodes in a Worker Group. That store is either Network filesystem (NFS) or AWS S3. All Worker Nodes can read queue data on that store. When a Worker Node disappears, surviving Worker Nodes can recover and drain its queues. Pair shared storage with orphan management to control when that recovery starts.
Shared PQ storage and orphan management are available in Cribl Stream on customer-managed and hybrid Worker Groups only. They require an Enterprise license. Cribl Edge uses local filesystem PQ storage only. On Cribl.Cloud-provisioned Worker Groups, Cribl manages shared storage automatically. You cannot edit PQ storage or orphan management settings.
For PQ basics, backpressure behavior, and per-Source or per-Destination settings, see About Persistent Queues, Optimize Source Persistent Queues, and Optimize Destination Persistent Queues.
When to Use Shared Storage
Use shared PQ storage when:
- Worker Nodes in the same Worker Group are replaced, scaled down, or fail during a backpressure event.
- You need queued data to survive the loss of a specific Worker Node, not just a downstream outage.
- You run in elastic environments (Kubernetes autoscaling, cloud instance replacement) where Worker Nodes come and go.
Use local filesystem PQ storage when:
- Worker Nodes have stable identities and local disks you control (for example, fixed on-prem hosts).
- You need the lowest write latency and do not need cross-Worker Node recovery.
For storage type trade-offs and sizing guidance, see Where to Store Data and Worker Group PQ Storage Fields.
How Shared Storage Works
Shared storage namespaces queue data by Worker Node so Worker Nodes in the same Group do not overwrite each other’s queues. Each Worker Node writes to its own area on the shared store. Surviving Worker Nodes can read and drain queues that belonged to a failed Worker Node.
- Local filesystem and NFS: Configure the Queue file path (local) or Mount point (NFS).
- AWS S3: Configure the bucket Key prefix. Cribl Stream uses a local cache on each Worker Node for fast writes, then uploads data to S3 asynchronously.
Migrate from Local Disk
When you first enable shared PQ storage and deploy the configuration, Cribl Stream automatically copies existing local PQ data from each Worker Node to the shared store. Drain queues or plan for this migration window before you switch storage types in production.
Switch Back to Local Storage
You can change Storage type back to Local filesystem at any time. Data is not migrated automatically. Drain all shared queues into your Destinations, or use Clear persistent queue, before you switch. Otherwise, data remains in the S3 bucket or NFS mount and is not attached to the new local queues.
Configure Shared Storage
Configure shared PQ storage at the Worker Group level. All Worker Nodes in the Group must use the same storage type. You cannot mix local and shared storage in one Group.
- Select Products, then Cribl Stream. Under Worker Groups, select a Worker Group.
- Select Worker Group Settings, then System, then PQ Storage.
- Set Storage type to Network filesystem (NFS) or AWS S3.
- Configure connection settings, limits, and compression. See Worker Group PQ Storage Fields for field descriptions and recommendations.
- Select Save, then Commit and Deploy.
After you configure shared storage, enable PQ on individual Sources or Destinations. When shared storage is active, Queue size limit and related capacity fields are managed at the Worker Group level and are hidden on individual Source and Destination PQ settings.
For Kubernetes deployments, mount shared volumes consistently on Worker Nodes and the Leader, and preserve Worker Node identity across restarts. See Persistent Queue Considerations for Kubernetes Deployments.
Orphan Management and PQ Revival
An orphan is queued data on shared storage from a Worker Node that is no longer active in the Worker Group. This can happen when a Worker Node is replaced, scaled down, or cannot reach the Leader for an extended period.
A restart that preserves the same Worker Node identity is not an orphan. The same Worker Node reconnects and resumes draining its own queues.
Revival Flow
When a Worker Node has been absent longer than the Orphan detection threshold, Cribl Stream can reassign its orphaned queues to surviving Worker Nodes for drain. Surviving Worker Nodes read the queued data from shared storage and deliver it through the configured Routes, Pipelines, and Destinations.
If the original Worker Node returns before reassignment completes, it resumes draining its own queues.
Configure Orphan Management
Orphan management controls when the Leader scans shared storage and reassigns orphaned queues to surviving Worker Nodes. It applies only when Storage type is Network filesystem (NFS) or AWS S3.
PQ Orphan Management appears in the UI only after you configure shared PQ storage. If Storage type is Local filesystem, the setting is hidden because orphan management does not apply to per-Worker Node local queues. The menu item also does not appear on Cribl.Cloud-provisioned Worker Groups.
To configure orphan management in the UI:
- Configure shared storage in Worker Group Settings > System > PQ Storage. Set Storage type to Network filesystem (NFS) or AWS S3, then select Save, Commit, and Deploy.
- Return to Worker Group Settings > System. PQ Orphan Management appears next to PQ Storage.
- Adjust Scan interval (minutes), Orphan detection threshold (minutes), or Disable orphan management as needed. Select Save, then Commit and Deploy.
You can also configure orphan management through the /system/pq/orphan-management API on customer-managed and hybrid Worker Groups.
| Setting | Default | Description |
|---|---|---|
| Scan interval (minutes) | 1 | Minimum time between full scans of shared PQ storage per Worker Group. |
| Orphan detection threshold (minutes) | 20 | How long a Worker Node may be absent from the Leader’s active set before its queues are eligible for reassignment. The threshold is evaluated when a scan runs, not every second. |
| Disable orphan management | Off | When toggled on, orphan detection, reassignment, and unassignment are disabled for the Group. Use during long planned maintenance when Worker Nodes may be offline for extended periods. |
Defaults suit auto-scaling fleets. Increase Orphan detection threshold for long maintenance windows, or toggle on Disable orphan management when Worker Nodes may be offline for extended periods and you do not want automatic reassignment.
After you change orphan management settings, Commit and Deploy so all Worker Nodes receive the updated configuration.
The PATCH request for
/system/pq/orphan-managementreplaces the entire configuration document. Include all fields in each update.
Observability
Query metrics from Monitoring or the metrics API. Filter by Worker Group, Worker Node, Source, or Destination as needed. For the full persistent queue metric catalog, see Internal Metrics.
Orphan Management Metrics
| Metric | Type | Meaning |
|---|---|---|
pq.orphan_detected | Gauge | Per Worker Group. Orphan PQ directories detected in the latest scan cycle. Should be 0 in steady state. |
Shared Storage Metrics
Shared storage metrics provide a general indication of remote store activity. They are not designed for precise accounting. Do not use them to calculate data loss or make other functional guarantees about queue durability.
| Metric | Type | Meaning |
|---|---|---|
pq.shared_storage_bytes_in | Counter | Bytes read from shared storage into Cribl Stream. |
pq.shared_storage_bytes_out | Counter | Bytes written from Cribl Stream to shared storage. |
pq.shared_storage_api_calls | Counter | Storage API call volume. Dimensions: type (s3 or nfs), purpose (scan, data, or clear), and operation (read, write, or list). |
pq.shared_storage_api_errors | Counter | Storage API failures. Dimensions: type and purpose (scan, data, or clear). |
pq.disk_buffered_bytes | Gauge | S3 only. Bytes in the local staging cache while uploading or downloading. |
purpose values indicate the type of storage activity:
scan: Storage inventory during orphan detection.data: Normal queue reads and writes.clear: Clear persistent queue operations.
Per-Queue Metrics
These metrics apply to each Source or Destination PQ on each Worker Process, including shared storage deployments:
| Metric | Type | Meaning |
|---|---|---|
pq.queue_size | Gauge | Uncompressed bytes in the PQ journal for this Source or Destination on this Worker Process. |
pq.buffered_events | Gauge | Events held in the in-memory buffer before the PQ journal. |
pq.health | Gauge | PQ health state: 0 Green, 1 Yellow, 2 Red. |
pq.in_bytes / pq.out_bytes | Counter | Bytes written into or flushed out of the queue during the reporting window. |
system.pq_used | Gauge | Leader-side PQ utilization ratio (0 to 1) aggregated across Worker Nodes for a Source or Destination. Updates on a slower cadence than per-Worker Process pq.queue_size. |
For queue depth during backpressure, plot pq.queue_size and pq.buffered_events per Source or Destination. Compare per-Worker Process pq.queue_size to system.pq_used on the Leader when you need a Group-wide utilization view.
Logs
Check Leader and Worker logs for persistent queue errors during investigations. On Cribl.Cloud, monitor PQ health alerts for shared storage connectivity issues. See Optimize Destination Persistent Queues.
Troubleshooting
| Symptom | Likely cause |
|---|---|
pq.orphan_detected stays high | Orphaned queues are not draining. Check Destination health and shared storage connectivity. |
Sustained pq.shared_storage_api_errors | S3 credentials, network, or bucket policy issues. Split by purpose to isolate scan vs. data vs. clear failures. |
pq.health is 2 on a Destination | Downstream blocker or PQ error state (Red). Fix the Destination before you expect the queue to drain. |
| Queues fill but nothing drains | Check shared storage connectivity, then Destination health. |
| Data replays after you cleared a queue | Clear persistent queue removes queued data and any in-progress revival drain for that Source or Destination on each Worker Node. Without that clear, revival could replay old data after you thought the queue was empty. |
See Also
- Worker Group PQ Storage Fields for the storage type settings reference
- Internal Metrics for the full persistent queue metric reference
- Persistent Queue Optimization Metrics for CPU, disk, latency, and cost tuning
- Persistent Queue Considerations for Kubernetes Deployments for autoscaling, volume mounts, and Worker Node identity
- pq CLI command to benchmark local PQ read and write performance