/ / / /

Considerations for Cribl Stream on Kubernetes

Kubernetes can be a powerful platform for deploying Cribl Stream, offering scalability and flexibility. However, there are specific use cases where Kubernetes might not be the optimal choice. This guide will explore the factors to consider when deciding whether Kubernetes is the right fit for your Cribl Stream deployment. We’ll also discuss workarounds and best practices to ensure a smooth integration.

Understanding the Challenges of Dynamic Workloads and HP Autoscaling

To maximize the benefits of Cribl Stream’s integration with Kubernetes, it’s essential to understand its optimal use cases and potential limitations.

Optimal Use Cases

Cribl Stream, when deployed on Kubernetes, excels at handling pull-based Sources, where data is readily available for collection. This is because Kubernetes’s containerized environment and dynamic scaling capabilities align well with data Sources where collection is on demand or scheduled.

By actively pulling data from Sources like Collector Sources, Cribl Stream running on Kubernetes ensures timely data ingestion, and minimizes the risk of data loss. This proactive approach also enables Cribl Stream to handle a large number of Sources efficiently, making it ideal for environments with many log files or syslog servers.

Less-Ideal Use Cases

Cribl Stream Workers running on Kubernetes can effectively handle data streams from push-based Sources, provided they are provisioned appropriately with fixed resources and sufficient overhead to accommodate average and peak data volumes. However, challenges can arise when dealing with senders that experience frequent, significant fluctuations in data volume, especially when using auto-scaling with scale-in policies.

Kubernetes environments, with their dynamic scaling capabilities, can be powerful tools to accommodate these fluctuations. However, the ephemeral nature of Kubernetes pods, particularly when managed by Horizontal Pod Autoscaler (HPA), can introduce complexities. HPA automatically scales the number of pods based on defined metrics, such as CPU utilization. While this is beneficial for resource optimization, it can lead to unexpected behavior in certain scenarios.

When a sudden spike in data volume occurs, HPA may scale up by creating new pods to handle the increased load. However, if the traffic spike is short-lived, these newly created pods may be terminated prematurely to conserve resources. This can disrupt ongoing data processing tasks, leading to data loss or processing delays. Additionally, the constant creation and destruction of pods can introduce latency and overhead, impacting overall system performance.

Consider the following when deploying Cribl Stream within Kubernetes environments.

General Kubernetes Best Practices

When managing large Cribl deployments within Kubernetes, it’s essential to be aware of potential limitations inherent to large Kubernetes clusters. To ensure optimal performance and stability, consider the best practices outlined in the Kubernetes documentation. These guidelines address critical aspects like resource management, network configuration, and monitoring, which become increasingly important as your Kubernetes environment scales.

Tuning Cribl for Kubernetes

Cribl Stream leverages Worker Processes to efficiently handle data ingestion and processing tasks. By configuring the maximum number of Worker Processes per pod, you can optimize resource allocation and prevent overloading.

Kubernetes containers use resource limits that map to shares of the CPU and Memory. Although a Kubernetes container has visibility of the entire host CPU, it can only use the resources allocated to the pod based on the configured limits. For details, see Resource Management for Pods and Containers.

To ensure optimal performance and avoid overloading your Kubernetes pods, consider the following:

Set the Maximum Worker Processes:
Use the --set env.CRIBL_MAX_WORKERS=NUMBER_WORKER_PROCESSES configuration setting to specify the maximum number of Worker Processes per pod. While Cribl cannot directly determine the number of CPUs within a container, this setting allows you to increase the maximum number of Worker Processes, helping to optimize resource allocation and avoid excessive load. This setting helps balance resource utilization and prevent excessive load on individual pods.
Determine the Optimal Worker Process Count:
- Self-Hosted Environments: Calculate the available CPU cores on your host nodes and set the Worker Process count accordingly. To avoid overloading your nodes, it’s recommended to deploy a limited number of Cribl Stream pods per node. A 1:1 mapping, where each node hosts a single Cribl Stream pod, is often a good starting point. For details, see Sizing and Scaling.
- Cloud Environments: Refer to your cloud provider’s instance type specifications or leverage auto-scaling mechanisms like HPA to dynamically adjust the number of pods based on demand.

Security Context Considerations for OpenShift and Non-Root Environments

To make sure your deployments run smoothly on OpenShift or non-root environments, you’ll need to configure the security context through the values.yaml file. The values.yaml file allows you to override default settings and customize your deployment without altering the core Helm chart templates.

To add or update the values.yaml file:

Locate the values.yaml file.
Modify the securityContext section.
Customize the capabilities to match your environment’s needs:

securityContext:
  capabilities:
    add:
      - CAP_NET_BIND_SERVICE
      - CAP_SYS_PTRACE
      - CAP_DAC_READ_SEARCH    
      - CAP_NET_ADMIN

Install your updated values to Helm, using this command: helm install -f /bar/values.yaml. For a description of the capabilities, see Set Capabilities for Cribl Stream.

Persistent Queue Considerations for Kubernetes Deployments

Persistent queuing in Cribl Stream on Kubernetes requires careful consideration of autoscaling, storage, and Worker lifecycle behavior.

Persistent Storage on Kubernetes

Use persistent storage with Cribl Stream on Kubernetes when you need durable storage for staging files or persistent queues (PQ). Choose storage that provides the durability and performance your workload requires, and validate the behavior of your storage class in your Kubernetes environment before broad rollout.

For PQ, configure shared storage at the Worker Group level. Cribl Stream supports three storage type options:

Local filesystem: Stores data on each Worker Node’s local disk.
Network filesystem (NFS): Stores data on a shared network volume. Mount the volume on Worker Nodes and the Leader Node at the same path.
AWS S3: Stores data in an S3-compatible object store. Ensure the bucket is reachable from Worker Nodes and the Leader Node. This is the recommended shared storage option for environments where Workers are replaced or scaled frequently.

For storage selection guidance and configuration details, see Optimize Destination Persistent Queues and Worker Group PQ Storage Fields.

Autoscaling and Worker Lifecycle

When you configure shared PQ storage (Network filesystem or AWS S3) at the Worker Group level, the Leader can detect orphaned PQ data from a Worker that has disconnected and reassign it to surviving Workers for draining. This behavior helps you recover queued data during Worker lifecycle events such as scale-down or pod restart. It does not apply to local filesystem PQ storage, where queued data remains tied to the individual Worker Node’s disk.

Reassignment is not immediate. By default, the Leader waits until a Worker has been absent for 20 minutes before it treats that Worker’s PQ data as orphaned. You can tune this behavior in Worker Group Settings > System > PQ Orphan Management (visible only when shared PQ storage is configured), or through the /system/pq/orphan-management API. See Persistent Queue Shared Storage for orphan management settings, the revival flow, and troubleshooting.

However, aggressive or poorly tested automatic scale-in can still create risk. Follow these guidelines:

Test scale behavior: Validate scale-down and scale-up behavior in a non-production environment before you enable automatic scale-in in production.
Preserve Worker identity: Prefer deployment patterns that preserve Worker identity and storage continuity, such as StatefulSet deployment mode in the Worker Group chart. StatefulSet assigns pods to specific volumes and helps maintain data consistency across restarts.
Manage scale-in deliberately: Set Horizontal Pod Autoscaler (HPA) scale-in policies to allow enough time for PQ data to flush or transfer before pods terminate. Consider limiting automatic scale-in during periods of backpressure.

Storage Performance and Configuration

Your storage choice affects performance and recovery behavior:

Block storage (EBS): Elastic Block Store (EBS) block storage provides high performance for per-node local storage. EBS volumes can move to a new Kubernetes host in the same Availability Zone (AZ) if a host fails. When you use EBS, do not span Worker Groups across Availability Zones.
Shared storage (NFS and S3): Shared storage lets surviving Worker Nodes recover and drain PQ data from a missing Worker Node. S3-backed PQ uses a local on-disk cache for high-speed writes, then asynchronously uploads data to S3. NFS requires a stable, high-performance network mount. See Where to Store Data for when to use each option.

Configure volume mounts with the following settings:

CRIBL_VOLUME_DIR: Set this environment variable to the mount point of your persistent volume. Setting CRIBL_VOLUME_DIR ensures that the Worker GUID is consistent on each startup.
extraVolumeMounts: Use the Helm chart’s extraVolumeMounts option to attach persistent volumes to Worker pods. You also need extraVolumeMounts in non-root Kubernetes environments like Anthos and OpenShift. See Security Context Considerations for OpenShift and Non-Root Environments.
Volume claim templates: When you use StatefulSet deployment mode, define PVC mounts in the chart’s volumeClaimTemplates configuration and align the mount point with CRIBL_VOLUME_DIR.

Monitoring

Continuously monitor PQ usage, including storage utilization and performance metrics, to identify potential bottlenecks. Monitor Worker Group/Fleet pods for volume issues. The faster you detect volume problems, the more likely you can resolve them before data loss occurs.

For Cribl-managed Cribl.Cloud Worker Groups, a Cloud persistent queue (PQ) health checker validates that Workers can see their configured shared storage. The health checker runs on the Leader and verifies shared-volume visibility across Workers in the Worker Group. Monitor PQ health alerts and investigate configuration errors before they affect data recovery. See Optimize Destination Persistent Queues for related log events.

Considerations for Cribl Stream on Kubernetes ​

Understanding the Challenges of Dynamic Workloads and HP Autoscaling ​

Optimal Use Cases ​

Less-Ideal Use Cases ​

General Kubernetes Best Practices ​

Tuning Cribl for Kubernetes ​

Security Context Considerations for OpenShift and Non-Root Environments ​

Persistent Queue Considerations for Kubernetes Deployments ​

Persistent Storage on Kubernetes ​

Autoscaling and Worker Lifecycle ​

Storage Performance and Configuration ​

Monitoring ​

Common Resources