On This Page

Home / Reference Architectures/ Core Architectural Concepts/Operational Monitoring

Operational Monitoring

A performant and reliable Cribl deployment relies upon a defined architecture for monitoring its operational health, data flows, and change events.

Visibility and Health

A complete monitoring architecture for Cribl leverages a multi-faceted approach, with all capabilities provided by the Cribl Internal Source. Your architectural plan must account for three distinct operational workflows, each serving a different purpose.

  • Built-in Monitoring Dashboards: This is the primary interface for day-to-day operational visibility. It’s designed for immediate troubleshooting, showing real-time data flows, Pipeline performance, and component health (CPU/Memory).

  • Long-Term Strategic Analysis: For long-term trend analysis, capacity planning, and historical reporting, the architecture must include forwarding internal metrics and logs to an archival Destination like Cribl Lake or a dedicated external platform. This provides data persistence and allows you to correlate performance with your broader infrastructure.

  • Cribl Search: For immediate, deep-dive troubleshooting that goes beyond dashboards, leverage Cribl Search. It allows you to run queries directly against the internal logs on your Cribl Nodes or in archival storage in-place, providing rapid answers for incident response without the delay of forwarding and indexing that data in another system.

Deployment Model Considerations

Your responsibility for monitoring the Cribl components changes based on your deployment model:

  • Cribl.Cloud: This is a managed service. Cribl provides the monitoring UI for your cloud-hosted Nodes. Your responsibility is to use these tools and plan if and where you will forward their monitoring data for long-term retention.
  • On-Premises: You are responsible for monitoring your entire Cribl deployment. This includes both the Cribl application data and the health of the underlying host operating system and network for your self-hosted Leader and Worker nodes.
  • Hybrid: You have a shared responsibility. You must provide full-stack monitoring for your self-hosted Cribl Nodes, while using the managed tools for your cloud-hosted Nodes. A key architectural challenge is unifying these two sources of monitoring data into a single view.

Monitor Audit and Change Events

While auditing is a security function, monitoring audit events is an operational task critical for stability. Unplanned changes are a cause of incidents.

  • Monitor Configuration Changes: The Leader Node uses a Git repository and detailed logs (such as audit.log) to track every configuration change. Your monitoring strategy should include forwarding these logs to a central system. Plan to build dashboards or alerts to track the frequency of deployments and watch for unexpected configuration activity. This allows for rapid investigation, especially when using Cribl Search to query these audit logs directly on the Leader.

  • Monitor Administrative Access: The access.log and ui-access.log provide a trail of all administrative activity. While primarily for security, monitoring these for unusual patterns (like a spike in logins) can be an early indicator of operational issues or unauthorized activity affecting the platform.

Deployment Model Considerations

Your scope for monitoring these events differs:

  • Cribl.Cloud: Your monitoring scope is limited to the application-level audit logs from your customer-managed Workers and Edge nodes. Logs generated by the Cribl-managed Leader are not exposed for forwarding.
  • On-Premises: You are responsible for monitoring both the Cribl application audit logs from all components and the host-level access logs (like SSH) for the servers running your Cribl components.
  • Hybrid: You have a combined responsibility to monitor application-level logs from all components and host-level logs for your on-premises Workers.

Performance Monitoring and Scaling

Your monitoring strategy is key to understanding when and how to scale your Cribl deployment.

  • Monitor Resource-Intensive Functions: Not all processing is equal. Functions like masking, redaction, and complex parsing are CPU-intensive. Your monitoring plan must include tracking CPU utilization on Worker and Edge nodes, segmented by the type of workloads they are handling. A spike in CPU on a specific Group/Fleet can indicate a need to scale or optimize a Pipeline.

  • Monitor the Monitoring System: You can control the level of metric detail that Worker and Edge nodes send to the Leader. For large-scale deployments, your architectural plan should specify a metric level (like Minimal, Basic, or Custom). This is a critical tuning decision that you must monitor to ensure the Leader Node itself does not become a performance bottleneck.

Deployment Model Considerations

Your role in performance monitoring and scaling varies:

  • Cribl.Cloud: You monitor the performance metrics of your Cribl Nodes to forecast resource needs.
  • On-Premises: You are responsible for monitoring the full stack—from the Cribl application metrics down to the underlying compute, storage, and network resources—to make scaling decisions for your infrastructure.
  • Hybrid: You perform full-stack monitoring for your self-hosted Cribl Nodes to manage their capacity.

Monitor Data Flows and Routing

A core function of Cribl is routing data. Your monitoring architecture must be able to validate that these data flows are healthy and correct.

  • Monitor Destination Health and Backpressure: The most critical operational metrics are pq.queue_size (persistent queue) and blocked.outputs. Your architecture must include alerting on these metrics, either through the built-in Notifications for immediate alerts or via your external monitoring platform. Persistent queues (PQ) are optional and must be enabled and configured per Source or per Destination, see how to Manage Backpressure. If you enable PQ, also consider alerting on “PQ engaged” conditions via Notifications to detect when queuing starts. A growing queue is the primary indicator that a Destination is down or slow, putting data delivery at risk.

  • Validate Routing Logic: Use the out_bytes and out_events metrics, which are tagged per-Destination, to validate your routing rules. Your monitoring should confirm that the expected volume of data is flowing to the correct Destinations. For example, you should see a high volume going to your low-cost archive and a smaller, reduced volume going to your analytics platform.

Deployment Model Considerations

Monitoring data flows has unique challenges in each model:

  • Cribl.Cloud: Data flow monitoring is straightforward within the cloud environment. Your primary focus is on ensuring Destinations are reachable from Cribl.Cloud.
  • On-Premises: You are responsible for monitoring the network paths between your data Sources, Cribl Nodes, and their Destinations, all within your own infrastructure.
  • Hybrid: The network link between your on-premises environment and Cribl.Cloud Nodes is the most critical component to monitor. A failure here will immediately cause backpressure. Your plan must include monitoring the health of this link and, most importantly, alerting on the backpressure metrics of your on-premises Cribl Nodes.

Further Reading

Learn more about Monitoring and Backpressure.