Monitor Leader Health and Logs
A resilient Cribl deployment depends on more than just Worker and Edge Node capacity; the health of the Leader, both the application and its underlying host, is critical for orchestration and configuration integrity.
This topic outlines the recommended architectural patterns for monitoring Leader Nodes across different deployment models.
Responsibility by Deployment Model
Your monitoring strategy and data access levels are determined by the management of the control plane. While Cribl.Cloud provides a managed experience with native telemetry integration, customer-managed environments require a proactive architecture to capture host and application health.
| Component | Cribl.Cloud (Managed) | Customer-Managed |
|---|---|---|
| Infrastructure | Cribl manages the High-availability (HA) control plane and underlying infrastructure. | You are responsible for sizing, HA, failover, and OS maintenance of all self-hosted Nodes. |
| Host Metrics | Cribl provides abstracted host metrics; visible via built-in UI dashboards. | You must collect OS metrics (CPU, RAM, Disk, Network) via Cribl Edge on all self-hosted Nodes. |
| Internal Logs | Curated logs auto-routed to Cribl Lake; queryable via Cribl Search. | You have access to all filesystem logs, collected via Cribl Edge and Cribl Stream Pipeline for all self-hosted Nodes. |
| Primary Tooling | Cribl Search, Cribl Lake, and Cribl Insights (System Insights dashboards and Alerts) in Cribl.Cloud. | Cribl Edge and Cribl Stream (monitoring Pipeline), optionally Cribl Lake, in addition to external SIEM/observability tools. |
Cribl.Cloud Pattern
In a Cribl-managed environment, the Leader is treated as “Monitoring-as-a-Service.” Cribl performs the heavy lifting of log collection and infrastructure health on your behalf.
- Log access: Most operational logs and internal metrics are automatically routed to Cribl Lake for use with Cribl Search. You do not need to configure the collection.
- Analysis and alerting: Use Cribl System Insights dashboards to monitor control-plane and data-plane health, and Insights Alerts for centralized alerting across products. Use Cribl Search when you need custom queries or scheduled searches over raw logs and Lake datasets.
- Operational view: Use the Cribl System Insights views in Cribl.Cloud for real-time health and historical trends across Leaders, Workers, and Edge Nodes.
Customer-Managed Pattern
When you host the Leader, you must replicate the telemetry Pipeline that Cribl manages in the cloud. Because a Leader is a control-plane component, it cannot act as its own data-plane collector. It coordinates the Worker Groups/Fleets but does not tail its own files or collect host-level metrics.
The Recommended Pattern
Deploy Cribl Edge on the Leader Node. Treat the Leader Node as a managed endpoint that forwards telemetry to a dedicated Stream Worker Group.
This pattern mirrors the automation Cribl performs in the cloud, giving you full visibility into the Leader Node and application state.
For details, see Installing Cribl Edge on Linux.
Components of the Pipeline
- Cribl Edge (Collection): Installed on the Leader Node to collect:
- Host metrics: CPU, memory, network, and disk (crucial for bundle storage and PQ volumes).
- Leader logs: Tailing
cribl.log,audit.log, andaccess.log.
- Cribl Stream (processing): The Cribl Edge Node forwards data to a Cribl Stream Worker Group. Here, you normalize the data and apply tags (such as
env:prod,role:leader). For details, see Cribl Edge to Cribl Stream. - Destinations (analysis): Cribl Stream routes the telemetry to your primary SIEM, observability tool, or Cribl Lake for long-term retention and auditing.
Key Observability Metrics
Once the Pipeline is established, focus your alerting on these three pillars:
- Control plane health: Leader uptime, API availability, and the success rate of configuration bundle distribution to Worker/Edge Nodes. For details, see Monitor Health and Metrics.
- Security and compliance: Monitor
audit.logfor unauthorized configuration changes andaccess.logfor unexpected administrative logins. - Resource constraints: Exhaustion of host resources is a leading cause of service disruption. Focus your monitoring on these high-impact areas. For details, see Internal Metrics.
- Disk Capacity: Monitor disk utilization specifically for bundle storage and log volumes running out of disk space is a primary cause of Leader instability.
- Persistent queues (PQ): For customer-managed deployments, monitor PQ closely. Backpressure or Destination unavailability can cause queues to fill rapidly, leading to data loss or Leader performance issues. For Cribl.Cloud, while the infrastructure is managed, monitoring PQ metrics remains a critical “early warning” indicator. Rising queue depths often signal downstream destination issues or throughput bottlenecks before they impact the control plane. For details, see Manage Backpressure.