System Insights
System Insights provides deep operational visibility into your Cribl deployment. It is designed to help you understand the behavior of your data, the health of your infrastructure, and the performance of your Pipelines.
By correlating throughput, errors, and freshness/latency with system logs, System Insights accelerates incident triage and Root Cause Analysis (RCA). You should use this interface to detect saturation and backpressure early, validate the health of your Nodes, and pinpoint anomalies before they impact downstream systems.
System Insights is organized hierarchically:
- Overview Page: The entry point containing high-level dashboards for each of the four products: Stream, Edge, Lake, and Search.
- Detail Dashboards: Select View Details on any product dashboard to open a dedicated view with deeper analytics, organized into further tabs.
How to Use System Insights
Start at the main Overview to spot anomalies at a high level. If you identify a potential issue, drill down into the per-product pages by selecting View Details on the top-right corner of the product section.
These detailed dashboards are designed to help you:
- Isolate Issues: Use dimension filters (such as Worker Group) to isolate suspect areas.
- Validate Changes: Compare windows before and after a configuration change to ensure stability.
- Navigate Contextually: The tab state is URL-backed, meaning you can share links that preserve your active view and filters.
Stream System Insights
Stream System Insights monitors Leaders, Worker Groups, and Pipelines. Its primary focus is on data flow throughput, drops, latency, and the efficiency of your Routes, Pipelines, and Packs.
Cribl Stream has the richest System Insights experience, with dedicated tabs for Data, Jobs, Infrastructure, and Logs.
Stream Data
The Data tab is your primary dashboard for understanding how data flows through Cribl Stream and how work is distributed. It focuses on:
Overall health and footprint: High-level counters show how many inputs/outputs are healthy and summarize the number of Worker Groups, Workers, processes, CPUs, memory, and storage in use. This gives you a quick sense of whether Cribl Stream is sized as expected.
Throughput over time: Time-series charts for events in/out and bytes in/out show how much data Cribl Stream is ingesting and emitting. Comparing in vs out helps you see drops, bottlenecks, or configuration changes that affect flow.
Worker Group performance: Per-Worker-Group charts let you see which Groups are carrying the most load. This is where you look when you suspect hot spots or uneven distribution.
Connection behavior: Connection lifecycle views show how often connections are being established, torn down, or flapping. This is useful for spotting unstable sources or destinations.
Key component groupings: Sections for Sources, Destinations, Pipelines, and Queues summarize activity per component type so you can move from a vague suspicion to a clear culprit.
Use the Data tab to answer questions like:
- Which Worker Groups and components are doing the most work?
- Did ingest/egress change after we rolled out a new configuration?
- Where are drops or throughput regressions starting?
Usage Tips:
- Identify heavy talkers: Use the Top Sources panels to find inputs driving excessive load.
- Spot hotspots: Use Top Destinations to find potential ingest caps or bottlenecks.
- Analyze flows: Top Routes helps you spot noisy or misrouted flows that may be consuming resources unnecessarily.
Stream Jobs
The Jobs tab shows scheduled and background work for Collectors. It focuses on:
Volume and workload: Counters for Total Jobs and Total Tasks show how many jobs and tasks completed in the time range. This tells you how busy your scheduled workloads have been.
Reliability: Task Errors and Task Errors Over Time highlight how often tasks fail and when error spikes occur. You use these to decide if failures are isolated or systemic.
Concurrency and backlog: In-flight Jobs Over Time shows how many jobs are running at once, and Tasks Started vs Completed compares the rate of new work to completed work. If starts outpace completions, you may be building a backlog.
Collector mix and hotspots: Panels like Jobs Completed by Collector Type and Top Collectors by Job Count show which Collectors account for most of the workload. If a particular Collector type or instance dominates, it’s a candidate for optimization or closer monitoring.
Cache efficiency: Collector Cache Hit Rate (%) shows how effectively Collectors are using cache. Low hit rates may indicate misconfiguration or missed optimization opportunities.
Use the Jobs tab for:
- Investigating late or missing data that depends on Collectors.
- Understanding whether new schedules or configuration changes affected job load or reliability.
- Deciding where to focus tuning and operational runbooks for scheduled jobs.
Stream Infrastructure
The Infrastructure tab gives you a capacity and health view of Workers. It focuses on:
CPU load over time: CPU load charts show which Workers or Groups are under the most pressure. Spikes or sustained high load often precede drops, latency, or backpressure.
Memory availability: Free memory over time tells you whether Workers are running with comfortable headroom or flirting with out-of-memory conditions.
Backpressure and queues: Backpressure-oriented visuals (for example, queue fill vs throughput) help you see when Pipelines are pushing back on input. This is where you look when data is technically flowing but latency and queues are growing.
Use the Infrastructure tab when:
- You suspect overload on specific Workers or Worker Groups.
- Metrics show problems (drops, slowdowns), and you want to confirm whether they line up with CPU/memory/backpressure issues.
- Planning capacity changes or validating the effects of scaling decisions.
Usage Tips:
- Predict saturation: Rising queue fill and backpressure points are early warnings of saturation.
- Validate scaling: Use CPU and memory trends to validate that your scaling actions are effective.
- Correlate errors: Spikes in errors with resource pressure usually indicate throttling or downstream limits.
Stream Logs
Use this tab to correlate metric spikes with runtime events like Worker/Leader logs, connection lifecycle, job activity, and errors/warnings.
For internal log locations and fields, see Internal Logs.
Usage:
- Pivot: When you see a metric spike in the Data or Infrastructure tabs, switch to this tab to see time-aligned log lines.
- Refine: The query is pre-scoped to
cribl.log. You can refine this predefined query with additional terms, fields, or time ranges to isolate specific errors.
Edge System Insights
Edge Insights provides operational visibility into Cribl Edge Fleets. It answers critical questions: Are Nodes healthy? Is throughput balanced? Are drops occurring and where?
For the health and throughput signals used here, see Internal Metrics.
Edge Data
This tab provides a focused view of throughput and Fleet monitoring.
The Data tab for Cribl Edge gives you a Fleet-centric view:
Inventory and footprint
Counters for Fleets, Nodes, Routes, Pipelines, Sources, and Destinations show how large and complex your Edge deployment is.Fleet throughput and drops
Charts (wired in the dashboard code) track events/bytes per second and drop behavior per Fleet. This helps you see which Fleets are busiest and where data might be getting lost.
Use the Data tab when you want to:
- Understand which Fleets are doing the most work.
- Verify that collection and forwarding is keeping up with expectations.
- Spot Fleets with higher drop rates or unstable traffic patterns.
Workflow Examples:
- Incident triage: If Edge Dropped Events spikes, filter to the affected Fleet. Then, correlate the divergence between Events/Bytes to inspect recent Pipeline or Route changes on that Fleet.
- Optimization check: After enabling sampling or compression, look for Bytes Out to decrease while Events Out remains stable. Confirm that the drop chart remains flat.
Edge Logs
This is an investigation surface to correlate metrics with runtime events from Edge Nodes, such as agent logs, connection lifecycle events, retries, and authentication failures.
To correlate Edge logs, metrics, and state in Search, see Built-In Cribl Edge Datasets.
- Data sources: Edge Worker logs (for example
cribl.log) or platform-collected Node logs. - Usage: Filter by time, Fleet, Node, Source, or Destination. Pivot from spikes in EPS or drops to concrete error lines and connection events.
Search System Insights
Search Insights monitors Cribl Search services. It answers: Are searches finishing? Are errors spiking? How long do searches run? Is the system constrained?
When to use it:
- Detect incidents: Look for rising error rates, stalled completions, or queue growth.
- Capacity planning: Look for sustained CPU or memory saturation and increasing throughput.
- Goal monitoring: Correlate Jobs Finished versus Error Rate and Duration Over Time to confirm system stability.
Search Overview
The Overview tab for Search provides diagnostics for performance, resource usage, and cost. It answers:
- How many searches finished in this time range?
- What percentage errored?
- How are jobs distributed across statuses (dispatched, queued, errored)?
- Are searches taking longer than they used to?
It does this through:
- Throughput metrics - jobs finished.
- Reliability metrics - error rate, status distributions.
- Performance metrics - total duration/CPU-seconds and duration-over-time charts.
Use the Overview tab when:
- You suspect search-related issues (timeouts, failures, slowness).
- You’re validating the effect of new dashboards, scheduled searches, or data volumes on search behavior.
Key Workflows:
- Distinguish IO vs CPU:
- If Throughput rises and CPU is flat, but Duration increases, suspect I/O bottlenecks.
- If CPU saturates while Throughput is flat, you may need to optimize queries or scale executors.
- Cost Control: Track Billable CPU*Hours against volume. Optimize filters, sampling, and time ranges, then confirm there is no drop in Jobs Finished.
Search Logs
This investigation surface allows you to analyze service errors and request failures alongside your metrics.
- Pivot: If you observe an Error Rate spike in the Overview or Data tabs, switch to this tab to see the specific error messages associated with that time window.
- Investigation: Use this tab to search for specific error keywords or filter by component ID to isolate failing services.
- Context: The search experience is pre-scoped to system logs (for example, Search service logs), allowing you to quickly verify if an infrastructure issue is causing job failures.
Lake System Insights
Lake Insights monitors Cribl Lake. It answers: Are datasets growing as expected? Is storage healthy? Are queries succeeding with acceptable latency?
When to use it:
- Capacity Planning: Track total storage, utilization trends, and Dataset growth.
- Cost Control: Identify heavy datasets, growth hotspots, and inefficient data formats.
- Reliability: Monitor errors and health scores to validate ingestion and query stability.
For how Cribl Stream writes to Cribl Lake (including required fields and partitioning), see the Cribl Lake Destination.
Lake Overview
The Overview tab for Lake concentrates on how data is being read from Lake. Use these panels to assess storage composition, growth, and Dataset footprint at a glance, then drill into anomalies (sudden growth, skewed formats) to tune retention, format, and ingestion patterns.
Event and byte rates out
Event/byte-per-second charts and their averages show how much data Lake is serving over time.Baseline and spikes
The averages establish a normal range, while the time-series views highlight spikes or dips in egress that might indicate new workloads, misconfigurations, or issues downstream.
Use the Overview tab to:
- Track how heavily Lake is being used as a source of data.
- Identify sudden increases in downstream consumption that might stress infrastructure or budgets.
Usage Tips:
- Find Heavy Hitters: Use Storage by Dataset to identify the top datasets driving storage costs. Check their daily growth bars.
- Format Optimization: If Storage by Format shows a high share of inefficient formats (like JSON), prioritize compaction or rewriting.
- Capacity Alarms: A rising slope in Storage Utilization combined with sustained Dataset growth indicates an imminent need for capacity action.
- Egress Validation: Peaks in Events/Bytes Out without corresponding downstream consumption may indicate backpressure outside of Lake.
Lake Logs
This view captures ingest and compaction job logs as well as error traces.
- Correlation: Align the time range with anomalies seen in the Overview.
- Investigation: Filter on error keywords and component IDs to narrow the scope of failed ingest or compaction jobs.