Monitoring
To get an operational view of a Cribl Stream deployment, you can consult the following resources.
Monitoring Resources
Monitoring Page
From Cribl Stream’s top nav, select Monitoring to access multiple dashboards. The resulting Monitoring page displays information about traffic in and out of the system, as well as collection jobs and tasks. It tracks events, bytes, splits by data fields over time, and broader system metrics.
The initial view (below) shows aggregate data for all Worker Groups and all Workers. You can use the drop-downs at the upper right to isolate individual Groups, or individual Workers.
Also at the upper right, you can change the display’s granularity from the default 15 min
, selecting from a variety of time ranges from 1 min
up to 1 day
. (The latter covers the preceding 24 hours, and this maximum window is not configurable.)
The displayed Storage tile (upper right) represents the amount of free storage remaining on the partition where Cribl Stream is installed. (This quantity might not represent the maximum storage available for the selected Worker or Group. Also, it does not calculate the system free space.)
Similarly, the Free Memory graph reflects only the operating system’s free
statistic, matching Linux’s strict free
definition by excluding buff/cache
memory. So this graph indicates a lower value than the OS’ available
memory statistic – and it does not necessarily indicate that the OS is running out of memory to allocate.
Metrics are displayed in the Cribl Stream UI in units of KB, MB, GB, and TB. At each level, these are multiples of
1024
.
Byte-related charts show the uncompressed amounts and rates of data processed over the selected time range:
- Events (total) in and out
- Events per second in and out
- Bytes (total) in and out
- Bytes per second in and out
We measure bytes in and out based on the size of _raw
, if this field is present and is of type string
. Otherwise, we use the size of read (uncompressed) data.
The displayed CPU Load Average is an average per Worker Process, updated at 1‑minute granularity. (It is not an average for the Worker Node as a whole.)
Vertical lines across each chart display configuration changes. Click anywhere on the line to view summary information including time, data, and configuration versions.
Click View CPU load by node to open a modal showing CPU usage per Worker.
Click a View Details button to access a Worker Node’s Worker Settings, with details per Worker Process. (To enable this button, you must first enable the parent Group’s Remote UI Access.)
Except for these configuration change markers, Monitoring data does not persist across Cribl Stream restarts. Keep this in mind before you restart the server.
Data Monitoring
From the Monitoring page’s top nav, open the Data submenu to isolate throughput for any of the following:
- Data Fields
- Destinations
- Packs
- Pipelines
- Projects
- Routes
- Sources
- Subscriptions
Sparklines
Dense displays are condensed to sparklines for legibility. Hover over the right edge to display Maximize buttons that you can click to zoom these up to detailed graphs.
Data Fields
The Data > Data Fields page lets you preview the flow of events that contain a specific data field. The left (blue) side summarizes events in/out. The right (orange) side summarizes bytes in/out.
You can control which data fields are included in these graphs with the Disable field metrics setting located at Settings > (Global Settings) > System > General Settings > Limits > Metrics > Disable field metrics.
Disable field metrics settings are not available on Cribl.Cloud-managed Worker Nodes.
Routes
This Data > Routes page condenses multiple details and options:
- Each row independently summarizes throughput for a separate Route.
- The left (blue) side summarizes events in/out. The right (orange) side summarizes bytes in/out.
- On each row, the top number summarizes average events/bytes into the Route, and the bottom number summarizes events/bytes out. Both are averaged over your selected time granularity.
- On each row, the upper Maximize button zooms up the left (events) graph, and the lower Maximize button zooms up the right (bytes) graph.
Fly-Out Details
You can hover over graphs to display a fly-out with details. On the Free Memory and CPU Load Average graphs, fly-outs are limited to a manageable 10 Worker Nodes, even if more Nodes are active. These fly-outs will occasionally show transient 0
metrics for some Workers, because Cribl Stream prioritizes reporting current throughput over memory/load metrics.
System Monitoring
From the Monitoring page’s top nav, open the System submenu to isolate throughput for any of the following:
- Job Inspector
- Jobs (and tasks in-flight, see Collector Sources)
- Leaders (available only with High Availability enabled; see Second Leader)
- Licensing
- Queues (Destinations)
- Queues (Sources)
Job Inspector
Select System > Job Inspector from the Monitoring page’s top nav to view and manage pending, in-flight, and completed collection jobs and their tasks. For details about the resulting page, see Monitoring and Inspecting Collection Jobs.
Leaders
This menu item is available only in distributed deployments with High Availability (HA) enabled.
Select Monitoring > System > Leaders to view the status of your Leader Nodes. For more information on how to configure a second Leader Node for failover/durability, see Second Leader).
Licensing
Select System > Licensing from the Monitoring page’s top nav to check your licenses’ expiration dates, and daily data throughput and events quotas. You can also compare your daily data throughput against your license quota – and against granular and average throughput over the last 30 to 365 days. Highlights include:
- A horizontal bar indicates license usage against your quota.
- Tooltips display details about data usage, data amount over/under license quota, and data percentage over/under license quota.
- Dots on the daily usage bar graph represent configuration changes in the system.
- The Daily Events In chart only shows events that count towards your license. It filters out
datagen
,cribl_http
, andcribl_tcp
events.
Even on single-instance deployments, you must have
git
installed in order for the Monitoring > Licensing page to display configuration change markers.For the most current and accurate throughput data, enable the Cribl Internal Source’s CriblMetrics option. Forward metrics to your metrics Destination of choice, and run a report on
cribl.total.in_bytes
. CriblMetrics aggregates metrics at the Worker Process level, every 2 seconds.
Reports/Top Talkers
Select Monitoring > Reports / Top Talkers > Top Talkers, where you can examine your five highest-volume Sources, Destinations, Pipelines, Routes, and Packs. All components are ranked by events throughput. Sources and Destinations get separate rankings by bytes in and out, respectively.
Flows
Select Flows from the Monitoring page’s top nav or ••• overflow menu to see a graphical, left-to-right visualization of data flow through your Cribl Stream deployment.
Internal Logs and Metrics
Select Logs from the Monitoring page’s top nav. Cribl Stream’s internal logs and internal metrics provide comprehensive information about an instance’s status/health, inputs, outputs, Pipelines, Routes, Functions, and traffic.
Health Endpoint
Query this endpoint on any instance to check the instance’s health. (Details below.)
Types of Logs
Cribl Stream provides the following log types, by originating process:
API Server Logs – These logs are emitted primarily by the
API/main
process. They correspond to the top-levelcribl.log
that shows up on the Diag page. These include telemetry/license-validation logs. Filesystem location:$CRIBL_HOME/log/cribl.log
Worker Process(es) Logs – These logs are emitted by all the Worker Processes, and are very common on single-instance deployments and Workers. Filesystem location:
$CRIBL_HOME/log/worker/N/cribl.log
Worker Group/Fleet Logs – These logs are emitted by all processes that help a Leader Node configure Worker Groups/Fleets. Filesystem location:
$CRIBL_HOME/log/group/GROUPNAME/cribl.log
In a distributed deployment (Stream), all Workers forward their metrics to the Leader Node, which then consolidates them to provide a deployment-wide view.
For details about generated log files, see Internal Logs. To work with logs in Cribl Stream’s UI, see Search Internal Logs.
Log Rotation and Retention
Cribl Stream rotates logs every 5 MB, keeping the most recent 5 logs. Logs will reach the 5 MB threshold more quickly with verbose logging levels (such as debug
) and with high volumes of system activity.
For long-term retention, Cribl recommends sending logs from the Leader, and from Workers’ Cribl Internal > CriblLogs Sources, to a downstream service of your choice. See the next section for details.
Forward Logs and Metrics Externally
Cribl Stream supports forwarding internal logs and metrics to your preferred external monitoring solution. The Cribl Internal Source captures internal logs and metrics to send to Destinations.
In the following steps, we’ll use the graphical QuickConnect UI to set up the Cribl Internal Source and how data is sent to your Destination. See Cribl Internal for details on how to instead configure the Cribl Internal Source via the Data > Sources (Stream) or More > Sources (Edge) submenus.
- From the top nav, click Manage, then select a Worker Group to configure.
- To open QuickConnect: Stream – select the QuickConnect tile on the Overview page. Edge – click the Collect submenu.
- Click Add Source.
- From the resulting drawer’s tiles, select System and Internal, then hover over the Cribl Internal tile.
- Click Select Existing.
- On the CriblLogs and/or the CriblMetrics row, toggle Enabled to
Yes
. Confirm your choice in the resulting message box.
- Notice how the Routes/QC column still says Routes. We want to use QuickConnect, so we’ll change this by clicking on the CriblLogs and/or the CriblMetrics row again. Once clicked, you’ll confirm Yes to the message box asking to switch the Source to send to QuickConnect instead of Routes.
- Click and drag the + button on the right side of the Source to your desired Destination. A Connection Configuration modal will prompt you to select a Passthru, Pipeline, or Pack connection. See QuickConnect for details.
- This will send internal logs and metrics to your Destination, just like another data Source. Both logs and metrics will have a field called
source
, set to the valuecribl
, which you can use in Routes’ filters.
Note that the only logs supported here are Worker Process logs (see Types of Logs above). You can, however, use a Script Collector to listen for API Server or Worker Group events.
For recommendations about useful Cribl metrics to monitor, see Internal Metrics.
Controlling Metrics Volume
To reduce the volume of metrics sent through Cribl Stream, see options on the Cribl Internal Source.
Disable Field Metrics
To send fewer metrics to the Leader Node, you can specify these metrics types in the blocklist at Settings > (Global Settings) > System > General Settings > Limits > Metrics > Disable field metrics.
The default fields to ignore are host
, source
, sourcetype
, index
, and project
.
You can remove any of the defaults, or add other fields you do not want to send as metrics.
However, when you enable the Cribl Internal Source, Cribl Stream ignores this Disable field metrics setting, and full-fidelity data will flow down the Routes.
Dropping Metrics
When the number of in-flight requests for sending metrics from Worker to Leader exceeds a limit (1,000 requests),
Workers will stop sending metrics. Dropped metrics information is logged under channel="clustercomm"
.
You can exclude certain metrics from being dropped due to exceeding limits. Specify these metrics at Settings > System > General Settings > Limits > Metrics > Metrics never-drop list.
This setting is available only on Worker Nodes.
Metrics Garbage Collection
Metrics garbage collection (GC) runs when the total number of stored metrics exceeds the max metrics limit (default 1,000,000). Metrics are then removed, starting with the oldest ones. This happens both on Workers and Leaders.
You can define how often to run garbage collection on each Worker Group. Select Group Settings > System > General Settings > Limits > Metrics > Metrics GC period. The default is 60 seconds.
Metrics Tracking by Worker ID
Typically, all metrics are assigned their own Worker Node ID dimensions so they can be split by worker if needed.
You can define which metrics are not assigned a Worker Node ID dimension by adding them to the list at Group Settings > System > General Settings > Limits > Metrics > Metrics worker tracking.
Control Metrics Lag and Disk Usage
If one or more Worker Groups has a large number of enabled Sources, and clicking into these Sources does not promptly display their status, a workaround is to prevent Cribl Stream from writing metrics to disk. On the Leader (versions 4.1.2 and later), navigate to Settings > Global Settings > System > General Settings > Limits > Storage, and disable the Persist metrics toggle. Then restart the Leader.
If you keep Persist metrics enabled, you can use the adjacent Metrics max disk space field to control the written metrics’ footprint. This threshold defaults to 64 GB
.
Cardinality Limit
You can define the cardinality limit or metrics on each Worker Group. Select Group Settings > System > General Settings > Limits > Metrics > Metrics cardinality limit The default is 1,000.
Refer to Monitoring > Data > Data Fields to identify the fields that have the highest cardinality.
Search Internal Logs
Cribl Stream exists because logs are great and wonderful things! Using Cribl Stream’s Monitoring > Logs page, you can search all Cribl Stream’s internal logs at once – from a single location, for both Leader and Worker. This enables you to query across all internal logs for strings of interest.
The labels on this screenshot highlight the key controls you can use (see the descriptions below):
Log file selector: Choose the Node to view. In a distributed deployment (Stream), this list will be hierarchical, with Workers displayed inside their Leader.
Fields selector: Click the Main | All | None toggles to quickly select or deselect multiple check boxes below. Beside these toggles, a Copy button enables you to copy field names to the clipboard in CSV format.
Fields: Select or deselect these check boxes to determine which columns are displayed in the Results pane at right. (The upper Main Fields group will contain data for every event; other fields might not display data for all events.)
Time range selector: Select a standard or custom range of log data to display. (The Custom Time Range pickers use your local time, even though the logs’ timestamps are in UTC.)
Search box: To limit the displayed results, enter a JavaScript expression here. An expression must evaluate to
truthy
to return results. You can press Shift+Enter to insert a newline.
Typeahead assist is available for expression completion:
Click a field in any event to add it to a query:
Click other fields to append them to a query:
Shift+click to negate a field:
To modify the depth of information that is originally input to the Logs page, see Logging Settings.
- Click the Search box’s history arrow (right side) to retrieve recent queries:
- The Results pane displays most-recent events first. Each event’s icon is color-coded to match the event’s severity level.
Click individual log events to unwrap an expanded view of their fields:
- Export Logs as JSON button: Exports logs as a file initially named
CriblMonitoringLogs.json
. (You can edit this name upon export.) The logs’ scope will be filtered both by your Time range selector setting, and by how fully you’ve scrolled down to lazy-load that time range into the displayed UI.
Logging Settings
On Cribl Stream’s Settings pages, you can adjust the level (verbosity) of internal logging data processed, per logging channel. You can also redact fields in customized ways. In a distributed deployment, you manage each of these settings per Worker Group.
Change Logging Levels
To adjust logging levels:
In a single-instance deployment, select Settings > System > Logging > Levels.
For the Leader Node’s own logs, select Settings > Global Settings > System > Logging > Levels.
In a distributed deployment, first select Manage, then click into the group you want to configure. Next, select Group Settings (or Fleet Settings) > System > Logging > Levels.
On the resulting Manage Logging Levels page, you can:
Modify one channel by clicking its Level column. In the resulting drop-down, you can set a verbosity level ranging from error up to debug. (Top of composite screenshot below.)
Modify multiple channels by selecting their check boxes, then clicking the Change log level drop-down at the bottom of the page. (Bottom of composite screenshot below.) You can select all channels at once by clicking the top check box. You can search for channels at top right.
The
silly
(ultra-verbose) logging level is available for all channels. However, it provides additional metrics information only for inputs.
Change Logging Redactions
On the Redact Internal Log Fields page, you can customize the redaction of sensitive, verbose, or just ugly data within Cribl Stream’s internal logs. To access these settings:
In a single-instance deployment, select Settings > System > Logging > Redactions.
In a single-instance deployment, or for the Leader Node’s own logs, select Settings > Global Settings > System > Logging > Redactions.
In a distributed deployment, first select Manage, then click into the group you want to configure. Next, select that group’s Group Settings > System > Logging > Redactions.
It’s easiest to understand the resulting Redact Internal Log Fields page’s fields from bottom to top:
- Default fields: Cribl Stream always redacts these fields, and you can’t modify this list to allow any of them through. However, you can use the two adjacent fields to define stricter redaction:
- Additional fields: Type or paste in the names of extra fields you want to redact. Use a tab or hard return to confirm each entry.
- Custom redact string: Unless this field is empty, it defines a literal string that will override Cribl Stream’s default redaction pattern (explained below) on the affected fields.
Default Redact String
By default, Cribl Stream transforms this page’s selected fields by applying the following redaction pattern:
- Echo the field value’s first two characters.
- Replace all intermediate characters with a literal
...
ellipsis. - Echo the value’s last two characters.
Anything you enter in the Custom redact string field will override this default ??...??
pattern.
Health Endpoint
Each Cribl Stream instance exposes a health
endpoint – typically used in conjunction with a Load Balancer – that you can use to make operational decisions.
Health Check Endpoint | Healthy Response | Cribl Stream Version |
---|---|---|
curl http(s)://<host>:<port>/api/v1/health | {"status":"healthy"} | Through 2.4.3 |
curl http(s)://<host>:<port>/api/v1/health | {"status":"healthy","startTime":1617814717110} (see details below) | 2.4.4 and later |
Specifically, the health
endpoint can return one of the following response codes:
- 200 – Healthy.
- 420 – Shutting down.
- 503 – HTTP engine reports server busy: too many concurrent connections (configurable).
In the above curl examples, <port>
stands for the API port (by default, 9000
).