These docs are for Cribl Stream 4.9 and are no longer actively maintained.
See the latest version (4.13).
Google Cloud Storage Collector
Cribl Stream supports collecting data objects from Google Cloud Storage buckets. This page covers how to configure the Collector.
Google Cloud Roles and Permissions
Your Google Cloud service account requires, at a minimum, the following specific permissions to collect data:
storage.buckets.get
storage.objects.get
storage.objects.list
Assign the following role combinations to grant the required permissions:
roles/storage.legacyObjectReader
plusroles/storage.legacyBucketReader
roles/storage.objectViewer
plusroles/storage.insightsCollectorService
For additional details, see the Google Cloud Access Control topic.
Configure a Google Cloud Storage Collector
Navigate to Products > Stream > Worker Groups. Select a Worker Group, then go to Data > Sources. Choose the Source and select Add Collector.
In the New Collector modal, configure the following under Collector Settings:
- Collector ID: Unique ID for this Collector. E.g.,
gcs_24-7
. If you clone this Collector, Cribl Stream will add-CLONE
to the original Collector ID. - Description: Optionally, enter a description.
- Auto-populate from: Optionally, select a predefined Destination that will be used to auto-populate Collector settings. Useful when replaying data.
- Bucket name: Google Cloud Storage bucket to collect from. This value can be a constant, or a JavaScript expression that can be evaluated only at init time. E.g., referencing a Global Variable:
myBucket-${C.vars.myVar}
.
- Collector ID: Unique ID for this Collector. E.g.,
Under Authentication, select an Authentication method from the dropdown:
- Auto: This option uses the
GOOGLE_APPLICATION_CREDENTIALS
environment variable to provide the location of a credential JSON file. When this option is set, Cribl Stream will first check whether theGOOGLE_APPLICATION_CREDENTIALS
environment variable exists and properly set with valid credentials. If this environmental variable is missing or incorrect, Cribl Stream checks for credentials in two additional fallback locations, as described in the official Google Cloud documentation: How Application Default Credentials works.
In addition to using the
GOOGLE_APPLICATION_CREDENTIALS
environment variable, you can use credentials attached to your Google Cloud instance.Secret: This option exposes a Connection string (text secret) drop-down, in which you can select a stored secret that references an Azure Storage connection string. The secret can reside in Cribl Stream’s internal secrets manager or (if enabled) in an external KMS. A Create link is available if you need to generate a new secret.
Manual: Use this default option to enter your Google Cloud Storage connection string directly. Exposes a Service account credentials field for this purpose. The Service account credentials are the contents of a Google Cloud service account credentials (JSON keys) file. To upload a file, click the upload button at this field’s upper right.
Secret: This option exposes a Service account credentials (text secret) drop-down, in which you can select a stored secret that references an Google Cloud Storage connection string. The secret can reside in Cribl Stream’s internal secrets manager or (if enabled) in an external KMS. A Create link is available if you need to generate a new secret.
You can access service account credentials in the Google Cloud Console under Service Accounts > <service account associated with bucket> > Keys. The key file must be in JSON format.
- Auto: This option uses the
Next, you can configure the following Optional Settings:
- Path: The directory from which to collect data. Templating is supported (e.g.,
myDir/${datacenter}/${host}/${app}/
). Time-based tokens are also supported (e.g.,myOtherDir/${_time:%Y}/${_time:%m}/${_time:%d}/
). More on templates and Filters. - Path extractors: Extractors allow using template tokens as context for expressions that enrich discovery results.
- Select Add Extractor to add each extractor as a key-value pair, mapping a Token name on the left (of the form
/<path>/${<token>}
) to a custom JavaScript Extractor expression on the right (e.g.,{host: value.toLowerCase()}
). See an example of the Extractor Expression below. - Endpoint: Google Cloud Storage service endpoint. If empty, the endpoint will be automatically constructed using the service account credentials.
- Disable time filter: Toggle on to bypass the collection job event timestamp filtering. By default, Collectors apply time filtering based on the configured time range. Events with timestamps outside this range are discarded. Enabling this setting disables timestamp-based filtering, allowing all returned events to be processed regardless of their timestamps. This is useful when the data source’s timestamps do not align with the configured time range, causing unexpected filtering, or when you need to process all data regardless of their timestamps. If you encounter zero results despite the data source returning events, disabling the time filter can help isolate whether timestamp filtering is the issue. For details, see No Discover or Collector Event Results.
- Recursive: If set to
Yes
(the default), data collection will recurse through subdirectories. - Max batch size (objects): Maximum number of metadata objects to batch before recording as results. Defaults to
10
. To override this limit in the Collector’s Schedule/Run modal, use Advanced Settings > Upper task bundle size. - Encoding: Character encoding to use when parsing ingested data. If not set, Cribl Stream will default to UTF-8 but might incorrectly interpret multi-byte characters. This option is ignored for Parquet files. UTF-16LE and Latin-1 are also supported.
- Tags: Optionally, add tags that you can use to filter and group Sources in Cribl Stream’s Manage Sources page. These tags aren’t added to processed events. Use a tab or hard return between (arbitrary) tag names.
- Path: The directory from which to collect data. Templating is supported (e.g.,
Optionally, configure any Result, Result Routing, and Advanced settings outlined in the sections below.
Select Save, then Commit & Deploy.
To verify that the Collector actually collects data, you can start a single run in the Preview mode.
The sections described below are spread across several tabs. Click the tab links at left to navigate among tabs.
Collector Sources currently cannot be selected or enabled in the QuickConnect UI.
Extractor Expression Example
Each expression accesses its corresponding <token>
through the value
variable, and evaluates the token to populate event fields. Here is a complete example:
Token | Expression | Matched Value | Extracted Result |
---|---|---|---|
/var/log/${foobar} | {program: value.split('.')[0]} | /var/log/syslog.1 | {program: syslog, foobar: syslog.1} |
Result Settings
The Result Settings determine how Cribl Stream transforms and routes the collected data.
Custom Command
In this section, you can pass the data from this input to an external command for processing, before the data continues downstream.
Enabled: Defaults to
No
. Toggle toYes
to enable the custom command.Command: Enter the command that will consume the data (via
stdin
) and will process its output (viastdout
).Arguments: Click Add Argument to add each argument to the command. You can drag arguments vertically to resequence them.
Event Breakers
In this section, you can apply event breaking rules to convert data streams to discrete events.
Event Breaker rulesets: A list of event breaking rulesets that will be applied, in order, to the input data stream. Defaults to
System Default Rule
.Event Breaker buffer timeout: How long (in milliseconds) the Event Breaker will wait for new data to be sent to a specific channel, before flushing out the data stream, as-is, to the Routes. Minimum
10
ms, default10000
(10 sec), maximum43200000
(12 hours).
Fields
In this section, you can add Fields to each event, using Eval-like functionality.
Name: Field name.
Value: JavaScript expression to compute the field’s value (can be a constant).
Result Routing
Send to Routes: If set to Yes
(the default), Cribl Stream will send events to normal routing and event processing. Toggle to No
to select a specific Pipeline/Destination combination. The No
setting exposes these two additional fields:
- Pipeline: Select a Pipeline to process results.
- Destination: Select a Destination to receive results.
The default Yes
setting instead exposes this field:
- Pre-processing Pipeline: Pipeline to process results before sending to Routes. Optional.
This field is always exposed:
- Throttling: Rate (in bytes per second) to throttle while writing to an output. Also takes values with multiple-byte units, such as
KB
,MB
,GB
, etc. (Example:42 MB
.) Default value of0
indicates no throttling.
You might disable Send to Routes when configuring a Collector that will connect data from a specific Source to a specific Pipeline and Destination. This keeps the Collector’s configuration self‑contained and separate from Cribl Stream’s routing table for live data – potentially simplifying the Routes structure.
Pre-processing Pipeline: Pipeline to process results before sending to Routes. Optional, and available only when Send to Routes is toggled to Yes
.
Throttling: Rate (in bytes per second) to throttle while writing to an output. Also takes values with multiple-byte units, such as KB
, MB
, GB
, etc. (Example: 42 MB
.) Default value of 0
indicates no throttling.
Advanced Settings
Advanced Settings enable you to customize post-processing and administrative options.
- Environment: If you’re using GitOps, optionally use this field to specify a single Git branch on which to enable this configuration. If empty, the config will be enabled everywhere.
- Time to live: How long to keep the job’s artifacts on disk after job completion. This also affects how long a job is listed in Job Inspector. Defaults to
4h
. - Remove Discover fields : List of fields to remove from the Discover results. This is useful when discovery returns sensitive fields that should not be exposed in the Jobs user interface. You can specify wildcards (such as
aws*
). - Resume job on boot: Toggle to
Yes
to resume ad hoc collection jobs if Cribl Stream restarts during the jobs’ execution.
Replay
See these resources that demonstrate how to replay data from object storage. Both are written around Amazon S3-compatible stores, but the general principles apply to Google Cloud buckets as well:
Data Collection & Replay sandbox: Step-by-step tutorial, in a hosted environment, with all inputs and outputs preconfigured for you. Takes about 30 minutes.
Using S3 Storage and Replay: Guided walk-through on setting up your own replay.
How the Collector Pulls Data
In the Discover phase, the first available Worker returns the list of files to the Leader Node. In the Collect phase, Cribl Stream spreads the list of files to process spread across 1..N Workers, based on file size, with the goal of distributing tasks as evenly as possible across Workers. These Workers then stream in their assigned files from the remote Google Cloud Storage location.
Proxying Requests
If you need to proxy HTTP/S requests, see System Proxy Configuration.
Internal Fields
Cribl Stream uses a set of internal fields to assist in data handling. These “meta” fields are not part of an event, but they are accessible, and you can use them in Functions to make processing decisions.
Relevant fields for this Collector:
__collectible
– This object’s nested fields contain metadata about each collection job.collectorType
: Indicates the type of Collector used for the job.collectorId
: Represents the Collector ID of the Collector, as configured during setup.
__inputId
– Uniquely identifies the origin of data for a collection job. Its format varies depending on whether the job is ad hoc or scheduled:- Ad hoc jobs are formatted as
collection:<timestamp>.<randomId>.adhoc.<Collector ID>
.<timestamp>
: The Unix timestamp when the job was initiated.<randomId>
: A random identifier to ensure uniqueness.adhoc
: Indicates the job was manually triggered.<Collector ID>
: The ID of the Collector.
- Scheduled jobs are formatted as
collection:<Collector ID>
.<Collector ID>
: The ID of the Collector.
- Ad hoc jobs are formatted as
Troubleshooting
When permissions are correct on the object store, and events are reaching the Collector, the Preview pane will show events and the Job Inspector will show an Events collected count.
However, if previewing returns no events and throws no error, first check your Filter expression by previewing without it (e.g., simplify the Filter expression to true
). Then check the Job Inspector: If the Total size is greater than 0
, and the Received size is NA
or 0
, make sure you have list and read permissions on the object store.