Cribl LogStream ‚Äď Docs

Cribl LogStream Documentation

Questions? We'd love to help you! Meet us in #Cribl Community Slack (sign up here)
Download entire manual as PDF ‚Äď v.3.1.1

S3

Cribl LogStream supports collecting data from Amazon S3 stores. This page covers how to configure the Collector.

ūüĎć

For a step-by-step tutorial on using LogStream to replay data from an S3-compatible store, see our Data Collection & Replay sandbox. The sandbox takes about 30 minutes. It provides a hosted environment, with all inputs and outputs preconfigured for you.

Also see our Using S3 Storage and Replay guided walk-through in this documentation.

How the Collector Pulls Data

When you run an S3 Collector in Discovery mode, the first available Worker returns the list of available files to the Leader Node.

In Full Run mode, the Leader distributes the list of files to process across 1 to N Workers as evenly as possible, based on file size. Each Worker then streams the files from the S3 bucket/path to itself.

ūüöß

LogStream does not support data preview, collection, or replay from S3 Glacier or Deep Glacier storage classes, whose stated retrieval lags (variously minutes to 48 hours) cannot guarantee data availability when the Collector needs it.

Configuring an S3 Collector

From the top nav of a LogStream instance or Group, select Sources, then select Collectors > S3 from the Data Sources page's tiles or the Sources left nav. Click + Add New to open the S3 > New Collector modal, which provides the following options and fields.

ūüďė

The sections described below are spread across several tabs. Click the tab links at left, or the Next and Prev buttons, to navigate among tabs. Click Save when you've configured your Collector.

Collector Settings

The Collector Settings determine how data is collected before processing.

Collector ID: Unique ID for this Collector. E.g., Attic42TreasureChest.

Auto-populate from: Select a Destination with which to auto-populate Collector settings. Useful when replaying data.

S3 bucket: Simple Storage Service bucket from which to collect data.

Region: S3 Region from which to retrieve data.

Path: Path, within the bucket, from which to collect data. Templating is supported (e.g., /myDir/${host}/${year}/${month}/). More on templates and Filters.

Path extractors: Extractors allow using template tokens as context for expressions that enrich discovery results. Click + Add Extractor to add each extractor as a key-value pair, mapping a Token name on the left (of the form /<path>/${<token>}) to a custom JavaScript Extractor expression on the right (for example, {host: value.toLowerCase()}). Each expression accesses its corresponding <token> through the value variable, and evaluates it to populate event fields. Here is a complete example:

TokenExpressionMatched ValueExtracted Result
/var/log/${foobar}foobar: {program: value.split('.')[0]}/var/log/syslog.1{program: syslog, foobar: syslog.1}

Recursive: If set to Yes (the default), data collection will recurse through subdirectories.

Max batch size (files): Maximum number of lines written to the discovery results files each time. Defaults to 10. To override this limit in the Collector's Schedule/Run modal, use Advanced Settings > Upper task bundle size.

Authentication

Select an AWS authentication method.

The Manual option (default) provides these fields:

  • Access key: Enter your AWS access key. If not present, will fall back to the env.AWS_ACCESS_KEY_ID environment variable, or to the metadata endpoint for IAM role credentials.

  • Secret key: Enter your AWS secret key. if not present, will fall back to the env.AWS_SECRET_ACCESS_KEY environment variable, or to the metadata endpoint for IAM credentials. Optional when running on AWS.

The Secret Key pair option swaps in this drop-down:

  • Secret key pair: Select a secret key pair that you've configured in LogStream's internal secrets manager or (if enabled) an external KMS. Follow the Create link if you need to configure a key pair.

Assume Role

Enable Assume Role: Slide to Yes to enable Assume Role behavior.

AssumeRole ARN: Amazon Resource Name (ARN) of the role to assume.

External ID: External ID to use when assuming role.

Additional S3 Settings

Endpoint: S3 service endpoint. If empty, LogStream will automatically construct the endpoint from the region.

Signature version: Signature version to use for signing S3 requests. Defaults to v4.

Reuse connections: Whether to reuse connections between requests. The default setting (Yes) can improve performance.

Reject unauthorized certificates: Whether to accept certificates that cannot be verified against a valid Certificate Authority (e.g., self-signed certificates). Defaults to Yes.

Result Settings

The Result Settings determine how LogStream transforms and routes the collected data.

Custom Command

In this section, you can pass the data from this input to an external command for processing, before the data continues downstream.

Enabled: Defaults to No. Toggle to Yes to enable the custom command.

Command: Enter the command that will consume the data (via stdin) and will process its output (via stdout).

Arguments: Click + Add Argument to add each argument to the command. You can drag arguments vertically to resequence them.

Event Breakers

In this section, you can apply event breaking rules to convert data streams to discrete events.

Event Breaker rulesets: A list of event breaking rulesets that will be applied, in order, to the input data stream. Defaults to System Default Rule.

Event Breaker buffer timeout: The amount of time (in milliseconds) that the event breaker will wait for new data to be sent to a specific channel, before flushing out the data stream, as-is, to the routes. Defaults to 10000.

Fields (Metadata)

In this section, you can add fields/metadata to each event, using Eval-like functionality.

Name: Field name.

Value: JavaScript expression to compute the field's value (can be a constant).

Result Routing

Send to Routes: If set to Yes (the default), events will be sent to normal routing and event processing. Toggle to No to select a specific Pipeline/Destination combination. The No setting exposes these two additional fields:

  • Pipeline: Select a Pipeline to process results.

  • Destination: Select a Destination to receive results.

ūüďė

You might disable Send to Routes when configuring a Collector that will connect data from a specific Source to a specific Pipeline and Destination. This keeps the Collector's configuration self‚ÄĎcontained and separate from LogStream's routing table for live data ‚Äď potentially simplifying the Routes structure.

Pre-processing Pipeline: Pipeline to process results before sending to Routes. Optional.

Throttling: Rate (in bytes per second) to throttle while writing to an output. Also takes values with multiple-byte units, such as KB, MB, GB, etc. (Example: 42 MB.) Default value of 0 indicates no throttling.

Advanced Settings

Advanced Settings enable you to customize post-processing and administrative options.

Time to live: How long to keep the job's artifacts on disk after job completion. This also affects how long a job is listed in Job Inspector. Defaults to 4h.

Remove Discover fields : List of fields to remove from the Discover results. This is useful when discovery returns sensitive fields that should not be exposed in the Jobs user interface. You can specify wildcards (such as aws*).

Resume job on boot: Toggle to Yes to resume ad hoc collection jobs if LogStream restarts during the jobs' execution.


   Last updated by: Dritan Bitincka

Updated 21 days ago


S3


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.