Cribl LogStream – Docs

Cribl LogStream Documentation

Questions? We'd love to help you! Meet us in #Cribl Community Slack (sign up here)
Download entire manual as PDF - v2.4.4

S3

Cribl LogStream supports receiving data from Amazon S3 buckets, using event notifications through SQS.

📘

Type: Pull | TLS Support: YES (secure API) | Event Breaker Support: YES

S3 Setup Strategy

📘

The source S3 bucket must be configured to send s3:ObjectCreated:* events to an SQS queue, either directly (easiest) or via SNS (Amazon Simple Notification Service). See the event notification configuration guidelines below.

SQS messages will be deleted after they're read, unless an error occurs, in which case LogStream will retry. This means that although LogStream will ignore files not matching the Filename Filter, their SQS events/notifications will still be read, and then deleted from the queue (along with those from files that match).

These ignored files will no longer be available to other S3 Sources targeting the same SQS queue. If you still need to process these files, we suggest one of these alternatives:

  • Using a different, dedicated SQS queue. (Preferred and recommended.)

  • Applying a broad filter on a single Source, and then using pre-processing Pipelines an/or Route filters for further processing.

Configuring Cribl LogStream to Receive Data from Amazon S3

Select Data > Sources, then select S3 from the Data Sources page's tiles or left menu. Click Add New to open the S3 > New Source modal, which provides the following fields.

General Settings

Input ID: Enter a unique name to identify this S3 Source definition.

Queue: The name, URL, or ARN of the SQS queue to read events from. When specifying a non-AWS URL, you must use the format: {url}/<queueName>. (E.g., https://host:port/<queueName>.) This value must be a JavaScript expression (which can evaluate to a constant), enclosed in single quotes, double quotes, or backticks.

Filename filter: Regex matching file names to download and process. Defaults to .*, to match all characters.

Region: AWS Region where the S3 bucket and SQS queue are located. Required, unless the Queue entry is a URL or ARN that includes a Region.

Authentication

Authentication method: Select an AWS authentication method.

  • Auto: This default option uses the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or the attached IAM role. Works only when running on AWS.

  • Manual: You must select this option when not running on AWS.

The Manual option exposes these corresponding additional fields:

  • Access key: Enter your AWS access key. If not present, will fall back to env.AWS_ACCESS_KEY_ID, or to the metadata endpoint for IAM role credentials.

  • Secret key: Enter your AWS secret key. If not present, will fall back to env.AWS_SECRET_ACCESS_KEY, or to the metadata endpoint for IAM credentials.

Assume Role

Enable for S3: Whether to use Assume Role credentials to access S3. Defaults to Yes.

Enable for SQS: Whether to use Assume Role credentials when accessing SQS (Amazon Simple Queue Service). Defaults to No.

AWS account ID: SQS queue owner's AWS account ID. Leave empty if the SQS queue is in the same AWS account.

AssumeRole ARN: Enter the Amazon Resource Name (ARN) of the role to assume.

External ID: Enter the External ID to use when assuming role.

Processing Settings

Custom Command

In this section, you can pass the data from this input to an external command for processing, before the data continues downstream.

Enabled: Defaults to No. Toggle to Yes to enable the custom command.

Command: Enter the command that will consume the data (via stdin) and will process its output (via stdout).

Arguments: Click + Add Argument to add each argument to the command. You can drag arguments vertically to resequence them.

Event Breakers

This section defines event breaking rulesets that will be applied, in order.

Event Breaker Rulesets: A list of event breaking rulesets that will be applied to the input data stream before the data is sent through the Routes. Defaults to System Default Rule.

Event Breaker Buffer Timeout: The amount of time (in milliseconds) that the Event Breaker will wait for new data to be sent to a specific channel, before flushing out the data stream, as-is, to the Routes. Defaults to 10000.

Fields (Metadata)

In this section, you can add fields/metadata to each event, using Eval-like functionality.

Name: Field name.

Value: JavaScript expression to compute field's value (can be a constant).

Pre-Processing

In this section's Pipeline drop-down list, you can select a single existing Pipeline to process data from this input before the data is sent through the Routes.

Advanced Settings

Endpoint: S3 service endpoint. If empty, defaults to AWS's region-specific endpoint. Otherwise, used to point to an S3-compatible endpoint.

Signature version: Signature version to use for signing SQS requests. Defaults to v4.

Num receivers: The number of receiver processes to run,. The higher the number, the better the throughput, at the expense of CPU overhead. Defaults to 1.

Max messages: The maximum number of messages that SQS should return in a poll request. Amazon SQS never returns more messages than this value. (However, fewer messages might be returned.) Acceptable values: 1 to 10. Defaults to 1.

Visibility timeout seconds: The duration (in seconds) that the received messages are hidden from subsequent retrieve requests, after being retrieved by a ReceiveMessage request. Defaults to 600.

📘

LogStream will automatically extend this timeout until the initial request's files have been processed – notably, in the case of large files that require additional processing time.

Socket timeout: Socket inactivity timeout (in seconds). Increase this value if retrievals time out during backpressure. Defaults to 300 seconds.

Skip file on error: Toggle to Yes to skip files that trigger a processing error. (E.g., corrupted files.) Defaults to No, which enables retries after a processing error.

Reuse connections: Whether to reuse connections between requests. The default setting (Yes) can improve performance.

Reject unauthorized certificates: Whether to accept certificates that cannot be verified against a valid Certificate Authority (e.g., self-signed certificates). Defaults to Yes.

Internal Fields

Cribl LogStream uses a set of internal fields to assist in handling of data. These "meta" fields are not part of an event, but they are accessible, and Functions can use them to make processing decisions.

Fields for this Source:

  • __inputId
  • __source

How to Configure S3 to Send Event Notifications to SQS

📘

For step-by-step instructions, see AWS' Walkthrough: Configure a Bucket for Notifications (SNS Topic and SQS Queue).

  1. Create a Standard SQS Queue. Note its ARN.

  2. Replace its access policy with one similar to the examples below. To do so, select the queue; and then, in the Permissions tab, click: Edit Policy Document (Advanced). (These examples differ only at line 9, showing public access to the SQS queue versus S3-only access to the queue.)

  3. In the Amazon S3 console, add a notification configuration to publish events of the s3:ObjectCreated:* type to the SQS queue.

{
 "Version": "example-2020-04-20",
 "Id": "example-ID",
 "Statement": [
  {
   "Sid": "<SID name>",
   "Effect": "Allow",
   "Principal": {
    "AWS":"*"  
   },
   "Action": [
    "SQS:SendMessage"
   ],
   "Resource": "example-SQS-queue-ARN",
   "Condition": {
      "ArnLike": { "aws:SourceArn": "arn:aws:s3:*:*:example-bucket-name" }
   }
  }
 ]
}
{
 “Version”: “example-2020-04-20",
 “Id”: “example-ID”,
 “Statement”: [
  {
   “Sid”: “<SID name”,
   “Effect”: “Allow”,
   “Principal”: {
    “Service”:“s3.amazonaws.com”  
   },
   “Action”: [
    “SQS:SendMessage”
   ],
   “Resource”: “example-SQS-queue-ARN”,
   “Condition”: {
      “ArnLike”: { “aws:SourceArn”: “arn:aws:s3:*:*:example-bucket-name” }
   }
  }
 ]
}

S3 and SQS Permissions

The following permissions are required on the S3 bucket:

  • s3:GetObject
  • s3:ListBucket

The following permissions are required on the SQS queue:

  • sqs:ReceiveMessage
  • sqs:DeleteMessage
  • sqs:ChangeMessageVisibility
  • sqs:GetQueueAttributes
  • sqs:GetQueueUrl

Best Practices

  • When LogStream instances are deployed on AWS, use IAM Roles whenever possible.

    • Not only is this safer, but it also makes the configuration simpler to maintain.
  • Although optional, we highly recommend that you use a Filename Filter.

    • This will ensure that LogStream ingests only files of interest.
    • Ingesting only what's strictly needed improves latency, processing power, and data quality.
  • If higher throughput is needed, increase Advanced Settings > Number of Receivers and/or Max messages. However, do note:

    • These are set at 1 by default. Which means, each Worker Process, in each LogStream Worker Node, will run 1 receiver consuming 1 message (i.e. S3 file) at a time.
    • Total S3 objects processed at a time per Worker Node = Worker Processes x Number of Receivers x Max Messages
    • Increased throughput implies additional CPU utilization.
  • When ingesting large files, tune up the Visibility Timeout, or consider using smaller objects.

    • The default value of 600s works well in most cases, and while you certainly can increase it, we suggest that you also consider using smaller S3 objects.

Troubleshooting Notes

  • VPC endpoints for SQS and for S3 might need to be set up in your account. Check with your administrator for details.

  • If you're having connectivity issues, but no problems with the CLI, see if the AWS CLI proxy is in use. Check with your administrator for details.

Updated 2 days ago

S3


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.