On This Page

Home / Stream/ Integrations/ Destinations/Databricks Destination

Databricks Destination

The Databricks Destination sends data to Databricks Unity Catalog volumes.

Type: Non-Streaming | TLS Support: N/A | PQ Support: N/A

In Cribl.Cloud, the Databricks Destination is only available on customer-managed hybrid Worker Nodes.

Configure a Databricks Destination

  1. On the top bar, select Products, and then select Cribl Stream. Under Worker Groups, select a Worker Group. Next, you have two options:
    • To configure via QuickConnect, navigate to Routing > QuickConnect (Stream) or Collect (Edge). Select Add Destination and select the Destination you want from the list, choosing either Select Existing or Add New.
    • To configure via the Routes, select Data > Destinations or More > Destinations (Edge). Select the Destination you want. Next, select Add Destination.
  2. In the New Destination modal, configure the following under General Settings:
    • Output ID: Enter a unique name to identify this Filesystem definition. If you clone this Destination, Cribl Stream will add -CLONE to the original Output ID.
    • Description: Optionally, enter a description.
    • Data format: The output data format defaults to JSON. Raw and Parquet are also available. Selecting Parquet (supported only on Linux, not Windows) exposes a Parquet Settings left tab, where you must configure certain options in order to export data in Parquet format.
  3. Next, you can configure the following Optional Settings:
    • Upload path: Specify an optional directory prefix for files uploaded to the Databricks events volume. It accepts a JavaScript expression (enclosed in quotes or backticks) that evaluates at initialization, allowing dynamic path generation using global variables. For example: myEventsVolumePath-${C.vars.myVar}.

      If unset, files are uploaded to the root of the events volume. Use this setting to organize output data by customer, date, or other runtime variables for efficient partitioning and querying in Databricks.

    • Staging location: The local filesystem directory where files are temporarily buffered before compression and upload to Databricks. This should be a performant and reliable storage path to avoid bottlenecks or data loss during output operations. Defaults to $CRIBL_HOME/state/outputs/staging.

    • Partitioning expression: JavaScript expression that defines how files are partitioned into subdirectories within the Databricks Upload path. Evaluated per event, typically for time-based or custom partitioning. Defaults to C.Time.strftime(_time ? _time : Date.now()/1000, '%Y/%m/%d'). This default partitions files by event time (or current time if _time is absent), creating a directory structure like 2025/10/29/.

      If blank, Cribl Stream will fall back to the event’s __partition field value (if present); or otherwise to the root directory of the Upload path.

    • Compression: Data compression format used before moving to the final destination. Defaults to gzip (recommended). This setting is not available when Data format is set to Parquet.

    • File name prefix expression: The output file name prefix. Must be a JavaScript expression (which can evaluate to a constant), enclosed in quotes or backticks. Defaults to CriblOut.

    • File name suffix expression: The output file name suffix. Must be a JavaScript expression (which can evaluate to a constant), enclosed in quotes or backticks. Defaults to `.${C.env["CRIBL_WORKER_ID"]}.${__format}${__compression === "gzip" ? ".gz" : ""}`, where __format can be json or raw, and __compression can be none or gzip.

      To prevent files from being overwritten, Cribl Stream appends a random sequence of six characters to the end of their names. This also applies to prefix and suffix expressions in file names.

      For example, if you set the File name prefix expression to CriblExec and set the File name suffix expression to .csv, the file name might display as CriblExec-adPRWM.csv with adPRWM appended.

    • Backpressure Behavior: Select whether to block or drop events when all receivers are exerting backpressure. (Causes might include an accumulation of too many files needing to be closed.) Defaults to Block.

    • Tags: Optionally, add tags that you can use to filter and group Destinations on the Destinations page. These tags aren’t added to processed events. Use a tab or hard return between (arbitrary) tag names.

  4. The Unity Catalog OAuth settings allow Cribl Stream to authenticate to your Databricks Workspace as an application (known as a service principal) without using a personal access token.
  5. The Databricks Settings define the specific location within the Unity Catalog three-level namespace (catalog.schema.object) where Cribl Stream will write data.
    • Catalog: The top-level container for your data in Unity Catalog. The default value main is the catalog that Databricks automatically creates within a new Unity Catalog metastore.
    • Schema: The second level of the namespace, also known as a database. A schema contains tables, views, and volumes. Defaults to external to indicate that the data is coming from an external source.
    • Events volume name: Name of the events volume in Databricks where files are stored. Defaults to events.
  6. Optionally, you can adjust the Processing, Parquet, and Advanced settings outlined in the sections below.
  7. Select Save, then Commit & Deploy.

Processing Settings

Post‑Processing

Pipeline: Pipeline or Pack to process data before sending the data out using this output.

System fields: A list of fields to automatically add to events that use this output. By default, includes cribl_pipe (identifying the Cribl Stream Pipeline that processed the event). Supports wildcards. Other options include:

  • cribl_host – Cribl Stream Node that processed the event.
  • cribl_input – Cribl Stream Source that processed the event.
  • cribl_output – Cribl Stream Destination that processed the event.
  • cribl_route – Cribl Stream Route (or QuickConnect) that processed the event.
  • cribl_wp – Cribl Stream Worker Process that processed the event.

Parquet Settings

To write out Parquet files, note that:

  • On Linux, you can use the Cribl Stream CLI’s parquet command to view a Parquet file, its metadata, or its schema.
  • Cribl Edge Workers support Parquet only when running on Linux, not on Windows.
  • See Parquet Schemas for pointers on how to avoid problems such as data mismatches.

Automatic schema: Toggle on to automatically generate a Parquet schema based on the events of each Parquet file that Cribl Stream writes. When toggled off (the default), exposes the following additional field:

  • Parquet schema: Select a schema from the drop-down.

If you need to modify a schema or add a new one, follow the instructions in our Parquet Schemas topic. These steps will propagate the freshest schema back to this drop-down.

Parquet version: Determines which data types are supported, and how they are represented. Defaults to 2.6; 2.4 and 1.0 are also available.

Data page version: Serialization format for data pages. Defaults to V2. If your toolchain includes a Parquet reader that does not support V2, use V1.

Group row limit: The number of rows that every group will contain. The final group can contain a smaller number of rows. Defaults to 10000.

Page size: Set the target memory size for page segments. Generally, set lower values to improve reading speed, or set higher values to improve compression. Value must be a positive integer smaller than the Row group size value, with appropriate units. Defaults to 1 MB.

Log invalid rows: Toggle on to output up to 20 unique rows that were skipped due to data format mismatch. Log level must be set to debug for output to be visible.

Write statistics: Toggle on (default) if you have Parquet tools configured to view statistics – these profile an entire file in terms of minimum/maximum values within data, numbers of nulls, etc.

Write page indexes: Toggle on (default) if your Parquet reader uses statistics from Page Indexes to enable page skipping. One Page Index contains statistics for one data page.

Write page checksum: Toggle on if you have configured Parquet tools to verify data integrity using the checksums of Parquet pages.

Metadata (optional): The metadata of files the Destination writes will include the properties you add here as key-value pairs. For example, one way to tag events as belonging to the OCSF category for security findings would be to set Key to OCSF Event Class and Value to 2001.

Advanced Settings

File size limit (MB): Maximum uncompressed output file size. Files of this size will be closed and moved to final output location. Defaults to 32.

File open time limit (sec): Maximum amount of time to write to a file. Files open for longer than this will be closed and moved to final output location. Defaults to 300.

Idle time limit (sec): Maximum amount of time to keep inactive files open. Files open for longer than this will be closed and moved to final output location. Defaults to 30.

Open file limit: Maximum number of files to keep open concurrently. When exceeded, the oldest open files will be closed and moved to final output location. Defaults to 100.

Cribl Stream will close files when either of the File size limit (MB) or the File open time limit (sec) conditions are met.

Compression level: Specifies the level of compression to apply before moving files to the final destination. Higher compression levels reduce file size but increase CPU usage. Choose from Best Speed, Normal, or Best Compression. The default level, Best Speed, will prioritize faster compression over smaller file sizes.

Writing high watermark (KB): Sets the high watermark for writing files, in kilobytes. This determines the buffer size for file writes, impacting performance and memory usage. Enter a numeric value representing the buffer size in kilobytes. The default size, 64 will use a 64 KB buffer for file writes.

Header line: If set, this line will be written to the beginning of each output file, followed by a newline character. This can be useful for adding a header row to CSV files.

Disk space protection: Specifies whether to Block (default) or Drop incoming events when the disk space falls below the globally defined Min free disk space amount.

Add Output ID: When toggled on (default), adds the Output ID field’s value to the staging location’s file path. This ensures that each Destination’s logs will write to its own bucket.

Remove empty staging directories: When toggled on (the default), Cribl Stream deletes empty staging directories after moving files. This prevents the proliferation of orphaned empty directories. When enabled, exposes this additional option:

  • Staging cleanup period: How often (in seconds) to delete empty directories when Remove staging dirs is enabled. Defaults to 300 seconds (every 5 minutes). Minimum configurable interval is 10 seconds; maximum is 86400 seconds (every 24 hours).
  • Directory batch size: Specifies how many directories are scanned per batch during automatic cleanup of empty staging directories. Set between 10 and 10000, defaults to 1000. Higher values speed up cleanup but increase memory usage.

Force close on shutdown: Toggle on to force all staged files to close during an orderly shutdown. This triggers immediate upload or finalization of any in-progress files, regardless of Idle time, File open time, or File size limits. Use this option to minimize data loss risk in ephemeral or auto-scaling environments. This could result in a higher number of small files, as files are closed before reaching their configured thresholds. Ensure your Worker Group or Fleet Drain timeout is set high enough to allow all staged data to be finalized before shutdown completes.

Enable dead-lettering: Toggle on to set a maximum number of retries, and to move files to a designated directory when write failures exceed that limit. This prevents data flow blockage and excessive error logging due to undeliverable files. When enabled, exposes two additional fields:

  • Dead-letter location: Specify the storage location for undeliverable files. Defaults to $CRIBL_HOME/state/outputs/dead-letter.
  • Maximum retry limit: Configure the retry limit for failed file deliveries. This setting defines how many times the system will attempt to move a file to its intended location before it is deemed undeliverable and placed in the dead-letter directory. Defaults to 20.

Environment: If you’re using GitOps, optionally use this field to specify a single Git branch on which to enable this configuration. If empty, the config will be enabled everywhere.

Troubleshooting

The Destination’s configuration modal has helpful tabs for troubleshooting:

Live Data: Try capturing live data to see real-time events as they flow through the Destination. On the Live Data tab, click Start Capture to begin viewing real-time data.

Logs: Review and search the logs that provide detailed information about the delivery process, including any errors or warnings that may have occurred.

Test: Ensures that the Destination is correctly set up and reachable. Verify that sample events are sent correctly by clicking Run Test.

You can also view the Monitoring page that provides a comprehensive overview of data volume and rate, helping you identify delivery issues. Analyze the graphs showing events and bytes in/out over time.