On This Page

Home / Search/ Get Your Data In/Organize Your Data in Cribl Search

Organize Your Data in Cribl Search

Datasets are logical containers that let you scope searches to the right data and keep things organized as ingest grows. Dataset rules define how Cribl Search routes your events into Datasets.


About Federated Datasets and Search Datasets

  • Search Datasets hold data ingested into a lakehouse engine (Cribl-hosted storage, optimized for fast, schema-aware search and AI workflows).
  • Federated Search Datasets don’t store data in Search; they point to external storage such as S3 or Blob via a Dataset Provider, and queries run “in place” there.

Both are called “Datasets,” but Search Datasets are Search-hosted and tied to a lakehouse engine, while federated search Datasets are customer-hosted and accessed via a Dataset Provider.

How Search Datasets Work

Each engine has a main Dataset (plus any custom Datasets you create). Engines ingest data from Sources, automatically detect the Datatypes, then route the events into Datasets.

You’ll configure data retention per Dataset, which means you control how long Search keeps the data before dropping it out of Cribl. This is key for cost and compliance.

Dataset rules enable you to organize and control data routing, so that, for example, firewall logs route to one Dataset and auth logs to another. If a Dataset rule does not point to a Dataset, then Cribl can drop the events instead of storing them.

The main Dataset

The main Dataset is a special Dataset with unique properties. It:

  • Is automatically created when you set up your first lakehouse engine.
  • Always exists as long as you have at least one lakehouse engine.
  • Can’t be moved to a different engine.

If you delete the engine hosting main, Cribl will prompt you to recreate it on another lakehouse engine. If no lakehouse engines remain, no main Dataset exists until you create a new one.

How Dataset Rules Work

After Datatyping parses your events, Cribl Search applies Dataset rules to decide where each event goes. Each rule has:

  • Kusto expression: A filter that matches events (for example, datatype == "syslog" or host == "web-01").
  • Destination: The Dataset to send matching events to, or Drop to discard them.

Rules are evaluated in order from top to bottom. The first rule whose expression matches an event is used. Events that match no rule, or that match a rule pointing to a deleted or invalid Dataset, are marked as Orphaned and land in the main Dataset with a _dataset_reason field indicating why.

By default, a catch-all rule sends all events to the main Dataset. You can add rules above it to route specific subsets.

Dataset Rule Best Practices

To add Dataset rules in the Get Data In workflow, see Add a Dataset Rule.

  • Preferred: Match on the Datatype assigned by Datatype matching. This lets you demultiplex data–routing events of different types from a single Source to separate Datasets.
  • Combine Datatype and __inputId for more specific routing when needed.
  • Use __inputId alone to route all events from a given Source.
  • Use any parsed field for finer-grained routing.

Matching Expressions in Dataset Rulesets

Use matching expressions to define when a Dataset (or Datatype) rule applies. Matching expressions support both simple term matching and full `where`-style filters:

Simple matching (default): Enter a plain expression with Cribl operator semantics for:

  • Simple term matching (like foo*)
  • Wildcards
  • Convenience operators like =
  • Case-insensitive matching

When you need full where semantics (built-in functions, richer conditions), start your expression with a piped operator:

  • | where ...
  • | find ...
  • | search ...

Examples:

  • | where isnotnull(must_have_this)
  • foo* | where x > y
  • foo* | find where z == "value"
  • foo* | search "error"

You can combine simple matching and piped filters in a single matching expression:

  • foo* | where x > y | find where z == "value" | search "error"

Only expressions are supported here - do not use other operators (such as stats, project, and so on) in the matching expression.

Timestamps and Timezone Handling

Cribl Search parses each event and interprets the timestamp assuming it includes timezone information. If no timezone is present in the event, Cribl Search assumes UTC.

To correct for timezone offsets, you have two options:

  • From Cribl Stream: Set the _time field in Stream to the correct time for the timezone. Cribl Search uses this value and ignores the timestamp in the event.
  • Without Stream: Create a Datatype in Cribl Search that includes the correct timezone label. This corrects the time offset during Datatype processing.

Inspect Your Datasets

After data is flowing, use the Data Explorer to inspect your Datasets. Go to Data > Datasets, select a Lakehouse Dataset, and open its details panel. You can:

  • Overview - See activity summary, volume over time, and field statistics.
  • Search History - See recent searches run against this Dataset.
  • Saved Searches - Manage saved searches scoped to this Dataset.
  • Usage - See who has used the Dataset and when.

Field statistics show the fields present in your data and their types, which helps you understand how datatyping and Dataset rules are shaping your events. For more details, see Data Explorer.

Stream Overrides

When sending data from Cribl Stream to Cribl Search, you can use these override fields:

FieldEffect
datasetForces Cribl Search to ignore the Dataset ruleset and route directly to the specified Dataset. If the Dataset does not exist, events route to main with _dataset_reason set to “does not exist.”
datatypeForces Cribl Search to ignore the Datatype ruleset and use the specified Datatype. If the Datatype does not exist, events run through the AI Datatypes with normal matching logic.
_timeForces Cribl Search to ignore the time in the event and use the _time value specified in Stream (in UTC).
isParsedTells Cribl Search to skip the datatyping step and use fields sent from Stream. Use this when you already parse data in Stream and do not want to parse it again.

TCP/UDP Ports

Ports can be used with any Source. When you create a new Source, it is assigned a default port, which you can change. There is a limited set of available cloud ports. If you have used all available ports, contact your SRE. Any port can be assigned to any Source as long as it is not already in use by another Source.