Organize Your Data in Cribl Search
Datasets are logical containers that let you scope searches to the right data and keep things organized as ingest grows. Dataset rules define how Cribl Search routes your events into Datasets.
About Federated Datasets and Search Datasets
- Search Datasets hold data ingested into a lakehouse engine (Cribl-hosted storage, optimized for fast, schema-aware search and AI workflows).
- Federated Search Datasets don’t store data in Search; they point to external storage such as S3 or Blob via a Dataset Provider, and queries run “in place” there.
Both are called “Datasets,” but Search Datasets are Search-hosted and tied to a lakehouse engine, while federated search Datasets are customer-hosted and accessed via a Dataset Provider.
How Search Datasets Work
Each engine has a main Dataset (plus any custom Datasets you create). Engines ingest data from Sources, automatically detect the Datatypes, then route the events into Datasets.
You’ll configure data retention per Dataset, which means you control how long Search keeps the data before dropping it out of Cribl. This is key for cost and compliance.
Dataset rules enable you to organize and control data routing, so that, for example, firewall logs route to one Dataset and auth logs to another. If a Dataset rule does not point to a Dataset, then Cribl can drop the events instead of storing them.
The main Dataset
The main Dataset is a special Dataset with unique properties. It:
- Is automatically created when you set up your first lakehouse engine.
- Always exists as long as you have at least one lakehouse engine.
- Can’t be moved to a different engine.
If you delete the engine hosting main, Cribl will prompt you to recreate it on another lakehouse engine. If no
lakehouse engines remain, no main Dataset exists until you create a new one.
How Dataset Rules Work
After Datatyping parses your events, Cribl Search applies Dataset rules to decide where each event goes. Each rule has:
- Kusto expression: A filter that matches events (for example,
datatype == "syslog"orhost == "web-01"). - Destination: The Dataset to send matching events to, or Drop to discard them.
Rules are evaluated in order from top to bottom. The first rule whose expression matches an event is used. Events that
match no rule, or that match a rule pointing to a deleted or invalid Dataset, are marked as Orphaned and land in the
main Dataset with a _dataset_reason field indicating why.
By default, a catch-all rule sends all events to the main Dataset. You can add rules above it to route specific
subsets.
Dataset Rule Best Practices
To add Dataset rules in the Get Data In workflow, see Add a Dataset Rule.
- Preferred: Match on the Datatype assigned by Datatype matching. This lets you demultiplex data–routing events of different types from a single Source to separate Datasets.
- Combine Datatype and
__inputIdfor more specific routing when needed. - Use
__inputIdalone to route all events from a given Source. - Use any parsed field for finer-grained routing.
Matching Expressions in Dataset Rulesets
Use matching expressions to define when a Dataset (or Datatype) rule applies. Matching expressions support both simple term matching and full `where`-style filters:
Simple matching (default): Enter a plain expression with Cribl operator semantics for:
- Simple term matching (like
foo*) - Wildcards
- Convenience operators like
= - Case-insensitive matching
When you need full where semantics (built-in functions, richer conditions), start your expression with a piped
operator:
| where ...| find ...| search ...
Examples:
| where isnotnull(must_have_this)foo* | where x > yfoo* | find where z == "value"foo* | search "error"
You can combine simple matching and piped filters in a single matching expression:
foo* | where x > y | find where z == "value" | search "error"
Only expressions are supported here - do not use other operators (such as stats, project, and so on) in the
matching expression.
Timestamps and Timezone Handling
Cribl Search parses each event and interprets the timestamp assuming it includes timezone information. If no timezone is present in the event, Cribl Search assumes UTC.
To correct for timezone offsets, you have two options:
- From Cribl Stream: Set the
_timefield in Stream to the correct time for the timezone. Cribl Search uses this value and ignores the timestamp in the event. - Without Stream: Create a Datatype in Cribl Search that includes the correct timezone label. This corrects the time offset during Datatype processing.
Inspect Your Datasets
After data is flowing, use the Data Explorer to inspect your Datasets. Go to Data > Datasets, select a Lakehouse Dataset, and open its details panel. You can:
- Overview - See activity summary, volume over time, and field statistics.
- Search History - See recent searches run against this Dataset.
- Saved Searches - Manage saved searches scoped to this Dataset.
- Usage - See who has used the Dataset and when.
Field statistics show the fields present in your data and their types, which helps you understand how datatyping and Dataset rules are shaping your events. For more details, see Data Explorer.
Stream Overrides
When sending data from Cribl Stream to Cribl Search, you can use these override fields:
| Field | Effect |
|---|---|
dataset | Forces Cribl Search to ignore the Dataset ruleset and route directly to the specified Dataset. If the Dataset does not exist, events route to main with _dataset_reason set to “does not exist.” |
datatype | Forces Cribl Search to ignore the Datatype ruleset and use the specified Datatype. If the Datatype does not exist, events run through the AI Datatypes with normal matching logic. |
_time | Forces Cribl Search to ignore the time in the event and use the _time value specified in Stream (in UTC). |
isParsed | Tells Cribl Search to skip the datatyping step and use fields sent from Stream. Use this when you already parse data in Stream and do not want to parse it again. |
TCP/UDP Ports
Ports can be used with any Source. When you create a new Source, it is assigned a default port, which you can change. There is a limited set of available cloud ports. If you have used all available ports, contact your SRE. Any port can be assigned to any Source as long as it is not already in use by another Source.