/ / /

Organize Data with Dataset Rules

Route ingested events into individual Search Datasets so you can scope your queries and control retention.

Highlights

Search Datasets group events ingested into Cribl Search, while federated Datasets point to external storage.
You control Search Dataset assignment with Dataset rules.
Each Search Dataset has its own retention period of 1 day to 10 years.

Datasets: Search vs. Federated

A Dataset is a named container that groups related events.

Search Datasets organize events within Cribl-hosted lakehouse engines, using Dataset rules. They can hold together data from multiple Sources and of different Datatypes. They’re optimized for fast, schema-aware search and AI workflows.

Federated Datasets reference your own storage, such as S3 or Azure Blob. Queries use your federated engine compute capacity to run “in place”, with no ingestion or indexing needed. For more on these, see Connect Cribl Search to External Data.

	Search Datasets	Federated Datasets
Best for	Fast search and AI workflows	Search-in-place without indexing
Hosted with	Cribl Search	Your external storage
Powered by	Lakehouse engines	Federated engine

Plan Your Search Datasets

Each Search Dataset auto-scales with:

Amount of ingested data you route to it using Dataset rules.
Retention period you set for it (1 day to 10 years, default: 30 days).

Adjust retention periods as your needs change. When retention ends, data is dropped.

Plan your Search Datasets ahead to estimate future storage costs. Exact strategy depends on your use case, but here’s general guidance.

Group events by domain
Route firewall logs to one Dataset, auth logs to another, and so on. This lets you set different retention periods per category, and scope searches precisely.

Set retention to match your investigation window
This way, you can keep the right events close, and store long-term audit or archive data in Cribl Lake or other object storage for federated search.

Treat main as a fallback
The catch-all main Dataset prevents data from being silently lost. Treat it as a fallback mechanism, not target storage.

Estimate storage
Storage in a lakehouse engine is the amount of data retained in all of its Search Datasets over time, measured after compression. Estimated compression ratio is between 10:1 and 12:1.

The formula below calculates estimated storage for a single Search Dataset, assuming a compression ratio of 10:1. To see how storage translates to cost, see Cribl Search Pricing.

(Daily volume of data routed to the Dataset / 10) × Days of retention = Estimated storage

Set Up Your Search Datasets

To organize ingested data into Search Datasets:

Add Datasets: Create your Search Datasets and set their retention periods.
Add Dataset rules: Define how to route events into your Search Datasets.
Verify Dataset assignment: Check if your events are routed and retained as expected.

1. Add Search Datasets

Create Datasets that you’ll target with Dataset rules.

On the Cribl.Cloud top bar, select Products > Search > Data > Datasets (at the top, next to Engines).

In New Dataset, configure:

Setting Description Example
ID Dataset ID, unique across your Cribl.Cloud Workspace.

Use letters (A-Z, a-z), numbers (0-9), hyphens, and underscores.
Max 512 characters.
my_dataset
Description Describe your Dataset so others know what it’s for. Contains security logs
Tags When you have lots of Datasets, tags will help you organize them. security, logs
Dataset Provider For Search Datasets, the provider is lakehouse. You can’t change it. lakehouse
Retention period Choose how long to keep the data, from 1 day to 10 years.

(See Plan Your Search Datasets.) 30 days (default)
Engine Select a lakehouse engine that will store the Dataset. palo_alto_logs

Confirm with Save.

Setting	Description	Example
ID	Dataset ID, unique across your Cribl.Cloud Workspace. Use letters (`A`-`Z`, `a`-`z`), numbers (`0`-`9`), hyphens, and underscores. Max 512 characters.	`my_dataset`
Description	Describe your Dataset so others know what it’s for.	`Contains security logs`
Tags	When you have lots of Datasets, tags will help you organize them.	`security`, `logs`
Dataset Provider	For Search Datasets, the provider is `lakehouse`. You can’t change it.	`lakehouse`
Retention period	Choose how long to keep the data, from 1 day to 10 years. (See Plan Your Search Datasets.)	`30 days` (default)
Engine	Select a lakehouse engine that will store the Dataset.	`palo_alto_logs`

Now you can route events to your Datasets by setting up Dataset rules.

2. Add Dataset Rules

Each Dataset rule captures events that match a KQL expression, and then sends those events to the specified Search Dataset.

On the Cribl.Cloud top bar, select Products > Search > Data > Get Data In > Datatypes: Organize Your Data.
Select Add Dataset Rule. Name and describe your rule.
In Kusto expression to match, enter a KQL expression that matches events you want to route.
See Dataset Rule Expressions for syntax and examples.
In Send data to, choose your target Search Dataset. This is where events matching the KQL expression will land.
You can also select Drop to instead discard the events.
Make sure that Enabled in the top right corner is checked, and confirm with Add.

If you add more rules, drag them to change the order. Rules run top-down, and the first match wins. Put more specific rules above broader ones.

Events that don’t match any rule, or match a rule pointing to a deleted Dataset, fall back to the main Dataset.

Dataset Rule Expressions

Point your Dataset KQL expressions at these fields:

Field	Description
`datatype`	Datatype assigned through Datatyping.
`__inputId`	Source identifier in `type:id` format. Supported types: `cribl_http`, `datadog_agent`, `elastic`, `http_raw`, `open_telemetry`, `prometheus_rw`, `splunk`, `splunk_hec`, `syslog`, `tcp`, `tcpjson`, `wef`, `wiz_webhook`. Example: `syslog:my_source_id`.

You can also filter by any other field in your parsed data.

Same as with Datatype rule expressions, you can:

Create KQL expressions that evaluate to true/false for matching events.
Set case-insensitive conditions using = and wildcards (*).
Pipe into | where ..., | find ..., or | search ... for richer logic.

But:

You can’t use expressions that aggregate or reshape data (such as stats or project).
You can’t use let or set statements.

See the examples below. For full reference on the Cribl Search implementation of KQL, see Language Reference.

Matches all events of Datatype apache_httpd_accesslog_common:

datatype = "apache_httpd_accesslog_common"

Matches RFC 3164 syslog events from the my_source_id Source only. You can use rules like this to separate syslog from one Source into its own Dataset:

datatype = "syslog_rfc3164" and __inputId = "syslog:my_source_id"

Matches AWS VPC Flow Logs v2 events where the parsed host field equals vpc-flow-logs:

datatype = "aws_vpc_v2" | where host = "vpc-flow-logs"

Matches all events from an OpenTelemetry Source with ID otel. You can use rules like this when you want one Dataset per Source:

__inputId = "open_telemetry:otel"

Matches healthcheck and heartbeat events, using custom Datatypes. You can route them to Drop to save storage:

datatype in ("healthcheck", "heartbeat")

3. Verify Dataset Assignment

Check on your Search Datasets to make sure your events are routed and retained as expected.

Go to Search Home: On the Cribl.Cloud top bar, select Products > Search.
Under Available Datasets, select a Search Dataset you want to inspect.
Search Datasets are marked with the lakehouse icon .
In the resulting details panel, look at the Fields section.
If the Fields section is empty, select Retry to load the metadata.
If there’s no Fields section at all, you’re looking at a federated Dataset. Select a Search Dataset instead.
Verify that the Dataset contains the fields you’d expect from your Datatyping configuration and Dataset rules.
For more information, see Explore Fields in Search Datasets.

For more ways to explore your Datasets, see Inspect Your Datasets.

The `main` Dataset

When you add your first lakehouse engine, Cribl Search creates a catch-all main Dataset for unrouted events:

Events Sent to `main`	Value Set for `_dataset_reason` Field
Events that match no Dataset rule.	`no_rules_matched`
Events that match a rule pointing to a deleted or invalid Dataset (Orphaned data).	`invalid_dataset_matched (<datasetId>)`
Events that arrive with a `dataset` field already set, pointing to an invalid Dataset.	`invalid_dataset_provided`

If your events land in main unexpectedly, fix your Dataset rules to send them to the correct Search Dataset instead.

You can change the retention period of your main Dataset, but you can’t delete it. If you delete a lakehouse engine hosting main, you’ll need to pick another lakehouse engine to take over.

Next Steps

Now that your data is organized into Datasets, you can start putting it to work. For example:

Organize Data with Dataset Rules

Highlights​

Datasets: Search vs. Federated​

Plan Your Search Datasets​

Set Up Your Search Datasets​

1. Add Search Datasets​

2. Add Dataset Rules​

Dataset Rule Expressions​

3. Verify Dataset Assignment​

The main Dataset​

Next Steps​

Common Resources

Highlights