Create Search Datasets in Cribl Search
Add Search Datasets within your lakehouse engine, so you can group ingested events and control retention.
Highlights
- Search Datasets group events ingested into Cribl Search, while federated Datasets point to external storage.
- You control Search Dataset assignment with Dataset rules.
- Each Search Dataset has its own retention period of 1 day to 10 years.
Datasets: Search vs. Federated
A Dataset in Cribl Search is a named container that groups related events.
Search Datasets organize events within Cribl-hosted lakehouse engines, following Dataset rules. They can hold together data from multiple Sources and of different Datatypes. They’re optimized for fast, schema-aware search and AI workflows.
Federated Datasets reference your own storage, such as S3 or Azure Blob. Queries use your federated engine capacity to run searches “in place”, with no ingestion or indexing needed. For more on these, see Connect Cribl Search to External Data.
| Search Datasets | Federated Datasets | |
|---|---|---|
Best for | Fast search and AI workflows | Search-in-place without indexing |
| Hosted with | Cribl Search | Your external storage |
| Powered by | Lakehouse engines | Federated engine |
Plan Your Search Datasets
Each Search Dataset auto-scales with:
- Amount of ingested data you route to the Dataset using Dataset rules.
- Retention period you set for the Dataset (1 day to 10 years, default: 1 year).
Adjust retention periods as your needs change. When retention ends, data is dropped.
Plan your Search Datasets ahead to estimate future storage costs. Exact strategy depends on your use case, but here’s general guidance.
Group events by domain
Route firewall logs to one Dataset, auth logs to another, and so on. This lets you set
different retention periods per category, and scope searches precisely.
Set retention to match your investigation window
This way, you can keep the right events close, and store
long-term audit or archive data in Cribl Lake or other object storage for federated search.
Treat main as a fallback
The catch-all main Dataset prevents data from being silently lost.
Treat it as a fallback mechanism, not target storage.
Estimate storage
Storage in a lakehouse engine is the amount of data retained in all of its Search Datasets over
time, measured after compression. Estimated compression ratio is between 10:1 and 12:1.
The formula below
calculates estimated storage for a single Search Dataset, assuming a compression ratio of 10:1. To see how storage
translates to cost, see Cribl Search Pricing.
(Daily volume of data routed to the Dataset / 10) × Days of retention = Estimated storageSet Up Your Search Datasets
To organize ingested data into Search Datasets:
- Add Datasets: Create your Search Datasets and set their retention periods.
- Add Dataset rules: Define how to route events into your Search Datasets.
- Verify Dataset assignment: Check if your events are routed and retained as expected.
Add Search Datasets
Create Datasets that you’ll target with Dataset rules.
- On the Cribl.Cloud top bar, select Products > Search > Data > Datasets (at the top, next to Engines).
- In New Dataset, configure:
Setting Description Example ID Dataset ID, unique across your Cribl.Cloud Workspace.
Use letters (A-Z,a-z), numbers (0-9), hyphens, and underscores.
Max 512 characters.my_datasetDescription Describe your Dataset so others know what it’s for. Contains security logsTags When you have lots of Datasets, tags will help you organize them. security,logsDataset Provider For Search Datasets, the provider is lakehouse. You can’t change it.lakehouseRetention period Choose how long to keep the data, from 1 day to 10 years.
(See Plan Your Search Datasets.)1 year(default)Engine Select a lakehouse engine that will store the Dataset. palo_alto_logs - Confirm with Save.
Now you’ll be able to target the new Dataset with Dataset rules.
The main Dataset
When you add your first lakehouse engine, Cribl Search creates a catch-all main Dataset for unrouted events:
Events Sent to main | Value Set for _dataset_reason Field |
|---|---|
| Events that match no Dataset rule. | no_rules_matched |
| Events that match a rule pointing to a deleted or invalid Dataset (Orphaned data). | invalid_dataset_matched (<datasetId>) |
Events that arrive with a dataset field already set, pointing to an invalid Dataset. | invalid_dataset_provided |
If your events land in main unexpectedly, fix your Dataset rules to send them to the correct
Search Dataset instead.
You can change the retention period of your main Dataset, but you can’t delete it. If you delete a lakehouse engine
hosting main, you’ll need to pick another lakehouse engine to take over.
Clear a Search Dataset
You can wipe data stored in a Cribl-hosted Search Dataset while preserving your Dataset rules and retention settings. This requires Maintainer access on the Dataset, and can’t be undone.
- On the Cribl.Cloud top bar, select Products > Search > Data > Datasets.
- Select the Search Dataset you want to clear.
- Select Clear Dataset.
- Type
CLEAR, and confirm with Clear Dataset.
Delete a Search Dataset
You can delete a Cribl-hosted Search Dataset, removing both its data and configuration. This requires Maintainer access on the Dataset, and can’t be undone.
- On the Cribl.Cloud top bar, select Products > Search > Data > Datasets.
- Select the Search Dataset you want to delete.
- Select Delete Dataset.
- Confirm with Delete.
Next Steps
Now that your Datasets are in place, connect your Sources to start sending data into Cribl Search.
If your Sources are already connected, target your Datasets with Dataset rules.