On This Page

Home / Search/ Get Data In/Create Search Datasets in Cribl Search

Create Search Datasets in Cribl Search

Add Search Datasets within your lakehouse engine, so you can group ingested events and control retention.


Highlights
  • Search Datasets group events ingested into Cribl Search, while federated Datasets point to external storage.
  • You control Search Dataset assignment with Dataset rules.
  • Each Search Dataset has its own retention period of 1 day to 10 years.
  • Each Search Dataset applies a configurable expected time range that prevents ingest failures from widely distributed event timestamps.

Datasets: Search vs. Federated

A Dataset in Cribl Search is a named container that groups related events.

Search Datasets organize events within Cribl-hosted lakehouse engines, following Dataset rules. They can hold together data from multiple Sources and of different Datatypes. They’re optimized for fast, schema-aware search and AI workflows.

Federated Datasets reference your own storage, such as S3 or Azure Blob. Queries use your federated engine capacity to run searches “in place”, with no ingestion or indexing needed. For more on these, see Connect Cribl Search to External Data.

Search DatasetsFederated Datasets
Best for
Fast search and AI workflowsSearch-in-place without indexing
Hosted withCribl SearchYour external storage
Powered byLakehouse enginesFederated engine
Datasets: Search vs. federated
Datasets: Search vs. federated

Plan Your Search Datasets

Each Search Dataset auto-scales with:

  • Amount of ingested data you route to the Dataset using Dataset rules.
  • Retention period you set for the Dataset (1 day to 10 years, default: 1 year).

Adjust retention periods as your needs change. When retention ends, data is dropped.

Plan your Search Datasets ahead to estimate future storage costs. Exact strategy depends on your use case, but here’s general guidance.

Group events by domain
Route firewall logs to one Dataset, auth logs to another, and so on. This lets you set different retention periods per category, and scope searches precisely.

Set retention to match your investigation window
This way, you can keep the right events close, and store long-term audit or archive data in Cribl Lake or other object storage for federated search.

Treat main as a fallback
The catch-all main Dataset prevents data from being silently lost. Treat it as a fallback mechanism, not target storage.

Estimate storage
Storage in a lakehouse engine is the amount of data retained in all of its Search Datasets over time, measured after compression. Estimated compression ratio is between 10:1 and 12:1.

The formula below calculates estimated storage for a single Search Dataset, assuming a compression ratio of 10:1. To see how storage translates to cost, see Cribl Search Pricing.

(Daily volume of data routed to the Dataset / 10) × Days of retention = Estimated storage

Set Up Your Search Datasets

To organize ingested data into Search Datasets:

  1. Add Datasets: Create your Search Datasets and set their retention periods, along with performance-enhancing accelerated fields.
  2. Add Dataset rules: Define how to route events into your Search Datasets.
  3. Verify Dataset assignment: Check if your events are routed and retained as expected.

Add Search Datasets

Create Search Datasets that you’ll later target with Dataset rules.

  1. On the Cribl.Cloud top bar, select Products > Search > Data > Datasets > Add Dataset.

  2. Under New Dataset, configure the following:

    SettingDescriptionExample
    IDDataset ID, unique across your Cribl.Cloud Workspace.

    Use letters (A-Z, a-z), numbers (0-9), hyphens, and underscores.
    Max 512 characters.
    my_dataset
    DescriptionDescribe your Dataset so others know what it’s for.Contains security logs
    TagsWhen you have lots of Datasets, tags will help you organize them.security, logs
  3. Select the Dataset type: Search Dataset.

    Add a new Search Dataset
    Add a new Search Dataset

    If you’re not seeing the Search Dataset option, wait till your first lakehouse engine is in Ready status.

  4. Under Search Dataset, configure the following:

    SettingDescriptionExample
    Retention period
    Choose how long to keep the data, from 1 day to 10 years.

    (See Plan Your Search Datasets.)
    1 year (default)
    EngineSelect a lakehouse engine that will store the Dataset.palo_alto_logs
    Earliest expected timestampRelative time expression for the earliest accepted event timestamp. Events with timestamps before this boundary have their _time reset to now() and the original value preserved in _original_time.

    See Configure Expected Time Range.
    -30d (default)
    Latest expected timestampRelative time expression for the latest accepted event timestamp. Events with timestamps after this boundary have their _time reset to now() and the original value preserved in _original_time.

    See Configure Expected Time Range.
    7d (default)
  5. To improve search performance and speed up search results, you can add accelerated fields to your Search Dataset.

    Select Performance > Add Field, and enter the fields you search often.

    We recommend no more than 15 fields. Accelerated fields come with additional ingest and index costs, so use them intentionally.

    Add accelerated fields on a Search Dataset
    Add accelerated fields on a Search Dataset
  6. To generate an AI-powered reference file for your Search Dataset by analyzing its contents, select Dataset Intelligence > Enable Dataset Intelligence. For details, see Dataset Intelligence.

  7. Confirm with Save.

Now you’ll be able to target the new Dataset with Dataset rules.

Configure Expected Time Range

Each Search Dataset applies a configurable expected time window to event timestamps at ingest. When the _time field of an incoming event falls outside the configured window, the system resets _time to now() and preserves the original value in _original_time. This prevents ingest failures from widely distributed event timestamps.

The expected time range settings appear on the Datasets tab when you add or edit a Search Dataset. They apply to every Search Dataset, including the main Dataset, and changes take effect immediately.

Relative Time Syntax

Both the Earliest expected timestamp and Latest expected timestamp fields accept a signed relative time expression. Supported time units are:

  • d: Days. For example, -30d is 30 days before now, and 7d is 7 days after now.
  • weeks: Weeks. For example, -420weeks is 420 weeks before now.
  • mon: Months. For example, 1mon is one month after now.
  • y: Years. For example, -1y is one year before now.

Use a negative value to set a boundary in the past, and a positive value to set a boundary in the future. Both fields accept either value.

Out-of-Range Timestamp Handling

When an event arrives with a _time value outside the configured window:

  • _time is reset to now() at the time of ingest.
  • The original timestamp is preserved in a new field called _original_time.

The event is still written to the Dataset; only the timestamp is corrected.

Validation

Cribl Search validates the expected time range configuration when you save a Dataset.

StateMessageWhat to Do
ErrorEarliest must resolve to a time before latest.The earliest boundary resolves to a time equal to or later than the latest boundary. Correct the values before saving.
WarningThe configured time range spans {N} days (the recommended maximum is 40 days) and may cause errors on ingest.The configured window exceeds 40 days and may cause ingest failures. Narrow the range if possible.

Impact on Existing Datasets

The default expected time range (-30d to 7d) applies automatically to all existing Search Datasets, including the main Dataset. No additional configuration is required for the defaults to take effect. If existing Datasets previously accepted events with timestamps significantly outside this range, those events will have their _time corrected on ingest going forward.

Review your Dataset configurations if you rely on events with very old or future-dated timestamps, and adjust the expected time range accordingly.

The main Dataset

When you add your first lakehouse engine, Cribl Search creates a catch-all main Dataset for unrouted events:

Events Sent to mainValue Set for _dataset_reason Field
Events that match no Dataset rule.no_rules_matched
Events that match a rule pointing to a deleted or invalid Dataset
(Orphaned data).
invalid_dataset_matched (<datasetId>)
Events that arrive with a dataset field already set, pointing to an invalid Dataset.invalid_dataset_provided

If your events land in main unexpectedly, fix your Dataset rules to send them to the correct Search Dataset instead.

You can change the retention period of your main Dataset, but you can’t delete it. If you delete a lakehouse engine hosting main, you’ll need to pick another lakehouse engine to take over.

Send Results to a Search Dataset

Use the export operator with the search keyword to send search results into an existing Search Dataset.

This is useful when you want to materialize the output of a federated search or enrichment pipeline only once, and then run fast, repeated searches against the cached results instead of re-scanning the original Datasets.

dataset="cribl_search_sample"
| export to search mySearchDataset

The target must be a Search Dataset, not a federated Dataset, and its lakehouse engine must be in Ready status. Only Admin and Editor Search Members can run the export operator.

For full syntax, arguments, and limitations, see the export operator reference.

Clear a Search Dataset

You can wipe data stored in a Cribl-hosted Search Dataset while preserving your Dataset rules and retention settings. This requires Maintainer access on the Dataset, and can’t be undone.

  1. On the Cribl.Cloud top bar, select Products > Search > Data > Datasets.
  2. Select the Search Dataset you want to clear.
  3. Select Clear Dataset.
  4. Type CLEAR, and confirm with Clear Dataset.

Delete a Search Dataset

You can delete a Cribl-hosted Search Dataset, removing both its data and configuration. This requires Maintainer access on the Dataset, and can’t be undone.

  1. On the Cribl.Cloud top bar, select Products > Search > Data > Datasets.
  2. Select the Search Dataset you want to delete.
  3. Select Delete Dataset.
  4. Confirm with Delete.

Next Steps

Now that your Datasets are in place, connect your Sources to start sending data into Cribl Search.

If your Sources are already connected, target your Datasets with Dataset rules.