Optimize Paths
Partition paths for efficient searchability.
Best Practices for Optimizing Paths in Object Stores
Use these guidelines when you define bucket paths and partitioning for Cribl Search Datasets on Amazon S3, Azure Blob Storage, and Google Cloud Storage to ensure efficient search performance and cost management.
Prioritize Time
Make time the left-most partition. Keep time tokens at the start of the bucket path. For example:
my-bucket/${_time:%Y}/${_time:%m}/${_time:%d}/${_time:%H}/....By putting time first, you ensure the static search prefix is maximized, which lets Cribl Search quickly skip irrelevant data, resulting in faster search performance.
Set Time Granularity
Aim for a balance where each time partition directory contains a manageable number of objects (typically in the hundreds or low thousands). For high-volume Datasets, partition down to hourly or per-minute level to allow Cribl Search to skip entire folders for irrelevant hours.
Place Non-Time Attributes After Time
If you need to partition by attributes like region, environment, account, or sourcetype, add those segments under the time hierarchy. For example:
my-bucket/${_time:%Y}/${_time:%m}/${_time:%d}/${region}/${sourcetype}/....Use search filters (such as, where region == "us-east-1") instead of placing these attributes before time. This structure also simplifies applying different S3 retention policies for various data types.
Limit Wildcards
Use wildcards only when necessary (for example, to skip key=value segments). Place them only on the right side of the path and make sure they are applied to segments that do not impact search filtering. For example:
my-bucket/${_time:%Y}/${_time:%m}/${*}/${someVarOfInterest}Static Values: Use Separate Datasets
If a path segment has a small, known list of values (for example, limited dataImportance or environment values), define multiple datasets or multiple bucket paths with static values rather than using a wildcard across that segment.
Use Sourcetype/Datatype as a Key Partition
When you control ingestion (for example, using Cribl Stream/Edge), consistently encode sourcetype or similar datatype signals in the path so related events land together and parsing is predictable.
Prefer Cribl Lake
Land data in Cribl Lake and search Lake Datasets instead of manually managing complex partitioning schemes in raw object stores. Cribl Lake automatically applies an optimized Dataset partitioning scheme that aids in newest-first discovery while maintaining intuitive time organization.
Common Pitfalls When Setting Up Paths
Avoid these common path design pitfalls when configuring your object store datasets.
Avoid Early Wildcards
Any wildcard near the left of the path forces Cribl Search to list and examine every key under that prefix before it can apply time filtering, dramatically increasing search time and cost.
Avoid patterns like:
my-bucket/${*}/${_time:%Y}/${_time:%m}/
or
my-bucket/data/${category:-*}/${_time:%Y}/.....
Avoid Large Directories
Avoid single directories containing tens or hundreds of thousands of objects. For high-volume data, partition down to at least the hourly level.
Avoid Over-Partitioning
Per-second or overly fine per-minute directories can cause excessive list operations and metadata overhead. If the majority of per-minute directories are empty or contain very few objects, coarsen your time partitioning (such as, from per-minute to hourly).
Avoid Placing Routing Fields Before Time
Placing routing fields (like region or account) before time prevents Search from leveraging time as the longest static prefix. This forces the object store to scan multiple, separate prefixes (one for every region/account queried) instead of using a single, efficient time-based lookup.
Avoid patterns like:
my-bucket/${region}/${account}/${_time:%Y}/${_time:%m}/...
Don’t Hide Key Attributes
If segments such as dataImportance, environment, or sourcetype are key to your search criteria, do not place them in segments you intend to wildcard. Placing them behind a wildcard (such as, my-bucket/${*}/${_time:%Y}/...) negates their value for path pruning and increases search time and cost.
Newest-First Ordering
The order in which Cribl Search traverses objects in public storage depends directly on the alphanumeric sorting (lexicographic order) of the object keys returned by the object store provider.
Default Behavior (Oldest-First)
- When you use a standard time hierarchy in your path (such as,
YYYY/MM/DD/...), the object store API lists keys in ascending, oldest-first order. - Although Cribl Search reads events within each file in reverse-chronological order, the overall traversal of objects is constrained by this oldest-first listing order.
- While this approach is efficient for smaller directory listings (hundreds of objects), for larger listings (thousands or more), Cribl Search must still respect this order to maintain predictable and efficient behavior.
Ordering in Cribl Lake
- Cribl Lake Datasets automatically apply an optimized “magic prefix” to the bucket path.
- This prefix encodes time in a way that forces the lexicographic listing order to correspond to reverse-chronological (newest-first) order.
- This optimization allows Cribl Search to discover and scan the newest objects first, significantly improving the speed of typical time-range queries.
Approximating Newest-First in Raw Object Stores
While Cribl Lake automatically handles optimal time partitioning for “newest-first” retrieval, you can achieve similar results when writing data directly to S3, Azure, or GCS using Cribl Stream or Edge.
Use C.Time.s3TimePartition()
Use the C.Time.s3TimePartition() function in Cribl Stream to generate your time prefix. This function encodes time in a special format that causes the resulting object keys to be ordered reverse-chronologically.
Incorporate this function into the bucket path expression within your Cribl Stream or Edge Destination:
my-bucket/${C.Time.s3TimePartition(_time, 'h')}/${sourcetype}/${host}/The function takes a time field (_time) and a granularity specifier:
C.Time.s3TimePartition(_time, 'h')yields an hourly prefix.C.Time.s3TimePartition(_time, 'd')yields a daily prefix.
This technique applies only to data written through Cribl Stream or Edge. Native producer formats (like CloudTrail or Splunk SmartStore) retain their own partitioning schemes. You must still configure Search Datasets with paths that align with your chosen prefix strategy.
Example: Manage Longest Static Prefix in S3
The Amazon S3 API uses your bucket path’s longest static prefix for fetching data. So when using two partitions, they must be supplied in every search in order for your time fields to be considered in the prefilter fetch stage.
For example: If you specify field1=foo and field2=bar, at noon on November 1, 2024, the longest prefix for fetching data would be:
<mybucket>/inputID/foo/bar/2024/11/01/12/00/This is exactly what you want, and very efficient for the files only in that S3 partition. But more efficiently, you’d want:
<mybucket>/2024/11/01/12/00/inputID/foo/bar
<mybucket>/2025/11/01/12/00/inputID/foo/barAnd ideally, you’d want them reverse-chronological, and lexicographically associated:
<mybucket>/baca-2025/ac-11/aa-01/ad-12/00/inputID/foo/bar
<mybucket>/bada-2024/ac-11/aa-01/ad-12/00/inputID/foo/barOn the other hand, if you didn’t specify field2, but used the same criteria, the longest prefix would then be:
<mybucket>/inputID/foo/And here, Cribl Search would have to pull every file within that partition boundary across the S3 API.
Another approach is to reverse the order of your criteria, like this:
<mybucket>/inputId/year/month/day/hour/minute/FIELD_1/FIELD_2/This eliminates the churn caused by omitting either of those criteria.
Note that you’d still get the correct data in your result set if you didn’t specify field2, because Cribl Search will still filter each event. However, the pool of objects searched could potentially be much larger. This would increase your Search wall-clock time and your proportional retrieval cost.
In Cribl Stream, the Amazon Security Lake Destination automatically partitions buckets this way. It creates time partitions on top, and you can explicitly add further partitioning fields, which this Destination adds below the time boundaries.