Searchable Data Lake
This blueprint defines an architecture where data is routed to low-cost object storage in a structured format, enabling Cribl Search to query the data in-place. This removes the requirement to index all data in a traditional analytics tool or perform a manual “Replay” to access historical events.
Topology Overview
This blueprint is built upon the Distributed (Single or Multi-Worker Group) topology to ingest data and use the Cribl Search engine for at-rest data exploration.
- Ingest tier: Worker Groups/Fleets receive raw data, normalize it, and write it to Cribl Lake (or other storage) using specific partitioning logic.
- Storage tier: Data is stored in object storage (Cribl Lake, S3, Azure Blob, GCS, or MinIO). This serves as the primary repository for both retention and ad-hoc queries.
- Search tier: Using Cribl Search to query the storage tier directly. It filters data based on metadata and time-based partitions without moving or re-indexing the files.
Combined Overlays
This blueprint utilizes three specific CVA overlays to optimize data accessibility:
- Hub and Spoke: Worker/Edge Nodes act as the Hub, ensuring a raw or normalized copy of all data is fanned out to the “Spoke” (the object storage destination).
- Regional/Geo Split: To minimize latency and cross-region egress costs, the Search provider and the storage bucket should reside in the same cloud region.
- Cribl Edge and Stream: Used to collect high-volume telemetry from endpoints that is unsuitable for real-time indexing but required for historical lookups.
Operational Guardrails
To ensure the Cribl Lake is performant for in-place querying, consider the following guidelines.
- Partitioning for Search: You must use an optimized Partitioning Expression (such as
/year/month/day/sourcetype/host`). This allows Cribl Search to “prune” data, scanning only the relevant directories rather than the entire bucket. - Optimized File Formats: Data should be written in Parquet or highly compressed JSON. Parquet is preferred for Cribl Search as it allows for columnar reads, significantly reducing the amount of data scanned per query.
- Small File Avoidance: Ensure Cribl Stream is configured with appropriate File Scaling (such
5-minuteor100MBflushes). Thousands of small files in a single partition will degrade search performance. - Metadata Standardization: Use Cribl Stream to enrich events with consistent labels before they are written. Cribl Search uses these labels as the primary filtering mechanism during the discovery phase.
For more considrerations, see Strategic Architectures (Bridging, Edge, and Hybrid Flow).