Structure Events for Cribl Lake
Preprocess your events, optimizing either for long-term retention or frequent access.
Summary
- Before sending events to Cribl Lake, format them in Cribl Stream.
- For maximum storage efficiency, use JSON, and remove all fields except
_raw.- For best search performance, use Parquet, and keep all fields.
Event Structure Matters
Cribl Lake stores data as either gzip-compressed JSON or Parquet files. The format you choose (when creating a Lake Dataset), and how you structure your events in Cribl Stream determines whether you optimize for storage efficiency or search performance.
Choose Your Format Based on How You’ll Use the Data
If you’re storing data primarily for compliance or long-term retention, JSON is typically the better choice. JSON with gzip compression produces smaller files when your goal is to minimize storage costs and you don’t need to query the data frequently.
If you plan to search and analyze the data regularly, building dashboards, running investigations, or performing ad-hoc queries in Cribl Search, Parquet is often the better option. Parquet has a columnar format that enables faster queries, especially when your events contain structured fields that Cribl Search can filter efficiently.
| Consideration | JSON | Parquet |
|---|---|---|
| Use case | Compliance, archival, infrequent access | Frequent search, dashboards, analysis |
| Event structure | Requires _raw only | Requires multiple parsed fields |
| Compression | Excellent with simplified events | Excellent with repeated field values |
| Search performance | Slower (scans full events) | Faster (columnar filtering, predicate pushdown) |
Optimize Events for Storage Efficiency (JSON)
JSON works best when you prioritize compression and storage efficiency over query performance. For compliance or
archival use cases, you can simplify events down to just the original log event stored in the _raw field. Use a
Pipeline in Cribl Stream to remove all top-level fields except _raw. This approach preserves the original event for
compliance purposes.
Example: Firewall Log for Compliance Storage
A firewall log arrives in Cribl Stream with multiple extracted fields:
{
"_raw": "Jun 12 14:23:01 fw01 %ASA-6-302013: Built outbound TCP connection 847263 for outside:203.0.113.50/443 to inside:192.168.1.100/52344",
"_time": 1718194981,
"action": "Built",
"protocol": "TCP",
"src_ip": "192.168.1.100",
"dest_ip": "203.0.113.50",
"dest_port": "443"
}If you aim for maximum storage efficiency (for example, for archival purposes), you can configure your Stream Pipeline
to remove all non-essential fields, and keep only _raw.
You can also keep other top-level fields, which can improve search performance later, but will result in larger file sizes in Cribl Lake.
{
"_raw": "Jun 12 14:23:01 fw01 %ASA-6-302013: Built outbound TCP connection 847263 for outside:203.0.113.50/443 to inside:192.168.1.100/52344",
"_time": 1718194981
}The original log remains intact in _raw for compliance. If you ever need to replay this data for analysis, you can
re-parse it through Cribl Stream at that time using a Cribl Lake Collector.
Optimize Events for Search Performance (Parquet)
Parquet shines when your events are well-structured with multiple fields, particularly fields with repeated values. Unlike JSON, Parquet stores data in columns and compresses repeated values efficiently, which means fields like Kubernetes labels that appear across thousands of events are stored only once per file.
For Parquet to deliver these benefits, your events must arrive at Lake as parsed, structured data with multiple
top-level fields. Sending events with only _raw to a Parquet Lake Dataset produces larger files than JSON and degrades
search performance.
Cribl Search takes advantage of the Parquet structure through predicate pushdown, using min/max values within the files
to skip irrelevant data during queries. This can significantly reduce query times when filtering on fields like
namespace, pod_name, or log_level.
For tips on searching Parquet data, see Parquet in the Cribl Search docs.
Example: Kubernetes Event for Active Analysis
A Kubernetes log arrives with container metadata:
{
"_raw": "2024-06-12T14:23:01.847Z INFO [main] Application started successfully",
"_time": 1718194981,
"kube_namespace": "production",
"kube_pod": "api-server-7d4b8c6f9-x2vnm",
"kube_container": "api-server",
"kube_node": "node-pool-1-abc123",
"log_level": "INFO",
"message": "Application started successfully"
}Keep all these fields intact when sending to a Parquet Dataset. The repeated values in kube_namespace, kube_node,
and log_level compress efficiently, and Cribl Search can filter directly on any of these fields without having to
re-parse the full _raw field.
Understand How Cribl Stream Processes Events
To learn more about how Cribl Stream processes and structures events, see:
- Event Model - Understand the internal structure of events in Stream
- Event Processing Order - See how events flow through Pipelines
- Pipelines - Configure Functions to shape events before they reach Lake