Structure Events for Cribl Lake

Preprocess your events, optimizing either for long-term retention or frequent access.

Highlights

Before sending events to Cribl Lake, format them in Cribl Stream.
For maximum storage efficiency, use JSON, and remove all fields except _raw.
For best search performance, use Parquet, and keep all fields.

Event Structure Matters

Cribl Lake stores data as either gzip-compressed JSON or Parquet files. The format you choose (when creating a Lake Dataset), and how you structure your events in Cribl Stream determines whether you optimize for storage efficiency or search performance.

Choose Your Format Based on How You’ll Use the Data

If you’re storing data primarily for compliance or long-term retention, JSON is typically the better choice. JSON with gzip compression produces smaller files when your goal is to minimize storage costs and you don’t need to query the data frequently.

If you plan to search and analyze the data regularly, building dashboards, running investigations, or performing ad-hoc queries in Cribl Search, Parquet is often the better option. Parquet has a columnar format that enables faster queries, especially when your events contain structured fields that Cribl Search can filter efficiently.

Consideration	JSON	Parquet
Use case	Compliance, archival, infrequent access	Frequent search, dashboards, analysis
Event structure	Requires `_raw` only	Requires multiple parsed fields
Compression	Excellent with simplified events	Excellent with repeated field values
Search performance	Slower (scans full events)	Faster (columnar filtering, predicate pushdown)

Optimize Events for Storage Efficiency (JSON)

JSON works best when you prioritize compression and storage efficiency over query performance. For compliance or archival use cases, you can simplify events down to just the original log event stored in the _raw field. Use a Pipeline in Cribl Stream to remove all top-level fields except _raw. This approach preserves the original event for compliance purposes.

Example: Firewall Log for Compliance Storage

A firewall log arrives in Cribl Stream with multiple extracted fields:

{
  "_raw": "Jun 12 14:23:01 fw01 %ASA-6-302013: Built outbound TCP connection 847263 for outside:203.0.113.50/443 to inside:192.168.1.100/52344",
  "_time": 1718194981,
  "action": "Built",
  "protocol": "TCP",
  "src_ip": "192.168.1.100",
  "dest_ip": "203.0.113.50",
  "dest_port": "443"
}

If you aim for maximum storage efficiency (for example, for archival purposes), you can configure your Stream Pipeline to remove all non-essential fields, and keep only _raw.

You can also keep other top-level fields, which can improve search performance later, but will result in larger file sizes in Cribl Lake.

{
  "_raw": "Jun 12 14:23:01 fw01 %ASA-6-302013: Built outbound TCP connection 847263 for outside:203.0.113.50/443 to inside:192.168.1.100/52344",
  "_time": 1718194981
}

The original log remains intact in _raw for compliance. If you ever need to replay this data for analysis, you can re-parse it through Cribl Stream at that time using a Cribl Lake Collector.

Optimize Events for Search Performance (Parquet)

Parquet shines when your events are well-structured with multiple fields, particularly fields with repeated values. Unlike JSON, Parquet stores data in columns and compresses repeated values efficiently, which means fields like Kubernetes labels that appear across thousands of events are stored only once per file.

For Parquet to deliver these benefits, your events must arrive at Lake as parsed, structured data with multiple top-level fields. Sending events with only _raw to a Parquet Lake Dataset produces larger files than JSON and degrades search performance.

Cribl Search takes advantage of the Parquet structure through predicate pushdown, using min/max values within the files to skip irrelevant data during queries. This can significantly reduce query times when filtering on fields like namespace, pod_name, or log_level.

For tips on searching Parquet data, see Parquet in the Cribl Search docs.

Example: Kubernetes Event for Active Analysis

A Kubernetes log arrives with container metadata:

{
  "_raw": "2024-06-12T14:23:01.847Z INFO [main] Application started successfully",
  "_time": 1718194981,
  "kube_namespace": "production",
  "kube_pod": "api-server-7d4b8c6f9-x2vnm",
  "kube_container": "api-server",
  "kube_node": "node-pool-1-abc123",
  "log_level": "INFO",
  "message": "Application started successfully"
}

Keep all these fields intact when sending to a Parquet Dataset. The repeated values in kube_namespace, kube_node, and log_level compress efficiently, and Cribl Search can filter directly on any of these fields without having to re-parse the full _raw field.

Understand How Cribl Stream Processes Events

To learn more about how Cribl Stream processes and structures events, see:

Event Model - Understand the internal structure of events in Stream
Event Processing Order - See how events flow through Pipelines
Pipelines - Configure Functions to shape events before they reach Lake

Structure Events for Cribl Lake

Highlights​

Event Structure Matters​

Choose Your Format Based on How You’ll Use the Data​

Optimize Events for Storage Efficiency (JSON)​

Example: Firewall Log for Compliance Storage​

Optimize Events for Search Performance (Parquet)​

Example: Kubernetes Event for Active Analysis​

Understand How Cribl Stream Processes Events​

Common Resources

Highlights