On This Page

Home / Reference Architectures/ Cribl Validated Architectures/ Overlays: Common Patterns/Replay-First Overlay

Replay-First Overlay

This overlay first writes all data to low-cost object storage, then selectively sends real-time subsets and supporting on-demand Replay to Destinations. The focus is durability, cost control, and flexibility.

This overlay is built on two parallel flows:

  • The Landing path: High-volume raw or lightly processed events are written once to low-cost object storage (for example, Cribl Lake, S3, GCS, Azure Blob) with minimal transformation.

  • The Real-time path: A curated, smaller subset is routed to online tools (SIEMs, observability platforms) for immediate detection and monitoring.

Replay enables investigations, backfills, and tool migrations to access full-fidelity data directly from storage, eliminating the need to recollect data from the original Sources.

Benefits

  • Durability and auditability: Object storage serves as the authoritative system of record, retaining raw or fully enriched data for long-term use cases like investigations, forensics, and compliance.

  • Cost control and flexibility: Decouples collection from Destination choices. Expensive online tools receive only the data they truly need in real time, while historical data remains cold.

  • Separation of ingest and analytics: Insulates data Sources from changes in downstream analytics platforms. This flexibility allows you to handle new tools, schema changes, or vendor migrations later by simply replaying data from storage.

  • Advanced workflows: Unlocks capabilities like backfilling new tools, A/B testing of parsing and routing changes, and reproducing incidents from historical data.

Risks / Trade-offs

  • Latency: There is higher latency when relying heavily on replayed data for analysis, especially for investigative or forensics use cases, due to object storage access time plus processing time.

  • Governance maturity: Requires careful governance to manage dataset naming, partitioning, directory structure, retention, and lifecycle policies in the object store.

  • Operational capacity: Requires operational maturity to manage replay windows, object storage life-cycle policies, and the risk of “replay storms” if multiple teams trigger large replays simultaneously, potentially overwhelming the dedicated replay Worker Group capacity.

Design Notes (Mitigations)

  • Dedicated replay Worker Group: Use a dedicated replay Worker Group to ensure heavy replay workloads do not interfere with live ingest or routing.

  • Schema alignment: Align replay schema with live ingest schema. Replayed data should be indistinguishable from live data streams to downstream tools, except for metadata indicating replay context.

  • Storage format: Use storage-friendly formats (such as JSON, NDJSON, or columnar formats) with strong metadata to make replay efficient.

  • Replay governance: Clearly document retention policies (how long data is stored and at what fidelity), replay SLAs (how quickly a given time window can be rehydrated), and access controls for who can invoke replay and to which Destinations.

  • Search in place: Combine this overlay with Cribl Lake and Cribl Search (or similar tools) to query data directly in storage. This allows you to identify and replay only the exact data subsets that need to be sent to downstream analytics tools.