LogStream Collectors are a special group of inputs. Unlike other Sources, Collectors are designed to handle intermittent, rather than continuous, data import. You can use Collectors to dispatch on‑demand ("ad hoc") collection tasks that fetch, or "replay" (re-ingest), data from local or remote locations.
Collectors also support scheduled periodic collection jobs – recurring tasks that can make batch collection of stored data more like continual processing of streaming data. You configure Collectors prior to, and independently from, your configuration of ad hoc versus scheduled collection runs.
Collectors are integral to Cribl LogStream's larger story about optimizing your data throughput. Send full-fidelity log and metrics data ("everything") to low-cost storage, and then use LogStream Collectors to selectively route ("replay") only needed data to your systems of analysis.
Cribl LogStream currently provides the following Collector options:
Filesystem/NFS – enables data collection and replay from local or remote filesystem locations.
Azure Blob – enables data collection and replay from Azure Blob Storage objects.
Google Cloud Storage – enables data collection and replay from Google Cloud Storage buckets.
S3 – enables data collection and replay from Amazon S3 buckets or S3-compatible stores.
Script – enables data collection and replay via custom scripts.
REST – enables data collection and replay via REST API calls. Provides four Discover options, to support progressively more complex (and dynamic) item enumerations.
You can configure a LogStream Node to retrieve data from a remote system by selecting Collectors from the top nav. Data collection is a multi-step process:
First, define a Collector instance. In this step, you configure collector-specific settings by selecting a Collector type and pointing it at a specific target. (E.g., the target will be a directory if the type is Filesystem, or an S3 bucket/path if the type is Amazon S3.)
Next, schedule or manually run the Collector. In this step, you configure either scheduled-job–specific or run‑specific settings – such as the run Mode (Preview, Discovery, or Full Run), the Filter expression to match the data against, the time range, etc.
When a Node receives this configuration, it prepares the infrastructure to execute a collection job. A collection job is typically made up of one or more tasks that: discover the data to be fetched; fetch data that match the run filter; and finally, pass the results either through the Routes or (optionally) into a specific Pipeline and Destination.
Select Monitoring (side or top nav) > System > Job Inspector to see the results of recent collection runs. You can filter the display by Worker Group (in distributed deployments), and by run type and run timing.
You might process data from inherently non-streaming sources, such as REST endpoints, blob stores, etc. Scheduled jobs enable you to emulate a data stream by scraping data from these sources in batches, on a set interval.
You can schedule a specific job to pick up new data from the source – data that hadn’t been picked up in previous invocations of this scheduled job. This essentially transforms a non-streaming data source into a streaming data source.
In a distributed deployment, you configure Collectors at the Worker Group level, and Worker Nodes execute the tasks. However, the Leader Node oversees the task distribution, and tries to maintain a fair balance across jobs.
When Workers ask for tasks, the Leader will normally try to assign the next task from a job that has the least tasks in progress. This is known as "Least-In-Flight Scheduling," and it provides the fairest task distribution for most cases. If desired, you can change this default behavior by opening global ⚙️ Settings (lower left) > General Settings > Job Limits, and then setting Job Dispatching to Round Robin.
More generally: In a distributed deployment, you configure Collectors and their jobs on individual Worker Groups. But you configure Collectors' resource allocation globally in the Leader's global ⚙️ Settings (lower left) > General Settings > Job Limits section.
Select Monitoring (side or top nav) > System > Job Inspector to view and manage pending, in-flight, and completed collection jobs and their tasks.
Here are the options available on the Job Inspector page:
All vs. Currently Scheduled tabs: Click Currently Scheduled to see jobs foward-scheduled for future execution – including their cron schedule details, last execution, and next scheduled execution. Click All to see all jobs initiated in the past, regardless of completion status.
Job categories (buttons): Select among Ad hoc, Scheduled, System, and Running. (At this level, Scheduled means scheduled jobs already running or finished.)
Filters: Click the gear icon to open a drop-down with multiple options to filter the jobs shown within your selected category.
Group selectors: Select one or more check boxes to display the Pause, Resume, etc., buttons shown along the bottom.
Sortable headers: Click any column to reverse its sort direction.
Search bar: Click to filter displayed jobs by arbitrary strings.
Action buttons: For finished jobs, the icons (from left to right) indicate: Re-run; Keep job artifacts; Copy job artifacts; Delete job artifacts; and Display job logs in a modal. For running jobs, the options (again from left to right) are: Pause; Stop; Copy job artifacts; Delete job artifacts; and Live (show collection status in a modal).
Updated 12 days ago