OpenAI Source
The OpenAI Source in Cribl Stream collects model invocation logs, audit logs, and other organization-level telemetry from the OpenAI platform. This lets you bring OpenAI activity into your pipelines for AI usage governance, security auditing, and cost/usage analysis.
Type: Pull | TLS Support: Yes | Event Breaker Support: No
TLS is enabled via HTTPS on this Source’s underlying OpenAI REST APIs.
Prerequisites
Before you configure an OpenAI Source, you need:
- A Cribl Stream deployment (Cloud or self-hosted).
- An OpenAI organization with access to the relevant organization-level APIs (such as usage and audit logs).
- An OpenAI organization admin key (API key or equivalent) with permissions to call the endpoints you plan to enable. Visit OpenAI’s organization admin keys page to create an organization admin key.
- Network connectivity from your Cribl Workers to the OpenAI API over HTTPS (direct or via an HTTP/S proxy).
For details on generating and managing OpenAI credentials, see the OpenAI API documentation.
How the OpenAI Source Works
When you enable specific content types (endpoints), the Source:
- Polls each enabled OpenAI endpoint on its own schedule.
- Uses a shared OpenAI organization auth token for all endpoint calls.
- Tracks state independently for each endpoint.
- Emits events into Cribl Stream, where you can route, filter, and enrich them before forwarding to Destinations.
Configuring an OpenAI Source
- From the top nav, click Manage, then select a Worker Group to configure. Next, you have two options:
- To configure via the graphical QuickConnect UI, click Routing then QuickConnect. Next, click Add Source and from the resulting drawer’s tiles, select OpenAI. Next, click either Add Source or (if displayed) Select Existing.
- To configure via the Routing UI, click Data then Sources. Select OpenAI from the list of tiles or the Sources left nav. Next, click Add Source to open the Add Source modal.
- In the Source modal, configure the following under General Settings:
- Input ID: Enter a unique name to identify this OpenAI Source definition. If you clone this Source, Cribl Stream will add
-CLONEto the original Input ID. - Desctiption: Optionally, enter a description.
- API Key: Select or create a stored API key.
- OpenAI Organization: Optionally, enter the
OpenAI-Organizationrequest header value. Typicallyorg-xxxxxxxxxxxxxxxxxxxxxxxx. - OpenAI Project: Optionally, enter the
OpenAI-Projectrequest header value. Typicallyproj_xxxxxxxxxxxxxxxxxxxxxxxx. - Tags: Optionally, add tags that you can use to filter and group Sources in in Cribl Stream’s UI. These tags aren’t added to processed events. Use a tab or hard return between (arbitrary) tag names.
- Input ID: Enter a unique name to identify this OpenAI Source definition. If you clone this Source, Cribl Stream will add
- Under Content Types, enable the endpoints you’d like to collect data from. See the available Content Types below.
- Optionally, configure any Processing Settings, Retries, and Advanced Settings.
- Click Save, then Commit & Deploy.
Supported Content Types (#content-types)
The OpenAI Source exposes up to 10 OpenAI endpoints as individually toggleable content types. These content types include:
The available Content types are:
- Audit Logs
- Costs
- Users
- Projects
- Completions Usage Details
- Embeddings Usage Details
- Moderations Usage Details
- Images Usage Details
- Audio Speeches Usage Details
- Audio Transcriptions Usage Details
Each content type uses the same controls:
Enable content: Turn collection for that endpoint on or off.
Details: Add or edit query parameters to control what each request returns.
Pagination: Choose a pagination type and specify response attributes and any last-page expression.
Schedule: Adjust how often Cribl Stream polls that endpoint. See Scheduling.
State tracking: Customize how the Source tracks time or cursor state to avoid overlaps and gaps between jobs. See State Tracking.
Fields: Add fixed fields (for example, env, tenant, or a business-unit tag) that are attached to every event from that endpoint.
The Scheduling and Query Parameters are automatically configured for you.
Processing Settings
Fields: You can add Fields to each event, using Eval-like functionality. A field consists of a Name and Value pair. The Value is a JavaScript expression and can be a constant.
Pre-Processing: Select a Pipeline to process results before sending to Routes. Optional, and available only when Send to Routes is toggled on.
Fields
In this section, you can add Fields to each event using Eval-like functionality.
Name: Field name.
Value: JavaScript expression to determine field’s value (can be a constant).
Fields specified on the Fields tab will normally override fields of the same name in events. But you can specify that fields in events should override these fields’ values.
Pre-Processing
In this section’s Pipeline drop-down list, you can select a single existing Pipeline or Pack to process data from this input before the data is sent through the Routes.
Retries
Optionally adjust the default retry settings for failed collect jobs.
Retry type: The algorithm to use when performing HTTP retries. Options include Backoff (the default), Static, and Disabled.
Initial retry interval (ms): Time interval between failed request and first retry (kickoff). Maximum allowed value is 20,000 ms (1/3 minute). A value of 0 means retry immediately until reaching the limit specified in Retry limit.
Retry limit: Maximum number of times to retry a failed HTTP request. Defaults to 5. Maximum: 20. A value of 0 means don’t retry at all.
Backoff multiplier: Base for exponential backoff. A value of 2 (default) means that Cribl Stream will retry after 2 seconds, then 4 seconds, then 8 seconds, and so forth.
Retry HTTP codes: List of HTTP codes that trigger a retry. Leave unchanged to use the defaults (429 and 503). Cribl Stream does not retry codes in the 200 series.
Honor Retry-After header: When toggled on (the default), retry-after headers up to a maximum of 20 seconds are processed. Delays longer than 20 seconds are ignored.
- Cribl Stream will log a warning message with the delay value retrieved from the
retry-afterheader (converted to ms). - When toggled off, Cribl Stream ignores all
retry-afterheaders.
Retry connection timeout: Toggle on to automatically retry a single connection attempt after a timeout (ETIMEDOUT) to ensure data continuity.
Retry connection reset: Toggle on to automatically retry a connection after a peer reset (ECONNRESET) to maintain data flow.
Advanced Settings
Request timeout (seconds): How long to wait for an incoming request to complete before aborting it. The default value is 300. A value of 0 means wait indefinitely.
Time to live (seconds): Time to keep the job’s artifacts on disk after job completion. This also affects how long a job is listed in the Job Inspector. The default value of 4h means Cribl Stream keeps the collector’s job artifacts on disk and visible in Job Inspector for 4 hours, then automatically cleans them up.
Environment: If you’re using GitOps, optionally use this field to specify a single Git branch on which to enable this configuration. If empty, the config will be enabled everywhere.
Connected Destinations
Select Send to Routes to enable conditional routing, filtering, and cloning of this Source’s data via the Routing table.
Select QuickConnect to send this Source’s data to one or more Destinations via independent, direct connections.
Scheduling
When you enable a Content type under General Settings the following scheduling options are available:
The Cron schedule configures when collection requests are made.
Job timeout: Maximum time this job will be allowed to run. Units are seconds, if not specified. Sample values: 30, 45s, or 15m. Minimum granularity is 10 seconds, so a 45s value would round up to a 50-second timeout. Defaults to 0, meaning unlimited time (no timeout).
Log Level: Level at which to set task logging. More verbose levels are useful for troubleshooting jobs and tasks, but use them sparingly.
The Earliest time and Latest time defines the time range of events to collect, based on the _time field. These fields accept the following syntax:
[+|-]<time_integer><time_unit>@<snap-to_time_unit>
To break down this syntax:
| Syntax Element | Values Supported |
|---|---|
| Offset | Specify: - for times in the past, + for times in the future, or omit with now. |
| <time_integer> | Specify any integer, or omit with now. |
| <time_unit> | Specify the now constant, or one of the following abbreviations: s[econds], m[inutes], h[ours], d[ays], w[eeks], mon[ths], q[uarters], y[ears]. |
| @<snap-to_time_unit> | Optionally, you can append the @ modifier, followed by any of the above <time_unit>s, to round down to the nearest instance of that unit. (See the next section for details.) |
Cribl Stream validates relative time values using these rules:
- Earliest must not be later than Latest.
- Values without units get interpreted as seconds. For example,
-1=-1s.
Snap-to-Time Syntax
The @ snap modifier always rounds down (backwards) from any specified time. This is true even in relative time expressions with + (future) offsets. For example:
@dsnaps back to the beginning of today, 12:00 AM (midnight).+128m@hlooks forward 128 minutes, then snaps back to the nearest round hour. (If you specified this in the Latest field, and ran the Source at 4:20 PM, collection would end at 6:00 PM. The expression would look forward to 6:28 PM, but snap back to 6:00 PM.)
Other options:
@wor@w7to snap back to the beginning of the week, defined as the preceding Sunday.- To snap back to other days of a week, use
w1(Monday) throughw6(Saturday). @monto snap back to the 1st of a month.@qto snap back to the beginning of the most recent quarter: Jan. 1, Apr. 1, Jul. 1, or Oct. 1.@yto snap back to Jan. 1.
Working with State Tracking
You can configure the Source to track state, either by time or another arbitrary value. This can help prevent overlaps between jobs, where subsequent runs may return some of the same results as previous runs. Similarly, it can help prevent gaps in data by allowing a run to pick up from where the last run ended.
State update expression: JavaScript expression that defines how to update the state from an event. Use the event’s data and the current state to compute the new state.
State merge expression: JavaScript expression that defines which state to keep when merging a task’s newly reported state with the previously saved state. Evaluates prevState and newState variables, resolving to the state to keep.
The default values for these fields are configured to track state by the latest _time field found in events gathered in a collection run.
Understanding State Expression Fields
The State update and State merge expressions control how state is derived from a collection run and how it is merged with existing state, respectively. They’re preconfigured to work with the common use case of tracking state by latest _time, but you may need to update them for other use cases. To understand what these fields do, let’s break down the default values.
State update expression
This expression has a default value of:
__timestampExtracted !== false && {latestTime: (state.latestTime || 0) > _time ? state.latestTime : _time}
The __timestampExtracted field is set to false if the Event Breaker was unable to parse time for the event. If this is the case, don’t want to update state (the event’s _time value defaults to Date.now() if the Event Breaker was unable to parse out the correct time). If __timestampExtracted is false, take advantage of short-circuit evaluation to not update state.
State values must resolve to an object, such as:
{ "latestTime": 17122806161 }
If the expression does not resolve to an object, Cribl Stream will ignore the result.
{latestTime: (state.latestTime || 0) > _time ? state.latestTime : _time} - compare state.latestTime to the event’s _time value, keeping whichever value is greater.
State Merge Expression
This expression has a default value of:
prevState.latestTime > newState.latestTime ? prevState : newState
It compares prevState (the state that was previously saved) to newState (the state reported from the most recent collection task), keeping the state with the greatest latestTime value.
Managing State
Select Manage State to view, modify, or delete a state. For more information, see Manage State.
The default values for these fields are configured to track state by the latest _time field found in events gathered in a collection run.
OpenAI API Rate Limits
OpenAI enforces per-organization and per-project rate limits on both the number of requests and the number of tokens you can send over a given interval. When the OpenAI Source calls the API, each response includes rate-limit headers that describe your current limits and remaining quota, for example:
x-ratelimit-limit-requests/x-ratelimit-remaining-requestsx-ratelimit-limit-tokens/x-ratelimit-remaining-tokensx-ratelimit-reset-requests/x-ratelimit-reset-tokens
If your schedules or enabled content types cause the Source to exceed these limits, OpenAI will return 429 responses. The Source relies on its retry settings and your polling schedule to avoid sustained rate-limit errors. If you see repeated 429s in job logs, reduce the polling frequency or disable nonessential endpoints.
Proxying Requests
If you need to proxy HTTP/S requests, see System Proxy Configuration.
Internal Fields
Cribl Stream uses a set of internal fields to assist in handling of data. These “meta” fields are not part of an event, but they are accessible, and Functions can use them to make processing decisions.
Fields for this Source:
__collectible- Contains metadata about each collection job.__collectStats- Contains per-request metadata.
Troubleshooting
The Source’s configuration modal has helpful tabs for troubleshooting:
Live Data: Try capturing live data to see real-time events as they are ingested. On the Live Data tab, click Start Capture to begin viewing real-time data.
Logs: Review and search the logs that provide detailed information about the ingestion process, including any errors or warnings that may have occurred.
You can also view the Monitoring page that provides a comprehensive overview of data volume and rate, helping you identify ingestion issues. Analyze the graphs showing events and bytes in/out over time.
Response Errors
The Source treats all non-200 responses from configured URL endpoints as errors. This includes 1xx, 3xx, 4xx, and 5xx responses.
A few exceptions are treated as non-fatal errors:
- Where a collect job launches multiple tasks, and only a subset of those tasks fail, Cribl Stream places the job in failed status, but treats the error as non-fatal. (Note that Cribl Stream does not retry the failed tasks.)
- Where a collect job receives a
3xxredirection error code, it follows the error’s treatment by the underlying library, and does not necessarily treat the error as fatal.