[openai] Add rate limits data stream with headroom dashboard panel and alert#19347
[openai] Add rate limits data stream with headroom dashboard panel and alert#19347stefans-elastic wants to merge 28 commits into
Conversation
Elastic Docs Style Checker (Vale)Summary: 1 warning, 3 suggestions found
|
| File | Line | Rule | Message |
|---|---|---|---|
| packages/openai/_dev/build/docs/README.md | 50 | Elastic.QuotesPunctuation | Place punctuation inside closing quotation marks. |
💡 Suggestions (3): Optional style improvements. Apply when helpful.
| File | Line | Rule | Message |
|---|---|---|---|
| packages/openai/_dev/build/docs/README.md | 107 | Elastic.HeadingColons | Capitalize ': r'. |
| packages/openai/_dev/build/docs/README.md | 109 | Elastic.WordChoice | Consider using 'can, might' instead of 'may', unless the term is in the UI. |
| packages/openai/_dev/build/docs/README.md | 151 | Elastic.Ellipses | In general, don't use an ellipsis. |
The Vale linter checks documentation changes against the Elastic Docs style guide. To use Vale locally or report issues, refer to Elastic style guide for Vale.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
TL;DRThe Buildkite failure is real, but the available log payload only contains teardown/output-upload lines and does not include the failing command/test error itself. Immediate next step: rerun the failed Remediation
Investigation detailsRoot CauseThe provided job log ( Evidence
Verification
Follow-upOnce the full failing section (or failing JUnit XML contents) is available, I can provide a precise root cause classification and patch-level fix recommendation. Note 🔒 Integrity filter blocked 2 itemsThe following items were blocked because they don't meet the GitHub integrity level.
To allow these resources, lower tools:
github:
min-integrity: approved # merged | approved | unapproved | noneWhat is this? | From workflow: PR Buildkite Detective Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not. |
The README documented "images ↔ images per minute" as a compared headroom dimension, but the Rate limit headroom panel only computed RPM and TPM, so max_images_per_1_minute was collected and documented yet never shown. Add image headroom to the panel: ES|QL now computes peak per-minute image usage (SUM of openai.images.images) against max_images_per_1_minute, gated independently so images don't inflate TPM/RPM, plus three table columns (Images used/limit/utilization). Audio stays usage-only by design: the limit is in megabytes but the Usage API reports audio only in seconds/characters, so there is no comparable usage figure. Clarify the README deferral note accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🚀 Benchmarks reportTo see the full report comment with |
OpenAI's Usage API reports dated model snapshots (e.g. gpt-image-1-2025-04-23, omni-moderation-2024-09-26), while the Rate Limits API often lists only the base name (gpt-image-1). The headroom dashboard panels and the alert rule joined usage to limits on the exact model string, so usage for a snapshot without a matching rate-limit row was dropped from headroom entirely (verified live: gpt-image token usage was invisible on the panel). Strip a trailing -YYYY-MM-DD snapshot suffix on both sides of the join so usage collapses to its base family and joins to the limit. Applied to all four dashboard panel queries and the alert rule template. Re-verified against live data: gpt-image-1 token usage now joins to its limit, and moderation utilization is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
OpenAI's Usage API finalizes per-minute buckets with a multi-minute delay: a bucket's request/token counts keep climbing for several minutes after it ends. The usage CEL only guarded the single newest bucket, so any bucket that was no longer newest but still finalizing was emitted partial, and the cursor advanced past it -- a permanent undercount that did not self-correct (most visible under bursts). Widen the partial-skip guard across all 6 usage streams (completions, embeddings, moderations, images, audio_speeches, audio_transcriptions): - New `finalization_grace` config var (default 15m). - events now emit only buckets with end_time <= now - finalization_grace. - The cursor is held at min(start_time) of the still-finalizing buckets so they are re-queried (start_time is inclusive) and emitted once final. Emitted buckets are strictly older than the held cursor, so they are never re-fetched and never double-counted -- no dedup needed. Docs updated with a Finalization grace period section and corrected collection-process steps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The issue-07 fix seeded finalization_grace (and initial_interval) in the CEL state: block and read state.finalization_grace on every poll, but the program's returned state map never re-emitted those keys. Elastic's CEL input replaces persisted state with the returned map (the state: block only seeds the first run), so from the second poll onward state.finalization_grace was absent and every evaluation failed with "no such key: finalization_grace" (event.kind: pipeline_error), ingesting zero usage across all six streams. Re-emit finalization_grace and initial_interval in both returned objects (success and error path) of every usage stream's cel.yml.hbs, mirroring the existing access_token persistence pattern. Verified live: agent resumed with 0 errors and backfilled the burst; per-bucket ES SUM(num_model_requests) equals the direct OpenAI usage API on all six streams (0 mismatches). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…L eval filebeat's CEL input ends a periodic cycle as soon as an evaluation emits zero events, before it inspects want_more. The old two-phase design listed projects in an evaluation that returned no events (events: [], want_more: true), so that empty eval ended the cycle and rate-limit draining only ran on the next interval -- producing every-other-interval collection. Resetting the drain phase's terminal state (prior attempts) could not help, because the blocker was the empty listing eval upstream. Fold listing and emitting into the same evaluation: a LOAD step pages the project list and immediately drains the first newly discovered project's first rate-limit page, and a DRAIN step pages each project's remaining rate limits. Every productive evaluation now emits at least one event, so the want_more chain completes within a single interval. An accumulating worklist with a next index plus projects_done/project_after cursors handle project-list pagination, and a top-level want_more==false reset starts each fresh cycle clean. Verified with the rate_limits system test (batch_size: 1, exercising both project-list and per-project rate-limit pagination): all expected docs are collected in one want_more chain per interval. rate_limits ships unreleased in 2.2.0, so no changelog entry is needed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ic/integrations into openai-rate-limit-headroom
… works The rate_limits pipeline renamed message into openai.rate_limits.results and parsed from there. This broke two paths: - Logstash/reroute (event.original pre-set): message was removed and openai.rate_limits.results never populated, dropping every parsed field. - Agent with preserve_original_event=true: event.original was never created, making the option a silent no-op. Adopt the canonical pattern (audit stream + 38 other pipelines): rename message -> event.original, parse event.original -> openai.rate_limits, and remove event.original unless the preserve_original_event tag is present. The default path output is unchanged; the raw copy is kept only when opted in. Add a pipeline test asserting both event.original and the parsed openai.rate_limits.* fields are present on the preserve path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The org-wide headroom panel summed per-project rate limits within each 1-minute bucket, then took the MAX across minutes. Because rate_limits docs are periodic snapshots, the per-minute project set depends on poll timing, so the denominator could undercount aggregate capacity and swing between polls when snapshots landed in different minutes. Stabilize each project's limit as the MAX over the look-back window first (matching the per-project panel), then sum by model. Usage becomes sum-of-per-project-peaks. Both numerator and denominator are now poll-timing independent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The README and the rate-limit-headroom alert investigation guide claimed
limits and usage join on the exact project_id and model strings "without
any normalization", but the shipped alert ES|QL and both dashboard
headroom panels normalize model via
REPLACE(<model>, "-[0-9]{4}-[0-9]{2}-[0-9]{2}$", "").
Rewrite the docs to describe the implemented contract: join on exact
project_id with model normalized on both sides (strip trailing
-YYYY-MM-DD), so dated usage snapshots match base rate-limit names. Also
correct the aggregation note to reflect the MAX-per-bucket dedup of
identical family/snapshot limits. Updates the generated README, its build
template, and the alert blob.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The prebuilt rate-limit headroom alert capped its ES|QL query at | LIMIT 100. With groupBy: row, that explicit limit is the binding cap on alert instances, and the ES|QL executor path never sets groupAggCount, so groups beyond the top 100 were dropped with no truncation warning. termField/termSize are ignored on the ES|QL path, so termSize: 100 was never the real constraint. Raise the cap to 1000 to align with the alerting max-alerts circuit breaker (xpack.alerting.rules.run.alerts.max); at that value truncation surfaces a warning instead of silently dropping breaching project/model pairs. SORT tpm_utilization DESC stays ahead of the LIMIT so the worst offenders survive any truncation. Align the inert termSize to 1000 and document the cap and large-org tuning in the investigation guide. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
filebeat's CEL input ends a polling cycle as soon as an evaluation emits zero events, before it reads want_more. Three paths in the rate_limits CEL could return [] mid want_more chain -- a project with no configured rate limits (LOAD and DRAIN branches) and an empty project-list page that still has more pages -- each re-introducing the every-other-interval stall the data stream was designed to avoid. Emit a dropped keep-alive sentinel instead of [] whenever the chain still wants to continue, and drop the sentinel in the ingest pipeline so it keeps the chain alive in the agent without ever being indexed. Pipeline test covers the drop (rendered as null in expected output). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The finalization grace period adds ~15m of latency to all usage streams and the headroom panels/alert that read from them, but the README only described the undercount trade-off. Document the concrete latency cost, that it applies to every usage stream, where to change it (Advanced options), and that 0s disables it for fresher-but-undercounted data. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The README described the org-wide usage rollup as sum-then-peak (sum across projects per 1-minute bucket, then take the peak minute), but the shipped ES|QL does peak-then-sum (each project's peak minute first, then sum across projects). Reconcile the docs to the query: describe it as an indicative upper bound and align the limit-column wording with the issue-10 stabilize-then-sum implementation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
✅ All changelog entries have the correct PR link. |
💚 Build Succeeded
History
|
Proposed commit message
openai: add rate_limits stream and rate-limit headroom view
This change adds an openai.rate_limits data stream and a rate-limit
headroom view that compares OpenAI's configured per-project, per-model
rate limits against actual Usage API consumption, so operators can see
how close each project and model is to being throttled with HTTP 429
responses. The stream provides the limit side of that comparison; the
dashboard panels and alert join it to the existing usage streams.
The stream uses a CEL work-list: it pages the organization projects
list, then drains each project's rate-limit endpoint, paginating both
with after, last_id and has_more. To collect on every interval rather
than every other one, the program never emits an empty event batch
mid-chain (the filebeat CEL input ends a periodic run as soon as an
evaluation yields zero events, before want_more is checked), so
project-listing and the first rate-limit page are folded into a single
emitting step. Each record is annotated with its project_id, name and
status, is stamped with collected_at, and includes an optional limit
field only when the API returns it, so sparse records never write null
into the long-typed mappings. A failed call emits an error event and
the run continues with the rest of the work list, and the admin token
is redacted.
The OpenAI dashboard gains two ES|QL tables. "Rate limit headroom"
joins limits to usage per project_id and model and reports peak
one-minute used, limit and utilization for requests (RPM), tokens
(TPM) and images (IPM). "Rate limit headroom - by model (org-wide)"
rolls the same metrics up by model across projects; because OpenAI
enforces limits per project, the summed per-project limit is a
synthetic aggregate capacity and is labelled as indicative, not an
exact throttle boundary. The model join normalizes dated model
snapshots (e.g. omni-moderation-2024-09-26, gpt-image-1-2025-04-23) to
Usage API reports per dated snapshot while rate_limits lists the base
name; an exact-string join silently dropped those rows (e.g.
gpt-image-1) from the headroom view. All six usage datasets
(completions, embeddings, moderations, images, audio_speeches,
audio_transcriptions) are added to the dashboard's global
event.dataset filter so RPM is counted for audio and image models too.
A prebuilt "[OpenAI] Rate limit headroom low" .es-query rule fires at
or above 80% peak TPM utilization, grouped by project_id::model. Its
saved-object id and filename use the same "low" wording as the
rule name and dashboard panel.
Request and image usage are normalized into single
openai.base.usage_tokens and openai.base.usage_images fields,
populated by the relevant usage pipelines. The panels and alert
reference these normalized fields rather than per-stream columns, so
the ES|QL resolves even before any usage data exists and keeps working
when only some usage streams have data.
Each usage stream gains a finalization_grace setting (default 15m).
OpenAI does not finalize a per-minute usage bucket when it ends; its
counts keep rising for minutes afterward. The stream holds back any
bucket whose end time is younger than the grace period and re-fetches
it on a later poll, so finalized counts are stored instead of partial
ones; the grace value is persisted in CEL state across polls. A
documented residual limitation remains: OpenAI can revise buckets
upward beyond any fixed grace window under high-volume bursts, and
buckets are ingested once, so a small (few-percent) undercount can
persist for the busiest minutes. This does not affect the alert's
ability to flag over-limit conditions.
Utilization is computed with TO_DOUBLE because the used and limit values
are longs; plain division truncated to zero until usage exceeded the
limit, which also left the alert threshold ineffective. Headroom rows are
ordered so that models with usage appear first, ranked by highest TPM
utilization, with unused models sorted to the bottom.
The peak one-minute calculation depends on the usage streams running with
bucket_width set to 1m, which is the default. The package is bumped to
2.2.0 and format_version to 3.5.7 for alerting_rule_template support.
Checklist
changelog.ymlfile.Author's Checklist
How to test this PR locally
Related issues
Screenshots