Skip to content

[openai] Add rate limits data stream with headroom dashboard panel and alert#19347

Draft
stefans-elastic wants to merge 28 commits into
elastic:mainfrom
stefans-elastic:openai-rate-limit-headroom
Draft

[openai] Add rate limits data stream with headroom dashboard panel and alert#19347
stefans-elastic wants to merge 28 commits into
elastic:mainfrom
stefans-elastic:openai-rate-limit-headroom

Conversation

@stefans-elastic
Copy link
Copy Markdown
Contributor

@stefans-elastic stefans-elastic commented Jun 3, 2026

Proposed commit message

openai: add rate_limits stream and rate-limit headroom view

This change adds an openai.rate_limits data stream and a rate-limit
headroom view that compares OpenAI's configured per-project, per-model
rate limits against actual Usage API consumption, so operators can see
how close each project and model is to being throttled with HTTP 429
responses. The stream provides the limit side of that comparison; the
dashboard panels and alert join it to the existing usage streams.

  • The stream uses a CEL work-list: it pages the organization projects
    list, then drains each project's rate-limit endpoint, paginating both
    with after, last_id and has_more. To collect on every interval rather
    than every other one, the program never emits an empty event batch
    mid-chain (the filebeat CEL input ends a periodic run as soon as an
    evaluation yields zero events, before want_more is checked), so
    project-listing and the first rate-limit page are folded into a single
    emitting step. Each record is annotated with its project_id, name and
    status, is stamped with collected_at, and includes an optional limit
    field only when the API returns it, so sparse records never write null
    into the long-typed mappings. A failed call emits an error event and
    the run continues with the rest of the work list, and the admin token
    is redacted.

  • The OpenAI dashboard gains two ES|QL tables. "Rate limit headroom"
    joins limits to usage per project_id and model and reports peak
    one-minute used, limit and utilization for requests (RPM), tokens
    (TPM) and images (IPM). "Rate limit headroom - by model (org-wide)"
    rolls the same metrics up by model across projects; because OpenAI
    enforces limits per project, the summed per-project limit is a
    synthetic aggregate capacity and is labelled as indicative, not an
    exact throttle boundary. The model join normalizes dated model
    snapshots (e.g. omni-moderation-2024-09-26, gpt-image-1-2025-04-23) to
    Usage API reports per dated snapshot while rate_limits lists the base
    name; an exact-string join silently dropped those rows (e.g.
    gpt-image-1) from the headroom view. All six usage datasets
    (completions, embeddings, moderations, images, audio_speeches,
    audio_transcriptions) are added to the dashboard's global
    event.dataset filter so RPM is counted for audio and image models too.

  • A prebuilt "[OpenAI] Rate limit headroom low" .es-query rule fires at
    or above 80% peak TPM utilization, grouped by project_id::model. Its
    saved-object id and filename use the same "low" wording as the
    rule name and dashboard panel.

  • Request and image usage are normalized into single
    openai.base.usage_tokens and openai.base.usage_images fields,
    populated by the relevant usage pipelines. The panels and alert
    reference these normalized fields rather than per-stream columns, so
    the ES|QL resolves even before any usage data exists and keeps working
    when only some usage streams have data.

  • Each usage stream gains a finalization_grace setting (default 15m).
    OpenAI does not finalize a per-minute usage bucket when it ends; its
    counts keep rising for minutes afterward. The stream holds back any
    bucket whose end time is younger than the grace period and re-fetches
    it on a later poll, so finalized counts are stored instead of partial
    ones; the grace value is persisted in CEL state across polls. A
    documented residual limitation remains: OpenAI can revise buckets
    upward beyond any fixed grace window under high-volume bursts, and
    buckets are ingested once, so a small (few-percent) undercount can
    persist for the busiest minutes. This does not affect the alert's
    ability to flag over-limit conditions.

Utilization is computed with TO_DOUBLE because the used and limit values
are longs; plain division truncated to zero until usage exceeded the
limit, which also left the alert threshold ineffective. Headroom rows are
ordered so that models with usage appear first, ranked by highest TPM
utilization, with unused models sorted to the bottom.

The peak one-minute calculation depends on the usage streams running with
bucket_width set to 1m, which is the default. The package is bumped to
2.2.0 and format_version to 3.5.7 for alerting_rule_template support.

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.
  • I have verified that any added dashboard complies with Kibana's Dashboard good practices

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Screenshots

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

Elastic Docs Style Checker (Vale)

Summary: 1 warning, 3 suggestions found

⚠️ Warnings (1): Fix when the suggestion improves clarity or correctness.
File Line Rule Message
packages/openai/_dev/build/docs/README.md 50 Elastic.QuotesPunctuation Place punctuation inside closing quotation marks.
💡 Suggestions (3): Optional style improvements. Apply when helpful.
File Line Rule Message
packages/openai/_dev/build/docs/README.md 107 Elastic.HeadingColons Capitalize ': r'.
packages/openai/_dev/build/docs/README.md 109 Elastic.WordChoice Consider using 'can, might' instead of 'may', unless the term is in the UI.
packages/openai/_dev/build/docs/README.md 151 Elastic.Ellipses In general, don't use an ellipsis.

The Vale linter checks documentation changes against the Elastic Docs style guide. To use Vale locally or report issues, refer to Elastic style guide for Vale.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

TL;DR

The Buildkite failure is real, but the available log payload only contains teardown/output-upload lines and does not include the failing command/test error itself. Immediate next step: rerun the failed Check integrations openai job (or share the full step log/JUnit failure section) so the actual root cause can be identified.

Remediation

  • Re-run Check integrations openai and capture the first failure block (the section before teardown starts), especially the test name and assertion/error text.
  • If available, inspect and share the failing JUnit artifact content from build/test-results/openai-*.xml (the uploaded files indicate one of these contains the real failure).
Investigation details

Root Cause

The provided job log (/tmp/gh-aw/buildkite-logs/integrations-check-integrations-openai.txt) is truncated to cleanup + artifact upload output; it does not contain the original failing stack trace/assertion, so a code-level root cause cannot be proven from available evidence.

Evidence

  • Build: https://buildkite.com/elastic/integrations/builds/44062
  • Job/step: Check integrations openai
  • Key log excerpt:
    • --- [openai] failed (/tmp/gh-aw/buildkite-logs/integrations-check-integrations-openai.txt:71)
    • 🚨 Error: The command exited with status 1 (...:74)
    • user command error: exit status 1 (...:76)
    • Remaining lines are artifact upload + stack teardown only.

Verification

  • Attempted local CI script repro with .buildkite/scripts/test_one_package.sh packages/openai origin/main bc03335307d316073d5af257c78f13ef47115df6, but local run exits early due missing CI env/tooling (YQ_VERSION unbound).
  • Could not retrieve PR metadata/comments via GitHub read tool in this run due integrity gating, so deduplication against prior detective comments could not be confirmed.

Follow-up

Once the full failing section (or failing JUnit XML contents) is available, I can provide a precise root cause classification and patch-level fix recommendation.

Note

🔒 Integrity filter blocked 2 items

The following items were blocked because they don't meet the GitHub integrity level.

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

What is this? | From workflow: PR Buildkite Detective

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

@shmsr shmsr changed the title [openai] Add rate limits data stream with headroom dashboard panel an… [openai] Add rate limits data stream with headroom dashboard panel and alert Jun 3, 2026
@shmsr shmsr added the enhancement New feature or request label Jun 3, 2026
@andrewkroh andrewkroh added dashboard Relates to a Kibana dashboard bug, enhancement, or modification. documentation Improvements or additions to documentation. Applied to PRs that modify *.md files. Integration:openai OpenAI labels Jun 3, 2026
stefans-elastic and others added 7 commits June 3, 2026 15:06
The README documented "images ↔ images per minute" as a compared
headroom dimension, but the Rate limit headroom panel only computed
RPM and TPM, so max_images_per_1_minute was collected and documented
yet never shown.

Add image headroom to the panel: ES|QL now computes peak per-minute
image usage (SUM of openai.images.images) against
max_images_per_1_minute, gated independently so images don't inflate
TPM/RPM, plus three table columns (Images used/limit/utilization).

Audio stays usage-only by design: the limit is in megabytes but the
Usage API reports audio only in seconds/characters, so there is no
comparable usage figure. Clarify the README deferral note accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@elastic-vault-github-plugin-prod
Copy link
Copy Markdown

🚀 Benchmarks report

To see the full report comment with /test benchmark fullreport

stefans-elastic and others added 5 commits June 4, 2026 15:19
OpenAI's Usage API reports dated model snapshots (e.g. gpt-image-1-2025-04-23,
omni-moderation-2024-09-26), while the Rate Limits API often lists only the base
name (gpt-image-1). The headroom dashboard panels and the alert rule joined usage
to limits on the exact model string, so usage for a snapshot without a matching
rate-limit row was dropped from headroom entirely (verified live: gpt-image token
usage was invisible on the panel).

Strip a trailing -YYYY-MM-DD snapshot suffix on both sides of the join so usage
collapses to its base family and joins to the limit. Applied to all four dashboard
panel queries and the alert rule template. Re-verified against live data: gpt-image-1
token usage now joins to its limit, and moderation utilization is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
OpenAI's Usage API finalizes per-minute buckets with a multi-minute
delay: a bucket's request/token counts keep climbing for several minutes
after it ends. The usage CEL only guarded the single newest bucket, so
any bucket that was no longer newest but still finalizing was emitted
partial, and the cursor advanced past it -- a permanent undercount that
did not self-correct (most visible under bursts).

Widen the partial-skip guard across all 6 usage streams (completions,
embeddings, moderations, images, audio_speeches, audio_transcriptions):

- New `finalization_grace` config var (default 15m).
- events now emit only buckets with end_time <= now - finalization_grace.
- The cursor is held at min(start_time) of the still-finalizing buckets
  so they are re-queried (start_time is inclusive) and emitted once
  final. Emitted buckets are strictly older than the held cursor, so
  they are never re-fetched and never double-counted -- no dedup needed.

Docs updated with a Finalization grace period section and corrected
collection-process steps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The issue-07 fix seeded finalization_grace (and initial_interval) in the
CEL state: block and read state.finalization_grace on every poll, but the
program's returned state map never re-emitted those keys. Elastic's CEL
input replaces persisted state with the returned map (the state: block only
seeds the first run), so from the second poll onward state.finalization_grace
was absent and every evaluation failed with "no such key: finalization_grace"
(event.kind: pipeline_error), ingesting zero usage across all six streams.

Re-emit finalization_grace and initial_interval in both returned objects
(success and error path) of every usage stream's cel.yml.hbs, mirroring the
existing access_token persistence pattern.

Verified live: agent resumed with 0 errors and backfilled the burst; per-bucket
ES SUM(num_model_requests) equals the direct OpenAI usage API on all six
streams (0 mismatches).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…L eval

filebeat's CEL input ends a periodic cycle as soon as an evaluation emits
zero events, before it inspects want_more. The old two-phase design listed
projects in an evaluation that returned no events (events: [], want_more: true),
so that empty eval ended the cycle and rate-limit draining only ran on the next
interval -- producing every-other-interval collection. Resetting the drain
phase's terminal state (prior attempts) could not help, because the blocker was
the empty listing eval upstream.

Fold listing and emitting into the same evaluation: a LOAD step pages the
project list and immediately drains the first newly discovered project's first
rate-limit page, and a DRAIN step pages each project's remaining rate limits.
Every productive evaluation now emits at least one event, so the want_more
chain completes within a single interval. An accumulating worklist with a next
index plus projects_done/project_after cursors handle project-list pagination,
and a top-level want_more==false reset starts each fresh cycle clean.

Verified with the rate_limits system test (batch_size: 1, exercising both
project-list and per-project rate-limit pagination): all expected docs are
collected in one want_more chain per interval. rate_limits ships unreleased in
2.2.0, so no changelog entry is needed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stefans-elastic and others added 10 commits June 5, 2026 13:09
…ic/integrations into openai-rate-limit-headroom
… works

The rate_limits pipeline renamed message into openai.rate_limits.results and
parsed from there. This broke two paths:

- Logstash/reroute (event.original pre-set): message was removed and
  openai.rate_limits.results never populated, dropping every parsed field.
- Agent with preserve_original_event=true: event.original was never created,
  making the option a silent no-op.

Adopt the canonical pattern (audit stream + 38 other pipelines): rename
message -> event.original, parse event.original -> openai.rate_limits, and
remove event.original unless the preserve_original_event tag is present. The
default path output is unchanged; the raw copy is kept only when opted in.

Add a pipeline test asserting both event.original and the parsed
openai.rate_limits.* fields are present on the preserve path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The org-wide headroom panel summed per-project rate limits within each
1-minute bucket, then took the MAX across minutes. Because rate_limits
docs are periodic snapshots, the per-minute project set depends on poll
timing, so the denominator could undercount aggregate capacity and swing
between polls when snapshots landed in different minutes.

Stabilize each project's limit as the MAX over the look-back window
first (matching the per-project panel), then sum by model. Usage becomes
sum-of-per-project-peaks. Both numerator and denominator are now
poll-timing independent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The README and the rate-limit-headroom alert investigation guide claimed
limits and usage join on the exact project_id and model strings "without
any normalization", but the shipped alert ES|QL and both dashboard
headroom panels normalize model via
REPLACE(<model>, "-[0-9]{4}-[0-9]{2}-[0-9]{2}$", "").

Rewrite the docs to describe the implemented contract: join on exact
project_id with model normalized on both sides (strip trailing
-YYYY-MM-DD), so dated usage snapshots match base rate-limit names. Also
correct the aggregation note to reflect the MAX-per-bucket dedup of
identical family/snapshot limits. Updates the generated README, its build
template, and the alert blob.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The prebuilt rate-limit headroom alert capped its ES|QL query at
| LIMIT 100. With groupBy: row, that explicit limit is the binding cap
on alert instances, and the ES|QL executor path never sets
groupAggCount, so groups beyond the top 100 were dropped with no
truncation warning. termField/termSize are ignored on the ES|QL path,
so termSize: 100 was never the real constraint.

Raise the cap to 1000 to align with the alerting max-alerts circuit
breaker (xpack.alerting.rules.run.alerts.max); at that value truncation
surfaces a warning instead of silently dropping breaching project/model
pairs. SORT tpm_utilization DESC stays ahead of the LIMIT so the worst
offenders survive any truncation. Align the inert termSize to 1000 and
document the cap and large-org tuning in the investigation guide.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
filebeat's CEL input ends a polling cycle as soon as an evaluation emits
zero events, before it reads want_more. Three paths in the rate_limits CEL
could return [] mid want_more chain -- a project with no configured rate
limits (LOAD and DRAIN branches) and an empty project-list page that still
has more pages -- each re-introducing the every-other-interval stall the
data stream was designed to avoid.

Emit a dropped keep-alive sentinel instead of [] whenever the chain still
wants to continue, and drop the sentinel in the ingest pipeline so it keeps
the chain alive in the agent without ever being indexed. Pipeline test
covers the drop (rendered as null in expected output).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The finalization grace period adds ~15m of latency to all usage streams
and the headroom panels/alert that read from them, but the README only
described the undercount trade-off. Document the concrete latency cost,
that it applies to every usage stream, where to change it (Advanced
options), and that 0s disables it for fresher-but-undercounted data.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The README described the org-wide usage rollup as sum-then-peak (sum
across projects per 1-minute bucket, then take the peak minute), but the
shipped ES|QL does peak-then-sum (each project's peak minute first, then
sum across projects). Reconcile the docs to the query: describe it as an
indicative upper bound and align the limit-column wording with the
issue-10 stabilize-then-sum implementation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@elastic-vault-github-plugin-prod
Copy link
Copy Markdown

✅ All changelog entries have the correct PR link.

@elasticmachine
Copy link
Copy Markdown

💚 Build Succeeded

History

cc @stefans-elastic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dashboard Relates to a Kibana dashboard bug, enhancement, or modification. documentation Improvements or additions to documentation. Applied to PRs that modify *.md files. enhancement New feature or request Integration:openai OpenAI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants