You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FeatureName.PROGRESSIVE_SSE_STREAMING is default-on in ADK 1.28+ and utils/streaming_utils.py:266-296 correctly emits
function-call parts as partial=TrueLlmResponses as they arrive β for the Gemini-native path. But models/lite_llm.py has its
own inline streaming loop in _generate_content_async (~line 2321) that bypasses StreamingResponseAggregator entirely and
silently buffers FunctionChunks into a local dict until LiteLLM sends finish_reason="tool_calls". The result: when an LLM is
routed through LiteLlm (any non-Gemini model β Anthropic, OpenAI, Mistral, etc.), there are zero events emitted between the last
text chunk and the final aggregated function call. For tools with large argument schemas (e.g. a LongRunningFunctionTool whose
arg is a deeply nested object), that gap is regularly 10β20 seconds of dead air, with no signal that the model is still
working. The TextChunk branch of the same loop yields per-token, so the asymmetry is visible side-by-side.
Steps to Reproduce:
pip install google-adk==1.28.1 litellm==1.83.0
Configure any non-Gemini model via LiteLlm (we hit it on openai/gpt-4o-mini and anthropic/claude-sonnet-4-5 routed
through a LiteLLM proxy).
Define an LlmAgent whose tool list contains a LongRunningFunctionTool with a non-trivial argument schema (a nested object
with ~7 fields reproduces it every time).
Wrap the agent in App(resumability_config=ResumabilityConfig(is_resumable=True)) so the long-running pause path is active.
Run the agent via runner.run_async(...) and print the inter-event delta for each yielded event.
Send a message that causes the model to call the long-running tool.
Full minimal reproduction code is in the Minimal Reproduction Code section below.
Expected Behavior:
Function-call argument deltas should stream as partial=True events while the model is generating them, the same way the
Gemini-native path already behaves under PROGRESSIVE_SSE_STREAMING. Concretely, between the last text token and the final
aggregated function call, there should be a sequence of partial events whose event.get_function_calls() is truthy and whose event.partial is True. The final aggregated event (partial=False) should still fire on finish_reason="tool_calls",
identical to today.
The Gemini-native trace (for comparison) looks like:
Through the LiteLlm adapter, no events fire between the last text chunk and the final aggregated function call. Every reproduction
shows 10β20 seconds of total silence on the event stream:
+0.00s partial=True fcs=False text='I'
+0.05s partial=True fcs=False text=' need'
... (text deltas stream normally over ~1.4s) ...
+0.05s partial=True fcs=False text=' details.'
+16.78s partial=False fcs=True text='' β 16.78s of dead air, no events
+0.01s done
The function call arrives intact in the final event β the data is fine. The problem is the absence of any partial events during
the model's argument-generation phase. From a user-facing application's perspective, the agent appears frozen.
Environment Details:
ADK Library Version (pip show google-adk): 1.28.1
Desktop OS: Linux
Python Version (python -V): Python 3.14.3
Model Information:
Are you using LiteLLM: Yes
Which model is being used: reproduced on openai/gpt-4o-mini and anthropic/claude-sonnet-4-5, both routed through a LiteLLM
proxy
π‘ Optional Information
Regression:
N/A β this has never worked through the LiteLlm adapter. PROGRESSIVE_SSE_STREAMING was introduced and made default-on (see #3974) but only the Gemini-native code path (utils/streaming_utils.py) was updated. models/lite_llm.py has its own parallel
implementation that was never wired to the feature flag.
Logs:
Real SSE event trace from a production run (timestamps are wall-clock, not deltas), captured at the application's SSE consumer
right outside runner.run_async:
The 16.79-second gap between 10:22:01.837 and 10:22:18.623 is the model streaming function-call argument deltas to LiteLLM.
LiteLLM is faithfully forwarding them; ADK's lite_llm.py is buffering them silently.
Screenshots / Video:
N/A
Additional Context:
Root cause
google/adk/models/lite_llm.py_generate_content_async (around line 2321 in 1.28.1) has its own streaming aggregation that
doesn't go through StreamingResponseAggregator.process_response:
# google/adk/models/lite_llm.py:2321 asyncforpartinawaitself.llm_client.acompletion(**completion_args):
forchunk, finish_reasonin_model_response_to_chunk(part):
ifisinstance(chunk, FunctionChunk):
index=chunk.indexorfallback_indexifindexnotinfunction_calls:
function_calls[index] = {"name": "", "args": "", "id": None}
ifchunk.name:
function_calls[index]["name"] +=chunk.nameifchunk.args:
function_calls[index]["args"] +=chunk.args# β buffer only
...
# NO yield here β the partial function call is invisible to callers elifisinstance(chunk, TextChunk):
text+=chunk.textyield_message_to_generate_content_response( # β text DOES yield ChatCompletionAssistantMessage(role="assistant", content=chunk.text),
is_partial=True,
model_version=part.model,
)
...
iffunction_callsand (
finish_reason=="tool_calls"orfinish_reason=="length"or (finish_reason=="stop"andchunkisNone)
):
aggregated_llm_response_with_tool_call=_finalize_tool_call_response(...)
# β only here, after finish_reason, does the function call become an event
The asymmetry is structural: TextChunk yields immediately (good), FunctionChunk accumulates silently (bad). The PROGRESSIVE_SSE_STREAMING feature flag exists in the registry but is never checked in this file:
$ grep -n "PROGRESSIVE_SSE_STREAMING\|StreamingResponseAggregator"\
google/adk/models/lite_llm.py
# (no matches)
Why it matters
Long-running tools become indistinguishable from a hung server. We use LongRunningFunctionTool to implement an
interactive question-form pattern (the model generates a question_flow schema, the framework pauses on the long-running call,
the frontend renders the form). The form's schema is ~1.5KB of JSON. The model spends 10β20s streaming those arg deltas, during
which we receive zero events from runner.run_async β and there's no callback we can hook to detect "the LLM is now generating
tool args." before_tool_callback fires after the args are complete, so it can't bracket the gap. From the frontend's
perspective, the chat just freezes.
It defeats the documented PROGRESSIVE_SSE_STREAMING contract. The feature is default-on, advertised as ADK's progressive
streaming behavior, and works for Gemini. Users routing through LiteLlm reasonably expect the same behavior β instead they get
silent buffering with no opt-out and no warning.
No external workaround is robust. The cleanest user-side workaround is wrapping LiteLlm.llm_client to inspect the
underlying LiteLLM stream chunks before they reach ADK's adapter β which works, but reaches into private SDK boundaries and is
brittle across upgrades. Anything else (FE timers, prompt-engineered "I'm about to call a tool" markers, periodic heartbeats) is a
guess about what's happening rather than a signal of what is happening.
Proposed fix
Make the FunctionChunk branch in _generate_content_async honor PROGRESSIVE_SSE_STREAMING by yielding a partial LlmResponse
containing the current accumulated state of the function call, the same way the TextChunk branch already yields per-token text:
fromgoogle.adk.featuresimportFeatureName, is_feature_enabled# ... elifisinstance(chunk, FunctionChunk):
index=chunk.indexorfallback_indexifindexnotinfunction_calls:
function_calls[index] = {"name": "", "args": "", "id": None}
ifchunk.name:
function_calls[index]["name"] +=chunk.nameifchunk.args:
function_calls[index]["args"] +=chunk.argsfunction_calls[index]["id"] = (
chunk.idorfunction_calls[index]["id"] orstr(index)
)
ifis_feature_enabled(FeatureName.PROGRESSIVE_SSE_STREAMING):
# Mirror the TextChunk path: emit a partial LlmResponse so callers # can show progress while the model streams tool-call arguments. # Args may be partial JSON; consumers should only act on the # non-partial aggregated event downstream, the same as Gemini. yield_partial_function_call_response(
function_calls[index],
model_version=part.model,
)
Plus a small helper:
def_partial_function_call_response(
fc_state: dict, *, model_version: str|None
) ->LlmResponse:
returnLlmResponse(
content=types.Content(
role="model",
parts=[
types.Part(
function_call=types.FunctionCall(
id=fc_state["id"],
name=fc_state["name"] orNone,
# args may not yet be valid JSON β pass through as-is. # Consumers that need parsed args should wait for the# final non-partial event. args=fc_state["args"] orNone,
)
)
],
),
partial=True,
model_version=model_version,
)
Important properties this preserves:
The aggregated final event is unchanged._finalize_tool_call_response still fires on finish_reason="tool_calls" and
emits the same partial=False event with parsed args. The new partial events are additional, not replacements.
handle_function_calls_async is unaffected. It only fires on the non-partial event (per existing if model_response_event.partial: return guard at flows/llm_flows/base_llm_flow.py:926-927), so handlers are not double-invoked.
This matches what xuanyang15 confirmed in ADK_ENABLE_PROGRESSIVE_SSE_STREAMING causes excessive tool use by Gemini 3Β #3974: "the first function call is a partial event, where ADK doesn't call the
function. Only the second one (non partial) will trigger an actually function call."
The pause-and-resume path for LongRunningFunctionTool is unaffected.should_pause_invocation is checked against the
non-partial final event, same as today.
Default-on, but disable-able.ADK_DISABLE_PROGRESSIVE_SSE_STREAMING=1 already exists for users who want the old behavior.
No new env var or API needed.
Happy to put up the PR if the maintainers agree this is the right shape.
Related issues
Streaming LiteLLM model responsesΒ #932 (closed) β original request for LiteLLM partial streaming. Resolution
added partial events for TextChunk but not for FunctionChunk. This issue
asks to complete that work for the function-call branch.
LiteLLM Streaming Content Duplication in Tool Call ResponsesΒ #3697 (open) β different bug in the same _generate_content_async block:
text gets included twice when planning is enabled. Touches the same code
but is about too much content surfacing, not too little. Listed for
context, not as a duplicate.
π΄ Required Information
Describe the Bug:
FeatureName.PROGRESSIVE_SSE_STREAMINGis default-on in ADK 1.28+ andutils/streaming_utils.py:266-296correctly emitsfunction-call parts as
partial=TrueLlmResponses as they arrive β for the Gemini-native path. Butmodels/lite_llm.pyhas itsown inline streaming loop in
_generate_content_async(~line 2321) that bypassesStreamingResponseAggregatorentirely andsilently buffers
FunctionChunks into a local dict until LiteLLM sendsfinish_reason="tool_calls". The result: when an LLM isrouted through
LiteLlm(any non-Gemini model β Anthropic, OpenAI, Mistral, etc.), there are zero events emitted between the lasttext chunk and the final aggregated function call. For tools with large argument schemas (e.g. a
LongRunningFunctionToolwhosearg is a deeply nested object), that gap is regularly 10β20 seconds of dead air, with no signal that the model is still
working. The
TextChunkbranch of the same loop yields per-token, so the asymmetry is visible side-by-side.Steps to Reproduce:
pip install google-adk==1.28.1 litellm==1.83.0LiteLlm(we hit it onopenai/gpt-4o-miniandanthropic/claude-sonnet-4-5routedthrough a LiteLLM proxy).
LlmAgentwhose tool list contains aLongRunningFunctionToolwith a non-trivial argument schema (a nested objectwith ~7 fields reproduces it every time).
App(resumability_config=ResumabilityConfig(is_resumable=True))so the long-running pause path is active.runner.run_async(...)and print the inter-event delta for each yielded event.Full minimal reproduction code is in the Minimal Reproduction Code section below.
Expected Behavior:
Function-call argument deltas should stream as
partial=Trueevents while the model is generating them, the same way theGemini-native path already behaves under
PROGRESSIVE_SSE_STREAMING. Concretely, between the last text token and the finalaggregated function call, there should be a sequence of partial events whose
event.get_function_calls()is truthy and whoseevent.partialisTrue. The final aggregated event (partial=False) should still fire onfinish_reason="tool_calls",identical to today.
The Gemini-native trace (for comparison) looks like:
Observed Behavior:
Through the LiteLlm adapter, no events fire between the last text chunk and the final aggregated function call. Every reproduction
shows 10β20 seconds of total silence on the event stream:
The function call arrives intact in the final event β the data is fine. The problem is the absence of any partial events during
the model's argument-generation phase. From a user-facing application's perspective, the agent appears frozen.
Environment Details:
pip show google-adk):1.28.1python -V):Python 3.14.3Model Information:
openai/gpt-4o-miniandanthropic/claude-sonnet-4-5, both routed through a LiteLLMproxy
π‘ Optional Information
Regression:
N/A β this has never worked through the
LiteLlmadapter.PROGRESSIVE_SSE_STREAMINGwas introduced and made default-on (see#3974) but only the Gemini-native code path (
utils/streaming_utils.py) was updated.models/lite_llm.pyhas its own parallelimplementation that was never wired to the feature flag.
Logs:
Real SSE event trace from a production run (timestamps are wall-clock, not deltas), captured at the application's SSE consumer
right outside
runner.run_async:The 16.79-second gap between
10:22:01.837and10:22:18.623is the model streaming function-call argument deltas to LiteLLM.LiteLLM is faithfully forwarding them; ADK's
lite_llm.pyis buffering them silently.Screenshots / Video:
N/A
Additional Context:
Root cause
google/adk/models/lite_llm.py_generate_content_async(around line 2321 in 1.28.1) has its own streaming aggregation thatdoesn't go through
StreamingResponseAggregator.process_response:The asymmetry is structural:
TextChunkyields immediately (good),FunctionChunkaccumulates silently (bad). ThePROGRESSIVE_SSE_STREAMINGfeature flag exists in the registry but is never checked in this file:Why it matters
Long-running tools become indistinguishable from a hung server. We use
LongRunningFunctionToolto implement aninteractive question-form pattern (the model generates a
question_flowschema, the framework pauses on the long-running call,the frontend renders the form). The form's schema is ~1.5KB of JSON. The model spends 10β20s streaming those arg deltas, during
which we receive zero events from
runner.run_asyncβ and there's no callback we can hook to detect "the LLM is now generatingtool args."
before_tool_callbackfires after the args are complete, so it can't bracket the gap. From the frontend'sperspective, the chat just freezes.
It defeats the documented
PROGRESSIVE_SSE_STREAMINGcontract. The feature is default-on, advertised as ADK's progressivestreaming behavior, and works for Gemini. Users routing through
LiteLlmreasonably expect the same behavior β instead they getsilent buffering with no opt-out and no warning.
No external workaround is robust. The cleanest user-side workaround is wrapping
LiteLlm.llm_clientto inspect theunderlying LiteLLM stream chunks before they reach ADK's adapter β which works, but reaches into private SDK boundaries and is
brittle across upgrades. Anything else (FE timers, prompt-engineered "I'm about to call a tool" markers, periodic heartbeats) is a
guess about what's happening rather than a signal of what is happening.
Proposed fix
Make the
FunctionChunkbranch in_generate_content_asynchonorPROGRESSIVE_SSE_STREAMINGby yielding a partialLlmResponsecontaining the current accumulated state of the function call, the same way the
TextChunkbranch already yields per-token text:Plus a small helper:
Important properties this preserves:
_finalize_tool_call_responsestill fires onfinish_reason="tool_calls"andemits the same
partial=Falseevent with parsedargs. The new partial events are additional, not replacements.handle_function_calls_asyncis unaffected. It only fires on the non-partial event (per existingif model_response_event.partial: returnguard atflows/llm_flows/base_llm_flow.py:926-927), so handlers are not double-invoked.This matches what
xuanyang15confirmed in ADK_ENABLE_PROGRESSIVE_SSE_STREAMING causes excessive tool use by Gemini 3Β #3974: "the first function call is a partial event, where ADK doesn't call thefunction. Only the second one (non partial) will trigger an actually function call."
LongRunningFunctionToolis unaffected.should_pause_invocationis checked against thenon-partial final event, same as today.
ADK_DISABLE_PROGRESSIVE_SSE_STREAMING=1already exists for users who want the old behavior.No new env var or API needed.
Happy to put up the PR if the maintainers agree this is the right shape.
Related issues
added partial events for
TextChunkbut not forFunctionChunk. This issueasks to complete that work for the function-call branch.
_generate_content_asyncblock:text gets included twice when planning is enabled. Touches the same code
but is about too much content surfacing, not too little. Listed for
context, not as a duplicate.
PROGRESSIVE_SSE_STREAMINGis default-on anddocumented how partial function-call events behave on the Gemini path.
code path, same bug family).
finish_reasonis"length"(max output tokens reached)Β #4482 (closed) β another bug in the same_generate_content_asyncaggregation block (silent drop on
finish_reason="length"). Confirms thisarea is fragile.
LongRunningFunctionTool resume fails: unresolved pause check + streaming ID mismatch. Orthogonal but related; we hit this as well.Minimal Reproduction Code:
How often has this issue occurred?: