Description
When using evaluate_categorical() on sessions that involve remote A2A agents (via ADK's TransferToAgentTool with RemoteA2aAgent), all such sessions receive parse_error=True with raw_response=None because AI.GENERATE returns NULL.
Root cause
The transcript-building SQL in categorical_evaluator.py uses this COALESCE to extract text from event content:
COALESCE(
JSON_VALUE(content, '$.text_summary'),
JSON_VALUE(content, '$.response'),
JSON_VALUE(content, '$.tool'),
''
)
However, A2A_INTERACTION events (introduced in google/adk-python#5325) store the remote agent's response under a different JSON path:
{
"artifacts": [{"parts": [{"kind": "text", "text": "The actual agent response..."}]}],
"contextId": "...",
"history": [...]
}
None of the existing JSON paths ($.text_summary, $.response, $.tool) match this structure. The transcript for A2A sessions looks like:
USER_MESSAGE_RECEIVED [knowledge_supervisor]: How many PTO days...
LLM_RESPONSE [knowledge_supervisor]: call: transfer_to_agent
TOOL_STARTING [knowledge_supervisor]: transfer_to_agent
A2A_INTERACTION [knowledge_supervisor]: <-- EMPTY, 3000+ bytes of content but nothing extracted
AGENT_COMPLETED [pto_agent]:
The evaluation model sees a question with no answer, cannot classify, and returns NULL.
Impact
- 100% of A2A sessions get
parse_error=True and category=None
- In our testing: 38-42 out of 100 sessions were A2A, so ~40% of all evaluations failed
- These show up as "UNKNOWN" in category distributions, inflating parse error rate
Reproduction
from bigquery_agent_analytics import Client, CategoricalEvaluationConfig, TraceFilter
client = Client(project_id=..., dataset_id=..., table_id=...)
# Configure metrics...
report = client.evaluate_categorical(config=cat_config, filters=TraceFilter(limit=100))
# Check parse errors - all A2A sessions will have raw_response=None
for sr in report.session_results:
for mr in sr.metrics:
if mr.parse_error and mr.raw_response is None:
print(f"NULL result: {sr.session_id}")
Proposed fix
Add JSON_VALUE(content, '$.artifacts[0].parts[0].text') to the COALESCE chain in all 4 transcript-building locations in categorical_evaluator.py:
COALESCE(
JSON_VALUE(content, '$.text_summary'),
JSON_VALUE(content, '$.response'),
JSON_VALUE(content, '$.artifacts[0].parts[0].text'), -- NEW: A2A_INTERACTION events
JSON_VALUE(content, '$.tool'),
''
)
Affected locations in categorical_evaluator.py:
CATEGORICAL_TRANSCRIPT_QUERY (line ~230)
CATEGORICAL_AI_GENERATE_QUERY (line ~256)
build_ai_classify_query() (line ~376)
build_ai_generate_query() (line ~445)
Description
When using
evaluate_categorical()on sessions that involve remote A2A agents (via ADK'sTransferToAgentToolwithRemoteA2aAgent), all such sessions receiveparse_error=Truewithraw_response=NonebecauseAI.GENERATEreturns NULL.Root cause
The transcript-building SQL in
categorical_evaluator.pyuses this COALESCE to extract text from event content:However,
A2A_INTERACTIONevents (introduced in google/adk-python#5325) store the remote agent's response under a different JSON path:{ "artifacts": [{"parts": [{"kind": "text", "text": "The actual agent response..."}]}], "contextId": "...", "history": [...] }None of the existing JSON paths (
$.text_summary,$.response,$.tool) match this structure. The transcript for A2A sessions looks like:The evaluation model sees a question with no answer, cannot classify, and returns NULL.
Impact
parse_error=Trueandcategory=NoneReproduction
Proposed fix
Add
JSON_VALUE(content, '$.artifacts[0].parts[0].text')to the COALESCE chain in all 4 transcript-building locations incategorical_evaluator.py:Affected locations in
categorical_evaluator.py:CATEGORICAL_TRANSCRIPT_QUERY(line ~230)CATEGORICAL_AI_GENERATE_QUERY(line ~256)build_ai_classify_query()(line ~376)build_ai_generate_query()(line ~445)