Skip to content

Categorical evaluation fails for A2A sessions — transcript builder missing A2A_INTERACTION content extraction #40

@evekhm

Description

@evekhm

Description

When using evaluate_categorical() on sessions that involve remote A2A agents (via ADK's TransferToAgentTool with RemoteA2aAgent), all such sessions receive parse_error=True with raw_response=None because AI.GENERATE returns NULL.

Root cause

The transcript-building SQL in categorical_evaluator.py uses this COALESCE to extract text from event content:

COALESCE(
  JSON_VALUE(content, '$.text_summary'),
  JSON_VALUE(content, '$.response'),
  JSON_VALUE(content, '$.tool'),
  ''
)

However, A2A_INTERACTION events (introduced in google/adk-python#5325) store the remote agent's response under a different JSON path:

{
  "artifacts": [{"parts": [{"kind": "text", "text": "The actual agent response..."}]}],
  "contextId": "...",
  "history": [...]
}

None of the existing JSON paths ($.text_summary, $.response, $.tool) match this structure. The transcript for A2A sessions looks like:

USER_MESSAGE_RECEIVED [knowledge_supervisor]: How many PTO days...
LLM_RESPONSE [knowledge_supervisor]: call: transfer_to_agent
TOOL_STARTING [knowledge_supervisor]: transfer_to_agent
A2A_INTERACTION [knowledge_supervisor]:              <-- EMPTY, 3000+ bytes of content but nothing extracted
AGENT_COMPLETED [pto_agent]:

The evaluation model sees a question with no answer, cannot classify, and returns NULL.

Impact

  • 100% of A2A sessions get parse_error=True and category=None
  • In our testing: 38-42 out of 100 sessions were A2A, so ~40% of all evaluations failed
  • These show up as "UNKNOWN" in category distributions, inflating parse error rate

Reproduction

from bigquery_agent_analytics import Client, CategoricalEvaluationConfig, TraceFilter

client = Client(project_id=..., dataset_id=..., table_id=...)
# Configure metrics...
report = client.evaluate_categorical(config=cat_config, filters=TraceFilter(limit=100))

# Check parse errors - all A2A sessions will have raw_response=None
for sr in report.session_results:
    for mr in sr.metrics:
        if mr.parse_error and mr.raw_response is None:
            print(f"NULL result: {sr.session_id}")

Proposed fix

Add JSON_VALUE(content, '$.artifacts[0].parts[0].text') to the COALESCE chain in all 4 transcript-building locations in categorical_evaluator.py:

COALESCE(
  JSON_VALUE(content, '$.text_summary'),
  JSON_VALUE(content, '$.response'),
  JSON_VALUE(content, '$.artifacts[0].parts[0].text'),   -- NEW: A2A_INTERACTION events
  JSON_VALUE(content, '$.tool'),
  ''
)

Affected locations in categorical_evaluator.py:

  1. CATEGORICAL_TRANSCRIPT_QUERY (line ~230)
  2. CATEGORICAL_AI_GENERATE_QUERY (line ~256)
  3. build_ai_classify_query() (line ~376)
  4. build_ai_generate_query() (line ~445)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions