Categorical evaluation fails for A2A sessions — transcript builder missing A2A_INTERACTION content extraction

### Description

When using `evaluate_categorical()` on sessions that involve remote A2A agents (via ADK's `TransferToAgentTool` with `RemoteA2aAgent`), all such sessions receive `parse_error=True` with `raw_response=None` because `AI.GENERATE` returns NULL.

### Root cause

The transcript-building SQL in `categorical_evaluator.py` uses this COALESCE to extract text from event content:

```sql
COALESCE(
  JSON_VALUE(content, '$.text_summary'),
  JSON_VALUE(content, '$.response'),
  JSON_VALUE(content, '$.tool'),
  ''
)
```

However, `A2A_INTERACTION` events (introduced in [google/adk-python#5325](https://github.com/google/adk-python/pull/5325)) store the remote agent's response under a different JSON path:

```json
{
  "artifacts": [{"parts": [{"kind": "text", "text": "The actual agent response..."}]}],
  "contextId": "...",
  "history": [...]
}
```

None of the existing JSON paths (`$.text_summary`, `$.response`, `$.tool`) match this structure. The transcript for A2A sessions looks like:

```
USER_MESSAGE_RECEIVED [knowledge_supervisor]: How many PTO days...
LLM_RESPONSE [knowledge_supervisor]: call: transfer_to_agent
TOOL_STARTING [knowledge_supervisor]: transfer_to_agent
A2A_INTERACTION [knowledge_supervisor]:              <-- EMPTY, 3000+ bytes of content but nothing extracted
AGENT_COMPLETED [pto_agent]:
```

The evaluation model sees a question with no answer, cannot classify, and returns NULL.

### Impact

- 100% of A2A sessions get `parse_error=True` and `category=None`
- In our testing: 38-42 out of 100 sessions were A2A, so ~40% of all evaluations failed
- These show up as "UNKNOWN" in category distributions, inflating parse error rate

### Reproduction

```python
from bigquery_agent_analytics import Client, CategoricalEvaluationConfig, TraceFilter

client = Client(project_id=..., dataset_id=..., table_id=...)
# Configure metrics...
report = client.evaluate_categorical(config=cat_config, filters=TraceFilter(limit=100))

# Check parse errors - all A2A sessions will have raw_response=None
for sr in report.session_results:
    for mr in sr.metrics:
        if mr.parse_error and mr.raw_response is None:
            print(f"NULL result: {sr.session_id}")
```

### Proposed fix

Add `JSON_VALUE(content, '$.artifacts[0].parts[0].text')` to the COALESCE chain in all 4 transcript-building locations in `categorical_evaluator.py`:

```sql
COALESCE(
  JSON_VALUE(content, '$.text_summary'),
  JSON_VALUE(content, '$.response'),
  JSON_VALUE(content, '$.artifacts[0].parts[0].text'),   -- NEW: A2A_INTERACTION events
  JSON_VALUE(content, '$.tool'),
  ''
)
```

Affected locations in `categorical_evaluator.py`:
1. `CATEGORICAL_TRANSCRIPT_QUERY` (line ~230)
2. `CATEGORICAL_AI_GENERATE_QUERY` (line ~256)
3. `build_ai_classify_query()` (line ~376)
4. `build_ai_generate_query()` (line ~445)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorical evaluation fails for A2A sessions — transcript builder missing A2A_INTERACTION content extraction #40

Description

Root cause

Impact

Reproduction

Proposed fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Categorical evaluation fails for A2A sessions — transcript builder missing A2A_INTERACTION content extraction #40

Description

Description

Root cause

Impact

Reproduction

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions