[FEATURE] Evals: thin scorer / experiment-item result surface for external evidence consumers

We just shipped a small sample on the Assay side that consumes a frozen Mastra scorer / experiment-item result as external evidence:

https://github.com/Rul1an/assay/tree/main/examples/mastra-scorer-evidence

The reason for the sample is pretty narrow: we wanted to test the smallest honest Mastra reliability surface an external evidence consumer could ingest without collapsing back into traces, Studio metrics, or dashboard semantics.

So we kept the shape intentionally small:
- scorer name
- score
- outcome
- dataset version ref
- item ref
- target type
- timestamp

We are not treating that artifact as Mastra truth, and we are not assuming the checked-in fixture shape is a stable wire contract.

The question is mainly about seam choice:

If an external evidence consumer wants the smallest honest Mastra reliability surface, is a bounded scorer / experiment-item result roughly the right place to start, or is there a thinner scorer result surface you would rather point them at?

If we are aiming at the wrong layer, happy to adjust the sample.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Evals: thin scorer / experiment-item result surface for external evidence consumers #15206

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Evals: thin scorer / experiment-item result surface for external evidence consumers #15206

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions