We just shipped a small sample on the Assay side that consumes a frozen Mastra scorer / experiment-item result as external evidence:
https://github.com/Rul1an/assay/tree/main/examples/mastra-scorer-evidence
The reason for the sample is pretty narrow: we wanted to test the smallest honest Mastra reliability surface an external evidence consumer could ingest without collapsing back into traces, Studio metrics, or dashboard semantics.
So we kept the shape intentionally small:
- scorer name
- score
- outcome
- dataset version ref
- item ref
- target type
- timestamp
We are not treating that artifact as Mastra truth, and we are not assuming the checked-in fixture shape is a stable wire contract.
The question is mainly about seam choice:
If an external evidence consumer wants the smallest honest Mastra reliability surface, is a bounded scorer / experiment-item result roughly the right place to start, or is there a thinner scorer result surface you would rather point them at?
If we are aiming at the wrong layer, happy to adjust the sample.
We just shipped a small sample on the Assay side that consumes a frozen Mastra scorer / experiment-item result as external evidence:
https://github.com/Rul1an/assay/tree/main/examples/mastra-scorer-evidence
The reason for the sample is pretty narrow: we wanted to test the smallest honest Mastra reliability surface an external evidence consumer could ingest without collapsing back into traces, Studio metrics, or dashboard semantics.
So we kept the shape intentionally small:
We are not treating that artifact as Mastra truth, and we are not assuming the checked-in fixture shape is a stable wire contract.
The question is mainly about seam choice:
If an external evidence consumer wants the smallest honest Mastra reliability surface, is a bounded scorer / experiment-item result roughly the right place to start, or is there a thinner scorer result surface you would rather point them at?
If we are aiming at the wrong layer, happy to adjust the sample.