Skip to content

Clarify validation semantics for omitted dataset dtype (e.g., VectorData) across APIs #1441

@ehennestad

Description

@ehennestad

Summary

For datasets whose schema omits dtype, it is not clear what validation behavior should be considered normative across APIs.

A concrete example is hdmf_common/VectorData, which omits dtype in the schema. In Python HDMF, omitted dtype still has practical runtime restrictions based on build-time inference and backend handling, but those rules do not appear to be documented as schema-level validation requirements.

For another implementation such as MatNWB, this creates ambiguity about what should be accepted or rejected during validation.

Concrete problem in MatNWB

At the moment, the following is possible in MatNWB:

types.hdmf_common.VectorData(data, types.hdmf_common.VectorData(data, 1))

This seems like it should be invalid, because the payload of a dataset is itself another HDMF container object rather than dataset data.

However, because MatNWB currently does not perform type validation when dtype is omitted, this is accepted.

Question

What is the intended normative validation behavior for datasets with omitted dtype, especially concrete types such as VectorData?

More specifically:

  • Should omitted dtype be interpreted as "no fixed primitive dtype is prescribed by the schema", while still requiring the dataset payload to be valid leaf/storable data?
  • Should implementations reject HDMF container objects as direct dataset payloads unless the schema explicitly declares a reference dtype?
  • Which parts of Python HDMFs current behavior for omitted dtype are intentional cross-API semantics, and which parts are just implementation details of that API?

Why this matters

As a maintainer of another HDMF-based API, I need to know whether validation for omitted dtype should be:

  • schema-level permissive: no explicit dtype constraint, but still reject obviously non-dataset objects
  • Python-HDMF-compatible: mirror current inference/restriction behavior
  • something else

Right now, the schema appears intentionally generic, but the practical validation semantics are underspecified for alternative implementations.

Request

Could the intended semantics for omitted dataset dtype be clarified, especially for VectorData and similar concrete types?

If there is already an intended rule, documenting it in the schema language docs and/or hdmf-common docs would help other implementations validate consistently.

This issue text was drafted with the help of Codex (GPT-5.4).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions