Skip to content

Explore integration with Icechunk data engine #5

@aufdenkampe

Description

@aufdenkampe

My vision for this package is that would work seamlessly in cooperation with a local and/or remote high performance data catalog and store (i.e. data engine). Presently, the Icechunk cloud-native transactional tensor storage engine is the most promising option, as it was recently open-sourced by EarthMover as the source code behind their ArrayLake services.

An ideal work flow would be to:

  • User requests a dataset from a well-known data repository for a specific area of interest.
    • These well-known data repos will be cataloged here in a yaml file, and optionally referenced with Kerchunk or VirtualiZarr.
  • This package first checks if the specific dataset has already been fetched and saved to a local Icechunk instance.
  • If not, it fetches the specific dataset from the source repository, saving it locally in it's native format.
  • If the user expects to reuse the data, they can choose to convert the dataset into a cloud-optimized, analysis-ready (ARCO) zarr3 dataset within Icechunk.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions