fix: bring idc_dataset notebook into compliance with contribution guidelines

fedorov · claude · fedorov · commit c5d0658b1ec8 · 2026-05-01T18:52:27.000-04:00
- Fix license header: use MONAI Consortium copyright, correct format with
  trailing double spaces and &amp;nbsp; indentation, moved to top of first cell
- Move all imports (os, sys, itkwasm_dicom) into Setup imports cell; simplify
  Setup environment cell to pip install only
- Add README.md entry for idc_dataset under Modules section
- Add idc_dataset to doesnt_contain_max_epochs and skip_run_papermill in runner.sh

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
Signed-off-by: Andrey Fedorov &lt;andrey.fedorov@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -331,6 +331,8 @@ Illustrate reading NIfTI files and iterating over image patches of the volumes l
 This tutorial illustrates the flexible network APIs and utilities.
 ##### [postprocessing_transforms](./modules/postprocessing_transforms.ipynb)
 This notebook shows the usage of several postprocessing transforms based on the model output of spleen segmentation task.
+##### [idc_dataset](./modules/idc_dataset.ipynb)
+This notebook shows how to query and download public cancer imaging data from NCI Imaging Data Commons (IDC) using `idc-index`, and how to load DICOM images and DICOM-SEG segmentations into MONAI for AI/ML preprocessing.
 ##### [public_datasets](./modules/public_datasets.ipynb)
 This notebook shows how to quickly set up training workflow based on `MedNISTDataset` and `DecathlonDataset`, and how to create a new dataset.
 ##### [tcia_csv_processing](./modules/tcia_csv_processing.ipynb)
@@ -386,4 +388,4 @@ Example shows the use cases of using MONAI to evaluate the performance of a gene
 
 #### [VISTA2D](./vista_2d)
 This tutorial demonstrates how to train a cell segmentation model using the [MONAI](https://monai.io/) framework and the [Segment Anything Model (SAM)](https://github.com/facebookresearch/segment-anything) on the [Cellpose dataset](https://www.cellpose.org/).
-ECHO�� �����Ǿ� �ֽ��ϴ�.
+ECHO�� �����Ǿ� �ֽ��ϴ�.
diff --git a/modules/idc_dataset.ipynb b/modules/idc_dataset.ipynb
@@ -5,58 +5,14 @@
    "metadata": {
     "id": "eFLP44iEFCpB"
    },
-   "source": [
-    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ImagingDataCommons/idc-monai/blob/main/monai_contribution/idc_dataset.ipynb)\n",
-    "\n",
-    "# Using NCI Imaging Data Commons with MONAI\n",
-    "\n",
-    "Copyright 2026 Imaging Data Commons\n",
-    "\n",
-    "Licensed under the Apache License, Version 2.0 (the \"License\");\n",
-    "you may not use this file except in compliance with the License.\n",
-    "You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
-    "\n",
-    "---\n",
-    "\n",
-    "## What is IDC?\n",
-    "\n",
-    "[NCI Imaging Data Commons (IDC)](https://portal.imaging.datacommons.cancer.gov/) is a free, cloud-hosted repository of publicly available cancer imaging data maintained by the National Cancer Institute (NCI). It provides:\n",
-    "\n",
-    "- **~100 TB** of radiology (CT, MR, PET) and pathology images across 160+ cancer collections\n",
-    "- **No sign-up or authentication required** — data is openly accessible\n",
-    "- **Expert and AI-generated annotations** (e.g., organ segmentations) paired with images\n",
-    "- **Standardized format** — all data uses DICOM, the medical imaging industry standard\n",
-    "- **Cloud-native storage** — data lives in Google Cloud Storage (GCS) buckets, so downloads are fast\n",
-    "- **Accompanying tools** - you can search, visualize, and subset the data\n",
-    "\n",
-    "## What is `idc-index`?\n",
-    "\n",
-    "[`idc-index`](https://github.com/ImagingDataCommons/idc-index) is a lightweight Python package that lets you search and download IDC data without any cloud account or special credentials. It ships with a local metadata index — a set of DuckDB tables describing every image series in IDC — so you can run SQL queries locally to find exactly the data you need before downloading anything.\n",
-    "\n",
-    "## What this tutorial covers\n",
-    "\n",
-    "This tutorial shows how to:\n",
-    "1. Query IDC metadata with SQL to find cancer imaging data\n",
-    "2. Download DICOM images and segmentations with one function call\n",
-    "3. Load the data into MONAI for AI/ML preprocessing\n",
-    "4. Work with DICOM Segmentation (DICOM-SEG) objects and their rich metadata\n",
-    "\n",
-    "> **Tip**: This tutorial was created using [idc-claude-skill](https://github.com/ImagingDataCommons/idc-claude-skill) — an AI assistant skill for navigating IDC data and the `idc-index` API."
-   ]
+   "source": "Copyright (c) MONAI Consortium  \nLicensed under the Apache License, Version 2.0 (the \"License\");  \nyou may not use this file except in compliance with the License.  \nYou may obtain a copy of the License at  \n&nbsp;&nbsp;&nbsp;&nbsp;http://www.apache.org/licenses/LICENSE-2.0  \nUnless required by applicable law or agreed to in writing, software  \ndistributed under the License is distributed on an \"AS IS\" BASIS,  \nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  \nSee the License for the specific language governing permissions and  \nlimitations under the License.\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ImagingDataCommons/idc-monai/blob/main/monai_contribution/idc_dataset.ipynb)\n\n# Using NCI Imaging Data Commons with MONAI\n\n## What is IDC?\n\n[NCI Imaging Data Commons (IDC)](https://portal.imaging.datacommons.cancer.gov/) is a free, cloud-hosted repository of publicly available cancer imaging data maintained by the National Cancer Institute (NCI). It provides:\n\n- **~100 TB** of radiology (CT, MR, PET) and pathology images across 160+ cancer collections\n- **No sign-up or authentication required** — data is openly accessible\n- **Expert and AI-generated annotations** (e.g., organ segmentations) paired with images\n- **Standardized format** — all data uses DICOM, the medical imaging industry standard\n- **Cloud-native storage** — data lives in Google Cloud Storage (GCS) buckets, so downloads are fast\n- **Accompanying tools** - you can search, visualize, and subset the data\n\n## What is `idc-index`?\n\n[`idc-index`](https://github.com/ImagingDataCommons/idc-index) is a lightweight Python package that lets you search and download IDC data without any cloud account or special credentials. It ships with a local metadata index — a set of DuckDB tables describing every image series in IDC — so you can run SQL queries locally to find exactly the data you need before downloading anything.\n\n## What this tutorial covers\n\nThis tutorial shows how to:\n1. Query IDC metadata with SQL to find cancer imaging data\n2. Download DICOM images and segmentations with one function call\n3. Load the data into MONAI for AI/ML preprocessing\n4. Work with DICOM Segmentation (DICOM-SEG) objects and their rich metadata\n\n> **Tip**: This tutorial was created using [idc-claude-skill](https://github.com/ImagingDataCommons/idc-claude-skill) — an AI assistant skill for navigating IDC data and the `idc-index` API."
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "id": "iudwt5GoFCpC"
    },
-   "source": [
-    "## Setup\n",
-    "\n",
-    "Install required packages:\n",
-    "- `monai` — Medical Open Network for AI, the ML framework used here\n",
-    "- `idc-index` — Query and download IDC data (includes local metadata index)\n",
-    "- `itk` / `itkwasm-dicom` — ITK-based DICOM readers used by MONAI's `ITKReader` and our custom DICOM-SEG loader"
-   ]
+   "source": "## Setup environment\n\nInstall required packages:\n- `monai` — Medical Open Network for AI, the ML framework used here\n- `idc-index` — Query and download IDC data (includes local metadata index)\n- `itk` / `itkwasm-dicom` — ITK-based DICOM readers used by MONAI's `ITKReader` and our custom DICOM-SEG loader\n\n> **Colab users**: After running the cell below, restart the runtime before continuing (Runtime → Restart runtime). ITK requires a fresh runtime to load correctly after installation."
   },
   {
    "cell_type": "code",
@@ -65,21 +21,7 @@
     "id": "T4ujv4U7FCpC"
    },
    "outputs": [],
-   "source": [
-    "!pip install -q monai idc-index itk itkwasm-dicom\n",
-    "\n",
-    "# Restart runtime after installing ITK (required for ITK to load properly)\n",
-    "import sys\n",
-    "\n",
-    "if \"google.colab\" in sys.modules:\n",
-    "    try:\n",
-    "        import itk  # noqa: F401\n",
-    "    except ImportError:\n",
-    "        print(\"Restarting runtime to load ITK...\")\n",
-    "        import os\n",
-    "\n",
-    "        os.kill(os.getpid(), 9)"
-   ]
+   "source": "!pip install -q monai idc-index itk itkwasm-dicom"
   },
   {
    "cell_type": "markdown",
@@ -101,34 +43,7 @@
     "outputId": "f63082ca-81d8-4b02-8de0-662710f6e508"
    },
    "outputs": [],
-   "source": [
-    "import os\n",
-    "import tempfile\n",
-    "from pathlib import Path\n",
-    "from typing import Hashable, Mapping\n",
-    "\n",
-    "import torch\n",
-    "import numpy as np\n",
-    "import matplotlib.pyplot as plt\n",
-    "from matplotlib.colors import ListedColormap\n",
-    "from idc_index import IDCClient\n",
-    "\n",
-    "from monai.config import KeysCollection\n",
-    "from monai.transforms import (\n",
-    "    Compose,\n",
-    "    LoadImaged,\n",
-    "    EnsureChannelFirstd,\n",
-    "    Orientationd,\n",
-    "    Spacingd,\n",
-    "    ScaleIntensityRanged,\n",
-    "    MapTransform,\n",
-    ")\n",
-    "from monai.data import Dataset, MetaTensor\n",
-    "from monai.data.image_reader import ITKReader\n",
-    "import monai\n",
-    "\n",
-    "monai.config.print_config()"
-   ]
+   "source": "import os\nimport sys\nimport tempfile\nfrom pathlib import Path\nfrom typing import Hashable, Mapping\n\nimport itkwasm_dicom\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport torch\nfrom idc_index import IDCClient\nfrom matplotlib.colors import ListedColormap\n\nimport monai\nfrom monai.config import KeysCollection\nfrom monai.data import Dataset, MetaTensor\nfrom monai.data.image_reader import ITKReader\nfrom monai.transforms import (\n    Compose,\n    EnsureChannelFirstd,\n    LoadImaged,\n    MapTransform,\n    Orientationd,\n    ScaleIntensityRanged,\n    Spacingd,\n)\n\nmonai.config.print_config()"
   },
   {
    "cell_type": "markdown",
@@ -562,124 +477,7 @@
     "outputId": "386c615d-5349-49fb-f412-7a92cfc421d9"
    },
    "outputs": [],
-   "source": [
-    "import itkwasm_dicom  # noqa: E402\n",
-    "\n",
-    "\n",
-    "class LoadDicomSegd(MapTransform):\n",
-    "    \"\"\"Load DICOM Segmentation (DICOM-SEG) files using ITKWasm.\n",
-    "\n",
-    "    DICOM-SEG is an enhanced multiframe DICOM format that stores segmentation\n",
-    "    masks with segment metadata including recommended display colors.\n",
-    "\n",
-    "    The affine matrix is derived directly from DICOM metadata (direction cosines,\n",
-    "    spacing, origin) with LPS→RAS conversion applied to match MONAI's ITKReader\n",
-    "    convention. No axis flipping is performed — orientation is fully encoded in\n",
-    "    the affine via the direction cosine matrix.\n",
-    "    \"\"\"\n",
-    "\n",
-    "    def __init__(self, keys: KeysCollection, allow_missing_keys: bool = False):\n",
-    "        super().__init__(keys, allow_missing_keys)\n",
-    "\n",
-    "    def _find_dcm_file(self, path: Path) -> Path:\n",
-    "        \"\"\"Find .dcm file in directory or return path if already a file.\"\"\"\n",
-    "        if path.is_file():\n",
-    "            return path\n",
-    "        dcm_files = list(path.glob(\"*.dcm\"))\n",
-    "        if not dcm_files:\n",
-    "            raise FileNotFoundError(f\"No .dcm files found in {path}\")\n",
-    "        return dcm_files[0]\n",
-    "\n",
-    "    def _build_affine(self, spacing, origin, direction) -> np.ndarray:\n",
-    "        \"\"\"Build 4x4 affine matrix from DICOM spatial metadata.\n",
-    "\n",
-    "        Converts from ITK/DICOM LPS convention to MONAI's RAS-like convention\n",
-    "        by negating X and Y world coordinates (LPS→RAS). No axis flips are\n",
-    "        applied — orientation is fully encoded in the affine via the direction\n",
-    "        cosine matrix.\n",
-    "\n",
-    "        Args:\n",
-    "            spacing:   Voxel spacing (X, Y, Z) as returned by itkwasm\n",
-    "            origin:    Physical coordinates of voxel [0,0,0] in LPS\n",
-    "            direction: 3x3 direction cosine matrix D where D[i,j] is the\n",
-    "                       component of voxel-axis-j's unit vector along LPS\n",
-    "                       physical axis i. ITK affine formula:\n",
-    "                       world_lps = D @ diag(spacing) @ voxel + origin\n",
-    "        \"\"\"\n",
-    "        lps_to_ras = np.diag([-1.0, -1.0, 1.0])\n",
-    "        affine = np.eye(4)\n",
-    "        affine[:3, :3] = lps_to_ras @ direction @ np.diag(spacing)\n",
-    "        affine[:3, 3] = lps_to_ras @ origin\n",
-    "        return affine\n",
-    "\n",
-    "    def __call__(self, data: Mapping[Hashable, any]) -> dict[Hashable, any]:\n",
-    "        d = dict(data)\n",
-    "        for key in self.key_iterator(d):\n",
-    "            path = Path(d[key])\n",
-    "            dcm_file = self._find_dcm_file(path)\n",
-    "\n",
-    "            # Read using ITKWasm\n",
-    "            seg_image, overlay_info = itkwasm_dicom.read_segmentation(dcm_file)\n",
-    "\n",
-    "            # ITKWasm returns array in (Z, Y, X) order but metadata in (X, Y, Z) order.\n",
-    "            # Transpose to (X, Y, Z) to match metadata — this is a layout convention,\n",
-    "            # not an orientation flip.\n",
-    "            seg_array = np.asarray(seg_image.data).copy()\n",
-    "            seg_array = np.transpose(seg_array, (2, 1, 0))\n",
-    "\n",
-    "            # Build affine from spatial metadata\n",
-    "            spacing = np.array(seg_image.spacing)\n",
-    "            origin = np.array(seg_image.origin)\n",
-    "            direction = np.array(seg_image.direction).reshape(3, 3)\n",
-    "\n",
-    "            affine = self._build_affine(spacing, origin, direction)\n",
-    "\n",
-    "            # Make contiguous (array may be non-contiguous after transpose)\n",
-    "            seg_array = np.ascontiguousarray(seg_array)\n",
-    "\n",
-    "            # Create MONAI MetaTensor with metadata\n",
-    "            meta_tensor = MetaTensor(seg_array)\n",
-    "            meta_tensor.affine = affine\n",
-    "            meta_tensor.meta[\"filename_or_obj\"] = str(dcm_file)\n",
-    "            meta_tensor.meta[\"overlay_info\"] = overlay_info\n",
-    "            meta_tensor.meta[\"original_channel_dim\"] = \"no_channel\"\n",
-    "\n",
-    "            d[key] = meta_tensor\n",
-    "            d[f\"{key}_meta_dict\"] = dict(meta_tensor.meta)\n",
-    "\n",
-    "        return d\n",
-    "\n",
-    "\n",
-    "# Load CT with MONAI's ITKReader\n",
-    "ct_transforms = Compose(\n",
-    "    [\n",
-    "        LoadImaged(keys=[\"image\"], reader=ITKReader()),\n",
-    "        EnsureChannelFirstd(keys=[\"image\"]),\n",
-    "    ]\n",
-    ")\n",
-    "\n",
-    "# Load SEG with our custom LoadDicomSegd\n",
-    "seg_transforms = Compose(\n",
-    "    [\n",
-    "        LoadDicomSegd(keys=[\"label\"]),\n",
-    "        EnsureChannelFirstd(keys=[\"label\"]),\n",
-    "    ]\n",
-    ")\n",
-    "\n",
-    "# Load both\n",
-    "image_path = os.path.join(seg_dir, demo_pair[\"image_uid\"])\n",
-    "seg_path = os.path.join(seg_dir, demo_pair[\"seg_uid\"])\n",
-    "\n",
-    "ct_data = ct_transforms({\"image\": image_path})\n",
-    "seg_data = seg_transforms({\"label\": seg_path})\n",
-    "\n",
-    "ct_image = ct_data[\"image\"]\n",
-    "seg_label = seg_data[\"label\"]\n",
-    "\n",
-    "print(f\"CT image shape: {ct_image.shape}\")\n",
-    "print(f\"Segmentation shape: {seg_label.shape}\")\n",
-    "print(f\"Unique labels: {torch.unique(seg_label)[:10].tolist()}...\")  # First 10 labels"
-   ]
+   "source": "class LoadDicomSegd(MapTransform):\n    \"\"\"Load DICOM Segmentation (DICOM-SEG) files using ITKWasm.\n\n    DICOM-SEG is an enhanced multiframe DICOM format that stores segmentation\n    masks with segment metadata including recommended display colors.\n\n    The affine matrix is derived directly from DICOM metadata (direction cosines,\n    spacing, origin) with LPS→RAS conversion applied to match MONAI's ITKReader\n    convention. No axis flipping is performed — orientation is fully encoded in\n    the affine via the direction cosine matrix.\n    \"\"\"\n\n    def __init__(self, keys: KeysCollection, allow_missing_keys: bool = False):\n        super().__init__(keys, allow_missing_keys)\n\n    def _find_dcm_file(self, path: Path) -> Path:\n        \"\"\"Find .dcm file in directory or return path if already a file.\"\"\"\n        if path.is_file():\n            return path\n        dcm_files = list(path.glob(\"*.dcm\"))\n        if not dcm_files:\n            raise FileNotFoundError(f\"No .dcm files found in {path}\")\n        return dcm_files[0]\n\n    def _build_affine(self, spacing, origin, direction) -> np.ndarray:\n        \"\"\"Build 4x4 affine matrix from DICOM spatial metadata.\n\n        Converts from ITK/DICOM LPS convention to MONAI's RAS-like convention\n        by negating X and Y world coordinates (LPS→RAS). No axis flips are\n        applied — orientation is fully encoded in the affine via the direction\n        cosine matrix.\n\n        Args:\n            spacing:   Voxel spacing (X, Y, Z) as returned by itkwasm\n            origin:    Physical coordinates of voxel [0,0,0] in LPS\n            direction: 3x3 direction cosine matrix D where D[i,j] is the\n                       component of voxel-axis-j's unit vector along LPS\n                       physical axis i. ITK affine formula:\n                       world_lps = D @ diag(spacing) @ voxel + origin\n        \"\"\"\n        lps_to_ras = np.diag([-1.0, -1.0, 1.0])\n        affine = np.eye(4)\n        affine[:3, :3] = lps_to_ras @ direction @ np.diag(spacing)\n        affine[:3, 3] = lps_to_ras @ origin\n        return affine\n\n    def __call__(self, data: Mapping[Hashable, any]) -> dict[Hashable, any]:\n        d = dict(data)\n        for key in self.key_iterator(d):\n            path = Path(d[key])\n            dcm_file = self._find_dcm_file(path)\n\n            # Read using ITKWasm\n            seg_image, overlay_info = itkwasm_dicom.read_segmentation(dcm_file)\n\n            # ITKWasm returns array in (Z, Y, X) order but metadata in (X, Y, Z) order.\n            # Transpose to (X, Y, Z) to match metadata — this is a layout convention,\n            # not an orientation flip.\n            seg_array = np.asarray(seg_image.data).copy()\n            seg_array = np.transpose(seg_array, (2, 1, 0))\n\n            # Build affine from spatial metadata\n            spacing = np.array(seg_image.spacing)\n            origin = np.array(seg_image.origin)\n            direction = np.array(seg_image.direction).reshape(3, 3)\n\n            affine = self._build_affine(spacing, origin, direction)\n\n            # Make contiguous (array may be non-contiguous after transpose)\n            seg_array = np.ascontiguousarray(seg_array)\n\n            # Create MONAI MetaTensor with metadata\n            meta_tensor = MetaTensor(seg_array)\n            meta_tensor.affine = affine\n            meta_tensor.meta[\"filename_or_obj\"] = str(dcm_file)\n            meta_tensor.meta[\"overlay_info\"] = overlay_info\n            meta_tensor.meta[\"original_channel_dim\"] = \"no_channel\"\n\n            d[key] = meta_tensor\n            d[f\"{key}_meta_dict\"] = dict(meta_tensor.meta)\n\n        return d\n\n\n# Load CT with MONAI's ITKReader\nct_transforms = Compose(\n    [\n        LoadImaged(keys=[\"image\"], reader=ITKReader()),\n        EnsureChannelFirstd(keys=[\"image\"]),\n    ]\n)\n\n# Load SEG with our custom LoadDicomSegd\nseg_transforms = Compose(\n    [\n        LoadDicomSegd(keys=[\"label\"]),\n        EnsureChannelFirstd(keys=[\"label\"]),\n    ]\n)\n\n# Load both\nimage_path = os.path.join(seg_dir, demo_pair[\"image_uid\"])\nseg_path = os.path.join(seg_dir, demo_pair[\"seg_uid\"])\n\nct_data = ct_transforms({\"image\": image_path})\nseg_data = seg_transforms({\"label\": seg_path})\n\nct_image = ct_data[\"image\"]\nseg_label = seg_data[\"label\"]\n\nprint(f\"CT image shape: {ct_image.shape}\")\nprint(f\"Segmentation shape: {seg_label.shape}\")\nprint(f\"Unique labels: {torch.unique(seg_label)[:10].tolist()}...\")  # First 10 labels"
   },
   {
    "cell_type": "code",
@@ -1110,4 +908,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 0
-}
+}
diff --git a/runner.sh b/runner.sh