Fix technical accuracy: update data sizes and URLs (Jan 2026)

rdhyee · claude · rdhyee · commit ebe8420730bd · 2026-01-29T16:11:37.000-08:00
- Update file sizes to ~280 MB wide / ~850 MB narrow (approximate)
- Update row counts: 6.7M MaterialSampleRecords, 20M total rows
- Update source breakdown: SESAR (4.6M), OpenContext (1M), GEOME (605K), Smithsonian (322K)
- zenodo_isamples_analysis.qmd: prioritize Cloudflare R2 URL over Zenodo
- Clarify that data is now served from Cloudflare R2, not Zenodo

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;
diff --git a/tutorials/index.qmd b/tutorials/index.qmd
@@ -17,7 +17,7 @@ Learn to explore **6.7 million physical samples** from scientific collections wo
 
 All tutorials use **geoparquet files** - no server required:
 
-- **iSamples Full Dataset**: 282 MB, 6.7M samples from SESAR, OpenContext, GEOME, Smithsonian
+- **iSamples Full Dataset**: ~280 MB wide format, 6.7M samples from SESAR, OpenContext, GEOME, Smithsonian
 - **Available via**: Cloudflare R2 with HTTP range requests
 
 ## Why Browser-Based?
diff --git a/tutorials/isamples_explorer.qmd b/tutorials/isamples_explorer.qmd
@@ -12,7 +12,7 @@ Search and explore **6.7 million physical samples** from scientific collections
 
 ::: {.callout-note}
 ### Serverless Architecture
-This app queries a 282 MB Parquet file directly in your browser using DuckDB-WASM. No server required!
+This app queries a ~280 MB Parquet file directly in your browser using DuckDB-WASM. No server required!
 :::
 
 ## Setup
diff --git a/tutorials/narrow_vs_wide_performance.qmd b/tutorials/narrow_vs_wide_performance.qmd
@@ -18,8 +18,8 @@ The iSamples property graph data can be serialized in two different parquet form
 
 | Format | Description | File Size | Row Count | Sources |
 |--------|-------------|-----------|-----------|---------|
-| **Narrow** | Stores relationships as separate edge rows (`otype='_edge_'`) | 844 MB | ~106M rows | All 4 sources |
-| **Wide** | Stores relationships as `p__*` columns on entity rows | 282 MB | ~20M rows | All 4 sources |
+| **Narrow** | Stores relationships as separate edge rows (`otype='_edge_'`) | ~850 MB | ~106M rows | All 4 sources |
+| **Wide** | Stores relationships as `p__*` columns on entity rows | ~280 MB | ~20M rows | All 4 sources |
 
 Both formats represent the **same underlying data** (SESAR, OpenContext, GEOME, Smithsonian) with identical semantics, but the wide format is optimized for analytical queries by eliminating edge rows.
 
diff --git a/tutorials/parquet_cesium_isamples_wide.qmd b/tutorials/parquet_cesium_isamples_wide.qmd
@@ -8,12 +8,12 @@ This page renders points from the **full iSamples wide-format** parquet file (al
 ::: {.callout-note}
 ## iSamples Full Dataset (Wide Format)
 
-This page uses the **iSamples combined dataset** (Dec 2025) which includes:
+This page uses the **iSamples combined dataset** (Jan 2026) which includes:
 
-- **6.68M MaterialSampleRecords** from all iSamples sources
-- **Source breakdown**: SESAR (70%), OpenContext (16%), GEOME (9%), Smithsonian (5%)
-- **242 MB** wide format (vs 709 MB narrow)
-- **19.5M total rows** (entities only, no edge rows)
+- **6.7M MaterialSampleRecords** from all iSamples sources
+- **Source breakdown**: SESAR (4.6M), OpenContext (1M), GEOME (605K), Smithsonian (322K)
+- **~280 MB** wide format (vs ~850 MB narrow) - 67% smaller
+- **20M total rows** (all entity types, no edge rows)
 - **47 columns** with flattened latitude/longitude (direct column access, no JSON parsing)
 
 :::
@@ -102,7 +102,7 @@ python3 -m http.server 8000
 Then use: `http://localhost:8000/isamples_202601_wide.parquet`
 
 **Benefits of wide format file:**
-- 66% smaller than narrow format (242 MB vs 709 MB)
+- 67% smaller than narrow format (~280 MB vs ~850 MB)
 - Much faster initial load (less network transfer)
 - Simpler queries with direct column access
 - Works offline once cached
diff --git a/tutorials/zenodo_isamples_analysis.qmd b/tutorials/zenodo_isamples_analysis.qmd
@@ -27,13 +27,13 @@ This tutorial demonstrates how to efficiently analyze large geospatial datasets
 
 ## Dataset Information
 
-**Primary dataset**:
-- **URL**: `https://labs.dataunbound.com/docs/2025/07/isamples_export_2025_04_21_16_23_46_geo.parquet` *(temporary for testing)*
-- **Original**: `https://zenodo.org/api/records/15278211/files/...` *(currently rate limited)*
-- **Size**: ~300 MB, 6+ million records
-- **Sources**: SESAR, OpenContext, GEOME, Smithsonian
+**Primary dataset** (Jan 2026):
+- **URL**: `https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide.parquet`
+- **Size**: ~280 MB wide format, 6.7M MaterialSampleRecords (20M total rows)
+- **Sources**: SESAR (4.6M), OpenContext (1M), GEOME (605K), Smithsonian (322K)
+- **Hosting**: Cloudflare R2 with HTTP range request support
 
-**Note**: *Currently using DataUnbound Labs hosting temporarily to avoid Zenodo rate limiting during development. This will be switched back to Zenodo once the notebook is stable.*
+**Note**: *Data was originally archived on Zenodo and is now served from Cloudflare R2 for better performance and reliability.*
 
 **Fallback dataset** (if remote data fails):
 - **Type**: Generated demo data with realistic structure
@@ -81,14 +81,13 @@ d3 = require("d3@7")
 topojson = require("topojson-client@3")
 
 // Dataset URLs - try multiple options for CORS compatibility
-// TEMPORARY: Using DataUnbound Labs hosting for testing to avoid Zenodo rate limiting
+// Primary: Cloudflare R2 (Jan 2026 wide format)
 parquet_urls = [
+  'https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide.parquet',
+
+  // Fallback: older versions
   'https://labs.dataunbound.com/docs/2025/07/isamples_export_2025_04_21_16_23_46_geo.parquet',
-  
-  // Original Zenodo URLs (currently rate limited)
-  'https://zenodo.org/api/records/15278211/files/isamples_export_2025_04_21_16_23_46_geo.parquet/content',
-  'https://cors-anywhere.herokuapp.com/https://zenodo.org/api/records/15278211/files/isamples_export_2025_04_21_16_23_46_geo.parquet/content',
-  'https://z.rslv.xyz/10.5281/zenodo.15278211/isamples_export_2025_04_21_16_23_46_geo.parquet'
+  'https://zenodo.org/api/records/15278211/files/isamples_export_2025_04_21_16_23_46_geo.parquet/content'
 ]
 
 // Test CORS and find working URL - with rate limiting protection
@@ -243,13 +242,13 @@ createDemoData = async (conn) => {
 md`
 ## Connection Status
 
-${working_parquet_url ? 
-  `✅ **Connected to live data**: Using ${working_parquet_url.includes('zenodo.org') ? 'Zenodo direct' : working_parquet_url.includes('cors-anywhere') ? 'CORS proxy' : 'original'} URL  
-📊 **Dataset**: ~6M records from real iSamples database  
-🌐 **Data source**: ${working_parquet_url}` 
-  : 
-  `⚠️ **Using demo data**: Remote file not accessible due to CORS restrictions  
-📊 **Dataset**: 10K synthetic records with realistic structure  
+${working_parquet_url ?
+  `✅ **Connected to live data**: Using ${working_parquet_url.includes('r2.dev') ? 'Cloudflare R2' : working_parquet_url.includes('zenodo.org') ? 'Zenodo' : 'fallback'} hosting
+📊 **Dataset**: 6.7M MaterialSampleRecords from iSamples
+🌐 **Data source**: ${working_parquet_url}`
+  :
+  `⚠️ **Using demo data**: Remote file not accessible due to CORS restrictions
+📊 **Dataset**: 10K synthetic records with realistic structure
 💡 **Note**: This demonstrates the same analysis patterns with representative data`
 }
 `