Skip to content

Commit ebe8420

Browse files
rdhyeeclaude
andcommitted
Fix technical accuracy: update data sizes and URLs (Jan 2026)
- Update file sizes to ~280 MB wide / ~850 MB narrow (approximate) - Update row counts: 6.7M MaterialSampleRecords, 20M total rows - Update source breakdown: SESAR (4.6M), OpenContext (1M), GEOME (605K), Smithsonian (322K) - zenodo_isamples_analysis.qmd: prioritize Cloudflare R2 URL over Zenodo - Clarify that data is now served from Cloudflare R2, not Zenodo Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 2c70702 commit ebe8420

5 files changed

Lines changed: 28 additions & 29 deletions

File tree

tutorials/index.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Learn to explore **6.7 million physical samples** from scientific collections wo
1717

1818
All tutorials use **geoparquet files** - no server required:
1919

20-
- **iSamples Full Dataset**: 282 MB, 6.7M samples from SESAR, OpenContext, GEOME, Smithsonian
20+
- **iSamples Full Dataset**: ~280 MB wide format, 6.7M samples from SESAR, OpenContext, GEOME, Smithsonian
2121
- **Available via**: Cloudflare R2 with HTTP range requests
2222

2323
## Why Browser-Based?

tutorials/isamples_explorer.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Search and explore **6.7 million physical samples** from scientific collections
1212

1313
::: {.callout-note}
1414
### Serverless Architecture
15-
This app queries a 282 MB Parquet file directly in your browser using DuckDB-WASM. No server required!
15+
This app queries a ~280 MB Parquet file directly in your browser using DuckDB-WASM. No server required!
1616
:::
1717

1818
## Setup

tutorials/narrow_vs_wide_performance.qmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,8 @@ The iSamples property graph data can be serialized in two different parquet form
1818

1919
| Format | Description | File Size | Row Count | Sources |
2020
|--------|-------------|-----------|-----------|---------|
21-
| **Narrow** | Stores relationships as separate edge rows (`otype='_edge_'`) | 844 MB | ~106M rows | All 4 sources |
22-
| **Wide** | Stores relationships as `p__*` columns on entity rows | 282 MB | ~20M rows | All 4 sources |
21+
| **Narrow** | Stores relationships as separate edge rows (`otype='_edge_'`) | ~850 MB | ~106M rows | All 4 sources |
22+
| **Wide** | Stores relationships as `p__*` columns on entity rows | ~280 MB | ~20M rows | All 4 sources |
2323

2424
Both formats represent the **same underlying data** (SESAR, OpenContext, GEOME, Smithsonian) with identical semantics, but the wide format is optimized for analytical queries by eliminating edge rows.
2525

tutorials/parquet_cesium_isamples_wide.qmd

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,12 @@ This page renders points from the **full iSamples wide-format** parquet file (al
88
::: {.callout-note}
99
## iSamples Full Dataset (Wide Format)
1010

11-
This page uses the **iSamples combined dataset** (Dec 2025) which includes:
11+
This page uses the **iSamples combined dataset** (Jan 2026) which includes:
1212

13-
- **6.68M MaterialSampleRecords** from all iSamples sources
14-
- **Source breakdown**: SESAR (70%), OpenContext (16%), GEOME (9%), Smithsonian (5%)
15-
- **242 MB** wide format (vs 709 MB narrow)
16-
- **19.5M total rows** (entities only, no edge rows)
13+
- **6.7M MaterialSampleRecords** from all iSamples sources
14+
- **Source breakdown**: SESAR (4.6M), OpenContext (1M), GEOME (605K), Smithsonian (322K)
15+
- **~280 MB** wide format (vs ~850 MB narrow) - 67% smaller
16+
- **20M total rows** (all entity types, no edge rows)
1717
- **47 columns** with flattened latitude/longitude (direct column access, no JSON parsing)
1818

1919
:::
@@ -102,7 +102,7 @@ python3 -m http.server 8000
102102
Then use: `http://localhost:8000/isamples_202601_wide.parquet`
103103

104104
**Benefits of wide format file:**
105-
- 66% smaller than narrow format (242 MB vs 709 MB)
105+
- 67% smaller than narrow format (~280 MB vs ~850 MB)
106106
- Much faster initial load (less network transfer)
107107
- Simpler queries with direct column access
108108
- Works offline once cached

tutorials/zenodo_isamples_analysis.qmd

Lines changed: 18 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -27,13 +27,13 @@ This tutorial demonstrates how to efficiently analyze large geospatial datasets
2727

2828
## Dataset Information
2929

30-
**Primary dataset**:
31-
- **URL**: `https://labs.dataunbound.com/docs/2025/07/isamples_export_2025_04_21_16_23_46_geo.parquet` *(temporary for testing)*
32-
- **Original**: `https://zenodo.org/api/records/15278211/files/...` *(currently rate limited)*
33-
- **Size**: ~300 MB, 6+ million records
34-
- **Sources**: SESAR, OpenContext, GEOME, Smithsonian
30+
**Primary dataset** (Jan 2026):
31+
- **URL**: `https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide.parquet`
32+
- **Size**: ~280 MB wide format, 6.7M MaterialSampleRecords (20M total rows)
33+
- **Sources**: SESAR (4.6M), OpenContext (1M), GEOME (605K), Smithsonian (322K)
34+
- **Hosting**: Cloudflare R2 with HTTP range request support
3535

36-
**Note**: *Currently using DataUnbound Labs hosting temporarily to avoid Zenodo rate limiting during development. This will be switched back to Zenodo once the notebook is stable.*
36+
**Note**: *Data was originally archived on Zenodo and is now served from Cloudflare R2 for better performance and reliability.*
3737

3838
**Fallback dataset** (if remote data fails):
3939
- **Type**: Generated demo data with realistic structure
@@ -81,14 +81,13 @@ d3 = require("d3@7")
8181
topojson = require("topojson-client@3")
8282
8383
// Dataset URLs - try multiple options for CORS compatibility
84-
// TEMPORARY: Using DataUnbound Labs hosting for testing to avoid Zenodo rate limiting
84+
// Primary: Cloudflare R2 (Jan 2026 wide format)
8585
parquet_urls = [
86+
'https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide.parquet',
87+
88+
// Fallback: older versions
8689
'https://labs.dataunbound.com/docs/2025/07/isamples_export_2025_04_21_16_23_46_geo.parquet',
87-
88-
// Original Zenodo URLs (currently rate limited)
89-
'https://zenodo.org/api/records/15278211/files/isamples_export_2025_04_21_16_23_46_geo.parquet/content',
90-
'https://cors-anywhere.herokuapp.com/https://zenodo.org/api/records/15278211/files/isamples_export_2025_04_21_16_23_46_geo.parquet/content',
91-
'https://z.rslv.xyz/10.5281/zenodo.15278211/isamples_export_2025_04_21_16_23_46_geo.parquet'
90+
'https://zenodo.org/api/records/15278211/files/isamples_export_2025_04_21_16_23_46_geo.parquet/content'
9291
]
9392
9493
// Test CORS and find working URL - with rate limiting protection
@@ -243,13 +242,13 @@ createDemoData = async (conn) => {
243242
md`
244243
## Connection Status
245244
246-
${working_parquet_url ?
247-
`✅ **Connected to live data**: Using ${working_parquet_url.includes('zenodo.org') ? 'Zenodo direct' : working_parquet_url.includes('cors-anywhere') ? 'CORS proxy' : 'original'} URL
248-
📊 **Dataset**: ~6M records from real iSamples database
249-
🌐 **Data source**: ${working_parquet_url}`
250-
:
251-
`⚠️ **Using demo data**: Remote file not accessible due to CORS restrictions
252-
📊 **Dataset**: 10K synthetic records with realistic structure
245+
${working_parquet_url ?
246+
`✅ **Connected to live data**: Using ${working_parquet_url.includes('r2.dev') ? 'Cloudflare R2' : working_parquet_url.includes('zenodo.org') ? 'Zenodo' : 'fallback'} hosting
247+
📊 **Dataset**: 6.7M MaterialSampleRecords from iSamples
248+
🌐 **Data source**: ${working_parquet_url}`
249+
:
250+
`⚠️ **Using demo data**: Remote file not accessible due to CORS restrictions
251+
📊 **Dataset**: 10K synthetic records with realistic structure
253252
💡 **Note**: This demonstrates the same analysis patterns with representative data`
254253
}
255254
`

0 commit comments

Comments
 (0)