Skip to content

Commit eaa0751

Browse files
rdhyeeclaude
andauthored
Expand how-to-use.qmd into a full data catalog (#123)
Replaces the minimal (and slightly inaccurate — res4 was listed as ~70 KB, actually 580 KB; lite as ~150 MB, actually 60 MB) Data Files table with a proper catalog organized by use case: - Architecture note: all files served via data.isamples.org backed by Cloudflare R2 with the immutable cache-control Worker (deployed 2026-04-17). File naming convention documented. - Primary datasets (wide, wide+H3, narrow) with size, shape, row count, and when-to-use guidance. - Pre-aggregated helpers (facet_summaries, facet_cross_filter, sample_facets_v2) with their tiny sizes and why they exist. - H3 geospatial aggregates at three resolutions with typical altitude. - Lite sample-point file. - Cross-reference matrix: which tutorial uses which file. - Python quick-query recipe. Intended as the single source of truth that tutorials can link into rather than re-describing each file. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 620a675 commit eaa0751

1 file changed

Lines changed: 89 additions & 12 deletions

File tree

how-to-use.qmd

Lines changed: 89 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -42,15 +42,92 @@ All code is visible and foldable on tutorial pages. Want to build your own analy
4242
- **[GitHub](https://github.com/isamplesorg/)** — all source code and data pipelines
4343
- **[Zenodo](https://zenodo.org/communities/isamples)** — archived datasets for reproducible research
4444

45-
## Data Files {.unnumbered}
46-
47-
All data is hosted on Cloudflare R2 with HTTP range request support:
48-
49-
| File | Size | Description |
50-
|------|------|-------------|
51-
| Wide format (H3-indexed) | ~292 MB | 20M rows, all entity types with H3 spatial indices |
52-
| H3 summary (res4) | ~70 KB | Pre-aggregated cluster counts for instant globe load |
53-
| H3 summary (res6) | ~200 KB | Mid-zoom cluster detail |
54-
| H3 summary (res8) | ~600 KB | Fine-zoom cluster detail |
55-
| Samples lite | ~150 MB | Individual sample points with coordinates |
56-
| Facet summaries | 2 KB | Pre-computed filter counts (source, material, context, specimen type) |
45+
## Data Catalog {.unnumbered}
46+
47+
All files are served from [`data.isamples.org`](https://data.isamples.org/)
48+
backed by Cloudflare R2. A Cloudflare Worker in front of the bucket sets
49+
`Cache-Control: public, max-age=31536000, immutable` on filename-versioned
50+
parquets (so browsers and the Cloudflare edge cache aggressively) and
51+
exposes CORS headers required by DuckDB-WASM's HTTP range requests.
52+
53+
File naming convention: `isamples_<YYYYMM>_<variant>.parquet`. The month
54+
in the filename is the data-generation snapshot — content at a given
55+
URL never changes.
56+
57+
### Primary datasets {.unnumbered}
58+
59+
The two main files carrying the sample records themselves:
60+
61+
| File | Size | Shape | Rows | Use when you need… |
62+
|---|---:|---|---:|---|
63+
| [`isamples_202601_wide.parquet`](https://data.isamples.org/isamples_202601_wide.parquet) | 278 MB | Wide (one row per entity, nested relationships in `p__*` array columns) | 20 M | General entity queries, UI filtering, description text |
64+
| [`isamples_202601_wide_h3.parquet`](https://data.isamples.org/isamples_202601_wide_h3.parquet) | 292 MB | Wide + H3 BIGINT indices (`h3_res4`, `h3_res6`, `h3_res8`) | 20 M | Geospatial queries with H3 clustering at arbitrary zoom |
65+
| [`isamples_202512_narrow.parquet`](https://data.isamples.org/isamples_202512_narrow.parquet) | 820 MB | Narrow (graph: nodes + explicit `_edge_` rows, s/p/o/n fields) | 106 M | Graph traversals, relationship-centric analysis, PQG work |
66+
67+
All three represent the same underlying data (SESAR + OpenContext + GEOME
68+
+ Smithsonian) with identical semantics — they differ only in serialization
69+
strategy. See the
70+
[Technical: Narrow vs Wide tutorial](/tutorials/narrow_vs_wide_performance.html)
71+
for a performance comparison.
72+
73+
### Pre-aggregated helpers {.unnumbered}
74+
75+
Small lookup tables computed ahead of time so a page can render facets
76+
and counts instantly, without touching the 278 MB primary file:
77+
78+
| File | Size | Contents | Use when… |
79+
|---|---:|---|---|
80+
| [`isamples_202601_facet_summaries.parquet`](https://data.isamples.org/isamples_202601_facet_summaries.parquet) | 2 KB | `(facet_type, facet_value, count)` for source, material, context, object_type | You want instant initial facet counts with no filters applied |
81+
| [`isamples_202601_facet_cross_filter.parquet`](https://data.isamples.org/isamples_202601_facet_cross_filter.parquet) | 6 KB | Pre-computed counts for single-facet selections | You want instant cross-filtered counts for a single active filter |
82+
| [`isamples_202601_sample_facets_v2.parquet`](https://data.isamples.org/isamples_202601_sample_facets_v2.parquet) | 63 MB | `(pid, material, context, object_type)` facet URIs per sample | You need to filter on *combinations* of facets at query time |
83+
84+
### Geospatial aggregates (H3) {.unnumbered}
85+
86+
Hexagonal H3 cells pre-aggregated at three resolutions for zoom-adaptive
87+
globe rendering. Each row: `h3_cell, center_lat, center_lng, sample_count,
88+
dominant_source, source_count`.
89+
90+
| File | Size | Cells | Typical altitude |
91+
|---|---:|---:|---|
92+
| [`isamples_202601_h3_summary_res4.parquet`](https://data.isamples.org/isamples_202601_h3_summary_res4.parquet) | 580 KB | ~38 K | Continental (world view) |
93+
| [`isamples_202601_h3_summary_res6.parquet`](https://data.isamples.org/isamples_202601_h3_summary_res6.parquet) | 1.6 MB | ~112 K | Regional (country / state) |
94+
| [`isamples_202601_h3_summary_res8.parquet`](https://data.isamples.org/isamples_202601_h3_summary_res8.parquet) | 2.4 MB | ~176 K | Neighborhood |
95+
96+
CSV twins exist alongside each parquet (3× larger) for human inspection —
97+
browsers use the parquet versions.
98+
99+
### Individual sample points (lite) {.unnumbered}
100+
101+
| File | Size | Contents | Use when… |
102+
|---|---:|---|---|
103+
| [`isamples_202601_samples_map_lite.parquet`](https://data.isamples.org/isamples_202601_samples_map_lite.parquet) | 60 MB | `pid, label, source, latitude, longitude, place_name, result_time, h3_res8, h3_res8_hex` — no description | Point-level rendering below ~120 km altitude |
104+
105+
### Which tutorial uses which file {.unnumbered}
106+
107+
| | Interactive Explorer | Search Explorer | Deep-Dive Analysis |
108+
|---|:-:|:-:|:-:|
109+
| `wide.parquet` | || |
110+
| `wide_h3.parquet` | | ||
111+
| `facet_summaries.parquet` ||||
112+
| `facet_cross_filter.parquet` | || |
113+
| `sample_facets_v2.parquet` ||| |
114+
| `h3_summary_res4/6/8.parquet` || | |
115+
| `samples_map_lite.parquet` || | |
116+
117+
### Quick query recipes {.unnumbered}
118+
119+
From Python:
120+
121+
```python
122+
import duckdb
123+
con = duckdb.connect()
124+
con.sql("""
125+
SELECT source, COUNT(*) AS n
126+
FROM read_parquet('https://data.isamples.org/isamples_202601_wide.parquet')
127+
WHERE otype = 'MaterialSampleRecord'
128+
GROUP BY 1 ORDER BY 2 DESC
129+
""").df()
130+
```
131+
132+
From the browser via DuckDB-WASM — see the
133+
[tutorials](/tutorials/) for complete examples with HTTP range requests.

0 commit comments

Comments
 (0)