Expand how-to-use.qmd into a full data catalog (#123)

rdhyee · claude · web-flow · commit eaa075148f2c · 2026-04-17T07:36:08.000-07:00
Replaces the minimal (and slightly inaccurate — res4 was listed as
~70 KB, actually 580 KB; lite as ~150 MB, actually 60 MB) Data Files
table with a proper catalog organized by use case:

- Architecture note: all files served via data.isamples.org backed by
  Cloudflare R2 with the immutable cache-control Worker (deployed
  2026-04-17). File naming convention documented.
- Primary datasets (wide, wide+H3, narrow) with size, shape, row count,
  and when-to-use guidance.
- Pre-aggregated helpers (facet_summaries, facet_cross_filter,
  sample_facets_v2) with their tiny sizes and why they exist.
- H3 geospatial aggregates at three resolutions with typical altitude.
- Lite sample-point file.
- Cross-reference matrix: which tutorial uses which file.
- Python quick-query recipe.

Intended as the single source of truth that tutorials can link into
rather than re-describing each file.

Co-authored-by: Claude Opus 4.7 &lt;noreply@anthropic.com&gt;
diff --git a/how-to-use.qmd b/how-to-use.qmd
@@ -42,15 +42,92 @@ All code is visible and foldable on tutorial pages. Want to build your own analy
 - **[GitHub](https://github.com/isamplesorg/)** — all source code and data pipelines
 - **[Zenodo](https://zenodo.org/communities/isamples)** — archived datasets for reproducible research
 
-## Data Files {.unnumbered}
-
-All data is hosted on Cloudflare R2 with HTTP range request support:
-
-| File | Size | Description |
-|------|------|-------------|
-| Wide format (H3-indexed) | ~292 MB | 20M rows, all entity types with H3 spatial indices |
-| H3 summary (res4) | ~70 KB | Pre-aggregated cluster counts for instant globe load |
-| H3 summary (res6) | ~200 KB | Mid-zoom cluster detail |
-| H3 summary (res8) | ~600 KB | Fine-zoom cluster detail |
-| Samples lite | ~150 MB | Individual sample points with coordinates |
-| Facet summaries | 2 KB | Pre-computed filter counts (source, material, context, specimen type) |
+## Data Catalog {.unnumbered}
+
+All files are served from [`data.isamples.org`](https://data.isamples.org/)
+backed by Cloudflare R2. A Cloudflare Worker in front of the bucket sets
+`Cache-Control: public, max-age=31536000, immutable` on filename-versioned
+parquets (so browsers and the Cloudflare edge cache aggressively) and
+exposes CORS headers required by DuckDB-WASM's HTTP range requests.
+
+File naming convention: `isamples_<YYYYMM>_<variant>.parquet`. The month
+in the filename is the data-generation snapshot — content at a given
+URL never changes.
+
+### Primary datasets {.unnumbered}
+
+The two main files carrying the sample records themselves:
+
+| File | Size | Shape | Rows | Use when you need… |
+|---|---:|---|---:|---|
+| [`isamples_202601_wide.parquet`](https://data.isamples.org/isamples_202601_wide.parquet) | 278 MB | Wide (one row per entity, nested relationships in `p__*` array columns) | 20 M | General entity queries, UI filtering, description text |
+| [`isamples_202601_wide_h3.parquet`](https://data.isamples.org/isamples_202601_wide_h3.parquet) | 292 MB | Wide + H3 BIGINT indices (`h3_res4`, `h3_res6`, `h3_res8`) | 20 M | Geospatial queries with H3 clustering at arbitrary zoom |
+| [`isamples_202512_narrow.parquet`](https://data.isamples.org/isamples_202512_narrow.parquet) | 820 MB | Narrow (graph: nodes + explicit `_edge_` rows, s/p/o/n fields) | 106 M | Graph traversals, relationship-centric analysis, PQG work |
+
+All three represent the same underlying data (SESAR + OpenContext + GEOME
++ Smithsonian) with identical semantics — they differ only in serialization
+strategy. See the
+[Technical: Narrow vs Wide tutorial](/tutorials/narrow_vs_wide_performance.html)
+for a performance comparison.
+
+### Pre-aggregated helpers {.unnumbered}
+
+Small lookup tables computed ahead of time so a page can render facets
+and counts instantly, without touching the 278 MB primary file:
+
+| File | Size | Contents | Use when… |
+|---|---:|---|---|
+| [`isamples_202601_facet_summaries.parquet`](https://data.isamples.org/isamples_202601_facet_summaries.parquet) | 2 KB | `(facet_type, facet_value, count)` for source, material, context, object_type | You want instant initial facet counts with no filters applied |
+| [`isamples_202601_facet_cross_filter.parquet`](https://data.isamples.org/isamples_202601_facet_cross_filter.parquet) | 6 KB | Pre-computed counts for single-facet selections | You want instant cross-filtered counts for a single active filter |
+| [`isamples_202601_sample_facets_v2.parquet`](https://data.isamples.org/isamples_202601_sample_facets_v2.parquet) | 63 MB | `(pid, material, context, object_type)` facet URIs per sample | You need to filter on *combinations* of facets at query time |
+
+### Geospatial aggregates (H3) {.unnumbered}
+
+Hexagonal H3 cells pre-aggregated at three resolutions for zoom-adaptive
+globe rendering. Each row: `h3_cell, center_lat, center_lng, sample_count,
+dominant_source, source_count`.
+
+| File | Size | Cells | Typical altitude |
+|---|---:|---:|---|
+| [`isamples_202601_h3_summary_res4.parquet`](https://data.isamples.org/isamples_202601_h3_summary_res4.parquet) | 580 KB | ~38 K | Continental (world view) |
+| [`isamples_202601_h3_summary_res6.parquet`](https://data.isamples.org/isamples_202601_h3_summary_res6.parquet) | 1.6 MB | ~112 K | Regional (country / state) |
+| [`isamples_202601_h3_summary_res8.parquet`](https://data.isamples.org/isamples_202601_h3_summary_res8.parquet) | 2.4 MB | ~176 K | Neighborhood |
+
+CSV twins exist alongside each parquet (3× larger) for human inspection —
+browsers use the parquet versions.
+
+### Individual sample points (lite) {.unnumbered}
+
+| File | Size | Contents | Use when… |
+|---|---:|---|---|
+| [`isamples_202601_samples_map_lite.parquet`](https://data.isamples.org/isamples_202601_samples_map_lite.parquet) | 60 MB | `pid, label, source, latitude, longitude, place_name, result_time, h3_res8, h3_res8_hex` — no description | Point-level rendering below ~120 km altitude |
+
+### Which tutorial uses which file {.unnumbered}
+
+| | Interactive Explorer | Search Explorer | Deep-Dive Analysis |
+|---|:-:|:-:|:-:|
+| `wide.parquet` | | ● | |
+| `wide_h3.parquet` | | | ● |
+| `facet_summaries.parquet` | ● | ● | ● |
+| `facet_cross_filter.parquet` | | ● | |
+| `sample_facets_v2.parquet` | ● | ● | |
+| `h3_summary_res4/6/8.parquet` | ● | | |
+| `samples_map_lite.parquet` | ● | | |
+
+### Quick query recipes {.unnumbered}
+
+From Python:
+
+```python
+import duckdb
+con = duckdb.connect()
+con.sql("""
+    SELECT source, COUNT(*) AS n
+    FROM read_parquet('https://data.isamples.org/isamples_202601_wide.parquet')
+    WHERE otype = 'MaterialSampleRecord'
+    GROUP BY 1 ORDER BY 2 DESC
+""").df()
+```
+
+From the browser via DuckDB-WASM — see the
+[tutorials](/tutorials/) for complete examples with HTTP range requests.