Add cross-filtering to Explorer facet counts by rdhyee · Pull Request #94 · isamplesorg/isamplesorg.github.io

rdhyee · 2026-04-09T00:10:00Z

Summary

When any filter is active, facet counts update to reflect the intersection of all other active filters (standard faceted search behavior)
Selecting SESAR as source → material/context/specimen counts show only what exists in SESAR
4 parallel GROUP BY queries via DuckDB-WASM, each excluding its own dimension
DOM manipulation updates count labels without re-rendering checkboxes (preserves selections)
Zero-count facet values dimmed for visual clarity
When no filters active, pre-computed 2KB summaries used (instant, unchanged)

Test plan

Load Explorer with no filters — counts should match pre-computed summaries
Check SESAR source → material counts should drop (no archaeology materials)
Check SESAR + Rock material → context/specimen counts narrow further
Clear all filters → counts restore to pre-computed values
Verify checkbox selections persist when counts update
Zero-count items should appear dimmed

🤖 Generated with Claude Code

rdhyee · 2026-04-09T00:39:03Z

Pre-cached cross-filter strategy (from Eric Kansa via Slack)

Note: analysis below generated by Claude Code based on Eric's suggestion and the current Explorer architecture.

Eric suggested pre-caching facet counts for filtered subsets, similar to how Open Context uses Django caching with a cache-warming script. Here's the full analysis for our dataset:

Our facet dimensions

Facet	Values	States (any + each value)
Source	4 (SESAR, OpenContext, GEOME, Smithsonian)	5
Material	~10	11
Context (Sampled Feature)	~8	9
Specimen Type	~8	9

Combinatorics

Single-value-per-facet (Eric's model): 5 × 11 × 9 × 9 = 4,455 combinations. Each combination stores counts for all ~30 facet values across all dimensions. That's ~130K rows — trivially small as a parquet file, probably under 1 MB.

Multi-value-per-facet (checkboxes allow this): each facet has 2^n subsets. That's 2^4 × 2^10 × 2^8 × 2^8 = 2^30 ≈ 1 billion combinations. Obviously not pre-cacheable.

Practical middle ground: pre-cache the single-value combinations (covers the most common interaction pattern — user clicks one checkbox at a time), and fall back to on-the-fly for multi-value selections. This is exactly Eric's "not in cache → calculate on the fly" pattern.

Comparison with current PR approach

Approach	Latency	Extra download	Maintenance	Multi-value support
Pre-cached file (Eric's pattern)	Instant (~0ms lookup)	~1 MB parquet	Rebuild when data changes	Falls back to on-the-fly
On-the-fly GROUP BY (this PR)	1-3s per change	None	Zero	Works for any combination
Hybrid (pre-cache + on-the-fly fallback)	Instant for common, 1-3s for complex	~1 MB parquet	Rebuild when data changes	Full coverage

How we'd build the pre-cache

We already have a pipeline that generates isamples_202601_facet_summaries.parquet (2KB, the unfiltered counts). The pre-cache would be a natural extension:

-- For each combination of single-value filters, compute cross-filtered counts
-- Example: "given source=SESAR, what are the material counts?"
SELECT
  'SESAR' as filter_source,
  NULL as filter_material,
  NULL as filter_context,
  NULL as filter_specimen,
  'material' as facet_type,
  has_material_category as facet_value,
  COUNT(*) as count
FROM samples
WHERE n = 'SESAR'
  AND otype = 'MaterialSampleRecord'
  AND latitude IS NOT NULL
GROUP BY has_material_category

Multiply that pattern across all 4,455 combinations. DuckDB can generate the entire file in under a minute locally.

The resulting file schema:

filter_source       | VARCHAR (nullable — NULL means "any")
filter_material     | VARCHAR (nullable)
filter_context      | VARCHAR (nullable)
filter_specimen     | VARCHAR (nullable)
facet_type          | VARCHAR (source/material/context/object_type)
facet_value         | VARCHAR
count               | BIGINT

In the browser, lookup is a simple filtered read — DuckDB-WASM with HTTP range requests would resolve it in milliseconds since the file is tiny.

Recommendation

The hybrid approach is the clear winner:

Ship the pre-cache file alongside the existing facet summaries on data.isamples.org
Use it for instant single-value lookups (covers 90%+ of user interactions)
Fall back to on-the-fly GROUP BY for multi-value or text search combinations
Regenerate the pre-cache whenever we update the main parquet files

This mirrors how Open Context does it with Django caching, just with parquet files instead of a database cache layer. The "cache warming script" is a DuckDB query that runs offline.

When any filter is active, facet counts now reflect the intersection of all OTHER active filters. For example, selecting SESAR as source updates material/context/specimen counts to show only what exists in SESAR data. Uses parallel GROUP BY queries via DuckDB-WASM. Counts update via DOM manipulation to avoid resetting checkbox selections. Zero-count facet values are dimmed for visual clarity. When no filters are active, pre-computed summaries are used (instant). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rdhyee · 2026-04-10T15:13:42Z

Status check + pre-computed cache

I just generated and uploaded a pre-computed cross-filter cache to complement this PR:

File: https://data.isamples.org/isamples_202601_facet_cross_filter.parquet (6 KB, 526 rows)

Schema:

filter_source       VARCHAR (NULL = any)
filter_material     VARCHAR (NULL = any)
filter_context      VARCHAR (NULL = any)
filter_object_type  VARCHAR (NULL = any)
facet_type          VARCHAR (source/material/context/object_type)
facet_value         VARCHAR (URI)
count               BIGINT

This covers all single-value filter combinations (e.g., "given source=SESAR, what are the material counts?"). Browser-side lookup is instant — just a filtered read on a 6KB file.

Bug found: column name mismatch

The on-the-fly cross-filter queries in this PR reference has_material_category, has_context_category, and has_specimen_category, but the wide parquet has:

p__has_material_category (BIGINT[], not VARCHAR)
p__has_context_category (BIGINT[], not VARCHAR)
p__has_sample_object_type (BIGINT[], not VARCHAR)

These are integer foreign keys in arrays, not URI strings. The facet summaries use URIs. The same mismatch exists in the base WHERE clause builder on main — material/context/specimen filtering silently fails.

Recommendation

Ship the pre-computed cache for instant single-filter lookups (covers 90%+ of interactions, per Eric K's suggestion)
Fix the column mapping as a follow-up — either alias columns in the CREATE VIEW, or use a mapping table to resolve BIGINT IDs → URIs
For multi-filter combinations not in the cache, fall back to on-the-fly queries (once column mapping is fixed)

The cache file is already on R2. Next step is updating the Explorer JS to use it.

- Add 6KB pre-computed cross-filter cache for instant single-filter lookups - Add 21MB sample_facets view with URI-string columns for on-the-fly fallback - Fix column name mismatch: wide parquet has p__* BIGINT[] columns, but facet values are URI strings — cross-filter now queries sample_facets - Main whereClause uses pid subquery against sample_facets for facet filters - Source filter still queries wide parquet directly (n column is correct) Supplementary files on data.isamples.org: - isamples_202601_facet_cross_filter.parquet (6 KB, 526 rows) - isamples_202601_sample_facets_v2.parquet (21 MB, 6M rows) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rdhyee · 2026-04-10T15:23:44Z

Fix pushed

The column name mismatch is now fixed. Here's the architecture:

Three tiers of facet data

Tier	File	Size	When used
1. Baseline	`facet_summaries.parquet`	2 KB	Page load — unfiltered counts
2. Pre-computed cache	`facet_cross_filter.parquet`	6 KB	Single-filter lookups — instant
3. On-the-fly fallback	`sample_facets_v2.parquet`	21 MB	Multi-filter or text search combos

What changed

Cross-filter queries now use sample_facets view (URI strings in source, material, context, object_type columns) instead of the wide parquet (which has p__has_material_category as BIGINT arrays)
Single-filter interactions (90%+ of use) hit the 6 KB pre-computed cache — instant response, no scanning
Multi-filter or text search falls back to GROUP BY queries against the 21 MB sample_facets — much faster than scanning the 280 MB wide parquet
Record retrieval (whereClause) uses a pid subquery against sample_facets for material/context/object_type filters, source filter still hits the wide parquet's n column directly

All supplementary files are live on data.isamples.org.

1. Multi-value within single facet: fast path now requires exactly one value in the active facet, not just one active dimension. Multiple selections (e.g., SESAR+GEOME) correctly fall through to on-the-fly queries. 2. Text search participates in cross-filtering: buildCrossFilterWhere now includes ILIKE conditions. sample_facets_v2 regenerated with label, description, place_name columns (63 MB on R2). 3. Clearing filters restores baseline counts: the update cell now resets all facet-count labels to baseline values and removes zero-count dimming when crossFilteredFacets is null. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Codex review found two bugs: 1. facet_summaries counted all 6.68M records but sample_facets only had the 5.98M with coordinates — counts jumped when toggling filters. Regenerated all three parquet files from the same base universe (lat IS NOT NULL). SESAR now consistently 4,389,231 across all files. 2. Baseline summaries included blank-string facet values, but on-the-fly queries excluded them with != ''. Regenerated summaries now exclude blanks, matching the on-the-fly behavior. Also: removed dead getDisplayCounts(), fixed stale 0.3MB comment, added missing quote escaping on source cache lookup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

5 new tests in TestExplorerCrossFiltering: - Baseline SESAR count matches summaries (>4M) - Clicking source updates material counts (organicmaterial decreases) - Clearing filter restores baseline counts - Zero-count items get dimmed (opacity < 1) - New parquet endpoints (cross_filter, sample_facets_v2) reachable Cross-filter tests gracefully skip if data attributes not yet deployed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Convert blank strings to NULL with NULLIF in sample_facets_v2 generation (586 blank context rows → NULL). Remove redundant != '' guards from on-the-fly queries since IS NOT NULL now handles both. Addresses Codex finding #2: blank values in sample_facets caused state mismatch with baseline summaries (which correctly excluded blanks). Finding #1 (count universe mismatch) was a false positive — Codex cached stale files; live CDN has consistent counts across all three artifacts (SESAR=4,389,231, total=5,980,282). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rdhyee · 2026-04-10T21:46:16Z

Addressing Eric K's pre-caching suggestion

We may want to pre-cache facet counts on filtered subsets of data too. We do this with Open Context using Django's caching tools and a little script to warm the cache. Let's say there are 3 different facet fields, each with 5 different values. We'd need to cache 455 different combinations of facet-value filters (assuming 1 value per facet field). It's a lot, but not an overwhelming amount to pre-calculate and cache. If a user makes a request that's not in the cache, the facet counts can be calculated on the fly. But hopefully, that would more usually involve smaller numbers of search results so generating facet counts would happen in a reasonable time.

This is now implemented — same pattern, parquet files instead of Django cache:

Eric's pattern	Implementation
Pre-cache filtered subsets	`facet_cross_filter.parquet` (6 KB) — all single-value filter combos
~455 combinations	170 filter-facet pairs across 4 facets (56 values total)
Cache warming script	DuckDB query against export parquet, runs in ~3 seconds
On-the-fly fallback	GROUP BY against `sample_facets_v2.parquet` (63 MB) for multi-value or text search
"Smaller result sets = reasonable time"	63 MB facets file is ~4x smaller than the 280 MB wide parquet

Three tiers total:

No filter → facet_summaries.parquet (2 KB) — instant
Single-value single-facet → facet_cross_filter.parquet (6 KB) — instant lookup
Multi-value or text search → on-the-fly GROUP BY against sample_facets_v2.parquet (63 MB)

All three files are generated from the same base universe (5,980,282 samples with coordinates) so counts are consistent across tiers.

rdhyee mentioned this pull request Apr 9, 2026

iSamples MVP Cleanup & Simplification Strategy #49

Open

6 tasks

rdhyee force-pushed the feature/cross-filtering branch from f3e3fd3 to 68ec7ee Compare April 10, 2026 15:23

rdhyee and others added 4 commits April 10, 2026 08:31

rdhyee merged commit 1be3fba into isamplesorg:main Apr 10, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cross-filtering to Explorer facet counts#94

Add cross-filtering to Explorer facet counts#94
rdhyee merged 6 commits intoisamplesorg:mainfrom
rdhyee:feature/cross-filtering

rdhyee commented Apr 9, 2026

Uh oh!

rdhyee commented Apr 9, 2026

Uh oh!

rdhyee commented Apr 10, 2026

Uh oh!

rdhyee commented Apr 10, 2026

Uh oh!

Uh oh!

rdhyee commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rdhyee commented Apr 9, 2026

Summary

Test plan

Uh oh!

rdhyee commented Apr 9, 2026

Pre-cached cross-filter strategy (from Eric Kansa via Slack)

Our facet dimensions

Combinatorics

Comparison with current PR approach

How we'd build the pre-cache

Recommendation

Uh oh!

rdhyee commented Apr 10, 2026

Status check + pre-computed cache

Bug found: column name mismatch

Recommendation

Uh oh!

rdhyee commented Apr 10, 2026

Fix pushed

Three tiers of facet data

What changed

Uh oh!

Uh oh!

rdhyee commented Apr 10, 2026

Addressing Eric K's pre-caching suggestion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant