Skip to content

Add cross-filtering to Explorer facet counts#94

Merged
rdhyee merged 6 commits intoisamplesorg:mainfrom
rdhyee:feature/cross-filtering
Apr 10, 2026
Merged

Add cross-filtering to Explorer facet counts#94
rdhyee merged 6 commits intoisamplesorg:mainfrom
rdhyee:feature/cross-filtering

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Apr 9, 2026

Summary

  • When any filter is active, facet counts update to reflect the intersection of all other active filters (standard faceted search behavior)
  • Selecting SESAR as source → material/context/specimen counts show only what exists in SESAR
  • 4 parallel GROUP BY queries via DuckDB-WASM, each excluding its own dimension
  • DOM manipulation updates count labels without re-rendering checkboxes (preserves selections)
  • Zero-count facet values dimmed for visual clarity
  • When no filters active, pre-computed 2KB summaries used (instant, unchanged)

Test plan

  • Load Explorer with no filters — counts should match pre-computed summaries
  • Check SESAR source → material counts should drop (no archaeology materials)
  • Check SESAR + Rock material → context/specimen counts narrow further
  • Clear all filters → counts restore to pre-computed values
  • Verify checkbox selections persist when counts update
  • Zero-count items should appear dimmed

🤖 Generated with Claude Code

@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Apr 9, 2026

Pre-cached cross-filter strategy (from Eric Kansa via Slack)

Note: analysis below generated by Claude Code based on Eric's suggestion and the current Explorer architecture.

Eric suggested pre-caching facet counts for filtered subsets, similar to how Open Context uses Django caching with a cache-warming script. Here's the full analysis for our dataset:

Our facet dimensions

Facet Values States (any + each value)
Source 4 (SESAR, OpenContext, GEOME, Smithsonian) 5
Material ~10 11
Context (Sampled Feature) ~8 9
Specimen Type ~8 9

Combinatorics

Single-value-per-facet (Eric's model): 5 × 11 × 9 × 9 = 4,455 combinations. Each combination stores counts for all ~30 facet values across all dimensions. That's ~130K rows — trivially small as a parquet file, probably under 1 MB.

Multi-value-per-facet (checkboxes allow this): each facet has 2^n subsets. That's 2^4 × 2^10 × 2^8 × 2^8 = 2^30 ≈ 1 billion combinations. Obviously not pre-cacheable.

Practical middle ground: pre-cache the single-value combinations (covers the most common interaction pattern — user clicks one checkbox at a time), and fall back to on-the-fly for multi-value selections. This is exactly Eric's "not in cache → calculate on the fly" pattern.

Comparison with current PR approach

Approach Latency Extra download Maintenance Multi-value support
Pre-cached file (Eric's pattern) Instant (~0ms lookup) ~1 MB parquet Rebuild when data changes Falls back to on-the-fly
On-the-fly GROUP BY (this PR) 1-3s per change None Zero Works for any combination
Hybrid (pre-cache + on-the-fly fallback) Instant for common, 1-3s for complex ~1 MB parquet Rebuild when data changes Full coverage

How we'd build the pre-cache

We already have a pipeline that generates isamples_202601_facet_summaries.parquet (2KB, the unfiltered counts). The pre-cache would be a natural extension:

-- For each combination of single-value filters, compute cross-filtered counts
-- Example: "given source=SESAR, what are the material counts?"
SELECT
  'SESAR' as filter_source,
  NULL as filter_material,
  NULL as filter_context,
  NULL as filter_specimen,
  'material' as facet_type,
  has_material_category as facet_value,
  COUNT(*) as count
FROM samples
WHERE n = 'SESAR'
  AND otype = 'MaterialSampleRecord'
  AND latitude IS NOT NULL
GROUP BY has_material_category

Multiply that pattern across all 4,455 combinations. DuckDB can generate the entire file in under a minute locally.

The resulting file schema:

filter_source       | VARCHAR (nullable — NULL means "any")
filter_material     | VARCHAR (nullable)
filter_context      | VARCHAR (nullable)
filter_specimen     | VARCHAR (nullable)
facet_type          | VARCHAR (source/material/context/object_type)
facet_value         | VARCHAR
count               | BIGINT

In the browser, lookup is a simple filtered read — DuckDB-WASM with HTTP range requests would resolve it in milliseconds since the file is tiny.

Recommendation

The hybrid approach is the clear winner:

  1. Ship the pre-cache file alongside the existing facet summaries on data.isamples.org
  2. Use it for instant single-value lookups (covers 90%+ of user interactions)
  3. Fall back to on-the-fly GROUP BY for multi-value or text search combinations
  4. Regenerate the pre-cache whenever we update the main parquet files

This mirrors how Open Context does it with Django caching, just with parquet files instead of a database cache layer. The "cache warming script" is a DuckDB query that runs offline.

When any filter is active, facet counts now reflect the intersection
of all OTHER active filters. For example, selecting SESAR as source
updates material/context/specimen counts to show only what exists
in SESAR data. Uses parallel GROUP BY queries via DuckDB-WASM.

Counts update via DOM manipulation to avoid resetting checkbox
selections. Zero-count facet values are dimmed for visual clarity.
When no filters are active, pre-computed summaries are used (instant).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Apr 10, 2026

Status check + pre-computed cache

I just generated and uploaded a pre-computed cross-filter cache to complement this PR:

File: https://data.isamples.org/isamples_202601_facet_cross_filter.parquet (6 KB, 526 rows)

Schema:

filter_source       VARCHAR (NULL = any)
filter_material     VARCHAR (NULL = any)
filter_context      VARCHAR (NULL = any)
filter_object_type  VARCHAR (NULL = any)
facet_type          VARCHAR (source/material/context/object_type)
facet_value         VARCHAR (URI)
count               BIGINT

This covers all single-value filter combinations (e.g., "given source=SESAR, what are the material counts?"). Browser-side lookup is instant — just a filtered read on a 6KB file.

Bug found: column name mismatch

The on-the-fly cross-filter queries in this PR reference has_material_category, has_context_category, and has_specimen_category, but the wide parquet has:

  • p__has_material_category (BIGINT[], not VARCHAR)
  • p__has_context_category (BIGINT[], not VARCHAR)
  • p__has_sample_object_type (BIGINT[], not VARCHAR)

These are integer foreign keys in arrays, not URI strings. The facet summaries use URIs. The same mismatch exists in the base WHERE clause builder on main — material/context/specimen filtering silently fails.

Recommendation

  1. Ship the pre-computed cache for instant single-filter lookups (covers 90%+ of interactions, per Eric K's suggestion)
  2. Fix the column mapping as a follow-up — either alias columns in the CREATE VIEW, or use a mapping table to resolve BIGINT IDs → URIs
  3. For multi-filter combinations not in the cache, fall back to on-the-fly queries (once column mapping is fixed)

The cache file is already on R2. Next step is updating the Explorer JS to use it.

- Add 6KB pre-computed cross-filter cache for instant single-filter lookups
- Add 21MB sample_facets view with URI-string columns for on-the-fly fallback
- Fix column name mismatch: wide parquet has p__* BIGINT[] columns, but
  facet values are URI strings — cross-filter now queries sample_facets
- Main whereClause uses pid subquery against sample_facets for facet filters
- Source filter still queries wide parquet directly (n column is correct)

Supplementary files on data.isamples.org:
- isamples_202601_facet_cross_filter.parquet (6 KB, 526 rows)
- isamples_202601_sample_facets_v2.parquet (21 MB, 6M rows)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rdhyee rdhyee force-pushed the feature/cross-filtering branch from f3e3fd3 to 68ec7ee Compare April 10, 2026 15:23
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Apr 10, 2026

Fix pushed

The column name mismatch is now fixed. Here's the architecture:

Three tiers of facet data

Tier File Size When used
1. Baseline facet_summaries.parquet 2 KB Page load — unfiltered counts
2. Pre-computed cache facet_cross_filter.parquet 6 KB Single-filter lookups — instant
3. On-the-fly fallback sample_facets_v2.parquet 21 MB Multi-filter or text search combos

What changed

  • Cross-filter queries now use sample_facets view (URI strings in source, material, context, object_type columns) instead of the wide parquet (which has p__has_material_category as BIGINT arrays)
  • Single-filter interactions (90%+ of use) hit the 6 KB pre-computed cache — instant response, no scanning
  • Multi-filter or text search falls back to GROUP BY queries against the 21 MB sample_facets — much faster than scanning the 280 MB wide parquet
  • Record retrieval (whereClause) uses a pid subquery against sample_facets for material/context/object_type filters, source filter still hits the wide parquet's n column directly

All supplementary files are live on data.isamples.org.

rdhyee and others added 4 commits April 10, 2026 08:31
1. Multi-value within single facet: fast path now requires exactly
   one value in the active facet, not just one active dimension.
   Multiple selections (e.g., SESAR+GEOME) correctly fall through
   to on-the-fly queries.

2. Text search participates in cross-filtering: buildCrossFilterWhere
   now includes ILIKE conditions. sample_facets_v2 regenerated with
   label, description, place_name columns (63 MB on R2).

3. Clearing filters restores baseline counts: the update cell now
   resets all facet-count labels to baseline values and removes
   zero-count dimming when crossFilteredFacets is null.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codex review found two bugs:

1. facet_summaries counted all 6.68M records but sample_facets only
   had the 5.98M with coordinates — counts jumped when toggling filters.
   Regenerated all three parquet files from the same base universe
   (lat IS NOT NULL). SESAR now consistently 4,389,231 across all files.

2. Baseline summaries included blank-string facet values, but on-the-fly
   queries excluded them with != ''. Regenerated summaries now exclude
   blanks, matching the on-the-fly behavior.

Also: removed dead getDisplayCounts(), fixed stale 0.3MB comment,
added missing quote escaping on source cache lookup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5 new tests in TestExplorerCrossFiltering:
- Baseline SESAR count matches summaries (>4M)
- Clicking source updates material counts (organicmaterial decreases)
- Clearing filter restores baseline counts
- Zero-count items get dimmed (opacity < 1)
- New parquet endpoints (cross_filter, sample_facets_v2) reachable

Cross-filter tests gracefully skip if data attributes not yet deployed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Convert blank strings to NULL with NULLIF in sample_facets_v2 generation
(586 blank context rows → NULL). Remove redundant != '' guards from
on-the-fly queries since IS NOT NULL now handles both.

Addresses Codex finding #2: blank values in sample_facets caused state
mismatch with baseline summaries (which correctly excluded blanks).
Finding #1 (count universe mismatch) was a false positive — Codex
cached stale files; live CDN has consistent counts across all three
artifacts (SESAR=4,389,231, total=5,980,282).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rdhyee rdhyee merged commit 1be3fba into isamplesorg:main Apr 10, 2026
1 check passed
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Apr 10, 2026

Addressing Eric K's pre-caching suggestion

We may want to pre-cache facet counts on filtered subsets of data too. We do this with Open Context using Django's caching tools and a little script to warm the cache. Let's say there are 3 different facet fields, each with 5 different values. We'd need to cache 455 different combinations of facet-value filters (assuming 1 value per facet field). It's a lot, but not an overwhelming amount to pre-calculate and cache. If a user makes a request that's not in the cache, the facet counts can be calculated on the fly. But hopefully, that would more usually involve smaller numbers of search results so generating facet counts would happen in a reasonable time.

This is now implemented — same pattern, parquet files instead of Django cache:

Eric's pattern Implementation
Pre-cache filtered subsets facet_cross_filter.parquet (6 KB) — all single-value filter combos
~455 combinations 170 filter-facet pairs across 4 facets (56 values total)
Cache warming script DuckDB query against export parquet, runs in ~3 seconds
On-the-fly fallback GROUP BY against sample_facets_v2.parquet (63 MB) for multi-value or text search
"Smaller result sets = reasonable time" 63 MB facets file is ~4x smaller than the 280 MB wide parquet

Three tiers total:

  1. No filterfacet_summaries.parquet (2 KB) — instant
  2. Single-value single-facetfacet_cross_filter.parquet (6 KB) — instant lookup
  3. Multi-value or text search → on-the-fly GROUP BY against sample_facets_v2.parquet (63 MB)

All three files are generated from the same base universe (5,980,282 samples with coordinates) so counts are consistent across tiers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant