GitHub - pranav-walimbe/PyCanopy: A high-performance spatial query layer for Polars

A spatial query layer for Polars. Rust core, Python API.

Note

Up to 155x on range queries · up to 1,949x on kNN · up to 1,521x on polygon contains · up to 8,522x on within joins (vs GeoPandas) · Full benchmarks

Installation

pip install pycanopy

Pre-built wheels for Linux, macOS, and Windows. No Rust toolchain required.

import polars as pl
from pycanopy import SpatialFrame

sf = SpatialFrame(pl.read_parquet("cities.parquet"), x_col="lon", y_col="lat")
result = sf.lazy().filter(pl.col("population") > 100_000).range_query(-10.0, 35.0, 40.0, 70.0).collect()

Why PyCanopy

Polars has no native spatial support. The standard alternatives each require a tradeoff:

GeoPandas applies linear scans by default. Its STRtree needs an explicit .sindex opt-in and is the only index available.
DuckDB spatial has a mature R-tree and good performance, but requires leaving Polars for SQL and creating the index by hand.

PyCanopy stays native to Polars and adds a query optimizer on top. The optimizer decides execution order, decides whether an index is worth building (and which kind) from a cost model, and fuses consecutive spatial predicates where possible.

	PyCanopy	GeoPandas	GeoPolars	DuckDB spatial
Works natively in Polars	✓	✗	✓	✗ (SQL + convert)
Lazy / declarative API	✓	✗	via Polars	SQL only
Auto index selection	✓	✗ (STR only)	✗ (R-tree, manual)	✗ (R-tree, opt-in)
Cost-based index vs scan	✓	✗	✗	✗
KNN join built-in	✓	✗	✗	✗ (O(N) scan)
Spatial operation ordering	✓	✗	✗	✗
Spatial predicate fusion	✓	✗	✗	✗
Live point ingestion	✓	✗	✗	✗

Operations

Spatial operations chain off sf.lazy() and mix freely with Polars' own .filter(expr). The optimizer orders the whole chain before anything runs.

Point datasets

Operation	Call	Returns
Range query	`.range_query(min_x, min_y, max_x, max_y)`	Rows inside the bounding box
k-nearest neighbours	`.knn(x, y, k)`	The `k` rows nearest a point
kNN join	`.knn_join(df, x_col, y_col, k)`	The `k` nearest rows for each query point
Within-distance join	`.within_distance_join(df, x_col, y_col, distance)`	Rows within `distance` of each query point
Convex-hull area	`SpatialFrame.convex_hull_area(xs, ys)`	Area of the convex hull of a point set

Polygon datasets

Operation	Call	Returns
Point in polygon	`.contains(x, y)`	Polygons that contain the point
MBR range	`.range_query(min_x, min_y, max_x, max_y)`	Polygons whose bounding box meets the query box
Within join	`.within_join(df, x_col, y_col)`	Polygons that contain each query point
Point-to-polygon distance join	`.polygon_within_distance_join(df, x_col, y_col, distance)`	Polygons within `distance` of each query point
Point-to-polygon kNN join	`.polygon_knn_join(df, x_col, y_col, k)`	The `k` nearest polygons for each query point
Intersects self-join	`.intersects_pairs()`	Intersecting polygon pairs with overlap area and IoU
Area	`.polygon_areas()`	Area of each polygon
Points near a polygon	`.points_within_distance_of_polygon(polygon, distance)`	Points within `distance` of a single polygon

Joins and aggregations that return tables (.intersects_pairs, .polygon_areas, .points_within_distance_of_polygon) are called directly on the SpatialFrame; the filtering operations chain off .lazy().

Usage

Point dataset: range and KNN

import polars as pl
from pycanopy import SpatialFrame

df = pl.read_parquet("cities.parquet")
sf = SpatialFrame(df, x_col="lon", y_col="lat")

# Bounding-box filter combined with a scalar predicate.
# Optimizer places the scalar filter first, then runs the range query
# on the reduced row set.
result = (
    sf.lazy()
    .filter(pl.col("population") > 100_000)
    .range_query(min_x=-10.0, min_y=35.0, max_x=40.0, max_y=70.0)
    .collect()
)

# k-nearest neighbours
nearest = sf.lazy().knn(x=2.35, y=48.85, k=5).collect()

Inspecting the plan

# Declare ops in any order. explain() shows what the optimizer will actually run.
lf = (
    sf.lazy()
    .range_query(min_x=-10.0, min_y=35.0, max_x=40.0, max_y=70.0)
    .filter(pl.col("population") > 100_000)
)

print(lf.explain())
# RANGE_QUERY [(-10, 35) → (40, 70)]
# FROM
#   FILTER [(col("population")) > (dyn int: 100000)]
#   FROM
#     DF [N=100,000; path: EXPR]

print(lf.explain(optimized=False))
# FILTER [(col("population")) > (dyn int: 100000)]
# FROM
#   RANGE_QUERY [(-10, 35) → (40, 70)]
#   FROM
#     DF [N=100,000]

The optimizer flipped the declaration order. The scalar filter runs first on all rows, then the spatial query runs on the smaller survivor set. Pass optimized=False to see declaration order instead. Plans follow Polars' FROM-chain convention, so the bottom runs first and the top is the final result.

More examples: point and polygon joins, aggregations, branching, delta buffer, index modes

Chaining multiple spatial predicates

# Two range predicates are fused into a single index build on large datasets.
result = (
    sf.lazy()
    .range_query(0.0, 0.0, 50.0, 50.0)
    .range_query(10.0, 10.0, 40.0, 40.0)
    .collect()
)

KNN join

query_df = pl.DataFrame({"qx": [2.35, 13.4], "qy": [48.85, 52.5]})

# For each row in query_df, find the 3 nearest rows in sf.
result = sf.lazy().knn_join(query_df, x_col="qx", y_col="qy", k=3).collect()

Polygon dataset: contains and range

from shapely.geometry import box
from pycanopy import SpatialFrame

polygons = [box(i, 0, i + 0.9, 0.9) for i in range(100_000)]
df = pl.DataFrame({"id": list(range(100_000)), "geom": polygons})
sf = SpatialFrame.from_polygons(df, geometry_col="geom")

# Which polygons contain this point?
containing = sf.lazy().contains(x=5.5, y=0.5).collect()

# Which polygon MBRs intersect this bbox?
intersecting = sf.lazy().range_query(0.0, 0.0, 10.0, 1.0).collect()

Polygon holes

from shapely.geometry import Polygon

# Interior rings (holes) are fully supported.
outer = [(0, 0), (10, 0), (10, 10), (0, 10)]
hole  = [(2, 2), (8, 2),  (8, 8),  (2, 8)]
donut = Polygon(outer, [hole])

sf = SpatialFrame.from_polygons(pl.DataFrame({"id": [0], "geom": [donut]}), geometry_col="geom")

# Point inside the hole is NOT contained.
sf.lazy().contains(x=5.0, y=5.0).collect()   # empty

# Point outside the hole but inside the outer ring IS contained.
sf.lazy().contains(x=1.0, y=1.0).collect()   # returns the polygon row

Within join

# For each query point, find which polygons in sf contain it.
query_df = pl.DataFrame({"qx": [5.5, 12.3], "qy": [0.5, 0.5]})
result = sf.lazy().within_join(query_df, x_col="qx", y_col="qy").collect()

Within-distance join

# For each query point, find all sf points within 50 km.
query_df = pl.DataFrame({"qx": [2.35, 13.4], "qy": [48.85, 52.5]})
result = sf.lazy().within_distance_join(query_df, x_col="qx", y_col="qy", distance=50.0).collect()

Point-to-polygon joins

# (polygon SpatialFrame) For each query point, the polygons within a distance
# of it. Distance is to the polygon boundary, and zero when the point is inside.
query_df = pl.DataFrame({"qx": [5.5, 12.3], "qy": [0.5, 0.5]})
near = sf.lazy().polygon_within_distance_join(query_df, x_col="qx", y_col="qy", distance=2.0).collect()

# For each query point, its k nearest polygons (adds a distance_to_polygon column).
nearest = sf.lazy().polygon_knn_join(query_df, x_col="qx", y_col="qy", k=3).collect()

Polygon aggregations

# Area of every polygon (appends an 'area' column).
areas = sf.polygon_areas()

# All intersecting polygon pairs, with overlap area and IoU.
overlaps = sf.intersects_pairs()

# (point SpatialFrame) rows whose point lies within a distance of one polygon.
from shapely.geometry import box
pts = point_sf.points_within_distance_of_polygon(box(0.0, 0.0, 1.0, 1.0), distance=0.5)

Convex-hull area

import numpy as np

# Area of the convex hull of a standalone point set (no frame needed).
area = SpatialFrame.convex_hull_area(np.array([0.0, 1.0, 0.5]), np.array([0.0, 0.0, 1.0]))

Index mode

# Fixed per frame. "auto" lets the cost model choose index vs scan per query;
# "none" always scans; "eager" (default) always builds the selected index.
sf = SpatialFrame(df, x_col="lon", y_col="lat", index_mode="auto")

Branching from a shared base

from pycanopy import SpatialFrame, SpatialLazyFrame

# Expensive filter applied once; two queries branch from the result.
base = sf.lazy().filter(pl.col("population") > 100_000).range_query(-10.0, 35.0, 40.0, 70.0)

major = base.filter(pl.col("population") > 1_000_000)
minor = base.filter(pl.col("population") <= 1_000_000)

# collect_all detects the shared prefix, caches it in Polars,
# and executes both branches in a single pass.
results = SpatialLazyFrame.collect_all([major, minor])
df_major, df_minor = results

Live updates via delta buffer

# Append new points -- visible to queries immediately, no index rebuild yet.
import numpy as np
sf.engine.append_delta(np.array([2.5]), np.array([48.9]))

# Queries probe the main index and scan the delta in parallel.
result = sf.lazy().range_query(-10.0, 35.0, 40.0, 70.0).collect()

# The buffer flushes automatically when accumulated query cost exceeds
# the estimated index rebuild cost, or when it exceeds 10% of N.
# Force a flush manually if needed.
sf.engine.flush()

Benchmarks

Apple M-series used for benchmarking. Warm = cached index, second call. Index build = one-time cost, amortised across queries. Naive baseline is GeoPandas. Datasets are mocked from random uniform distribution.

Single-query operations

Operation	N	Index build	Warm	Naive	Speedup	Idx mem
Range query (points)	100,000	1.3 ms	29 µs	4.4 ms	155x	783 KB
kNN k=10	100,000	9.3 ms	3 µs	5.4 ms	1,949x	1.9 MB
Polygon contains	100,000	6.2 ms	5 µs	7.0 ms	1,521x	3.7 MB
Polygon range	100,000	5.6 ms	8 µs	3.3 ms	391x	3.7 MB
kNN join k=5	10,000	7.3 ms	2.1 ms	5.4 s	2,601x	180 KB
Within-distance join	10,000	0.5 ms	12.6 ms	1.3 s	102x	n/a
Within join (polygons)	10,000	1.6 ms	0.52 ms	4.4 s	8,522x	354 KB

How It Works

PyCanopy plans a query in two layers, then hands the result to Polars to run.

Query flow

  sf.lazy().filter(...).range_query(...).knn_join(...).collect()
                            |
            +---------------+----------------+
            |   Logical plan (whole chain)   |
            |   order ops, fuse predicates,  |
            |   pick join side, EXPR vs IO   |
            +---------------+----------------+
                            |
            +---------------+----------------+
            |   Access path (per operation)  |
            |   index or scan, and which     |
            |   kind: a cost model decides   |
            +---------------+----------------+
                            |
            +---------------+----------------+
            |   Polars runs the emitted ops  |
            +---------------+----------------+
                            |
                      pl.DataFrame

Logical planning

Decisions about the whole chain, made before any data is touched:

Predicate pushdown: scalar filters run before spatial ops, cheapest first (cost estimated from the Polars expression tree). They shrink the row count for little cost.
Fusion: consecutive spatial predicates merge into one index build and one pass.
Join side: symmetric joins (within_join, within_distance_join) index the smaller side. knn_join always indexes the engine side.
Execution path: very selective filters slice the prebuilt index directly (IO path). Otherwise filters run first and a small index is built on the survivors (EXPR path).

Cost model: index or scan?

Building an index costs about N log N, so it only pays off if queried enough times. For each operation the planner compares two estimates (N is the dataset size, Q the number of query points):

  scan   =  Q * N                          every row, for every query point
  index  =  N log N  (build once)  +  Q * log N   (probe per query point)

Building wins once Q passes roughly log N. A one-off lookup scans; a join with many probes builds the index and reuses it. Selectivity refines this: if a predicate keeps most rows, the planner skips the index, since a tree that prunes nothing loses to a plain scan.

index_mode, set per frame, picks how the estimate is used:

eager (default): always build the selected index.
auto: build only when the estimate beats a scan for this Q.
none: always scan.

Index management

Indexes build lazily, never at load time. Dataset stats (extent, distribution, a 32x32 histogram) are computed once up front and drive the first query's choice, after which the index is cached for all later queries. When a non-brute index is built, its kind comes from:

Condition	Index
N < 500, selectivity > 50%, or k/N > 10%	Brute force
Point range, uniform distribution	Uniform grid
Point range, clustered distribution	KD-tree
Point KNN or contains	KD-tree
Polygons, any query	R-tree

All index types share the same coordinate arrays with no duplication.

Why Rust

The hot paths need packed immutable index structures, zero-copy array slices at the Python boundary, and loop-level parallelism. C++ would require a separate FFI layer and loses the native Polars plugin integration that PyO3/Maturin provides for free.

Accepted input formats

Format	Example
numpy `(N, 2)` array	`np.array([[x, y], ...])`
GeoArrow PyArrow array	`pa.StructArray` or `FixedSizeList<2>`
geopandas `GeoSeries`	`gdf.geometry`
list of shapely Points or Polygons	`[Point(x, y), ...]`
list of `(x, y)` tuples	`[(x, y), ...]`
Separate coordinate sequences	`Engine.from_coords(xs, ys)`

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.cargo		.cargo
.github/workflows		.github/workflows
assets		assets
bench		bench
python/pycanopy		python/pycanopy
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Why PyCanopy

Operations

Usage

Point dataset: range and KNN

Inspecting the plan

Chaining multiple spatial predicates

KNN join

Polygon dataset: contains and range

Polygon holes

Within join

Within-distance join

Point-to-polygon joins

Polygon aggregations

Convex-hull area

Index mode

Branching from a shared base

Live updates via delta buffer

Benchmarks

Single-query operations

How It Works

Query flow

Logical planning

Cost model: index or scan?

Index management

Why Rust

Accepted input formats

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Installation

Why PyCanopy

Operations

Usage

Point dataset: range and KNN

Inspecting the plan

Chaining multiple spatial predicates

KNN join

Polygon dataset: contains and range

Polygon holes

Within join

Within-distance join

Point-to-polygon joins

Polygon aggregations

Convex-hull area

Index mode

Branching from a shared base

Live updates via delta buffer

Benchmarks

Single-query operations

How It Works

Query flow

Logical planning

Cost model: index or scan?

Index management

Why Rust

Accepted input formats

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages