Skip to content

Allow pickling PyExpr #1520

@ntjohnson1

Description

@ntjohnson1

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I want to hold a PyExpr defining some item of interest then use multiprocessing. This tries to pickle your objects.
One motivating use case is accessing datafusion data from PyTorch.

If we like this approach I assume it would be nice to extend other things to be picklable using the exisiting FFI feature pathways.

Describe the solution you'd like
Allow PyExpr to be pickleable.

spawn_ctx = mp.get_context("spawn")
expr = (col("a") * lit(2)) + lit(1)
chunks = [[1, 2, 3], [10, 20, 30]]

def _apply_builtin_expr(args: tuple) -> list:
    expr, values = args
    ctx = SessionContext()
    batch = pa.RecordBatch.from_arrays([pa.array(values, type=pa.int64())], names=["a"])
    df = ctx.create_dataframe([[batch]], name="t")
    return df.select(expr.alias("out")).collect()[0].column(0).to_pylist()

with spawn_ctx.Pool(processes=2) as pool:
   results = pool.map(_apply_builtin_expr, [(expr, c) for c in chunks])

Describe alternatives you've considered
Use lazy initialization with a factory that returns a pyexpr.

Additional context
A pyexpr can be related to custom registered items. One mitigation for this is to allow setting the global session context, but I'm not sure how this singleton value is managed with python multiprocessing. Additionally, I'm not sure if this means we need to pickle the SessionContext itself.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions