refactor[next]: Split compilation into a actual compilation and loading of the compilation artifact#2587
refactor[next]: Split compilation into a actual compilation and loading of the compilation artifact#2587havogt wants to merge 40 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Prepares the “next” compilation pipeline for out-of-process / multi-process JIT by splitting “compile” into producing a (mostly) picklable compilation artifact and a separate load() step that materializes the in-process executable callable.
Changes:
- Introduce
stages.CompilationArtifactprotocol and update OTF compilation contracts to return artifacts withload(). - Refactor CPP (nanobind/GTFN) and DaCe compilation steps to return artifact dataclasses instead of live callables; move “decoration” into
artifact.load(). - Update
Backend.compile()and tests to materialize executables viaartifact.load()and add pickle round-trip contract tests.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/next_tests/unit_tests/program_processor_tests/runners_tests/test_gtfn.py | Updates assertions to reflect compilation.device_type living on the compilation step. |
| tests/next_tests/unit_tests/program_processor_tests/runners_tests/dace_tests/test_dace_compilation.py | Adds pickle round-trip contract test for DaCe artifact (ensuring live handles don’t serialize). |
| tests/next_tests/unit_tests/otf_tests/test_compiled_program.py | Adapts a hijacking test to return a dummy artifact with load() instead of a raw callable. |
| tests/next_tests/unit_tests/otf_tests/compilation_tests/test_compiler.py | Adds pickle round-trip contract test for CPP artifact. |
| tests/next_tests/integration_tests/feature_tests/otf_tests/test_nanobind_build.py | Updates integration flow to call .load() on the returned compilation artifact. |
| src/gt4py/next/program_processors/runners/roundtrip.py | Wraps roundtrip’s in-memory callable in a RoundtripArtifact implementing load(). |
| src/gt4py/next/program_processors/runners/gtfn.py | Introduces GTFNCompilationArtifact/GTFNCompiler to stamp device type and apply argument conversion in load(). |
| src/gt4py/next/program_processors/runners/dace/workflow/factory.py | Removes separate decoration stage wiring (now handled by artifact load()). |
| src/gt4py/next/program_processors/runners/dace/workflow/decoration.py | Adjusts imports/types to avoid cycles after moving decoration into artifact load(). |
| src/gt4py/next/program_processors/runners/dace/workflow/compilation.py | Adds DaCeCompilationArtifact and makes DaCe compilation return an on-disk artifact with load(). |
| src/gt4py/next/otf/stages.py | Adds CompilationArtifact protocol. |
| src/gt4py/next/otf/recipes.py | Updates OTFCompileWorkflow typing to end at CompilationArtifact (removes decoration step). |
| src/gt4py/next/otf/definitions.py | Updates CompilationStep protocol to return CompilationArtifact. |
| src/gt4py/next/otf/compilation/compiler.py | Refactors CPP compilation into CPPCompiler + CPPCompilationArtifact(load). |
| src/gt4py/next/backend.py | Changes backend executor to return artifacts; compile() now calls artifact.load(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
This PR refactors the “next” OTF compilation workflow so that compilation steps produce a picklable CompilationArtifact with a load() method, enabling future compilation in separate processes while keeping load/materialization process-local.
Changes:
- Introduces
stages.CompilationArtifactand updatesCompilationStep,OTFCompileWorkflow, andBackend.compile()to compile → return artifact →load()into anExecutableProgram. - Adds/updates concrete artifacts and compilers (CPP/GTFN/DaCe/Roundtrip) so backend-specific loading/wrapping happens in
artifact.load()rather than a separate decoration stage. - Updates tests and adds minimal “pickle round-trip” contract tests for the new artifact types.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/next_tests/unit_tests/program_processor_tests/runners_tests/test_gtfn.py | Updates assertions to reflect device type living on the compilation step/artifact. |
| tests/next_tests/unit_tests/program_processor_tests/runners_tests/dace_tests/test_dace_compilation.py | Adds pickle round-trip contract test for DaCeCompilationArtifact (ensures live handle is dropped). |
| tests/next_tests/unit_tests/otf_tests/test_compiled_program.py | Adjusts test to use a dummy artifact (load() returns a no-op callable). |
| tests/next_tests/unit_tests/otf_tests/compilation_tests/test_compiler.py | Adds pickle round-trip contract test for CPPCompilationArtifact. |
| tests/next_tests/integration_tests/feature_tests/otf_tests/test_nanobind_build.py | Updates integration tests to build → artifact → .load() and passes device_type. |
| src/gt4py/next/program_processors/runners/roundtrip.py | Splits roundtrip into source generation + module load, introduces RoundtripArtifact.load(). |
| src/gt4py/next/program_processors/runners/gtfn.py | Introduces GTFNCompilationArtifact/GTFNCompiler so wrapping happens on load(). |
| src/gt4py/next/program_processors/runners/dace/workflow/factory.py | Removes decoration stage wiring; compilation now returns an artifact. |
| src/gt4py/next/program_processors/runners/dace/workflow/decoration.py | Adjusts imports/types to avoid cycles after moving decoration to artifact load. |
| src/gt4py/next/program_processors/runners/dace/workflow/compilation.py | Introduces DaCeCompilationArtifact and updates compiler to return it (with optional process-local live-program cache). |
| src/gt4py/next/otf/stages.py | Adds CompilationArtifact protocol. |
| src/gt4py/next/otf/recipes.py | Updates OTFCompileWorkflow typing to end at CompilationArtifact (removes decoration stage). |
| src/gt4py/next/otf/definitions.py | Updates CompilationStep protocol to return CompilationArtifact. |
| src/gt4py/next/otf/compilation/compiler.py | Refactors CPP compiler to return CPPCompilationArtifact and adds load(). |
| src/gt4py/next/backend.py | Updates Backend.compile() to call artifact.load() before returning an executable. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Brings PR 2587 (compilation/load artifact split) up to date with the GridTools#2648 fingerprinting refactor. Conflict resolutions: - dace/workflow/compilation.py: keep both — upstream's GPU TX-marker pass and this branch's str(build_folder) (build_folder is now a pathlib.Path here). - dace/workflow/factory.py: keep the `cache` import (used by the cached_translation trait); drop the now-unused `decoration` step/import (decoration moved into the artifact's load()). - roundtrip.py: keep RoundtripArtifact return; upstream's inline decorated_fencil is the pre-split pattern this PR replaces. - test_dace_compilation.py: combine both test sets (TX-marker + artifact pickle). - next/utils.py: restore MetadataBasedPickling (silently dropped by the textual merge — GridTools#2648 removed it as dead code, but this PR's artifacts depend on it). Orthogonal to GridTools#2648's fingerprint opt-out; both share the gt4py_metadata namespace.
Move the in-process CompiledDaceProgram cache from a `pickle=False` field on `DaCeCompilationArtifact` to a module-level dict keyed by build folder. The artifact becomes a plain frozen dataclass with no live process-bound state, restoring the strong `CompilationArtifact` Protocol contract; receivers in another process see an empty cache and fall through to the disk-based `get_program_handle` load path as before. `_live_program` was the only `pickle=False` field in the tree, so the `MetadataBasedPickling` mixin + `_get_metadata_based_state_getstate` helper become dead code and are deleted (re-doing what GridTools#2648 had removed; the merge re-introduced them only to serve this one user). `CPPCompilationArtifact` drops the now-no-op `MetadataBasedPickling` parent. `gt4py_metadata` itself stays, still used for the orthogonal `fingerprint=False` axis introduced in GridTools#2648. The Protocol docstring is restored to the strong form and explicitly flags `RoundtripArtifact.dispatch_backend` as the one remaining deviation, to be lifted into the runner / load-time seam alongside the staged-compilation API end-state from egparedes' layered- architecture proposal. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…TS.md Replace `:class:` / `:meth:` Sphinx cross-reference roles introduced in the previous commit with plain backtick literals, per CODING_GUIDELINES.md §3.8 (Google style + Sphinx-napoleon; reST roles forbidden). Add an explicit Do entry in AGENTS.md so future agents do not re-introduce them. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sweep the docstrings this PR added or modified relative to upstream/main and replace `:class:` / `:meth:` / `:func:` Sphinx roles with plain backtick literals, per CODING_GUIDELINES.md §3.8. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
egparedes
left a comment
There was a problem hiding this comment.
First quick round of reviews. In general looks good to me. In the second run I would need to pay more attention to the changes in the roundtrip backend and tests.
Keep `sdfg_build_folder` as the str returned by `get_cache_folder` and convert to `pathlib.Path` only where consumers require it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`load()` previously recomputed the .so path by calling two dace-internal helpers (`get_folder_version`, `get_binary_name`) against `build_folder` and `sdfg.name`. That coupled the load path to dace internals — `get_folder_version` was removed in dace 2.0.0a3, silently breaking artifact loading on that release. Capture `sdfg_program.filename` after `sdfg.compile()` and store it on the artifact. `_load_compiled_program` now goes straight to `get_program_handle(self.library_path, sdfg)`; the two helper calls disappear. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Without the in-process hand-off, the back-to-back `sdfg.compile()` → `artifact.load()` sequence in thread mode hits dace's "library already loaded, renaming file" path: dace renames the .so on disk and dlopens the renamed copy, silently invalidating `library_path` for any later load. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tehrengruber
left a comment
There was a problem hiding this comment.
Only did a birds eye review. If compilation and loading are cleanly split, it should eventually be possible to only load when compilation already occured in a previous run, which is almost the case in gtfn, but not in dace yet.
| with locking.lock(src_dir): | ||
| data = build_data.read_data(src_dir) | ||
|
|
||
| if not data or not is_compiled(data) or self.force_recompile: | ||
| self.builder_factory(inp, self.cache_lifetime).build() | ||
|
|
||
| new_data = build_data.read_data(src_dir) | ||
|
|
||
| if not new_data or not is_compiled(new_data) or not module_exists(new_data, src_dir): | ||
| raise CompilationError( | ||
| f"On-the-fly compilation unsuccessful for '{inp.program_source.entry_point.name}'." | ||
| ) |
There was a problem hiding this comment.
Except for the builder_factory this part also feels like loading, the only thing necessary to construct a CPPCompilationArtifact appears to be the src_dir.
There was a problem hiding this comment.
I am experimenting to isolate the load to just be an import of a python module. I think will propose as a separate PR after this one.
| bind_func_name=self.bind_func_name, | ||
| device_type=self.device_type, | ||
| ) | ||
| _live_program_cache[artifact.build_folder] = CompiledDaceProgram( |
There was a problem hiding this comment.
Feels a little odd that DaCeCompilationArtifact and DaCeCompiler manage the _live_program_cache, but managing everything from DaCeCompilationArtifact also has its downsides. Ideally we could compile the sdfg without loading it then this part here wouldn't be needed.
There was a problem hiding this comment.
Yes, in a previous iteration that stuff was on the compilation artifact...
Allows ``isinstance(x, CompilationArtifact)`` in tests / asserts. ``runtime_checkable`` on a single-method Protocol only checks for the presence of ``load``, not its signature. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Zero call sites anywhere in src/ or tests/. ``GTFNCompiler`` is built via its own ``GTFNCompilerFactory`` and ``DaCeCompiler`` via ``DaCeCompilationStepFactory``; the base ``CompilerFactory`` was never referenced. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
``sdfg.compile(return_program_handle=False)`` builds the .so without loading it; the subsequent ``DaCeCompilationArtifact.load()`` is then a clean first dlopen via ``get_program_handle``. The "library already loaded, renaming file" warning the compile→load sequence used to trigger is gone, and the artifact no longer needs to pre-populate ``_live_program_cache`` to dodge that rename. The cache stays, but for a narrower reason: hand off the live ``CompiledDaceProgram`` (carrying mutable ``csdfg_argv`` state) across ``Backend.compile()`` invocations for the same build folder, so ``fast_call`` doesn't re-run ``construct_arguments`` every time. Compile-time pre-population removed; the cache is now load-time only. ``library_path`` is resolved inside ``dace_context`` (which sets ``build_folder_mode``), since ``get_binary_name`` falls back to the global config when ``folder_mode`` is unset.
After ``return_program_handle=False``, the cache no longer serves the rename-workaround purpose. Its only remaining role was to share ``CompiledDaceProgram.csdfg_argv`` state across ``Backend.compile()`` invocations for the same build folder — a scenario only triggered by calling ``with_grid_type(...).with_backend(...)`` per program invocation, which is an antipattern that creates a fresh ``CompiledProgramsPool`` every call. The two fastcall tests were going through ``cases.verify`` → ``cases.run`` which rebinds on every call, masking the antipattern with the cache. Rewrite them to bind once at the top of the test and call the bound program directly. They now exercise fast_call reuse via the recommended pattern (single bind, many calls) and would catch a real regression.
…finalize-split # Conflicts: # tests/next_tests/unit_tests/program_processor_tests/runners_tests/dace_tests/test_dace.py
This is a preparation to allow compilation in separate processes for faster jit compilation.
The new workflow: CompilationStep produces a CompilationArtifact which is picklable and has a
load()method which returns the original ExecutableProgram.