Skip to content

[FSTORE-2030] Add support for specifying lookback windows for PIT queries#583

Open
manu-sj wants to merge 1 commit into
logicalclocks:mainfrom
manu-sj:FSTORE-2030
Open

[FSTORE-2030] Add support for specifying lookback windows for PIT queries#583
manu-sj wants to merge 1 commit into
logicalclocks:mainfrom
manu-sj:FSTORE-2030

Conversation

@manu-sj
Copy link
Copy Markdown
Contributor

@manu-sj manu-sj commented May 21, 2026

Summary

  • Adds a user-guide section Lookback window for PIT joins to feature_view/batch-data.md covering the two modes, the dict and dataclass call shapes, partition pruning behavior, and the one-sided lower-only form.
  • Cross-links from feature_view/training-data.md so users hitting create_training_data find the same reference.

JIRA

FSTORE-2030

Test plan

  • One-sentence-per-line convention respected.
  • Python code blocks valid Python (run through ruff via the workspace policy).
  • Reviewer to verify the page renders correctly in the mkdocs preview.

Companion PRs

  • Backend: logicalclocks/hopsworks-ee → branch FSTORE-2030
  • SDK: logicalclocks/hopsworks-api → branch FSTORE-2030
  • Integration tests: logicalclocks/loadtest → branch FSTORE-2030

…ries

https://hopsworks.atlassian.net/browse/FSTORE-2030

PIT joins generate predicates of the form `feature_fg.event_time <=
root_fg.event_time` to select the latest matching feature record.
Because this is a range join rather than an equality join, partition
pruning cannot eliminate older partitions of the joined feature group:
the latest valid value may live in any of them. As feature groups
grow with daily ingestion, every PIT query scans more historical
partitions, inflating IO, shuffle volume, and execution time.

[FSTORE-2030] adds an optional `lookback` parameter on
`FeatureView.get_batch_data`, `create_training_data`, and the split
variants. `Lookback(key=..., start=..., end=...)` declares a
constant-bound window that the backend AND's onto the root FG and
each joined FG. `key="partition_key"` mode places the bound on the
partition column so flyingduck and Spark catalyst can prune
partitions; `key="event_time"` mode emits the predicate on the
event_time column with engine-dependent pruning.

The user guides document the new `lookback` parameter on the
batch-data and training-data flows, with worked examples for both
`partition_key` and `event_time` modes. The pages call out that
`start` is required and `end` is optional, and that omitting `end`
falls back to the existing upper-only auto-pruning derived from
`query.end_time`. The pages cross-link to each other so users on
either entry point see the same vocabulary.

Reviewed-by: OpenAI Codex (GPT-5 via codex-plugin-cc 1.0.4) <codex@openai.com>
Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@logicalclocks.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@manu-sj manu-sj marked this pull request as ready for review May 23, 2026 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant