feat: :sparkles: raw data to staging by martonvago · Pull Request #94 · onlimit-study/feasibility-data

martonvago · 2026-05-28T09:29:36Z

Description

This PR transforms raw data to staged data. For now, this includes special transformations only for VAS.

Closes #73

This PR needs an in-depth review.

Checklist

Ran just run-all

martonvago · 2026-05-28T09:37:41Z

+VAS_TIME_FIELD_PATTERN = re.compile(
+    r"^vas_(?P<field_name>.+?)(_fasted)?_(?P<time>minus10|30|60|90|120|180|240)min$"
+)


We can share this with the metadata transformation

…easibility-data into feat/data-raw-to-staging

martonvago · 2026-05-28T09:44:43Z

+        .rename({"redcap_event_name": "event"})
+        .with_columns(
+            pl.lit("Copenhagen").alias("center"),
+            pl.lit(resource_name).alias("resource_name"),


I'll use this to write the parquet file then drop it

martonvago · 2026-05-28T09:46:34Z

+    for col in vas_cols:
+        match = cast(re.Match[str], VAS_TIME_FIELD_PATTERN.match(col))
+
+        time = match.group("time")
+        if time == "minus10":
+            time = "-10"
+
+        cols_grouped_by_time.setdefault(int(time), []).append(col)


I can rewrite this dictionary construction without a for loop if you want, but I think it will just be longer and more complicated

martonvago · 2026-05-28T09:49:29Z

This was done automatically, not sure if it's the right way

martonvago · 2026-05-28T09:49:53Z

@@ -1 +1,2 @@
 raw/** filter=lfs diff=lfs merge=lfs -text
+staging/** filter=lfs diff=lfs merge=lfs -text


Or we can track *.parquet

Not sure, is that a question?

Just an alternative

martonvago · 2026-05-28T10:22:19Z

+    """Selects columns and adds base columns common to all dataframes."""
+    return (
+        raw_df.select(["redcap_event_name"] + cols)
+        .rename({"redcap_event_name": "event"})


Having looked at more of the data, I don't think event by itself can be the PK. Raised #95

Yea, I was thinking the same.

lwjohnst86 · 2026-05-28T13:37:54Z

+    """Selects columns and adds base columns common to all dataframes."""
+    return (
+        raw_df.select(["redcap_event_name"] + cols)
+        .rename({"redcap_event_name": "event"})


Yea, I was thinking the same.

lwjohnst86 · 2026-05-28T13:38:59Z

@@ -1 +1,2 @@
 raw/** filter=lfs diff=lfs merge=lfs -text
+staging/** filter=lfs diff=lfs merge=lfs -text


Not sure, is that a question?

lwjohnst86 · 2026-05-28T13:39:18Z

lwjohnst86 · 2026-05-28T13:47:47Z

+    )
+
+
+def raw_to_staged(raw_df: pl.DataFrame) -> list[pl.DataFrame]:


Maybe organize so either all the _fn are above or below the normal functions?

lwjohnst86 · 2026-05-28T13:48:27Z

+            pl.lit("Copenhagen").alias("center"),
+            pl.lit(resource_name).alias("resource_name"),


Suggested change

pl.lit("Copenhagen").alias("center"),

pl.lit(resource_name).alias("resource_name"),

# Only used for creating the Parquet files.

pl.lit("Copenhagen").alias("center"),

pl.lit(resource_name).alias("resource_name"),

lwjohnst86 · 2026-05-28T13:51:51Z

+    vas_cols = so.keep(
+        raw_df.columns,
+        lambda column: VAS_TIME_FIELD_PATTERN.match(column) is not None,
+    )


This would be better by using Polars rather than a filter. E.g. select() can take a pattern/exclude

lwjohnst86 · 2026-05-28T13:53:51Z

+    vas_dfs = so.pairwise_fmap(
+        list(cols_grouped_by_time.items()), [raw_df], _create_df_for_time_group
+    )
+    return pl.concat(vas_dfs, how="vertical")


I think all of this would be better with a pivot https://docs.pola.rs/user-guide/transformations/pivot/

lwjohnst86

Another comment here

lwjohnst86 · 2026-06-02T11:49:10Z

+)
+
+
+def load_raw_data() -> pl.DataFrame:


This is where we need to think about the design of the Python files, etc, as we will eventually have non-REDCap raw data, while this only loads the REDCap data. Plus this function name implies reading in all raw data, but it's only reading in the latest.

Maybe to start, refactor this to have a path as an arg? And I'm not sure we need to only read the latest all the time, we'll want the orchestrater to handle which files to read and which to not read.

Hmm, yeah, everything here is just about the REDCap data, so like one step in an orchestrated flow. Maybe I will rename things to make that clear, rather than trying to anticipate how it will fit into the flow.

Right now, every time we download the raw REDCap data, we download the full set, that's why I thought taking the latest one made sense. This doesn't quite align with the earlier batch update model, and I was wondering if you had a broader vision for how raw data should be staged. E.g., when would we not stage the latest data?

martonvago added 3 commits May 28, 2026 09:19

feat: ✨ stage raw data

c559150

feat: ✨ track staging with git lfs

b4608bb

refactor: ♻️ end early if no raw data

30214ad

martonvago self-assigned this May 28, 2026

add-to-board-token Bot added this to Data development May 28, 2026

github-project-automation Bot moved this to Todo in Data development May 28, 2026

martonvago and others added 2 commits May 28, 2026 10:36

chore: 🔧 add vas to known words

627ee16

Merge branch 'main' into feat/data-raw-to-staging

df75685

martonvago commented May 28, 2026

View reviewed changes

Comment thread scripts/stage_data.py

martonvago added 2 commits May 28, 2026 10:43

refactor: ♻️ add underscore prefix to functions

455a4ac

Merge branch 'feat/data-raw-to-staging' of github.com:onlimit-study/f…

a7eba5f

…easibility-data into feat/data-raw-to-staging

martonvago commented May 28, 2026

View reviewed changes

martonvago moved this from Todo to In review in Data development May 28, 2026

martonvago marked this pull request as ready for review May 28, 2026 10:24

martonvago requested a review from a team as a code owner May 28, 2026 10:24

lwjohnst86 requested changes May 28, 2026

View reviewed changes

github-project-automation Bot moved this from In review to In progress in Data development May 28, 2026

lwjohnst86 requested changes Jun 2, 2026

View reviewed changes

		@@ -1 +1,2 @@
		raw/** filter=lfs diff=lfs merge=lfs -text
		staging/** filter=lfs diff=lfs merge=lfs -text

		)


		def raw_to_staged(raw_df: pl.DataFrame) -> list[pl.DataFrame]:

		pl.lit("Copenhagen").alias("center"),
		pl.lit(resource_name).alias("resource_name"),

Conversation

martonvago commented May 28, 2026 • edited by lwjohnst86 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lwjohnst86 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

martonvago commented May 28, 2026 •

edited by lwjohnst86

Loading