Skip to content

feat: ✨ raw data to staging#94

Open
martonvago wants to merge 7 commits into
mainfrom
feat/data-raw-to-staging
Open

feat: ✨ raw data to staging#94
martonvago wants to merge 7 commits into
mainfrom
feat/data-raw-to-staging

Conversation

@martonvago

@martonvago martonvago commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR transforms raw data to staged data. For now, this includes special transformations only for VAS.

Closes #73

This PR needs an in-depth review.

Checklist

  • Ran just run-all

Comment thread scripts/stage_data.py
Comment on lines +10 to +12
VAS_TIME_FIELD_PATTERN = re.compile(
r"^vas_(?P<field_name>.+?)(_fasted)?_(?P<time>minus10|30|60|90|120|180|240)min$"
)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can share this with the metadata transformation

Comment thread scripts/stage_data.py
Comment thread scripts/stage_data.py
.rename({"redcap_event_name": "event"})
.with_columns(
pl.lit("Copenhagen").alias("center"),
pl.lit(resource_name).alias("resource_name"),

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use this to write the parquet file then drop it

Comment thread scripts/stage_data.py
Comment on lines +93 to +100
for col in vas_cols:
match = cast(re.Match[str], VAS_TIME_FIELD_PATTERN.match(col))

time = match.group("time")
if time == "minus10":
time = "-10"

cols_grouped_by_time.setdefault(int(time), []).append(col)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rewrite this dictionary construction without a for loop if you want, but I think it will just be longer and more complicated

Comment thread .typos.toml

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done automatically, not sure if it's the right way

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread .gitattributes
@@ -1 +1,2 @@
raw/** filter=lfs diff=lfs merge=lfs -text
staging/** filter=lfs diff=lfs merge=lfs -text

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can track *.parquet

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, is that a question?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an alternative

Comment thread scripts/stage_data.py
"""Selects columns and adds base columns common to all dataframes."""
return (
raw_df.select(["redcap_event_name"] + cols)
.rename({"redcap_event_name": "event"})

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having looked at more of the data, I don't think event by itself can be the PK. Raised #95

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I was thinking the same.

@martonvago martonvago moved this from Todo to In review in Data development May 28, 2026
@martonvago martonvago marked this pull request as ready for review May 28, 2026 10:24
@martonvago martonvago requested a review from a team as a code owner May 28, 2026 10:24
Comment thread scripts/stage_data.py
Comment thread scripts/stage_data.py
"""Selects columns and adds base columns common to all dataframes."""
return (
raw_df.select(["redcap_event_name"] + cols)
.rename({"redcap_event_name": "event"})

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I was thinking the same.

Comment thread .gitattributes
@@ -1 +1,2 @@
raw/** filter=lfs diff=lfs merge=lfs -text
staging/** filter=lfs diff=lfs merge=lfs -text

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, is that a question?

Comment thread .typos.toml

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread scripts/stage_data.py
)


def raw_to_staged(raw_df: pl.DataFrame) -> list[pl.DataFrame]:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe organize so either all the _fn are above or below the normal functions?

Comment thread scripts/stage_data.py
Comment on lines +52 to +53
pl.lit("Copenhagen").alias("center"),
pl.lit(resource_name).alias("resource_name"),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pl.lit("Copenhagen").alias("center"),
pl.lit(resource_name).alias("resource_name"),
# Only used for creating the Parquet files.
pl.lit("Copenhagen").alias("center"),
pl.lit(resource_name).alias("resource_name"),

Comment thread scripts/stage_data.py
Comment on lines +87 to +90
vas_cols = so.keep(
raw_df.columns,
lambda column: VAS_TIME_FIELD_PATTERN.match(column) is not None,
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better by using Polars rather than a filter. E.g. select() can take a pattern/exclude

Comment thread scripts/stage_data.py
vas_dfs = so.pairwise_fmap(
list(cols_grouped_by_time.items()), [raw_df], _create_df_for_time_group
)
return pl.concat(vas_dfs, how="vertical")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of this would be better with a pivot https://docs.pola.rs/user-guide/transformations/pivot/

@github-project-automation github-project-automation Bot moved this from In review to In progress in Data development May 28, 2026

@lwjohnst86 lwjohnst86 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment here

Comment thread scripts/stage_data.py
)


def load_raw_data() -> pl.DataFrame:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we need to think about the design of the Python files, etc, as we will eventually have non-REDCap raw data, while this only loads the REDCap data. Plus this function name implies reading in all raw data, but it's only reading in the latest.

Maybe to start, refactor this to have a path as an arg? And I'm not sure we need to only read the latest all the time, we'll want the orchestrater to handle which files to read and which to not read.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah, everything here is just about the REDCap data, so like one step in an orchestrated flow. Maybe I will rename things to make that clear, rather than trying to anticipate how it will fit into the flow.

Right now, every time we download the raw REDCap data, we download the full set, that's why I thought taking the latest one made sense. This doesn't quite align with the earlier batch update model, and I was wondering if you had a broader vision for how raw data should be staged. E.g., when would we not stage the latest data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Raw data to staging

2 participants