Skip to content

Clean versions of the CTRPv2 datasets#428

Open
PascalIversen wants to merge 4 commits into
developmentfrom
ctrpv2_clean
Open

Clean versions of the CTRPv2 datasets#428
PascalIversen wants to merge 4 commits into
developmentfrom
ctrpv2_clean

Conversation

@PascalIversen

@PascalIversen PascalIversen commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

CTRPv2 versions where we Drop drugs when they just don't really have a signal in the curves (p value of curve curator). We should warn, actually, when someone does LDO on them, because I use the test labels to determine what is a "bad" drug. The labels of these bad drugs 1) are just noise in the training and 2.) make the per drug pearson mean lower than it is on drugs that actually work.

Three versions: clean, cleaner, cleanests. Keeping a drug if it has at least N cell lines with a CurveCurator-significant response , N = 15 / 30 / 50. (absolute count because some targeted drugs are very selective e.g. Venetoclax)

I don't want to just drop non-significant curves per experiment because that would be leakage in LCO, LPO. Currently it would only leakage in LDO.

We could also think about adding a filter that can be adjusted for any N or even %, but I think this is enough for now?

JudithBernett and others added 4 commits June 8, 2026 16:23
Time for a new version - v1.5.0
Add CTRPv2_clean, CTRPv2_cleaner and CTRPv2_cleanest as selectable datasets,
derived from the original CTRPv2 download so nothing new needs hosting. On first
load each one builds a local folder that keeps only drugs with at least N
reproducible CurveCurator-significant dose-response curves (N = 15, 30, 50) and
symlinks CTRPv2's feature files.

Filtering is done at the drug level only, never per experiment, so the sample is
not conditioned on the response. Many CTRPv2 compounds (prodrugs, non-cytotoxics)
have flat curves and meaningless IC50 labels; selective drugs that have a real
cluster of responders are kept because the criterion counts responders rather
than their fraction.

Register the three loaders in AVAILABLE_DATASETS and update the factory test.
The clean/cleaner/cleanest variants drop inactive drugs, so leave-drug-out (LDO)
metrics on them are optimistic: real screens contain inactive compounds a model
would still face. Emit a warning when one of these datasets is used with LDO.
@JudithBernett

Copy link
Copy Markdown
Contributor

How is this specific to CTRPv2?

Comment on lines +278 to +291
meta_path = os.path.join(path_data, "meta", "tissue_mapping.csv")
if not os.path.exists(meta_path):
download_dataset("meta", path_data, redownload=True)
path = os.path.join(path_data, dataset_name, f"{dataset_name}.csv")
response_data = pd.read_csv(path, dtype={"pubchem_id": str, "cell_line_name": str})
response_data[DRUG_IDENTIFIER] = response_data[DRUG_IDENTIFIER].str.replace(",", "")
check_measure(measure, list(response_data.columns), dataset_name)
return DrugResponseDataset(
response=response_data[measure].values,
cell_line_ids=response_data[CELL_LINE_IDENTIFIER].values,
drug_ids=response_data[DRUG_IDENTIFIER].values,
tissues=response_data[TISSUE_IDENTIFIER].values,
dataset_name=dataset_name,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all duplicated to _load_zenodo_dataset

@PascalIversen

Copy link
Copy Markdown
Collaborator Author

not specific, but CTRPv2 is the only one I use these days :D

Should I
a) make a general solution with the next cli flag?
b) just make it for every dataset where its possible
c)just GDSC1 and 2 and CTRPv1,2 and CCLE?

@JudithBernett

Copy link
Copy Markdown
Contributor

I'd say this can be a general solution; maybe instead of the hardcoded 5, 30, 50 with percentages per drug? I'd check if it was curve curated (i.e., if it has the Regulation column) and then, this could be done for all datasets

@PascalIversen

Copy link
Copy Markdown
Collaborator Author

Hm, I am not sure. Maybe it is good to establish some benchmarking datasets, so we can have a leaderboard on CTRPv2 cleanest, etc Both are also options. Let's see after the weekend.

But I want to keep it absolute, not percentage-based. I see no reason for percentage-based making sense. Maybe the drug only has a signal that's highly selective for 5 breast cancer cell lines or something. This should be included regardless of how many others have been measured

@JudithBernett

Copy link
Copy Markdown
Contributor

Ok, let's discuss next week :D But I think 5, 30, 50 are somewhat arbitrarily chosen and would mean different things in overall screening size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants