Added subcaption, summary and modality label generation scripts by saidul-islam98 · Pull Request #38 · VectorInstitute/pmc-data-extraction

saidul-islam98 · 2026-01-19T17:22:19Z

PR Type

Feature

Short Description

Adds a complete vLLM inference pipeline for Open-PMC-18M, including Python inference scripts, Slurm bash launch scripts, and a README documenting the full 3-stage workflow for subcaption extraction, image-context summary generation, and modality labeling.

Python Scripts Added

generate_subcaption_vllm.py
generate_summary_vllm.py
generate_modality_labels_vllm.py

Slurm Scripts Added

run_vllm_subcaption_inference.sh
run_vllm_summary_inference.sh
run_vllm_modality_inference.sh

Tests Added

None

This change is

afkanpour · 2026-05-14T15:34:03Z

+echo "Module Loaded and Environment Activated!"
+
+# Specify which GPUs to use
+CUDA_VISIBLE_DEVICES=0,1 \ 


Remove the trailing whitespace.

afkanpour · 2026-05-14T15:36:14Z

+import argparse
+import re
+from tqdm import tqdm
+from tqdm.auto import tqdm


Two imports of tqdm.

afkanpour · 2026-05-14T15:38:20Z

+        max_new_tokens=args.max_new_tokens,
+    )
+
+    fdf.to_csv(data_path, index=False)


The process_data_batched_vllm saves to csv at the end. So this looks redundant?

afkanpour · 2026-05-14T15:44:03Z

+    enc = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+
+    if len(enc) > max_length:
+        enc = enc[:max_length]


Since tokenize=False above, I believe the output is text here, max_length applies to character count, not token count.

afkanpour · 2026-05-14T15:44:50Z

+
+* **Stage 1 (Subcaption extraction, VLM):** `Qwen2.5-VL-32B-Instruct` generates a *verbatim* subfigure caption from a full figure caption + subfigure image. 
+* **Stage 2 (Context summary, LLM):** `Qwen2.5-14B-Instruct` generates a focused summary of the context passage relevant to the subcaption. 
+* **Stage 2 (Modality Labeling, VLM):** `Qwen2.5-VL-32B-Instruct` generates L2 labels, then L1 and L0 labels are inferred from a predefined set based on the generated L2 label. 


afkanpour

@afkanpour partially reviewed 12 files, made 2 comments, and resolved 3 discussions.
Reviewable status: 7 of 12 files reviewed, 3 unresolved discussions (waiting on Negiiiin and saidul-islam98).

.DS_Store line 0 at r1 (raw file):
Please delete all .DS_Store files

Added subcaption, summary and modality label generation scripts

dd678f2

saidul-islam98 assigned saidul-islam98, Negiiiin and afkanpour and unassigned saidul-islam98 Jan 19, 2026

saidul-islam98 added 5 commits May 13, 2026 10:58

delete DS_Store file

24bf47c

delete DS_Store file

35b3918

Delete working/process/.DS_Store

bde3ee4

Delete working/.DS_Store

efbc5ad

Delete .DS_Store

b88b7d0

afkanpour reviewed May 14, 2026

View reviewed changes

fixed some redundant imports and some format fixes

b1db8a9

afkanpour requested a review from Negiiiin May 14, 2026 20:46

afkanpour approved these changes May 14, 2026

View reviewed changes

saidul-islam98 added 3 commits May 15, 2026 09:54

updated some mypy issues that is blocking merge

d0a2af7

updated ruff issues that is blocking merge

726c556

updated literal issues and trailing whitespaces that is blocking merge

75ac793

Negiiiin approved these changes May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added subcaption, summary and modality label generation scripts#38

Added subcaption, summary and modality label generation scripts#38
saidul-islam98 wants to merge 10 commits into
mainfrom
subcaption-summary-generation

saidul-islam98 commented Jan 19, 2026 •

edited by afkanpour

Loading

Uh oh!

afkanpour May 14, 2026

Uh oh!

saidul-islam98 May 14, 2026

Uh oh!

afkanpour May 14, 2026

Uh oh!

saidul-islam98 May 14, 2026

Uh oh!

afkanpour May 14, 2026

Uh oh!

saidul-islam98 May 14, 2026

Uh oh!

afkanpour May 14, 2026

Uh oh!

saidul-islam98 May 14, 2026

Uh oh!

afkanpour May 14, 2026

Uh oh!

saidul-islam98 May 14, 2026

Uh oh!

afkanpour left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saidul-islam98 commented Jan 19, 2026 • edited by afkanpour Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Short Description

Python Scripts Added

Slurm Scripts Added

Tests Added

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

afkanpour left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saidul-islam98 commented Jan 19, 2026 •

edited by afkanpour

Loading