Add adapter generation demo scripts by AlonMalach · Pull Request #5 · generative-computing/granite-switch

AlonMalach · 2026-05-11T05:48:00Z

Add two tutorial scripts under tutorials/scripts/ that compose a
Granite Switch model with the RAG, core, and guardian adapter
libraries and demonstrate invoking every adapter end-to-end:

- run_adapter_generation.py: plain HuggingFace transformers path. Activates adapters by passing adapter_name=... to tokenizer.apply_chat_template, following the README idiom. One dedicated demo_<adapter> function per adapter with the message / document layout the adapter expects. For the three per-span LoRA adapters (citations, context-attribution, hallucination_detection) the script also applies io.yaml's sentence_boundaries tagging and appends the io.yaml instruction as a final user turn, so their output shape matches Mellea. Results saved to a timestamped JSON file.

- run_adapter_generation_mellea.py: Mellea + vLLM path. Starts a vLLM server for the composed model, wires a Mellea OpenAIBackend with embedded-adapter registration, and calls the corresponding intrinsic wrapper (rag.rewrite_question, core.check_certainty, guardian.policy_guardrails, ...) for each adapter.

Both scripts mirror one another's demo inputs so outputs can be
compared side by side. Common flags: --output, --model-dir,
--base-model, --max-tokens (HF only); the Mellea script additionally
exposes --port, --gpu-memory-utilization, --max-model-len for the
vLLM server.

The 14 demos cover:
- RAG: query_rewrite, query_clarification, answerability, citations, hallucination_detection
- Core: context-attribution, requirement-check, uncertainty
- Guardian: guardian-core (social_bias / harm / safe variants), policy-guardrails, factuality-detection, factuality-correctio

Add two tutorial scripts under tutorials/scripts/ that compose a Granite Switch model with the RAG, core, and guardian adapter libraries and demonstrate invoking every adapter end-to-end: - run_adapter_generation.py: plain HuggingFace transformers path. Activates adapters by passing adapter_name=... to tokenizer.apply_chat_template, following the README idiom. One dedicated demo_<adapter> function per adapter with the message / document layout the adapter expects. For the three per-span LoRA adapters (citations, context-attribution, hallucination_detection) the script also applies io.yaml's sentence_boundaries tagging and appends the io.yaml instruction as a final user turn, so their output shape matches Mellea. Results saved to a timestamped JSON file. - run_adapter_generation_mellea.py: Mellea + vLLM path. Starts a vLLM server for the composed model, wires a Mellea OpenAIBackend with embedded-adapter registration, and calls the corresponding intrinsic wrapper (rag.rewrite_question, core.check_certainty, guardian.policy_guardrails, ...) for each adapter. Both scripts mirror one another's demo inputs so outputs can be compared side by side. Common flags: --output, --model-dir, --base-model, --max-tokens (HF only); the Mellea script additionally exposes --port, --gpu-memory-utilization, --max-model-len for the vLLM server. The 14 demos cover: - RAG: query_rewrite, query_clarification, answerability, citations, hallucination_detection - Core: context-attribution, requirement-check, uncertainty - Guardian: guardian-core (social_bias / harm / safe variants), policy-guardrails, factuality-detection, factuality-correctio

- Tighten sentence-boundary tagging and instruction-templates comment blocks; drop development-flavored asides. - Rewrite demo_policy_guardrails docstring from pseudocode sketch to concrete context/returns description. - Replace magic-number slicing in strip_adapter_suffix with str.removesuffix (Python 3.9+ idiom). - Minor rephrasing in _split_sentences, _mark_sentence_boundaries, and the granite_switch.hf import comment.

- Default to ibm-granite/granite-switch-4.1-3b-preview instead of composing a model at startup. Dropped compose_model(), LIBRARIES, BUILD_TIMEOUT, --base-model, and the subprocess/tempfile plumbing from both scripts. - Load the HF model with device_map="auto" instead of .to("cuda"), so it works on multi-GPU / no-GPU-0 setups. - Remove the Mellea-sync claim from run_adapter_generation.py's top-level docstring; the per-adapter attribution comments stay since they point at the exact Mellea source the prompts were copied from.

Soften the hard cuda.is_available() check to a warning — with device_map="auto", plain HF transformers loads onto whatever is available (CUDA / MPS / CPU). CPU is slow but not unsupported.

Makes the naming symmetric with run_adapter_generation_mellea.py and clarifies that this script talks to the model directly via HuggingFace transformers, with no framework layer in between.

Addresses PR #5 review comment. Strips every Mellea mention from the script: section-header attribution comments, "Copied verbatim from mellea..." markers, the CRITERIA_BANK subset note, and the per-demo "Mirrors X.Y()" docstrings (rewritten as plain behavior descriptions). The script now stands on its own without pointing at Mellea's source.

AlonMalach requested a review from antonpibm May 11, 2026 05:48

AlonMalach self-assigned this May 11, 2026

AlonMalach requested a review from freunda May 11, 2026 05:50

antonpibm reviewed May 11, 2026

View reviewed changes

Comment thread tutorials/scripts/run_adapter_generation.py Outdated

Comment thread tutorials/scripts/run_adapter_generation.py Outdated

Comment thread tutorials/scripts/run_adapter_generation_mellea.py Outdated

Comment thread tutorials/scripts/run_adapter_generation.py Outdated

AlonMalach requested a review from yairallouche as a code owner May 12, 2026 07:55

alon added 2 commits May 12, 2026 11:17

Allow HF demo script to run without CUDA

855892c

Soften the hard cuda.is_available() check to a warning — with device_map="auto", plain HF transformers loads onto whatever is available (CUDA / MPS / CPU). CPU is slow but not unsupported.

Rename HF demo script to run_adapter_generation_direct.py

f2109c3

Makes the naming symmetric with run_adapter_generation_mellea.py and clarifies that this script talks to the model directly via HuggingFace transformers, with no framework layer in between.

antonpibm reviewed May 12, 2026

View reviewed changes

Comment thread tutorials/scripts/run_adapter_generation_direct.py Outdated

antonpibm approved these changes May 12, 2026

View reviewed changes

antonpibm merged commit 436f698 into main May 12, 2026

antonpibm deleted the feature/add-demo-scripts branch May 12, 2026 09:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add adapter generation demo scripts#5

Add adapter generation demo scripts#5
antonpibm merged 6 commits into
mainfrom
feature/add-demo-scripts

AlonMalach commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlonMalach commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants