Skip to content

Feature Request: Allow Assistant Messages to Act as Generation Prefix (Prefill) #3877

Description

@fonfonya

This issue proposes adding support for assistant prefill: allowing an assistant message to be provided as a prefix that the model should continue generating from, rather than treating it as a completed assistant turn.

This capability enables deterministic continuation, structured output anchoring, and server-controlled tool or schema-guided generation.

Current Behavior

When passing messages like:

messages = [
    {"role": "user", "content": "Write a short apology email to a customer for a delayed shipment."},
    {"role": "assistant", "content": "Hi John,\n\nI'm sorry for the delay with your order. "}
]

the model server currently serializes the assistant message as a completed turn (constexpr bool add_generation_prompt = true;) and then starts a new assistant turn:

Pipeline input text: <|im_start|>user
Write a short apology email to a customer for a delayed shipment.<|im_end|>
<|im_start|>assistant
Hi John,

I'm sorry for the delay with your order. <|im_end|>
<|im_start|>assistant

As a result, the provided assistant content cannot be used as the active generation prefix.

Expected Behavior

The assistant message should be treated as a partial prefix, and generation should continue immediately after it:

<|im_start|>user
Write a short apology email to a customer for a delayed shipment.<|im_end|>
<|im_start|>assistant
Hi John,

I'm sorry for the delay with your order. 

Other Use Cases

Structured Output Prefill

messages = [
    {
        "role": "user",
        "content": "Is the customer satisfied? Respond in JSON with fields \"reasoning\" and \"answer\"."
    },
    {
        "role": "assistant",
        # Prefills the JSON shape and anchors a concise reasoning style.
        "content": '{\n  "reasoning": "Based on the user\'s tone and wording, '
    }
]

This allows the server to enforce output structure while still letting the model complete the response naturally.

Tool-Guided Generation

messages = [
    {
        "role": "user",
        "content": "Tell the user that their package is delayed by 2 days."
    },
    {
        "role": "assistant",
        # Tool name and immutable arguments are injected by the system.
        # The model only needs to generate the remaining message content.
        "content": '{"tool": "send_notification", "args": {"user_id": "uid_541", "message": "'
    }
]

This pattern enables server-controlled tool configuration, while allowing the model to complete the remaining schema-constrained fields.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions