Skip to content

[BUG] Model ignores prompt in letter spelling task, repeats robot trajectory instead — potential loss of VLM language understanding or train/inference mismatch #56

@LaFeuilleMorte

Description

@LaFeuilleMorte

Description

I trained the pretrained Holobrain Qwen model for 4 epochs on a self-collected letter spelling task Agilex Piperx data(1000 episodes, 15h) using RoboOrchard data recording app. But when I tested on a trained model, in a letter spelling task with the prompt:
"pick the letters and spell the word 'candle'"
the model appears to completely ignore the instruction. Instead, it memorizes and repeats the robot's motion trajectory, performing meaningless grasping actions unrelated to spelling.

Steps to Reproduce

  1. Load the current model weights (including the VLM module).
  2. Provide the above prompt along with the corresponding visual input (scene with letters).
  3. Observe the model's output action sequence.

Actual Result

The model does not follow the prompt to spell the word. It repeats a previously memorized robot trajectory, which is irrelevant to the task.

Expected Result

The model should understand the semantics of "pick the letters" and "spell the word 'candle'", and execute correct grasping and spelling actions accordingly.

Preliminary Root Cause Analysis

We suspect the following potential causes:

  1. Train/Inference Code Mismatch

    • The training and inference codebases are not identical and have multiple modifications.
    • Need to verify the differences or run remote inference directly on the server to rule out environmental factors.
  2. Loss of Language Understanding in VLM

    • The vision-language component in the pretrained weights has a small parameter count.
    • Language and text comprehension capabilities may have been "washed out" (catastrophic forgetting) during pretraining.
    • Additional image-text data may be needed to recover or retain such capabilities.

Suggested Next Steps / Investigation Path

  • Unify training and inference code to ensure logical consistency.
  • Compare behavior between the current model and the original pretrained weights under the same environment.
  • Evaluate the VLM's basic performance on simple vision-language instruction tasks (e.g., referential understanding, letter recognition, word spelling).
  • Consider incorporating more language-action joint data into the training pipeline, or using freezing/fine-tuning strategies to preserve language capabilities.

Environment

  • Model version / weight source: (please specify)
  • Differences between training and inference environments: (if any)
  • VLM parameter count: (please specify)

Supplementary Materials

Inference demo video:

default.mp4

The video clearly shows the model ignoring the prompt and repeating a meaningless grasping trajectory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions