You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[BUG] Model ignores prompt in letter spelling task, repeats robot trajectory instead — potential loss of VLM language understanding or train/inference mismatch #56
I trained the pretrained Holobrain Qwen model for 4 epochs on a self-collected letter spelling task Agilex Piperx data(1000 episodes, 15h) using RoboOrchard data recording app. But when I tested on a trained model, in a letter spelling task with the prompt: "pick the letters and spell the word 'candle'"
the model appears to completely ignore the instruction. Instead, it memorizes and repeats the robot's motion trajectory, performing meaningless grasping actions unrelated to spelling.
Steps to Reproduce
Load the current model weights (including the VLM module).
Provide the above prompt along with the corresponding visual input (scene with letters).
Observe the model's output action sequence.
Actual Result
The model does not follow the prompt to spell the word. It repeats a previously memorized robot trajectory, which is irrelevant to the task.
Expected Result
The model should understand the semantics of "pick the letters" and "spell the word 'candle'", and execute correct grasping and spelling actions accordingly.
Preliminary Root Cause Analysis
We suspect the following potential causes:
Train/Inference Code Mismatch
The training and inference codebases are not identical and have multiple modifications.
Need to verify the differences or run remote inference directly on the server to rule out environmental factors.
Loss of Language Understanding in VLM
The vision-language component in the pretrained weights has a small parameter count.
Language and text comprehension capabilities may have been "washed out" (catastrophic forgetting) during pretraining.
Additional image-text data may be needed to recover or retain such capabilities.
Suggested Next Steps / Investigation Path
Unify training and inference code to ensure logical consistency.
Compare behavior between the current model and the original pretrained weights under the same environment.
Evaluate the VLM's basic performance on simple vision-language instruction tasks (e.g., referential understanding, letter recognition, word spelling).
Consider incorporating more language-action joint data into the training pipeline, or using freezing/fine-tuning strategies to preserve language capabilities.
Environment
Model version / weight source: (please specify)
Differences between training and inference environments: (if any)
VLM parameter count: (please specify)
Supplementary Materials
Inference demo video:
default.mp4
The video clearly shows the model ignoring the prompt and repeating a meaningless grasping trajectory.
Description
I trained the pretrained Holobrain Qwen model for 4 epochs on a self-collected letter spelling task Agilex Piperx data(1000 episodes, 15h) using RoboOrchard data recording app. But when I tested on a trained model, in a letter spelling task with the prompt:
"pick the letters and spell the word 'candle'"
the model appears to completely ignore the instruction. Instead, it memorizes and repeats the robot's motion trajectory, performing meaningless grasping actions unrelated to spelling.
Steps to Reproduce
Actual Result
The model does not follow the prompt to spell the word. It repeats a previously memorized robot trajectory, which is irrelevant to the task.
Expected Result
The model should understand the semantics of "pick the letters" and "spell the word 'candle'", and execute correct grasping and spelling actions accordingly.
Preliminary Root Cause Analysis
We suspect the following potential causes:
Train/Inference Code Mismatch
Loss of Language Understanding in VLM
Suggested Next Steps / Investigation Path
Environment
Supplementary Materials
Inference demo video:
default.mp4
The video clearly shows the model ignoring the prompt and repeating a meaningless grasping trajectory.