[BUG] Model ignores prompt in letter spelling task, repeats robot trajectory instead — potential loss of VLM language understanding or train/inference mismatch


### Description
I trained the pretrained Holobrain Qwen model for 4 epochs on a self-collected letter spelling task **Agilex Piperx** data(1000 episodes, 15h) using **RoboOrchard data recording app.** But when I tested on a trained model, in a letter spelling task with the prompt:  
_**"pick the letters and spell the word 'candle'"**_
the model appears to completely ignore the instruction. Instead, it memorizes and repeats the robot's motion trajectory, performing meaningless grasping actions unrelated to spelling.

### Steps to Reproduce
1. Load the current model weights (including the VLM module).
2. Provide the above prompt along with the corresponding visual input (scene with letters).
3. Observe the model's output action sequence.

### Actual Result
The model does not follow the prompt to spell the word. It repeats a previously memorized robot trajectory, which is irrelevant to the task.

### Expected Result
The model should understand the semantics of "pick the letters" and "spell the word 'candle'", and execute correct grasping and spelling actions accordingly.

### Preliminary Root Cause Analysis
We suspect the following potential causes:

1. **Train/Inference Code Mismatch**
   - The training and inference codebases are not identical and have multiple modifications.
   - Need to verify the differences or run remote inference directly on the server to rule out environmental factors.

2. **Loss of Language Understanding in VLM**
   - The vision-language component in the pretrained weights has a small parameter count.
   - Language and text comprehension capabilities may have been "washed out" (catastrophic forgetting) during pretraining.
   - Additional image-text data may be needed to recover or retain such capabilities.

### Suggested Next Steps / Investigation Path
- Unify training and inference code to ensure logical consistency.
- Compare behavior between the current model and the original pretrained weights under the same environment.
- Evaluate the VLM's basic performance on simple vision-language instruction tasks (e.g., referential understanding, letter recognition, word spelling).
- Consider incorporating more language-action joint data into the training pipeline, or using freezing/fine-tuning strategies to preserve language capabilities.

### Environment
- Model version / weight source: (please specify)
- Differences between training and inference environments: (if any)
- VLM parameter count: (please specify)

### Supplementary Materials
Inference demo video: 

https://github.com/user-attachments/assets/9fb33baa-9b78-4efd-81a1-4aa8c133a746

The video clearly shows the model ignoring the prompt and repeating a meaningless grasping trajectory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Model ignores prompt in letter spelling task, repeats robot trajectory instead — potential loss of VLM language understanding or train/inference mismatch #56

Description

Steps to Reproduce

Actual Result

Expected Result

Preliminary Root Cause Analysis

Suggested Next Steps / Investigation Path

Environment

Supplementary Materials

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] Model ignores prompt in letter spelling task, repeats robot trajectory instead — potential loss of VLM language understanding or train/inference mismatch #56

Description

Description

Steps to Reproduce

Actual Result

Expected Result

Preliminary Root Cause Analysis

Suggested Next Steps / Investigation Path

Environment

Supplementary Materials

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions