Dear authors,
Thank you very much for your impressive work and the open-source implementation of StreamVLN. I’ve learned a lot from both the paper and the codebase.
I would like to ask for a clarification regarding the definition of H and W in Algorithm 1 (Voxel-Based Spatial Pruning).
Based on the paper and the implementation, my understanding is that:
- Starting from the raw camera resolution (e.g., 640×480), the image is first resized/cropped to the vision tower’s input resolution (e.g., 384×384 for SigLIP).
- It is then divided into patches (with patch size 14×14), resulting in
num_patches_per_side = floor(384 / 14) = 27.
- Therefore,
H and W correspond to num_patches_per_side (27×27 in this example), and the total number of visual tokens per frame is H × W = 729.
Could you please confirm whether my understanding is correct?
Thank you again for your time and help!
Dear authors,
Thank you very much for your impressive work and the open-source implementation of StreamVLN. I’ve learned a lot from both the paper and the codebase.
I would like to ask for a clarification regarding the definition of
HandWin Algorithm 1 (Voxel-Based Spatial Pruning).Based on the paper and the implementation, my understanding is that:
num_patches_per_side = floor(384 / 14) = 27.HandWcorrespond tonum_patches_per_side(27×27 in this example), and the total number of visual tokens per frame isH × W = 729.Could you please confirm whether my understanding is correct?
Thank you again for your time and help!