[Question]: What do H and W represent in the Voxel-Based Spatial Pruning algorithm in StreamVLN?

### Question

Dear authors,

Thank you very much for your impressive work and the open-source implementation of StreamVLN. I’ve learned a lot from both the paper and the codebase.

I would like to ask for a clarification regarding the definition of H and W in Algorithm 1 (Voxel-Based Spatial Pruning).

Based on the paper and the implementation, my understanding is that:

Starting from the raw camera resolution (e.g., 640×480), the image is first resized/cropped to the vision tower’s input resolution (e.g., 384×384 for SigLIP).
It is then divided into patches (with patch size 14×14), resulting in num_patches_per_side = floor(384 / 14) = 27.
Therefore, H and W correspond to num_patches_per_side (27×27 in this example), and the total number of visual tokens per frame is H × W = 729.
Could you please confirm whether my understanding is correct?

Thank you again for your time and help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: What do H and W represent in the Voxel-Based Spatial Pruning algorithm in StreamVLN? #353

Question

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Question]: What do H and W represent in the Voxel-Based Spatial Pruning algorithm in StreamVLN? #353

Description

Question

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions