Hi! First of all, thanks a lot for a great example project!
However, I'm pretty sure I have found a bug. onnx_model.encode_text() produces results that are not consistent with regular model output.
There is a statement in .utils:
take features from the eot embedding (eot_token is the highest number in each sequence).
But ruclip uses eos_id = 3, which clearly is not the highest token in the sequence.
This change was clearly made after you released you repo, but there was no version bump.
So, I tried tracing the model with hardcode eos id and original .where line from ruclip's encode_text:
x = x[torch.arange(x.shape[0]), torch.where(text == 3)[1]] @ self.text_projection
and it worked. I will submit a pull request with a fix, if I have free time later, but for now just wanted you to know, that running the default version of your notebook produces incorrect text_encoding results
Hi! First of all, thanks a lot for a great example project!
However, I'm pretty sure I have found a bug. onnx_model.encode_text() produces results that are not consistent with regular model output.
There is a statement in .utils:
take features from the eot embedding (eot_token is the highest number in each sequence).
But ruclip uses eos_id = 3, which clearly is not the highest token in the sequence.
This change was clearly made after you released you repo, but there was no version bump.
So, I tried tracing the model with hardcode eos id and original .where line from ruclip's encode_text:
x = x[torch.arange(x.shape[0]), torch.where(text == 3)[1]] @ self.text_projection
and it worked. I will submit a pull request with a fix, if I have free time later, but for now just wanted you to know, that running the default version of your notebook produces incorrect text_encoding results