Replies: 1 comment
-
|
The reason for this is because of the Resolution in the coarse matching. This means that a coarse match is done at every 8 pixels. Another reason is due to the input into the Vision transformer as the token size is reduced. This is why it is recomended to use a Image size that is divisible by 8. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The model seems to run fine even when image sizes are not divisible by 8. Then why is it stated that it should be divisible by 8 ?
Beta Was this translation helpful? Give feedback.
All reactions