NaViT not really resolution agnostic? 🤔 #342
-
Hot take here 🔥. NaViT may have allowed to handle images with varied aspect ratios but it did not fix handling arbitrary resolutions. For this, inter/extrapolation is still needed. Fractional Factorized positional embeddings (hight and width) are initialized as learnable 1-dimensional vectors of fixed size. So if one of the dimensions of the input image exceeds this fixed size there will be an indexing error. Maybe Im wrong, but this is what it looks like to me looking at some publicly available implementations 🤷🏻♂️. Would love some input on this, its driving me crazy 🤯. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Agreed, it is not "truly" flexible to arbitrary image sizes. Image resolutions still need to be a multiple of the patch size. My strategy for this is zero-padding to multiples of the patch size, which seems to be a reasonable workaround, but if done on the fly (i.e. inside the torch Dataset getitem) it can add some overhead. |
Beta Was this translation helpful? Give feedback.
Agreed, it is not "truly" flexible to arbitrary image sizes. Image resolutions still need to be a multiple of the patch size. My strategy for this is zero-padding to multiples of the patch size, which seems to be a reasonable workaround, but if done on the fly (i.e. inside the torch Dataset getitem) it can add some overhead.