You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From the config of train_dit, it appears that the conditioner uses a frozen DINOv2 ViT-B/14 model (Dinov2Wrapper) to process image-based conditioning signals. However, I would like to confirm the exact input data format expected by the conditioner and whether the model relies on precomputed DINOv2 embeddings during training.
It'll be great if you can provide clarification on the following points:
Are the files at cond_url_template expected to contain precomputed DINOv2 image embeddings (e.g., extracted offline using DINOv2 and saved as tensors)?
Does the model avoid processing raw images during training and instead rely on these precomputed features?
Is there a script or guideline for generating these DINOv2 embeddings from raw images?
Thank you in advance!
The text was updated successfully, but these errors were encountered:
From the config of train_dit, it appears that the conditioner uses a frozen DINOv2 ViT-B/14 model (Dinov2Wrapper) to process image-based conditioning signals. However, I would like to confirm the exact input data format expected by the conditioner and whether the model relies on precomputed DINOv2 embeddings during training.
It'll be great if you can provide clarification on the following points:
Thank you in advance!
The text was updated successfully, but these errors were encountered: