Validate Input Data as Precomputed DINOv2 Embeddings #13

NavairaRehman · 2025-03-25T21:30:52Z

From the config of train_dit, it appears that the conditioner uses a frozen DINOv2 ViT-B/14 model (Dinov2Wrapper) to process image-based conditioning signals. However, I would like to confirm the exact input data format expected by the conditioner and whether the model relies on precomputed DINOv2 embeddings during training.

It'll be great if you can provide clarification on the following points:

Are the files at cond_url_template expected to contain precomputed DINOv2 image embeddings (e.g., extracted offline using DINOv2 and saved as tensors)?
Does the model avoid processing raw images during training and instead rely on these precomputed features?
Is there a script or guideline for generating these DINOv2 embeddings from raw images?

Thank you in advance!

FrozenBurning · 2025-05-07T07:01:04Z

Thanks for your interest in our work! Let me respond to your questions as follows:

cond_url_template directly points to the path of the precomputed DINO feature or text embedding of which we cached before training.
Yes, we directly load precomputed PrimX and corresponding condition features when training the DiT.
We provide detailed instructions and caching scripts for VAE and DINOv2 embedding here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validate Input Data as Precomputed DINOv2 Embeddings #13

Validate Input Data as Precomputed DINOv2 Embeddings #13

NavairaRehman commented Mar 25, 2025

FrozenBurning commented May 7, 2025

Uh oh!

Validate Input Data as Precomputed DINOv2 Embeddings #13

Validate Input Data as Precomputed DINOv2 Embeddings #13

Comments

NavairaRehman commented Mar 25, 2025

FrozenBurning commented May 7, 2025

Uh oh!