Manual model lowvram hint node #9433
Open
+34
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is an attempt at some similar logic to what llama.cpp (ggml-org/llama.cpp#11397) uses.
In case of ComfyUI, it means forcing some weights to use
lowvram
by default. The idea here is that when you're low on VRAM to begin with, the default logic ends up marking all tensors past a certain point as lowvram. If we allocate some of the larger tensors as lowvram to begin with, we can reduce the total number of lowvram weights.Testing with qwen image and marking the two FFN blocks (
img_mlp
+txt_mlp
) manually, I get a total of 324 lowvram weights instead of the 640 I get with the default logic.Now, diffusion models are almost all compute bound, and lowvram weights are still moved to the GPU, so this probably doesn't help much (I get about ~2 seconds faster generations at best). The main usecase would be if for some reason each individual
CPU
<->CUDA
copy op has high-ish overhead (Might be the case on PCIe 3.0 or with weird cross NUMA access?).This may still be useful for debugging lowvram without restarts, though.