Skip to content

Conversation

city96
Copy link
Contributor

@city96 city96 commented Aug 19, 2025

This PR is an attempt at some similar logic to what llama.cpp (ggml-org/llama.cpp#11397) uses.

image

In case of ComfyUI, it means forcing some weights to use lowvram by default. The idea here is that when you're low on VRAM to begin with, the default logic ends up marking all tensors past a certain point as lowvram. If we allocate some of the larger tensors as lowvram to begin with, we can reduce the total number of lowvram weights.

Testing with qwen image and marking the two FFN blocks (img_mlp + txt_mlp) manually, I get a total of 324 lowvram weights instead of the 640 I get with the default logic.

Now, diffusion models are almost all compute bound, and lowvram weights are still moved to the GPU, so this probably doesn't help much (I get about ~2 seconds faster generations at best). The main usecase would be if for some reason each individual CPU<->CUDA copy op has high-ish overhead (Might be the case on PCIe 3.0 or with weird cross NUMA access?).

This may still be useful for debugging lowvram without restarts, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant