Skip to content

[Feature Request] (Long Term): Detailed documentation of stable-diffusion.cpp and its features #864

@MrSnichovitch

Description

@MrSnichovitch

At some point when the addition of new features/model support and addressing bugs are at a lull, it would be of great benefit to this project and its users if the documentation got an overhaul.

Best case scenario would be to "ELI5" it; basically providing in-depth instructions and definitions of how stable-diffusion.cpp can be used with each supported model and all their features that would teach the "lowest common denominator" (of which I am a card-carrying member... I can be as dense as a bag of hammers sometimes) how to use the software. Believe me, I have ~15 years of aggregate experience writing technical process documentation, so I know how tedious it can be, but getting it done means productive users and fewer issues being filed. I'd love to do this for you, but I don't know anywhere near enough about any of this stuff--neither the programming aspect nor the technical functions of generative models--to be of any real use, otherwise I'd be bombarding you with .md files instead of posting this long winded whatnot.

Please do not consider this as any sort of complaint about stable-diffusion.cpp. I really like this software and will continue using it as long as it's available. The reason behind this write up is because I don't have any real understanding of its full capabilities and would like to know what can be done with it. Some problems I'm having may even be due to unreported bugs, but I don't have enough information to make that determination.

A major peeve I have with a lot of indie open-source projects is their lack of documentation, or worse, documentation that assumes the end-users of their software already know enough about the intended function to not require instructions. The second scenario is pretty common, especially when it comes to generative AI-related projects. When it comes to image gen usage in general, the most common instance of this is people saying, "Just import my workflow," expecting the end-user to be running ComfyUI, which is of no use to anyone who either can't or won't run ComfyUI for whatever reason. Of course, even those who do run Comfy can run into problems when faced with a "switchboard" full of presets, they want to make a change, but have no idea what the rat's nest of interconnections before them does without a lot of reckless experimentation and frustrating time in a search engine. The worst part is that, due to the rise of AI-generated tutorial websites and "hallucinatory" answers coming from ChatGPT and the like, trying to find good answers to a complete newbie's questions is made that much more difficult.

Since stable-diffusion.cpp could be considered the antithesis of ComfyUI and doesn't have anywhere near the adoption rate, much more robust documentation geared toward those who are completely new to GenAI is really necessary, even if one considers this software to be "simple" by comparison to any other.

Arch Linux provides a prime example of the right way to go about software documentation. The entire philosophy of that particular OS is to "keep it simple," and yet they have the most in-depth and useful wiki of any Linux distro out there today. And that wiki is critical for them, because "simple" doesn't mean "easy." There's an expectation that Arch users need to learn how to use it properly, and they provide all the resources necessary for such self-education. (And no, I don't use Arch, BTW.)

As far as this project goes, the addition of the WAN video set is an example of where documentation is sorely lacking. Considering that WAN is the newest addition and pretty complicated in its own right, the lack of details is understandable, but plans to correct that should be part of the overall project.

While the example scripts by model are a good starting point for some folks, they don't explain what any of the settings do, whether or not they're necessary, or what other options might be of benefit. There's also no indication of resource usage per model, which would have been good to know before diving in. E.g., the standard "a lovely cat" T2V example for Wan2.2 TI2V 5B (specifically, Wan2.2-TI2V-5B-Q8_0.gguf was tested in my case) at 832x480 has a VAE compute buffer size of >20 GB(!). --vae-tiling doesn't appear to be an option for WAN, at least it ignored the option when set, which sort of makes sense considering it's producing sequential frame data for video and not a single static image, so --vae-on-cpu is necessary for my 16 GB GPU. It takes ~30 minutes to output a 2 second video because of this, so longer videos are out of the question if I want to use my PC for anything else.

Just to further illustrate the point, here's a few example questions I personally have about WAN:

  1. In general, which stable-diffusion.cpp options are relevant to WAN, and which ones are irrelevant? Which ones are WAN-specific and which ones are shared with other models?
    • E.g., as mentioned, --vae-tiling doesn't appear to work. What other settings are ignored?
  2. What back-ends support WAN?
    • E.g., Vulkan appears to be pretty weak for WAN 2.1 (requires both --clip-on-cpu and --vae-on-cpu) and doesn't seem to work at all for WAN 2.2. (Could be something similar to the problems with Vulkan in add Qwen Image support #851). ROCm works for both 2.1/2.2, but requires --clip-on-cpu to avoid black output videos.
  3. What does each of the WAN-relevant options do, exactly?
    • E.g., what is --flow-shift control? What effect does setting a value have, what's the range, and what happens when you adjust the value higher or lower?
    • What about --vace-strength for the 2.1 VACE models? Links to external explainers/tutorials that don't leave the end-user scratching their head in confusion would be beneficial here.
  4. What output sizes do each WAN model support?
    • E.g., on WAN 2.1, I can only seem to get viable results with 416x240 or 832x480. Any other resolution inbetween produces gibberish video or throws errors. 416x240 requires high --steps though (50 on average), to avoid garbage output (body horror).
    • On WAN 2.2 TI2V 5B, 832x480 works (if I'm willing to wait 30 minutes for the VAE step to trudge along in my Ryzen 7 3700x). 624x360 almost works, at least it produces recognizable images in the output video, but requires high --steps like WAN 2.1. Any other resolution inbetween produces junk.

Of course, the lack of documentation isn't limited to WAN.

Chroma appears to have its own set of command options that aren't explained anywhere...

  --chroma-disable-dit-mask          disable dit mask for chroma
  --chroma-enable-t5-mask            enable t5 mask for chroma
  --chroma-t5-mask-pad  PAD_SIZE     t5 mask pad size of chroma

...but I've discovered that using both --guidance 0 and --chroma-disable-dit-mask when generating with Vulkan greatly reduces instances of body horror and increases prompt adherence. --chroma-disable-dit-mask is present in the example script for the model (doesn't mean I know what a DiT mask is or why turning it off is important), but I found --guidance 0 in the discussion chain of PR #696 and found adding it to be beneficial on my own. Could be an issue of correlation not equaling causation, but if it appears to work, I'm using it.

Control Net: Outside of "Control Net support with SD 1.5" in the Features description, there's really no mention further mention of it. As far as I'm aware, one should be able to use Control Net masks with SDXL, Chroma and other models, but no guidance is provided on what files need to be downloaded and how to set its options.

Okay... I'm sure if you've read through all this by now, you get the point. Thanks for sparing the time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions