WanImageToVideoPipeline: Given groups=1, weight of size [160, 12, 3, 3, 3], expected input[1, 3, 3, 258, 258] to have 12 channels, but got 3 channels instead

### Describe the bug

Trying the example code with the following changes

pipe.vae.enable_tiling()
pipe.enable_sequential_cpu_offload() 

Uninstalled and installed diffusers and peft from source before trying the example code.

### Reproduction

```python
import torch
import numpy as np
from diffusers import WanImageToVideoPipeline, AutoencoderKLWan, ModularPipeline
from diffusers.utils import export_to_video


model_id = "Wan-AI/Wan2.2-TI2V-5B-Diffusers"
dtype = torch.bfloat16

vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanImageToVideoPipeline.from_pretrained(model_id, vae=vae, torch_dtype=dtype)
# pipe.enable_model_cpu_offload(device=device)
pipe.vae.enable_tiling()
pipe.enable_sequential_cpu_offload()

# use default wan image processor to resize and crop the image
image_processor = ModularPipeline.from_pretrained("YiYiXu/WanImageProcessor", trust_remote_code=True)
image = image_processor(
    image="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG",
    max_area=1280*704, output="processed_image")

height, width = image.height, image.width
print(f"height: {height}, width: {width}")
num_frames = 121
num_inference_steps = 10
guidance_scale = 5.0

prompt = "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

negative_prompt = "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走"

output = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=num_frames,
    guidance_scale=guidance_scale,
    num_inference_steps=num_inference_steps,
).frames[0]
export_to_video(output, "yiyi_test_6_ti2v_5b_output.mp4", fps=24)
```

### Logs

```shell
(sddw-dev) C:\aiOWN\diffuser_webui>python Wan2.1_4steps.py
W0810 17:44:53.607000 9400 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
The config attributes {'clip_output': False} were passed to AutoencoderKLWan, but are not expected and will be ignored. Please verify your config.json configuration file.
Loading checkpoint shards: 100%|████████████████████████████████████| 5/5 [00:24<00:00,  4.86s/it]
Loading checkpoint shards: 100%|████████████████████████████████████| 3/3 [00:00<00:00,  4.15it/s]
Loading pipeline components...: 100%|███████████████████████████████| 5/5 [00:28<00:00,  5.72s/it]
C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\diffusers\configuration_utils.py:141: FutureWarning: Accessing config attribute `patch_size` directly via 'ModularPipeline' object attribute is deprecated. Please access 'patch_size' over 'ModularPipeline's config object instead, e.g. 'scheduler.config.patch_size'.
  deprecate("direct config name access", "1.0.0", deprecation_message, standard_warn=False)
C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\diffusers\configuration_utils.py:141: FutureWarning: Accessing config attribute `vae_stride` directly via 'ModularPipeline' object attribute is deprecated. Please access 'vae_stride' over 'ModularPipeline's config object instead, e.g. 'scheduler.config.vae_stride'.
  deprecate("direct config name access", "1.0.0", deprecation_message, standard_warn=False)
 initial image size: (832, 1104)
 processed image size: (800, 1088)
height: 1088, width: 800
Traceback (most recent call last):
  File "C:\aiOWN\diffuser_webui\Wan2.1_4steps.py", line 66, in <module>
    output = pipe(
             ^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\diffusers\pipelines\wan\pipeline_wan_i2v.py", line 699, in __call__
    latents_outputs = self.prepare_latents(
                      ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\diffusers\pipelines\wan\pipeline_wan_i2v.py", line 455, in prepare_latents
    latent_condition = retrieve_latents(self.vae.encode(video_condition), sample_mode="argmax")
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\diffusers\utils\accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_wan.py", line 1191, in encode
    h = self._encode(x)
        ^^^^^^^^^^^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_wan.py", line 1149, in _encode
    return self.tiled_encode(x)
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_wan.py", line 1313, in tiled_encode
    tile = self.encoder(tile, feat_cache=self._enc_feat_map, feat_idx=self._enc_conv_idx)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_wan.py", line 593, in forward
    x = self.conv_in(x, feat_cache[idx])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\accelerate\hooks.py", line 175, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_wan.py", line 176, in forward
    return super().forward(x)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\torch\nn\modules\conv.py", line 725, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nitin\miniconda3\envs\sddw-dev\Lib\site-packages\torch\nn\modules\conv.py", line 720, in _conv_forward
    return F.conv3d(
           ^^^^^^^^^
RuntimeError: Given groups=1, weight of size [160, 12, 3, 3, 3], expected input[1, 3, 3, 258, 258] to have 12 channels, but got 3 channels instead
```

### System Info

```
- 🤗 Diffusers version: 0.35.0.dev0
- Platform: Windows-10-10.0.26100-SP0
- Running on Google Colab?: No
- Python version: 3.11.9
- PyTorch version (GPU?): 2.7.0+cu126 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.34.3
- Transformers version: 4.52.4
- Accelerate version: 1.8.0.dev0
- PEFT version: 0.17.1.dev0
- Bitsandbytes version: 0.46.0
- Safetensors version: 0.5.3
- xFormers version: 0.0.30
- Accelerator: NVIDIA GeForce RTX 4060 Laptop GPU, 8188 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
```

### Who can help?

@yiyixuxu @a-r-r-o-w 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WanImageToVideoPipeline: Given groups=1, weight of size [160, 12, 3, 3, 3], expected input[1, 3, 3, 258, 258] to have 12 channels, but got 3 channels instead #12113

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WanImageToVideoPipeline: Given groups=1, weight of size [160, 12, 3, 3, 3], expected input[1, 3, 3, 258, 258] to have 12 channels, but got 3 channels instead #12113

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions