Image-to-Video

Animating still images into short video clips. Stable Video Diffusion, Wan, CogVideoX-I2V are open-weight options.

Setup walkthrough

Install ComfyUI via Stability Matrix.
ComfyUI Manager → Install Models → search "wan-2.1-i2v-14b" → download FP8 version (~16 GB).
Load a Wan I2V workflow. The workflow takes:
- Input image (the starting frame)
- Text prompt describing the desired motion
- Resolution: 832×480 (Wan default), frames=81 (~5 seconds)
Prompt: "The person in the image slowly turns their head to look at the camera, gentle smile." Steps=20, CFG=5.
Queue → first animated video in 8-20 minutes on RTX 3090/4090 24 GB.
For faster/lighter: install Stable Video Diffusion XT (6 GB) → 14 frames (2 seconds) of animation in 2-5 minutes on 12 GB GPU.
For best quality-to-speed ratio: CogVideoX-I2V-5B (~10 GB) → 49 frames in 5-10 minutes on 16 GB GPU.

Image-to-video is just as compute-heavy as text-to-video. Measure in minutes, not seconds.

The cheap setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs Stable Video Diffusion (SVD) at 2-5 minutes for ~2 seconds of animation. CogVideoX-I2V-5B will run with heavy offloading but takes 10-20 minutes for 3 seconds. Wan I2V 14B is not practical on 12 GB — it technically runs with massive offloading (30-60+ minutes for 5 seconds). For $300-400: you get short animated clips (2-3 seconds) from SVD at reasonable speed. For longer/higher quality, save for 24 GB. I2V at this budget is for experiments, not production.

The serious setup

Used RTX 4090 24 GB ($1,600, see /hardware/rtx-4090). Runs Wan 2.1 I2V 14B FP8 at ~8-15 minutes per 5-second 832×480 clip. CogVideoX-I2V-5B at 3-6 minutes per 6-second clip. For professional use (social media content, product visualization), ~10 minutes per clip is production-viable if you queue overnight. Total build: ~$2,500-3,000. RTX 5090 32 GB ($2,000, see /hardware/rtx-5090) is the current I2V king — ~5-8 minutes per Wan clip. Image-to-video is a "start the render, get coffee, come back" workflow.

Common beginner mistake

The mistake: Feeding a complex, highly-detailed image into I2V and expecting the model to animate every element realistically. Why it fails: I2V models struggle with fine detail motion — a detailed photograph of a person in a patterned shirt will have the face animate correctly but the shirt pattern will warp and swim. The model allocates its "motion budget" to the most salient features (faces, hands) and approximates the rest. The fix: Simplify the input image. Remove busy backgrounds (inpaint to solid color). Use images with clear subjects and minimal fine detail. If animating a portrait, the face should be the only detailed element. For product videos, use clean product shots on white backgrounds. The simpler the input, the cleaner the animation. I2V is not "bring any image to life" — it's "animate a clean, well-composed image."

Recommended setup for image-to-video

Recommended hardware

Best GPU for Stable Diffusion + image gen →

Compute-bound workload — VRAM + FP16 TFLOPS both matter.

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for Stable Diffusion + image gen →

Reality check

Local video gen is genuinely possible in 2026 (LTX-Video, Mochi) but VRAM-hungry. 24 GB is the working minimum; 32 GB is the comfort zone for long-form workflows. Below 24 GB, video gen isn't realistic with current models.

Common mistakes

Trying video gen on 16 GB cards (model + KV cache doesn't fit)
Underestimating runtime VRAM (peak draw 1.5x model size on long sequences)
Mixing video gen with concurrent LLM serving on same GPU
Using Mac Silicon for video gen — viable but 30-50% slower than CUDA

What breaks first

The errors most operators hit when running image-to-video locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle image-to-video before committing money.

Hardware buying guidance for Image-to-Video

Local video generation is the most VRAM-hungry workload of 2026 — Hunyuan, Wan, and Mochi all need 24 GB minimum, with 32 GB unlocking longer clips.

Buyer guides

Compare hardware

Troubleshooting

Specialized buyer guides

Updated 2026 roundup

Best free local AI tools (2026) →