Image-to-Video
Animating still images into short video clips. Stable Video Diffusion, Wan, CogVideoX-I2V are open-weight options.
Setup walkthrough
- Install ComfyUI via Stability Matrix.
- ComfyUI Manager → Install Models → search "wan-2.1-i2v-14b" → download FP8 version (~16 GB).
- Load a Wan I2V workflow. The workflow takes:
- Input image (the starting frame)
- Text prompt describing the desired motion
- Resolution: 832×480 (Wan default), frames=81 (~5 seconds)
- Prompt: "The person in the image slowly turns their head to look at the camera, gentle smile." Steps=20, CFG=5.
- Queue → first animated video in 8-20 minutes on RTX 3090/4090 24 GB.
- For faster/lighter: install Stable Video Diffusion XT (6 GB) → 14 frames (2 seconds) of animation in 2-5 minutes on 12 GB GPU.
- For best quality-to-speed ratio: CogVideoX-I2V-5B (~10 GB) → 49 frames in 5-10 minutes on 16 GB GPU.
Image-to-video is just as compute-heavy as text-to-video. Measure in minutes, not seconds.
The cheap setup
Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs Stable Video Diffusion (SVD) at 2-5 minutes for ~2 seconds of animation. CogVideoX-I2V-5B will run with heavy offloading but takes 10-20 minutes for 3 seconds. Wan I2V 14B is not practical on 12 GB — it technically runs with massive offloading (30-60+ minutes for 5 seconds). For $300-400: you get short animated clips (2-3 seconds) from SVD at reasonable speed. For longer/higher quality, save for 24 GB. I2V at this budget is for experiments, not production.
The serious setup
Used RTX 4090 24 GB ($1,600, see /hardware/rtx-4090). Runs Wan 2.1 I2V 14B FP8 at ~8-15 minutes per 5-second 832×480 clip. CogVideoX-I2V-5B at 3-6 minutes per 6-second clip. For professional use (social media content, product visualization), ~10 minutes per clip is production-viable if you queue overnight. Total build: ~$2,500-3,000. RTX 5090 32 GB ($2,000, see /hardware/rtx-5090) is the current I2V king — ~5-8 minutes per Wan clip. Image-to-video is a "start the render, get coffee, come back" workflow.
Common beginner mistake
The mistake: Feeding a complex, highly-detailed image into I2V and expecting the model to animate every element realistically. Why it fails: I2V models struggle with fine detail motion — a detailed photograph of a person in a patterned shirt will have the face animate correctly but the shirt pattern will warp and swim. The model allocates its "motion budget" to the most salient features (faces, hands) and approximates the rest. The fix: Simplify the input image. Remove busy backgrounds (inpaint to solid color). Use images with clear subjects and minimal fine detail. If animating a portrait, the face should be the only detailed element. For product videos, use clean product shots on white backgrounds. The simpler the input, the cleaner the animation. I2V is not "bring any image to life" — it's "animate a clean, well-composed image."
Recommended setup for image-to-video
Browse all tools for runtimes that fit this workload.
Reality check
Local video gen is genuinely possible in 2026 (LTX-Video, Mochi) but VRAM-hungry. 24 GB is the working minimum; 32 GB is the comfort zone for long-form workflows. Below 24 GB, video gen isn't realistic with current models.
Common mistakes
- Trying video gen on 16 GB cards (model + KV cache doesn't fit)
- Underestimating runtime VRAM (peak draw 1.5x model size on long sequences)
- Mixing video gen with concurrent LLM serving on same GPU
- Using Mac Silicon for video gen — viable but 30-50% slower than CUDA
What breaks first
The errors most operators hit when running image-to-video locally. Each links to a diagnose+fix walkthrough.
Before you buy
Verify your specific hardware can handle image-to-video before committing money.
Local video generation is the most VRAM-hungry workload of 2026 — Hunyuan, Wan, and Mochi all need 24 GB minimum, with 32 GB unlocking longer clips.