Text-to-Video Generation

Generating short video clips from text prompts. Wan 2.1, HunyuanVideo, LTX-Video lead the open-weight tier in 2026.

Capability notes

Open-weight text-to-video is the least mature AI modality — models arrived 12-18 months after text-to-image equivalents and trail closed APIs in temporal consistency, resolution, and prompt adherence. Three models (all on [ComfyUI](/tools/comfyui)) define the landscape: **Wan 2.1** (Alibaba, strongest all-around quality), **HunyuanVideo** (Tencent, highest resolution at 720×1280), **LTX-Video** (Lightricks, fastest at 2-4s per clip). **Wan 2.1** is the default recommendation. Produces 5-second clips at 480×832 (16fps) with class-leading temporal consistency — minimal flickering, characters maintain appearance across frames. 14B T2V model requires ~24 GB VRAM at FP16, ~12 GB at FP8. Quality gap vs closed-source (Sora, Kling, Runway Gen-3) is 12-18 months — Wan generates short, medium-quality clips; closed APIs generate longer, higher-resolution, more temporally stable video. Prompt adherence ~70-80% of closed API benchmarks. **HunyuanVideo** pushes higher resolution — native 720×1280 at 129 frames (~8.5s at 15fps) with 13B model. Higher quality ceiling than Wan, at 48 GB minimum VRAM at FP16 for full resolution. Multi-resolution training (256p-1080p) handles aspect ratios naturally. Best open-weight option for vertical video (9:16). **LTX-Video** prioritizes speed — 5-second clips in 2-4 seconds on [RTX 4090](/hardware/rtx-4090), 10-20× faster than Wan (30-60s) and Hunyuan (60-120s). Quality noticeably lower (more flickering, less detail) but speed enables rapid iteration. Fits 12 GB at FP8. **Frame limitations**: All open-weight models exhibit temporal degradation beyond 24-32 frames (1.5-2s). Small errors compound exponentially in autoregressive generation. Full-video-attention architectures (processing all frames simultaneously) are in research but not yet open-weight.

If you just want to try this

Lowest-friction path to a working setup.

Set expectations first: open-weight video generation produces short (5s), medium-resolution (480p-720p), temporally-imperfect clips. This isn't Sora quality. The models are free and private, but need patience and high-end hardware. Download [ComfyUI](https://github.com/comfyanonymous/ComfyUI) — standard runtime for Wan, Hunyuan, LTX-Video. Install ComfyUI-WanVideoWrapper custom node via ComfyUI Manager. Download Wan 2.1 T2V 14B weights from Hugging Face (Wan-AI/Wan2.1-T2V-14B) — FP16 weights ~27 GB, FP8 ~14 GB with nearly identical quality. You need 24 GB+ VRAM. [RTX 4090 24GB](/hardware/rtx-4090): Wan 2.1 FP8, 5s clip at 480×832 in 30-60s. [RTX 3090 24GB](/hardware/rtx-3090): 45-90s (lower bandwidth: 936 GB/s vs 4090's 1 TB/s). Start with simplest workflow: Load Wan Model → Wan Text-to-Video → Video Combine (frames to MP4). Prompt format matters: "A [subject] [doing action] in [environment], [camera movement], [lighting], [style]." Example: "A woman walking through a bamboo forest, camera tracking left to right, soft morning light, cinematic 24fps." Camera movement + frame rate hints help the model lock onto motion patterns. If 24 GB unavailable, switch to LTX-Video (Lightricks/LTX-Video) via ComfyUI-LTXVideo node. Fits 12 GB FP8, generates in 2-4s. Quality is noticeably lower. For CPU-only or [Apple Silicon](/hardware/macbook-pro-16-m4-max): video generation isn't practical — even LTX-Video takes 10-30 minutes per clip on M-series GPU. Rent cloud GPU if you don't own suitable hardware.

For production deployment

Operator-grade recommendation.

Production text-to-video with open-weight models is early-stage. The quality/cost/speed tradeoff currently favors closed APIs for most commercial use. Self-hosted is for (1) privacy-sensitive content, (2) high-volume where API costs dominate hardware, (3) custom fine-tuning on proprietary styles. **Batch pipeline**: [ComfyUI](/tools/comfyui) headless as video generation API. Architecture: request queue (Redis) → ComfyUI worker (one per GPU) → frame output → FFmpeg encoding → cloud storage → callback. Single [RTX 5090 32GB](/hardware/rtx-5090): ~60 Wan clips/hour. For 10,000 clips/month: ~167 GPU-hours. At $3-5/hour cloud: $500-835/month vs Kling/Runway API at $0.05-0.20/clip ($500-2,000/month) — cost-competitive. **When closed APIs beat self-hosted**: (1) Quality — Sora, Kling 2.0, Runway Gen-3 produce visibly better video at higher resolution, longer duration, better temporal consistency. (2) Speed — Kling: 10-20s per clip vs Wan's 30-60s. (3) Resolution — 1080p default vs 480p requiring upscaling. (4) Support — frame interpolation, upscaling, format conversion included. **When self-hosted wins**: (1) Privacy — proprietary footage where cloud upload violates data policy. (2) Volume — 100,000+ clips/month, hardware amortization favors self-hosting. (3) Customization — LoRA fine-tuning on proprietary visual styles possible with open-weight, impossible with closed APIs. **Pipeline architecture**: (1) Prompt validation/enrichment → (2) [ComfyUI](/tools/comfyui) generation on GPU cluster → (3) Frame interpolation (RIFE) 16→30fps → (4) AI upscaling (Real-ESRGAN) 480p→1080p → (5) Quality gate (CLIP score, aesthetic scoring, NSFW detection) → (6) FFmpeg encode (H.264/H.265) → delivery. Cost per clip (Wan, [RTX 5090](/hardware/rtx-5090)): $0.05-0.10 GPU compute, $0.01-0.02 post-processing, $0.0001-0.001 storage ≈ $0.06-0.12 total self-hosted vs $0.05-0.20 API.

What breaks

Failure modes operators see in the wild.

- **Temporal inconsistency (flickering, morphing).** Small texture/color shifts accumulate across frames — objects "swim," backgrounds shift. Mitigation: generate at higher fps with temporal smoothing across 3-5 frame windows. Use frame interpolation (RIFE) between generated frames. Accept this as open-weight video generation tax. - **Motion collapse (static-looking video).** Model generates near-static frames — minimal motion despite dynamic prompt. Mitigation: include explicit motion descriptors ("camera pans left," "subject walks forward," "particles drift upward"). Increase guidance scale (7-10). Wan handles motion better than LTX-Video. - **Resolution/frame budget OOM.** Requesting 129 frames at 720×1280 on 24 GB GPU crashes mid-generation — model loads, allocates frame buffers, runs out of VRAM at frame ~50-80. Mitigation: pre-flight VRAM calculation. HunyuanVideo 720p: model ~26 GB + (129 frames × ~50 MB) = ~32 GB. Fall back to 512p or fewer frames. - **Prompt adherence weaker than image gen.** Model ignores secondary prompt elements as frame count grows — "cat on red couch with window" becomes cat on beige couch, no window by frame 30. Mitigation: keep prompts simple (1-2 subjects, 1 action, 1 environment). Use I2V mode with established start frame, prompt only motion. - **NSFW filter false-positives.** Wan and Hunyuan ship with Chinese-content-standards-trained safety filters. False positives on medical content, fitness content, swimwear, classical art. Mitigation: no clean workaround — filter is in model weights. Community fine-tunes may loosen but exist in legal gray area. - **Multi-character identity confusion.** Two distinct characters swap attributes by mid-clip — model tracks one identity but struggles with two. Mitigation: generate characters separately and composite. Use I2V with initial frame showing both. Active research, no reliable open-weight solution.

Hardware guidance

**Minimum viable (24 GB VRAM)**: [RTX 3090 24GB](/hardware/rtx-3090) or [RTX 4090 24GB](/hardware/rtx-4090). Wan 2.1 FP8 at 480×832, 5-second clips. This is the absolute floor — below 24 GB, video generation is not practical. 30-90 seconds per clip on 4090, 45-120 seconds on 3090 (lower memory bandwidth: 936 GB/s vs 4090's 1 TB/s impacts video throughput more than image). [Apple M3 Ultra 64GB](/hardware/apple-m3-ultra) via MLX — 2-4× slower but unified memory enables higher frame counts or resolution if you accept the speed penalty. **Recommended (32 GB)**: [RTX 5090 32GB](/hardware/rtx-5090). Single-card sweet spot — Wan 2.1 FP16 (~27 GB model) with full 5s generation. Also HunyuanVideo at 512×720 reduced resolution. 30-60s per Wan clip, 60-90s per Hunyuan. **Enterprise (48 GB+)**: [NVIDIA L40S](/hardware/nvidia-l40s) 48 GB or [RTX A6000](/hardware/rtx-a6000) 48 GB. HunyuanVideo 720×1280, 129 frames at FP16. Wan 2.1 FP16 extended 81-frame generation. [NVIDIA H200](/hardware/nvidia-h200) 141 GB for experimental 200+ frame generation (10+ seconds). **Multi-GPU**: Video gen models don't natively support tensor parallelism — can't split one generation across GPUs. Run multiple ComfyUI workers per GPU. 4× [RTX 5090](/hardware/rtx-5090) = 4 concurrent generations, ~240 clips/hour. 8× [L40S](/hardware/nvidia-l40s) = 8 concurrent, ~480 clips/hour. **VRAM scaling**: LTX-Video FP8 480p: 10-12 GB. Wan 2.1 FP8 480p 5s: 12-16 GB. Wan 2.1 FP16 480p 5s: 22-28 GB. HunyuanVideo FP16 512p: 26-32 GB. HunyuanVideo FP16 720p 129fr: 42-50 GB.

Runtime guidance

**If generating video clips interactively** → [ComfyUI](https://github.com/comfyanonymous/ComfyUI) with WanVideoWrapper, HunyuanVideoWrapper, or LTXVideo nodes. The standard and only practical runtime for open-weight video generation. Wan, Hunyuan, and LTX-Video all ship ComfyUI workflow support as primary integration. Native Diffusers support experimental — use ComfyUI for reliability. **If building programmatic video generation** → ComfyUI API mode (`--listen`, POST workflows to `/prompt`). API accepts parameterizable workflow JSON — inject prompts, seeds, model paths. Build Python wrapper: construct workflow JSON → submit → poll → download video output. **If speed is priority** → LTX-Video on [RTX 4090](/hardware/rtx-4090) or [RTX 5090](/hardware/rtx-5090). 2-4s per clip. Lowest quality — use for real-time previews, iterative prompt refinement to find best seed, then commit to Wan 2.1 for final quality. **If quality is priority** → Wan 2.1 FP16 + RIFE interpolation + Real-ESRGAN upscaling. Pipeline: Wan at 16fps → RIFE to 30fps → Real-ESRGAN 480p→1080p. 90-180s per clip on [RTX 5090](/hardware/rtx-5090). Best open-weight video quality achievable. **If needing image-to-video for controlled generation** → Wan 2.1 I2V model (Wan2.1-I2V-14B) in ComfyUI. Provide start frame + motion prompt. I2V produces more controllable results — start frame anchors the scene, model only handles temporal evolution. **Post-processing stack**: RIFE (frame interpolation, 5-15s) for motion smoothing. Real-ESRGAN (upscaling, 10-30s). FFmpeg encoding (H.264/H.265, CRF 18-23) in ComfyUI Video Combine node. **What doesn't work**: [Diffusers](/tools/diffusers) native video pipelines experimental/unreliable for Wan/Hunyuan. [Transformers](/tools/transformers), [Ollama](/tools/ollama), [LM Studio](/tools/lm-studio), [Automatic1111](/tools/automatic1111) — none support video generation. ComfyUI is the single runtime.

Setup walkthrough

Install ComfyUI via Stability Matrix.
ComfyUI Manager → Install Models → search "wan-2.1-t2v-14b" and download the FP8 version (~16 GB).
Load the Wan 2.1 T2V workflow from the workflow library.
Configure: resolution=832×480 (Wan default), frames=81 (~5 seconds at 16 fps), steps=20, CFG=5.
Prompt: "A cat walking through a neon-lit Tokyo alley at night, rain reflecting on the ground." Queue.
First video in 5-15 minutes on RTX 4090 24 GB, 15-30 minutes on RTX 3090 24 GB.
For lighter/faster: install LTX-Video (~6 GB, ComfyUI Manager) — 3-5 seconds of video in 1-3 minutes on 12 GB GPU.

Warning: video generation is the most compute-intensive local AI task. Time-to-first-result is measured in minutes, not seconds.

The cheap setup

Honestly: $300 cannot do quality text-to-video locally. A used RTX 3060 12 GB (~$200-250) runs LTX-Video at 832×480, ~5 seconds of video in 3-8 minutes. Wan 2.1 14B requires 24+ GB at reasonable speed — it will run with heavy offloading on 12 GB but takes 30-60 minutes for 5 seconds of video. For realistic expectations: $300 gets you into the game with LTX-Video (basic quality, fast). $300 does NOT get you Wan 2.1 or HunyuanVideo at usable speeds. If you only need short clips (<3 seconds) occasionally, it's workable. For daily use, save up for 24 GB.

The serious setup

Used RTX 4090 24 GB ($1,600, see /hardware/rtx-4090). Runs Wan 2.1 T2V 14B FP8 at ~5-10 minutes per 5-second 832×480 clip. HunyuanVideo at similar speeds. RTX 5090 32 GB ($2,000, see /hardware/rtx-5090) is the current single-GPU video generation king — 3-6 minutes per 5-second clip. Dual RTX 3090 48 GB total ($1,600) handles Wan 2.1 at full resolution. Pair with Ryzen 7 7700X + 64 GB DDR5 + 2TB NVMe. Total build: ~$2,500-3,500. Video generation pushes hardware harder than any other local AI task.

Common beginner mistake

The mistake: Expecting text-to-video to be as fast as text-to-image ("I'll just generate a 10-second 1080p video in 30 seconds, right?"). Why it fails: A 5-second video at 16 fps is 81 frames — each frame is a 1024×1024 diffusion generation. That's 81× the compute of one image. Even with temporal attention sharing, video generation is 20-50× more expensive than image generation. The fix: Start with the fastest model: LTX-Video generates 5 seconds of video in 1-3 minutes on 12 GB. Use Wan 2.1 or HunyuanVideo only when you need higher quality and have the time/VRAM. Lower resolution (832×480 vs 1024×1024) and fewer frames (49 instead of 81) dramatically reduce generation time. Video generation is a "walk away and come back" task, not a "click and get" task.

Recommended setup for text-to-video generation

Recommended hardware

Best GPU for Stable Diffusion + image gen →

Compute-bound workload — VRAM + FP16 TFLOPS both matter.

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for Stable Diffusion + image gen →

Reality check

Local video gen is genuinely possible in 2026 (LTX-Video, Mochi) but VRAM-hungry. 24 GB is the working minimum; 32 GB is the comfort zone for long-form workflows. Below 24 GB, video gen isn't realistic with current models.

Common mistakes

Trying video gen on 16 GB cards (model + KV cache doesn't fit)
Underestimating runtime VRAM (peak draw 1.5x model size on long sequences)
Mixing video gen with concurrent LLM serving on same GPU
Using Mac Silicon for video gen — viable but 30-50% slower than CUDA

What breaks first

The errors most operators hit when running text-to-video generation locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle text-to-video generation before committing money.

Hardware buying guidance for Text-to-Video Generation

Local video generation is the most VRAM-hungry workload of 2026 — Hunyuan, Wan, and Mochi all need 24 GB minimum, with 32 GB unlocking longer clips.

Buyer guides

Compare hardware

Troubleshooting

Specialized buyer guides

Updated 2026 roundup

Best free local AI tools (2026) →