Text-to-3D Generation
Generating 3D models from text prompts. Hunyuan3D-2, TRELLIS, Stable Fast 3D lead open-weight in 2026.
Setup walkthrough
pip install gradio+git clone https://github.com/Tencent/Hunyuan3D-2(Hunyuan3D-2 — SOTA open-weight text-to-3D).- Download the model weights (~5 GB for the base model, ~10 GB for the full pipeline with texture generation).
- The pipeline: text prompt → multi-view diffusion (generates 6 views of the object) → 3D reconstruction (creates mesh from views) → texture generation (UV-unwraps and textures the mesh).
- CLI:
python inference.py --prompt "a wooden chair with carved armrests" --output chair.glb - First 3D model in 2-10 minutes on 12+ GB GPU. Output is a textured GLB file (standard format, opens in Blender, Unity, Unreal).
- For lighter/faster:
pip install shap-e(OpenAI Shap-E, ~1 GB) — generates simple 3D shapes from text in 10-30 seconds on CPU. Lower quality, much faster. - Alternative: TripoSR (
pip install triposr) — image-to-3D, but can be used with text via text→image→3D pipeline.
The cheap setup
Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs Hunyuan3D-2 at 5-15 minutes per model. Shap-E runs at 30-60 seconds per model on CPU. For $400: you can generate simple 3D assets (furniture, props, basic characters) for game dev and prototyping. For high-quality textured models: Hunyuan3D-2 on 12 GB works but the multi-view diffusion stage strains VRAM — expect occasional OOM errors on complex prompts. Text-to-3D at $400 works for prototyping; production-quality models need more VRAM or cloud services.
The serious setup
Used RTX 3090 24 GB ($700-900, see /hardware/rtx-3090). Runs Hunyuan3D-2 comfortably at 2-5 minutes per model — the full pipeline (multi-view + mesh + texture) fits in 24 GB. For a game asset pipeline generating 20-50 props/day, one RTX 3090 handles it. For high-quality character models: 24 GB enables the highest resolution multi-view diffusion. Total: ~$1,800-2,200. RTX 4090 24 GB ($1,600) drops generation to 1-3 minutes per model — fast enough for interactive prototyping. Text-to-3D is a "generate, review, refine" loop — faster GPU = faster iteration.
Common beginner mistake
The mistake: Generating a 3D model from text, importing it into a game engine or 3D printer slicer, and wondering why it has 500K triangles, inverted normals, and non-manifold geometry. Why it fails: AI-generated meshes prioritize visual appearance over geometric correctness. The mesh looks right from the generated views but has topological issues: non-manifold edges, self-intersecting faces, inconsistent normals, and absurd triangle counts (a simple chair shouldn't have 500K tris). The fix: Always post-process AI-generated meshes. Import into Blender → Decimate modifier (reduce to 5-10K tris for game assets) → Recalculate Normals → 3D Print Toolbox (check for non-manifold geometry) → manual cleanup. AI generates the rough shape; you optimize for the target platform. A raw AI mesh is a starting point, not a deliverable. Budget 10-30 minutes of manual cleanup per AI-generated model.
Recommended setup for text-to-3d generation
Browse all tools for runtimes that fit this workload.
Reality check
Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.
Common mistakes
- Buying for spec-sheet VRAM without modeling KV cache + activation overhead
- Underestimating quantization quality loss below Q4
- Skipping flash-attention support (real perf gap on long context)
- Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)
What breaks first
The errors most operators hit when running text-to-3d generation locally. Each links to a diagnose+fix walkthrough.
Before you buy
Verify your specific hardware can handle text-to-3d generation before committing money.