Does NVIDIA DGX Spark (Project Digits) support CUDA?

Yes — NVIDIA DGX Spark (Project Digits) is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

NVIDIA DGX Spark (Project Digits) for local AI

What it does well

The DGX Spark (also marketed as Project Digits) is NVIDIA's first true ARM-based desk-side AI development workstation. The headline feature is unified memory — 128 GB of LPDDR5X shared between the Grace ARM CPU and the Blackwell-generation GPU, accessible via NVLink-C2C at ~600 GB/s, in a small form factor that runs from a 240 W power adapter. For local development of frontier models, this is genuinely unique: it fits Llama 3.3 405B at Q3, DeepSeek V3 671B at Q2 with paged offload, or Qwen 3 235B at FP8 / Q4 with comfortable context — workloads that no other $3,000 device on Earth can host. The full CUDA stack works (with ARM-compiled binaries; CUDA 12.8+ ships first-class ARM/Grace support), so you can dev against the same software stack you'll deploy on H200/B200 clusters. NVIDIA provides the entire NeMo + DGX OS + JupyterLab pre-configured stack out of the box. At $3,000 retail, it's roughly 1/3 the price of a 96 GB workstation card alone — and it's a complete system with CPU + RAM + storage + cooling.

Where it breaks

Bandwidth ceiling vs discrete GPU. 600 GB/s NVLink-C2C between CPU and GPU is dramatically below an H100's 3.35 TB/s or even an RTX 5090's 1.79 TB/s. Decode speed for memory-bound workloads is meaningfully slower — a 405B Q3 might run at 5–~10 tok/s rather than the ~25–40 tok/s a 4× H100 box delivers.
ARM ecosystem still has friction. While CUDA on ARM works, many third-party tools (some Python packages, certain Docker images, niche binaries) don't ship native ARM builds. Expect occasional 'pip install' failures, x86 emulation slowdowns, and Docker images that don't run.
Power envelope limits sustained workloads. 240 W total system power means GPU is never running at "actual H100 / B200" wattage. Sustained inference on big models throttles compared to discrete-GPU equivalents.
Single-system, no multi-card scale. No NVLink to other Spark units, no PCIe expansion. What you buy is what you get. For workloads that grow beyond 128 GB, you're recommitting to a different platform.
First-generation product risk. Software ecosystem (drivers, distro support, niche tooling) will mature over 12–24 months. Buying day-one means living through the maturity curve.
Not for production serving. Single-user dev box, period. Don't deploy this as a production server.

Ideal model range

Sweet spot: Local development of 200B–671B-class models — fit them in memory at low quant for prompt iteration, prototyping agentic workflows, and validating model behavior before deploying to H200/B200 cluster.
Sweet spot: 70B–200B-class development at FP16 with comfortable context. For dev-loop work where you don't need maximum throughput, this is the cheapest path.
Sweet spot: Mixed-model agentic prototyping — fit 70B + 30B + 7B simultaneously for draft → review → summarize loops without offload thrashing.
Stretch: Light fine-tuning at 7B–13B QLoRA. Bandwidth-limited but functional.
Bad fit: Production serving (any scale), high-throughput single-model decode, training, multi-card scale-out workloads.

Bad use cases

Production inference deployment. Wrong tier — pick L40S / H100 PCIe / H200.
Maximum tok/s on small models. Sub-13B at >~150 tok/s is consumer GPU territory (RTX 4090 / 5090). DGX Spark is bandwidth-limited.
Training workloads. Training is bandwidth-and-compute-heavy. DGX Spark is for development on top of trained models, not training.
Anyone whose stack doesn't have ARM support already. Audit your toolchain first. If you have many CUDA-x86-only dependencies, this is friction you don't want.
Long-horizon production reliance. First-gen platform. Expect software maturity issues for 12–18 months.

Verdict

Buy this if you do local development on frontier (200B+) models that you'll deploy to H200/B200 clusters, you want a desk-side dev box that fits 128 GB of model weights, your software stack is ARM-clean (or you're willing to sort out the long tail), and you understand this is dev/prototyping hardware not production. The DGX Spark hits a unique price point — there is genuinely no other $3,000 device that does what this does.

Skip this if you need maximum tok/s on smaller models (consumer GPU wins), your workloads fit 24–48 GB (RTX 5090 or used 3090 is dramatically cheaper), you're production-serving (wrong tier entirely), you're CUDA-x86-locked and ARM friction is painful, or you want a mature first-day-functional platform (wait 12 months).

How it compares

vs Mac Studio M3 Ultra (192 GB) → Mac Studio at 192 GB unified memory is the closest comparable: more memory, similar dev-tier positioning, more mature ARM ecosystem (Apple Silicon has 3+ years), but no CUDA. Pick DGX Spark when CUDA-on-ARM is non-negotiable for your stack; Mac Studio when MLX/Metal works and you want more memory. See /compare/nvidia-dgx-spark-vs-mac-studio-m3-ultra.
vs RTX 5090 (32 GB) → 5090 wins on raw bandwidth (1.79 TB/s vs 600 GB/s), tensor compute, and price-per-performance for everything that fits 32 GB. DGX Spark wins on memory ceiling (4× the VRAM-equivalent) for frontier-scale dev. Pick 5090 for hobbyist / sub-32 GB; DGX Spark for 200B+ dev box.
vs RTX PRO 6000 Blackwell (96 GB) → PRO 6000 Blackwell wins on bandwidth (1.79 TB/s) and pure tensor compute, at 2.8× the GPU price. DGX Spark at $3,000 includes the full system; PRO 6000 Blackwell at $8,499 is just the card. For cost-conscious frontier dev, DGX Spark wins; for serious workstation-tier prosumer inference, PRO 6000 Blackwell.
vs renting H200 / B200 on cloud → Renting H200 at $3–$4.50/hr lets you actually run frontier inference at speed. DGX Spark dev means slower iteration but no rental clock. Pick DGX Spark for the "no clock" benefit; rent for the "actual production speed" benefit. Most teams should do both.

What it does well

Where it breaks

Bandwidth ceiling vs discrete GPU. 600 GB/s NVLink-C2C between CPU and GPU is dramatically below an H100's 3.35 TB/s or even an RTX 5090's 1.79 TB/s. Decode speed for memory-bound workloads is meaningfully slower — a 405B Q3 might run at 5–~10 tok/s rather than the ~25–40 tok/s a 4× H100 box delivers.
ARM ecosystem still has friction. While CUDA on ARM works, many third-party tools (some Python packages, certain Docker images, niche binaries) don't ship native ARM builds. Expect occasional 'pip install' failures, x86 emulation slowdowns, and Docker images that don't run.
Power envelope limits sustained workloads. 240 W total system power means GPU is never running at "actual H100 / B200" wattage. Sustained inference on big models throttles compared to discrete-GPU equivalents.
Single-system, no multi-card scale. No NVLink to other Spark units, no PCIe expansion. What you buy is what you get. For workloads that grow beyond 128 GB, you're recommitting to a different platform.
First-generation product risk. Software ecosystem (drivers, distro support, niche tooling) will mature over 12–24 months. Buying day-one means living through the maturity curve.
Not for production serving. Single-user dev box, period. Don't deploy this as a production server.

Ideal model range

Sweet spot: Local development of 200B–671B-class models — fit them in memory at low quant for prompt iteration, prototyping agentic workflows, and validating model behavior before deploying to H200/B200 cluster.
Sweet spot: 70B–200B-class development at FP16 with comfortable context. For dev-loop work where you don't need maximum throughput, this is the cheapest path.
Sweet spot: Mixed-model agentic prototyping — fit 70B + 30B + 7B simultaneously for draft → review → summarize loops without offload thrashing.
Stretch: Light fine-tuning at 7B–13B QLoRA. Bandwidth-limited but functional.
Bad fit: Production serving (any scale), high-throughput single-model decode, training, multi-card scale-out workloads.

Bad use cases

Production inference deployment. Wrong tier — pick L40S / H100 PCIe / H200.
Maximum tok/s on small models. Sub-13B at >~150 tok/s is consumer GPU territory (RTX 4090 / 5090). DGX Spark is bandwidth-limited.
Training workloads. Training is bandwidth-and-compute-heavy. DGX Spark is for development on top of trained models, not training.
Anyone whose stack doesn't have ARM support already. Audit your toolchain first. If you have many CUDA-x86-only dependencies, this is friction you don't want.
Long-horizon production reliance. First-gen platform. Expect software maturity issues for 12–18 months.

Verdict

How it compares

vs Mac Studio M3 Ultra (192 GB) → Mac Studio at 192 GB unified memory is the closest comparable: more memory, similar dev-tier positioning, more mature ARM ecosystem (Apple Silicon has 3+ years), but no CUDA. Pick DGX Spark when CUDA-on-ARM is non-negotiable for your stack; Mac Studio when MLX/Metal works and you want more memory. See /compare/nvidia-dgx-spark-vs-mac-studio-m3-ultra.
vs RTX 5090 (32 GB) → 5090 wins on raw bandwidth (1.79 TB/s vs 600 GB/s), tensor compute, and price-per-performance for everything that fits 32 GB. DGX Spark wins on memory ceiling (4× the VRAM-equivalent) for frontier-scale dev. Pick 5090 for hobbyist / sub-32 GB; DGX Spark for 200B+ dev box.
vs RTX PRO 6000 Blackwell (96 GB) → PRO 6000 Blackwell wins on bandwidth (1.79 TB/s) and pure tensor compute, at 2.8× the GPU price. DGX Spark at $3,000 includes the full system; PRO 6000 Blackwell at $8,499 is just the card. For cost-conscious frontier dev, DGX Spark wins; for serious workstation-tier prosumer inference, PRO 6000 Blackwell.
vs renting H200 / B200 on cloud → Renting H200 at $3–$4.50/hr lets you actually run frontier inference at speed. DGX Spark dev means slower iteration but no rental clock. Pick DGX Spark for the "no clock" benefit; rent for the "actual production speed" benefit. Most teams should do both.

VRAM	0 GB
System RAM (typical)	128 GB
Power draw (peak)	200 W
Released	2025
MSRP	$3000
Backends	CUDA

VRAM	0 GB
System RAM (typical)	128 GB
Power draw (peak)	200 W
Released	2025
MSRP	$3000
Backends	CUDA

NVIDIA DGX Spark (Project Digits)

Our verdict

What it does well

Where it breaks

Ideal model range

Bad use cases

Verdict

How it compares

Overview

Specs

Frequently asked