RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Frameworks & tools / Ray
Frameworks & tools

Ray

Ray is an open-source distributed computing framework for scaling AI workloads across multiple machines. Operators encounter Ray when they need to run inference or training across a cluster of GPUs or CPUs, handling task scheduling, data parallelism, and fault tolerance. Ray provides a runtime that abstracts away the complexity of distributing work, allowing operators to write Python code that scales from a single machine to hundreds of nodes. It is commonly used with vLLM for serving large language models across multiple GPUs, or with Ray Serve for deploying models as microservices.

Deeper dive

Ray consists of a core distributed runtime and higher-level libraries like Ray Serve (model serving), Ray Train (distributed training), and Ray Data (data processing). The runtime uses a global control store (GCS) for metadata and a distributed scheduler to place tasks on available nodes. Operators interact with Ray through Python decorators like @ray.remote to mark functions or classes that can run remotely. Ray handles object sharing via the object store (shared memory) and provides automatic fault recovery. For local AI, Ray is often used with vLLM to serve models across multiple GPUs, enabling larger models or higher throughput than a single GPU can provide. Ray can also be used with Hugging Face Transformers for distributed training, though for single-machine setups, simpler tools like Ollama or LM Studio are more common.

Practical example

An operator with two RTX 4090 GPUs (24 GB each) wants to serve Llama 3.1 70B (quantized to Q4, ~40 GB). A single GPU cannot fit the model, but with Ray and vLLM, the operator can distribute the model across both GPUs using tensor parallelism. The command vllm serve meta-llama/Llama-3.1-70B --tensor-parallel-size 2 leverages Ray under the hood to split the model layers across the two GPUs, achieving ~20 tok/s instead of failing on one GPU.

Workflow example

When setting up a multi-GPU inference server, an operator installs Ray via pip install ray[serve] and starts a Ray cluster with ray start --head on the main node and ray start --address=<head-ip>:6379 on worker nodes. Then, using vLLM with --ray-workers-use-gpu, the operator deploys a model across the cluster. Ray's dashboard (available at http://localhost:8265) shows task status, GPU utilization, and object store usage, helping operators debug scaling issues.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →