Ray
Ray is an open-source distributed computing framework for scaling AI workloads across multiple machines. Operators encounter Ray when they need to run inference or training across a cluster of GPUs or CPUs, handling task scheduling, data parallelism, and fault tolerance. Ray provides a runtime that abstracts away the complexity of distributing work, allowing operators to write Python code that scales from a single machine to hundreds of nodes. It is commonly used with vLLM for serving large language models across multiple GPUs, or with Ray Serve for deploying models as microservices.
Deeper dive
Ray consists of a core distributed runtime and higher-level libraries like Ray Serve (model serving), Ray Train (distributed training), and Ray Data (data processing). The runtime uses a global control store (GCS) for metadata and a distributed scheduler to place tasks on available nodes. Operators interact with Ray through Python decorators like @ray.remote to mark functions or classes that can run remotely. Ray handles object sharing via the object store (shared memory) and provides automatic fault recovery. For local AI, Ray is often used with vLLM to serve models across multiple GPUs, enabling larger models or higher throughput than a single GPU can provide. Ray can also be used with Hugging Face Transformers for distributed training, though for single-machine setups, simpler tools like Ollama or LM Studio are more common.
Practical example
An operator with two RTX 4090 GPUs (24 GB each) wants to serve Llama 3.1 70B (quantized to Q4, ~40 GB). A single GPU cannot fit the model, but with Ray and vLLM, the operator can distribute the model across both GPUs using tensor parallelism. The command vllm serve meta-llama/Llama-3.1-70B --tensor-parallel-size 2 leverages Ray under the hood to split the model layers across the two GPUs, achieving ~20 tok/s instead of failing on one GPU.
Workflow example
When setting up a multi-GPU inference server, an operator installs Ray via pip install ray[serve] and starts a Ray cluster with ray start --head on the main node and ray start --address=<head-ip>:6379 on worker nodes. Then, using vLLM with --ray-workers-use-gpu, the operator deploys a model across the cluster. Ray's dashboard (available at http://localhost:8265) shows task status, GPU utilization, and object store usage, helping operators debug scaling issues.
Reviewed by Fredoline Eruo. See our editorial policy.