Cloud GPU Fallback — Hardware Planning for Local AI (Chapter 12)

Cloud GPU instances provide elastic capacity for workloads that exceed local hardware. Understanding when and how to use cloud resources completes a hardware strategy.

Cloud GPU Pricing

Provider	GPU	VRAM	Price/hr	Price/hr (spot)
Vast.ai	RTX 3080	10GB	$0.20-0.35	$0.10-0.20
Vast.ai	RTX 3090	24GB	$0.40-0.60	$0.20-0.35
Vast.ai	A100 40GB	40GB	$1.50-2.50	$0.80-1.20
Lambda Labs	A100 80GB	80GB	$3.00+	$1.50+
AWS p4d	A100 40GB	40GB	$3.67	N/A

Prices vary by region and demand. Vast.ai typically offers best cost efficiency for short-term needs.

Break-Even Calculation

When does buying vs. renting make sense?

Monthly GPU cost (purchase): ($1500 GPU / 36 months) + ($50 electricity) = $92/month
Equivalent cloud usage: $92 / $0.30/hr = 307 hours/month = 12.8 hrs/day

At typical usage patterns, purchasing makes sense above 8 hours/day of active inference.

Cloud Instance Selection

For 70B model fine-tuning:

1x A100 80GB: Required for 70B QLoRA
8x A100 40GB: Required for 70B full fine-tuning
Spot instances: 40-60% savings with interruption risk

For inference only (70B):

1x A100 40GB at INT4: Handles 70B inference
2x RTX 3090 (parallel): Alternative at lower cost

SSH Access Pattern

# Connect to cloud instance
ssh user@instance-ip
# Port 22 or custom SSH port

# Download model
huggingface-cli download meta-llama/Meta-Llama-3-70b-Instruct
# Requires HuggingFace token with access

# Run inference
python3 -m llama_cpp_server --model models/llama-3-70b.gguf --host 0.0.0.0 --port 8080

Security Considerations

Cloud instances = external attack surface:

Use SSH key authentication, disable password login
Configure firewall to allow only essential ports
Encrypt model storage at rest
Consider VPN tunnel to instance
Terminate instances after use to avoid charges