macOS AI Workflows — Local AI on macOS (Chapter 15)

This chapter chains the tools from previous chapters into working workflows. Each workflow solves a real use case.

Workflow 1: Development API with MLX acceleration

Purpose: Serve a local model as an API for a development project, maximizing throughput on Apple Silicon.

# 1. Start Ollama on the host
ollama serve &

# 2. Use LM Studio or a Python server for different model formats
# MLX model via Python server
pip install fastapi uvicorn
python3 << 'EOF'
from fastapi import FastAPI
from mlx_lm import load

app = FastAPI()
model, tokenizer = None, None

@app.on_event("startup")
async def startup():
    global model, tokenizer
    model, tokenizer = load("mlx-community/Qwen2.5-3B-Instruct-4bit")

@app.post("/generate")
async def generate(req: dict):
    response = model.generate(
        req["prompt"],
        tokenizer,
        max_tokens=req.get("max_tokens", 256)
    )
    return {"response": response}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
EOF

Workflow 2: Team model serving with Open WebUI

Purpose: Give a team a self-hosted web interface to multiple models.

# 1. Start Ollama (host)
ollama serve

# 2. In another terminal, start Open WebUI (Python)
open-webui serve

# 3. Team members access http://localhost:8080
# Multiple users can chat simultaneously with different models

Workflow 3: Batch processing with CLI scripts

Purpose: Run inference over a dataset without a web interface.

#!/bin/bash
# batch_inference.sh
MODEL="llama3.2:3b"
INPUT_FILE="prompts.txt"
OUTPUT_FILE="results.txt"

while IFS= read -r prompt; do
  result=$(curl -s -X POST http://localhost:11434/api/generate \
    -d "{\"model\":\"$MODEL\",\"prompt\":\"$prompt\",\"stream\":false}" \
    | jq -r '.response')
  echo "$result" >> "$OUTPUT_FILE"
  echo "Processed: ${prompt:0:50}..." >&2
done < "$INPUT_FILE"

# Run the batch script
chmod +x batch_inference.sh
./batch_inference.sh

Workflow 4: Claude Code agent with local model fallback

Purpose: Have a coding assistant use a local model as a fallback when cloud models are unavailable.

Configure in your AI tool's settings:

{
  "model": "claude-3-5-sonnet",
  "fallback_model": "ollama/llama3.2:3b",
  "ollama_endpoint": "http://localhost:11434"
}

This pattern is common in dev tools—define the local endpoint as a fallback and the tool automatically routes to it when the primary endpoint is unreachable.