Computer vision

Depth Estimation

Depth estimation is a computer vision task that predicts a depth value for each pixel in an image, producing a depth map where closer objects appear brighter (or darker, depending on convention). Operators encounter it in local AI when running monocular depth models like MiDaS or Depth Anything, which take a single RGB image and output a grayscale depth map. These models are typically small enough to run on consumer GPUs (e.g., Depth Anything V2 Small at ~24 MB) and are used for 3D reconstruction, AR effects, or as preprocessing for other models. Inference speed depends on resolution and model size; a 518×518 image on an RTX 3060 runs at ~30-50 ms per frame.

Deeper dive

Depth estimation models are typically convolutional or transformer-based networks trained on large datasets of RGB-D images. Monocular depth estimation (from a single image) is an ill-posed problem, so models learn statistical cues like perspective, texture gradients, and object size. Two widely used families are MiDaS (multiple dataset training) and Depth Anything (large-scale synthetic data + fine-tuning). Both output inverse depth (disparity) by default, which can be scaled to metric depth if camera intrinsics are known. Operators can run these via Hugging Face Transformers or ONNX runtime. For real-time applications, smaller variants (Depth Anything V2 Small, 24 MB) achieve ~30 FPS on an RTX 3060, while larger variants (Depth Anything V2 Large, 300 MB) provide better accuracy at ~10 FPS. Depth maps are often used as input for 3D point cloud generation or as a conditioning signal for image-to-3D models.

Practical example

A rig with an RTX 3060 (12 GB VRAM) runs Depth Anything V2 Small (24 MB) on a 518×518 image in ~30 ms, producing a 518×518 depth map. The same model on an Apple M1 Max (32 GB unified memory) via MLX runs at ~40 ms. For higher accuracy, Depth Anything V2 Large (300 MB) takes ~100 ms on the RTX 3060. VRAM usage is minimal (<1 GB for batch size 1).

Workflow example

In a local AI pipeline, an operator might run python run_depth.py --model depth_anything_v2_small --input image.jpg using a Hugging Face Transformers script. The output depth map can be saved as a PNG and fed into a 3D reconstruction tool like Open3D to generate a point cloud. In LM Studio, depth estimation models are not natively supported, but operators can load them via the Python API using the transformers library.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work