Instant 3D Vision: Apple’s Depth Pro Delivers High-Precision Depth Maps in 0.3 Seconds

Monocular Depth Estimation, which involves estimating depth from a single image, holds tremendous potential. It can add a third dimension to any image—regardless of when or how it was captured—without requiring specialized hardware or additional data. In recent years, zero-shot monocular depth estimation has become the foundation for a range of applications, including advanced image editing, view synthesis, and conditional image generation.

In a new paper Depth Pro: Sharp Monocular Metric Depth in Less Than a Second, an Apple research team introduces Depth Pro, a state-of-the-art foundation model designed for zero-shot metric monocular depth estimation. This model can generate high-resolution depth maps with exceptional clarity and fine detail, producing a 2.25-megapixel depth map in just 0.3 seconds on a standard GPU.

Depth Pro’s architecture hinges on the use of plain Vision Transformer (ViT) encoders, based on the work of Dosovitskiy et al. (2021), which process patches of the image at multiple scales. These patch predictions are then merged into a single, high-resolution depth map within an end-to-end trainable framework. To predict depth effectively, Depth Pro employs two ViT encoders: one for the patches and another for the entire image. The patch encoder processes image segments across various scales, while the image encoder uses the full, downsampled image (in this case, 384×384 pixels) to ground the patch-based predictions within a global context.

The model operates consistently at a fixed resolution of 1536×1536, a multiple of the 384×384 ViT input resolution. This ensures both a large enough receptive field and predictable runtimes for any image size, while also avoiding memory overloads. An advantage of Depth Pro’s architecture is its reliance on standard ViT encoders, allowing it to leverage a wide array of pretrained ViT-based models, enhancing both efficiency and performance.

Depth Pro’s success is rooted in several key innovations:

Multi-Scale ViT-Based Architecture: The model efficiently captures global context while preserving fine image details at high resolution.

Novel Metrics for Boundary Accuracy: The team developed new evaluation metrics based on highly accurate matting datasets, which better assess the precision of boundary tracing in monocular depth maps.

Enhanced Loss Functions and Training Approach: A combination of custom loss functions and a specialized training regimen ensures sharp depth estimates. This curriculum balances training on real-world datasets that have coarse, imprecise supervision near boundaries, and synthetic datasets that provide pixel-accurate ground truth with less realism.

Zero-Shot Focal Length Estimation: Depth Pro also introduces a method for estimating focal length from a single image, dramatically improving upon previous techniques.

In summary, Depth Pro pushes the boundaries of monocular depth estimation, offering unparalleled speed, precision, and depth map quality without the need for additional input or specialized hardware.

The paper Depth Pro: Sharp Monocular Metric Depth in Less Than a Second is on arXiv.

Author: Hecate He | Editor: Chain Zhang

The post Instant 3D Vision: Apple’s Depth Pro Delivers High-Precision Depth Maps in 0.3 Seconds first appeared on Synced.