metadata

license: mit

DPT 3.1 (BEiT backbone)

DPT (Dense Prediction Transformer) model trained on 1.4 million images for monocular depth estimation. It was introduced in the paper Vision Transformers for Dense Prediction by Ranftl et al. (2021) and first released in this repository. DPT uses the BEiT model as backbone and adds a neck + head on top for monocular depth estimation.

Disclaimer: The team releasing DPT did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

The Table Transformer is equivalent to DETR, a Transformer-based object detection model. Note that the authors decided to use the "normalize before" setting of DETR, which means that layernorm is applied before self- and cross-attention.

How to use

Here is how to use this model for zero-shot depth estimation on an image:

from transformers import DPTImageProcessor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = DPTImageProcessor.from_pretrained("Intel/dpt-large")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")

# prepare image for the model
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predicted_depth = outputs.predicted_depth

# interpolate to original size
prediction = torch.nn.functional.interpolate(
    predicted_depth.unsqueeze(1),
    size=image.size[::-1],
    mode="bicubic",
    align_corners=False,
)

# visualize the prediction
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)

or one can use the pipeline API:

from transformers import pipeline

pipe = pipeline(task="depth-estimation", model="Intel/dpt-beit-base-384")
result = pipe("http://images.cocodataset.org/val2017/000000039769.jpg")
result["depth"]