[논문 리뷰] 코드 파헤치기 Depth Pro : Sharp Monocular Metric Depth in Less Than a Second (24.09)

논문 리뷰

[논문 리뷰] 코드 파헤치기 Depth Pro : Sharp Monocular Metric Depth in Less Than a Second (24.09)

도도걸만단 2025. 1. 12. 21:45

오늘 리뷰할 논문은 다음과 같다.

1. 논문리뷰(이전 게시물 참고), 2. 코드리뷰

필요한 구간으로 넘어가세요!

Apple

arXiv:2410.02073v1 [cs.CV] 2 Oct 2024

링크 : Depth Pro : Sharp Monocular Metric Depth in Less Than a Second

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without rel

arxiv.org

https://github.com/apple/ml-depth-pro

GitHub - apple/ml-depth-pro: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second.

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second. - apple/ml-depth-pro

github.com

자세한 논문 내용 :

https://minsunstudio.tistory.com/61

[논문 리뷰] Depth Pro : Sharp Monocular Metric Depth in Less Than a Second (24.09)

(계속 수정 업데이트중입니다) 오늘 리뷰할 논문은 다음과 같다.1. 논문리뷰, 2. 코드리뷰(다음 게시물 참고)필요한 구간으로 넘어가세요! ApplearXiv:2410.02073v1 [cs.CV] 2 Oct 2024 링크 : Depth Pro : Sh

minsunstudio.tistory.com

2. 코드 리뷰

이 글을 참고하는 초보 등 모든 사람을 위해 코드를 샅샅이 파헤쳐봅시다.

먼저 구조부터

depth_pro/cli/init.py

# Copyright (C) 2024 Apple Inc. All Rights Reserved.
"""Depth Pro CLI and tools."""

from .run import main as run_main  # noqa

현재 경로 내에서 run.py의 main함수를 가져온다

main함수에 run_main이라는 이름 할당 -> 코드 내부에서 run_main() 으로 함수 호출 가능

depth_pro/cli/run.py : depth estimation 후 결과 저장, 필요시 시각화

import argparse
import logging
from pathlib import Path

import numpy as np
import PIL.Image
import torch
from matplotlib import pyplot as plt
from tqdm import tqdm

from depth_pro import create_model_and_transforms, load_rgb

LOGGER = logging.getLogger(__name__)


def get_torch_device() -> torch.device:
    """Get the Torch device."""
    device = torch.device("cpu")
    if torch.cuda.is_available():
        device = torch.device("cuda:0")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
    return device

• argparse: 명령줄 인자 처리.

• logging: 로그 기록 및 디버깅.

1. 기본적으로 CPU를 사용.

2. GPU(CUDA)가 가능하면 이를 사용.

3. Apple Silicon의 MPS(메탈 퍼포먼스 쉐이더)가 가능하면 이를 사용.

def run(args):
    """Run Depth Pro on a sample image."""
    if args.verbose:
        logging.basicConfig(level=logging.INFO)

    # Load model.
    model, transform = create_model_and_transforms(
        device=get_torch_device(),
        precision=torch.half,
    )
    model.eval()
    
    image_paths = [args.image_path]
    if args.image_path.is_dir():
        image_paths = args.image_path.glob("**/*")
        relative_path = args.image_path
    else:
        relative_path = args.image_path.parent

• 사용자가 --verbose 옵션을 지정하면, 상세 로그를 출력.

• 모델과 이미지 변환(transform)을 생성.

• eval()은 모델을 추론 모드로 설정.

• 입력이 디렉토리인지 확인 :

디렉토리면 모든 이미지를 처리.
단일 파일이면 해당 파일만 처리.

    for image_path in tqdm(image_paths):
        # Load image and focal length from exif info (if found.).
        try:
            LOGGER.info(f"Loading image {image_path} ...")
            image, _, f_px = load_rgb(image_path)
        except Exception as e:
            LOGGER.error(str(e))
            continue
        # Run prediction. If `f_px` is provided, it is used to estimate the final metric depth,
        # otherwise the model estimates `f_px` to compute the depth metricness.
        prediction = model.infer(transform(image), f_px=f_px)

        # Extract the depth and focal length.
        depth = prediction["depth"].detach().cpu().numpy().squeeze()
        if f_px is not None:
            LOGGER.debug(f"Focal length (from exif): {f_px:0.2f}")
        elif prediction["focallength_px"] is not None:
            focallength_px = prediction["focallength_px"].detach().cpu().item()
            LOGGER.info(f"Estimated focal length: {focallength_px}")

        inverse_depth = 1 / depth
        # Visualize inverse depth instead of depth, clipped to [0.1m;250m] range for better visualization.
        max_invdepth_vizu = min(inverse_depth.max(), 1 / 0.1)
        min_invdepth_vizu = max(1 / 250, inverse_depth.min())
        inverse_depth_normalized = (inverse_depth - min_invdepth_vizu) / (
            max_invdepth_vizu - min_invdepth_vizu
        )

        # Save Depth as npz file.
        if args.output_path is not None:
            output_file = (
                args.output_path
                / image_path.relative_to(relative_path).parent
                / image_path.stem
            )
            LOGGER.info(f"Saving depth map to: {str(output_file)}")
            output_file.parent.mkdir(parents=True, exist_ok=True)
            np.savez_compressed(output_file, depth=depth)

            # Save as color-mapped "turbo" jpg image.
            cmap = plt.get_cmap("turbo")
            color_depth = (cmap(inverse_depth_normalized)[..., :3] * 255).astype(
                np.uint8
            )
            color_map_output_file = str(output_file) + ".jpg"
            LOGGER.info(f"Saving color-mapped depth to: : {color_map_output_file}")
            PIL.Image.fromarray(color_depth).save(
                color_map_output_file, format="JPEG", quality=90
            )

• 각 이미지에 대해:

• 이미지를 로드.

• 모델로 추론 실행.

• 결과 저장 및 시각화.

• 모델 추론 결과에서 depth map를 추출.

• 깊이 값을 역으로 변환해 시각화.

다음 변수로 역깊이 값 제한

(1) max_invdepth_vizu

max_invdepth_vizu = min(inverse_depth.max(), 1 / 0.1)

depth 최소 0.1m 일 때, 역depth = 1/0.1 = 10으로 상한 제한

(2) min_invdepth_vizu

min_invdepth_vizu = max(1 / 250, inverse_depth.min())

depth 최대 250m 일 때, 역depth = 1/250 = 0.004로 하한 제한

(3) inverse_depth_normalized

• 정규화:

• 역깊이 지도의 값을 [0, 1] 범위로 정규화하여 시각화나 저장(예: 이미지 파일) 시 적합하게 만든다.

• \text{normalized\_value} = \frac{\text{value} - \text{min}}{\text{max} - \text{min}} .

결과 저장 후 시각화

• 추론 결과를 .npz 파일로 저장.

• 결과를 컬러맵(turbo)으로 변환 후 .jpg로 저장.

• 입력 이미지와 추론된 깊이 지도를 Matplotlib으로 시각화.

def main():
    """Run DepthPro inference example."""
    parser = argparse.ArgumentParser(
        description="Inference scripts of DepthPro with PyTorch models."
    )
    parser.add_argument(
        "-i", 
        "--image-path", 
        type=Path, 
        default="./data/example.jpg",
        help="Path to input image.",
    )
    parser.add_argument(
        "-o",
        "--output-path",
        type=Path,
        help="Path to store output files.",
    )
    parser.add_argument(
        "--skip-display",
        action="store_true",
        help="Skip matplotlib display.",
    )
    parser.add_argument(
        "-v", 
        "--verbose", 
        action="store_true", 
        help="Show verbose output."
    )
    
    run(parser.parse_args())


if __name__ == "__main__":
    main()

• 역할: argparse 모듈을 사용, 명령줄 인자를 파싱하고, run() 함수를 실행.

• -i (필수): 입력 이미지 경로.

• -o (선택): 출력 경로.

• --skip-display: 결과를 시각화하지 않고 저장만 함.

• -v: 로그 메시지를 더 상세히 출력.

run(parser.parse_args()) : 파싱된 인자를 run 함수로 전달.

• if 문 역할: 이 파일이 직접 실행될 경우, main()을 호출.

• 다른 모듈에서 import하면 main()은 실행되지 않음.

예를들어 python script.py -i ./data/example.jpg -o ./output --verbose 명령줄이라고 하면

출력

Image Path: ./data/example.jpg
Output Path: ./output
Verbose mode is enabled.

이런식, 전달됨

depth_pro 디렉토리

src/
└── depth_pro/
    ├── __init__.py
    ├── depth_pro.py
    ├── utils.py

depth_pro/init.py

# Copyright (C) 2024 Apple Inc. All Rights Reserved.
"""Depth Pro package."""

from .depth_pro import create_model_and_transforms  # noqa
from .utils import load_rgb  # noqa

depth_pro/depth_pro.py :

from __future__ import annotations

from dataclasses import dataclass
from typing import Mapping, Optional, Tuple, Union

import torch
from torch import nn
from torchvision.transforms import (
    Compose,
    ConvertImageDtype,
    Lambda,
    Normalize,
    ToTensor,
)

from .network.decoder import MultiresConvDecoder	#디코더
from .network.encoder import DepthProEncoder	#인코더
from .network.fov import FOVNetwork				#시야각(Field of View) 관련 네트워크
from .network.vit_factory import VIT_CONFIG_DICT, ViTPreset, create_vit # Vision Transformer(ViT) 모델 생성

@dataclass
class DepthProConfig:
    """Configuration for DepthPro."""

    patch_encoder_preset: ViTPreset	
    image_encoder_preset: ViTPreset
    decoder_features: int

    checkpoint_uri: Optional[str] = None	# 모델 가중치 경로
    fov_encoder_preset: Optional[ViTPreset] = None
    use_fov_head: bool = True	# 시야각 네트워크 사용 여부


DEFAULT_MONODEPTH_CONFIG_DICT = DepthProConfig(
    patch_encoder_preset="dinov2l16_384",
    image_encoder_preset="dinov2l16_384",
    checkpoint_uri="./checkpoints/depth_pro.pt",
    decoder_features=256,
    use_fov_head=True,
    fov_encoder_preset="dinov2l16_384",
)

• DEFAULT_MONODEPTH_CONFIG_DICT는 DepthPro 모델의 표준 초기 설정을 정의

1. patch_encoder_preset="dinov2l16_384"

• 패치 기반 인코더의 사전 설정 이름.

• "dinov2l16_384"은 Vision Transformer(ViT)의 특정 사전 학습된 구성(384x384 이미지 입력, DINO 알고리즘 기반)을 의미.

• 패치 단위로 특징을 추출하는 역할.

2. image_encoder_preset="dinov2l16_384"

• 이미지 기반 인코더의 사전 설정 이름.

• "dinov2l16_384"은 위와 동일한 사전 설정을 사용.

• 전체 이미지를 기반으로 특징을 추출.

def create_backbone_model(
    preset: ViTPreset
) -> Tuple[nn.Module, ViTPreset]:
    """Create and load a backbone model given a config.

    Args:
    ----
        preset: A backbone preset to load pre-defind configs.

    Returns:
    -------
        A Torch module and the associated config.

    """
    if preset in VIT_CONFIG_DICT:
        config = VIT_CONFIG_DICT[preset]
        model = create_vit(preset=preset, use_pretrained=False)
    else:
        raise KeyError(f"Preset {preset} not found.")

    return model, config

• Vision Transformer(ViT) 기반의 백본 모델 생성:

• preset: ViT 사전 설정.

• VIT_CONFIG_DICT에서 설정을 읽어 모델을 생성.

모델 및 변환 생성 :

인코드 디코더 등 정의

def create_model_and_transforms(
    config: DepthProConfig = DEFAULT_MONODEPTH_CONFIG_DICT,
    device: torch.device = torch.device("cpu"),
    precision: torch.dtype = torch.float32,
) -> Tuple[DepthPro, Compose]:
    """Create a DepthPro model and load weights from `config.checkpoint_uri`.

    Args:
    ----
        config: The configuration for the DPT model architecture.
        device: The optional Torch device to load the model onto, default runs on "cpu".
        precision: The optional precision used for the model, default is FP32.

    Returns:
    -------
        The Torch DepthPro model and associated Transform.

    """
    patch_encoder, patch_encoder_config = create_backbone_model(
        preset=config.patch_encoder_preset
    )
    image_encoder, _ = create_backbone_model(
        preset=config.image_encoder_preset
    )

    fov_encoder = None
    if config.use_fov_head and config.fov_encoder_preset is not None:
        fov_encoder, _ = create_backbone_model(preset=config.fov_encoder_preset)

    dims_encoder = patch_encoder_config.encoder_feature_dims
    hook_block_ids = patch_encoder_config.encoder_feature_layer_ids
    encoder = DepthProEncoder(
        dims_encoder=dims_encoder,
        patch_encoder=patch_encoder,
        image_encoder=image_encoder,
        hook_block_ids=hook_block_ids,
        decoder_features=config.decoder_features,
    )
    decoder = MultiresConvDecoder(
        dims_encoder=[config.decoder_features] + list(encoder.dims_encoder),
        dim_decoder=config.decoder_features,
    )
    model = DepthPro(
        encoder=encoder,
        decoder=decoder,
        last_dims=(32, 1),
        use_fov_head=config.use_fov_head,
        fov_encoder=fov_encoder,
    ).to(device)

    if precision == torch.half:
        model.half()

    transform = Compose(
        [
            ToTensor(),
            Lambda(lambda x: x.to(device)),
            Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
            ConvertImageDtype(precision),
        ]
    )

    if config.checkpoint_uri is not None:
        state_dict = torch.load(config.checkpoint_uri, map_location="cpu")
        missing_keys, unexpected_keys = model.load_state_dict(
            state_dict=state_dict, strict=True
        )

        if len(unexpected_keys) != 0:
            raise KeyError(
                f"Found unexpected keys when loading monodepth: {unexpected_keys}"
            )

        # fc_norm is only for the classification head,
        # which we would not use. We only use the encoding.
        missing_keys = [key for key in missing_keys if "fc_norm" not in key]
        if len(missing_keys) != 0:
            raise KeyError(f"Keys are missing when loading monodepth: {missing_keys}")

    return model, transform

class DepthPro(nn.Module):
    """DepthPro network."""

    def __init__(
        self,
        encoder: DepthProEncoder,
        decoder: MultiresConvDecoder,
        last_dims: tuple[int, int],
        use_fov_head: bool = True,
        fov_encoder: Optional[nn.Module] = None,
    ):
        """Initialize DepthPro.

        Args:
        ----
            encoder: The DepthProEncoder backbone.
            decoder: The MultiresConvDecoder decoder.
            last_dims: The dimension for the last convolution layers.
            use_fov_head: Whether to use the field-of-view head.
            fov_encoder: A separate encoder for the field of view.

        """
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder
    
        dim_decoder = decoder.dim_decoder
        self.head = nn.Sequential(
            nn.Conv2d(
                dim_decoder, dim_decoder // 2, kernel_size=3, stride=1, padding=1
            ),
            nn.ConvTranspose2d(
                in_channels=dim_decoder // 2,
                out_channels=dim_decoder // 2,
                kernel_size=2,
                stride=2,
                padding=0,
                bias=True,
            ),
            nn.Conv2d(
                dim_decoder // 2,
                last_dims[0],
                kernel_size=3,
                stride=1,
                padding=1,
            ),
            nn.ReLU(True),
            nn.Conv2d(last_dims[0], last_dims[1], kernel_size=1, stride=1, padding=0),
            nn.ReLU(),
        )

        # Set the final convolution layer's bias to be 0.
        self.head[4].bias.data.fill_(0)

        # Set the FOV estimation head.
        if use_fov_head:
            self.fov = FOVNetwork(num_features=dim_decoder, fov_encoder=fov_encoder)

• self.encoder:

• DepthProEncoder 객체를 할당. 입력 이미지에서 특징 추출.

• self.decoder:

• MultiresConvDecoder 객체를 할당. 인코더의 출력 특징을 조합하여 최종 출력(깊이 지도)을 생성.

4. self.head: 마지막 출력 레이어

• self.head는 깊이 지도를 생성하기 위한 마지막 출력 레이어로 구성됩니다.

• Conv2D:

• 입력 특징을 축소하여 중간 레이어 크기를 결정.

• ConvTranspose2D:

• 업샘플링(해상도 증가)을 수행.

• ReLU:

• 활성화 함수로 비선형성을 추가.

• 최종 Conv2D:

• 마지막 출력 채널 수(last_dims[1], 보통 1 채널)를 생성.

예시 흐름:

1. 입력 크기: dim_decoder (디코더 출력 크기).

2. 중간 크기: dim_decoder // 2.

3. 최종 크기: last_dims[0] → last_dims[1].

self.head[4].bias.data.fill_(0)

• 마지막 Conv2D 레이어의 bias 값을 0으로 초기화.

• 모델의 초기 출력값이 안정적으로 수렴하도록 돕는 역할.

• 조건: use_fov_head=True일 경우, FOV 네트워크를 활성화.

• FOVNetwork:

• 입력 특징(dim_decoder)과 선택적 추가 인코더(fov_encoder)를 사용하여 시야각(Field of View)을 추정.

• FOV 네트워크는 추가적인 시야각 정보를 활용하여 깊이 추정 정확도를 높이거나, 독립적으로 시야각 값을 출력.

1. encoder:

• 입력 이미지에서 다중 해상도의 특징을 추출.

2. decoder:

• 인코더 출력 특징을 결합하여 깊이 지도 생성.

3. head:

• 최종 출력 레이어. 깊이 지도와 같은 결과물 생성.

4. fov (선택적):

• 시야각 추정 네트워크.

    @property
    def img_size(self) -> int:
        """Return the internal image size of the network."""
        return self.encoder.img_size

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
        """Decode by projection and fusion of multi-resolution encodings.

        Args:
        ----
            x (torch.Tensor): Input image.

        Returns:
        -------
            The canonical inverse depth map [m] and the optional estimated field of view [deg].

        """
        _, _, H, W = x.shape
        assert H == self.img_size and W == self.img_size

        encodings = self.encoder(x)
        features, features_0 = self.decoder(encodings)
        canonical_inverse_depth = self.head(features)

        fov_deg = None
        if hasattr(self, "fov"):
            fov_deg = self.fov.forward(x, features_0.detach())

        return canonical_inverse_depth, fov_deg

1. @property def img_size

• img_size:

• DepthPro 네트워크의 입력 이미지 크기를 반환.

• 이 크기는 내부 인코더(self.encoder)의 설정에 따라 결정됩니다.

• 예를 들어, img_size = 384이면, 네트워크는 384 \times 384 크기의 입력 이미지를 처리.

• 역할:

• 입력 이미지의 크기를 확인하거나, 네트워크 내부에서 이미지 크기를 일관되게 유지하기 위해 사용.

2. forward 메서드

• forward:

• PyTorch에서 모델의 추론(inference) 또는 학습(forward pass)을 수행하는 메서드.

• 입력 이미지를 처리하여 깊이 지도와 (옵션으로) 시야각을 반환.

• 입력 인자:

• x (torch.Tensor):

• 입력 이미지 텐서. 크기는 (batch_size, channels, height, width).

• 예: (1, 3, 384, 384) → RGB 이미지(3 채널), 384x384 크기.

• 반환값:

• canonical_inverse_depth (torch.Tensor):

• 정규화된 역깊이 지도.

• 크기: (batch_size, 1, height, width) (단일 채널 깊이 지도).

• fov_deg (Optional[torch.Tensor]):

• 시야각(Field of View) 값(선택적).

• \text{degrees} 단위로 반환되며, 존재하지 않을 경우 None.

(1) 입력 이미지 크기 확인

• 입력 크기 확인:

• 입력 이미지의 높이(H)와 너비(W)가 네트워크의 내부 크기(self.img_size)와 같은지 확인.

• 네트워크는 고정된 크기의 이미지를 처리하도록 설계되었으므로, 크기가 맞지 않으면 에러 발생.

(2) 특징 추출 (Encoder)

• self.encoder(x):

• 입력 이미지 x를 인코더를 통해 처리하여 **다중 해상도 특징(Encodings)**을 추출.

• encodings는 인코더의 여러 레이어에서 추출된 특징 맵의 모음.

(3) 깊이 지도 생성 (Decoder)

features, features_0 = self.decoder(encodings)
canonical_inverse_depth = self.head(features)

• self.decoder(encodings):

• 인코더에서 추출된 특징 맵(encodings)을 디코더로 전달하여, 다중 해상도의 정보를 결합.

• 반환값:

• features: 깊이 지도를 생성하기 위한 주요 특징 맵.

• features_0: 추가적인 낮은 해상도의 특징 맵.

• self.head(features):

• 디코더의 주요 출력(features)을 마지막 출력 레이어(self.head)로 전달.

• canonical_inverse_depth:

• 네트워크의 최종 출력인 역깊이 지도(Inverse Depth Map) 생성.

(4) 시야각 추정 (FOV Head, 선택적)

• if hasattr(self, "fov"):

• 네트워크에 FOV 네트워크가 활성화되어 있는 경우(self.fov 존재).

• self.fov.forward(x, features_0.detach()):

• 입력 이미지 x와 추가 특징 맵 features_0을 사용해 시야각(Field of View)을 추정.

• 시야각은 **도 단위(°)**로 반환.

• fov_deg:

• 추정된 시야각 값.

• FOV 네트워크가 없을 경우 None 반환.

(5) 결과 반환

1. canonical_inverse_depth: 네트워크가 생성한 정규화된 역깊이 지도.

2. fov_deg: 시야각 값 (옵션).

요약

1. 입력 이미지 크기 확인:

• 입력 이미지의 크기가 네트워크의 기대 크기(self.img_size)와 같은지 확인.

2. 특징 추출 (Encoder):

• 입력 이미지를 인코더를 통해 다중 해상도의 특징 맵으로 변환.

3. 깊이 지도 생성 (Decoder):

• 인코더의 출력 특징 맵을 디코더로 결합해 깊이 지도 생성.

4. 시야각 추정 (선택적):

• 시야각 추정 네트워크를 통해 FOV 값을 계산.

5. 결과 반환:

• 생성된 역깊이 지도와 (선택적으로) 시야각 값 반환.

추론 def infer

    @torch.no_grad()
    def infer(
        self,
        x: torch.Tensor,
        f_px: Optional[Union[float, torch.Tensor]] = None,
        interpolation_mode="bilinear",
    ) -> Mapping[str, torch.Tensor]:
        """Infer depth and fov for a given image.

        If the image is not at network resolution, it is resized to 1536x1536 and
        the estimated depth is resized to the original image resolution.
        Note: if the focal length is given, the estimated value is ignored and the provided
        focal length is use to generate the metric depth values.

        Args:
        ----
            x (torch.Tensor): Input image
            f_px (torch.Tensor): Optional focal length in pixels corresponding to `x`.
            interpolation_mode (str): Interpolation function for downsampling/upsampling. 

        Returns:
        -------
            Tensor dictionary (torch.Tensor): depth [m], focallength [pixels].

        """
        if len(x.shape) == 3: #차원 맞추기: 입력 이미지가 배치 차원이 없는  (channels, height, width)  형식인 경우, 배치 차원을 추가:
            x = x.unsqueeze(0) # 결과:  (1, channels, height, width)
        _, _, H, W = x.shape
        resize = H != self.img_size or W != self.img_size
 		
        #  입력이미지 크기(H,W)와 네트워크 내부 크기 img_size비교해서 다르면 리사이징 필요
 
        if resize:
            x = nn.functional.interpolate(
                x,
                size=(self.img_size, self.img_size),
                mode=interpolation_mode,
                align_corners=False,
            )

        canonical_inverse_depth, fov_deg = self.forward(x) # 네트워크 추론
        if f_px is None:
            f_px = 0.5 * W / torch.tan(0.5 * torch.deg2rad(fov_deg.to(torch.float)))
        
        inverse_depth = canonical_inverse_depth * (W / f_px) 
        # 정규화된 역깊이 지도를 실제 초점 거리와 해상도를 반영하여 변환
        f_px = f_px.squeeze()

        if resize:
            inverse_depth = nn.functional.interpolate(
                inverse_depth, size=(H, W), mode=interpolation_mode, align_corners=False
            )

        depth = 1.0 / torch.clamp(inverse_depth, min=1e-4, max=1e4)

        return {
            "depth": depth.squeeze(),
            "focallength_px": f_px,
        }

• 이 메서드는 모델이 학습된 네트워크 해상도와 다른 크기의 이미지를 처리할 수 있도록 리사이징 및 후처리를 수행합니다.

• 최종적으로 깊이 지도와 초점 거리(focal length) 정보를 반환합니다.

1. 데코레이터: @torch.no_grad()

• @torch.no_grad():

• 이 데코레이터는 추론 모드에서 사용됩니다.

• 역전파 계산(Gradient Calculation)을 비활성화하여 메모리 사용량을 줄이고 속도를 높입니다.

매개변수

1. x: torch.Tensor:

• 입력 이미지 텐서.

• 형식: (batch\_size, channels, height, width) .

• 예: (1, 3, 384, 384) → RGB 이미지.

2. f_px: Optional[Union[float, torch.Tensor]]:

• (선택적) 초점 거리(Focal Length).

• 이미지의 초점 거리(픽셀 단위)를 직접 제공.

• 제공되지 않으면 네트워크에서 예측된 FOV를 통해 계산.

3. interpolation_mode: str:

• 리사이징 시 사용하는 보간(interpolation) 방식.

• 기본값은 "bilinear".

반환값 return

• 딕셔너리 형태:

1. depth:

• 추론된 깊이 지도 ([m], 단위: 미터).

• (height, width) 크기의 단일 채널 텐서.

2. focallength_px:

• 초점 거리 ([pixels]).

메타데이터 및 쓸데없는 파일들(?)

sources.txt : 파일 경로 목록을 나열

구성은 다음과 같습니다.

보통 setup.py를 통해 python setup.py sdist 또는 pip install -e . 같은 명령어를 실행할 때 자동으로 생성
프로젝트를 PyPI(Python Package Index)에 업로드하거나 다른 환경에 설치할 때, 어떤 파일들이 패키지에 포함되는지 관리하기 위한 용도

1. 메타데이터 파일

• ACKNOWLEDGEMENTS.md: 프로젝트에 대한 공로 또는 감사 내용을 기록.

• CODE_OF_CONDUCT.md: 프로젝트의 커뮤니티 행동 규범.

• CONTRIBUTING.md: 기여 방법을 설명.

• LICENSE: 프로젝트의 라이선스 정보.

• README.md: 프로젝트 설명 파일.

2. 스크립트 및 실행 파일

• get_pretrained_models.sh: 미리 학습된 모델을 다운로드하는 스크립트.

• src/depth_pro/cli/run.py: 명령줄 인터페이스 실행 파일.

3. 코드 관련 파일

• src/depth_pro/__init__.py: 패키지 초기화 파일.

• src/depth_pro/network/: 모델 네트워크 관련 코드.

• src/depth_pro/eval/: 평가 코드.

4. 데이터 파일

• data/example.jpg: 예제 이미지 파일.

• data/depth-pro-teaser.jpg: 티저 이미지 파일.

5. egg-info 디렉토리

• src/depth_pro.egg-info/: setuptools가 패키징 과정에서 생성한 메타데이터 디렉토리.

• PKG-INFO: 패키지의 메타정보.

• SOURCES.txt: 배포 패키지에 포함될 파일 목록.

• requires.txt: 의존성 정보.

entry_points.txt : 명령어 정의

[console_scripts]
depth-pro-run = depth_pro.cli:run_main

setup.py나 pyproject.toml 파일의 entry points 설정에 사용
명령줄 스크립트(CLI)를 등록하는 역할

• depth-pro-run 입력해서 실행하는 명령어 정의

• depth_pro.cli: Python 모듈 이름 (예: depth_pro/cli.py).

• run_main: 해당 모듈 안에 정의된 함수 이름.

즉, 사용자가 depth-pro-run 명령어를 입력하면, depth_pro/cli.py 파일의 run_main() 함수가 실행

get_pretrained_models.sh : 가중치 모델 불러오기

#!/usr/bin/env bash
#
# For licensing see accompanying LICENSE file.
# Copyright (C) 2024 Apple Inc. All Rights Reserved.
#
mkdir -p checkpoints
# Place final weights here:
wget https://ml-site.cdn-apple.com/models/depth-pro/depth_pro.pt -P checkpoints

미리 학습된 모델 가중치(depth_pro.pt)를 다운로드하고, 이를 checkpoints 디렉토리에 저장하는 작업

• wget:

• 인터넷에서 파일을 다운로드하는 명령어

• 여기서는 Apple의 CDN(Content Delivery Network)에서 모델 가중치 파일(depth_pro.pt)을 다운로드

• https://ml-site.cdn-apple.com/models/depth-pro/depth_pro.pt: 다운로드할 파일의 URL.

• -P checkpoints: 다운로드된 파일을 checkpoints 디렉토리에 저장.

• 결과 : depth_pro.pt라는 이름의 파일이 checkpoints 디렉토리에 다운로드된다

'논문 리뷰' 카테고리의 다른 글

[논문 리뷰] Sapiens: Foundation for Human Vision Models 및 평가지표 설명 (0)	2025.01.17
[논문 리뷰] DPT : Vision Transformers for Dense Prediction (ICCV 2021) (0)	2025.01.16
[떠먹여주는 논문 리뷰] Depth Pro : Sharp Monocular Metric Depth in Less Than a Second (24.09) (2)	2025.01.05
[논문 리뷰] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis (0)	2024.11.09
[논문 리뷰] User Hand Gesture Recognition Algorithm Using FFT based Sound Analysis FFT 기반 음파 분석을 이용한 사용자 손동작 인식 알고리즘 (0)	2024.06.03

현재글[논문 리뷰] 코드 파헤치기 Depth Pro : Sharp Monocular Metric Depth in Less Than a Second (24.09)

프로그래밍선

Python, 챗봇만들기, OpenGL, SGD, ffmpeg, tmpi, streamlit, depth, tiled multiplane images for practical 3d photography, depth pro, 프로그래머스, 경사하강법, cv2, LLM, AI, PIP, error, Depth estimation, HTML, code,

Today :
Yesterday :

프로그래밍선