isl-org · ky747 · Jul 29, 2025 · Jul 29, 2025
diff --git a/Mono-depth-Midas-to-Real-depth/README.md b/Mono-depth-Midas-to-Real-depth/README.md
@@ -0,0 +1,114 @@
+# Mono-depth-Midas-to-Real-depth
+This repository contains code to convert relative depth map to real depth map with [MiDaS](https://github.com/isl-org/MiDaS) and [YOLO](https://github.com/ultralytics/ultralytics)
+___
+## Feature of MiDaS
+MiDaS deep learning model compute depth form a single image. But this model output only relative depth. because, stereo camera usesing different parallax like our eyes, but MiDaS only using single image.
+
+So it can create **relative depth** map.
+
+But I make this repository to overcome limits *(generate real depth)*
+___
+## Convert program pipline
+
+1. get video from cv2
+2. input YOLO & MiDaS
+3. make bounding box & depth map form cam frame
+4. compare average depth in **standard box**  with average depth in object detected **bounding box**
+5. calcuate real depth(standard box real depth entered)
+
+___
+## Code
+*detail of code refer to comment I'll note here about cautions*
+
+1. ```py
+    # set base depth (input real depth in bottom right yello box)
+    REF_REAL_M = 15.0
+    ```
+you can set this num and reversion if you want
+
+2. ```py
+    with torch.no_grad():
+        fps = 1
+        video = VideoStream(0).start()
+        time_start = time.time()
+        ...
+    ```
+This code using a usb camera for video input, `VideoStream(0)` you can select cam by list of connected camera.
+
+3. If you want to change model here
+```py
+    yolo_model = YOLO("yolo11n.pt")
+```
+Yolo
+```py
+def run(input_path, output_path, model_path, model_type="dpt_levit_224", optimize=False, height=None):
+    print("Initialize")
+```
+```py
+    parser.add_argument('-t', '--model_type', default='dpt_levit_224')
+```
+Midas
+
+4. 
+```py
+ if optimize and device == torch.device("cuda"):
+```
+```py
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+```
+If you don't want using CUDA(enhance processing with GPU), changing `cuda` to `cpu`
+
+5.
+If you want to change color type of Midas
+```py
+combined = np.hstack((yolo_annotated, create_side_by_side(None, prediction, True)))
+```
+chage `True` and `False`
+because of here
+```py
+def create_side_by_side(image, depth, grayscale):
+```
+<p align='center'>
+<img src='.\img\gray.png' align='center'>
+<img src='.\img\rgb.png' align='center'>
+</p>
+
+---
+
+## Example
+<p align='center'>
+<img src='img/Midas v3.1 dpt_levit_224.png'><br>
+Midas v3.1 dpt_levit_224 
+<img src='img/Midas v3.1 Swin2-L 384.png'><br>
+Midas v3.1 dpt_swin2_large_384
+</p>
+
+
+**We can find something intresting points.**
+**Top(levit_224) vs Bottom(Large 384)**
+1. Yello base box(bottom right on Yolo)
+- Between camera and standing lectur desk real distance is similer.
+- But distingishing point is base box avarage of depth 
+- Base box is contained 15m real distance fixed. But Top and Buttom isn't correct about average of depth.
+- So a student behind lectur dex (he is wearing black shirt) considerable difference of depth between two picture.
+2. Midas model considerable diffenrce at detail. This deveolp envirment is Galaxy Book Flex2 (i7-1165G7, MX450)
+- dpt_levit_224 is low quility output but can realtime feedback. This situation laptop performance limited because of overheat, but condittion is more better average FPS is 17.
+- dpt_large_384 is high quality but taks many culcuation. So FPS is low
+
+---
+
+## Model confidence
+Real depth convert program depending on MiDaS's output. So MiDaS creadibility related to this program. I utilize pointcloud generate webprogram  to check MiDaS confidence.
+<p align='center'>
+<img src='img/1.jpg'><br>
+<img src='img/point2.jpg' align='center' width="49%">
+<img src='img/point3.jpg' align='center' width="50%">
+</p>
+
+**This output showing me caution point**
+Near the camera depth showing considerable expressiveness, but far from area can't express subtle depth. Forexample, between wall and brown wastebasket.
+
+Further away from the camera, uncertainty more increase, depth disparity more decrease.
+
+I think this problem can solve to use high quility model.
+(I use dpt_swin2_large_384.pt for point cloud)
diff --git a/Mono-depth-Midas-to-Real-depth/img/1.jpg b/Mono-depth-Midas-to-Real-depth/img/1.jpg
diff --git a/Mono-depth-Midas-to-Real-depth/img/2.jpg b/Mono-depth-Midas-to-Real-depth/img/2.jpg
diff --git a/Mono-depth-Midas-to-Real-depth/img/3.jpg b/Mono-depth-Midas-to-Real-depth/img/3.jpg
diff --git a/Mono-depth-Midas-to-Real-depth/img/Midas v3.1 Swin2-L 384.png b/Mono-depth-Midas-to-Real-depth/img/Midas v3.1 Swin2-L 384.png
diff --git a/Mono-depth-Midas-to-Real-depth/img/Midas v3.1 dpt_levit_224.png b/Mono-depth-Midas-to-Real-depth/img/Midas v3.1 dpt_levit_224.png
diff --git a/Mono-depth-Midas-to-Real-depth/img/gray.png b/Mono-depth-Midas-to-Real-depth/img/gray.png
diff --git a/Mono-depth-Midas-to-Real-depth/img/point1.jpg b/Mono-depth-Midas-to-Real-depth/img/point1.jpg
diff --git a/Mono-depth-Midas-to-Real-depth/img/point2.jpg b/Mono-depth-Midas-to-Real-depth/img/point2.jpg
diff --git a/Mono-depth-Midas-to-Real-depth/img/point3.jpg b/Mono-depth-Midas-to-Real-depth/img/point3.jpg
diff --git a/Mono-depth-Midas-to-Real-depth/img/rgb.png b/Mono-depth-Midas-to-Real-depth/img/rgb.png
diff --git a/Mono-depth-Midas-to-Real-depth/midas/__pycache__/base_model.cpython-311.pyc b/Mono-depth-Midas-to-Real-depth/midas/__pycache__/base_model.cpython-311.pyc
diff --git a/Mono-depth-Midas-to-Real-depth/midas/__pycache__/blocks.cpython-311.pyc b/Mono-depth-Midas-to-Real-depth/midas/__pycache__/blocks.cpython-311.pyc
diff --git a/Mono-depth-Midas-to-Real-depth/midas/__pycache__/dpt_depth.cpython-311.pyc b/Mono-depth-Midas-to-Real-depth/midas/__pycache__/dpt_depth.cpython-311.pyc
diff --git a/Mono-depth-Midas-to-Real-depth/midas/__pycache__/midas_net.cpython-311.pyc b/Mono-depth-Midas-to-Real-depth/midas/__pycache__/midas_net.cpython-311.pyc
diff --git a/Mono-depth-Midas-to-Real-depth/midas/__pycache__/midas_net_custom.cpython-311.pyc b/Mono-depth-Midas-to-Real-depth/midas/__pycache__/midas_net_custom.cpython-311.pyc
diff --git a/Mono-depth-Midas-to-Real-depth/midas/__pycache__/model_loader.cpython-311.pyc b/Mono-depth-Midas-to-Real-depth/midas/__pycache__/model_loader.cpython-311.pyc
diff --git a/Mono-depth-Midas-to-Real-depth/midas/__pycache__/transforms.cpython-311.pyc b/Mono-depth-Midas-to-Real-depth/midas/__pycache__/transforms.cpython-311.pyc
diff --git a/Mono-depth-Midas-to-Real-depth/midas/backbones/__pycache__/beit.cpython-311.pyc b/Mono-depth-Midas-to-Real-depth/midas/backbones/__pycache__/beit.cpython-311.pyc
diff --git a/Mono-depth-Midas-to-Real-depth/midas/backbones/__pycache__/levit.cpython-311.pyc b/Mono-depth-Midas-to-Real-depth/midas/backbones/__pycache__/levit.cpython-311.pyc
diff --git a/Mono-depth-Midas-to-Real-depth/midas/backbones/__pycache__/swin.cpython-311.pyc b/Mono-depth-Midas-to-Real-depth/midas/backbones/__pycache__/swin.cpython-311.pyc
diff --git a/Mono-depth-Midas-to-Real-depth/midas/backbones/__pycache__/swin2.cpython-311.pyc b/Mono-depth-Midas-to-Real-depth/midas/backbones/__pycache__/swin2.cpython-311.pyc
diff --git a/Mono-depth-Midas-to-Real-depth/midas/backbones/__pycache__/swin_common.cpython-311.pyc b/Mono-depth-Midas-to-Real-depth/midas/backbones/__pycache__/swin_common.cpython-311.pyc
diff --git a/Mono-depth-Midas-to-Real-depth/midas/backbones/__pycache__/utils.cpython-311.pyc b/Mono-depth-Midas-to-Real-depth/midas/backbones/__pycache__/utils.cpython-311.pyc
diff --git a/Mono-depth-Midas-to-Real-depth/midas/backbones/__pycache__/vit.cpython-311.pyc b/Mono-depth-Midas-to-Real-depth/midas/backbones/__pycache__/vit.cpython-311.pyc
diff --git a/Mono-depth-Midas-to-Real-depth/midas/backbones/beit.py b/Mono-depth-Midas-to-Real-depth/midas/backbones/beit.py
@@ -0,0 +1,196 @@
+import timm
+import torch
+import types
+
+import numpy as np
+import torch.nn.functional as F
+
+from .utils import forward_adapted_unflatten, make_backbone_default
+from timm.models.beit import gen_relative_position_index
+from torch.utils.checkpoint import checkpoint
+from typing import Optional
+
+
+def forward_beit(pretrained, x):
+    return forward_adapted_unflatten(pretrained, x, "forward_features")
+
+
+def patch_embed_forward(self, x):
+    """
+    Modification of timm.models.layers.patch_embed.py: PatchEmbed.forward to support arbitrary window sizes.
+    """
+    x = self.proj(x)
+    if self.flatten:
+        x = x.flatten(2).transpose(1, 2)
+    x = self.norm(x)
+    return x
+
+
+def _get_rel_pos_bias(self, window_size):
+    """
+    Modification of timm.models.beit.py: Attention._get_rel_pos_bias to support arbitrary window sizes.
+    """
+    old_height = 2 * self.window_size[0] - 1
+    old_width = 2 * self.window_size[1] - 1
+
+    new_height = 2 * window_size[0] - 1
+    new_width = 2 * window_size[1] - 1
+
+    old_relative_position_bias_table = self.relative_position_bias_table
+
+    old_num_relative_distance = self.num_relative_distance
+    new_num_relative_distance = new_height * new_width + 3
+
+    old_sub_table = old_relative_position_bias_table[:old_num_relative_distance - 3]
+
+    old_sub_table = old_sub_table.reshape(1, old_width, old_height, -1).permute(0, 3, 1, 2)
+    new_sub_table = F.interpolate(old_sub_table, size=(new_height, new_width), mode="bilinear")
+    new_sub_table = new_sub_table.permute(0, 2, 3, 1).reshape(new_num_relative_distance - 3, -1)
+
+    new_relative_position_bias_table = torch.cat(
+        [new_sub_table, old_relative_position_bias_table[old_num_relative_distance - 3:]])
+
+    key = str(window_size[1]) + "," + str(window_size[0])
+    if key not in self.relative_position_indices.keys():
+        self.relative_position_indices[key] = gen_relative_position_index(window_size)
+
+    relative_position_bias = new_relative_position_bias_table[
+        self.relative_position_indices[key].view(-1)].view(
+        window_size[0] * window_size[1] + 1,
+        window_size[0] * window_size[1] + 1, -1)  # Wh*Ww,Wh*Ww,nH
+    relative_position_bias = relative_position_bias.permute(2, 0, 1).contiguous()  # nH, Wh*Ww, Wh*Ww
+    return relative_position_bias.unsqueeze(0)
+
+
+def attention_forward(self, x, resolution, shared_rel_pos_bias: Optional[torch.Tensor] = None):
+    """
+    Modification of timm.models.beit.py: Attention.forward to support arbitrary window sizes.
+    """
+    B, N, C = x.shape
+
+    qkv_bias = torch.cat((self.q_bias, self.k_bias, self.v_bias)) if self.q_bias is not None else None
+    qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
+    qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
+    q, k, v = qkv.unbind(0)  # make torchscript happy (cannot use tensor as tuple)
+
+    q = q * self.scale
+    attn = (q @ k.transpose(-2, -1))
+
+    if self.relative_position_bias_table is not None:
+        window_size = tuple(np.array(resolution) // 16)
+        attn = attn + self._get_rel_pos_bias(window_size)
+    if shared_rel_pos_bias is not None:
+        attn = attn + shared_rel_pos_bias
+
+    attn = attn.softmax(dim=-1)
+    attn = self.attn_drop(attn)
+
+    x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
+    x = self.proj(x)
+    x = self.proj_drop(x)
+    return x
+
+
+def block_forward(self, x, resolution, shared_rel_pos_bias: Optional[torch.Tensor] = None):
+    """
+    Modification of timm.models.beit.py: Block.forward to support arbitrary window sizes.
+    """
+    if self.gamma_1 is None:
+        x = x + self.drop_path1(self.attn(self.norm1(x), resolution, shared_rel_pos_bias=shared_rel_pos_bias))
+        x = x + self.drop_path1(self.mlp(self.norm2(x)))
+    else:
+        x = x + self.drop_path1(self.gamma_1 * self.attn(self.norm1(x), resolution,
+                                                        shared_rel_pos_bias=shared_rel_pos_bias))
+        x = x + self.drop_path1(self.gamma_2 * self.mlp(self.norm2(x)))
+    return x
+
+
+def beit_forward_features(self, x):
+    """
+    Modification of timm.models.beit.py: Beit.forward_features to support arbitrary window sizes.
+    """
+    resolution = x.shape[2:]
+
+    x = self.patch_embed(x)
+    x = torch.cat((self.cls_token.expand(x.shape[0], -1, -1), x), dim=1)
+    if self.pos_embed is not None:
+        x = x + self.pos_embed
+    x = self.pos_drop(x)
+
+    rel_pos_bias = self.rel_pos_bias() if self.rel_pos_bias is not None else None
+    for blk in self.blocks:
+        if self.grad_checkpointing and not torch.jit.is_scripting():
+            x = checkpoint(blk, x, shared_rel_pos_bias=rel_pos_bias)
+        else:
+            x = blk(x, resolution, shared_rel_pos_bias=rel_pos_bias)
+    x = self.norm(x)
+    return x
+
+
+def _make_beit_backbone(
+        model,
+        features=[96, 192, 384, 768],
+        size=[384, 384],
+        hooks=[0, 4, 8, 11],
+        vit_features=768,
+        use_readout="ignore",
+        start_index=1,
+        start_index_readout=1,
+):
+    backbone = make_backbone_default(model, features, size, hooks, vit_features, use_readout, start_index,
+                                     start_index_readout)
+
+    backbone.model.patch_embed.forward = types.MethodType(patch_embed_forward, backbone.model.patch_embed)
+    backbone.model.forward_features = types.MethodType(beit_forward_features, backbone.model)
+
+    for block in backbone.model.blocks:
+        attn = block.attn
+        attn._get_rel_pos_bias = types.MethodType(_get_rel_pos_bias, attn)
+        attn.forward = types.MethodType(attention_forward, attn)
+        attn.relative_position_indices = {}
+
+        block.forward = types.MethodType(block_forward, block)
+
+    return backbone
+
+
+def _make_pretrained_beitl16_512(pretrained, use_readout="ignore", hooks=None):
+    model = timm.create_model("beit_large_patch16_512", pretrained=pretrained)
+
+    hooks = [5, 11, 17, 23] if hooks is None else hooks
+
+    features = [256, 512, 1024, 1024]
+
+    return _make_beit_backbone(
+        model,
+        features=features,
+        size=[512, 512],
+        hooks=hooks,
+        vit_features=1024,
+        use_readout=use_readout,
+    )
+
+
+def _make_pretrained_beitl16_384(pretrained, use_readout="ignore", hooks=None):
+    model = timm.create_model("beit_large_patch16_384", pretrained=pretrained)
+
+    hooks = [5, 11, 17, 23] if hooks is None else hooks
+    return _make_beit_backbone(
+        model,
+        features=[256, 512, 1024, 1024],
+        hooks=hooks,
+        vit_features=1024,
+        use_readout=use_readout,
+    )
+
+
+def _make_pretrained_beitb16_384(pretrained, use_readout="ignore", hooks=None):
+    model = timm.create_model("beit_base_patch16_384", pretrained=pretrained)
+
+    hooks = [2, 5, 8, 11] if hooks is None else hooks
+    return _make_beit_backbone(
+        model,
+        features=[96, 192, 384, 768],
+        hooks=hooks,
+        use_readout=use_readout,
+    )