Skip to content
This repository was archived by the owner on Aug 25, 2025. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 114 additions & 0 deletions Mono-depth-Midas-to-Real-depth/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Mono-depth-Midas-to-Real-depth
This repository contains code to convert relative depth map to real depth map with [MiDaS](https://github.com/isl-org/MiDaS) and [YOLO](https://github.com/ultralytics/ultralytics)
___
## Feature of MiDaS
MiDaS deep learning model compute depth form a single image. But this model output only relative depth. because, stereo camera usesing different parallax like our eyes, but MiDaS only using single image.

So it can create **relative depth** map.

But I make this repository to overcome limits *(generate real depth)*
___
## Convert program pipline

1. get video from cv2
2. input YOLO & MiDaS
3. make bounding box & depth map form cam frame
4. compare average depth in **standard box** with average depth in object detected **bounding box**
5. calcuate real depth(standard box real depth entered)

___
## Code
*detail of code refer to comment I'll note here about cautions*

1. ```py
# set base depth (input real depth in bottom right yello box)
REF_REAL_M = 15.0
```
you can set this num and reversion if you want

2. ```py
with torch.no_grad():
fps = 1
video = VideoStream(0).start()
time_start = time.time()
...
```
This code using a usb camera for video input, `VideoStream(0)` you can select cam by list of connected camera.

3. If you want to change model here
```py
yolo_model = YOLO("yolo11n.pt")
```
Yolo
```py
def run(input_path, output_path, model_path, model_type="dpt_levit_224", optimize=False, height=None):
print("Initialize")
```
```py
parser.add_argument('-t', '--model_type', default='dpt_levit_224')
```
Midas

4.
```py
if optimize and device == torch.device("cuda"):
```
```py
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
```
If you don't want using CUDA(enhance processing with GPU), changing `cuda` to `cpu`

5.
If you want to change color type of Midas
```py
combined = np.hstack((yolo_annotated, create_side_by_side(None, prediction, True)))
```
chage `True` and `False`
because of here
```py
def create_side_by_side(image, depth, grayscale):
```
<p align='center'>
<img src='.\img\gray.png' align='center'>
<img src='.\img\rgb.png' align='center'>
</p>

---

## Example
<p align='center'>
<img src='img/Midas v3.1 dpt_levit_224.png'><br>
Midas v3.1 dpt_levit_224
<img src='img/Midas v3.1 Swin2-L 384.png'><br>
Midas v3.1 dpt_swin2_large_384
</p>


**We can find something intresting points.**
**Top(levit_224) vs Bottom(Large 384)**
1. Yello base box(bottom right on Yolo)
- Between camera and standing lectur desk real distance is similer.
- But distingishing point is base box avarage of depth
- Base box is contained 15m real distance fixed. But Top and Buttom isn't correct about average of depth.
- So a student behind lectur dex (he is wearing black shirt) considerable difference of depth between two picture.
2. Midas model considerable diffenrce at detail. This deveolp envirment is Galaxy Book Flex2 (i7-1165G7, MX450)
- dpt_levit_224 is low quility output but can realtime feedback. This situation laptop performance limited because of overheat, but condittion is more better average FPS is 17.
- dpt_large_384 is high quality but taks many culcuation. So FPS is low

---

## Model confidence
Real depth convert program depending on MiDaS's output. So MiDaS creadibility related to this program. I utilize pointcloud generate webprogram to check MiDaS confidence.
<p align='center'>
<img src='img/1.jpg'><br>
<img src='img/point2.jpg' align='center' width="49%">
<img src='img/point3.jpg' align='center' width="50%">
</p>

**This output showing me caution point**
Near the camera depth showing considerable expressiveness, but far from area can't express subtle depth. Forexample, between wall and brown wastebasket.

Further away from the camera, uncertainty more increase, depth disparity more decrease.

I think this problem can solve to use high quility model.
(I use dpt_swin2_large_384.pt for point cloud)
Binary file added Mono-depth-Midas-to-Real-depth/img/1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Mono-depth-Midas-to-Real-depth/img/2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Mono-depth-Midas-to-Real-depth/img/3.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Mono-depth-Midas-to-Real-depth/img/gray.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Mono-depth-Midas-to-Real-depth/img/point1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Mono-depth-Midas-to-Real-depth/img/point2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Mono-depth-Midas-to-Real-depth/img/point3.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Mono-depth-Midas-to-Real-depth/img/rgb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
196 changes: 196 additions & 0 deletions Mono-depth-Midas-to-Real-depth/midas/backbones/beit.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
import timm
import torch
import types

import numpy as np
import torch.nn.functional as F

from .utils import forward_adapted_unflatten, make_backbone_default
from timm.models.beit import gen_relative_position_index
from torch.utils.checkpoint import checkpoint
from typing import Optional


def forward_beit(pretrained, x):
return forward_adapted_unflatten(pretrained, x, "forward_features")


def patch_embed_forward(self, x):
"""
Modification of timm.models.layers.patch_embed.py: PatchEmbed.forward to support arbitrary window sizes.
"""
x = self.proj(x)
if self.flatten:
x = x.flatten(2).transpose(1, 2)
x = self.norm(x)
return x


def _get_rel_pos_bias(self, window_size):
"""
Modification of timm.models.beit.py: Attention._get_rel_pos_bias to support arbitrary window sizes.
"""
old_height = 2 * self.window_size[0] - 1
old_width = 2 * self.window_size[1] - 1

new_height = 2 * window_size[0] - 1
new_width = 2 * window_size[1] - 1

old_relative_position_bias_table = self.relative_position_bias_table

old_num_relative_distance = self.num_relative_distance
new_num_relative_distance = new_height * new_width + 3

old_sub_table = old_relative_position_bias_table[:old_num_relative_distance - 3]

old_sub_table = old_sub_table.reshape(1, old_width, old_height, -1).permute(0, 3, 1, 2)
new_sub_table = F.interpolate(old_sub_table, size=(new_height, new_width), mode="bilinear")
new_sub_table = new_sub_table.permute(0, 2, 3, 1).reshape(new_num_relative_distance - 3, -1)

new_relative_position_bias_table = torch.cat(
[new_sub_table, old_relative_position_bias_table[old_num_relative_distance - 3:]])

key = str(window_size[1]) + "," + str(window_size[0])
if key not in self.relative_position_indices.keys():
self.relative_position_indices[key] = gen_relative_position_index(window_size)

relative_position_bias = new_relative_position_bias_table[
self.relative_position_indices[key].view(-1)].view(
window_size[0] * window_size[1] + 1,
window_size[0] * window_size[1] + 1, -1) # Wh*Ww,Wh*Ww,nH
relative_position_bias = relative_position_bias.permute(2, 0, 1).contiguous() # nH, Wh*Ww, Wh*Ww
return relative_position_bias.unsqueeze(0)


def attention_forward(self, x, resolution, shared_rel_pos_bias: Optional[torch.Tensor] = None):
"""
Modification of timm.models.beit.py: Attention.forward to support arbitrary window sizes.
"""
B, N, C = x.shape

qkv_bias = torch.cat((self.q_bias, self.k_bias, self.v_bias)) if self.q_bias is not None else None
qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
q, k, v = qkv.unbind(0) # make torchscript happy (cannot use tensor as tuple)

q = q * self.scale
attn = (q @ k.transpose(-2, -1))

if self.relative_position_bias_table is not None:
window_size = tuple(np.array(resolution) // 16)
attn = attn + self._get_rel_pos_bias(window_size)
if shared_rel_pos_bias is not None:
attn = attn + shared_rel_pos_bias

attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)

x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
x = self.proj(x)
x = self.proj_drop(x)
return x


def block_forward(self, x, resolution, shared_rel_pos_bias: Optional[torch.Tensor] = None):
"""
Modification of timm.models.beit.py: Block.forward to support arbitrary window sizes.
"""
if self.gamma_1 is None:
x = x + self.drop_path1(self.attn(self.norm1(x), resolution, shared_rel_pos_bias=shared_rel_pos_bias))
x = x + self.drop_path1(self.mlp(self.norm2(x)))
else:
x = x + self.drop_path1(self.gamma_1 * self.attn(self.norm1(x), resolution,
shared_rel_pos_bias=shared_rel_pos_bias))
x = x + self.drop_path1(self.gamma_2 * self.mlp(self.norm2(x)))
return x


def beit_forward_features(self, x):
"""
Modification of timm.models.beit.py: Beit.forward_features to support arbitrary window sizes.
"""
resolution = x.shape[2:]

x = self.patch_embed(x)
x = torch.cat((self.cls_token.expand(x.shape[0], -1, -1), x), dim=1)
if self.pos_embed is not None:
x = x + self.pos_embed
x = self.pos_drop(x)

rel_pos_bias = self.rel_pos_bias() if self.rel_pos_bias is not None else None
for blk in self.blocks:
if self.grad_checkpointing and not torch.jit.is_scripting():
x = checkpoint(blk, x, shared_rel_pos_bias=rel_pos_bias)
else:
x = blk(x, resolution, shared_rel_pos_bias=rel_pos_bias)
x = self.norm(x)
return x


def _make_beit_backbone(
model,
features=[96, 192, 384, 768],
size=[384, 384],
hooks=[0, 4, 8, 11],
vit_features=768,
use_readout="ignore",
start_index=1,
start_index_readout=1,
):
backbone = make_backbone_default(model, features, size, hooks, vit_features, use_readout, start_index,
start_index_readout)

backbone.model.patch_embed.forward = types.MethodType(patch_embed_forward, backbone.model.patch_embed)
backbone.model.forward_features = types.MethodType(beit_forward_features, backbone.model)

for block in backbone.model.blocks:
attn = block.attn
attn._get_rel_pos_bias = types.MethodType(_get_rel_pos_bias, attn)
attn.forward = types.MethodType(attention_forward, attn)
attn.relative_position_indices = {}

block.forward = types.MethodType(block_forward, block)

return backbone


def _make_pretrained_beitl16_512(pretrained, use_readout="ignore", hooks=None):
model = timm.create_model("beit_large_patch16_512", pretrained=pretrained)

hooks = [5, 11, 17, 23] if hooks is None else hooks

features = [256, 512, 1024, 1024]

return _make_beit_backbone(
model,
features=features,
size=[512, 512],
hooks=hooks,
vit_features=1024,
use_readout=use_readout,
)


def _make_pretrained_beitl16_384(pretrained, use_readout="ignore", hooks=None):
model = timm.create_model("beit_large_patch16_384", pretrained=pretrained)

hooks = [5, 11, 17, 23] if hooks is None else hooks
return _make_beit_backbone(
model,
features=[256, 512, 1024, 1024],
hooks=hooks,
vit_features=1024,
use_readout=use_readout,
)


def _make_pretrained_beitb16_384(pretrained, use_readout="ignore", hooks=None):
model = timm.create_model("beit_base_patch16_384", pretrained=pretrained)

hooks = [2, 5, 8, 11] if hooks is None else hooks
return _make_beit_backbone(
model,
features=[96, 192, 384, 768],
hooks=hooks,
use_readout=use_readout,
)
Loading