Skip to content

Conversation

lengrongfu
Copy link
Member

@lengrongfu lengrongfu commented Feb 27, 2025

Issue: #55

Test python code:

import torch
import torch.nn as nn

# 查询设备和显存信息
def query_device():
    if torch.cuda.is_available():
        print(f"CUDA is available. Using {torch.cuda.device_count()} device(s).")
        for i in range(torch.cuda.device_count()):
            print(f"Device {i}: {torch.cuda.get_device_name(i)}")
            print(f"  Memory Allocated: {torch.cuda.memory_allocated(i) / (1024 ** 2):.2f} MB")
            print(f"  Memory Cached: {torch.cuda.memory_reserved(i) / (1024 ** 2):.2f} MB")
    else:
        print("CUDA is not available. Using CPU.")

# 模拟显存分配,直到触发 OOM
def trigger_oom(device):
    try:
        print("\nStarting memory allocation...")
        tensor_list = []
        while True:
            # 每次分配一个大的张量来消耗显存
            tensor = torch.randn((10000, 10000), device=device)
            tensor_list.append(tensor)
            # 打印当前显存状态
            allocated_memory = torch.cuda.memory_allocated(device) / (1024 ** 2)  # MB
            cached_memory = torch.cuda.memory_reserved(device) / (1024 ** 2)  # MB
            print(f"Allocated Memory: {allocated_memory:.2f} MB, Cached Memory: {cached_memory:.2f} MB")
    except RuntimeError as e:
        if "out of memory" in str(e):
            print(f"OOM Triggered! {e}")
        else:
            print(f"Unexpected error: {e}")

# 主程序
if __name__ == "__main__":
    # 查询设备信息
    query_device()

    # 设置设备为 CUDA(如果可用)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    if torch.cuda.is_available():
        # 在设备上触发 OOM
        trigger_oom(device)
    else:
        print("CUDA not available. Unable to trigger OOM.")

image

Change before error.
image

Copy link

hami-robott bot commented Feb 27, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lengrongfu

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robott hami-robott bot added the size/XS label Feb 27, 2025
@lengrongfu
Copy link
Member Author

@archlitchi PTAL

@archlitchi
Copy link
Member

/lgtm

@archlitchi archlitchi merged commit 6039f80 into Project-HAMi:main Mar 3, 2025
4 of 5 checks passed
@hami-robott hami-robott bot added the lgtm label Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants