CUDA Error Happens when using vLLM in inference: an illegal memory access was encountered.

### System Info / 系統信息

Nice work. The model is impressive, and I would like to thank all of the contributors that have made efforts in this model.
However, When I try using vLLM for an inference, a CUDA related error happens after 211 samples had been inferenced. I checked every detail of my code, but nothing I can do to solve. I used the same code for GLM 4.5 Air, all were right.

The error is listed below:

`Traceback (most recent call last):
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 711, in run_engine_core
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]     engine_core.run_busy_loop()
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 738, in run_busy_loop
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]     self._process_engine_step()
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 764, in _process_engine_step
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]     outputs, model_executed = self.step_fn()
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 292, in step
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]     model_output = self.execute_model_with_error_logging(
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 278, in execute_model_with_error_logging
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]     raise err
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 269, in execute_model_with_error_logging
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]     return model_fn(scheduler_output)
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 176, in execute_model
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]     (output, ) = self.collective_rpc(
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 259, in collective_rpc
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]     result = get_response(w, dequeue_timeout,
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 243, in get_response
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720]     raise RuntimeError(
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720] RuntimeError: Worker failed with error 'CUDA error: an illegal memory access was encountered
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[1;36m(EngineCore_DP0 pid=94)[0;0m ERROR 11-05 22:07:50 [core.py:720] ', please check the stack trace above for the root cause
[1;36m(Worker_TP0 pid=133)[0;0m INFO 11-05 22:07:50 [multiproc_executor.py:546] Parent process exited, terminating worker
[1;36m(Worker_TP0 pid=133)[0;0m INFO 11-05 22:07:50 [multiproc_executor.py:587] WorkerProc shutting down.
[1;36m(Worker_TP1 pid=134)[0;0m INFO 11-05 22:07:50 [multiproc_executor.py:546] Parent process exited, terminating worker
[1;36m(Worker_TP1 pid=134)[0;0m INFO 11-05 22:07:50 [multiproc_executor.py:587] WorkerProc shutting down.
[1;36m(Worker_TP2 pid=135)[0;0m INFO 11-05 22:07:50 [multiproc_executor.py:546] Parent process exited, terminating worker
[1;36m(Worker_TP2 pid=135)[0;0m INFO 11-05 22:07:50 [multiproc_executor.py:587] WorkerProc shutting down.
[1;36m(Worker_TP3 pid=136)[0;0m INFO 11-05 22:07:50 [multiproc_executor.py:546] Parent process exited, terminating worker
[1;36m(Worker_TP4 pid=137)[0;0m INFO 11-05 22:07:50 [multiproc_executor.py:546] Parent process exited, terminating worker
[1;36m(Worker_TP3 pid=136)[0;0m INFO 11-05 22:07:50 [multiproc_executor.py:587] WorkerProc shutting down.
[1;36m(Worker_TP4 pid=137)[0;0m INFO 11-05 22:07:50 [multiproc_executor.py:587] WorkerProc shutting down.
[1;36m(Worker_TP5 pid=138)[0;0m INFO 11-05 22:07:50 [multiproc_executor.py:546] Parent process exited, terminating worker
[1;36m(Worker_TP5 pid=138)[0;0m INFO 11-05 22:07:50 [multiproc_executor.py:587] WorkerProc shutting down.
[1;36m(Worker_TP6 pid=139)[0;0m INFO 11-05 22:07:50 [multiproc_executor.py:546] Parent process exited, terminating worker
[1;36m(Worker_TP7 pid=140)[0;0m INFO 11-05 22:07:50 [multiproc_executor.py:546] Parent process exited, terminating worker`
CUDA error: an illegal memory access was encountered

I used 8*H100 for infenece.
vLLM I am using is 0.11.0

Could you help me with that?

### Who can help? / 谁可以帮助到您？

_No response_

### Information / 问题信息

- [x] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务

### Reproduction / 复现过程

`    llm = LLM(model=model_id, tensor_parallel_size=torch.cuda.device_count(), dtype="auto", trust_remote_code=True, quantization="compressed-tensors")
    sampling = SamplingParams(max_tokens=2048, n=1)`

### Expected behavior / 期待表现

solve the error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA Error Happens when using vLLM in inference: an illegal memory access was encountered. #105

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA Error Happens when using vLLM in inference: an illegal memory access was encountered. #105

Description

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions