vLLM or SGLang Command Line for Maximum Throughput?

Hello there,

First of all, thank you so much for sharing a great model!
Its multi-lingual capability is impressive.
https://lmarena.ai/leaderboard/text/korean

I wanted to use GLM-4.6 for transforming our dataset into another with H100x8 machines but I found that the current recommended command line did not work well.

First, I tried vLLM but encountered OOM.
```
vllm serve zai-org/GLM-4.6-FP8 \
    --tensor-parallel-size 8 \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice
```
As guided in the README.md, I added `--cpu-offload-gb 16` then generation speed became 6 tokens per second. It is also reported here: https://github.com/vllm-project/vllm/issues/22692

My data has 10 million samples so it was not acceptable. Therefore, I tried SGLang.
```
python3 -m sglang.launch_server \
  --model-path zai-org/GLM-4.6-FP8 \
  --tp-size 8 \
  --tool-call-parser glm45  \
  --reasoning-parser glm45  \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3  \
  --speculative-eagle-topk 1  \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.7 \
  --disable-shared-experts-fusion \
  --host 0.0.0.0 \
  --port 8000
```
This time, it seemed to be working well at first glance. However, when I ran a batch job, all the responses were filled with `!!!!!!!!!!...` in the maximum output token length.

I think many people are already using some optimal setup for serving GLM-4.6. Can you share what config would be the best?

Thank you, again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vLLM or SGLang Command Line for Maximum Throughput? #107

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vLLM or SGLang Command Line for Maximum Throughput? #107

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions