Skip to content

vLLM or SGLang Command Line for Maximum Throughput? #107

@seungduk-yanolja

Description

@seungduk-yanolja

Hello there,

First of all, thank you so much for sharing a great model!
Its multi-lingual capability is impressive.
https://lmarena.ai/leaderboard/text/korean

I wanted to use GLM-4.6 for transforming our dataset into another with H100x8 machines but I found that the current recommended command line did not work well.

First, I tried vLLM but encountered OOM.

vllm serve zai-org/GLM-4.6-FP8 \
    --tensor-parallel-size 8 \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice

As guided in the README.md, I added --cpu-offload-gb 16 then generation speed became 6 tokens per second. It is also reported here: vllm-project/vllm#22692

My data has 10 million samples so it was not acceptable. Therefore, I tried SGLang.

python3 -m sglang.launch_server \
  --model-path zai-org/GLM-4.6-FP8 \
  --tp-size 8 \
  --tool-call-parser glm45  \
  --reasoning-parser glm45  \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3  \
  --speculative-eagle-topk 1  \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.7 \
  --disable-shared-experts-fusion \
  --host 0.0.0.0 \
  --port 8000

This time, it seemed to be working well at first glance. However, when I ran a batch job, all the responses were filled with !!!!!!!!!!... in the maximum output token length.

I think many people are already using some optimal setup for serving GLM-4.6. Can you share what config would be the best?

Thank you, again!

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions