-
Notifications
You must be signed in to change notification settings - Fork 348
Description
Hello there,
First of all, thank you so much for sharing a great model!
Its multi-lingual capability is impressive.
https://lmarena.ai/leaderboard/text/korean
I wanted to use GLM-4.6 for transforming our dataset into another with H100x8 machines but I found that the current recommended command line did not work well.
First, I tried vLLM but encountered OOM.
vllm serve zai-org/GLM-4.6-FP8 \
--tensor-parallel-size 8 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice
As guided in the README.md, I added --cpu-offload-gb 16 then generation speed became 6 tokens per second. It is also reported here: vllm-project/vllm#22692
My data has 10 million samples so it was not acceptable. Therefore, I tried SGLang.
python3 -m sglang.launch_server \
--model-path zai-org/GLM-4.6-FP8 \
--tp-size 8 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.7 \
--disable-shared-experts-fusion \
--host 0.0.0.0 \
--port 8000
This time, it seemed to be working well at first glance. However, when I ran a batch job, all the responses were filled with !!!!!!!!!!... in the maximum output token length.
I think many people are already using some optimal setup for serving GLM-4.6. Can you share what config would be the best?
Thank you, again!