[LLM INFER] Append attn #9244

yuanlehome · 2024-10-11T07:33:05Z

PR types

New features

PR changes

Others

Description

大模型推理attention组网重构，新的append_attn方案相比旧方案有10%到90%的性能提升。

目前已支持了llama/qwen/qwen-moe/mixtral的推理。

使用方式，原推理脚本的 --block_attn选项改为--append_attn即可。

TODO：

fp8推理适配
性能数据补充，稍后见llm docs

…nto append_attn

Change-Id: Ibe8920ba41ea9775e676b05b12dc01cb9da95b5e

…nto append_attn

…tention

…nto append_attn

paddle-bot · 2024-10-11T07:33:10Z

Thanks for your contribution!

codecov · 2024-10-11T08:05:01Z

Codecov Report

Attention: Patch coverage is 0% with 60 lines in your changes missing coverage. Please review.

Project coverage is 52.74%. Comparing base (fe8b527) to head (84a6864).
Report is 264 commits behind head on develop.

Files with missing lines	Patch %	Lines
...erimental/transformers/fused_transformer_layers.py	0.00%	38 Missing ⚠️
...dlenlp/experimental/transformers/qwen2/modeling.py	0.00%	8 Missing ⚠️
...dlenlp/experimental/transformers/llama/modeling.py	0.00%	7 Missing ⚠️
...enlp/experimental/transformers/mixtral/modeling.py	0.00%	5 Missing ⚠️
...lp/experimental/transformers/qwen2_moe/modeling.py	0.00%	1 Missing ⚠️
paddlenlp/experimental/transformers/utils.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #9244   +/-   ##
========================================
  Coverage    52.73%   52.74%           
========================================
  Files          661      661           
  Lines       107422   107371   -51     
========================================
- Hits         56653    56630   -23     
+ Misses       50769    50741   -28

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…nto append_attn

vivienfanghuagood · 2024-10-16T07:08:14Z

csrc/gpu/append_attn/speculate_write_cache_with_rope_impl.cuh

+              static_cast<uint8_t>(quant_value2 + 128.0f);
+        }
+        // write k
+        // 大分块 lane_id / 4 / 2


中文注释删一删

太多了，留着无伤大雅吧

…nto append_attn

…into append_attn

ZHUI · 2024-10-23T03:56:37Z

paddlenlp/experimental/transformers/llama/modeling.py

                    weight_scales_loader = EmptyWeightScale(
                        weight_scale_map_dict,
                        num_of_layers=self.config.num_hidden_layers,
-                        num_head=self.num_attention_heads,


这个一直没有生效，会导致什么问题？

这里是调试时改的，仅仅改下命名，不会有什么影响

ZHUI · 2024-10-23T03:57:28Z

paddlenlp/experimental/transformers/qwen2/modeling.py

                    for i_layer, weight_scale in enumerate(v):
-                        weight_scale = weight_scale.astype("float32")
+                        if self.config.append_attn:
+                            weight_scale = paddle.to_tensor(weight_scale).cast(paddle.get_default_dtype())


为什么 append_attn 下可以不同 fp32？

因为kernel实现里是要求half精度的，访存不同

ZHUI · 2024-10-23T03:58:45Z

llm/predict/predictor.py

    print("***********Start Benchmark**********")

-    warmup_time = 10
-    test_time = 100


这个修改是为了？

这个修改没啥影响，没注意到提到commit上了

注意下个pr恢复吧

ZHUI · 2024-10-23T04:00:11Z

paddlenlp/experimental/transformers/mixtral/modeling.py

    def set_transformer_block(self, transformer_config):
        if self.use_weight_only:
            self.transformer_block = FusedBlockMultiTransformerWeightOnly(transformer_config)
-        elif "a8w8" in self.quant_type:


这个为什么删除？

因为并没有支持，之前同学参考其他代码时一并copy过来了，我这里顺便给删掉

* refine paddle::empty(), fix memory error, support multi_stream for attention * fix and rename attention as append_attention * rename file --------- Co-authored-by: lizhenyun <[email protected]> Co-authored-by: lizhenyun01 <[email protected]>

yuanlehome and others added 30 commits September 14, 2024 14:05

append_attention 0914

b072465

paddle::empty to phi::allocator

b915f95

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

9b1e1d8

…nto append_attn

append_attn 0919

140a509

0920 fix split_kv_block

5272b6f

my change for merge 4 to 1

a42157d

fix prev

bec8eef

merge zhenyun 0923

8dab056

fix prev

d5047b5

fix var name

006a467

update

73e2c06

fix config

a8acb2b

fix

ec46a89

fix append_attn

cb02ee5

Change-Id: Ibe8920ba41ea9775e676b05b12dc01cb9da95b5e

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

83a19a6

…nto append_attn

fix --use_fake_parameter

37fc7da

refine paddle::empty(), fix memory error, support multi_stream for at…

a3b265b

…tention

fix and rename attention as append_attention

68a09b6

rename file

2bcd939

fix

74941a0

encoder GQANEOX rope support

19a0bdb

decoder a8w8c8 GQANEOX rope support

a9078cb

merge get_block_shape and split_kv_block

f64f962

bf16 neox rope support

7ba73f8

fix diff

6837c23

separate compilation

0a5ae96

manual destroy stream

e9cfc55

fix multi stream

478c517

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

aa1e96a

…nto append_attn

qwen/llama support weightonly

e8ddfe8

yuanlehome added 2 commits October 11, 2024 12:52

refine code

2292780

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

036fb73

…nto append_attn

yuanlehome mentioned this pull request Oct 11, 2024

[LLM INFER] Append attn Moved to https://github.com/PaddlePaddle/PaddleNLP/pull/9244 #9242

Closed

decoder neox_rope_c4 support

b85782d

lizhenyun01 and others added 5 commits October 11, 2024 20:22

instantiation of append_attn with float16

9814578

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

7a1f591

…nto append_attn

optimize cpu performance

5c126ad

format code

2ef7c11

c16/c8/c4 分离编译加快编译速度

4a4a4b4

yuanlehome force-pushed the append_attn branch from 4d69d01 to 4a4a4b4 Compare October 15, 2024 03:30

yuanlehome and others added 3 commits October 15, 2024 14:20

fix bug

0e35a1e

gqa_group_size -> kv_num_heads

c5b4633

support speculate_attn

ea8c07e

vivienfanghuagood reviewed Oct 16, 2024

View reviewed changes

adjust network

3789175

yuanlehome force-pushed the append_attn branch from 50a48de to 3789175 Compare October 16, 2024 08:20

yuanlehome and others added 7 commits October 16, 2024 09:17

cache_int4 -> cache_int4_zp

6eacbca

fix use_fake_parameter multi cards

358115d

fix speculate_decoder

30ac44c

delete comment

4011d89

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

7efff99

…nto append_attn

Merge branch 'append_attn' of https://github.com/yuanlehome/PaddleNLP …

c30c112

…into append_attn

fix ci

84a6864

yuanlehome force-pushed the append_attn branch from ed9da7a to 84a6864 Compare October 21, 2024 06:33

qingqing01 approved these changes Oct 22, 2024

View reviewed changes

ZHUI approved these changes Oct 23, 2024

View reviewed changes

ZHUI merged commit 31c6b9a into PaddlePaddle:develop Oct 23, 2024

[LLM INFER] Append attn #9244

[LLM INFER] Append attn #9244

Uh oh!

Conversation

yuanlehome commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Description

Uh oh!

paddle-bot bot commented Oct 11, 2024

Uh oh!

codecov bot commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yuanlehome commented Oct 11, 2024 •

edited

Loading

codecov bot commented Oct 11, 2024 •

edited

Loading