Skip to content

Conversation

@yuanlehome
Copy link
Collaborator

@yuanlehome yuanlehome commented Oct 11, 2024

PR types

New features

PR changes

Others

Description

大模型推理attention组网重构,新的append_attn方案相比旧方案有10%到90%的性能提升。

目前已支持了llama/qwen/qwen-moe/mixtral的推理。

使用方式,原推理脚本的 --block_attn选项改为--append_attn即可。

TODO:

  • fp8推理适配
  • 性能数据补充,稍后见llm docs

@paddle-bot
Copy link

paddle-bot bot commented Oct 11, 2024

Thanks for your contribution!

@codecov
Copy link

codecov bot commented Oct 11, 2024

Codecov Report

Attention: Patch coverage is 0% with 60 lines in your changes missing coverage. Please review.

Project coverage is 52.74%. Comparing base (fe8b527) to head (84a6864).
Report is 264 commits behind head on develop.

Files with missing lines Patch % Lines
...erimental/transformers/fused_transformer_layers.py 0.00% 38 Missing ⚠️
...dlenlp/experimental/transformers/qwen2/modeling.py 0.00% 8 Missing ⚠️
...dlenlp/experimental/transformers/llama/modeling.py 0.00% 7 Missing ⚠️
...enlp/experimental/transformers/mixtral/modeling.py 0.00% 5 Missing ⚠️
...lp/experimental/transformers/qwen2_moe/modeling.py 0.00% 1 Missing ⚠️
paddlenlp/experimental/transformers/utils.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #9244   +/-   ##
========================================
  Coverage    52.73%   52.74%           
========================================
  Files          661      661           
  Lines       107422   107371   -51     
========================================
- Hits         56653    56630   -23     
+ Misses       50769    50741   -28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.