-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[NEW MODEL]add OPT model #2659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEW MODEL]add OPT model #2659
Conversation
def gen_cache(self, memory): | ||
incremental_cache = self.self_attn.gen_cache(memory, | ||
type=self.self_attn.Cache) | ||
return incremental_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果以上这些没有修改,可以先 import gpt 中的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
当前上面这些还会和GPT的有差别不
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在350m上还是有一些不一样,所以这个模块没有import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
什么不一样呢,normalize_before=True/False 都支持就可以了吧
也注意加入 transformers/auto/modeling.py 中 |
还有个疑问是如果不增加 tokenzier 的话这个要如何使用Auto来加载创建呢 |
既然我们的模型最后的名称定位是 同时还必须保证gpt的tokenizer和opt modeling中的模型名称必须保持一致才能够正常保存在相同的目录下。 除此之外,还需要有什么需要注意的地方吗? |
就是HF是如何做的呢,看着并没有在代码中提供,tokenizer config 中也未指明使用GPT的tokenzier |
I have complete the 125m ~ 2.7b works which contains the code-align, logit-align. But the 6.7b~66b model weight file is too big to be in one weight file which is splited into many weight file in huggingface transformers. In conclusion, there are some things to do:
|
"pad_token_id": 1, | ||
"num_hidden_layers": 24, | ||
"num_attention_heads": 16, | ||
"max_position_embeddings": 2048 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个如果权重是按照社区生态模型来的话吗,上面这些config也不放在这里了吧,可以参考codegen。因为有些地方会根据 pretrained_init_configuration
里面包含来判断是内置模型还是社区生态模型,可能会有些混淆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have removed the opt configuration which will be automatic loaded from the specific path.
hidden_dropout_prob: float = 0.1, | ||
max_position_embeddings: int = 512, | ||
type_vocab_size: Optional[int] = None, | ||
initializer_range=0.02): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果OPT没有这些initializer_range就去掉吧,不必和GPT保持一致。想和GPT一致的主要是TransformerDecoder部分,希望能复用FasterGeneration的一些内容
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The origin configuration of opt is init_std
which maintains the same function as initializer_range
in GPT. So this is added.
return decoder_outputs | ||
|
||
|
||
class OPTForPretraining(OPTPretrainedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些ForPretraining和PretrainingCriterion如果HF没有也先不要加了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will remove it.
raise e | ||
|
||
|
||
OPTForCausalLM = OPTLMHeadModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
上面OPTLMHeadModel就直接用OPTForCausalLM吧,GPT的主要是考虑兼容性
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I will rename it .
decode_strategy = kwargs.get('decode_strategy') | ||
if decode_strategy == "beam_search": | ||
raise AttributeError( | ||
"'beam_search' is not supported yet in the faster version of GPT" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPT->OPT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果Faster这部分还没有完成的话先暂时去掉再合入吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intergration with FasterGeneration
will be done in next PR, so let's remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this time, OPT model can not play with FasterGenearation
, so we should disable the method prepare_faster_entry
. I raise Error on top of this method, How do you think about it ? @guoshengCS
def prepare_faster_entry(self, kwargs):
# TODO(wj-Mcat): this error will be removed when opt can play with FasterGeneration.
raise AttributeError(
"FasterGeneration is not supported in OPT Model, please keep eyes on the latest feature of PaddleNLP"
)
from paddlenlp.ops import FasterOPT
use_fp16_decoding = kwargs.get('use_fp16_decoding', False)
decode_strategy = kwargs.get('decode_strategy')
if decode_strategy == "beam_search":
raise AttributeError(
"'beam_search' is not supported yet in the faster version of OPT"
)
# Currently, FasterTransformer only support restricted size_per_head.
size_per_head = self.opt.config["hidden_size"] // self.opt.config[
"num_attention_heads"]
if size_per_head not in [32, 64, 80, 96, 128]:
raise AttributeError(
"'size_per_head = %d' is not supported yet in the faster version of OPT"
% size_per_head)
if kwargs['forced_bos_token_id'] is not None:
# not support for min_length yet in the faster version
raise AttributeError(
"'forced_bos_token_id != None' is not supported yet in the faster version"
)
if kwargs['min_length'] != 0:
# not support for min_length yet in the faster version
raise AttributeError(
"'min_length != 0' is not supported yet in the faster version")
self._faster_entry = FasterOPT(
self, use_fp16_decoding=use_fp16_decoding).forward
return self._faster_entry
word_embed_proj_dim: int, | ||
norm: Optional[Layer] = None, | ||
normalize_before: bool = False, | ||
remove_final_layer_norm: bool = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在还需要remove_final_layer_norm吗,HF主要是考虑和此前权重的兼容性问题,最新的是不是不需要这个了呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should remove it. I will do it in next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def gen_cache(self, memory):
incremental_cache = self.self_attn.gen_cache(memory,
type=self.self_attn.Cache)
return incremental_cache
this method maintains the same behavior as GPT. but Class maintains the different behavior.
def gen_cache(self, memory): | ||
incremental_cache = self.self_attn.gen_cache(memory, | ||
type=self.self_attn.Cache) | ||
return incremental_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
当前上面这些还会和GPT的有差别不
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done all of code & local test with loading
, saving
, generate
.
So, please review it when you are free. @guoshengCS
if not init_class: | ||
init_class = init_kwargs.pop("tokenizer_class", None) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are init_class
and tokenizer_class
supported in the module, you can refer to:
PaddleNLP/paddlenlp/transformers/auto/tokenizer.py
Lines 230 to 240 in 548d59a
init_class = init_kwargs.pop("init_class", None) | |
if init_class is None: | |
init_class = init_kwargs.pop("tokenizer_class", None) | |
if init_class: | |
class_name = cls._name_mapping[init_class] | |
import_class = importlib.import_module( | |
f"paddlenlp.transformers.{class_name}.tokenizer") | |
tokenizer_class = getattr(import_class, init_class) | |
logger.info( | |
"We are using %s to load '%s'." % | |
(tokenizer_class, pretrained_model_name_or_path)) |
def gen_cache(self, memory): | ||
incremental_cache = self.self_attn.gen_cache(memory, | ||
type=self.self_attn.Cache) | ||
return incremental_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在350m上还是有一些不一样,所以这个模块没有import
@@ -1,4 +1,4 @@ | |||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. | |||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All Rights Reserved.k
这里是否是手误呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里再注意下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯嗯,这个地方确实是没有注意到,我这边已经修改了。
def gen_cache(self, memory): | ||
incremental_cache = self.self_attn.gen_cache(memory, | ||
type=self.self_attn.Cache) | ||
return incremental_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
什么不一样呢,normalize_before=True/False 都支持就可以了吧
现在主要讨论的是为什么
tgt = self.dropout2(
self.linear2(F.gelu(self.linear1(tgt), approximate=True)))
tgt = residual + tgt
tgt = self.dropout2(self.linear2(self.activation(self.linear1(tgt))))
tgt = residual + tgt 所以在代码层面这个类没有直接import过来。 |
ping @guoshengCS |
ping @guoshengCS |
faster_generation/README.md
Outdated
**OPT** (opt, batch_size=4, max_length=32) | ||
|
||
<p align="left"> | ||
<img src="../docs/imgs/opt_perf.png" width="800" height ="400" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里看看能否将这几张图片都上传github的issue或者PR,这里使用github的链接来控制repo大小
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好,这两个我调整一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
examples/text_generation/opt/demo.py
Outdated
|
||
demo = Demo(model_name_or_path="facebook/opt-1.3b", | ||
max_predict_len=10, | ||
repetition_penalty=1.2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为什么要引入repetition_penalty这个参数呢,不加效果有问题吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不必要,我在下一个commit中删掉
pos_emb, linear_weight, normalize_before, topk, topp, | ||
max_out_len, head_num, size_per_head, num_layer, bos_id, | ||
eos_id, temperature, use_fp16_decoding): | ||
helper = LayerHelper('fusion_opt', **locals()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意这里后面按照 #2795 迁移下
PR types
New features
PR changes
Models
Description
add OPT: Open Pre-trained Transformer Language Models model based on the facebook/opt-* models.