Skip to content

Conversation

wj-Mcat
Copy link
Contributor

@wj-Mcat wj-Mcat commented Jun 27, 2022

PR types

New features

PR changes

Models

Description

add OPT: Open Pre-trained Transformer Language Models model based on the facebook/opt-* models.

def gen_cache(self, memory):
incremental_cache = self.self_attn.gen_cache(memory,
type=self.self_attn.Cache)
return incremental_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果以上这些没有修改,可以先 import gpt 中的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前上面这些还会和GPT的有差别不

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在350m上还是有一些不一样,所以这个模块没有import

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

什么不一样呢,normalize_before=True/False 都支持就可以了吧

@guoshengCS
Copy link
Contributor

也注意加入 transformers/auto/modeling.py 中

@guoshengCS
Copy link
Contributor

还有个疑问是如果不增加 tokenzier 的话这个要如何使用Auto来加载创建呢

@wj-Mcat
Copy link
Contributor Author

wj-Mcat commented Jun 28, 2022

既然我们的模型最后的名称定位是facebook/opt-*,那我们只需要在gpt的 pretrained_init_configurationpretrained_resource_files_map 两个字段中添加对应的配置即可。

同时还必须保证gpt的tokenizer和opt modeling中的模型名称必须保持一致才能够正常保存在相同的目录下。

除此之外,还需要有什么需要注意的地方吗?

@guoshengCS
Copy link
Contributor

guoshengCS commented Jun 28, 2022

既然我们的模型最后的名称定位是facebook/opt-*,那我们只需要在gpt的 pretrained_init_configurationpretrained_resource_files_map 两个字段中添加对应的配置即可。

同时还必须保证gpt的tokenizer和opt modeling中的模型名称必须保持一致才能够正常保存在相同的目录下。

除此之外,还需要有什么需要注意的地方吗?

就是HF是如何做的呢,看着并没有在代码中提供,tokenizer config 中也未指明使用GPT的tokenzier

@wj-Mcat
Copy link
Contributor Author

wj-Mcat commented Jul 4, 2022

I have complete the 125m ~ 2.7b works which contains the code-align, logit-align. But the 6.7b~66b model weight file is too big to be in one weight file which is splited into many weight file in huggingface transformers.

In conclusion, there are some things to do:

  • import layers from gpt.modeling which are same as opt.modeling to avoid duplicated code.
  • find out the method for handling the biiiiiiiiiiiig weight file which is splited into some pieces files.
  • make the opt can be loaded from AutoModel.

"pad_token_id": 1,
"num_hidden_layers": 24,
"num_attention_heads": 16,
"max_position_embeddings": 2048
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个如果权重是按照社区生态模型来的话吗,上面这些config也不放在这里了吧,可以参考codegen。因为有些地方会根据 pretrained_init_configuration 里面包含来判断是内置模型还是社区生态模型,可能会有些混淆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have removed the opt configuration which will be automatic loaded from the specific path.

hidden_dropout_prob: float = 0.1,
max_position_embeddings: int = 512,
type_vocab_size: Optional[int] = None,
initializer_range=0.02):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果OPT没有这些initializer_range就去掉吧,不必和GPT保持一致。想和GPT一致的主要是TransformerDecoder部分,希望能复用FasterGeneration的一些内容

Copy link
Contributor Author

@wj-Mcat wj-Mcat Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The origin configuration of opt is init_std which maintains the same function as initializer_range in GPT. So this is added.

return decoder_outputs


class OPTForPretraining(OPTPretrainedModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些ForPretraining和PretrainingCriterion如果HF没有也先不要加了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will remove it.

raise e


OPTForCausalLM = OPTLMHeadModel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

上面OPTLMHeadModel就直接用OPTForCausalLM吧,GPT的主要是考虑兼容性

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will rename it .

decode_strategy = kwargs.get('decode_strategy')
if decode_strategy == "beam_search":
raise AttributeError(
"'beam_search' is not supported yet in the faster version of GPT"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPT->OPT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果Faster这部分还没有完成的话先暂时去掉再合入吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intergration with FasterGeneration will be done in next PR, so let's remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this time, OPT model can not play with FasterGenearation, so we should disable the method prepare_faster_entry. I raise Error on top of this method, How do you think about it ? @guoshengCS

def prepare_faster_entry(self, kwargs):
    # TODO(wj-Mcat): this error will be removed when opt can play with FasterGeneration. 
    raise AttributeError(
        "FasterGeneration is not supported in OPT Model, please keep eyes on the latest feature of PaddleNLP"
    )

    from paddlenlp.ops import FasterOPT
    use_fp16_decoding = kwargs.get('use_fp16_decoding', False)
    decode_strategy = kwargs.get('decode_strategy')
    
    if decode_strategy == "beam_search":
        raise AttributeError(
            "'beam_search' is not supported yet in the faster version of OPT"
        )
    # Currently, FasterTransformer only support restricted size_per_head.
    size_per_head = self.opt.config["hidden_size"] // self.opt.config[
        "num_attention_heads"]
    if size_per_head not in [32, 64, 80, 96, 128]:
        raise AttributeError(
            "'size_per_head = %d' is not supported yet in the faster version of OPT"
            % size_per_head)
    if kwargs['forced_bos_token_id'] is not None:
        # not support for min_length yet in the faster version
        raise AttributeError(
            "'forced_bos_token_id != None' is not supported yet in the faster version"
        )
    if kwargs['min_length'] != 0:
        # not support for min_length yet in the faster version
        raise AttributeError(
            "'min_length != 0' is not supported yet in the faster version")
    self._faster_entry = FasterOPT(
        self, use_fp16_decoding=use_fp16_decoding).forward
    return self._faster_entry

word_embed_proj_dim: int,
norm: Optional[Layer] = None,
normalize_before: bool = False,
remove_final_layer_norm: bool = False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在还需要remove_final_layer_norm吗,HF主要是考虑和此前权重的兼容性问题,最新的是不是不需要这个了呢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should remove it. I will do it in next commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    def gen_cache(self, memory):
        incremental_cache = self.self_attn.gen_cache(memory,
                                                     type=self.self_attn.Cache)
        return incremental_cache

this method maintains the same behavior as GPT. but Class maintains the different behavior.

def gen_cache(self, memory):
incremental_cache = self.self_attn.gen_cache(memory,
type=self.self_attn.Cache)
return incremental_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前上面这些还会和GPT的有差别不

Copy link
Contributor Author

@wj-Mcat wj-Mcat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done all of code & local test with loading, saving, generate.

So, please review it when you are free. @guoshengCS

Comment on lines +288 to +290
if not init_class:
init_class = init_kwargs.pop("tokenizer_class", None)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are init_class and tokenizer_class supported in the module, you can refer to:

init_class = init_kwargs.pop("init_class", None)
if init_class is None:
init_class = init_kwargs.pop("tokenizer_class", None)
if init_class:
class_name = cls._name_mapping[init_class]
import_class = importlib.import_module(
f"paddlenlp.transformers.{class_name}.tokenizer")
tokenizer_class = getattr(import_class, init_class)
logger.info(
"We are using %s to load '%s'." %
(tokenizer_class, pretrained_model_name_or_path))

def gen_cache(self, memory):
incremental_cache = self.self_attn.gen_cache(memory,
type=self.self_attn.Cache)
return incremental_cache
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在350m上还是有一些不一样,所以这个模块没有import

@@ -1,4 +1,4 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All Rights Reserved.k 这里是否是手误呢

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里再注意下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯嗯,这个地方确实是没有注意到,我这边已经修改了。

def gen_cache(self, memory):
incremental_cache = self.self_attn.gen_cache(memory,
type=self.self_attn.Cache)
return incremental_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

什么不一样呢,normalize_before=True/False 都支持就可以了吧

@wj-Mcat
Copy link
Contributor Author

wj-Mcat commented Jul 12, 2022

现在主要讨论的是为什么TransformerDecoderLayer这个类没有从GPT中import过来使用。主要区别还是在于激活函数的区别:

  • 在GPT中使用的gelu,并且是写死的。
        tgt = self.dropout2(
            self.linear2(F.gelu(self.linear1(tgt), approximate=True)))
        tgt = residual + tgt
  • 在OPT中使用配置的方式来动态获取,并在计算时使用目标激活函数
        tgt = self.dropout2(self.linear2(self.activation(self.linear1(tgt))))
        tgt = residual + tgt

所以在代码层面这个类没有直接import过来。

@wj-Mcat
Copy link
Contributor Author

wj-Mcat commented Jul 13, 2022

ping @guoshengCS

@wj-Mcat
Copy link
Contributor Author

wj-Mcat commented Aug 8, 2022

ping @guoshengCS

**OPT** (opt, batch_size=4, max_length=32)

<p align="left">
<img src="../docs/imgs/opt_perf.png" width="800" height ="400" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里看看能否将这几张图片都上传github的issue或者PR,这里使用github的链接来控制repo大小

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好,这两个我调整一下

Copy link
Contributor Author

@wj-Mcat wj-Mcat Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

图片文件上传:

  • GPT Performance

gpt_perf

  • OPT Performance

opt_perf

  • Bart Performance

bart_perf


demo = Demo(model_name_or_path="facebook/opt-1.3b",
max_predict_len=10,
repetition_penalty=1.2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么要引入repetition_penalty这个参数呢,不加效果有问题吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不必要,我在下一个commit中删掉

pos_emb, linear_weight, normalize_before, topk, topp,
max_out_len, head_num, size_per_head, num_layer, bos_id,
eos_id, temperature, use_fp16_decoding):
helper = LayerHelper('fusion_opt', **locals())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意这里后面按照 #2795 迁移下

@guoshengCS guoshengCS merged commit 70e6a31 into PaddlePaddle:develop Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants