Add new dataset and new collate and new tokenizer functions #4

smallv0221 · 2021-02-07T04:27:04Z

No description provided.

guoshengCS · 2021-02-08T02:26:34Z

paddlenlp/transformers/tokenizer_utils.py

+        # Add special tokens
+
+        sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
+        segment_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)


这里包括上面算total_len的时候为什么特意把可选的add_special_tokens去掉了呢

guoshengCS · 2021-02-08T03:15:36Z

paddlenlp/datasets/experimental/dataset.py

+                            yield example
+
+                return IterDataset(
+                    generate_examples(example_iter), label_list=label_list)


可以试下将上面def generate_examples(example_iter)改为def generate_examples()，然后return这里改为IterDataset(generate_examples, label_list=label_list)，然后将IterDataset.__iter__中的for example in self.data:改为for example in self.data():来解决IterDataset只能遍历一次的问题。

另外上面的for example in example_iter:可能也要调整，得改成for example in self._read(root):才能让example_iter也能多次遍历。这样每次遍历也都要重新读一遍数据文件。

guoshengCS · 2021-02-08T05:12:26Z

examples/glue/run_glue.py

    if not is_test:
-        return input_ids, segment_ids, valid_length, label
+        return example['input_ids'], example['segment_ids'], len(example[
+            'input_ids']), label


tokenizer的返回值可否考虑使用namedtuple呢，这样一方面可以使用example.input_ids这种方式像当前用dict一样表意，另一方面也可以通过索引的方式example[0]来取字段，后者和之前已有的dataset也可以更加一致，由用户选择性的使用name或者index来选取字段。

guoshengCS · 2021-02-08T06:29:29Z

paddlenlp/datasets/experimental/dataset.py

+    module_path = DATASETS_MODULE_PATH + name
+
+    reader_cls = import_main_class(module_path)
+    reader_instance = reader_cls(lazy)


可否给DatasetBuilder加一个lazy的类属性，各DatasetBuilder子类提供默认的lazy值以支持不同数据集默认的加载方式，reader_cls(lazy)这里传入的可以更高优先级

Done,但是也不是所有的corpus数据集都希望用lazy读取，像ptb这种小的就应该用MapDataset，因为lazy方式读取就不能折叠组batch了。

guoshengCS · 2021-02-08T06:32:37Z

paddlenlp/transformers/tokenizer_utils.py

+            encoded_inputs["position_ids"] = list(
+                range(len(encoded_inputs["input_ids"])))
+
+        return encoded_inputs


如前面所述，example能否以namedtuple的形式返回，这样既能通过name取字段，也能通过index取字段。但namedtuple可能要想想对于return_position_ids == False这种不返回的情况如何处理。

namedtuple相比dict的缺点是不能方便的添加新字段，对于稍复杂的convert_to_feature方法就不太好做。

fix the load model for the bigbird

Replace np.argmax to paddle.argmax

Add deepfloyd_if unitests

support sharding stage1 in hybrid parallel.

generation retriver

绑核运行

Develop

merge

lsc_node

Fix rope&fix precision

commit by zkk

smallv0221 added 3 commits February 7, 2021 04:25

Add new dataset and new collate and new tokenizer functions

e67c2c5

add filter and shard function to IterDataset, fix some readme

77562c1

minor fix

3aac9ee

guoshengCS reviewed Feb 8, 2021

View reviewed changes

smallv0221 added 2 commits February 8, 2021 06:49

fix iterations of IterDataset

830feec

add default lazy option for subDatasetBuilder

83054d4

ZeyuChen self-assigned this Feb 8, 2021

add public wrapper for _tokenize

554a55a

ZeyuChen merged commit 21dd000 into PaddlePaddle:develop Feb 9, 2021

wawltor referenced this pull request in wawltor/PaddleNLP Mar 4, 2021

Merge pull request #4 from wawltor/add_efficient_transformers

fdcd025

fix the load model for the bigbird

smallv0221 deleted the yxp0207 branch June 17, 2021 03:49

wawltor added a commit that referenced this pull request Sep 7, 2021

Merge pull request #4 from linjieccc/add_taskflow_ddparser

c09d901

Replace np.argmax to paddle.argmax

w5688414 referenced this pull request in w5688414/PaddleNLP Jun 8, 2023

Merge pull request #4 from w5688414/sd11

b18cae9

Add deepfloyd_if unitests

GhostScreaming pushed a commit to GhostScreaming/PaddleNLP that referenced this pull request Jun 26, 2023

Merge pull request PaddlePaddle#4 from wuhuachaocoding/sharding_eb4

499413b

support sharding stage1 in hybrid parallel.

qingzhong1 pushed a commit to qingzhong1/PaddleNLP that referenced this pull request Sep 26, 2023

Merge pull request PaddlePaddle#4 from qingzhong1/zq5

6be5dad

generation retriver

bmers pushed a commit to bmers/PaddleNLP that referenced this pull request Oct 22, 2023

Merge pull request PaddlePaddle#4 from bmers/10_17_llama

65bd9c8

绑核运行

DesmonDay pushed a commit to DesmonDay/PaddleNLP that referenced this pull request Sep 23, 2024

Merge pull request PaddlePaddle#4 from PaddlePaddle/develop

03cb362

Develop

DesmonDay pushed a commit to DesmonDay/PaddleNLP that referenced this pull request Sep 23, 2024

Merge pull request PaddlePaddle#4 from PaddlePaddle/main

970bc3f

merge

DesmonDay pushed a commit to DesmonDay/PaddleNLP that referenced this pull request Sep 23, 2024

Merge pull request PaddlePaddle#4 from PaddlePaddle/main

ca43113

lsc_node

ming1753 referenced this pull request in ming1753/PaddleNLP Jan 17, 2025

Merge pull request #4 from lizhenyun01/fix_rope

acc025f

Fix rope&fix precision

zhoutianzi666 pushed a commit to zhoutianzi666/PaddleNLP that referenced this pull request Feb 28, 2025

Merge pull request PaddlePaddle#4 from zhoutianzi666/add_moe_fp8

e43f69d

commit by zkk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add new dataset and new collate and new tokenizer functions #4

Add new dataset and new collate and new tokenizer functions #4

Uh oh!

smallv0221 commented Feb 7, 2021

Uh oh!

guoshengCS Feb 8, 2021

Uh oh!

guoshengCS Feb 8, 2021 •

edited

Loading

Uh oh!

guoshengCS Feb 8, 2021 •

edited

Loading

Uh oh!

smallv0221 Feb 8, 2021

Uh oh!

guoshengCS Feb 8, 2021

Uh oh!

guoshengCS Feb 8, 2021

Uh oh!

smallv0221 Feb 8, 2021

Uh oh!

guoshengCS Feb 8, 2021 •

edited

Loading

Uh oh!

smallv0221 Feb 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add new dataset and new collate and new tokenizer functions #4

Add new dataset and new collate and new tokenizer functions #4

Uh oh!

Conversation

smallv0221 commented Feb 7, 2021

Uh oh!

guoshengCS Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

guoshengCS Feb 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guoshengCS Feb 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smallv0221 Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

guoshengCS Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

guoshengCS Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

smallv0221 Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

guoshengCS Feb 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smallv0221 Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

guoshengCS Feb 8, 2021 •

edited

Loading

guoshengCS Feb 8, 2021 •

edited

Loading

guoshengCS Feb 8, 2021 •

edited

Loading