Skip to content

Conversation

smallv0221
Copy link
Contributor

No description provided.

# Add special tokens

sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
segment_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里包括上面算total_len的时候为什么特意把可选的add_special_tokens去掉了呢

yield example

return IterDataset(
generate_examples(example_iter), label_list=label_list)
Copy link
Contributor

@guoshengCS guoshengCS Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以试下将上面def generate_examples(example_iter)改为def generate_examples(),然后return这里改为IterDataset(generate_examples, label_list=label_list),然后将IterDataset.__iter__中的for example in self.data:改为for example in self.data():来解决IterDataset只能遍历一次的问题。

Copy link
Contributor

@guoshengCS guoshengCS Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外上面的for example in example_iter:可能也要调整,得改成for example in self._read(root):才能让example_iter也能多次遍历。这样每次遍历也都要重新读一遍数据文件。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if not is_test:
return input_ids, segment_ids, valid_length, label
return example['input_ids'], example['segment_ids'], len(example[
'input_ids']), label
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokenizer的返回值可否考虑使用namedtuple呢,这样一方面可以使用example.input_ids这种方式像当前用dict一样表意,另一方面也可以通过索引的方式example[0]来取字段,后者和之前已有的dataset也可以更加一致,由用户选择性的使用name或者index来选取字段。

module_path = DATASETS_MODULE_PATH + name

reader_cls = import_main_class(module_path)
reader_instance = reader_cls(lazy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可否给DatasetBuilder加一个lazy的类属性,各DatasetBuilder子类提供默认的lazy值以支持不同数据集默认的加载方式,reader_cls(lazy)这里传入的可以更高优先级

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,但是也不是所有的corpus数据集都希望用lazy读取,像ptb这种小的就应该用MapDataset,因为lazy方式读取就不能折叠组batch了。

encoded_inputs["position_ids"] = list(
range(len(encoded_inputs["input_ids"])))

return encoded_inputs
Copy link
Contributor

@guoshengCS guoshengCS Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如前面所述,example能否以namedtuple的形式返回,这样既能通过name取字段,也能通过index取字段。但namedtuple可能要想想对于return_position_ids == False这种不返回的情况如何处理。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

namedtuple相比dict的缺点是不能方便的添加新字段,对于稍复杂的convert_to_feature方法就不太好做。

@ZeyuChen ZeyuChen self-assigned this Feb 8, 2021
@ZeyuChen ZeyuChen merged commit 21dd000 into PaddlePaddle:develop Feb 9, 2021
wawltor referenced this pull request in wawltor/PaddleNLP Mar 4, 2021
@smallv0221 smallv0221 deleted the yxp0207 branch June 17, 2021 03:49
wawltor added a commit that referenced this pull request Sep 7, 2021
Replace np.argmax to paddle.argmax
w5688414 referenced this pull request in w5688414/PaddleNLP Jun 8, 2023
Add deepfloyd_if unitests
GhostScreaming pushed a commit to GhostScreaming/PaddleNLP that referenced this pull request Jun 26, 2023
support sharding stage1 in hybrid parallel.
qingzhong1 pushed a commit to qingzhong1/PaddleNLP that referenced this pull request Sep 26, 2023
bmers pushed a commit to bmers/PaddleNLP that referenced this pull request Oct 22, 2023
DesmonDay pushed a commit to DesmonDay/PaddleNLP that referenced this pull request Sep 23, 2024
DesmonDay pushed a commit to DesmonDay/PaddleNLP that referenced this pull request Sep 23, 2024
DesmonDay pushed a commit to DesmonDay/PaddleNLP that referenced this pull request Sep 23, 2024
ming1753 referenced this pull request in ming1753/PaddleNLP Jan 17, 2025
zhoutianzi666 pushed a commit to zhoutianzi666/PaddleNLP that referenced this pull request Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants