-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Add new dataset and new collate and new tokenizer functions #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Add special tokens | ||
|
||
sequence = self.build_inputs_with_special_tokens(ids, pair_ids) | ||
segment_ids = self.create_token_type_ids_from_sequences(ids, pair_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里包括上面算total_len
的时候为什么特意把可选的add_special_tokens
去掉了呢
yield example | ||
|
||
return IterDataset( | ||
generate_examples(example_iter), label_list=label_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以试下将上面def generate_examples(example_iter)
改为def generate_examples()
,然后return这里改为IterDataset(generate_examples, label_list=label_list)
,然后将IterDataset.__iter__
中的for example in self.data:
改为for example in self.data():
来解决IterDataset只能遍历一次的问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
另外上面的for example in example_iter:
可能也要调整,得改成for example in self._read(root):
才能让example_iter也能多次遍历。这样每次遍历也都要重新读一遍数据文件。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if not is_test: | ||
return input_ids, segment_ids, valid_length, label | ||
return example['input_ids'], example['segment_ids'], len(example[ | ||
'input_ids']), label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizer的返回值可否考虑使用namedtuple呢,这样一方面可以使用example.input_ids
这种方式像当前用dict一样表意,另一方面也可以通过索引的方式example[0]
来取字段,后者和之前已有的dataset也可以更加一致,由用户选择性的使用name或者index来选取字段。
module_path = DATASETS_MODULE_PATH + name | ||
|
||
reader_cls = import_main_class(module_path) | ||
reader_instance = reader_cls(lazy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可否给DatasetBuilder加一个lazy
的类属性,各DatasetBuilder
子类提供默认的lazy值以支持不同数据集默认的加载方式,reader_cls(lazy)
这里传入的可以更高优先级
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done,但是也不是所有的corpus数据集都希望用lazy读取,像ptb这种小的就应该用MapDataset,因为lazy方式读取就不能折叠组batch了。
encoded_inputs["position_ids"] = list( | ||
range(len(encoded_inputs["input_ids"]))) | ||
|
||
return encoded_inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如前面所述,example能否以namedtuple的形式返回,这样既能通过name取字段,也能通过index取字段。但namedtuple可能要想想对于return_position_ids == False
这种不返回的情况如何处理。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
namedtuple相比dict的缺点是不能方便的添加新字段,对于稍复杂的convert_to_feature方法就不太好做。
fix the load model for the bigbird
Replace np.argmax to paddle.argmax
support sharding stage1 in hybrid parallel.
generation retriver
No description provided.