Skip to content

Conversation

Steffy-zxf
Copy link
Contributor

PR types

New features

PR changes

APIs

Describe

  • add ChnSentiCorp & LCQMC datasets to paddlenlp.experimental.datasets
  • adopt text_cls & text_matching example

Copy link
Contributor

@smallv0221 smallv0221 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好像没看到新加的dataset

example(obj:`list[str]`): List of input data, containing query, title and label if it have label.
tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
which contains most of the methods. Users should refer to the superclass for more information regarding methods.
label_list(obj:`list[str]`): All the labels that the data has.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是不是不需要了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已去除

encoded_inputs = tokenizer(
text=example["text"],
max_seq_len=max_seq_length,
pad_to_max_seq_len=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上回复


query, title = example[0], example[1]
query, title = example["query"], example["title"]
query_ids = np.array(tokenizer.encode(query), dtype="int64")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

统一改成__call__()方法吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是JiebaTokenizer,不是PretrainedTokenizer,所以不需要更改。

Copy link
Member

@ZeyuChen ZeyuChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to discuss about the scenario of data.Pad API and Tokenizer pad_to_max_seq

train_ds, dev_ds, test_ds = ChnSentiCorp.get_datasets(
['train', 'dev', 'test'])
train_ds, dev_ds, test_ds = load_dataset(
"chnsenticorp", splits=["train", "dev", "test"], lazy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazy=False是不是默认选项? @smallv0221 我们是否要求只有Iterable场景下才需要lazy=True?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的是默认选项。只有Iterable场景下才需要lazy=True。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@smallv0221 smallv0221 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @ZeyuChen

Copy link
Member

@ZeyuChen ZeyuChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZeyuChen ZeyuChen merged commit 102ddf3 into PaddlePaddle:develop Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants