Add byt5 Model #1742

yoreG123 · 2022-03-08T13:04:55Z

Add model and datatsets

PR changes

Add new model byt5; Add two datasets tweetqa and xsum

Description

The link of the model byt5-small is https://aistudio.baidu.com/aistudio/datasetdetail/123125 ; The link of the datatset xsum is https://aistudio.baidu.com/aistudio/datasetdetail/122619 ; The link of the dataset tweetqa is https://aistudio.baidu.com/aistudio/datasetdetail/131440 ; Please upload the model and datasets and I will update the link and md5 of these datasets in my code

CLAassistant · 2022-03-08T13:05:02Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 3 committers have signed the CLA.

❌ yoreG123
❌ wanmeige123
❌ yingyibiao
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

yingyibiao · 2022-04-14T03:32:58Z

paddlenlp/datasets/__init__.py

+from .tweetqa import *
+from .xsum import *


我们目前重新设计了我们的 datasets，实现了对 HF datasets 的全面兼容，使用方法同 HF 接口，所以目前不需要在代码中新增数据集相关文件。

好的，我看一下

yingyibiao · 2022-04-14T03:35:07Z

paddlenlp/datasets/tweetqa.py

+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+from paddlenlp.utils.env import DATA_HOME
+from . import DatasetBuilder
+
+__all__ = ['tweetqa']
+
+
+class tweetqa(DatasetBuilder):
+
+    META_INFO = collections.namedtuple('META_INFO', ('file', 'md5', 'URL'))
+    SPLITS = {
+        'train': META_INFO(
+            os.path.join('train.json'), '4e06fd1cfd5e7f0380499df8cbe17237',
+            'https://bj.bcebos.com/paddlenlp/datasets/tweetqa/train.json'),
+        'dev': META_INFO(
+            os.path.join('dev.json'), '9c39d49d25d5296bdc537409208ddc85',
+            'https://bj.bcebos.com/paddlenlp/datasets/tweetqa/dev.json'),
+        'test': META_INFO(
+            os.path.join('test.json'), '9c39d49d25d5296bdc537409208ddc85',
+            'https://bj.bcebos.com/paddlenlp/datasets/tweetqa/test.json')
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash, URL = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and
+                                            not md5file(fullname) == data_hash):
+            get_path_from_url(URL, default_root)
+        return fullname
+
+    def _read(self, filename, split):
+        with open(filename, encoding="utf-8") as f:
+            tweet_qa = json.load(f)
+            for data in tweet_qa:
+                yield {"Question": data["Question"], "label": [] if split == "test" else data["Answer"],"Tweet": data["Tweet"],"qid": data["qid"]}
+


yingyibiao · 2022-04-14T03:35:12Z

paddlenlp/datasets/xsum.py

+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+from paddlenlp.utils.env import DATA_HOME
+from . import DatasetBuilder
+
+__all__ = ['xsum']
+
+
+class xsum(DatasetBuilder):
+
+    META_INFO = collections.namedtuple('META_INFO', ('file', 'md5', 'URL'))
+    SPLITS = {
+        'train': META_INFO(
+            os.path.join('train.json'), '4e06fd1cfd5e7f0380499df8cbe17237',
+            'https://bj.bcebos.com/paddlenlp/datasets/xsum/train.json'),
+        'dev': META_INFO(
+            os.path.join('dev.json'), '9c39d49d25d5296bdc537409208ddc85',
+            'https://bj.bcebos.com/paddlenlp/datasets/xsum/dev.json'),
+        'test': META_INFO(
+            os.path.join('test.json'), '9c39d49d25d5296bdc537409208ddc85',
+            'https://bj.bcebos.com/paddlenlp/datasets/xsum/test.json')
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash, URL = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and
+                                            not md5file(fullname) == data_hash):
+            get_path_from_url(URL, default_root)
+        return fullname
+
+    def _read(self, filename, split):
+        with open(filename, encoding="utf-8") as f:
+            xsums = json.load(f)
+            for data in xsums:
+                yield {"document": data["document"], "label": data["summary"],"id": data["id"]}


yingyibiao · 2022-04-14T03:36:05Z

paddlenlp/transformers/byt5/tokenizer.py

@@ -0,0 +1,383 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.


github-actions · 2023-02-27T00:19:19Z

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动，被标记为stale。

github-actions · 2024-05-12T00:19:34Z

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动，被标记为stale。

github-actions · 2025-01-24T00:21:45Z

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动，被标记为stale。

wanmeige123 and others added 2 commits March 8, 2022 20:01

add byt5 model and xsum tweetqa datasets

4c9c018

Update tokenizer.py

7cc06f1

yingyibiao self-requested a review March 10, 2022 04:45

yingyibiao self-assigned this Mar 10, 2022

yingyibiao added the contributions label Mar 10, 2022

Merge branch 'develop' into paddlenlp-byt5

fb424a4

ZeyuChen changed the title ~~Paddlenlp byt5~~ Add byt5 Model Mar 11, 2022

Merge branch 'develop' into paddlenlp-byt5

b46d271

yingyibiao reviewed Apr 14, 2022

View reviewed changes

yingyibiao and others added 2 commits April 20, 2022 17:32

Merge branch 'develop' into paddlenlp-byt5

da794f3

modifydateadd

4614dbb

sijunhe removed the contributions label Dec 28, 2022

github-actions bot added the stale label Feb 27, 2023

paddle-bot bot added the contributor label Feb 26, 2024

paddle-bot bot assigned wawltor Feb 26, 2024

github-actions bot removed the stale label Feb 28, 2024

github-actions bot added the stale label May 12, 2024

github-actions bot removed the stale label Oct 16, 2024

github-actions bot added the stale label Jan 24, 2025

github-actions bot removed the stale label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add byt5 Model #1742

Add byt5 Model #1742

Uh oh!

yoreG123 commented Mar 8, 2022 •

edited

Loading

Uh oh!

CLAassistant commented Mar 8, 2022 •

edited

Loading

Uh oh!

yingyibiao Apr 14, 2022

Uh oh!

yoreG123 Apr 16, 2022

Uh oh!

yingyibiao Apr 14, 2022

Uh oh!

yingyibiao Apr 14, 2022

Uh oh!

yingyibiao Apr 14, 2022

Uh oh!

github-actions bot commented Feb 27, 2023

Uh oh!

github-actions bot commented May 12, 2024

Uh oh!

github-actions bot commented Jan 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		@@ -0,0 +1,383 @@
		# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.

Add byt5 Model #1742

Are you sure you want to change the base?

Add byt5 Model #1742

Uh oh!

Conversation

yoreG123 commented Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add model and datatsets

PR changes

Description

Uh oh!

CLAassistant commented Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yingyibiao Apr 14, 2022

Choose a reason for hiding this comment

Uh oh!

yoreG123 Apr 16, 2022

Choose a reason for hiding this comment

Uh oh!

yingyibiao Apr 14, 2022

Choose a reason for hiding this comment

Uh oh!

yingyibiao Apr 14, 2022

Choose a reason for hiding this comment

Uh oh!

yingyibiao Apr 14, 2022

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 27, 2023

Uh oh!

github-actions bot commented May 12, 2024

Uh oh!

github-actions bot commented Jan 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yoreG123 commented Mar 8, 2022 •

edited

Loading

CLAassistant commented Mar 8, 2022 •

edited

Loading