Skip to content

Conversation

dyan-dy
Copy link
Contributor

@dyan-dy dyan-dy commented Jun 14, 2021

PR types

database

PR changes

modified bq_corpus.py according to the opinion

Description

@ZeyuChen ZeyuChen added data Issues about data pipeline and dataset SIG labels Jun 14, 2021
@ZeyuChen ZeyuChen mentioned this pull request Jun 14, 2021
@ZeyuChen ZeyuChen changed the title modified bq_corpus.py Add BQCorpus Dataset Jun 14, 2021
@ZeyuChen ZeyuChen requested a review from smallv0221 June 14, 2021 08:00
Copy link
Member

@ZeyuChen ZeyuChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems miss a init.py in paddlenlp/dataset/

@dyan-dy dyan-dy mentioned this pull request Jun 14, 2021
Copy link
Contributor

@smallv0221 smallv0221 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is not need to modify init.py to implement new datasets. @ZeyuChen @frozenfish123


class BQCorpus(DatasetBuilder):
"""
BQCorpus: the largest dataset available for for the banking and finance sector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here.

META_INFO = collections.namedtuple('META_INFO', ('file', 'md5'))
SPLITS = {
'train': META_INFO(
os.path.join('BQCorpus', 'train.tsv'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filepath seems wrong. Have you check your dataset using load_dataset()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry actually I've been trapped in the trouble for a long time and I don't know how to fix it ... I've been dealing with it for the past two days, and I can't solve the problem. Sorry for slowing down your progress : (

"""
BQCorpus: the largest dataset available for for the banking and finance sector

by frozenfish123@Wuhan University
Copy link
Contributor

@smallv0221 smallv0221 Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also give original author information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""
 BQ Corpus (Bank Question Corpus), a large-scale domain-specific Chinese corpus for sentence semantic
 equivalence identification is constructed from online Webank custom service logs.

 The BQ corpus contains 120,000 question pairs with manual annotation.

 The following two lines are examples extracted from training data:

     '''
     "微粒贷开通"	你好,我的微粒贷怎么没有开通呢	0
     为什么借款后一直没有给我回拨电话    怎么申请借款后没有打电话过来呢!    1
     '''

 Provider: Intelligent Computing Research Center, Harbin Institute of Technology(Shenzhen)
 Contacts: Qingcai Chen (email: [email protected]; Fax: +86-755-26033182)
 More Info: https://www.luge.ai/

  """

Sorry I have been preparing for my final exam these days and I didn't reply timely. Here are my chages on this opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great description! Good luck in the final.

@ZeyuChen
Copy link
Member

Actually, there is not need to modify init.py to implement new datasets. @ZeyuChen @frozenfish123

Got it.

Copy link
Contributor

@smallv0221 smallv0221 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decide to merge your dataset and fix it for you due to our release schedule. Excellent work! Thanks for your contribution.

@smallv0221 smallv0221 merged commit 6b1fa66 into PaddlePaddle:develop Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Issues about data pipeline and dataset SIG

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants