Skip to content

Conversation

westfish
Copy link
Contributor

@westfish westfish commented Aug 2, 2022

PR types

New features

PR changes

Others

Description

Add T5-based question generation example

westfish and others added 25 commits August 3, 2022 21:53
* Add retrievalbase classification

* Add requirements and update readme

* Update readme

* refine the dataset structure

* Update readme

* Update text data format
* mv ernie-gen

* delete useless code
* add dalle-mini model

* update

* support auto

* decode->generate

* add dalle-mega v26

* update url

* update url

* update url

* add dallebart tokenizer into autotokenizer

Co-authored-by: Guo Sheng <[email protected]>
…e#2962)

* Add unittest for more outputs in test_modeling_common.py

* Change test import.
* custom op supports eager mode

* ut support eager

* compatibility

* bart

* encoder & decoder

* force decoding

* update

* alter import

* delete dir

Co-authored-by: Guo Sheng <[email protected]>
* add opt model

* update opt modeling

* complete all of test

* remove remote file link from opt configuration

* enable auto model with opt

* remove pretraining and criterion

* remove OPTLMHeadModel

* remove `remove_final_layer_norm`

* add example for opt text generation

* improve auto model & tokenizer

* keep the same code style

* revert changes fixed by PaddlePaddle#2764

* remove unused character

* use transformer-decoder-layer in opt

* align demo with huggingface website

* add opt op operations

* update faster entry

* add opt convert_param supporting

* add Makefile to manualy build faster transformer

* fix fp16 usage

* add performance & sample

* update opt modeling docstring

* update opt by comments

* remove Makefile && recover CMakeLists.txt

* update metric of opt

* update opt perf image

* update opt example

* remove opt & gpt image file to reduce size of repo

* remove bart file

* remove bart file to reduce the size of repo

Co-authored-by: Guo Sheng <[email protected]>
* add artist

* update tokenizer

* support faster generation

* update

* update

* update config
…2945)

* Add Milvus2.0 support for neural search

* Add milvus search file and remove unused blanks and code

* Add milvus_util.py
PaddlePaddle#2965)

* Fix faiss index batch_size bug on python3.7 and update es config for pipelines

* Fix the nltk download bug and Add FAQ for mac support

* Remove update_batch_size for fais
* add support for wo visual backbone

* add re hidden states

* fix hidden save

Co-authored-by: yingyibiao <[email protected]>
Co-authored-by: Jiaqi Liu <[email protected]>
* add text2image taskflow

* update readme

* update readme

* update text2image taskflow

Co-authored-by: Guo Sheng <[email protected]>
* Move pipelines to the root dir

* Add missing files
guoshengCS and others added 10 commits August 12, 2022 17:13
* Add label and loss support for BERT/RoBERTa.

* Add label and loss support for ERNIE.

* Update api docs.
* Add ERNIE 3.0 based RocketQA Ranker models

* Update Docs Introduction
* Add rocketqa DualEncoder models

* Add DualEncoder docs test=document_fix
)

* Upgrade run_system.py to milvus 2.1 for neural search

* recover server port
* run_qa.py add xpu chioce of device,*test=kunlun

* run_qa.py add xpu chioce of device,*test=kunlun
…ddle#2694)

* add label studio to doccano file format conversion function

* add label studio to doccano file format conversion function

* add description in readme
Copy link

@tianxin1860 tianxin1860 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave some comments

#### 数据加载
**SQuAD**数据集是一个英文问答数据集,现有的问题生成研究主要在该数据集上进行评价。**SQuAD**中的数据由段落、问题、答案3个主要部分组成,其中段落和问题从维基百科中获取,答案由人工标注。

为了方便用户快速测试,PaddleNLP Dataset API内置了Squad数据集,一键即可完成数据集加载,示例代码如下:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Squad -> SQuAD

@@ -0,0 +1,206 @@
# T5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

标题直接用 T5 是出于什么考虑?建议标题还是用 问题生成(Question Generation)。


## 简介

Question Generation(QG),即问题生成,指的是给定一段上下文(passage或sentence),自动生成一个流畅且符合上下文主题的问句。问题生成通常可以分为两个分支,即无答案问题生成和有答案问题生成。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

无答案问题生成、有答案问题生成这些中文翻译比较容易引起误解,建议在中文术语后括号备注把对应的英文术语明确写出来。

### 数据准备

#### 数据加载
**SQuAD**数据集是一个英文问答数据集,现有的问题生成研究主要在该数据集上进行评价。**SQuAD**中的数据由段落、问题、答案3个主要部分组成,其中段落和问题从维基百科中获取,答案由人工标注。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意下表述严谨: SQuAD 数据集中的 问题 是通过众包方式人工标注的。

Comment on lines +47 to +48
answer: {answer_text} context: {context_text}
question: {question_text}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里能否给出一个转换后的具体实例数据?帮助用户更容易理解。


- `device` 表示使用的设备。

程序运行结束后会将预测生成的摘要保存在`output_path`中。同时终端中会输出评估结果。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

预测生成的摘要? 文档是作为用户入口最先看到的东西,值得用心打磨,对于推广 Question Generation 也有帮助。


程序运行结束后会将预测生成的摘要保存在`output_path`中。同时终端中会输出评估结果。

采用社区微调模型mrm8488/t5-base-finetuned-question-generation-ap在验证集上有如下结果:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按照我们自己训练脚本训出来的模型和社区模型相比指标如何有做过评估么?

Comment on lines 129 to 135
parser.add_argument(
"--ignore_pad_token_for_loss",
default=True,
type=bool,
help="Whether to ignore the tokens corresponding to "
"padded labels in the loss computation or not.",
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

预测阶段需要这个参数么?