Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
834e923
add qg_example
westfish Aug 2, 2022
532776d
Merge branch 'develop' into qg_example
westfish Aug 2, 2022
9677bae
Merge branch 'develop' into qg_example
westfish Aug 3, 2022
2099987
Merge branch 'develop' into qg_example
westfish Aug 4, 2022
9a5f606
update link
westfish Aug 4, 2022
115b231
Merge branch 'qg_example' of github.com:westfish/PaddleNLP into qg_ex…
westfish Aug 4, 2022
a7bebbf
feat: add more device (#2963)
Aug 4, 2022
36ccdcd
Add retrieval based classification (#2836)
w5688414 Aug 5, 2022
3e863a3
Mv ernie-gen (#2970)
FrostML Aug 5, 2022
14d2c93
remove duplicate code in valid text input (#2977)
BasicCoder Aug 6, 2022
516f549
Fix faster wordpiece empty string input (#2975)
joey12300 Aug 7, 2022
63b556e
[NEW MODEL]Add DALL-E mini Model (#2917)
JunnYu Aug 8, 2022
52531dd
Add unittest for more outputs in test_modeling_common.py (#2962)
guoshengCS Aug 8, 2022
4042301
[duee] fix bug brought by using autotokenizer (#2985)
LemonNoel Aug 8, 2022
00a3551
Data augmentation API name change & bug fix (#2956)
lugimzzz Aug 9, 2022
101a02c
[FT] Custom op supports eager mode (#2795)
FrostML Aug 9, 2022
70e6a31
[NEW MODEL]add OPT model (#2659)
wj-Mcat Aug 9, 2022
145333e
Add EasyNLP Text2Image model (#2968)
JunnYu Aug 9, 2022
085ac53
Add and upgrade to Milvus2.0 support for neural search (#2945)
w5688414 Aug 9, 2022
bb1729b
Fix faiss index batch_size bug on python3.7 and update es config for …
w5688414 Aug 9, 2022
c097614
support multi gpus (#3004)
Aug 9, 2022
8dc6c98
fix tokenizer encode bug for layoutxlm (#3006)
linjieccc Aug 10, 2022
54eda93
add support for wo visual backbone (#2935)
littletomatodonkey Aug 10, 2022
2a4a2fb
Add Docker Support for pipelines (#2997)
w5688414 Aug 10, 2022
7ce81e1
[Recompute] Support ernie for dygraph recompute. (#2849)
ZHUI Aug 10, 2022
6b3f40c
Add Text2Image into Taskflow (#2988)
JunnYu Aug 10, 2022
377ecb2
Move pipelines to the root dir (#3012)
w5688414 Aug 10, 2022
3f0d6ce
Make BERT support past_key_values. (#2801)
guoshengCS Aug 10, 2022
4c54f6d
Update compression API (#2777)
LiuChiachi Aug 10, 2022
6952b91
[Unittest]add tinybert unittest (#2992)
wj-Mcat Aug 11, 2022
115d69d
Fix faster_tokenizer arm64 compile (#3016)
joey12300 Aug 11, 2022
8980d64
Add label and loss support for BERT/RoBERTa/ERNIE (#3013)
guoshengCS Aug 12, 2022
baa47b6
Add ERNIE 3.0 based RocketQA Ranker models (#3019)
w5688414 Aug 12, 2022
f61e7ca
Add ERNIE 3.0 based rocketqa DualEncoder models (#3033)
w5688414 Aug 15, 2022
a85059f
Update Pipelines README.md (#3043)
Aug 15, 2022
bf0d3ab
update text2image taskflow (#3040)
JunnYu Aug 15, 2022
42ec8e3
Upgrade run_system.py to milvus 2.1 for neural search (#3047)
w5688414 Aug 15, 2022
56a8957
run_qa.py add xpu chioce of device,*test=kunlun (#3046)
dongfangshenzhu Aug 16, 2022
1022500
Add label studio to doccano file format conversion function (#2694)
Haibarayu Aug 16, 2022
84a6f28
update group code (#3055)
chenxiaozeng Aug 16, 2022
da02704
Merge branch 'develop' of github.com:westfish/PaddleNLP into qg_example
westfish Aug 16, 2022
ea50a43
update
westfish Sep 6, 2022
589eae4
update
westfish Sep 6, 2022
c50dba1
fix generate_run bug
westfish Sep 6, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -123,3 +123,6 @@ FETCH_HEAD

# vscode
.vscode

# temp
applications/question_generation
4 changes: 2 additions & 2 deletions README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,11 +43,11 @@
## 社区交流

- 微信扫描二维码并填写问卷之后,加入交流群领取福利
- 获取5月18-19日每晚20:30《产业级通用信息抽取技术UIE+ERNIE轻量级模型》直播课链接
- 与众多社区开发者以及官方团队深度交流
- 10G重磅NLP学习大礼包!

<div align="center">
<img src="https://user-images.githubusercontent.com/11793384/168411900-d9f3d777-99ab-4b5c-8cdc-ef747a48b864.jpg" width="150" height="150" />
<img src="https://user-images.githubusercontent.com/11793384/184784832-bb97930f-a738-4480-99be-517aeb65afac.png" width="150" height="150" />
</div>

## 特性
Expand Down
3 changes: 2 additions & 1 deletion README_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -323,9 +323,10 @@ To connect with other users and contributors, welcome to join our [Slack channel
Scan the QR code below with your Wechat⬇️. You can access to official technical exchange group. Look forward to your participation.

<div align="center">
<img src="https://user-images.githubusercontent.com/11793384/168411900-d9f3d777-99ab-4b5c-8cdc-ef747a48b864.jpg" width="150" height="150" />
<img src="https://user-images.githubusercontent.com/11793384/184784832-bb97930f-a738-4480-99be-517aeb65afac.png" width="150" height="150" />
</div>


## Citation

If you find PaddleNLP useful in your research, please consider cite
Expand Down
26 changes: 0 additions & 26 deletions applications/experimental/pipelines/README.md

This file was deleted.

23 changes: 0 additions & 23 deletions applications/experimental/pipelines/requirements-cpu.txt

This file was deleted.

Binary file added applications/neural_search/img/attu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions applications/neural_search/recall/in_batch_negative/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,8 @@ Recall@K召回率是指预测的前topK(top-k是指从最后的按得分排序
推荐使用GPU进行训练,在预测阶段使用CPU或者GPU均可。

**环境依赖**
* python >= 3.6
* paddlepaddle >= 2.1.3
* python >= 3.6.2
* paddlepaddle >= 2.2.3
* paddlenlp >= 2.2
* [hnswlib](https://github.com/nmslib/hnswlib) >= 0.5.2
* visualdl >= 2.2.2
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ class ErnieOp(Op):

def init_op(self):
from paddlenlp.transformers import AutoTokenizer
self.tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
self.tokenizer = AutoTokenizer.from_pretrained('ernie-1.0')

def preprocess(self, input_dicts, data_id, log_id):
from paddlenlp.data import Stack, Tuple, Pad
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,22 +34,20 @@
# yapf: enable

if __name__ == "__main__":
# If you want to use ernie1.0 model, plesace uncomment the following code
output_emb_size = 256

pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")

tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
pretrained_model = AutoModel.from_pretrained("ernie-1.0")
tokenizer = AutoTokenizer.from_pretrained('ernie-1.0')
model = SemanticIndexBaseStatic(pretrained_model,
output_emb_size=output_emb_size)

if args.params_path and os.path.isfile(args.params_path):
state_dict = paddle.load(args.params_path)
model.set_dict(state_dict)
print("Loaded parameters from %s" % args.params_path)
else:
raise ValueError(
"Please set --params_path with correct pretrained model file")

model.eval()

# Convert to static graph with specific input description
model = paddle.jit.to_static(
model,
Expand Down
100 changes: 55 additions & 45 deletions applications/neural_search/recall/milvus/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,11 @@
## 2. 环境依赖和安装说明

**环境依赖**
* python >= 3.6
* python >= 3.6.2
* paddlepaddle >= 2.2
* paddlenlp >= 2.2
* milvus >= 1.1.1
* pymilvus >= 1.1.2
* milvus >= 2.1.0
* pymilvus >= 2.1.0

<a name="代码结构"></a>

Expand All @@ -47,17 +47,15 @@
```
|—— scripts
|—— feature_extract.sh 提取特征向量的bash脚本
|—— search.sh 插入向量和向量检索bash脚本
├── base_model.py # 语义索引模型基类
├── config.py # milvus配置文件
├── data.py # 数据处理函数
├── embedding_insert.py # 插入向量
├── embedding_recall.py # 检索topK相似结果 / ANN
├── milvus_ann_search.py # 向量插入和检索的脚本
├── inference.py # 动态图模型向量抽取脚本
├── feature_extract.py # 批量抽取向量脚本
├── milvus_insert.py # 插入向量工具类
├── milvus_recall.py # 向量召回工具类
├── README.md
└── server_config.yml # milvus的config文件,本项目所用的配置
├── milvus_util.py # milvus的工具类
└── README.md
```
<a name="数据准备"></a>

Expand Down Expand Up @@ -97,13 +95,14 @@

## 5. 向量检索

### 5.1 基于Milvus的向量检索系统搭建

数据准备结束以后,我们开始搭建 Milvus 的语义检索引擎,用于语义向量的快速检索,我们使用[Milvus](https://milvus.io/)开源工具进行召回,Milvus 的搭建教程请参考官方教程 [Milvus官方安装教程](https://milvus.io/cn/docs/v1.1.1/milvus_docker-cpu.md)本案例使用的是 Milvus 的1.1.1 CPU版本,建议使用官方的 Docker 安装方式,简单快捷。
数据准备结束以后,我们开始搭建 Milvus 的语义检索引擎,用于语义向量的快速检索,我们使用[Milvus](https://milvus.io/)开源工具进行召回,Milvus 的搭建教程请参考官方教程 [Milvus官方安装教程](https://milvus.io/docs/v2.1.x/install_standalone-docker.md)本案例使用的是 Milvus 的2.1版本,建议使用官方的 Docker 安装方式,简单快捷。

Milvus 搭建完系统以后就可以插入和检索向量了,首先生成 embedding 向量,每个样本生成256维度的向量,使用的是32GB的V100的卡进行的提取:

```
CUDA_VISIBLE_DEVICES=2 python feature_extract.py \
CUDA_VISIBLE_DEVICES=0 python feature_extract.py \
--model_dir=./output \
--corpus_file "data/milvus_data.csv"
```
Expand All @@ -127,57 +126,60 @@ MILVUS_PORT = 8530
然后运行下面的命令把向量插入到Milvus库中:

```
python3 embedding_insert.py
python milvus_ann_search.py --data_path milvus/milvus_data.csv \
--embedding_path corpus_embedding.npy \
--batch_size 100000 \
--insert
```
参数含义说明

* `data_path`: 数据的路径
* `embedding_path`: 数据对应向量的路径
* `index`: 选择检索向量的索引,用于向量检索
* `insert`: 是否插入向量
* `search`: 是否检索向量
* `batch_size`: 表示的是一次性插入的向量的数量


| 数据量 | 时间 |
| ------------ | ------------ |
|1000万条|12min24s|
|1000万条|21min12s|

另外,Milvus提供了可视化的管理界面,可以很方便的查看数据,安装地址为[Attu](https://github.com/zilliztech/attu).

另外,Milvus提供了可视化的管理界面,可以很方便的查看数据,安装地址为[Milvus Enterprise Manager](https://github.com/zilliztech/attu)
![](../../img/attu.png)


运行召回脚本:

```
python3 embedding_recall.py

python milvus_ann_search.py --data_path milvus/milvus_data.csv \
--embedding_path corpus_embedding.npy \
--batch_size 100000 \
--index 18 \
--search
```
运行的结果为,表示的是召回的 id 和与当前的 query 计算的距离:

运行以后的结果的输出为:

```
10000000
time cost 0.5410025119781494 s
Status(code=0, message='Search vectors successfully!')
[
[
(id:1, distance:0.0),
(id:7109733, distance:0.832247257232666),
(id:6770053, distance:0.8488889932632446),
(id:2653227, distance:0.9032443761825562),
hit: (distance: 0.0, id: 18), text field: 吉林铁合金集团资产管理现状分析及对策资产管理;资金控制;应收帐款风险;造价控制;集中化财务控制
hit: (distance: 0.45325806736946106, id: 7611689), text field: 哈药集团应收账款分析应收账款,流动资产,财务报告
hit: (distance: 0.5440893769264221, id: 4297885), text field: 宝钢集团负债经营风险控制策略研究钢铁行业;负债经营;风险控制
hit: (distance: 0.5455711483955383, id: 5661135), text field: 浅谈电网企业固定资产风险管理大数据,固定资产,风险管理
...
```
返回的是向量的距离,向量的id,以及对应的文本。

第一次检索的时间大概是18s左右,需要把数据从磁盘加载到内存,后面检索就很快,下面是测试的速度:

| 数据量 | 时间 |
| ------------ | ------------ |
|100条|0.15351247787475586|

如果测试的速度过慢,可以修改 Milvus 配置里面的 cache 参数:
也可以一键执行上述的过程:

```
cache:
cache_size: 32GB
insert_buffer_size: 8GB
preload_collection:

sh scripts/search.sh
```
把 cache_size,insert_buffer_size 调的越大,速度越快,调完后重启 Milvus

### 5.2 文本检索

修改代码的模型路径和样本
首先修改代码的模型路径和样本

```
params_path='checkpoints/model_40/model_state.pdparams'
Expand All @@ -194,12 +196,20 @@ python3 inference.py

```
[1, 256]
[[ 0.06374735 -0.08051944 0.05118101 -0.05855767 -0.06969483 0.05318566
0.079629 0.02667932 -0.04501902 -0.01187392 0.09590752 -0.05831281
Tensor(shape=[1, 256], dtype=float32, place=Place(gpu:0), stop_gradient=True,
[[ 0.07830613, -0.14036864, 0.03433795, -0.14967985, -0.03386058,
0.06630671, 0.01357946, 0.03531205, 0.02411086, 0.02000865,
0.05724005, -0.08119474, 0.06286906, 0.06509133, 0.07193415,
....
5677638 国有股权参股对家族企业创新投入的影响混合所有制改革,国有股权,家族企业,创新投入 0.5417419672012329
1321645 高管政治联系对民营企业创新绩效的影响——董事会治理行为的非线性中介效应高管政治联系,创新绩效,民营上市公司,董事会治理行为,中介效应 0.5445536375045776
1340319 国有控股上市公司资产并购重组风险探讨国有控股上市公司,并购重组,防范对策 0.5515031218528748
hit: (distance: 0.40141725540161133, id: 2742485), text field: 完善国有企业技术创新投入机制的探讨--基于经济责任审计实践国有企业,技术创新,投
入机制
hit: (distance: 0.40258315205574036, id: 1472893), text field: 企业技术创新与组织冗余--基于国有企业与非国有企业的情境研究
hit: (distance: 0.4121206998825073, id: 51831), text field: 企业创新影响对外直接投资决策—基于中国制造业上市公司的研究企业创新;对外直接投资;
制造业;上市公司
hit: (distance: 0.42234909534454346, id: 8682312), text field: 政治关联对企业创新绩效的影响——国有企业与民营企业的对比政治关联,创新绩效,国有
企业,民营企业,双重差分
hit: (distance: 0.46187296509742737, id: 9324797), text field: 财务杠杆、股权激励与企业创新——基于中国A股制造业经验数据制造业;上市公司;股权激
励;财务杠杆;企业创新
....
```
## FAQ
Expand Down
31 changes: 18 additions & 13 deletions applications/neural_search/recall/milvus/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,25 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import os
from milvus import MetricType, IndexType

MILVUS_HOST = '10.21.226.173'
MILVUS_HOST = '10.21.226.175'
MILVUS_PORT = 8530
data_dim = 256
top_k = 100
collection_name = 'literature_search'
partition_tag = 'partition_2'
embedding_name = 'embeddings'

collection_param = {
'dimension': 256,
'index_file_size': 256,
'metric_type': MetricType.L2
index_config = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {
"nlist": 1000
},
}

index_type = IndexType.IVF_FLAT
index_param = {'nlist': 1000}

top_k = 100
search_param = {'nprobe': 20}
search_params = {
"metric_type": "L2",
"params": {
"nprobe": top_k
},
}
38 changes: 0 additions & 38 deletions applications/neural_search/recall/milvus/embedding_insert.py

This file was deleted.

Loading