Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
0260530
initial commit
1649759610 May 18, 2022
5f66ab1
modify _calc_img_embeddings to support running without img embedding.
1649759610 May 18, 2022
73acf83
remove commented code
1649759610 May 18, 2022
6b457e8
delete README
1649759610 May 18, 2022
1ab0a3f
refine readme.md
1649759610 May 18, 2022
3a5f203
change question
1649759610 May 18, 2022
c54dc88
modify layoutxlm to support traing without image embedding
1649759610 May 18, 2022
84a215d
modify _calc_img_embeddings in layoutxlm to support training without …
1649759610 May 18, 2022
6e8582b
modify _calc_img_embeddings in layoutxlm to support training without …
1649759610 May 18, 2022
fd0e0ea
Merge branch 'develop' into develop
yingyibiao May 18, 2022
3255d51
refine .gitignore
1649759610 May 18, 2022
b3144e3
refine Rerank with pre-commit
1649759610 May 18, 2022
d3c3a09
refine Extraction with pre-commit
1649759610 May 18, 2022
fde5156
refine code and readme details
1649759610 May 18, 2022
a812206
Merge branch 'PaddlePaddle:develop' into develop
1649759610 May 18, 2022
93af9e0
Merge branch 'develop' of github.com:1649759610/PaddleNLP into develop
1649759610 May 18, 2022
a497049
refine coding
1649759610 May 18, 2022
53cfcf8
refine code style about imports
1649759610 May 18, 2022
812a508
refine README
1649759610 May 18, 2022
18abfa8
set CUDA_VISIBLE_DEVICES 0
1649759610 May 18, 2022
fe63856
refine code style
1649759610 May 18, 2022
ff9d6ee
refine readme
1649759610 May 18, 2022
7218e61
refine readme
1649759610 May 18, 2022
8b28fc2
delete ocr parsing file
1649759610 May 18, 2022
fb26c50
Merge branch 'develop' into develop
May 18, 2022
37d1456
refine readme
1649759610 May 18, 2022
d5f2451
Merge branch 'develop' of github.com:1649759610/PaddleNLP into develop
1649759610 May 18, 2022
8c39aa9
refine readme
1649759610 May 23, 2022
ee0b712
refine Readme
1649759610 May 23, 2022
2f7262a
Merge branch 'PaddlePaddle-develop' into develop
1649759610 May 23, 2022
7e088d0
refine README.md
1649759610 May 23, 2022
eedb2d4
refine readme
1649759610 May 25, 2022
706566d
refine readme
1649759610 May 25, 2022
3e2c425
refnie readme
1649759610 May 25, 2022
8331820
Merge branch 'PaddlePaddle-develop' into develop
1649759610 May 25, 2022
d8f3aaa
refine readme
1649759610 May 25, 2022
fcaa27e
optimize ocr and mrc module
1649759610 May 31, 2022
e8c08d3
Merge branch 'PaddlePaddle:develop' into develop
1649759610 May 31, 2022
312bcb0
Merge branch 'develop' of github.com:1649759610/PaddleNLP into develop
1649759610 May 31, 2022
f0f1edb
refine code style
1649759610 May 31, 2022
a3bb28f
Merge branch 'PaddlePaddle:develop' into develop
1649759610 May 31, 2022
4c6b248
set params with argparse
1649759610 May 31, 2022
f6d85fc
rename max_seq_length to max_seq_len
1649759610 May 31, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
initial commit
  • Loading branch information
1649759610 committed May 18, 2022
commit 026053091f0e4b4fb67131a08196b7e573193bea
14 changes: 14 additions & 0 deletions applications/doc_vqa/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
checkpoints/*
__pycache__/*
OCR_process/demo_pics/*
Rerank/log/*
Rerank/checkpoints/*
Rerank/data/*
Rerank/output/*
Rerank/__pycache__/*
Extraction/log/*
Extraction/checkpoints/*
Extraction/data/*
Extraction/output/*
Extraction/__pycache__/*

34 changes: 34 additions & 0 deletions applications/doc_vqa/Extraction/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# 环境要求
paddle 2.2.0+,动态图训练和预测
python 3.7+

# 已有模型测试
bash run_test.sh saved_checkpoints/checkpoint-48000/
得到结果:F1=65.04

我在paddlenlp/transformers/layoutxlm/modeling.py实现了各种方式的训练实现,具体参考下面的说明。

# 模型训练(单卡、crf算子:https://github.com/PaddlePaddle/models/blob/develop/dygraph/lac/sequence_labeling.py#L126)
bash run_train.sh
模型训练跑到第20个epoch预计效果较好。可以参考日志log/log_acl.txt,保存模型为checkpoint-48000。利用run_test.sh脚本,测试得到F1>=65代表基线模型训练可复现。

# 模型训练(单卡、paddlenlp实现CRF,https://github.com/PaddlePaddle/models/blob/develop/dygraph/lac/sequence_labeling.py#L126)
参见paddlenlp/transformers/layoutxlm/modeling.py中1048-1074行实现,并且注释掉代码前面的标准CRF算子实现即可。

bash run_train.sh运行即可

# 模型训练(多卡,paddlenlp实现CRF)

bash run_train_multi.sh运行即可

# 存在问题以及期望实现目标
存在问题:
1.使用CRF算子实现的,多卡无法运行
2.使用paddlenlp实现的CRF算子,单卡、多卡都可运行,但是速度比算子慢很多(预计3-5倍),并且效果未进行最终验证

期望实现目标:
1.CRF能够实现多卡训练、并且效果打平甚至超过F1=65。

可以两种方式:
1.已有CRF算子优化,支持多卡,并且效果超过F1=65。
2.现有paddlenlp的CRF实现优化,加速运行,速度和最终效果至少打平CRF算子。
Binary file added applications/doc_vqa/Extraction/answer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 33 additions & 0 deletions applications/doc_vqa/Extraction/change_to_mrc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import sys
import json
import numpy as np

def get_top1_from_ranker(path):
with open(path, "r", encoding="utf-8") as f:
scores = [float(line.strip()) for line in f.readlines()]
top_id = np.argmax(scores)

return top_id

def get_ocr_result_by_id(path, top_id):
with open(path, "r", encoding="utf-8") as f:
reses = f.readlines()
res = reses[top_id]
return json.loads(res)

def write_to_file(doc, path):
with open(path, "w", encoding="utf-8") as f:
json.dump(doc, f, ensure_ascii=False)
f.write("\n")

if __name__=="__main__":
question = sys.argv[1]
ranker_result_path = "../Rerank/data/demo.score"
ocr_result_path = "../OCR_process/demo_ocr_res.json"
save_path = "data/demo_test.json"
top_id = get_top1_from_ranker(ranker_result_path)
doc = get_ocr_result_by_id(ocr_result_path, top_id)
doc["question"] = question
doc["img_id"] = str(top_id + 1)

write_to_file(doc, save_path)
Loading