English | 简体中文
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
Official PyTorch implementation of DocLayout-YOLO.
We present DocLayout-YOLO, a real-time and robust layout detection model for diverse documents, based on YOLO-v10. This model is enriched with diversified document pre-training and structural optimization tailored for layout detection. In the pre-training phase, we introduce Mesh-candidate BestFit, viewing document synthesis as a two-dimensional bin packing problem, and create a large-scale diverse synthetic document dataset, DocSynth-300K. In terms of model structural optimization, we propose a module with Global-to-Local Controllability for precise detection of document elements across varying scales.
2024.10.25 🎉🎉 Mesh-candidate Bestfit code is released. Mesh-candidate Bestfit is an automatic pipeline which can synthesize large-scale, high-quality, and visually appealing document layout detection dataset. Tutorial and example data are available in here.
2024.10.23 🎉🎉 DocSynth300K dataset is released on 🤗Huggingface, DocSynth300K is a large-scale and diverse document layout analysis pre-training dataset, which can largely boost model performance.
2024.10.21 🎉🎉 Online demo available on 🤗Huggingface.
2024.10.18 🎉🎉 DocLayout-YOLO is implemented in PDF-Extract-Kit for document context extraction.
2024.10.16 🎉🎉 Paper now available on ArXiv.
Online Demo is now available. For local development, follow steps below:
Follow these steps to set up your environment:
conda create -n doclayout_yolo python=3.10
conda activate doclayout_yolo
pip install -e .
Note: If you only need the package for inference, you can simply install it via pip:
pip install doclayout-yolo
You can make predictions using either a script or the SDK:
-
Script
Run the following command to make a prediction using the script:
python demo.py --model path/to/model --image-path path/to/image
-
SDK
Here is an example of how to use the SDK for prediction:
import cv2 from doclayout_yolo import YOLOv10 # Load the pre-trained model model = YOLOv10("path/to/provided/model") # Perform prediction det_res = model.predict( "path/to/image", # Image to predict imgsz=1024, # Prediction image size conf=0.2, # Confidence threshold device="cuda:0" # Device to use (e.g., 'cuda:0' or 'cpu') ) # Annotate and save the result annotated_frame = det_res[0].plot(pil=True, line_width=5, font_size=20) cv2.imwrite("result.jpg", annotated_frame)
We provide model fine-tuned on DocStructBench for prediction, which is capable of handing various document types. Model can be downloaded from here and example images can be found under assets/example
.
Note: For PDF content extraction, please refer to PDF-Extract-Kit and MinerU.
Note: Thanks to NielsRogge, DocLayout-YOLO now supports implementation directly from 🤗Huggingface, you can load model as follows:
filepath = hf_hub_download(repo_id="juliozhao/DocLayout-YOLO-DocStructBench", filename="doclayout_yolo_docstructbench_imgsz1024.pt")
model = YOLOv10(filepath)
or directly load using from_pretrained
:
model = YOLOv10.from_pretrained("juliozhao/DocLayout-YOLO-DocStructBench")
more details can be found at this PR.
Note: Thanks to luciaganlulu, DocLayout-YOLO can perform batch inference and prediction. Instead of passing single image into model.predict
in demo.py
, pass a list of image path. Besides, due to batch inference is not implemented before YOLOv11
, you should manually change batch_size
in here.
Use following command to download dataset(about 113G):
from huggingface_hub import snapshot_download
# Download DocSynth300K
snapshot_download(repo_id="juliozhao/DocSynth300K", local_dir="./docsynth300k-hf", repo_type="dataset")
# If the download was disrupted and the file is not complete, you can resume the download
snapshot_download(repo_id="juliozhao/DocSynth300K", local_dir="./docsynth300k-hf", repo_type="dataset", resume_download=True)
If you want to perform DocSynth300K pretraining, using format_docsynth300k.py
to convert original .parquet
format into YOLO
format. The converted data will be stored at ./layout_data/docsynth300k
.
python format_docsynth300k.py
To perform DocSynth300K pre-training, use this command. We default use 8GPUs to perform pretraining. To reach optimal performance, you can adjust hyper-parameters such as imgsz
, lr
according to your downstream fine-tuning data distribution or setting.
Note: Due to memory leakage in YOLO original data loading code, the pretraining on large-scale dataset may be interrupted unexpectedly, use --pretrain last_checkpoint.pt --resume
to resume the pretraining process.
- specify the data root path
Find your ultralytics config file (for Linux user in $HOME/.config/Ultralytics/settings.yaml)
and change datasets_dir
to project root path.
- Download prepared yolo-format D4LA and DocLayNet data from below and put to
./layout_data
:
Dataset | Download |
---|---|
D4LA | link |
DocLayNet | link |
the file structure is as follows:
./layout_data
├── D4LA
│ ├── images
│ ├── labels
│ ├── test.txt
│ └── train.txt
└── doclaynet
├── images
├── labels
├── val.txt
└── train.txt
Training is conducted on 8 GPUs with a global batch size of 64 (8 images per device). The detailed settings and checkpoints are as follows:
Dataset | Model | DocSynth300K Pretrained? | imgsz | Learning rate | Finetune | Evaluation | AP50 | mAP | Checkpoint |
---|---|---|---|---|---|---|---|---|---|
D4LA | DocLayout-YOLO | ✗ | 1600 | 0.04 | command | command | 81.7 | 69.8 | checkpoint |
D4LA | DocLayout-YOLO | ✓ | 1600 | 0.04 | command | command | 82.4 | 70.3 | checkpoint |
DocLayNet | DocLayout-YOLO | ✗ | 1120 | 0.02 | command | command | 93.0 | 77.7 | checkpoint |
DocLayNet | DocLayout-YOLO | ✓ | 1120 | 0.02 | command | command | 93.4 | 79.7 | checkpoint |
The DocSynth300K pretrained model can be downloaded from here. Change checkpoint.pt
to the path of model to be evaluated during evaluation.
This section shows how to finetune and evalutate your own models, based on DocLayout-YOLO. These instructions have been tested under Linux.
YOLO expects a config file at $HOME/.config/Ultralytics/settings.yaml
. Most importantly, the YOLO config specifies the dataset root folder, which holds all datasets. Each dataset holds images, labels and assignments of elements to training and tests sets. Finally, each dataset requires a config file in DocLayout-YOLO
, the dataset config.
- Specify YOLO's
datasets_dir
Edit $HOME/.config/Ultralytics/settings.yaml
, specify the dataset root folder, under datasets_dir
.
$ cat ~/.config/Ultralytics/settings.yaml
settings_version: 0.0.4
datasets_dir: /home/sebastian/Datasets/
# <more options left out for brevity>
- Prepare dataset
Inside the dataset root folder specified above, create a folder holding the dataset. Place the images inside images/
, create labels and place them in labels/
. Create train.txt
and test.txt
, listing which images are part of the training set and which images are part of the test set.
# sample dataset
mydataset
images
001.jpg
002.jpg
labels
001.txt
002.txt
train.txt
test.txt
The files train.txt
and test.txt
containg the name of single image file per line, relative to the dataset directory.
# train.txt
./images/001.jpg
./images/002.jpg
- Clone DocLayout-YOLO
git clone [email protected]:opendatalab/DocLayout-YOLO.git
cd DocLayout-YOLO
- Add Dataset Config
To be able to use a dataset in DocLayout-YOLO, create a config file at doclayout_yolo/cfg/datasets/mydataset.yml
. For details see https://docs.ultralytics.com/datasets/detect/.
# DocLayout-YOLO/cfg/datasets/mydataset.yml
path: mydataset
train: train.txt
val: test.txt
test: test.txt
# classes
names:
0: Index
1: Line
2: Location
3: PageNumber
- Download a checkpoint
See "Training and Evaluation" above.
- Start training
Specify the checkpoint from above under --pretrain
. Specify the name of the dataset config created above under --data
, without the *.yaml
extension.
python ./train.py --data mydataset --model m-doclayout --epoch 500 \
--image-size 1600 --batch-size 64 --project public_dataset/test \
--plot 1 --optimizer SGD --lr0 0.04 --device 0 \
--pretrain layout_data/doclayout_yolo_docsynth300k_imgsz1600.pt
The code base is built with ultralytics and YOLO-v10.
Thanks for their great work!
If you find our project useful, please add a "star" to the repo. It's exciting to us when we see your interest, which keep us motivated to continue investing in the project!
@misc{zhao2024doclayoutyoloenhancingdocumentlayout,
title={DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception},
author={Zhiyuan Zhao and Hengrui Kang and Bin Wang and Conghui He},
year={2024},
eprint={2410.12628},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.12628},
}
@article{wang2024mineru,
title={MinerU: An Open-Source Solution for Precise Document Content Extraction},
author={Wang, Bin and Xu, Chao and Zhao, Xiaomeng and Ouyang, Linke and Wu, Fan and Zhao, Zhiyuan and Xu, Rui and Liu, Kaiwen and Qu, Yuan and Shang, Fukai and others},
journal={arXiv preprint arXiv:2409.18839},
year={2024}
}