Skip to content

Conversation

@bsowell
Copy link
Contributor

@bsowell bsowell commented May 22, 2025

I found this function useful in my own development. It splits a pdf Document (or each pdf in a DocSet) into multiple Documents, where each resulting Document has the specified number of pages, except possibly the last. Both the pdf binary and elements are split.

I found this function useful in my own development. It splits a pdf
Document (or each pdf in a DocSet) into multiple Documents, where each
resulting Document has the specified number of pages, except possibly the
last. Both the pdf binary and elements are split.
@bsowell bsowell requested review from HenryL27 and Copilot May 22, 2025 23:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new utility function, split_pdf, which splits a PDF Document (or each PDF within a DocSet) into multiple Documents containing a specified number of pages, while preserving both the binary and element data.

  • Updated logging of missing binary representations with f-strings
  • Introduced split_pdf and its corresponding tests in pdf_utils and test_pdf_utils respectively

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
lib/sycamore/sycamore/utils/pdf_utils.py Adds the split_pdf function with logic to split a PDF document into multiple parts
lib/sycamore/sycamore/tests/unit/utils/test_pdf_utils.py Adds unit tests for split_pdf and its integration with FlatMap
Comments suppressed due to low confidence (1)

lib/sycamore/sycamore/utils/pdf_utils.py:164

  • [nitpick] Consider renaming the variable 'start' to 'start_page' for improved clarity in indicating it represents the starting page number for the current split.
for idx, start in enumerate(range(1, page_count, num_pages)):

Copy link
Collaborator

@HenryL27 HenryL27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Soeb-aryn you were looking for something like this right?

@bsowell bsowell merged commit 1c38e17 into main May 23, 2025
11 of 15 checks passed
@bsowell bsowell deleted the ben/split_pages branch May 23, 2025 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants