Skip to content

Conversation

@karfau
Copy link
Member

@karfau karfau commented Feb 25, 2025

  • enforce timeouts on tests and specifically for the one not using the default normalizeLineEndings, so it actually fails.
  • improve reused regular expression to not contain .* twice, to speed up processing larger files containing certain Unicode chars as demonstrated in Increased processing times when using \u2028 and \u2029 #838

when the `locator` is not disabled
@codecov
Copy link

codecov bot commented Feb 25, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.12%. Comparing base (bfb5c44) to head (23e2aaf).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #847   +/-   ##
=======================================
  Coverage   95.12%   95.12%           
=======================================
  Files           8        8           
  Lines        2196     2196           
  Branches      577      577           
=======================================
  Hits         2089     2089           
  Misses        107      107           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@karfau karfau linked an issue Feb 25, 2025 that may be closed by this pull request
@karfau karfau changed the title perf: improve line detection regular expression for locator perf: speed up line detection Feb 25, 2025
@karfau karfau requested a review from shunkica February 25, 2025 23:22
@karfau karfau marked this pull request as ready for review February 25, 2025 23:23
@karfau
Copy link
Member Author

karfau commented Feb 25, 2025

@kboshold this one is an actual fix for the performance issue from your initial example, even when not normalizing the input.
Also the timeout of the test is now enforced.

@karfau karfau added this to the 0.9.8 milestone Feb 25, 2025
@karfau karfau enabled auto-merge (squash) February 25, 2025 23:29
@karfau karfau disabled auto-merge February 25, 2025 23:29
@karfau karfau enabled auto-merge (rebase) February 25, 2025 23:29
@karfau karfau requested review from Ponynjaa and removed request for shunkica February 25, 2025 23:29
Copy link
Collaborator

@Ponynjaa Ponynjaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two minor things I would change: the test-descriptions on the test you copied and the new one don't describe what's actually being tested.

Other than that it looks really good! I was a little confused at first when I saw your changes to the regex and loop, but after playing around with it for a little bit and testing the outcomes it seems to work exactly as before but without matching the content inbetween line-endings for no reason. Good job dude!

expect(new XMLSerializer().serializeToString(doc)).toEqual(source.replace(/[\u0085\u2028\u2029]/g, '\n'));
}, 500);
});
test('should be able to open documents with alternative whitespace without creating a bottleneck and replacing them with \\n', () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and replacing them with \n

this description doesn't make sense for this test

// issue: https://github.com/xmldom/xmldom/issues/838
const onError = jest.fn();
const { parser } = getTestParser({ onError });
const source = `<root>${'A'.repeat(50000)}\u2029${'A'.repeat(50000)}\u0085${'A'.repeat(50000)}\u2028${'A'.repeat(50000)}\u2029</root>`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without creating a bottleneck

The test description here actually also doesn't make sense, maybe you can rephrase it when you are already at it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, somehow the test I copied is also covering the one that @kboshold added, so I merged them into one and corrected the description.

@karfau karfau disabled auto-merge February 26, 2025 22:17
@karfau karfau enabled auto-merge (squash) February 26, 2025 22:18
const onError = jest.fn();
const normalizeLineEndings = jest.fn((source) => source);
const { parser } = getTestParser({ onError, normalizeLineEndings });
const source = `<root>${'A'.repeat(15000)}\u2029${'A'.repeat(15000)}\u0085${'A'.repeat(15000)}\u2028${'A'.repeat(15000)}\u2029</root>`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense to have a \n as the first character, since the test currently has \u2029 twice.

Solution:

const source = `<root>${'A'.repeat(15000)}\n${'A'.repeat(15000)}\u0085${'A'.repeat(15000)}\u2028${'A'.repeat(15000)}\u2029</root>`;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's such big of a difference, but hey, I applied your suggestion.

@karfau karfau disabled auto-merge February 27, 2025 08:36
@karfau
Copy link
Member Author

karfau commented Feb 27, 2025

I applied all the suggested changes, so please approve it.
I would say the PR is already an improvement that we should get landed and released.
Or just make the changes yourself, I think both of you are able to push to this branch.

@karfau karfau requested review from Ponynjaa and kboshold February 27, 2025 21:28
@Ponynjaa Ponynjaa merged commit d4dc4da into master Feb 27, 2025
38 checks passed
@Ponynjaa Ponynjaa deleted the improve-line-split branch February 27, 2025 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Increased processing times when using \u2028 and \u2029

3 participants