Replace deprecated url basic normalizer #531

HamzaElzarw-2022 · 2025-07-19T14:18:13Z

Description:
Replaces all deprecated usages of new URL(...) constructors in BasicURLNormalizer.java which is the only remaining implementation class with such deprecated constructors.

Details:
Directly replacing new URL(...) with URI.toURL() was not feasible in all cases due to behavioral differences between URL and URI. The reasoning and applied solutions are outlined below:

Handling new URL(String url) in parseStringToURL method
The method parseStringToURL, used at the beginning of filter method to extract URL components, used new URL(String) to deconstruct the input string. Replacing this with URI was not viable, as URI is stricter and would throw exceptions for cases that were meant to be handled later in the filter method (e.g., URLs containing spaces or non-ASCII characters).
To address this, a new method deconstructURL was introduced, which uses a urlDeconstructionPattern regex to extract components without instantiating a URL prematurely.
Handling new URL(protocol, host, port, file) constructor
Replacing this form with its closest URI substitute new URI(schema, userinfo, host, port, path, query, fragment) led to unwanted percent encoding behavior. For instance:

test case: org.opentest4j.AssertionFailedError: normalizing: http://foo.com/%66oo.htm%C0 ==> 
Expected :http://foo.com/foo.htm%C0 
Actual :http://foo.com/foo.htm%25C0

To avoid this, the URL is reconstructed as a string from its components after normalization, and then converted using new URI(String).toURL() to preserve encoding.

Other refactoring details:

The changed flag is now only used after the final URL is constructed, since a URL instance will be created regardless.
Temporary variables like file2 and newHost, previously used just to track changes, were removed for clarity.
The check for whether the port equals the default port was moved after the initial URL construction as it calls url.getDefaultPort().

I tried to justify the changes as possible hope it makes sense :)
I’d be happy to hear any feedback and will gladly update the PR to simplify or reduce the changes as possible.

Partially Addresses #522

HamzaElzarw-2022 · 2025-07-19T14:26:45Z

I forgot to mention that there's one failing test when i manually run the tests:

org.opentest4j.AssertionFailedError: normalizing: file:///foo/bar.txt ==> 
Expected : file:///foo/bar.txt
Actual   : file:/foo/bar.txt

This seems to be caused by normalization behavior within the URI constructor itself as I explicitly add :// to the url string before passing it to the constructor and still URI removes it again. I'm not exactly sure how to work around this, so any guidance would be appreciated.

HamzaElzarw-2022 · 2025-07-21T19:03:09Z

Hello @sebastian-nagel , can you please review these changes when you have time?

rzo1 · 2025-07-21T19:06:07Z

Can you check the build result?

rzo1 · 2025-07-21T19:09:55Z

file:///foo/bar.txt is the standard and correct format of a file URI, so I would expect it in the result (according to https://datatracker.ietf.org/doc/html/rfc3986)

HamzaElzarw-2022 · 2025-07-22T14:23:28Z

I think I was overcomplicating things a bit. I reset the branch and added a commit that now uses URI.toURL within parseStringToURL. Before that, I remove the fragment from the urlString to avoid escaping the # character, then call the escapePath method on the full string to clean up any illegal characters that might cause URI to throw an exception.

Regarding the build results:

Error: EffectiveTldFinderTest.testMatchAllSuffixes:331 Domain under public suffix is expected to be same as host ==> expected: <foo.bar.uberspace.de> but was: <uberspace.de>

Error: EffectiveTldFinderTest.testMatchAllSuffixes:337 Host name is a wildcard public suffix ==> expected: but was: <uberspace.de>

I downloaded the "effective_tld_names.dat" file used in the build workflow and ran the tests locally. These tests also fail on the master branch, so I don't believe they're related to my changes. might be caused by changes in "effective_tld_names.dat" file itself.

Error: BasicURLNormalizerTest.testBasicNormalizer:41 normalizing: http: ==> expected: http:/ but was: http://http/

This happens because the URI constructor (unlike URL) doesn't accept a bare http: scheme. It ends up trying to prepends http://, resulting in http://http/. I'm unsure whether it's worth working around this or how best to handle it, any guidance would be appreciated.

rzo1 · 2025-07-22T17:46:44Z

Yep, I can confirm

[ERROR] Failures: 
[ERROR]   EffectiveTldFinderTest.testMatchAllSuffixes:331 Domain under public suffix is expected to be same as host ==> expected: <foo.bar.uberspace.de> but was: <uberspace.de>
[ERROR]   EffectiveTldFinderTest.testMatchAllSuffixes:337 Host name is a wildcard public suffix ==> expected: <null> but was: <uberspace.de>

from master.

Reason: publicsuffix/list#2540

Will fix on master.

rzo1 · 2025-07-25T09:44:14Z

@HamzaElzarw-2022 Think you can rebase now, so we get a clean view of the build

HamzaElzarw-2022 · 2025-08-07T20:55:17Z

Hi @rzo1,
Any updates on this PR? There's only one failing case remaining, and it seems directly related to the behavior of URI. Do you think it's worth implementing a workaround to fix it, or is it something we can ignore? I can update the test if needed, just let me know what you think.

sebastian-nagel · 2025-08-12T17:16:32Z

Regarding the failing unit test:

After digging into the code and its history: it was originally added to ensure that a malformed input does not cause a NPE, see here.
For sure, you'll find a <a href="http:">...</a> somewhere in the web, and a NPE in the normalizer should not cause a crawler job to fail.

For now, we could just comment out the corresponding test (in weirdToNormalizedUrls.csv) and open a separate issue to address it later: test for the NPE. Since the URL is not crawlable anyway, the exact output does not matter. A null would be the best return value.

sebastian-nagel

@HamzaElzarw-2022, eventually the code can be simplified if the URI class is directly used without converting from and to URL. As far as I can see, all methods which take a URL object as parameter or returning one are private. It would be safe to change them.

For the failing unit test: see comments.

sebastian-nagel · 2025-08-12T17:18:04Z

src/main/java/crawlercommons/filters/basic/BasicURLNormalizer.java

            try {
                if (changed) {
-                    url = new URL(protocol, host, port, file);
+                    String tempUrl = protocol + "://" + (host == null ? "" : host) + (port == -1 ? "" : ":" + port) + file;


This seems overtly complex. We could just return the concatenated string. Except we want explicitly verify that the normalized URL is both a valid URI and URL.

... or use the URI internally for now. In getFileWithNormalizedPath(url) we convert again from URL to URI.

sebastian-nagel · 2025-08-12T17:25:52Z

src/main/java/crawlercommons/filters/basic/BasicURLNormalizer.java

            if (!hasSchemePattern.matcher(urlString).find()) {
                // no protocol/scheme : try to prefix http://
                try {
-                    url = new URL("http://" + urlString);


Ok, here might be the origin for the failing unit test: "http://" is added in front of "http:".

HamzaElzarw-2022 force-pushed the replace-deprecated-url-basic-normalizer branch from d718e9a to 9debbbf Compare July 21, 2025 23:37

rzo1 mentioned this pull request Jul 22, 2025

Replace deleted *.uberspace.de with a wildcard from Google #532

Merged

Replace deprecated URL usage with URI in BasicURLNormalizer

eaf6158

HamzaElzarw-2022 force-pushed the replace-deprecated-url-basic-normalizer branch from 9debbbf to eaf6158 Compare July 25, 2025 12:06

sebastian-nagel requested changes Aug 12, 2025

View reviewed changes

Replace deprecated url basic normalizer #531

Are you sure you want to change the base?

Replace deprecated url basic normalizer #531

Uh oh!

Conversation

HamzaElzarw-2022 commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HamzaElzarw-2022 commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HamzaElzarw-2022 commented Jul 21, 2025

Uh oh!

rzo1 commented Jul 21, 2025

Uh oh!

rzo1 commented Jul 21, 2025

Uh oh!

HamzaElzarw-2022 commented Jul 22, 2025

Uh oh!

rzo1 commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzo1 commented Jul 25, 2025

Uh oh!

HamzaElzarw-2022 commented Aug 7, 2025

Uh oh!

sebastian-nagel commented Aug 12, 2025

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HamzaElzarw-2022 commented Jul 19, 2025 •

edited

Loading

HamzaElzarw-2022 commented Jul 19, 2025 •

edited

Loading

rzo1 commented Jul 22, 2025 •

edited

Loading