Skip to content

Conversation

HamzaElzarw-2022
Copy link
Contributor

@HamzaElzarw-2022 HamzaElzarw-2022 commented Jul 19, 2025

Description:
Replaces all deprecated usages of new URL(...) constructors in BasicURLNormalizer.java which is the only remaining implementation class with such deprecated constructors.

Details:
Directly replacing new URL(...) with URI.toURL() was not feasible in all cases due to behavioral differences between URL and URI. The reasoning and applied solutions are outlined below:

  1. Handling new URL(String url) in parseStringToURL method
    The method parseStringToURL, used at the beginning of filter method to extract URL components, used new URL(String) to deconstruct the input string. Replacing this with URI was not viable, as URI is stricter and would throw exceptions for cases that were meant to be handled later in the filter method (e.g., URLs containing spaces or non-ASCII characters).
    To address this, a new method deconstructURL was introduced, which uses a urlDeconstructionPattern regex to extract components without instantiating a URL prematurely.

  2. Handling new URL(protocol, host, port, file) constructor
    Replacing this form with its closest URI substitute new URI(schema, userinfo, host, port, path, query, fragment) led to unwanted percent encoding behavior. For instance:

test case: org.opentest4j.AssertionFailedError: normalizing: http://foo.com/%66oo.htm%C0 ==> 
Expected :http://foo.com/foo.htm%C0 
Actual :http://foo.com/foo.htm%25C0

To avoid this, the URL is reconstructed as a string from its components after normalization, and then converted using new URI(String).toURL() to preserve encoding.

Other refactoring details:

  • The changed flag is now only used after the final URL is constructed, since a URL instance will be created regardless.
  • Temporary variables like file2 and newHost, previously used just to track changes, were removed for clarity.
  • The check for whether the port equals the default port was moved after the initial URL construction as it calls url.getDefaultPort().

I tried to justify the changes as possible hope it makes sense :)
I’d be happy to hear any feedback and will gladly update the PR to simplify or reduce the changes as possible.

Partially Addresses #522

@HamzaElzarw-2022
Copy link
Contributor Author

HamzaElzarw-2022 commented Jul 19, 2025

I forgot to mention that there's one failing test when i manually run the tests:

org.opentest4j.AssertionFailedError: normalizing: file:///foo/bar.txt ==> 
Expected : file:///foo/bar.txt
Actual   : file:/foo/bar.txt

This seems to be caused by normalization behavior within the URI constructor itself as I explicitly add :// to the url string before passing it to the constructor and still URI removes it again. I'm not exactly sure how to work around this, so any guidance would be appreciated.

@HamzaElzarw-2022
Copy link
Contributor Author

Hello @sebastian-nagel , can you please review these changes when you have time?

@rzo1
Copy link
Member

rzo1 commented Jul 21, 2025

Can you check the build result?

@rzo1
Copy link
Member

rzo1 commented Jul 21, 2025

file:///foo/bar.txt is the standard and correct format of a file URI, so I would expect it in the result (according to https://datatracker.ietf.org/doc/html/rfc3986)

@HamzaElzarw-2022 HamzaElzarw-2022 force-pushed the replace-deprecated-url-basic-normalizer branch from d718e9a to 9debbbf Compare July 21, 2025 23:37
@HamzaElzarw-2022
Copy link
Contributor Author

I think I was overcomplicating things a bit. I reset the branch and added a commit that now uses URI.toURL within parseStringToURL. Before that, I remove the fragment from the urlString to avoid escaping the # character, then call the escapePath method on the full string to clean up any illegal characters that might cause URI to throw an exception.

Regarding the build results:

Error: EffectiveTldFinderTest.testMatchAllSuffixes:331 Domain under public suffix is expected to be same as host ==> expected: <foo.bar.uberspace.de> but was: <uberspace.de>

Error: EffectiveTldFinderTest.testMatchAllSuffixes:337 Host name is a wildcard public suffix ==> expected: but was: <uberspace.de>

I downloaded the "effective_tld_names.dat" file used in the build workflow and ran the tests locally. These tests also fail on the master branch, so I don't believe they're related to my changes. might be caused by changes in "effective_tld_names.dat" file itself.

Error: BasicURLNormalizerTest.testBasicNormalizer:41 normalizing: http: ==> expected: http:/ but was: http://http/

This happens because the URI constructor (unlike URL) doesn't accept a bare http: scheme. It ends up trying to prepends http://, resulting in http://http/. I'm unsure whether it's worth working around this or how best to handle it, any guidance would be appreciated.

@rzo1
Copy link
Member

rzo1 commented Jul 22, 2025

Yep, I can confirm

[ERROR] Failures: 
[ERROR]   EffectiveTldFinderTest.testMatchAllSuffixes:331 Domain under public suffix is expected to be same as host ==> expected: <foo.bar.uberspace.de> but was: <uberspace.de>
[ERROR]   EffectiveTldFinderTest.testMatchAllSuffixes:337 Host name is a wildcard public suffix ==> expected: <null> but was: <uberspace.de>

from master.

Reason: publicsuffix/list#2540

Will fix on master.

@rzo1
Copy link
Member

rzo1 commented Jul 25, 2025

@HamzaElzarw-2022 Think you can rebase now, so we get a clean view of the build

@HamzaElzarw-2022 HamzaElzarw-2022 force-pushed the replace-deprecated-url-basic-normalizer branch from 9debbbf to eaf6158 Compare July 25, 2025 12:06
@HamzaElzarw-2022
Copy link
Contributor Author

Hi @rzo1,
Any updates on this PR? There's only one failing case remaining, and it seems directly related to the behavior of URI. Do you think it's worth implementing a workaround to fix it, or is it something we can ignore? I can update the test if needed, just let me know what you think.

@sebastian-nagel
Copy link
Contributor

Regarding the failing unit test:

  • After digging into the code and its history: it was originally added to ensure that a malformed input does not cause a NPE, see here.
  • For sure, you'll find a <a href="http:">...</a> somewhere in the web, and a NPE in the normalizer should not cause a crawler job to fail.

For now, we could just comment out the corresponding test (in weirdToNormalizedUrls.csv) and open a separate issue to address it later: test for the NPE. Since the URL is not crawlable anyway, the exact output does not matter. A null would be the best return value.

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HamzaElzarw-2022, eventually the code can be simplified if the URI class is directly used without converting from and to URL. As far as I can see, all methods which take a URL object as parameter or returning one are private. It would be safe to change them.

For the failing unit test: see comments.

try {
if (changed) {
url = new URL(protocol, host, port, file);
String tempUrl = protocol + "://" + (host == null ? "" : host) + (port == -1 ? "" : ":" + port) + file;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overtly complex. We could just return the concatenated string. Except we want explicitly verify that the normalized URL is both a valid URI and URL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or use the URI internally for now. In getFileWithNormalizedPath(url) we convert again from URL to URI.

if (!hasSchemePattern.matcher(urlString).find()) {
// no protocol/scheme : try to prefix http://
try {
url = new URL("http://" + urlString);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, here might be the origin for the failing unit test: "http://" is added in front of "http:".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants