-
Notifications
You must be signed in to change notification settings - Fork 86
Replace deprecated url basic normalizer #531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Replace deprecated url basic normalizer #531
Conversation
I forgot to mention that there's one failing test when i manually run the tests:
This seems to be caused by normalization behavior within the URI constructor itself as I explicitly add |
Hello @sebastian-nagel , can you please review these changes when you have time? |
Can you check the build result? |
|
d718e9a
to
9debbbf
Compare
I think I was overcomplicating things a bit. I reset the branch and added a commit that now uses URI.toURL within parseStringToURL. Before that, I remove the fragment from the urlString to avoid escaping the # character, then call the escapePath method on the full string to clean up any illegal characters that might cause URI to throw an exception. Regarding the build results:
I downloaded the "effective_tld_names.dat" file used in the build workflow and ran the tests locally. These tests also fail on the master branch, so I don't believe they're related to my changes. might be caused by changes in "effective_tld_names.dat" file itself.
This happens because the |
Yep, I can confirm
from master. Reason: publicsuffix/list#2540 Will fix on master. |
@HamzaElzarw-2022 Think you can rebase now, so we get a clean view of the build |
9debbbf
to
eaf6158
Compare
Hi @rzo1, |
Regarding the failing unit test:
For now, we could just comment out the corresponding test (in weirdToNormalizedUrls.csv) and open a separate issue to address it later: test for the NPE. Since the URL is not crawlable anyway, the exact output does not matter. A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HamzaElzarw-2022, eventually the code can be simplified if the URI class is directly used without converting from and to URL. As far as I can see, all methods which take a URL object as parameter or returning one are private. It would be safe to change them.
For the failing unit test: see comments.
try { | ||
if (changed) { | ||
url = new URL(protocol, host, port, file); | ||
String tempUrl = protocol + "://" + (host == null ? "" : host) + (port == -1 ? "" : ":" + port) + file; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems overtly complex. We could just return the concatenated string. Except we want explicitly verify that the normalized URL is both a valid URI and URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or use the URI internally for now. In getFileWithNormalizedPath(url)
we convert again from URL to URI.
if (!hasSchemePattern.matcher(urlString).find()) { | ||
// no protocol/scheme : try to prefix http:// | ||
try { | ||
url = new URL("http://" + urlString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, here might be the origin for the failing unit test: "http://" is added in front of "http:".
Description:
Replaces all deprecated usages of
new URL(...)
constructors inBasicURLNormalizer.java
which is the only remaining implementation class with such deprecated constructors.Details:
Directly replacing new URL(...) with URI.toURL() was not feasible in all cases due to behavioral differences between URL and URI. The reasoning and applied solutions are outlined below:
Handling
new URL(String url)
inparseStringToURL
methodThe method
parseStringToURL
, used at the beginning offilter
method to extract URL components, used new URL(String) to deconstruct the input string. Replacing this withURI
was not viable, asURI
is stricter and would throw exceptions for cases that were meant to be handled later in thefilter
method (e.g., URLs containing spaces or non-ASCII characters).To address this, a new method
deconstructURL
was introduced, which uses aurlDeconstructionPattern
regex to extract components without instantiating a URL prematurely.Handling
new URL(protocol, host, port, file)
constructorReplacing this form with its closest URI substitute
new URI(schema, userinfo, host, port, path, query, fragment)
led to unwanted percent encoding behavior. For instance:To avoid this, the URL is reconstructed as a string from its components after normalization, and then converted using new URI(String).toURL() to preserve encoding.
Other refactoring details:
url.getDefaultPort()
.I tried to justify the changes as possible hope it makes sense :)
I’d be happy to hear any feedback and will gladly update the PR to simplify or reduce the changes as possible.
Partially Addresses #522