Skip to content

Conversation

@sebastian-nagel
Copy link
Contributor

The class EffectiveTldFinder.EffectiveTLD is used for two purposes:

  1. it represents one line (one suffix) of the public suffix list, i.e. the suffix string and its properties: whether it's wildcard suffix, an exception, or is part of the private section of the list
  2. it holds the result of a suffix match returned by getEffectiveTLD(hostname)

Due to this dual use, there's some confusion what EffectiveTLD shall contain as a match result.

Let's look at two problematic matches:

  • matching a wildcard suffix, *.bd in the public suffix list:
    // bd : https://en.wikipedia.org/wiki/.bd
    *.bd
    
    The match result does not state that it's a wildcard suffix. As domain it shows the literally matched suffix, but it does not provide the original line from the public suffix list:
    echo abc.def.bd | java -cp ... crawlercommons.domains.EffectiveTldFinder -etld -excludePrivate -strict
    abc.def.bd      [domain=def.bd,wild=false,exception=false,private=false]
  • in case of IDNs, only the ASCII version of the suffix is given, the literally matched suffix is not part of the result:
    中国政府网.政务   [domain=xn--zfr164b,wild=false,exception=false,private=false]
    

This PR addresses these two points – the results are now:

abc.def.bd      [domain=def.bd,suffix=*.bd,idn=null,wildcard=true,exception=false,private=false]
中国政府网.政务   [domain=政务,suffix=xn--zfr164b,idn=政务,wildcard=false,exception=false,private=false]

In detail:

  • ensure the flag wildcard (and all other flags) are transferred from the ETLD representing the public suffix line to that holding the match result
  • added the method EffectiveTLD.getSuffix() which returns the suffix as specified in the public suffix list, e.g. *.bd
  • getDomain() still returns the literally matched suffix (ETLD), also for Unicode suffixes (IDNs)
  • add method EffectiveTLD.getUnicodeDomain() which - in case of an IDN - returns the Unicode representation of the suffix
  • rename wild to wildcard and isWild() to isWildcard()

EffectiveTLD

- ensure that the flag `wildcard` is properly set
- rename `wild` to `wildcard` and `isWild()` to `isWildcard()`
- add method EffectiveTLD.getSuffix() which returns the suffix
  as specified in the public suffix list, e.g. `*.bd`, while
  getDomain() still returns the matched suffix (ETLD)
- ensure that getDomain() returns the literally matched suffix
  also for Unicode suffixes (IDNs)
- add method EffectiveTLD.getUnicodeDomain() which - in case of
  an IDN - returns the Unicode representation of the suffix
- complete Javadoc, add unit tests
@sebastian-nagel sebastian-nagel added this to the 1.5 milestone Oct 31, 2024
@sebastian-nagel sebastian-nagel merged commit ded6697 into crawler-commons:master Nov 12, 2024
4 checks passed
@sebastian-nagel
Copy link
Contributor Author

Thanks for the review, @rzo1!

@sebastian-nagel sebastian-nagel deleted the etld-match-representation branch November 12, 2024 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants