Skip to content

Conversation

@sebastian-nagel
Copy link
Contributor

This PR adds unit tests for all examples provided in RFC 9309. For the examples of paths containing percent-encoded characters the unit tests of Google's RFC reference parser and the errata were consulted:

  • /foo/bar/\u30C4 resp. /foo/bar/%E3%83%84 - see errata
  • the paths /foo/bar?baz=https://foo.bar and /foo/bar?baz=https%3A%2F%2Ffoo.bar are matched "as is" without decoding or encoding also in the unit tests of the reference parser. See also the discussion of #309 for crawler-commons' BasicURLNormalizer.
  • /foo/bar/%62%61%7A in the robots.txt matches /foo/bar/baz - it's a minor improvement of SimpleRobotRulesParser over the reference parser

@sebastian-nagel sebastian-nagel merged commit 6523fd2 into crawler-commons:master Jun 13, 2023
@sebastian-nagel
Copy link
Contributor Author

Thanks, @rzo1!

sebastian-nagel added a commit that referenced this pull request Jun 13, 2023
@sebastian-nagel sebastian-nagel deleted the rfc-9309-examples-as-unit-tests branch March 13, 2025 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants