Improve logging done by SiteMapParser #457

valfirst · 2024-05-14T16:48:27Z

increase log levels to raise importance of certain messages
optimize performance by using parameterized log messages

- increase log levels to raise importance of certain messages - optimize performance by using parameterized log messages

sebastian-nagel · 2024-05-17T07:50:56Z

Hi @valfirst, thanks for the PR! Could you share the objective why to increase the log level? I can understand that showing all bad lines or URLs without setting the level to DEBUG helps to detect potential format errors in sitemaps and similar. However, there was reason to decrease the log level (see #145) - this was some time ago and the MIME type detection of the sitemap parser was less precise at this time. It's about to avoid that any erroneously "announced" sitemap (for example, a HTML file referenced in the robots.txt) floods the log file. I will check how this problem affects the sitemap parser with real-world sitemap links from robots.txt files. Eventually, we could also log only the first N "bad URLs" per sitemap.

valfirst · 2024-05-17T17:12:04Z

@sebastian-nagel In general your guess is correct. In my case sitemap.xml was valid, but contained a set of invalid URLs, the messages about them were not logged, because we didn't enable debug level globally. I checked source code and enabled debug level for crawlercommons.sitemaps.SiteMapParser, but it resulted in number of not so important messages. I thought it could be useful to filter logged information and I changed logging in SiteMapParser based on my understanding of its importance.

I've not faced with cases like described in #145 and now I see the reasoning behind it.

Eventually, we could also log only the first N "bad URLs" per sitemap.

If you decide to go with this approach, I can help implementing this solution.

sebastian-nagel · 2024-05-21T15:17:45Z

Hi @valfirst,

the MIME type detection of the sitemap parser was less precise at this time.
[...] check how this problem affects the sitemap parser with real-world sitemap links from robots.txt files.

Done. The detection of the MIME type of sitemaps is quite precise after we implemented a simple MIME detector (#198) which supports only the very limited set of "sitemap MIME types". I found no HTML sitemap which was erroneously detected as "text/plain". Consequently, there were no erroneous "Bad url: ..." log messages.

Eventually, we could also log only the first N "bad URLs" per sitemap.

If you decide to go with this approach, I can help implementing this solution.

Thanks! It's on you. It does not harm but it seems not necessary anymore.

valfirst · 2024-05-22T07:20:24Z

Thanks! It's on you. It does not harm but it seems not necessary anymore.

let's skip it, I like YAGNI principle

sebastian-nagel · 2024-05-28T10:45:35Z

Thanks, @valfirst!

Improve logging done by SiteMapParser

00147bb

- increase log levels to raise importance of certain messages - optimize performance by using parameterized log messages

sebastian-nagel added the sitemaps label May 17, 2024

sebastian-nagel added this to the 1.5 milestone May 17, 2024

sebastian-nagel approved these changes May 28, 2024

View reviewed changes

sebastian-nagel merged commit b950e38 into crawler-commons:master May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve logging done by SiteMapParser #457

Improve logging done by SiteMapParser #457

Uh oh!

valfirst commented May 14, 2024

Uh oh!

sebastian-nagel commented May 17, 2024

Uh oh!

valfirst commented May 17, 2024

Uh oh!

sebastian-nagel commented May 21, 2024

Uh oh!

valfirst commented May 22, 2024

Uh oh!

sebastian-nagel commented May 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve logging done by SiteMapParser #457

Improve logging done by SiteMapParser #457

Uh oh!

Conversation

valfirst commented May 14, 2024

Uh oh!

sebastian-nagel commented May 17, 2024

Uh oh!

valfirst commented May 17, 2024

Uh oh!

sebastian-nagel commented May 21, 2024

Uh oh!

valfirst commented May 22, 2024

Uh oh!

sebastian-nagel commented May 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants