Skip to content

Conversation

@valfirst
Copy link
Contributor

  • increase log levels to raise importance of certain messages
  • optimize performance by using parameterized log messages

- increase log levels to raise importance of certain messages
- optimize performance by using parameterized log messages
@sebastian-nagel
Copy link
Contributor

Hi @valfirst, thanks for the PR! Could you share the objective why to increase the log level? I can understand that showing all bad lines or URLs without setting the level to DEBUG helps to detect potential format errors in sitemaps and similar. However, there was reason to decrease the log level (see #145) - this was some time ago and the MIME type detection of the sitemap parser was less precise at this time. It's about to avoid that any erroneously "announced" sitemap (for example, a HTML file referenced in the robots.txt) floods the log file. I will check how this problem affects the sitemap parser with real-world sitemap links from robots.txt files. Eventually, we could also log only the first N "bad URLs" per sitemap.

@sebastian-nagel sebastian-nagel added this to the 1.5 milestone May 17, 2024
@valfirst
Copy link
Contributor Author

@sebastian-nagel In general your guess is correct. In my case sitemap.xml was valid, but contained a set of invalid URLs, the messages about them were not logged, because we didn't enable debug level globally. I checked source code and enabled debug level for crawlercommons.sitemaps.SiteMapParser, but it resulted in number of not so important messages. I thought it could be useful to filter logged information and I changed logging in SiteMapParser based on my understanding of its importance.

I've not faced with cases like described in #145 and now I see the reasoning behind it.

Eventually, we could also log only the first N "bad URLs" per sitemap.

If you decide to go with this approach, I can help implementing this solution.

@sebastian-nagel
Copy link
Contributor

Hi @valfirst,

the MIME type detection of the sitemap parser was less precise at this time.
[...] check how this problem affects the sitemap parser with real-world sitemap links from robots.txt files.

Done. The detection of the MIME type of sitemaps is quite precise after we implemented a simple MIME detector (#198) which supports only the very limited set of "sitemap MIME types". I found no HTML sitemap which was erroneously detected as "text/plain". Consequently, there were no erroneous "Bad url: ..." log messages.

Eventually, we could also log only the first N "bad URLs" per sitemap.

If you decide to go with this approach, I can help implementing this solution.

Thanks! It's on you. It does not harm but it seems not necessary anymore.

@valfirst
Copy link
Contributor Author

Thanks! It's on you. It does not harm but it seems not necessary anymore.

let's skip it, I like YAGNI principle

@sebastian-nagel sebastian-nagel merged commit b950e38 into crawler-commons:master May 28, 2024
@sebastian-nagel
Copy link
Contributor

Thanks, @valfirst!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants