Skip to content

Conversation

@sebastian-nagel
Copy link
Contributor

for quick testing whether a robots.txt is parsed as expected, e.g.

$ java ... SimpleRobotRulesParser http://www.robotstxt.org/robots.txt '*' http://www.robotstxt.org/norobots-rfc.txt
Checking URLs:
allowed         http://www.robotstxt.org/norobots-rfc.txt

The full set of options is:

SimpleRobotRulesParser <robots.txt> [[<agentname>] <URL>...]

Parse a robots.txt file
  <robots.txt>  URL pointing to robots.txt file.
                To read a local file use a file:// URL
                (parsed as http://example.com/robots.txt)
  <agentname>   user agent name to check for exclusion rules.
                If not defined check with '*'
  <URL>         check URL whether allowed or forbidden.
                If no URL is given show robots.txt rules

The pull request includes also the following minor changes:

  • implement toString() for robot rules
  • fix line breaks in comments

@jnioche jnioche mentioned this pull request May 31, 2018
@jnioche jnioche added this to the 0.10 milestone Jun 4, 2018
@jnioche jnioche merged commit 0c75e75 into crawler-commons:master Jun 4, 2018
@jnioche
Copy link
Contributor

jnioche commented Jun 4, 2018

thanks @sebastian-nagel

jnioche added a commit that referenced this pull request Jun 4, 2018
Add main to SimpleRobotRulesParser for testing (#193)
@sebastian-nagel sebastian-nagel deleted the robots-parser-main branch June 5, 2018 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants