Google Open Sources Its ‘Web Crawler’ After 20 Years

Google Open Sources Its ‘Web Crawler’ After 20 Years


Google’s Robot Exclusion Protocol (REP),
also known as robots.txt, is a standard used by many websites to tell the automated crawlers
which parts of the site should be crawled or not. However, it isn’t the officially adopted
standard, leading to different interpretations. In a bid to make REP an official web standard,
Google has open-sourced robots.txt parser and the associated C++ library which it first
created 20 years back. You can find the tool on GitHub. REP was conceived back in 1994 by a Dutch
software engineer Martijn Koster, and today it is the de facto standard used by websites
to instruct crawlers. Googlebot crawler scours the robots.txt file
to find any instructions on which parts of the website it should ignore. If there’s
no robots.txt file, the bot assumes that it’s okay to crawl the entire website. However, this protocol has been interpreted
“somewhat differently over the years” by developers, leading to ambiguity and difficulty
in “writing the rules correctly.” For instance, there is uncertainty in cases
where the “text editor includes BOMcharacters in their robots.txt files.” Whereas for
crawler and tool developers, there is always uncertainty about “how should they deal
with robots.txt files that are hundreds of megabytes large?” This is why Google wants REP to be officially
adopted as an internet standard with fixed rules for all. The company says it has documented
exactly how REP should be used and submitted its proposal to the Internet Engineering Task
Force (IETF). While we cannot say with certainty that REP
will become an official standard; it would definitely help web visitors as well as website
owners by showing more consistent search results and respecting the site’s wishes.

Leave a Reply

Your email address will not be published. Required fields are marked *