Adding compression support can be very simple -- if your spider is coded in Perl using LWP::UserAgent, then the
addition of a single line of code will enable compression support.
$ua->default_header('Accept-Encoding' => 'gzip');and then you need to make sure that you always refer to 'decoded_content' when dealing with the response object.
For other languages, all
you need to do is to add
Accept-encoding: gzipto the HTTP request that you send, and then be prepared to deal with a 'content-encoding: gzip' in the response.
Happily, some of the large spiders do support compression -- the googlebot and Yahoo Slurp do (to name but two). Since I started prodding crawler implementors, a couple have implemented compression (one within hours), and another reported that it was a bug that it didn't work -- which would be fixed shortly.
Crawlers which do more than 5% of the total (uncompressed) crawling activity are marked in bold below.
|Crawler||Last IP used|
|Aboundex/0.3 (http://www.aboundex.com/crawler/)" "www.gladstonefamily.net||22.214.171.124|
|LinqiaMetadataDownloaderBot/1.0 (email@example.com)" "blog1.gladstonefamily.net||126.96.36.199|
|Mozilla/5.0 (compatible; DeuSu/5.0.2; +https://deusu.de/robot.html)" "gladstonefamily.net||188.8.131.52|
|Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, firstname.lastname@example.org)" "pond1.gladstonefamily.net||184.108.40.206|
|Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)" "www.gladstonefamily.net||220.127.116.11|
|Mozilla/5.0 (compatible; SEOkicks-Robot; +http://www.seokicks.de/robot.html)" "gladstonefamily.net||18.104.22.168|
|Mozilla/5.0 (compatible; SEOkicks-Robot; +http://www.seokicks.de/robot.html)" "pond1.gladstonefamily.net||22.214.171.124|
|Mozilla/5.0 (compatible; Windows; U; Windows NT 6.2; en-US; rv:12.0) Gecko/20120403211507 Firefox/12.0" "pond1.gladstonefamily.net||126.96.36.199|
|Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 OPR/36.0.2130.32" "pond1.gladstonefamily.net||188.8.131.52|
|Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 OPR/36.0.2130.32" "pond1.gladstonefamily.net:8080||184.108.40.206|
|Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:220.127.116.11) Gecko/20070725 Firefox/18.104.22.168 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" "pond1.gladstonefamily.net||22.214.171.124|
|NextGenSearchBot 1 (for information visit http://www.zoominfo.com/About/misc/NextGenSearchBot.aspx)" "www.gladstonefamily.net||126.96.36.199|