Listing of web crawlers that do not support compression

If you are the author of any of these spiders, then please add support for content compression when you crawl the web. This will save you bandwidth on your crawling system, and it saves bandwidth on the servers that you crawl.

Adding compression support can be very simple -- if your spider is coded in Perl using LWP::UserAgent, then the addition of a single line of code will enable compression support.

$ua->default_header('Accept-Encoding' => 'gzip');
and then you need to make sure that you always refer to 'decoded_content' when dealing with the response object.

For other languages, all you need to do is to add

Accept-encoding: gzip
to the HTTP request that you send, and then be prepared to deal with a 'content-encoding: gzip' in the response.

Happily, some of the large spiders do support compression -- the googlebot and Yahoo Slurp do (to name but two). Since I started prodding crawler implementors, a couple have implemented compression (one within hours), and another reported that it was a bug that it didn't work -- which would be fixed shortly.

I typically send an email of the form:

Hi,

I noticed that your web spider does not support content compression. This is a pity as it causes increased network bandwidth usage on my web server for no good reason. Of course, it also increases your bandwidth consumption, but that isn't my problem!

Adding support for content compression can be very easy depending on the implementation language of your spider. See the page http://www.gladstonefamily.net/cgi-bin/shame.pl for a list of the current spiders that do not support content compression, and more information on how to fix it.

Thanks, Philip

Crawlers which do more than 5% of the total (uncompressed) crawling activity are marked in bold below.

CrawlerLast IP used
boitho.com-dc/0.82 ( http://www.boitho.com/dcbot.html )129.241.104.185
boitho.com-dc/0.85 ( http://www.boitho.com/dcbot.html )129.241.104.182
boitho.com-dc/0.86 ( http://www.boitho.com/dcbot.html )129.241.50.34
envolk/1.7 (+http://www.envolk.com/envolkspiderinfo.html)98.173.26.142
Gigabot/3.0 (http://www.gigablast.com/spider.html)66.231.189.147
holmes/3.12 (OnetSzukaj/5.0; +http://szukaj.onet.pl)213.180.137.72
ia_archiver209.234.171.44
ia_archiver-web.archive.org207.241.229.63
Jakarta Commons-HttpClient/3.0.172.36.114.238
libwww-perl/5.805127.0.0.1
Mozilla70.42.129.250
Mozilla/4.0 (compatible; MSIE 5.00; Windows 98)24.64.223.203
Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)81.223.254.34
Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0; obot)194.153.113.23
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MSIECrawler)211.137.211.67
Mozilla/4.0 (compatible; NaverBot/1.0; http://help.naver.com/delete_main.asp)202.179.180.53
Mozilla/4.5 [en] (Win98; I)71.211.245.27
Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0194.170.95.210
Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1204.234.220.250
Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.496.252.219.130
Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/419 (KHTML, like Gecko) Safari/419.3158.165.5.86
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.363.253.32.69
Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.8) Gecko/20071008 Firefox/2.0.0.881.39.154.195
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.5) Gecko/20070713 Firefox/2.0.0.5128.206.27.76
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b374.13.110.202
Mozilla/5.0 (Windows; U; Windows NT 5.1; es-ES; rv:1.8.1) Gecko/20061010 Firefox/2.0190.24.49.153
Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4205.200.73.242
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7200.215.79.50
Mozilla/5.0 (Windows;) NimbleCrawler 2.0.2 obeys UserAgent NimbleCrawler For problems contact: crawler@healthline.com70.42.186.129
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12; ips-agent) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.769.36.158.37
msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)65.55.212.89
multicrawler (+http://sw.deri.org/2006/04/multicrawler/robots.html)140.203.155.251
MyRobot/1.000207.106.86.84
psbot/0.1 (+http://www.picsearch.com/bot.html)217.212.224.170
Python-urllib/2.483.36.166.206
Python-urllib/2.564.41.145.89
StackRambler/2.0 (MSIE incompatible)81.222.64.10
WebImages 0.3 ( http://herbert.groot.jebbink.nl/?app=WebImages )213.206.76.79
Wget/1.10.265.57.245.11
Wget/1.10.2 (Red Hat modified)24.61.153.45
YahooFeedSeeker/2.0 (compatible; Mozilla 4.0; MSIE 5.5; http://publisher.yahoo.com/rssguide)203.216.243.112
Yandex/1.01.001 (compatible; Win16; I)77.88.26.27
Yeti/1.0 (+http://help.naver.com/robots/)61.247.222.52

Comments, problems etc to
Philip Gladstone

Last modified Sunday, 19 November 2006