Adding compression support can be very simple -- if your spider is coded in Perl using LWP::UserAgent, then the
addition of a single line of code will enable compression support.
$ua->default_header('Accept-Encoding' => 'gzip');
and then you need to make sure that you always refer to 'decoded_content' when dealing with the response object.
For other languages, all
you need to do is to add
Accept-encoding: gzipto the HTTP request that you send, and then be prepared to deal with a 'content-encoding: gzip' in the response.
Happily, some of the large spiders do support compression -- the googlebot and Yahoo Slurp do (to name but two). Since I started prodding crawler implementors, a couple have implemented compression (one within hours), and another reported that it was a bug that it didn't work -- which would be fixed shortly.
I typically send an email of the form:
I noticed that your web spider does not support content compression. This is a pity as it causes increased network bandwidth usage on my web server for no good reason. Of course, it also increases your bandwidth consumption, but that isn't my problem!
Adding support for content compression can be very easy depending on the implementation language of your spider. See the page http://www.gladstonefamily.net/cgi-bin/shame.pl for a list of the current spiders that do not support content compression, and more information on how to fix it.
Thanks, Philip
Crawlers which do more than 5% of the total (uncompressed) crawling activity are marked in bold below.
| Crawler | Last IP used |
|---|---|
| boitho.com-dc/0.82 ( http://www.boitho.com/dcbot.html ) | 129.241.104.185 |
| boitho.com-dc/0.85 ( http://www.boitho.com/dcbot.html ) | 129.241.104.182 |
| boitho.com-dc/0.86 ( http://www.boitho.com/dcbot.html ) | 129.241.50.34 |
| envolk/1.7 (+http://www.envolk.com/envolkspiderinfo.html) | 98.173.26.142 |
| Gigabot/3.0 (http://www.gigablast.com/spider.html) | 66.231.189.147 |
| holmes/3.12 (OnetSzukaj/5.0; +http://szukaj.onet.pl) | 213.180.137.72 |
| ia_archiver | 209.234.171.44 |
| ia_archiver-web.archive.org | 207.241.229.63 |
| Jakarta Commons-HttpClient/3.0.1 | 72.36.114.238 |
| libwww-perl/5.805 | 127.0.0.1 |
| Mozilla | 70.42.129.250 |
| Mozilla/4.0 (compatible; MSIE 5.00; Windows 98) | 24.64.223.203 |
| Mozilla/4.0 (compatible; MSIE 5.01; Windows NT) | 81.223.254.34 |
| Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0; obot) | 194.153.113.23 |
| Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MSIECrawler) | 211.137.211.67 |
| Mozilla/4.0 (compatible; NaverBot/1.0; http://help.naver.com/delete_main.asp) | 202.179.180.53 |
| Mozilla/4.5 [en] (Win98; I) | 71.211.245.27 |
| Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0 | 194.170.95.210 |
| Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1 | 204.234.220.250 |
| Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4 | 96.252.219.130 |
| Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/419 (KHTML, like Gecko) Safari/419.3 | 158.165.5.86 |
| Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3 | 63.253.32.69 |
| Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.8) Gecko/20071008 Firefox/2.0.0.8 | 81.39.154.195 |
| Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.5) Gecko/20070713 Firefox/2.0.0.5 | 128.206.27.76 |
| Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3 | 74.13.110.202 |
| Mozilla/5.0 (Windows; U; Windows NT 5.1; es-ES; rv:1.8.1) Gecko/20061010 Firefox/2.0 | 190.24.49.153 |
| Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4 | 205.200.73.242 |
| Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7 | 200.215.79.50 |
| Mozilla/5.0 (Windows;) NimbleCrawler 2.0.2 obeys UserAgent NimbleCrawler For problems contact: crawler@healthline.com | 70.42.186.129 |
| Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12; ips-agent) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7 | 69.36.158.37 |
| msnbot-media/1.0 (+http://search.msn.com/msnbot.htm) | 65.55.212.89 |
| multicrawler (+http://sw.deri.org/2006/04/multicrawler/robots.html) | 140.203.155.251 |
| MyRobot/1.000 | 207.106.86.84 |
| psbot/0.1 (+http://www.picsearch.com/bot.html) | 217.212.224.170 |
| Python-urllib/2.4 | 83.36.166.206 |
| Python-urllib/2.5 | 64.41.145.89 |
| StackRambler/2.0 (MSIE incompatible) | 81.222.64.10 |
| WebImages 0.3 ( http://herbert.groot.jebbink.nl/?app=WebImages ) | 213.206.76.79 |
| Wget/1.10.2 | 65.57.245.11 |
| Wget/1.10.2 (Red Hat modified) | 24.61.153.45 |
| YahooFeedSeeker/2.0 (compatible; Mozilla 4.0; MSIE 5.5; http://publisher.yahoo.com/rssguide) | 203.216.243.112 |
| Yandex/1.01.001 (compatible; Win16; I) | 77.88.26.27 |
| Yeti/1.0 (+http://help.naver.com/robots/) | 61.247.222.52 |