Adding compression support can be very simple -- if your spider is coded in Perl using LWP::UserAgent, then the
addition of a single line of code will enable compression support.
$ua->default_header('Accept-Encoding' => 'gzip');
and then you need to make sure that you always refer to 'decoded_content' when dealing with the response object.
For other languages, all
you need to do is to add
Accept-encoding: gzipto the HTTP request that you send, and then be prepared to deal with a 'content-encoding: gzip' in the response.
Happily, some of the large spiders do support compression -- the googlebot and Yahoo Slurp do (to name but two). Since I started prodding crawler implementors, a couple have implemented compression (one within hours), and another reported that it was a bug that it didn't work -- which would be fixed shortly.
Crawlers which do more than 5% of the total (uncompressed) crawling activity are marked in bold below.
| Crawler | Last IP used |
|---|---|
| bitlybot | 184.72.0.132 |
| DoCoMo/2.0 P900i(c100;TB;W24H11) (compatible; ichiro/mobile goo; +http://help.goo.ne.jp/help/article/1142/) | 218.213.137.44 |
| envolk/1.7 (+http://www.envolk.com/envolkspiderinfo.html) | 98.173.26.142 |
| findlinks/2.0.2 (+http://wortschatz.uni-leipzig.de/findlinks/) | 77.21.147.173 |
| Gigabot/3.0 (http://www.gigablast.com/spider.html) | 64.22.106.82 |
| ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com) | 174.129.228.67 |
| ichiro/3.0 (http://help.goo.ne.jp/help/article/1142) | 218.213.29.188 |
| Java/1.6.0_04 | 89.123.3.226 |
| Java/1.6.0_20 | 46.137.54.254 |
| Mail.RU/2.0 | 217.69.133.28 |
| Mozilla/5.0 (compatible; archive.org_bot +http://www.archive.org/details/archive.org_bot) | 207.241.237.209 |
| Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8 | 74.112.131.128 |
| Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com) | 208.115.113.89 |
| Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots) | 95.108.158.240 |
| Python-urllib/2.6 | 107.22.98.31 |
| Python-urllib/2.7 | 70.32.154.133 |
| Wget/1.11.4 Red Hat modified | 127.0.0.1 |
| Wget/1.12 (linux-gnu) | 206.53.65.108 |
| Wget/1.9+cvs-stable (Red Hat modified) | 69.28.58.6 |
| YahooCacheSystem | 98.139.241.252 |