To Anonymize or to Not Anonymize

Posted in Tips on 11/11/10by scottw

Lately we find an increasing need to anonymize our scraping sessions. So, as necessity is the mother of invention, we have created and adopted a handful of different approaches to keep our scrapes up and running.

Keep in mind, the only way to block a web crawler is for a website’s server to refuse connections from an offending IP address.

Proactive Anonymization

This approach is used before any blocking has occurred. Ideally, a proactive approach would be the only technique needed.

Using screen-scraper’s Anonymization Service set up your scraping session to spawn between 3-5 proxy servers when it starts. Create a script whose job it is to shutdown and spawn anew a proxy server at a random interval (say, every 3-5 minutes).

It is also useful to switch up the User Agent at least each time you switch out a proxy. It can be even more effective if you switch it up on every request.

Similarly, when possible, you can change your referrer to a random URL that is off of the target domain. This makes it appear as though a different user is entering the site from an external source (typically considered positive traffic).

Reactive Anonymization

This is necessary once a site starts blocking your IP address.

The first approach is to use screen-scraper’s built-in Anonymization Service. The current implementation makes use of Amazon EC2 servers as proxies. Because we make use of Amazon’s Linux EC2 instances we have access to Squid, a popular proxy server already installed.

A limitation to using Amazon’s EC2 is that they reside in a finite and predictable block of IP addresses. We have had a number of sites block Amazon EC2’s wholesale.

After Amazon EC2’s are no longer effective you can make use of three other ad-hoc techniques.

Tor: The Tor network is spread widely across many different nodes and can prove difficult (almost impossible) to block. However, because of the vast distribution across any type of web server (with varying internet speeds) the relay speed is roughly 1/10th that of a normal connection. But, it’s free.

I2P2: Similar to Tor but a bit better maintained. This means faster connections. However, there are many fewer proxy nodes and fewer IP addresses to block. But, it’s free.

Anonymization via Manual Proxy Pools: Using proxy pools should be a last resort because the nature of the proxies is unknown and often unreliable. You are making use of computers on the Internet that have been set up with an open port for all the world to relay its traffic through. It’s possible that the owner of the server may close the open port at any time. But, it’s free.

See the following resources to read more about Anonymizing screen-scraper.

Leave a Comment