Anonymization through proxy servers
In certain cases a scrape needs to be anonymized in order to get the data you’re after. Generally this means sending the HTTP requests through one or more proxy servers, over which you may or may not have control (see How to surf and screen-scrape anonymously for more on this). Up to this point, this has been possible in screen-scraper, but the implementation has been relatively inelegant. Because of the needs of a recent client of ours, we’ve taken the time to flesh this out a bit more such that handling proxies is handled much more gracefully in screen-scraper. To use the code cited in this post, you’ll need to upgrade to the latest alpha version of screen-scraper.
The best way to explain is often by example, so here you go:
// Creat a new ProxyServerPool object. This object will
// control how screen-scraper interacts with proxy servers.
ProxyServerPool proxyServerPool = new ProxyServerPool();
// We give the current scraping session a reference to
// the proxy pool. This step should ideally be done right
// after the object is created (as in the previous step).
session.setProxyServerPool( proxyServerPool );
// This tells the pool to populate itself from a file
// containing a list of proxy servers. The format is very
// simple–you should have a proxy server on each line of
// the file, with the host separated from the port by a colon.
// For example:
// But obviously without the slashes at the beginning.
proxyServerPool.populateFromFile( “proxies.txt” );
// screen-scraper can iterate through all of the proxies to
// ensure they’re responsive. This can be a time-consuming
// process unless it’s done in a multi-threaded fashion.
// This method call tells screen-scraper to validate up to
// 25 proxies at a time.
proxyServerPool.setNumProxiesToValidateConcurrently( 25 );
// This method call tells screen-scraper to filter the list of
// proxy servers using 7 seconds as a timeout value. That is,
// if a server doesn’t respond within 7 seconds, it’s deemed
// to be invalid.
proxyServerPool.filter( 7 );
// Once filtering is done, it’s often helpful to write the good
// set of proxies out to a file. That way you may not have to
// filter again the next time.
proxyServerPool.writeProxyPoolToFile( “good_proxies.txt” );
// You might also want to write out the list of proxy servers
// to screen-scraper’s log.
// This is the switch that tells the scraping session to make
// use of the proxy servers. Note that this can be turned on
// and off during the course of the scrape. You may want to
// anonymize some pages, but not others.
session.setUseProxyFromPool( true );
// As a scrapiing session runs, screen-scraper will filter out
// proxies that become non-responsive. If the number of proxies
// gets down to a specified level, screen-scraper can repopulate
// itself. That’s what this method call controls.
proxyServerPool.setRepopulateThreshold( 5 );
During the course of the scrape, you may find that a proxy has been blocked. When this happens, you can make this method call to tell screen-scraper to remove the proxy from the pool:
Given that this feature is still in the alpha version of screen-scraper, there’s a chance we might change around the methods a bit, but, for the most part, you should be able to use it as you see it here.
It also might be of interest to note that we’ve done a slightly extended implementation of this technique that we’re using internally, which makes use of Amazon’s EC2 service. This allows us to have a pool of high speed proxy servers at an arbitrary quantity. As the proxy servers get blocked, they can be automatically terminated, with others spawned to replace them.