Scaling & Optimizing screen-scraper

Posted in Uncategorized on 10/11/10by jason

I get a lot of requests for help to configure and run screen-scraper to scrape at an optimal rate.  As is often the case with optimization, it is often as much art as science since the many variables that can affect the speed of a scrape are impossible to catalog.  While these steps will help to achieve a higher rate of scraping, it is impossible to foretell what maximum rate is available in your situation and setup.

Server
Generally, screen-scraper is a fairly lightweight application, however the needs of each scraping server differ.
Screen-scraper is cross-platform, and can be successfully deployed on a number of server operating systems.  We have found, however, that for the most part Linux based servers tend to be somewhat more dependable and scraping friendly.
The more intense the scraping needs, the more screen-scraper can take advantage of system resources.  One set of very successful, high-end scraping servers are configured thus:

  • Intel Core2 Duo at 2Ghz
  • 4G RAM
  • CentOS
  • Multiple servers

In cases where there in an abundance of scrapes that need to be run simultaneously it is advisable to have multiple servers for load balancing.  In these cases we have used physical servers in various locations, virtual servers, or a hybrid solution of the two.  Ekiwi is able to build a custom controller to manage multiple servers, including spawning/closing virtual servers.

Network
The network connection to the site(s) with which you are interacting is the single most important factor in optimizing screen-scraper’s speed.  It is important to have adequate bandwidth available.  As you increase the number of concurrent scrapes, you will need to have greater bandwidth to accommodate them.
Screen-scraper is already configured to make only HTTP requests specified, and will not make subsequent requests for images, scripts, CSS files, frames, etc.
Some factors in the network connection are out of your control.  Speed from your ISP node to the site, aka latency, is often dictated by distance to the remote server, and response time from the site cannot be increased by any setting of screen-scraper or the network.
In some cases anonymization is desired.  Any time that you need to do so, you are introducing additional stops (and distance) between you and the remote site.  These steps can have a substantial and detrimental affect on the speed of your scrape.  Some scenarios include:

  • Tor/Privoxy:  This package is desirable because it is free to use, plus the large number and variation of IPs makes it very difficult to block.  Generally Tor is a slower option, but there is some configuration that can be done to seek fast exit nodes, etc.
  • I2P:  Like Tor/Privoxy, this is free to use, though has fewer IPs, and the drawback of generally limited speed.
  • EC2:  The Amazon EC2 cloud spawns a number of virtual servers, and screen-scraper is set up to tie into and use these virtual servers as proxies.  This option provides consistently fast proxies, but there is a finite number of IPs available so it can be blocked, and in some cases the sites you are scraping can determine that unwanted traffic is coming from EC2, and make an abuse report to Amazon.  The severs cost $0.25 per server per hour.
  • Anonymizer:  This 3rd party option hosts an array of fast servers that are easy to configure.  This too has a finite number of IP addresses, and can sometimes be blocked.  The company charges per HTTP request made.

Screen-scraper
The screen-scraper is already largely configured for optimal speed.  Scrapes should always be run from a command line or through the server (the workbench is meant for development of scraping sessions).
Make sure that screen-scraper is set to a adequate memory usage setting; we’ve found that 768M of memory allocation is optimal, and that higher settings offer little added benefit.
After a scrape is set up and stable, one should stop logging or reduce the logging level.
Ensure that the connection timeout and data extraction timeout are set no higher than needed for the scrape.  Sometimes too low a timeout will miss an HTTP response if the remote server takes too long to respond, but in many cases missing an occasional record is preferable to waiting for it.

Scraping session
The primary indicator of how much time it will take to run a scrape is a count of how many HTTP requests are required.  Large datasets will usually require more requests, so any steps you can take to focus your results will save time.  Scraping sessions should be designed not to make any unnecessary HTTP requests.
For some scenarios there can be an advantage in running multiple threads against a site.  This allows a large number of scraping sessions to target smaller subsets of the site in tandem.  Using this method is generally more intensive of the server’s resources, but will offer a net gain.  With screen-scraper professional edition, you can run up to 5 concurrent scraping sessions, whereas with enterprise edition you may run as many as the server’s resources will allow; determining the number of scrapes to run on any server is a matter of testing and monitoring.  We will often set the server to run 100 concurrent scrapes to make a base-line, and adjust from there if needed.  Sometimes screen-scraper or Java will use all of the resources available to it while the server still has capacity; in such cases you will see greater performance by installing an additional instance of screen-scraper instead of further taxing the existing instance.
When scraping data from the remote site, it is often faster to write data to a file on the fly so screen-scraper needn’t pause for database queries.  In cases where direct database interaction is required, ensure that the database is optimized, indexed, and has a fast connection to the scraping server(s).

Leave a Comment