03.12.10

Capping response length

Posted in Tips at 6:42 pm by Todd Wilson

Once in a while when you’re scraping you may request a file that ends up being really large, but you actually only need to pull data from the top portion of the file.  If it’s a big file it can end up slowing down the scraping process quite a bit.  Not too long ago (somewhere around version 4.5.20a, I think) we added a method to deal with just such cases:

scrapeableFile.setMaxResponseLength( int maxKBytes )

This tells screen-scraper to only download a given number of kilobytes at the beginning of the file.  You would want to run this method in a script that gets invoked before a file is scraped.  For example, if your script contained this line:

scrapeableFile.setMaxResponseLength( 50 );

screen-scraper would download the first 50K of the file, cut it off, then continue on.

If the speed of a scraping session is especially critical this can also be a great way to trim off quite a bit of download time.

03.11.10

Using OCR with screen-scraper

Posted in Tips at 1:47 pm by scottw

Within screen-scraper you have the ability to call outside programs directly from your scripts.  The following is an example scraping session that makes use of Tesseract OCR and Imagemagick in order to take an image from the internet and attempt to read the text of the image.

As is, the scraping session is intended to run on Linux.  However, it is possible to run both dependent programs under Windows either directly or using Cygwin.

To use:

Download and import the following scraping session.

http://community.screen-scraper.com/samples/ocr