07.07.08

Techniques for Scraping Large Datasets

Posted in Tips at 2:45 pm by jason

Some of the sites we aspire to scrape contain vast, huge amounts of data. In such cases, an attempt to scrape data from it may run fine for a time, but eventually stop prematurely with the following message printed to the log:

The error message was: The application script threw an exception: java.lang.OutOfMemoryError: Java heap space BSF info: null at line: 0 column: columnNo

There can be a variety of causes, but most of the time it is caused by memory use in page iteration. Turning up the memory allocation for screen-scraper may take care of it, but it doesn’t address the root cause.

In a typical site structure, we input search parameters and are presented with a page of results and a link to view subsequent pages. If there are ten to twenty pages of results, it’s easiest to just scrape the “next page” link and run a script after the pattern is applied that scrapes the next page. The problem lies in the fact that this is recursive. When we’ve requested the search results, and 2 subsequent “next pages” the scrapeable files are still open in memory thusly:

  • Scrapeable file “Search results” and dataSet “Next page”
  • Scrapeable file “Next search results” and dataSet “Next page”
  • Scrapeable file “Next search results” and dataSet “Next page”

Every “Next search results” opens a new scrapable file while the previous is still open. While you can run the script on the scripts tab after the file is scraped to prevent the dataSets from remaining in scope, the scrapeable files remain in memory—the scrape may get further, but the memory still fills up with scrapable files, and it mayn’t be enough to get all the data.

The solution is to use an iterative approach.

If the site we’re scraping shows the total number of pages, using an iterative method easy. For my example, I’ll describe a site that has a link for pages 1 through 20, and a “>>” indicator to show there are pages beyond 20.

On first page of search results, I have 3 extractor patterns to extract the following information:

  1. Each result listed
  2. All the page numbers shown, and
  3. The next batch of results

When I get the to the search results page, the first extractor runs as always and drills into the details of each result as usual. The second extractor pattern grabs all the pages listed so I get a dataSet named “Pages,” containing links to pages 2 through 20, and I save the dataSet as a session variable. On the scripts tab, I then run this script after the file is scraped:

/*

Script gets all page numbers from the Pages extractor pattern, and iterates through them

*/

// Get variable

pages = session.getVariable(”Pages”);

// Clear session variable so it doesn’t linger

session.setVariable(”Pages”, null);

// Loop through pages

for (i=0; i

{

// Since the page list appears twice, use only a number larger than that just used

if (i>session.getVariable(”PAGE”))

{

session.setVariable(”PAGE”, i);

session.log(”+++Scraping page #” + i);

session.scrapeFile(”Next search results”);

}

else

{

session.log(”+++Already have page #” + i + ” so not scraping”);

}

}

The “for” loop will have the first page of search results in memory, but when it calls the “Next search results” scrapeable file to go to page 2, it only gets the results, and doesn’t try to look for a next page. The loop closes out the second page before it starts the third, and closes the third before starting the forth, etc.

The last extractor on “Search results” looks for “>>”. I save the that dataSet as a session variable named “Next batch pages”, and put this as the last script to run on the scripts tab:

import com.screenscraper.common.*;

/*

Script that checks if there is a next batch of pages

*/

if (session.getVariable(”Next batch pages”)!=null)

{

pageSet = session.getVariable(”Next batch pages”);

session.setVariable(”Next batch pages”, null);

pages = pageSet.getDataRecord(0);

page = Integer.parseInt(pages.get(”PAGE”));

if (page>session.getVariable(”PAGE”))

{

session.setVariable(”PAGE”, page);

session.log(”+++Scraping page #” + page);

session.scrapeFile(”Next batch search results”);

}

else

{

session.log(”+++Already have page #” + page + ” so not scraping”);

}

}

Now the “Next batch search results” scrapable file must do all the things the first page of search results did; get each result, look for next page links, and look for a next batch of results. Using the iterative approach to cycle through pages enables you request many more pages without keeping as many in memory, and without unnecessary pages in memory, the scrape will run far longer.

del.icio.us:Techniques for Scraping Large Datasets digg:Techniques for Scraping Large Datasets spurl:Techniques for Scraping Large Datasets wists:Techniques for Scraping Large Datasets simpy:Techniques for Scraping Large Datasets newsvine:Techniques for Scraping Large Datasets blinklist:Techniques for Scraping Large Datasets furl:Techniques for Scraping Large Datasets reddit:Techniques for Scraping Large Datasets fark:Techniques for Scraping Large Datasets blogmarks:Techniques for Scraping Large Datasets Y!:Techniques for Scraping Large Datasets smarking:Techniques for Scraping Large Datasets magnolia:Techniques for Scraping Large Datasets segnalo:Techniques for Scraping Large Datasets

3 Comments »

  1. kamiloklauss said,

    July 8, 2008 at 9:42 am

    It’s really an interesting aproach. But what happend when you have a large amont of data but to access each page you need yo pass a “key” generated in the preceding page? Things like people soft where you have to pass a code from the search page to the details page to load and if you try to go from one details page to another detail page you just get a “key error”. Is there any way you can handle that?

  2. jason said,

    July 8, 2008 at 11:21 am

    I have scraped some Peoplesoft sites and seen things like that. You also run into issues like that on some .NET pages which like to fling viewstates around like flapjacks in a country diner.

    You’ll just need one more extractor on each page where the key appears; you can name your token as you please, but I would use something like “KEY” and set it as a session variable. Then on the next page request you just replace the value in that parameter with the ~#KEY#~ token. Since you’re scraping the value on every page, the session variable will always hold the most recently found value, and thusly should work.

  3. ryans said,

    July 10, 2008 at 2:09 pm

    For more information about optimizing screen-scraper, please see the following post on screen-scraper’s blog: http://community.screen-scraper.com/faq#81n882

Leave a Comment