07.07.08
Techniques for Scraping Large Datasets
Some of the sites we aspire to scrape contain vast, huge amounts of data. In such cases, an attempt to scrape data from it may run fine for a time, but eventually stop prematurely with the following message printed to the log:
The error message was: The application script threw an exception: java.lang.OutOfMemoryError: Java heap space BSF info: null at line: 0 column: columnNo
There can be a variety of causes, but most of the time it is caused by memory use in page iteration. Turning up the memory allocation for screen-scraper may take care of it, but it doesn’t address the root cause.
In a typical site structure, we input search parameters and are presented with a page of results and a link to view subsequent pages. If there are ten to twenty pages of results, it’s easiest to just scrape the “next page” link and run a script after the pattern is applied that scrapes the next page. The problem lies in the fact that this is recursive. When we’ve requested the search results, and 2 subsequent “next pages” the scrapeable files are still open in memory thusly:
- Scrapeable file “Search results” and dataSet “Next page”
- Scrapeable file “Next search results” and dataSet “Next page”
- Scrapeable file “Next search results” and dataSet “Next page”
Every “Next search results” opens a new scrapable file while the previous is still open. While you can run the script on the scripts tab after the file is scraped to prevent the dataSets from remaining in scope, the scrapeable files remain in memory—the scrape may get further, but the memory still fills up with scrapable files, and it mayn’t be enough to get all the data.
The solution is to use an iterative approach.
If the site we’re scraping shows the total number of pages, using an iterative method easy. For my example, I’ll describe a site that has a link for pages 1 through 20, and a “>>” indicator to show there are pages beyond 20.
On first page of search results, I have 3 extractor patterns to extract the following information:
- Each result listed
- All the page numbers shown, and
- The next batch of results
When I get the to the search results page, the first extractor runs as always and drills into the details of each result as usual. The second extractor pattern grabs all the pages listed so I get a dataSet named “Pages,” containing links to pages 2 through 20, and I save the dataSet as a session variable. On the scripts tab, I then run this script after the file is scraped:
/*
Script gets all page numbers from the Pages extractor pattern, and iterates through them
*/
// Get variable
pages = session.getVariable(”Pages”);
// Clear session variable so it doesn’t linger
session.setVariable(”Pages”, null);
// Loop through pages
for (i=0; i
{
// Since the page list appears twice, use only a number larger than that just used
if (i>session.getVariable(”PAGE”))
{
session.setVariable(”PAGE”, i);
session.log(”+++Scraping page #” + i);
session.scrapeFile(”Next search results”);
}
else
{
session.log(”+++Already have page #” + i + ” so not scraping”);
}
}
The “for” loop will have the first page of search results in memory, but when it calls the “Next search results” scrapeable file to go to page 2, it only gets the results, and doesn’t try to look for a next page. The loop closes out the second page before it starts the third, and closes the third before starting the forth, etc.
The last extractor on “Search results” looks for “>>”. I save the that dataSet as a session variable named “Next batch pages”, and put this as the last script to run on the scripts tab:
import com.screenscraper.common.*;
/*
Script that checks if there is a next batch of pages
*/
if (session.getVariable(”Next batch pages”)!=null)
{
pageSet = session.getVariable(”Next batch pages”);
session.setVariable(”Next batch pages”, null);
pages = pageSet.getDataRecord(0);
page = Integer.parseInt(pages.get(”PAGE”));
if (page>session.getVariable(”PAGE”))
{
session.setVariable(”PAGE”, page);
session.log(”+++Scraping page #” + page);
session.scrapeFile(”Next batch search results”);
}
else
{
session.log(”+++Already have page #” + page + ” so not scraping”);
}
}
Now the “Next batch search results” scrapable file must do all the things the first page of search results did; get each result, look for next page links, and look for a next batch of results. Using the iterative approach to cycle through pages enables you request many more pages without keeping as many in memory, and without unnecessary pages in memory, the scrape will run far longer.



