Techniques for Scraping Large Datasets

Posted in Tips on 07/07/08by jason

Some of the sites we aspire to scrape contain vast, huge amounts of data. In such cases, an attempt to scrape data from it may run fine for a time, but eventually stop prematurely with the following message printed to the log:

The error message was: The application script threw an exception: java.lang.OutOfMemoryError: Java heap space BSF info: null at line: 0 column: columnNo

There can be a variety of causes, but most of the time it is caused by memory use in page iteration. Turning up the memory allocation for screen-scraper may take care of it, but it doesn’t address the root cause.

In a typical site structure, we input search parameters and are presented with a page of results and a link to view subsequent pages. If there are ten to twenty pages of results, it’s easiest to just scrape the “next page” link and run a script after the pattern is applied that scrapes the next page. The problem lies in the fact that this is recursive. When we’ve requested the search results, and 2 subsequent “next pages” the scrapeable files are still open in memory thusly:

  • Scrapeable file “Search results” and dataSet “Next page”
  • Scrapeable file “Next search results” and dataSet “Next page”
  • Scrapeable file “Next search results” and dataSet “Next page”

Every “Next search results” opens a new scrapable file while the previous is still open. While you can run the script on the scripts tab after the file is scraped to prevent the dataSets from remaining in scope, the scrapeable files remain in memory—the scrape may get further, but the memory still fills up with scrapable files, and it mayn’t be enough to get all the data.

The solution is to use an iterative approach.

If the site we’re scraping shows the total number of pages, using an iterative method easy. For my example, I’ll describe a site that has a link for pages 1 through 20, and a “>>” indicator to show there are pages beyond 20.

On first page of search results, I have 3 extractor patterns to extract the following information:

  1. Each result listed
  2. All the page numbers shown, and
  3. The next batch of results

When I get the to the search results page, the first extractor runs as always and drills into the details of each result as usual. The second extractor pattern grabs all the pages listed so I get a dataSet named “Pages,” containing links to pages 2 through 20, and I save the dataSet as a session variable. On the scripts tab, I then run this script after the file is scraped:

/*

Script gets all page numbers from the Pages extractor pattern, and iterates through them

*/

// Get variable

pages = session.getVariable(“Pages”);

// Clear session variable so it doesn’t linger

session.setVariable(“Pages”, null);

// Loop through pages

for (i=0; i

{

// Since the page list appears twice, use only a number larger than that just used

if (i>session.getVariable(“PAGE”))

{

session.setVariable(“PAGE”, i);

session.log(“+++Scraping page #” + i);

session.scrapeFile(“Next search results”);

}

else

{

session.log(“+++Already have page #” + i + ” so not scraping”);

}

}

The “for” loop will have the first page of search results in memory, but when it calls the “Next search results” scrapeable file to go to page 2, it only gets the results, and doesn’t try to look for a next page. The loop closes out the second page before it starts the third, and closes the third before starting the forth, etc.

The last extractor on “Search results” looks for “>>”. I save the that dataSet as a session variable named “Next batch pages”, and put this as the last script to run on the scripts tab:

import com.screenscraper.common.*;

/*

Script that checks if there is a next batch of pages

*/

if (session.getVariable(“Next batch pages”)!=null)

{

pageSet = session.getVariable(“Next batch pages”);

session.setVariable(“Next batch pages”, null);

pages = pageSet.getDataRecord(0);

page = Integer.parseInt(pages.get(“PAGE”));

if (page>session.getVariable(“PAGE”))

{

session.setVariable(“PAGE”, page);

session.log(“+++Scraping page #” + page);

session.scrapeFile(“Next batch search results”);

}

else

{

session.log(“+++Already have page #” + page + ” so not scraping”);

}

}

Now the “Next batch search results” scrapable file must do all the things the first page of search results did; get each result, look for next page links, and look for a next batch of results. Using the iterative approach to cycle through pages enables you request many more pages without keeping as many in memory, and without unnecessary pages in memory, the scrape will run far longer.

24 Comments »

  1. kamiloklauss said,

    July 8, 2008 at 9:42 am

    It’s really an interesting aproach. But what happend when you have a large amont of data but to access each page you need yo pass a “key” generated in the preceding page? Things like people soft where you have to pass a code from the search page to the details page to load and if you try to go from one details page to another detail page you just get a “key error”. Is there any way you can handle that?

  2. jason said,

    July 8, 2008 at 11:21 am

    I have scraped some Peoplesoft sites and seen things like that. You also run into issues like that on some .NET pages which like to fling viewstates around like flapjacks in a country diner.

    You’ll just need one more extractor on each page where the key appears; you can name your token as you please, but I would use something like “KEY” and set it as a session variable. Then on the next page request you just replace the value in that parameter with the ~#KEY#~ token. Since you’re scraping the value on every page, the session variable will always hold the most recently found value, and thusly should work.

  3. ryans said,

    July 10, 2008 at 2:09 pm

    For more information about optimizing screen-scraper, please see the following post on screen-scraper’s blog: http://community.screen-scraper.com/faq#16n778

  4. Stu said,

    September 12, 2011 at 4:10 am

    Well I have played around with this now for a day, still with no luck, errors in the script and failure to go the the next page in the set. Is there an .sss download with this integrated so it can be studied better?

  5. Stu said,

    September 12, 2011 at 6:04 am

    I invoke the first script “When I get the to the search results page, the first extractor runs as always and drills into the details of each result as usual. The second extractor pattern grabs all the pages listed so I get a dataSet named “Pages,” containing links to pages 2 through 20, and I save the dataSet as a session variable. On the scripts tab, I then run this script after the file is scraped:”

    …. da da da Script Runs with result: The error message was: class bsh.EvalError (line 9): session .getVariable ( Pages ) — Error in method invocation: Attempt to pass void argument (position 0) to method: getVariable

    Line Nine reads “pages = session.getVariable(”Pages”);”

    On my Search results page I extract “[email protected]@~”, “[email protected]@~” and “[email protected]@~” all are stored as a Session. My gut is telling me that the Session Requested “session.getVariable(”Pages”);” is empty.

  6. Stu said,

    September 12, 2011 at 8:47 am

    OK, Figured out the above problem ””, Replace ” with ” if you cut and paste the above, BUT I have been presented with another… argh!! An error occurred while processing the script: The error message was: class bsh.ParseException (line 23): if– Encountered “if” at line 23, column 5.

    Line 23 reads: if (i>session.getVariable(“PAGE”))

    I do hope all this pain will help someone else…. If anyone can shed some light on the “New” problem, it would be appreciated..

  7. jason said,

    September 12, 2011 at 10:05 am

    Stu,

    You seem to be missing a bit. Look at this:

    /*

    Script gets all page numbers from the Pages extractor pattern, and iterates through them

    */

    // Get variable
    pages = Integer.parseInt(session.getVariable(”Pages”));

    // Clear session variable so it doesn’t linger
    session.setVariable(”Pages”, null);

    // Loop through pages
    for (i=0; i>pages; i++)
    {
    // Since the page list appears twice, use only a number larger than that just used
    if (i>session.getVariable(”PAGE”))
    {
    session.setVariable(”PAGE”, i);
    session.log(”+++Scraping page #” + i);
    session.scrapeFile(”Next search results”);
    }
    else
    {
    session.log(”+++Already have page #” + i + ” so not scraping”);
    }
    }

  8. Stu said,

    September 12, 2011 at 12:24 pm

    I figured out there was something wrong with the “for (i=0;” portion because there was no closing bracket in your example, BUT being a sysadmin and not a programmer makes life a little difficult sometimes. Hopefully the posted information will assist someone else with the head scratching…. Thanks Jason.

  9. hanki said,

    February 21, 2014 at 6:07 am

    Hey Guys. I’ve been trying to scrape the garage details from http://www.yell.com, I’ve scraped all the required data and have stored it in the db. The problem is, scrape works absolutely fantastic till first five (5) pages, but from page no 6 n on, it doesn’t scrape. It says:
    The pattern did not find any match.
    As i’ve been working on this for almost a month now (I’m new to scraping) and this problem seems to be one thing I am unable to solve.
    If any of you could shed some light or help me, that would be so sweet of you.
    Waiting eagerly…

  10. Jason said,

    February 21, 2014 at 6:00 pm

    I don’t see an anything that changes. Could I point you to go to the forum http://community.screen-scraper.com/forum and detail how you’re searching so I can emulate?

  11. Hanki said,

    February 23, 2014 at 8:17 am

    Hey Jason. Thanks for the reply. After your msg, i’ve been trying to dig deeper into the problem and finally came to conclusion that the problem is Yell blocking my IP after exactly 5 pages scraped. Here is one post that presents almost the same issue i’m facing, http://community.screen-scraper.com/node/2181

    Please help me with this, i’m using win 7. Ever since i’ve read that post, i’d been looking for that SSTOR.jar but couldn’t find it. I’ve download the TOR Browser bundle. Guide me from here on.

    Thank you very much

  12. Jason said,

    February 24, 2014 at 1:34 pm

    SSTor.jar is a file that we made, and you can get it here.

  13. Mitch said,

    February 24, 2014 at 6:09 pm

    I’m stuck on this. I followed what Jason has from September 12, 2011 but with no luck. The site I am scraping has EVENTSTATEs so I have a bit of data to keep in memory.

    Maybe I’m dense but the “PAGE” reference should be the current page right?

  14. Mitch said,

    February 24, 2014 at 6:11 pm

    When it gets to the Loop through pages part with the “(i=0; i>pages; i++)” nothing appears to be happening. I can’t get the If Then statement to register either log writes.

  15. Hanki said,

    February 25, 2014 at 12:30 am

    Thank you so very much Jason for the SSTOR.jar file.

    I believe the Scraping session (the ip change/check) is the next thing to get, i’ve also read in the post that I need Polipo setup on my system. As i mentioned earlier, i’ve downloaded TOR Bundle and it doesn’t have that Polipo or Vidalia thing in it, i’ve search all over the net and the TORPROJECT now come with this only setup (without Polipo or Vidalia). What should I do now? Where to GO?

    Your help is highly appreciated.

    Hanki

  16. Jason said,

    February 25, 2014 at 5:25 pm

    Polipo is no longer bundled with Tor, but you can download it independently.

  17. Hanki said,

    February 26, 2014 at 5:12 am

    Hi again. I’ve downloaded the Polipo. Now what is the next step? When are you going to provide configuration steps and the script for checking the blocked ports and changing them to continue scrape from there on.

    Having my fingers crossed and hopes high.

  18. Jason said,

    February 26, 2014 at 3:30 pm

    See this post for instructions: http://community.screen-scraper.com/node/2181

  19. Hanki said,

    February 27, 2014 at 5:50 am

    Hey Jason. Thanks for reply. I’ve read this post, it says something about the scrapeable session that you”ve provided. Kindly do provide me with that session file so I may carry-on scrape.

    Waiting …

  20. Jason said,

    February 27, 2014 at 7:23 am

    There is a link on that post.

  21. Hanki said,

    February 27, 2014 at 7:44 am

    I’m sorry but i couldn’t find it on the page. Would you please share it here, the way you shared the SSTOR.jar?

    Really appreciate your help and prompt reply.

  22. Hanki said,

    February 28, 2014 at 3:45 pm

    Hi there. I’m still waiting for your replay Jason.

  23. Hanki said,

    March 4, 2014 at 5:15 am

    Hey … Sorry to bug you again but i just want to remind that i am still waiting to hear from you guys.

    Regards,
    Hanki

  24. Jason said,

    March 4, 2014 at 11:23 am

    I cannot make an attachment here. You will need to find it on http://community.screen-scraper.com/node/2181.

    I must also redirect all future support requests to the forum.

Leave a Comment