To Recurse is Human, to Iterate, Divine

Posted in Tips, Updates on 04/15/10by Todd Wilson

Well, that’s actually not always true.  Take a quick look at this blog posting here.  The fundamental issue described by that posting is one of recursion vs. iteration.  When recursion is used (a page calls a page which calls a page…) objects tend to get stacked up, and subsequently fill up memory.  When iteration is used objects are properly cleaned up so memory doesn’t become a problem.  The trouble is, this condition is often hard to detect, and unless you’re thinking about it when you’re building your scraping session, you may cause it without realizing it.

An astute screen-scraper user yesterday suggested a solution to this that is both simple and effective.  In the case described in the blog posting you end up with a big stack of scripts, all of which have references to objects, which causes the OutOfMemoryError.  The number of scripts on the stack can be viewed in the breakpoint window, and in version 4.5.45a we added a method that will allow you to see how many scripts are on the stack from within a script:

session.getNumScriptsOnStack()

You can check this number as often as you’d like.  As it grows it could mean trouble, so you can respond appropriately in your scraping session.  We’ve also added a failsafe mechanism inside of screen-scraper that will hopefully save you from an OutOfMemoryError.  If too many scripts are pushed on the stack your scraping session will be stopped and the following message will be output to the log:

ERROR–halting the scraping session because the maximum number of scripts allowed on the stack was reached.

You can control the maximum number of scripts allowed on the stack by invoking this method at any time:

session.setMaxScriptsOnStack( 50 )

Set that number to whatever you’d like.

By design screen-scraper provides a lot of flexibility and power in the data extraction process, but this same power can also result in our shooting ourselves in the foot on occasion.  The inclusion of this new mechanism will hopefully help some to avoid this problem down the road.

3 Comments »

  1. scottw said,

    April 4, 2011 at 2:00 pm

    See also: How can I optimize screen-scraper’s performance?

  2. biby said,

    July 29, 2013 at 11:43 am

    http://yellowpages.superpages.com/listings.jsp?C=apartment&CS=L&MCBP=true&search=Find+It&SRC=&STYPE=S&SCS=&channelId=&sessionId=

    In this result page, there is no no: pages. & the url does not contain the “pages” parameter. In cases like this what could be a possible solution?

  3. Todd Wilson said,

    July 29, 2013 at 2:45 pm

    You can still capture the parameters in the “Next” link, then use those to make the subsequent request. The cars.com scraping session you can download from this page exemplifies a technique that should work.

Leave a Comment