04.15.10

To Recurse is Human, to Iterate, Divine

Posted in Tips, Updates at 11:08 am by Todd Wilson

Well, that’s actually not always true.  Take a quick look at this blog posting here.  The fundamental issue described by that posting is one of recursion vs. iteration.  When recursion is used (a page calls a page which calls a page…) objects tend to get stacked up, and subsequently fill up memory.  When iteration is used objects are properly cleaned up so memory doesn’t become a problem.  The trouble is, this condition is often hard to detect, and unless you’re thinking about it when you’re building your scraping session, you may cause it without realizing it.

An astute screen-scraper user yesterday suggested a solution to this that is both simple and effective.  In the case described in the blog posting you end up with a big stack of scripts, all of which have references to objects, which causes the OutOfMemoryError.  The number of scripts on the stack can be viewed in the breakpoint window, and in version 4.5.45a we added a method that will allow you to see how many scripts are on the stack from within a script:

session.getNumScriptsOnStack()

You can check this number as often as you’d like.  As it grows it could mean trouble, so you can respond appropriately in your scraping session.  We’ve also added a failsafe mechanism inside of screen-scraper that will hopefully save you from an OutOfMemoryError.  If too many scripts are pushed on the stack your scraping session will be stopped and the following message will be output to the log:

ERROR–halting the scraping session because the maximum number of scripts allowed on the stack was reached.

You can control the maximum number of scripts allowed on the stack by invoking this method at any time:

session.setMaxScriptsOnStack( 50 )

Set that number to whatever you’d like.

By design screen-scraper provides a lot of flexibility and power in the data extraction process, but this same power can also result in our shooting ourselves in the foot on occasion.  The inclusion of this new mechanism will hopefully help some to avoid this problem down the road.

del.icio.us:To Recurse is Human, to Iterate, Divine digg:To Recurse is Human, to Iterate, Divine spurl:To Recurse is Human, to Iterate, Divine wists:To Recurse is Human, to Iterate, Divine simpy:To Recurse is Human, to Iterate, Divine newsvine:To Recurse is Human, to Iterate, Divine blinklist:To Recurse is Human, to Iterate, Divine furl:To Recurse is Human, to Iterate, Divine reddit:To Recurse is Human, to Iterate, Divine fark:To Recurse is Human, to Iterate, Divine blogmarks:To Recurse is Human, to Iterate, Divine Y!:To Recurse is Human, to Iterate, Divine smarking:To Recurse is Human, to Iterate, Divine magnolia:To Recurse is Human, to Iterate, Divine segnalo:To Recurse is Human, to Iterate, Divine

1 Comment »

  1. scottw said,

    April 4, 2011 at 2:00 pm

    See also: How can I optimize screen-scraper’s performance?

Leave a Comment