04.15.10

To Recurse is Human, to Iterate, Divine

Posted in Tips, Updates at 11:08 am by Todd Wilson

Well, that’s actually not always true.  Take a quick look at this blog posting here.  The fundamental issue described by that posting is one of recursion vs. iteration.  When recursion is used (a page calls a page which calls a page…) objects tend to get stacked up, and subsequently fill up memory.  When iteration is used objects are properly cleaned up so memory doesn’t become a problem.  The trouble is, this condition is often hard to detect, and unless you’re thinking about it when you’re building your scraping session, you may cause it without realizing it.

An astute screen-scraper user yesterday suggested a solution to this that is both simple and effective.  In the case described in the blog posting you end up with a big stack of scripts, all of which have references to objects, which causes the OutOfMemoryError.  The number of scripts on the stack can be viewed in the breakpoint window, and in version 4.5.45a we added a method that will allow you to see how many scripts are on the stack from within a script:

session.getNumScriptsOnStack()

You can check this number as often as you’d like.  As it grows it could mean trouble, so you can respond appropriately in your scraping session.  We’ve also added a failsafe mechanism inside of screen-scraper that will hopefully save you from an OutOfMemoryError.  If too many scripts are pushed on the stack your scraping session will be stopped and the following message will be output to the log:

ERROR–halting the scraping session because the maximum number of scripts allowed on the stack was reached.

You can control the maximum number of scripts allowed on the stack by invoking this method at any time:

session.setMaxScriptsOnStack( 50 )

Set that number to whatever you’d like.

By design screen-scraper provides a lot of flexibility and power in the data extraction process, but this same power can also result in our shooting ourselves in the foot on occasion.  The inclusion of this new mechanism will hopefully help some to avoid this problem down the road.

Resume points

Posted in Tips at 8:00 am by jason

Sometimes a long scrape will be stopped mid-run by a system crash, power surge, or bad mojo.  Many times there is nothing to do but to restart, but sometimes there is a way to pick up (pretty close to) where you left off.  You need to include some extra logic, but it is often worthwhile.
Let’s say where looking a site that lists hundred of LOCATIONS, and inside each there is a listing of COMPANIES, and the data we’re after is listed in each COMPANY.

I’m going to make a script that runs at the beginning of the scrape to check for a file that contains the last scraping state. Read the rest of this entry »

04.09.10

Tidy Time

Posted in Updates at 6:04 pm by Todd Wilson

So lately we’ve been experimenting with different tidiers in the latest alpha versions of screen-scraper.  This is the little utility that will clean up malformed HTML, making extraction easier.  For some time we’ve used a library called JTidy to handle this, which has worked quite well, but does have a couple of problems.  First, at times it simply fails to tidy the HTML.  If you’ve been using screen-scraper for a while you’ve likely seen a message indicating this in the log.  This isn’t too big of a deal, but can be a bit of a hassle.  Second, in very rare instances we’ve actually found that it will omit portions of an HTML page which are especially malformed.  This is definitely a problem and can make debugging difficult.

In order to address the issues above we’ve been trying out a few other tidiers–NekoHTML and Jericho.  We’ve actually already found issues with NekoHTML, so Jericho looks to be the favorite as of right now.  Both will still require some experimentation, though, so please use them at your own risk for now.  Once we’ve put them both through the paces we’ll likely settle on one as the recommended default.  And not to worry about any scrapeable files that are already using JTidy–they’ll stay just as they are.  At some point, though, for any new scrapeable files, you might notice a different tidier as the default.

04.06.10

Exporting & importing scraping sessions in 4.5.42a

Posted in Miscellaneous, Tips at 6:12 pm by Todd Wilson

We try hard to maintain backward compatibility as much as possible, but unfortunately it can’t always be done.  If you recently upgraded to 4.5.42a you may have noticed that scraping sessions that are exported from that version don’t import correctly into an alpha version prior to it.  This was a result of the alterations to the “tidy HTML” functionality that were implemented in that version.  As such, this is one case of backward-compatibility where you’re going to have to be careful.  As of this version (and later versions) if you export scraping sessions from screen-scraper you should only import them into instances of screen-scraper also running version 4.5.42a or later.  Unfortunately, this is one case where it was impossible to maintain the compatibility with older versions, so please take note.