03.31.06

Version 2.7.2.1a of screen-scraper available

Posted in Uncategorized at 2:59 pm by Todd Wilson

Today on our support forum we had someone inquire about calling scripts from other scripts within screen-scraper. This has been requested a number of times in the past, and I’ve kind of hummed and hahed about it, not sure if it would be opening a can of worms. Some of our internal developers have wanted this as well, so I gave it a bit more thought, and came up with a pretty quick and easy way to implement it.

I’m particularly interested in having this one thoroughly tested, so please feel free to upgrade (try this FAQ if you run into trouble). Remember that this is an alpha version, so caveats apply. It should be plenty stable, though, since this is the only addition from 2.7.2

Once you’ve upgraded, you can do a method call like this within a script in order to invoke another:

session.executeScript( “My Script” );

03.27.06

Example scraping sessions available

Posted in Updates at 5:39 pm by Todd Wilson

I just posted several example scraping sessions that may be of help to those starting out with screen-scraper: http://www.screen-scraper.com/support/examples/scrapbookfinds_examples.php.

Back when screen-scraper was just a babe in my arms I used to include scraping sessions in the download. The scraping sessions extracted stuff from Slashdot, Freshmeat, and Weather.com. The trouble was, the sites would change from time to time, and it was always a pain keeping up with them. What was worse, occasionally people would download screen-scraper, run the scraping sessions, and find that they didn’t work (because the sites had changed). They’d then report back that our software stunk because it didn’t even work with the very examples we provided.

After all of that I decided it simply wasn’t worth providing examples using sites we didn’t have control over. That’s why we set up this mock e-commerce web site on our server. We wanted to provide a “real world” example, but still needed to have control over the site so that we didn’t need to continually update it.

When we started doing ScrapbookFinds, it occurred to me that we could share those scraping sessions with others. We don’t control the sites, but we’re constantly monitoring the scraping sessions and updating as them as the sites change. The hope is that these scraping sessions will provide templates and examples to people that will both help them learn screen-scraper, as well as act as boiler plates people can tweak to create their own scraping sessions.

As a side-note, if it’s of interest, we probably average about 15 minutes of time updating scraping sessions per week, and we’re scraping about 15 sites (i.e., the sites either don’t change that often, or we’ve set up our scraping sessions to be fuzzy enough such that they don’t break when minor changes are made).

03.24.06

Version 2.7.2 of screen-scraper available

Posted in Updates at 3:43 pm by Todd Wilson

This is just a minor bug fix release, but anyone invoking screen-scraper from the command line should upgrade. Somehow a semi-critical bug slipped through our radar on the 2.7 release. In 2.7 if you have the workbench open, then run screen-scraper from the command line, when the command line instance ends it will close screen-scraper’s database, leaving the workbench without a way to save any of its information. It wouldn’t lead to database corruption or anything like that, but could get pretty annoying.

03.22.06

Scraping data from similar tables

Posted in Tips at 5:52 pm by Todd Wilson

Astute screen-scraper Fred came up with a scenario that arises from time-to-time: you’ve got a page containing one or more HTML tables, all of which are nearly identical in structure. You want to pull the data from each table, but need to be able to distinguish which row came from which table. Standard old extractor patterns won’t do the job–they’ll match every row in every table, which destroys the link between each row and its corresponding table.

Fortunately, there are a couple of ways of handling such a scenario, which I’ve just outlined in this FAQ. Not too complicated, but a bit more involved than just using a standard extractor pattern.

03.21.06

Three common methods for data extraction

Posted in Miscellaneous, Thoughts at 3:29 pm by Todd Wilson

Building off of my earlier posting on data discovery vs. data extraction, in the data extraction phase of the web scraping process you’ve already arrived at the page containing the data you’re interested in, and you now need to pull it out of the HTML.

Probably the most common technique used traditionally to do this is to cook up some regular expressions that match the pieces you want (e.g., URL’s and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you’re already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing “ontologies“, or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they’re often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it’s probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what’s the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

  • If you’re already familiar with regular expressions and at least one programming language, this can be a quick solution.
  • Regular expressions allow for a fair amount of “fuzziness” in the matching such that minor changes to the content won’t break them.
  • You likely don’t need to learn any new languages or tools (again, assuming you’re already familiar with regular expressions and a programming language).
  • Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It’s also nice because the various regular expression implementations don’t vary too significantly in their syntax.

Disadvantages:

  • They can be complex for those that don’t have a lot of experience with them. Learning regular expressions isn’t like going from Perl to Java. It’s more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.
  • They’re often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you’ll see what I mean.
  • If the content you’re trying to match changes (e.g., they change the web page by adding a new “font” tag) you’ll likely need to update your regular expressions to account for the change.
  • The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You’ll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there’s no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

  • You create it once and it can more or less extract the data from any page within the content domain you’re targeting.
  • The data model is generally built in. For example, if you’re extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).
  • There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

  • It’s relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.
  • These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you’re targeting.
  • You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you’ll only get into ontologies and artificial intelligence when you’re planning on extracting information from a very large number of sources. It also makes sense to do this when the data you’re trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

  • Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.
  • Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.
  • Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

  • The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.
  • A potential cost. Most ready-to-go screen-scraping applications are commercial, so you’ll likely be paying in dollars as well as time for this solution.
  • A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you’re locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you’re using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don’t mind paying a bit, you can save yourself a significant amount of time by using one. If you’re doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you’re probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we’ve been involved with that has actually required a hybrid approach of two of the aforementioned methods. We’re currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term “number of bedrooms” can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we’ve done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it’s handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we’ve written that uses ontologies in order to extract out the individual pieces we’re after. Once the data has been extracted we then insert it into a database.

New screen-scraper tutorial available

Posted in Updates at 12:22 pm by Todd Wilson

We’ve just released a new screen-scraper tutorial: http://www.screen-scraper.com/support/tutorials/tutorial7/tutorial_overview.php. It’s just received the blessing from our project manager and aspiring professional writer/editor, Jason Bellows, so it should be ready for public consumption.

Here’s a snippet from the tutorial introduction:

“It’s often the case in screen-scraping that you want to submit a form multiple times using different parameters each time. For example, you may be extracting locations from the “store locator” service on a site, and need to submit the form for a series of zip codes. In this tutorial we’ll provide an example on how to go about that.”

We’ve had this requested a few times, so hopefully it will provide enough of a template that people can use it for similar projects.

As always, feel free to let us know what you think. You can post a comment below, post to our support forum, or send us a note.

03.16.06

Untrusted Server Certificate Chain fix

Posted in Updates at 6:18 pm by Todd Wilson

Some of you in the past may have run into this dreaded message when trying to access a site that uses HTTPS:

java.security.cert.CertificateException: Untrusted Server Certificate Chain

I’m happy to report that we’ve just issued a fix for that in version 2.7.0.1a. See this FAQ if you run into any trouble upgrading.

Data discovery vs. data extraction

Posted in Miscellaneous, Thoughts at 1:30 pm by Todd Wilson

Looking at screen-scraping at a simplified level, there are two primary stages involved: data discovery and data extraction. Data discovery deals with navigating a web site to arrive at the pages containing the data you want, and data extraction deals with actually pulling that data off of those pages. Generally when people think of screen-scraping they focus on the data extraction portion of the process, but my experience has been that data discovery is often the more difficult of the two.

The data discovery step in screen-scraping might be as simple as requesting a single URL. For example, you might just need to go to the home page of a site and extract out the latest news headlines. On the other side of the spectrum, data discovery may involve logging in to a web site, traversing a series of pages in order to get needed cookies, submitting a POST request on a search form, traversing through search results pages, and finally following all of the “details” links within the search results pages to get to the data you’re actually after. In cases of the former a simple Perl script would often work just fine. For anything much more complex than that, though, a tool like screen-scraper can be an incredible time-saver. Especially for sites that require logging in, writing code to handle screen-scraping can be a nightmare when it comes to dealing with cookies and such.

In the data extraction phase you’ve already arrived at the page containing the data you’re interested in, and you now need to pull it out of the HTML. Traditionally this has typically involved creating a series of regular expressions that match the pieces of the page you want (e.g., URL’s and link titles). Regular expressions can be a bit complex to deal with, so screen-scraper hides most of those types of details behind the scenes, which simplifies the process. screen-scraper actually uses regular expressions to perform the data extraction, but you may or may not even be aware of that when you use it.

As an addendum, I should probably mention a third phase that is often ignored, and that is, what do you do with the data once you’ve extracted it? Common examples include writing the data to a CSV or XML file, or saving it to a database. In the case of a live web site you might even scrape the information and display it in the user’s web browser in real-time. When shopping around for a screen-scraping tool you should make sure that it gives you the flexibility you need to work with the data once it’s been extracted. One of the primary design goals of screen-scraper was to make it as flexible as possible in this regard. Our FAQ on saving information to a database gives several suggestions on how screen-scraper can be used in this regard.

03.09.06

Version 2.7 of screen-scraper available

Posted in Updates at 12:30 pm by Todd Wilson

Come ‘n get it, friends and neighbors. You can download it fresh from our site or update your existing instance. This is definitely our cleanest release yet. Probably the coolest feature in my opinion is the RSS stuff. Check out our new tutorial on it. It may end up being kind of a “gee whiz” feature, but hopefully people will find ways to make it useful.

For the next day or two we’ll hold off on announcing this to the world, so enjoy the speedy downloads while they last. Generally when we announce it on Freshmeat and other sites our server gets pretty hammered…

03.07.06

Adding numbers to session variables

Posted in Updates, Tips at 5:38 pm by Todd Wilson

Up till now it’s been a pretty big pain to add a number to a session variable. Oftentimes you’ll have something like a page number that you need to increment as you loop through search results pages. The page number is usually stored as a String, and to increment it you normally have to cast it to an int, increment it, then cast it back to a String. Recently, though, we added a “session.addToVariable” method that makes this a lot quicker. Here’s the documentation on it:

  • addToVariable( String variable, int value ). Adds a value to a session variable. Session variables are generally stored as Strings, so it’s normally more difficult than it should be to simply add a number to one. This method takes the name of the variable, which can either hold a String or Integer, and adds a number to it. The number added to it can be positive or negative.
    example: session.addToVariable( "PAGE_NUM", 1 );

Much simpler than the previous way. This will be part of our upcoming 2.7 release (any day now!), but if you’d like to make use of it right now you can simply upgrade to the latest pre-release version (2.6.0.6a).

« Previous entries ·