04.06.10

Exporting & importing scraping sessions in 4.5.42a

Posted in Miscellaneous, Tips at 6:12 pm by Todd Wilson

We try hard to maintain backward compatibility as much as possible, but unfortunately it can’t always be done.  If you recently upgraded to 4.5.42a you may have noticed that scraping sessions that are exported from that version don’t import correctly into an alpha version prior to it.  This was a result of the alterations to the “tidy HTML” functionality that were implemented in that version.  As such, this is one case of backward-compatibility where you’re going to have to be careful.  As of this version (and later versions) if you export scraping sessions from screen-scraper you should only import them into instances of screen-scraper also running version 4.5.42a or later.  Unfortunately, this is one case where it was impossible to maintain the compatibility with older versions, so please take note.

04.10.09

One-day only 50% off sale!

Posted in Miscellaneous at 12:03 pm by Todd Wilson

Yesterday I opened a fortune cookie that said, “Do something unusual tomorrow.”  I thought about sky-diving or going the whole day blind-folded, but instead opted for something even crazier–sell screen-scraper for half price!  If you’re on the fence about purchasing now might be a good time to take the plunge.  I don’t see us doing this again any time soon.  The sale will last until April 11, 2009 at 11:00 a.m. Mountain time.

03.25.09

First video tutorial

Posted in Miscellaneous at 10:24 am by Todd Wilson

We’ve had people asking for this for quite a while, and have finally gotten to it.  We now have a video version of our first tutorial, accessible from the tutorial itself:

http://community.screen-scraper.com/Tutorial_1_Page_1

It isn’t perfect, but I think it’s a pretty good first version (and definitely better than what we had previously).  We’re hoping to get some feedback, then will likely do another version soon based on that feedback.  Feel free to give it a try and let us know what you think.

10.27.08

Iowa Workforce Development Uses Screen-Scraper to Enhance Job Search

Posted in Miscellaneous at 11:44 am by Todd Wilson

One of our eagle-eyed developers recently spotted a couple of blog postings by Bronwyn Mauldin (here and here) wherein she discusses Iowa Workforce Development’s use of our screen-scraping technology in building out their job board.  Bronwyn is a great writer and a consultant in the workforce development industry.  After reading Bronwyn’s postings we decided to contact the Iowa office ourselves to catch up on how things have been going for them.  It makes a great story as to how screen-scraping technology is being used in a very effective way.  We decided to make a press release on it, which you can find here:

Iowa Workforce Development Uses Screen-Scraper to Enhance Job Search

01.02.07

How to stop phpBB spam

Posted in Miscellaneous, Tips at 12:29 pm by Todd Wilson

Well, I sure wish someone would have told us about this a while ago, so I’m doing the world a favor and talking about it here. Hopefully this blog posting gets picked up by Google so that others who are new to phpBB can learn how to stop spam up front.

We’ve been battling spam on our phpBB forum for I don’t know how long. The forum software works fine, but it’s so widespread that it seems to be one of the primary targets for forum spammers. After monkeying around with the thing installing mods and making manual changes, we finally hit this mod: Stop Spambot Registration. Once installed, the spam stopped. Amazing.

Now, obviously your mileage may vary with this one. We’ve also tried a bunch of other mods, so it’s possible that some of our mods are helping, but the Stop Spambot Registration was the key for us. If you find that you need more firepower beyond that mod, I’d recommend trying others on the phpBB Security-Related MODs page that relate to spam.

By the way, just one plea to the phpBB folks–please consider building spam control into the base install of the software. You know people are targeting you, so why not give your users some defense out of the box?

***UPDATE***

Well, I declared victory a bit prematurely with that last posting. We got a bit more spam after I installed the mod I mentioned, so I installed one more: spamwords. It seems to work fairly well. My only complaint is that it only allows you to designate words, and not phrases, as indicators of spam.

I should also mention one other change we made early on that stopped a lot of the spam–we deleted the guest user account. This is the user in the database that has an ID of -1. I searched and searched for a way to disable guest posting, to no avail. With the guest account deleted people see an error message if they explicitly log out, but at least it prevents spam from non-registered posters.

09.12.06

Using screen-scraper to automatically test embedded devices

Posted in Miscellaneous, Thoughts at 10:49 am by Todd Wilson

A while back I flew out to Huntsville, AL to work with a government contractor company on automating the testing of embedded devices. To this day I’m not entirely sure what these little machines did, but they each had a web interface that needed testing (much like that of a wireless router, if you’ve worked with those before). This isn’t the most common usage for screen-scraper, but it turned out to be just what they needed.

I worked closely with Greg Chapman, one of their engineers, and he recently wrote an article on the experience entitled Testing aerospace UUTs leads to Web solution. Greg’s a smart guy, and has continued to use screen-scraper in ways that I wouldn’t have even considered.

It’s gratifying to see screen-scraper used in so many different ways, but it’s interesting that it’s versatility has almost been a curse at times to us. Our software can be used for all kinds of purposes, but we’re finding that, from a business standpoint, we’re often better off narrowing our focus to very specific applications. As one marketing expert we consulted with put it, “You guys have plastic.” Plastic is incredibly useful, but it gains value as you craft it into something with a specific purpose. I’m planning on blogging about this idea more later, but it’s interesting to consider the pros and cons of a general-purpose tool, like screen-scraper.

08.24.06

Developing software by the 15% rule

Posted in Miscellaneous, Thoughts at 5:02 pm by Todd Wilson

Writing software on a consulting basis can often be a losing proposition for developers or clients or both. There are too many things that can go wrong, and that ultimately translates into loss of time and money. The “15% rule” we’ve come up with is intended to create a win-win situation for both parties (or at least make it fair for everyone). Clients generally get what they want, and development shops make a fair profit. It’s not a perfect solution, but so far it seems to be working for us.

This may come as a surprise to some, but we make very little money selling software licenses. The vast majority of our revenue comes through consulting services–writing code for hire. Having now done this for several years, we’ve learned some hard lessons. On a few projects the lessons were so hard we actually lost money.

A few months ago I put together somewhat of a manifesto-type document intended to address the difficulties we’ve faced in developing software for clients. I’m pleased to say that it’s made a noticeable difference so far for us. My hope is that this blog entry will be read by others who develop software on a consulting basis, so that they can learn these lessons the easy way rather than the way we learned them.

What follows in this article is a summary of one of the main principles we now follow in developing software–the 15% rule. If you’d like, you’re welcome to read the full “Our Approach to Software Development” document.

For the impatient, the 15% rule goes like this…

Before undertaking a development project we create a statement of work (which acts as a contract and a specification) that outlines what we’ll do, how many hours it will require, and how much it will cost the client. As part of the contract we commit to invest up to the amount of time outlined in the document plus 15%. That is, if the statement of work says that the project will take us 100 hours to complete, we’ll spend up to 115 hours (but no more). As to where-fores and why-tos on how this works, read on.

Those that have developed software for hire know that the end product almost never ends up exactly as the client had pictured. There are invariably tweaks that will need to be made (that may or may not have been discussed up front) in order to get the thing to at least resemble what the client has in mind. And, yes, this can happen even if you spend hours upon hours fine tuning the specification to reflect the client’s wishes. Additionally, technical issues can crop up that weren’t anticipated by the programming team. In theory, the better the programming team the less likely this should be, but it doesn’t always end up that way (Microsoft’s Vista operating system is a sterling example). These two factors, among others, equate to the risk that is inherent in the project. Something isn’t going to go right, and that will almost always mean someone pays or loses more money than originally anticipated. The question is, who should be responsible to account for those extra dollars?

Up until relatively recently, we would shoulder almost all of the risk in our projects. If the app didn’t do what the client had in mind, or if unforeseen technical issues cropped up, it generally came out of our pockets. For the most part it wasn’t a huge problem, but always seemed to have at least some effect (the extreme cases obviously being when we lost money on a project).

This seems kind of unfair, doesn’t it? The risk inherent to the project isn’t necessarily the fault of either party. It’s just there. We didn’t put it there, and neither did the client. As such, it shouldn’t be the case that one party shoulders it all. That’s where the 15% rule comes in.

The 15% rule allows both parties to share the risk. By following this rule, we’re acknowledging that something probably won’t go as either party intended, so we need a buffer to handle the stuff that spills over. By capping it at a specific amount, though, we’re also ensuring that the buffer isn’t so big that it devours the profits of the developers.

For the most part, the clients with whom we’ve used the 15% rule are just fine with it. It is a pretty reasonable arrangement, after all. We have had the occasional party that squirms and wiggles about it, but, in the end, they’ve gone along with it and I think everyone has benefited as a result.

06.09.06

Using screen-scraper with Enterprise Content Management Software

Posted in Miscellaneous at 4:32 pm by Richard

Enterprise Content Management (ECM) encompasses a wide variety of software and technologies for retrieving and managing information and applications so that they are accessible to users throughout a company’s network. Integration of these applications and content sources is usually best accomplished not by reprogamming all of them to accomodate a common interface, which quickly becomes time consuming and expensive. Instead, enterprise content management solutions are most often constructed by extracting data from the web interfaces exposed by these differing systems. This content is then reformatted and made available through a web interface or ECM portal. Users of these systems can seamlessly gain access to the content and applications they need. Advanced ECM systems also facilitate two-way interaction, so that the click of one button by a user of the enterprise’s network on a ECM portal web page may update a legacy database, while the click of another button on the same portal page might send an XML request to perform a document search.

screen-scraper software is a critical piece of enterprise content management systems as described above. The software can be used to extract pertinent information from any number of web sources and to reformat that information for use in building an enterprise portal. When used as a server, screen-scraper can be called from external applications, which gives users the flexibility of calling screen-scraper from a .NET, Java or other program.

The flexibility of screen-scraper in managing data that is accessible via the web (both extracting and publishing) makes it an ideal fit for performing enterprise content management tasks. The developers who built and currently support screen-scraper are experts at analyzing data extraction and formatting needs and can recommend the best approach for incorporating the screen-scraper software into your enterprise’s network.

03.21.06

Three common methods for data extraction

Posted in Miscellaneous, Thoughts at 3:29 pm by Todd Wilson

Building off of my earlier posting on data discovery vs. data extraction, in the data extraction phase of the web scraping process you’ve already arrived at the page containing the data you’re interested in, and you now need to pull it out of the HTML.

Probably the most common technique used traditionally to do this is to cook up some regular expressions that match the pieces you want (e.g., URL’s and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you’re already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing “ontologies“, or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they’re often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it’s probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what’s the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

  • If you’re already familiar with regular expressions and at least one programming language, this can be a quick solution.
  • Regular expressions allow for a fair amount of “fuzziness” in the matching such that minor changes to the content won’t break them.
  • You likely don’t need to learn any new languages or tools (again, assuming you’re already familiar with regular expressions and a programming language).
  • Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It’s also nice because the various regular expression implementations don’t vary too significantly in their syntax.

Disadvantages:

  • They can be complex for those that don’t have a lot of experience with them. Learning regular expressions isn’t like going from Perl to Java. It’s more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.
  • They’re often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you’ll see what I mean.
  • If the content you’re trying to match changes (e.g., they change the web page by adding a new “font” tag) you’ll likely need to update your regular expressions to account for the change.
  • The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You’ll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there’s no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

  • You create it once and it can more or less extract the data from any page within the content domain you’re targeting.
  • The data model is generally built in. For example, if you’re extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).
  • There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

  • It’s relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.
  • These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you’re targeting.
  • You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you’ll only get into ontologies and artificial intelligence when you’re planning on extracting information from a very large number of sources. It also makes sense to do this when the data you’re trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

  • Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.
  • Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.
  • Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

  • The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.
  • A potential cost. Most ready-to-go screen-scraping applications are commercial, so you’ll likely be paying in dollars as well as time for this solution.
  • A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you’re locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you’re using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don’t mind paying a bit, you can save yourself a significant amount of time by using one. If you’re doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you’re probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we’ve been involved with that has actually required a hybrid approach of two of the aforementioned methods. We’re currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term “number of bedrooms” can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we’ve done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it’s handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we’ve written that uses ontologies in order to extract out the individual pieces we’re after. Once the data has been extracted we then insert it into a database.

03.16.06

Data discovery vs. data extraction

Posted in Miscellaneous, Thoughts at 1:30 pm by Todd Wilson

Looking at screen-scraping at a simplified level, there are two primary stages involved: data discovery and data extraction. Data discovery deals with navigating a web site to arrive at the pages containing the data you want, and data extraction deals with actually pulling that data off of those pages. Generally when people think of screen-scraping they focus on the data extraction portion of the process, but my experience has been that data discovery is often the more difficult of the two.

The data discovery step in screen-scraping might be as simple as requesting a single URL. For example, you might just need to go to the home page of a site and extract out the latest news headlines. On the other side of the spectrum, data discovery may involve logging in to a web site, traversing a series of pages in order to get needed cookies, submitting a POST request on a search form, traversing through search results pages, and finally following all of the “details” links within the search results pages to get to the data you’re actually after. In cases of the former a simple Perl script would often work just fine. For anything much more complex than that, though, a tool like screen-scraper can be an incredible time-saver. Especially for sites that require logging in, writing code to handle screen-scraping can be a nightmare when it comes to dealing with cookies and such.

In the data extraction phase you’ve already arrived at the page containing the data you’re interested in, and you now need to pull it out of the HTML. Traditionally this has typically involved creating a series of regular expressions that match the pieces of the page you want (e.g., URL’s and link titles). Regular expressions can be a bit complex to deal with, so screen-scraper hides most of those types of details behind the scenes, which simplifies the process. screen-scraper actually uses regular expressions to perform the data extraction, but you may or may not even be aware of that when you use it.

As an addendum, I should probably mention a third phase that is often ignored, and that is, what do you do with the data once you’ve extracted it? Common examples include writing the data to a CSV or XML file, or saving it to a database. In the case of a live web site you might even scrape the information and display it in the user’s web browser in real-time. When shopping around for a screen-scraping tool you should make sure that it gives you the flexibility you need to work with the data once it’s been extracted. One of the primary design goals of screen-scraper was to make it as flexible as possible in this regard. Our FAQ on saving information to a database gives several suggestions on how screen-scraper can be used in this regard.

« Previous entries · Next entries »