08.02.06

Extracting data from PDF files

Posted in Tips at 11:42 am by Todd Wilson

Periodically people ask if screen-scraper can extract data from PDF files, as well as HTML. We’ve never had a very good answer for this (it can’t, out of the box), but lately we’ve been forced to come up with a solution, as a project we’ve been working on has required it.

When I initially researched how to go about this, I was looking for libraries that would allow for extraction from PDF files. I found a handful of them, but each had its own proprietary method for performing the extraction (e.g., lots of different method calls for handling tables and such). They seemed like possibilities, but I couldn’t come up with an elegant way to integrate them into screen-scraper without completely changing the way the user would need to perform the extraction.

After stepping back from the problem, I decided that it might make more sense to simply convert the PDF to a text-based format (e.g., HTML), then use screen-scraper’s existing extraction mechanism to pull the data out. After poking around a bit, I happened across pdftohtml, which does an excellent job of converting PDF files to HTML or XML. For the project we’re currently working on, screen-scraper is able to easily pull the data we need out of the converted file.

Our next step will be to integrate this functionality directly within screen-scraper. That is, screen-scraper should be able to seamlessly convert a PDF on the fly, then allow the user to make use of screen-scraper’s existing extraction mechanisms to pull the data out. The tricky part is that pdftohtml is platform dependent. They offer a Windows binary, but on any other OS you have to either compile from source or hope for an existing package (we’re using Ubuntu and were able to just apt-get it).

Here’s how I’m thinking it would work if we were to automate the process within screen-scraper:

  • For any operating systems that allow it, we just ship a binary with screen-scraper that will perform the PDF to HTML/XML conversion locally.
  • In cases where that doesn’t work, we provide a remote web service that will convert the PDF to XML. screen-scraper would invoke this behind the scenes in two different ways:
    • screen-scraper would first attempt to convert the PDF by passing the URL to it to the web service. The web service would attempt to retrieve the PDF via a GET request. Assuming that works, it would then perform the conversion and spit back the resulting XML, which screen-scraper would download.
    • If the web service is unable to grab the PDF directly (e.g., in cases where the PDF is behind an authentication gateway), it would indicate such to screen-scraper, which would then download the PDF file and upload it to the web service. The web service would perform the conversion, then output the resulting XML.

Once the PDF is converted, the user would be free to use the normal extractor patterns to pull the data out.

So in a worst-case scenario, the PDF would need to be downloaded from the source site, uploaded to the web service, converted, then downloaded again. Obviously this would add a lot of overhead, so it’s definitely not the best approach. I would guess that in the majority of the cases, however, the PDF could be converted either locally or via the web service where it’s able to request the PDF directly from the web site.

I can’t say just when we’ll get around to implementing this. It would likely mostly depend on the demand we see for it. This is the first project we’ve done where we’ve had to pull it off, but I’m guessing there will be others down the road. Until we implement this automated method, though, running pdftohtml manually may not be too cumbersome for most.

07.31.06

Extracting data from Java applets, ActiveX controls, and Adobe Flash movies

Posted in Tips at 10:04 am by Todd Wilson

This is a question we get from time to time, so I finally decided to add it to our FAQ. If anyone else has experience with this kind of thing feel free to post a comment. I’m unaware of many packages that can do this.

Here’s the posting from the FAQ:

The short answer to this one is, “Sometimes.” Most all widgets (applets, etc.) that communicate with their server via HTTP can be sccraped by screen-scraper. Oftentimes, however, they’ll use a proprietary protocol. Most of the time Adobe Flash movies use HTTP when they need to communicate with a server, but Java applets and ActiveX controls don’t always. The easiest way to find out is to use screen-scraper’s proxy server when interacting with a page containing one of these elements. Take a close look at the HTTP requests and responses passing between the web browser and the server. If you see text in there (often XML or URL-encoded lists of parameters) then the chances are good that screen-scraper can extract the information being passed between the client and server. Note, however, that there may be text that the widget is displaying that doesn’t get passed between the client and server. Unfortunately, in such cases, screen-scraper is unable to extract that information. The only utility we’re aware of that may allow for scraping that type of information would be IBM’s Rational Robot software.

03.22.06

Scraping data from similar tables

Posted in Tips at 5:52 pm by Todd Wilson

Astute screen-scraper Fred came up with a scenario that arises from time-to-time: you’ve got a page containing one or more HTML tables, all of which are nearly identical in structure. You want to pull the data from each table, but need to be able to distinguish which row came from which table. Standard old extractor patterns won’t do the job–they’ll match every row in every table, which destroys the link between each row and its corresponding table.

Fortunately, there are a couple of ways of handling such a scenario, which I’ve just outlined in this FAQ. Not too complicated, but a bit more involved than just using a standard extractor pattern.

03.07.06

Adding numbers to session variables

Posted in Tips, Updates at 5:38 pm by Todd Wilson

Up till now it’s been a pretty big pain to add a number to a session variable. Oftentimes you’ll have something like a page number that you need to increment as you loop through search results pages. The page number is usually stored as a String, and to increment it you normally have to cast it to an int, increment it, then cast it back to a String. Recently, though, we added a “session.addToVariable” method that makes this a lot quicker. Here’s the documentation on it:

  • addToVariable( String variable, int value ). Adds a value to a session variable. Session variables are generally stored as Strings, so it’s normally more difficult than it should be to simply add a number to one. This method takes the name of the variable, which can either hold a String or Integer, and adds a number to it. The number added to it can be positive or negative.
    example: session.addToVariable( "PAGE_NUM", 1 );

Much simpler than the previous way. This will be part of our upcoming 2.7 release (any day now!), but if you’d like to make use of it right now you can simply upgrade to the latest pre-release version (2.6.0.6a).

« Previous Page « Previous Page Next entries »