Extracting data from PDF files

Periodically people ask if screen-scraper can extract data from PDF files, as well as HTML. We’ve never had a very good answer for this (it can’t, out of the box), but lately we’ve been forced to come up with a solution, as a project we’ve been working on has required it.

When I initially researched how to go about this, I was looking for libraries that would allow for extraction from PDF files. I found a handful of them, but each had its own proprietary method for performing the extraction (e.g., lots of different method calls for handling tables and such). They seemed like possibilities, but I couldn’t come up with an elegant way to integrate them into screen-scraper without completely changing the way the user would need to perform the extraction.

After stepping back from the problem, I decided that it might make more sense to simply convert the PDF to a text-based format (e.g., HTML), then use screen-scraper’s existing extraction mechanism to pull the data out. After poking around a bit, I happened across pdftohtml, which does an excellent job of converting PDF files to HTML or XML. For the project we’re currently working on, screen-scraper is able to easily pull the data we need out of the converted file.

Our next step will be to integrate this functionality directly within screen-scraper. That is, screen-scraper should be able to seamlessly convert a PDF on the fly, then allow the user to make use of screen-scraper’s existing extraction mechanisms to pull the data out. The tricky part is that pdftohtml is platform dependent. They offer a Windows binary, but on any other OS you have to either compile from source or hope for an existing package (we’re using Ubuntu and were able to just apt-get it).

Here’s how I’m thinking it would work if we were to automate the process within screen-scraper:

  • For any operating systems that allow it, we just ship a binary with screen-scraper that will perform the PDF to HTML/XML conversion locally.
  • In cases where that doesn’t work, we provide a remote web service that will convert the PDF to XML. screen-scraper would invoke this behind the scenes in two different ways:
    • screen-scraper would first attempt to convert the PDF by passing the URL to it to the web service. The web service would attempt to retrieve the PDF via a GET request. Assuming that works, it would then perform the conversion and spit back the resulting XML, which screen-scraper would download.
    • If the web service is unable to grab the PDF directly (e.g., in cases where the PDF is behind an authentication gateway), it would indicate such to screen-scraper, which would then download the PDF file and upload it to the web service. The web service would perform the conversion, then output the resulting XML.

Once the PDF is converted, the user would be free to use the normal extractor patterns to pull the data out.

So in a worst-case scenario, the PDF would need to be downloaded from the source site, uploaded to the web service, converted, then downloaded again. Obviously this would add a lot of overhead, so it’s definitely not the best approach. I would guess that in the majority of the cases, however, the PDF could be converted either locally or via the web service where it’s able to request the PDF directly from the web site.

I can’t say just when we’ll get around to implementing this. It would likely mostly depend on the demand we see for it. This is the first project we’ve done where we’ve had to pull it off, but I’m guessing there will be others down the road. Until we implement this automated method, though, running pdftohtml manually may not be too cumbersome for most.

3 thoughts on “Extracting data from PDF files”

Leave a Comment