08.02.06
Posted in Tips at 11:42 am by Todd Wilson
Periodically people ask if screen-scraper can extract data from PDF files, as well as HTML. We’ve never had a very good answer for this (it can’t, out of the box), but lately we’ve been forced to come up with a solution, as a project we’ve been working on has required it.
When I initially researched how to go about this, I was looking for libraries that would allow for extraction from PDF files. I found a handful of them, but each had its own proprietary method for performing the extraction (e.g., lots of different method calls for handling tables and such). They seemed like possibilities, but I couldn’t come up with an elegant way to integrate them into screen-scraper without completely changing the way the user would need to perform the extraction.
After stepping back from the problem, I decided that it might make more sense to simply convert the PDF to a text-based format (e.g., HTML), then use screen-scraper’s existing extraction mechanism to pull the data out. After poking around a bit, I happened across pdftohtml, which does an excellent job of converting PDF files to HTML or XML. For the project we’re currently working on, screen-scraper is able to easily pull the data we need out of the converted file.
Our next step will be to integrate this functionality directly within screen-scraper. That is, screen-scraper should be able to seamlessly convert a PDF on the fly, then allow the user to make use of screen-scraper’s existing extraction mechanisms to pull the data out. The tricky part is that pdftohtml is platform dependent. They offer a Windows binary, but on any other OS you have to either compile from source or hope for an existing package (we’re using Ubuntu and were able to just apt-get it).
Here’s how I’m thinking it would work if we were to automate the process within screen-scraper:
- For any operating systems that allow it, we just ship a binary with screen-scraper that will perform the PDF to HTML/XML conversion locally.
- In cases where that doesn’t work, we provide a remote web service that will convert the PDF to XML. screen-scraper would invoke this behind the scenes in two different ways:
- screen-scraper would first attempt to convert the PDF by passing the URL to it to the web service. The web service would attempt to retrieve the PDF via a GET request. Assuming that works, it would then perform the conversion and spit back the resulting XML, which screen-scraper would download.
- If the web service is unable to grab the PDF directly (e.g., in cases where the PDF is behind an authentication gateway), it would indicate such to screen-scraper, which would then download the PDF file and upload it to the web service. The web service would perform the conversion, then output the resulting XML.
Once the PDF is converted, the user would be free to use the normal extractor patterns to pull the data out.
So in a worst-case scenario, the PDF would need to be downloaded from the source site, uploaded to the web service, converted, then downloaded again. Obviously this would add a lot of overhead, so it’s definitely not the best approach. I would guess that in the majority of the cases, however, the PDF could be converted either locally or via the web service where it’s able to request the PDF directly from the web site.
I can’t say just when we’ll get around to implementing this. It would likely mostly depend on the demand we see for it. This is the first project we’ve done where we’ve had to pull it off, but I’m guessing there will be others down the road. Until we implement this automated method, though, running pdftohtml manually may not be too cumbersome for most.
Permalink
07.31.06
Posted in Tips at 10:04 am by Todd Wilson
This is a question we get from time to time, so I finally decided to add it to our FAQ. If anyone else has experience with this kind of thing feel free to post a comment. I’m unaware of many packages that can do this.
Here’s the posting from the FAQ:
The short answer to this one is, “Sometimes.” Most all widgets (applets, etc.) that communicate with their server via HTTP can be sccraped by screen-scraper. Oftentimes, however, they’ll use a proprietary protocol. Most of the time Adobe Flash movies use HTTP when they need to communicate with a server, but Java applets and ActiveX controls don’t always. The easiest way to find out is to use screen-scraper’s proxy server when interacting with a page containing one of these elements. Take a close look at the HTTP requests and responses passing between the web browser and the server. If you see text in there (often XML or URL-encoded lists of parameters) then the chances are good that screen-scraper can extract the information being passed between the client and server. Note, however, that there may be text that the widget is displaying that doesn’t get passed between the client and server. Unfortunately, in such cases, screen-scraper is unable to extract that information. The only utility we’re aware of that may allow for scraping that type of information would be IBM’s Rational Robot software.
Permalink
07.20.06
Posted in Updates at 4:31 pm by Todd Wilson
Pfew. Well, sorry it’s been so long. We’ve been swamped with work lately, but I’m happy to report that we’ve recently carved out enough time to get a new alpha version of screen-scraper out the door. For the impatient, here are some quick install instructions (which differ from the usual):
- Back up your scraping sessions (check here for help on that).
- Ensure screen-scraper isn’t currently running (close the workbench and server, if running).
- Download this file, and unzip it.
- Copy the contents of the zip file on top of your existing files in the screen-scraper install folder. For example, the zip file contains a “screen-scraper.jar” file which should be copied on top of your existing “screen-scraper.jar” file.
- Edit your “resource\conf\screen-scraper.properties” file in a text editor. Change the “Version” property to “2.7.2.9a”.
- Launch the screen-scraper workbench.
- If all of your scraping sessions have disappeared, don’t panic!
- Close the screen-scraper workbench.
- Re-open the screen-scraper workbench.
At this point everything should be hunky-dory and you should see a few new features (such as folders).
As to why you need to update screen-scraper in this odd way, we discovered a bug in the updater. It’s surprising that it’s never surfaced before, but hopefully we’ve permanently squashed it so that you can easily update via the standard “Check for updates” menu item in the future.
As always, feel free to send along any feedback. You can post a comment to this blog posting, a message to our forum, or send us a support request.
Permalink
06.09.06
Posted in Miscellaneous at 4:32 pm by Richard
Enterprise Content Management (ECM) encompasses a wide variety of software and technologies for retrieving and managing information and applications so that they are accessible to users throughout a company’s network. Integration of these applications and content sources is usually best accomplished not by reprogamming all of them to accomodate a common interface, which quickly becomes time consuming and expensive. Instead, enterprise content management solutions are most often constructed by extracting data from the web interfaces exposed by these differing systems. This content is then reformatted and made available through a web interface or ECM portal. Users of these systems can seamlessly gain access to the content and applications they need. Advanced ECM systems also facilitate two-way interaction, so that the click of one button by a user of the enterprise’s network on a ECM portal web page may update a legacy database, while the click of another button on the same portal page might send an XML request to perform a document search.
screen-scraper software is a critical piece of enterprise content management systems as described above. The software can be used to extract pertinent information from any number of web sources and to reformat that information for use in building an enterprise portal. When used as a server, screen-scraper can be called from external applications, which gives users the flexibility of calling screen-scraper from a .NET, Java or other program.
The flexibility of screen-scraper in managing data that is accessible via the web (both extracting and publishing) makes it an ideal fit for performing enterprise content management tasks. The developers who built and currently support screen-scraper are experts at analyzing data extraction and formatting needs and can recommend the best approach for incorporating the screen-scraper software into your enterprise’s network.
Permalink
05.18.06
Posted in Updates at 1:13 pm by Todd Wilson
We’ve been swamped lately, but I’m happy to report that we’ve finally completed a version of screen-scraper that includes folders. This has been one of the most oft-requested features, so I hope it makes using screen-scraper just that much more pleasant. Upgrade your instance via the “Check for updates” option from the “Options” menu.
Hopefully it’s obvious how they’re used–click the folder icon to create a new folder. To drop an item into a folder, select it, then drag and drop it on to the folder. Right-click a folder to rename it (as well as any other object).
Right now it seems pretty stable, but there are definitely some features we’ll be adding in the not-so-distant future:
- Right now it’s a little obnoxious that you have to select an item and allow it to load before you can drag and drop it. We’re planning on making such that you can just click an icon and immediately begin to drag it.
- It would be nice to have a bunch of context menus on a folder. That is, you right click and get options like: “Import into…”, “Add scraping session to…” That way you don’t have to create or import a bunch of items to the root folder, then drag them into the folders where you really wanted them in the first place.
- Copying and pasting of scraping sessions, scripts, etc. This one’s also been requested a fair amount. At times you want to be able to create a full copy of a scraping session so that you can experiment with it without nuking the original. You may also want to copy a scrapeable file from one scraping session to another.
Any others you can think of that would be helpful?
As always, let us know of any bugs/issues. We’ve been using this version internally for a few days, and it seems quite stable. Thanks in advance to anyone willing to help us test.
Permalink
04.04.06
Posted in Updates at 11:29 am by Todd Wilson
This is a pretty small upgrade, but fixes a couple of bugs I’ve personally found to be obnoxious in screen-scraper. They were easy to fix, so my apologies to the world for taking so long to fix them.
The first bug deals with the little divider between the tree on the left and whatever else you might be looking at on the right side (e.g., a scraping session or script). Many might have noticed that oftentimes you can only inch that divider along a few pixels at a time. Pretty annoying, but, fortunately, now fixed.
The second bug is less common, but equally annoying. When adding sub-extractor patterns a vertical scroll bar would often show up on the inner pane, when there was already a vertical scroll bar on the outer pane. You had to resize the window in order to make the inner one go away. Again, obnoxious, but now fixed.
This is a very stable release, so no fears on upgrading. Have at it and save some of your sanity.
Permalink
03.31.06
Posted in Uncategorized at 2:59 pm by Todd Wilson
Today on our support forum we had someone inquire about calling scripts from other scripts within screen-scraper. This has been requested a number of times in the past, and I’ve kind of hummed and hahed about it, not sure if it would be opening a can of worms. Some of our internal developers have wanted this as well, so I gave it a bit more thought, and came up with a pretty quick and easy way to implement it.
I’m particularly interested in having this one thoroughly tested, so please feel free to upgrade (try this FAQ if you run into trouble). Remember that this is an alpha version, so caveats apply. It should be plenty stable, though, since this is the only addition from 2.7.2
Once you’ve upgraded, you can do a method call like this within a script in order to invoke another:
session.executeScript( “My Script” );
Permalink
03.27.06
Posted in Updates at 5:39 pm by Todd Wilson
I just posted several example scraping sessions that may be of help to those starting out with screen-scraper: http://www.screen-scraper.com/support/examples/scrapbookfinds_examples.php.
Back when screen-scraper was just a babe in my arms I used to include scraping sessions in the download. The scraping sessions extracted stuff from Slashdot, Freshmeat, and Weather.com. The trouble was, the sites would change from time to time, and it was always a pain keeping up with them. What was worse, occasionally people would download screen-scraper, run the scraping sessions, and find that they didn’t work (because the sites had changed). They’d then report back that our software stunk because it didn’t even work with the very examples we provided.
After all of that I decided it simply wasn’t worth providing examples using sites we didn’t have control over. That’s why we set up this mock e-commerce web site on our server. We wanted to provide a “real world” example, but still needed to have control over the site so that we didn’t need to continually update it.
When we started doing ScrapbookFinds, it occurred to me that we could share those scraping sessions with others. We don’t control the sites, but we’re constantly monitoring the scraping sessions and updating as them as the sites change. The hope is that these scraping sessions will provide templates and examples to people that will both help them learn screen-scraper, as well as act as boiler plates people can tweak to create their own scraping sessions.
As a side-note, if it’s of interest, we probably average about 15 minutes of time updating scraping sessions per week, and we’re scraping about 15 sites (i.e., the sites either don’t change that often, or we’ve set up our scraping sessions to be fuzzy enough such that they don’t break when minor changes are made).
Permalink
03.24.06
Posted in Updates at 3:43 pm by Todd Wilson
This is just a minor bug fix release, but anyone invoking screen-scraper from the command line should upgrade. Somehow a semi-critical bug slipped through our radar on the 2.7 release. In 2.7 if you have the workbench open, then run screen-scraper from the command line, when the command line instance ends it will close screen-scraper’s database, leaving the workbench without a way to save any of its information. It wouldn’t lead to database corruption or anything like that, but could get pretty annoying.
Permalink
03.22.06
Posted in Tips at 5:52 pm by Todd Wilson
Astute screen-scraper Fred came up with a scenario that arises from time-to-time: you’ve got a page containing one or more HTML tables, all of which are nearly identical in structure. You want to pull the data from each table, but need to be able to distinguish which row came from which table. Standard old extractor patterns won’t do the job–they’ll match every row in every table, which destroys the link between each row and its corresponding table.
Fortunately, there are a couple of ways of handling such a scenario, which I’ve just outlined in this FAQ. Not too complicated, but a bit more involved than just using a standard extractor pattern.
Permalink
« Previous entries · Next entries »