We’ve just added several new scraping sessions that exemplify extracting data from sites in various industries. If you go to our home page and click on one of the buttons corresponding to an industry you’ll be taken to a page where you can download the scraping session. The e-commerce section also has a video to walk you through the process, and we’ll be adding videos to the others shortly.
This is our biggest sale in quite a while. Until December 31, 2012 take 40% off Professional Edition licenses and 60% off Enterprise Edition licenses. Click here to take advantage.
I made a boo-boo. To make it short, if you downloaded an installer from our web site between May 3 and May 5 you may actually be running version 5.0 of screen-scraper even though it says 5.5. This was a result of an oversight on my part, and my apologies to those affected. Fortunately for you hapless victims of my carelessness the solution is simple:
- Ensure screen-scraper is not currently running.
- Edit the “screen-scraper.properties” file in your favorite text editor. This properties file is found in the “resource/conf” folder of the directory where you have screen-scraper installed.
- Inside that properties file you’ll find a “Version” property, which should say 5.5. Change it to 5.0, then save the file. This will cause screen-scraper to think that it’s at 5.0, and that it needs to upgrade.
- Launch the screen-scraper workbench. You should get a message indicating that version 5.5 of screen-scraper is available. Allow screen-scraper to download and install the update.
That’s it! Hopefully it won’t cause anyone undue angst.
You who have been updating via the normal update process should be unaffected by this snafu.
The Mobile Problem
The proliferation of mobile devices has created a problem. Most web sites these days are designed to be viewed on desktop computers with high-resolution monitors and via web browsers that allow for sophisticated interactivity. Anyone who’s tried to view such sites on mobile devices with small screens can attest to a cramped feeling. Even the very best mobile web browsers leave you wanting more space. The advent of mobile apps has helped some in this respect. Many content providers simply create customized interfaces via apps to make their data usable. Apps are great, but there still exists a significant portion of information on the Web that isn’t easily accessible on mobile devices. This is where screen-scraping can often fill the gap.
Ideally content providers, like travel and news web sites, offer either an app or a mobile-friendly version of their web site. There are a variety of reasons why this may not happen, though, so screen-scraping may be used by third parties to provide alternate interfaces.
The approach you’d take to screen-scrape for mobile devices doesn’t differ too much from any other kind of screen-scraping. I’ll present a couple of scenarios that will likely be similar to many sites you’d want to scrape.
Scraping Real Estate Data
There are a lot of sites out there that list information related to real estate. This includes commercial sites like Realtor.com and Zillow, but there are also a staggering number of government and county web sites that contain invaluable real estate data. Supposing you’re a realtor or home appraiser it might be helpful to have information related to a specific property while you’re out and about. To meet this need, a software development group might build an app that provides detailed real estate information on a mobile device. Let’s use Arizona’s Maricopa county web site as an example. The site allows you to search for properties via a number of methods, including address and street name. If you’re a software developer, your app might take a street address as an input parameter, then search for a property at that location. If you perform such a search on the Maricopa site you might end up with a property like this one. That page contains all kinds of information about the property, but maybe you’re only interested in a handful of data points:
The parcel number, property description, and most recent valuation information may be the most important parts. You also wouldn’t want to attempt to display too much of this data on a mobile device because of the limited screen real estate. The nice thing about screen-scraping is that you can be very precise in what you extract.
It’s likely that this information won’t change too frequently. As such, it may make sense to simply extract all records from the web site, deposit desired data points into a database, then scrape again periodically to ensure that the information is current. Even though it could be a relatively large data set, it may be better to grab it all at once rather than hitting the site in piecemeal fashion as the data is needed. This would likely mean less of a load on the target web site, and also better performance as you wouldn’t be relying on the web site to return the information to you in real time. In such a case the best approach would be to get the information into a database, then, when the data is requested from the mobile device, grab it directly out of your database rather than relying on the Maricopa site. The flow would end up looking something like this:
In other words, the scraping is not done in real time. You extract the information in a batch process, then deposit it into a database. Once it’s there, the mobile device can make a request containing a property address to your web server, which then retrieves the corresponding record from your database, then passes it down to the mobile device. Using either an app or a mobile-friendly web page, you could then display the information on the device in a much more usable format.
Scraping Travel Air Fares
Let’s suppose you’re interested in extracting travel air fares like Southwest Airlines. In contrast to the previous example, air fare information is very volatile, and, as such, couldn’t be scraped in a batch to be accessed later from a database. That is, the information would need to be scraped in real-time, as the user performs a search. If you perform such a search on the Southwest Airlines site you’ll get a page that looks something like this:
It would be a relatively simple matter to program a screen-scraping application to iterate over each row of search results, extracting out information such as the departure times and the prices. Because this data would need to be scraped in real time the architecture would look a bit different:
In this case the mobile device sends its request to the web server, which in turn passes a request along to a screen-scraper application, which gets the data from the web site, then sends it back down the line. We’ve added a little twist to this example, though–depending on how much traffic the service gets it may be prudent to add multiple screen-scraping applications to help balance load. In the case of our own screen-scraping software a given instance can handle multiple requests simultaneously, but the scraping load can be distributed even further across multiple screen-scraper instances which may be running on different computers.
We recently added experimental support to screen-scraper for client/PKI certificates. Some web sites require that you supply a client certificate, that you would have previously been given, in order to access them. I say this new feature is “experimental” because we’ve only been able to perform limited testing with it. So far, it does seem to be working as it should, though.
In order to account for sites that use client/PKI certificates, we’ve added a feature to screen-scraper that allows it to use JKS files. These are files used by Java that encapsulate secure certificates and such. The trick is to turn your existing client certificate file(s) into a .jks file. We’ve currently only tested the feature using .pfx files, which we converted into a .jks file via the method described here:
In the current alpha version of screen-scraper, if you look under the “Advanced” tab for a scraping session you’ll see a box where you can enter the location of your .jks file, and a box that will take the password you used when generating the .jks file. There are also corresponding boxes under the “Advanced” tab for a proxy session.
If you’d like to use this new feature you’ll likely need to do some of your own research on how to turn your client certificate file(s) into a .jks file. Here are a few sites that may help you in this:
- http://www.google.com/ 🙂
You might also find these tools to be helpful:
A few months ago we announced a drop in the price of our anonymization service for those using the latest alpha versions of screen-scraper. Unfortunately, things didn’t work out with this quite as we had planned, so, for the time being, we’re returning to our previous price of 25 cents per server per hour.
For a bit more detail, we were able to do this because Amazon announced the availability of even smaller virtual machine instances that we could use as proxies. As we’ve used these smaller instances over the past few months however, we’ve found them to be so unreliable that they’re essentially unusable. Our hope is that Amazon will improve things on their end, and, once they do, we’ll start using them again and drop the price back down. Cross your fingers and keep posted for when that day comes.
We’ve actually had a 30-day money-back guarantee since almost the beginning, but recently decided to highlight it a bit more. Any time in a software development project you incorporate a new library or application you’re incurring a certain amount of risk. The new application may appear to do just what you want it to, but later on your realize that it falls short. We’ve put many years of development into making screen-scraper the best data extraction tool on the market, but we also acknowledge that it may not work out for everyone. Because of this we want to help reduce some of the risk that people take when trying us out. We already offer the Professional and Enterprise Editions as fully-functional 30-day trials, but on top of that, if things still don’t work out, you always have the option of simply asking for your money back.
As the 30-day money-back guarantee page on our site mentions, we also offer several ways for people to get help. There’s no question that screen-scraper has a bit of a learning curve, and we try to provide as much help as we can so people can become proficient with it quickly.
On September 5th, 2010 we received a call from a reporter from the Wall Street Journal, Steve Stecklow (2007 Pulitzer Prize winner). He was calling to speak with someone at our company for a story he was doing related to our industry. He and I talked for about 40 minutes where he asked a lot of interesting questions about our company and about the industry in general. I explained to him how we are one of three screen-scraping companies in Utah Valley. He then asked if he could fly in and meet with us in person.
About 10 days later Steve pulled into town in a shiny new rent-a-car. Jason Bellows (VP of Operations), Todd Wilson (Owner, President), and I took Steve to a favorite Mexican joint, Diego’s, which is run by a fellow in our building. There he continued his interview and asked us about the various companies we’ve done work for. He was looking for something juicy for his story but did so in a very polite and forthright manner. We told him about work we’ve done for Microsoft, Oracle, Progressive Insurance and others. Some information we were not able to share due to non-disclosure agreements we’ve entered into with our clients.
A few days later they sent Chris Detrick, a free-lance photographer who works for the Salt Lake Tribune, to take pictures around the office. Over the next few weeks Steve and I stayed in contact as he had various follow up questions.
In the end, Todd and I were quoted briefly and Todd got to expose a part of screen-scraper’s source code for all the world to see.
Here I go tooting our horn again–Oracle has just posted a podcast on integrating screen-scraper with their Oracle Secure Enterprise Search product. The page that links to the podcast is here, and the MP3 file of the same can be found here. Might be worth a listen if only to get a better sense for some of the possibilities screen-scraper allows.
Not to toot our own horn (okay, we will), but our very own screen-scraper software is helping to power the search feature for the currently-running Oracle OpenWorld conference. From the OpenWorld home page, try a search in the box found in the upper-right corner (try something like “SES”). The search results you see where scraped from their content catalog, keynotes, and blog postings, then aggregated and enriched with information like spatial data (e.g., for demos you can click the location to see on a map exactly where it occurs). The excellent search interface is provided by Oracle Secure Enterprise Search, with which screen-scraper has been integrated.
This is actually a great example of the power of screen-scraping. Take information from various web sources, dump them all into a single database, then correlate and enrich the information in a searchable interface. It’s a powerful thing to take disparate pieces and sum them into something that’s much greater than the individual parts.