03.14.11

Screen-Scraping for iPhone, Andriod, Blackberry, and Most Any Other Mobile Device

Posted in Miscellaneous at 11:21 am by Todd Wilson

The Mobile Problem

The proliferation of mobile devices has created a problem.  Most web sites these days are designed to be viewed on desktop computers with high-resolution monitors and via web browsers that allow for sophisticated interactivity.  Anyone who’s tried to view such sites on mobile devices with small screens can attest to a cramped feeling.  Even the very best mobile web browsers leave you wanting more space.  The advent of mobile apps has helped some in this respect.  Many content providers simply create customized interfaces via apps to make their data usable.  Apps are great, but there still exists a significant portion of information on the Web that isn’t easily accessible on mobile devices.  This is where screen-scraping can often fill the gap.

Ideally content providers, like travel and news web sites, offer either an app or a mobile-friendly version of their web site.  There are a variety of reasons why this may not happen, though, so screen-scraping may be used by third parties to provide alternate interfaces.

The approach you’d take to screen-scrape for mobile devices doesn’t differ too much from any other kind of screen-scraping.  I’ll present a couple of scenarios that will likely be similar to many sites you’d want to scrape.

Scraping Real Estate Data

There are a lot of sites out there that list information related to real estate.  This includes commercial sites like Realtor.com and Zillow, but there are also a staggering number of government and county web sites that contain invaluable real estate data.  Supposing you’re a realtor or home appraiser it might be helpful to have information related to a specific property while you’re out and about.  To meet this need, a software development group might build an app that provides detailed real estate information on a mobile device.  Let’s use Arizona’s Maricopa county web site as an example.  The site allows you to search for properties via a number of methods, including address and street name.  If you’re a software developer, your app might take a street address as an input parameter, then search for a property at that location.  If you perform such a search on the Maricopa site you might end up with a property like this one.  That page contains all kinds of information about the property, but maybe you’re only interested in a handful of data points:

The parcel number, property description, and most recent valuation information may be the most important parts.  You also wouldn’t want to attempt to display too much of this data on a mobile device because of the limited screen real estate.  The nice thing about screen-scraping is that you can be very precise in what you extract.

It’s likely that this information won’t change too frequently.  As such, it may make sense to simply extract all records from the web site, deposit desired data points into a database, then scrape again periodically to ensure that the information is current.  Even though it could be a relatively large data set, it may be better to grab it all at once rather than hitting the site in piecemeal fashion as the data is needed.  This would likely mean less of a load on the target web site, and also better performance as you wouldn’t be relying on the web site to return the information to you in real time.  In such a case the best approach would be to get the information into a database, then, when the data is requested from the mobile device, grab it directly out of your database rather than relying on the Maricopa site.  The flow would end up looking something like this:

In other words, the scraping is not done in real time.  You extract the information in a batch process, then deposit it into a database.  Once it’s there, the mobile device can make a request containing a property address to your web server, which then retrieves the corresponding record from your database, then passes it down to the mobile device.  Using either an app or a mobile-friendly web page, you could then display the information on the device in a much more usable format.

Scraping Travel Air Fares

Let’s suppose you’re interested in extracting travel air fares like Southwest Airlines.  In contrast to the previous example, air fare information is very volatile, and, as such, couldn’t be scraped in a batch to be accessed later from a database.  That is, the information would need to be scraped in real-time, as the user performs a search.  If you perform such a search on the Southwest Airlines site you’ll get a page that looks something like this:

It would be a relatively simple matter to program a screen-scraping application to iterate over each row of search results, extracting out information such as the departure times and the prices.  Because this data would need to be scraped in real time the architecture would look a bit different:

In this case the mobile device sends its request to the web server, which in turn passes a request along to a screen-scraper application, which gets the data from the web site, then sends it back down the line.  We’ve added a little twist to this example, though–depending on how much traffic the service gets it may be prudent to add multiple screen-scraping applications to help balance load.  In the case of our own screen-scraping software a given instance can handle multiple requests simultaneously, but the scraping load can be distributed even further across multiple screen-scraper instances which may be running on different computers.

03.02.11

Version 5.0.47a of screen-scraper Released

Posted in Updates at 3:21 pm by Todd Wilson

A number of fixes in this one:

  • Fixed a minor memory leak in the workbench.
  • Fixed a bug related to highlighting data records.
  • Fixed a bug where the scrapeable file view wasn’t updating correctly in some cases.
  • The “Generate scrapeable files in…” menu will now scroll when it contains many items.
  • The term “sutil” will now appear in blue in the script editor.

02.16.11

Version 5.0.46a of screen-scraper Released

Posted in Updates at 6:10 pm by Todd Wilson

Several small fixes in this version:

  • Fixed a bug related to setting the originator edition when exporting.
  • The cursor now returns to normal after attempting to highlight data records for a pattern that doesn’t match.
  • Fixed a bug where data records were not highlighting in the last response the very first time.
  • Fixed an issue where scrollbars weren’t appearing in the proxy/scrapeable file compare window.
  • Now displaying an error message when applying invalid extractor patterns.

02.15.11

Version 5.0.45a of screen-scraper Released

Posted in Updates at 11:18 am by Todd Wilson

Just a couple of little changes in this one:

  • No longer truncating HTML in the “Last Response” tab.
  • Minor bug fix to the DataManager.

02.09.11

Version 5.0.44a of screen-scraper Released

Posted in Updates at 4:42 pm by Todd Wilson

A few more little fixes:

  • The position of the divider bar on the split pane for proxy sessions is now retained.
  • Numeric columns in tables are now rendered using the default font.
  • Fixed a minor bug related to editing extractor pattern tokens.

02.08.11

Version 5.0.43a of screen-scraper Released

Posted in Updates at 10:35 am by Todd Wilson

A few fixes in this one:

  • Fixed a bug where the paste sub-extractor pattern was becoming enabled after a sub-extractor pattern had been deleted.
  • Fixed a bug where data record highlighting wouldn’t work correctly with very large HTML pages.
  • Fixed a bug where parameters sent in a multi-part request were causing invalid responses.

02.03.11

Version 5.0.42a of screen-scraper Released

Posted in Updates at 11:06 am by Todd Wilson

Quite a few little fixes and enhancements in this one:

  • Fixed a bug related to the data set list view not displaying correctly.
  • Fixed an issue where anonymous proxy pool would not automatically repopulate when proxies were terminated automatically.
  • Fixed an issue in Linux where the extractor pattern panel was a bit too large.
  • Fixed an issue in Linux where the scraping session log panel was a bit too large.
  • Altered how character sets are handled in terms of how specifically set character sets override more global settings.
  • Long parameter values can now be edited in a separate text box.
  • Fixed an issue with extractor pattern token tooltips.
  • Fixed an issue with sub-extractor panels not sequencing after deletion.
  • screen-scraper will now display an error message when an invalid regular expression is entered for an extractor pattern token.
  • Fixed an issue with resizing the proxy transaction compare window.
As always, see the Alpha Log for the full history on changes.

02.01.11

Version 5.0.41a of screen-scraper Released

Posted in Updates at 11:54 am by Todd Wilson

This one just contains a minor bug fix related to the Java keystore functionality we added in earlier.

01.26.11

Version 5.0.40a of screen-scraper Released

Posted in Updates at 1:24 pm by Todd Wilson

This one takes care of a couple of bugs that slipped through in the last version:

  • Restored the horizontal scroll bar in the last response tab.
  • Fixed an error that caused screen-scraper to disallow testing extractor patterns.

01.25.11

Version 5.0.39a of screen-scraper Released

Posted in Updates at 11:17 am by Todd Wilson

This one contains several fixes and enhancements:

  • Fixed a bug related to hitting the “Enter” key in the find dialog box.
  • You can now wrap text in the last request and last response panels.
  • Rearranged elements on the last response panel so that overlapping shouldn’t occur.
  • The delay on the script auto-complete box can now be set via the “AutoCompleteDelay” property in the “screen-scraper.properties” file.
  • Rearranged elements in the proxy “Progress panel” so that they don’t overlap.
  • Now dismissing the splash screen before the start page loads.
  • The name text box is now highlighted when proxy sessions, scraping sessions, and scripts are created.
  • Adjusted a few visual elements related to proxy sessions so that they resize correctly.
  • Now filtering out “sitecheck” requests made by Opera.
  • Table columns in the “HTTP Transactions” table are now being sized correctly even when the table is empty.
  • Fixed a bug where less-than symbols weren’t always showing up in the tool-tip for extractor pattern tokens.
As always, full history and details can be found in the Alpha Change Log.  Also, for anyone keeping track, we’re getting very close to releasing another public version of screen-scraper (we’ll probably give it a version number of 5.5).  If you’re running the alpha versions we’d be grateful for any bug reports.  We’ll obviously want to work out any kinks before we release the next stable version.

« Previous entries · Next entries »