Version 5.5.33a of screen-scraper Released

Posted in Updates on 01.04.12 by Todd Wilson

The holiday enhancements have spilled over into 2012:

  • Added “Always at the end” option to force scripts to run at the end of a scraping session, even if it gets stopped prematurely.
  • The prompt to save dialog box only shows on exit when a change has actually been made.
  • Added a keyboard shortcut to the extractor pattern text box such that when text is highlighted and the Control/Command-T key combination is pressed an extractor pattern token will be generated.  This is the equivalent of using the corresponding menu item when the right-click pop-up menu is invoked.
  • Improved error reporting.
  • Added local script variables to the breakpoint frame.
  • When in workbench mode screen-scraper will now breakpoint on a script error.

Version 5.5.32a of screen-scraper Released

Posted in Updates on 12.27.11 by Todd Wilson

Things have cooled down for us a bit over the holidays, so we’ve been able to carve out time for a number of bug fixes and feature enhancements.  Here’s the list:

  • Fixed a threading issue related to the REST interface.
  • Added classes and methods related to decoding images.
  • Fixed a bug related to use of the “Breakpoint” button with RunnableScrapingSessions.
  • Added getStatusMessage, setStatusMessage, and appendStatusMessage to the session object, all of which are synonymous with their corresponding “error” methods (e.g., getStatusMessage = getErrorMessage).
  • In the web UI changed the column “Error Message” to “Status Message”.
  • Added the following methods to the scrapeableFile object: resequenceHTTPParameter( String key, int sequence ), removeHTTPParameter( String key ), addGETHTTPParameter( String key, String value, int sequence ), addGETHTTPParameter( String key, String value ), addPOSTHTTPParameter( String key, String value, int sequence ), addPOSTHTTPParameter( String key, String value )
  • Made a DataManager fix where child rows weren’t getting inserted for duplicate parent rows.
  • Changed default user agent for newly-created scraping sessions to Internet Explorer 8.
  • Now saving in a separate thread so that the GUI won’t get locked up for large objects.

Scraping AMF Sites

Posted in Tips on 11.15.11 by Todd Wilson

Most of the time when extracting information from web sites you’ll deal with HTML, which is generally pretty straightforward to deal with.  Occasionally, though, content will be delivered via something like a Java applet or Flash movie.  Just recently I completed a project that dealt with extracting data from a Flash movie, where the data was delivered from the server via Adobe’s Action Message Format (AMF).  I thought I’d share a bit about my experience here, which will hopefully be useful to others, as well as myself the next time I have to do this 🙂

The main tool you’ll deal with when scraping AMF-based data is Adobe’s Java AMF Client.  It handles most of the heavy lifting for you, though you’ll still need to do a fair amount of coding.  The other tool that is indispensable is Charles proxy, which has a built-in AMF parser.  Without it you’ll be flying blind.

The basic approach you’ll want to take is to proxy the site via Charles with your web browser, pick out the AMF requests that seem relevant, then replicate those in code.  In my case I also had to download PDF files (standard HTTP), so I actually had to run it all in screen-scraper, combining normal screen-scraper stuff with the Java AMF Client stuff.  There was also a login that had to be done outside of AMF.  Anyway, just be aware that you may have to combine both approaches in your own project.

I’m going to be providing some example code below in Interpreted Java (which is just BeanShell) as a screen-scraper script.  You’ll need to do a bit of modification if you want to run this as straight Java.

Digging into the details, here’s how my code looks that sets up the initial AMF stuff:

import flex.messaging.io.ArrayCollection;
import flex.messaging.messages.*;
import flex.messaging.io.amf.client.AMFConnection;
import flex.messaging.io.amf.client.exceptions.ClientStatusException;
import flex.messaging.io.amf.client.exceptions.ServerStatusException;
import flex.messaging.util.UUIDUtils;
import flex.messaging.io.amf.ASObject;

// Create the AMF connection.
AMFConnection amfConnection = new AMFConnection();

// Used for debugging...
//Proxy proxy = new Proxy( Proxy.Type.HTTP, new InetSocketAddress( "localhost", 8888 ) );
//amfConnection.setProxy( proxy );

// Connect to the remote url.
url = "http://www.myamfsite.com/messagebroker/amf";
try
{
amfConnection.connect(url);
}
catch( ClientStatusException cse )
{
session.logError( cse );
return;
}

// Set a few headers we'll want throughout the session.
amfConnection.addHttpRequestHeader( "Content-type", "application/x-amf" );
amfConnection.addHttpRequestHeader( "Referer", "http://www.myamfsite.com/media/MyMovie.swf" );

Here we’re setting up an AMF connection to a server whose AMF end point is found at http://www.myamfsite.com/messagebroker/amf.  The commented-out proxy code allows us to send it all through Charles; that way we can compare the requests our code produces with those we record when browsing the web site via our web browser.  Kind of an apples-to-apples comparison that helps to root out bugs.  If your code doesn’t seem to have the desired effect, compare what’s happening via Charles with the requests from your browser.  Ideally they should match as closely as possible.  I also found that I had to add the two request headers that you’ll find at the end.  The referer may or may not be necessary, but it’s likely that the content-type header is, since the Flash server would normally be expecting requests from a Flash movie, which would probably include that header by default.

Once you’ve done the initialization you can start adding AMF requests to get the data you’re after.  Again, you’ll want to do this by recording the requests from your browser in Charles, then translate those into code.  Here’s a screen-shot of a recorded AMF request from Charles:

And here’s how I translated the request into code:

CommandMessage message1 = new CommandMessage( CommandMessage.CLIENT_PING_OPERATION );
Object[] params1 = new Object[]
{
message1
};
HashMap headers1 = new HashMap();
message1.setHeader( "DSId", "nil" );
message1.setMessageId( UUIDUtils.createUUID() );
Object result1 = amfConnection.call( "null", params1 );
session.log( "Result 1: " + result1 );

Based on the request recorded by Charles, it’s obvious that this should be a CommandMessage.  The PING part of it was a bit trickier.  This is the “operation” portion of the request, which you’ll notice is recorded by Charles only as “5”.  This is where I had to bit of sleuthing through the Java AMF Client source code (which is fortunately open source and freely downloadable).  If you’ve downloaded that source code you’ll find the CommandMessage class here in the bundle: modules/core/src/flex/messaging/messages/CommandMessage.java.  Notice also in the request how I set the header “DSId” to be “nil”, which is also evident in what Charles recorded.  Again, we’re trying to get our code to match as closely as possible what was recorded by our web browser.  I gave the request a unique ID, then asked the connection to make the call.

The next request I needed was a bit different, but not too difficult to recreate from what Charles recorded:

I’ve blurred out the username I used.  Here’s the corresponding code:

// Authenticate the current user.
RemotingMessage message2 = new RemotingMessage();
message2.setOperation( "getUserByUserName" );
Object[] params2 = new Object[]
{
message2
};
String[] body2 = new String[]
{
"myUserName"
};
message2.setBody( body2 );
message2.setDestination( "XYZ" );
message2.setMessageId( UUIDUtils.createUUID() );
Object result2 = amfConnection.call( "null", params2 );
session.log( "Result 2: " + result2 );

Again, you can hopefully see how the pieces in the code correlate to what Charles recorded.

From this point it was simply a matter of adding requests as needed, along with a fair amount of trial and error to ensure that I was matching as closely as possible the original AMF requests.  The only item that tripped me up for a while that’s probably worth mentioning was when Charles recorded the body portion of the request as containing simply an “Object”.  When I did the same in code the server didn’t like it, and it took me a bit before I realized what it actually wanted was an “ASObject”.  So the code I used to create the body looks like this:

Object[] body3 = new Object[]
{
new ASObject()
};

A few last items that might be helpful:

  • The Java AMF Client download contains quite a few dependency files.  You’ll have to figure out exactly which ones of those you truly need.  In my case, in using this within screen-scraper, I ended up only needing two of the jars from the bundle: flex-messaging-common.jar and flex-messaging-core.jar.
  • As it stands the Java AMF Client can’t handle HTTPS, nor can it handle HTTPS sites that utilize an invalid secure certificate.  I ended up modifying the source for the AMFConnection class in order to add this functionality (in the bundle that class is found here: modules/core/src/flex/messaging/io/amf/client/AMFConnection.java).  You can download a zip file here that contains that modified source file as well as a compiled version of the flex-messaging-core.jar files, which contains that modified class.  If you end up modifying that class further in the bundle you can compile it with a simple “ant core” from the command line.  You need not compile the whole thing.

Version 5.5.26a of screen-scraper Released

Posted in Updates on 11.08.11 by Todd Wilson

A few fixes in this release:

  • Fixed a bug that was causing the user-agent header to be duplicated.
  • Fixed a bug where a deleted recent script still shows in the script drop-down list.
  • Fixed a bug related to multi-exports.

Version 5.5.25a of screen-scraper Released

Posted in Updates on 10.25.11 by Todd Wilson

Just a few changes:

  • Deprecated caching and filtering data sets (can be re-enabled with EnableCachingAndFilteringDataSets property).
  • Now automatically swapping extractor pattern tokens for embedded variables in certain fields in the workbench (e.g., in the URL field [email protected]@~ is changed to ~#FOO#~).
  • Added a “Find” button to the “Last Request” tab.

Version 5.5.23a of screen-scraper Released

Posted in Updates on 10.14.11 by Todd Wilson

Get ready, kiddies, this is a big one!  Found myself with some time on my hands, so I got some things done that have needed doing for a while.  Plus I added in a few little goodies that have been rolling around in my head.  Enjoy!

  • Now outputting message as a warning when extractor pattern times out.
  • Script pane no longer scrolls to the top when finding text fails.
  • The last error message will now always be retained in the Web UI.
  • Now notifying the user if a scrapeable file is generated from an HTTP transaction that contains a multi-part request, but no file parameters.
  • Changed icon to something friendlier on database backup pop-up.
  • Added session.setUserAgent.
  • Fixed an issue related to resolving relative URL’s from extracted data.
  • Fixed an issue related to reordering columns in the workbench.
  • Fixed an issue related to truncated server responses.
  • Fixed the PHP driver to allow carriage returns and line feeds to be passed in the setVariable method.
  • Now initializing the last response view to the top of the page.
  • Now displaying recently accessed scripts first in the script instances drop-down list.
  • Enlarged the scraping session notes field a bit.
  • Added back and forward buttons to the workbench.

Version 5.5.17a of screen-scraper Released

Posted in Updates on 09.13.11 by Todd Wilson

Several fixes and enhancements in this one:

  • Fixed a bug where a null parameter was causing rendering problems.
  • Added ability to turn on and off automatic proxy cycling via setAutomaticProxyCycling.
  • Auto-saving can now be enabled by adding an AutoSaveTime=[Time in seconds] in the screen-scraper.properties file.
  • Filtered data sets now show up as filtered when using the “Test Pattern” button.
  • Added SetCharacterSet to .NET driver.

Version 5.5.15a of screen-scraper Released

Posted in Updates on 08.11.11 by Todd Wilson

Just a small fix in this one, but definitely a recommended upgrade for those using alpha versions.  This release fixes a bug introduced in the previous alpha related to closing the HTTP connection manager.

Version 5.5.14a of screen-scraper Released

Posted in Updates on 08.09.11 by Todd Wilson

This one contains a few important fixes.  For those using alpha versions I’d recommend upgrading.  Here’s what it contains:

  • Fixed an issue where a blank file HTTP parameter was being sent incorrectly.
  • Fixed an issue with the REST interface where the wrong scrapeable_session_id was being returned.
  • Fixed an issue where the HTTP connection manager was getting closed prematurely.

Version 5.5.12a of screen-scraper Released

Posted in Updates on 07.13.11 by Todd Wilson

Several minor bug fixes and updates in this one:

  • Fixed a race condition where a scraping session could potentially get started by two different threads.
  • DataManager: a few logging changes
  • DataManager: a modification of the order of database writes when foreign keys are manually set
  • DataManager: transactional support for rolling back writes
  • DataManager: a framework for making data assertions
  • Fixed an issue exporting large scripts that call session.executeScript.

« Newer EntriesPrevious Entries »