Scraping AMF Sites
Most of the time when extracting information from web sites you’ll deal with HTML, which is generally pretty straightforward to deal with. Occasionally, though, content will be delivered via something like a Java applet or Flash movie. Just recently I completed a project that dealt with extracting data from a Flash movie, where the data was delivered from the server via Adobe’s Action Message Format (AMF). I thought I’d share a bit about my experience here, which will hopefully be useful to others, as well as myself the next time I have to do this 🙂
The main tool you’ll deal with when scraping AMF-based data is Adobe’s Java AMF Client. It handles most of the heavy lifting for you, though you’ll still need to do a fair amount of coding. The other tool that is indispensable is Charles proxy, which has a built-in AMF parser. Without it you’ll be flying blind.
The basic approach you’ll want to take is to proxy the site via Charles with your web browser, pick out the AMF requests that seem relevant, then replicate those in code. In my case I also had to download PDF files (standard HTTP), so I actually had to run it all in screen-scraper, combining normal screen-scraper stuff with the Java AMF Client stuff. There was also a login that had to be done outside of AMF. Anyway, just be aware that you may have to combine both approaches in your own project.
I’m going to be providing some example code below in Interpreted Java (which is just BeanShell) as a screen-scraper script. You’ll need to do a bit of modification if you want to run this as straight Java.
Digging into the details, here’s how my code looks that sets up the initial AMF stuff:
// Create the AMF connection.
AMFConnection amfConnection = new AMFConnection();
// Used for debugging...
//Proxy proxy = new Proxy( Proxy.Type.HTTP, new InetSocketAddress( "localhost", 8888 ) );
//amfConnection.setProxy( proxy );
// Connect to the remote url.
url = "http://www.myamfsite.com/messagebroker/amf";
catch( ClientStatusException cse )
session.logError( cse );
// Set a few headers we'll want throughout the session.
amfConnection.addHttpRequestHeader( "Content-type", "application/x-amf" );
amfConnection.addHttpRequestHeader( "Referer", "http://www.myamfsite.com/media/MyMovie.swf" );
Here we’re setting up an AMF connection to a server whose AMF end point is found at http://www.myamfsite.com/messagebroker/amf. The commented-out proxy code allows us to send it all through Charles; that way we can compare the requests our code produces with those we record when browsing the web site via our web browser. Kind of an apples-to-apples comparison that helps to root out bugs. If your code doesn’t seem to have the desired effect, compare what’s happening via Charles with the requests from your browser. Ideally they should match as closely as possible. I also found that I had to add the two request headers that you’ll find at the end. The referer may or may not be necessary, but it’s likely that the content-type header is, since the Flash server would normally be expecting requests from a Flash movie, which would probably include that header by default.
Once you’ve done the initialization you can start adding AMF requests to get the data you’re after. Again, you’ll want to do this by recording the requests from your browser in Charles, then translate those into code. Here’s a screen-shot of a recorded AMF request from Charles:
And here’s how I translated the request into code:
CommandMessage message1 = new CommandMessage( CommandMessage.CLIENT_PING_OPERATION );
Object params1 = new Object
HashMap headers1 = new HashMap();
message1.setHeader( "DSId", "nil" );
message1.setMessageId( UUIDUtils.createUUID() );
Object result1 = amfConnection.call( "null", params1 );
session.log( "Result 1: " + result1 );
Based on the request recorded by Charles, it’s obvious that this should be a CommandMessage. The PING part of it was a bit trickier. This is the “operation” portion of the request, which you’ll notice is recorded by Charles only as “5”. This is where I had to bit of sleuthing through the Java AMF Client source code (which is fortunately open source and freely downloadable). If you’ve downloaded that source code you’ll find the CommandMessage class here in the bundle: modules/core/src/flex/messaging/messages/CommandMessage.java. Notice also in the request how I set the header “DSId” to be “nil”, which is also evident in what Charles recorded. Again, we’re trying to get our code to match as closely as possible what was recorded by our web browser. I gave the request a unique ID, then asked the connection to make the call.
The next request I needed was a bit different, but not too difficult to recreate from what Charles recorded:
I’ve blurred out the username I used. Here’s the corresponding code:
// Authenticate the current user.
RemotingMessage message2 = new RemotingMessage();
message2.setOperation( "getUserByUserName" );
Object params2 = new Object
String body2 = new String
message2.setBody( body2 );
message2.setDestination( "XYZ" );
message2.setMessageId( UUIDUtils.createUUID() );
Object result2 = amfConnection.call( "null", params2 );
session.log( "Result 2: " + result2 );
Again, you can hopefully see how the pieces in the code correlate to what Charles recorded.
From this point it was simply a matter of adding requests as needed, along with a fair amount of trial and error to ensure that I was matching as closely as possible the original AMF requests. The only item that tripped me up for a while that’s probably worth mentioning was when Charles recorded the body portion of the request as containing simply an “Object”. When I did the same in code the server didn’t like it, and it took me a bit before I realized what it actually wanted was an “ASObject”. So the code I used to create the body looks like this:
Object body3 = new Object
A few last items that might be helpful:
- The Java AMF Client download contains quite a few dependency files. You’ll have to figure out exactly which ones of those you truly need. In my case, in using this within screen-scraper, I ended up only needing two of the jars from the bundle: flex-messaging-common.jar and flex-messaging-core.jar.
- As it stands the Java AMF Client can’t handle HTTPS, nor can it handle HTTPS sites that utilize an invalid secure certificate. I ended up modifying the source for the AMFConnection class in order to add this functionality (in the bundle that class is found here: modules/core/src/flex/messaging/io/amf/client/AMFConnection.java). You can download a zip file here that contains that modified source file as well as a compiled version of the flex-messaging-core.jar files, which contains that modified class. If you end up modifying that class further in the bundle you can compile it with a simple “ant core” from the command line. You need not compile the whole thing.