One’s first experience with a page full of dynamic content can be pretty confusing. Generally one can request the HTML, but it’s missing the data that is sought.
Most of the time when extracting information from web sites you’ll deal with HTML, which is generally pretty straightforward to deal with. Occasionally, though, content will be delivered via something like a Java applet or Flash movie. Just recently I completed a project that dealt with extracting data from a Flash movie, where the data was delivered from the server via Adobe’s Action Message Format (AMF). I thought I’d share a bit about my experience here, which will hopefully be useful to others, as well as myself the next time I have to do this 🙂
The main tool you’ll deal with when scraping AMF-based data is Adobe’s Java AMF Client. It handles most of the heavy lifting for you, though you’ll still need to do a fair amount of coding. The other tool that is indispensable is Charles proxy, which has a built-in AMF parser. Without it you’ll be flying blind.
The basic approach you’ll want to take is to proxy the site via Charles with your web browser, pick out the AMF requests that seem relevant, then replicate those in code. In my case I also had to download PDF files (standard HTTP), so I actually had to run it all in screen-scraper, combining normal screen-scraper stuff with the Java AMF Client stuff. There was also a login that had to be done outside of AMF. Anyway, just be aware that you may have to combine both approaches in your own project.
I’m going to be providing some example code below in Interpreted Java (which is just BeanShell) as a screen-scraper script. You’ll need to do a bit of modification if you want to run this as straight Java.
Digging into the details, here’s how my code looks that sets up the initial AMF stuff:
// Set a few headers we'll want throughout the session.
amfConnection.addHttpRequestHeader( "Content-type", "application/x-amf" );
amfConnection.addHttpRequestHeader( "Referer", "http://www.myamfsite.com/media/MyMovie.swf" );
Here we’re setting up an AMF connection to a server whose AMF end point is found at http://www.myamfsite.com/messagebroker/amf. The commented-out proxy code allows us to send it all through Charles; that way we can compare the requests our code produces with those we record when browsing the web site via our web browser. Kind of an apples-to-apples comparison that helps to root out bugs. If your code doesn’t seem to have the desired effect, compare what’s happening via Charles with the requests from your browser. Ideally they should match as closely as possible. I also found that I had to add the two request headers that you’ll find at the end. The referer may or may not be necessary, but it’s likely that the content-type header is, since the Flash server would normally be expecting requests from a Flash movie, which would probably include that header by default.
Once you’ve done the initialization you can start adding AMF requests to get the data you’re after. Again, you’ll want to do this by recording the requests from your browser in Charles, then translate those into code. Here’s a screen-shot of a recorded AMF request from Charles:
And here’s how I translated the request into code:
CommandMessage message1 = new CommandMessage( CommandMessage.CLIENT_PING_OPERATION );
Object params1 = new Object
HashMap headers1 = new HashMap();
message1.setHeader( "DSId", "nil" );
message1.setMessageId( UUIDUtils.createUUID() );
Object result1 = amfConnection.call( "null", params1 );
session.log( "Result 1: " + result1 );
Based on the request recorded by Charles, it’s obvious that this should be a CommandMessage. The PING part of it was a bit trickier. This is the “operation” portion of the request, which you’ll notice is recorded by Charles only as “5”. This is where I had to bit of sleuthing through the Java AMF Client source code (which is fortunately open source and freely downloadable). If you’ve downloaded that source code you’ll find the CommandMessage class here in the bundle: modules/core/src/flex/messaging/messages/CommandMessage.java. Notice also in the request how I set the header “DSId” to be “nil”, which is also evident in what Charles recorded. Again, we’re trying to get our code to match as closely as possible what was recorded by our web browser. I gave the request a unique ID, then asked the connection to make the call.
The next request I needed was a bit different, but not too difficult to recreate from what Charles recorded:
I’ve blurred out the username I used. Here’s the corresponding code:
Again, you can hopefully see how the pieces in the code correlate to what Charles recorded.
From this point it was simply a matter of adding requests as needed, along with a fair amount of trial and error to ensure that I was matching as closely as possible the original AMF requests. The only item that tripped me up for a while that’s probably worth mentioning was when Charles recorded the body portion of the request as containing simply an “Object”. When I did the same in code the server didn’t like it, and it took me a bit before I realized what it actually wanted was an “ASObject”. So the code I used to create the body looks like this:
Object body3 = new Object
A few last items that might be helpful:
The Java AMF Client download contains quite a few dependency files. You’ll have to figure out exactly which ones of those you truly need. In my case, in using this within screen-scraper, I ended up only needing two of the jars from the bundle: flex-messaging-common.jar and flex-messaging-core.jar.
As it stands the Java AMF Client can’t handle HTTPS, nor can it handle HTTPS sites that utilize an invalid secure certificate. I ended up modifying the source for the AMFConnection class in order to add this functionality (in the bundle that class is found here: modules/core/src/flex/messaging/io/amf/client/AMFConnection.java). You can download a zip file here that contains that modified source file as well as a compiled version of the flex-messaging-core.jar files, which contains that modified class. If you end up modifying that class further in the bundle you can compile it with a simple “ant core” from the command line. You need not compile the whole thing.
Lately we find an increasing need to anonymize our scraping sessions. So, as necessity is the mother of invention, we have created and adopted a handful of different approaches to keep our scrapes up and running.
Keep in mind, the only way to block a web crawler is for a website’s server to refuse connections from an offending IP address.
This approach is used before any blocking has occurred. Ideally, a proactive approach would be the only technique needed.
Using screen-scraper’s Anonymization Service set up your scraping session to spawn between 3-5 proxy servers when it starts. Create a script whose job it is to shutdown and spawn anew a proxy server at a random interval (say, every 3-5 minutes).
It is also useful to switch up the User Agent at least each time you switch out a proxy. It can be even more effective if you switch it up on every request.
Similarly, when possible, you can change your referrer to a random URL that is off of the target domain. This makes it appear as though a different user is entering the site from an external source (typically considered positive traffic).
This is necessary once a site starts blocking your IP address.
The first approach is to use screen-scraper’s built-in Anonymization Service. The current implementation makes use of Amazon EC2 servers as proxies. Because we make use of Amazon’s Linux EC2 instances we have access to Squid, a popular proxy server already installed.
A limitation to using Amazon’s EC2 is that they reside in a finite and predictable block of IP addresses. We have had a number of sites block Amazon EC2’s wholesale.
After Amazon EC2’s are no longer effective you can make use of three other ad-hoc techniques.
Tor: The Tor network is spread widely across many different nodes and can prove difficult (almost impossible) to block. However, because of the vast distribution across any type of web server (with varying internet speeds) the relay speed is roughly 1/10th that of a normal connection. But, it’s free.
I2P2: Similar to Tor but a bit better maintained. This means faster connections. However, there are many fewer proxy nodes and fewer IP addresses to block. But, it’s free.
Anonymization via Manual Proxy Pools: Using proxy pools should be a last resort because the nature of the proxies is unknown and often unreliable. You are making use of computers on the Internet that have been set up with an open port for all the world to relay its traffic through. It’s possible that the owner of the server may close the open port at any time. But, it’s free.
See the following resources to read more about Anonymizing screen-scraper.
Those of you already familiar with screen-scraper are acquainted with the usual routine of starting off by proxying a site using screen-scraper’s proxy server. Well, it so happens that screen-scraper uses an HTTP proxy. It also so happens that most online videos are served over a protocol other than HTTP (eg. mms, http to mms, rtsp, http to rtmp, rtmp, rtmpe, rtmps, rtmpt, etc., etc.).
Those of you already familiar with online videos probably know that you view them via the Adobe Flash player. screen-scraper’s built-in client is not a Flash player. So, you wonder, how does screen-scraper scrape online videos?
Source video URL discovery is particularly challenging for the reasons described above and requires a new set of tools to make it happen. Over time our tool set has evolved to include different video stream recording software, Proxy/TCP revealers, and various multimedia players…
Once discovered we create a pretty typical scraping session to recurs over a site scraping the visible title, description, etc.; as well as, the non-visible pieces that make up an online video source URL. For example…
Extracting embedded video meta-data is required because seldom will a site state outright what the format, codec, dimensions, length, etc. of their online videos. We use a combination of software to download a portion of the video in order to get to the meta-data.
The ability to easily manage multiple scraping sessions is key because we are currently scraping from around 26 online video portals. To do this we have built a web-based Tomcat controller to coordinate across multiple servers located anywhere in the world. You can manually, or by way of a scheduler, start each scraping session, add additional screen-scraper instances and point to multiple mySQL databases.
This is the first installment in what will hopefully become a series.
Here at screen-scraper we handle a variety of projects for a myriad of different clients. All of our work is centered around our core software, screen-scraper, but is often complimented by third-party software such as PHP, Tomcat, Lucene, Google Web Toolkit, mySQL, along with our own set of custom-built code.
ScrapbookFinds.com: Our in-house scrapbooking comparison shopping site. Since 2006 we have been scraping many scrapbooking supply websites for product data. While scraping, the data is added to a mySQL database where we categorize and scrub it for duplicates. When you search the site Lucene quickly handles the finding of results related to your query.
Data normalization is the process of identifying a single product that is found on more than one site. Each site may refer to that product using different characteristics in, say, the title, description, or part number. Finding likeness despite the differences is a common challenge for us. Data normalization is handled by Lucene’s ability to index and tokenize disparate data to find commonality.
We mitigate changes to a site by monitoring the number of records each time it scrapes. If the current number of records drops below 80% of the previous total then we know to look over the logs for errors and/or warnings issued by screen-scraper.
Well, that’s actually not always true. Take a quick look at this blog posting here. The fundamental issue described by that posting is one of recursion vs. iteration. When recursion is used (a page calls a page which calls a page…) objects tend to get stacked up, and subsequently fill up memory. When iteration is used objects are properly cleaned up so memory doesn’t become a problem. The trouble is, this condition is often hard to detect, and unless you’re thinking about it when you’re building your scraping session, you may cause it without realizing it.
An astute screen-scraper user yesterday suggested a solution to this that is both simple and effective. In the case described in the blog posting you end up with a big stack of scripts, all of which have references to objects, which causes the OutOfMemoryError. The number of scripts on the stack can be viewed in the breakpoint window, and in version 4.5.45a we added a method that will allow you to see how many scripts are on the stack from within a script:
You can check this number as often as you’d like. As it grows it could mean trouble, so you can respond appropriately in your scraping session. We’ve also added a failsafe mechanism inside of screen-scraper that will hopefully save you from an OutOfMemoryError. If too many scripts are pushed on the stack your scraping session will be stopped and the following message will be output to the log:
ERROR–halting the scraping session because the maximum number of scripts allowed on the stack was reached.
You can control the maximum number of scripts allowed on the stack by invoking this method at any time:
session.setMaxScriptsOnStack( 50 )
Set that number to whatever you’d like.
By design screen-scraper provides a lot of flexibility and power in the data extraction process, but this same power can also result in our shooting ourselves in the foot on occasion. The inclusion of this new mechanism will hopefully help some to avoid this problem down the road.
Sometimes a long scrape will be stopped mid-run by a system crash, power surge, or bad mojo. Many times there is nothing to do but to restart, but sometimes there is a way to pick up (pretty close to) where you left off. You need to include some extra logic, but it is often worthwhile.
Let’s say where looking a site that lists hundred of LOCATIONS, and inside each there is a listing of COMPANIES, and the data we’re after is listed in each COMPANY.
I’m going to make a script that runs at the beginning of the scrape to check for a file that contains the last scraping state. Read the rest of this entry »
We try hard to maintain backward compatibility as much as possible, but unfortunately it can’t always be done. If you recently upgraded to 4.5.42a you may have noticed that scraping sessions that are exported from that version don’t import correctly into an alpha version prior to it. This was a result of the alterations to the “tidy HTML” functionality that were implemented in that version. As such, this is one case of backward-compatibility where you’re going to have to be careful. As of this version (and later versions) if you export scraping sessions from screen-scraper you should only import them into instances of screen-scraper also running version 4.5.42a or later. Unfortunately, this is one case where it was impossible to maintain the compatibility with older versions, so please take note.
Once in a while when you’re scraping you may request a file that ends up being really large, but you actually only need to pull data from the top portion of the file. If it’s a big file it can end up slowing down the scraping process quite a bit. Not too long ago (somewhere around version 4.5.20a, I think) we added a method to deal with just such cases:
scrapeableFile.setMaxResponseLength( int maxKBytes )
This tells screen-scraper to only download a given number of kilobytes at the beginning of the file. You would want to run this method in a script that gets invoked before a file is scraped. For example, if your script contained this line:
scrapeableFile.setMaxResponseLength( 50 );
screen-scraper would download the first 50K of the file, cut it off, then continue on.
If the speed of a scraping session is especially critical this can also be a great way to trim off quite a bit of download time.