Handling scraped data in real time

Once screen-scraper extracts data from a web site, typically that data is sent somewhere else. Data is probably most commonly written out to a file, but may also be saved to a database or even submitted to another web site. You can always handle the scraped data in screen-scraper scripts, but what if you want to make use of the data in your own application, which invokes screen-scraper?

In the past, when invoking screen-scraper from a remote application, the process has generally meant sending screen-scraper the request to scrape, waiting for extraction to occur, then handling that extracted data in the application that invoked screen-scraper. It’s that second step that can be a bit hard to deal with–the request to scrape is sent, but the scraped data can’t be touched by the calling application until screen-scraper finishes its work. This can be especially troublesome in cases where the scrapes are long and might even get interrupted in the middle. This is at best inconvenient, and at worst may mean loss of scraped data.

I recently had a flash of inspiration as to how to deal with these cases, and implemented a new feature in the latest alpha version of screen-scraper (3.0.63a) that greatly facilitates handling data in a remote application as it is getting scraped. First, to give a contrary example, consider the method we advocate in our fourth tutorial for invoking screen-scraper remotely to extract data from our shopping web site. The process goes basically like this:

  1. An external application starts up (e.g., a Java application or PHP script).
  2. The application invokes screen-scraper, telling it to run the “Shopping Site” scraping session.
  3. The “Shopping Site” scraping session runs.
  4. Once the scraping session completes, control returns to the calling application.
  5. The calling application requests the scraped records from screen-scraper.
  6. The scraped records are output by the calling application.

Now consider this possibility:

  1. An external application starts up (e.g., a Java application or PHP script).
  2. The application invokes screen-scraper, telling it to run the “Shopping Site” scraping session.
  3. While the scraping session runs it sends scraped records back to the calling application, which outputs them as they get scraped.

Hopefully the benefits to the second approach are obvious.

Now on to implementation. Consider this Java class (sorry for the odd formatting):

import com.screenscraper.scraper.*;
import com.screenscraper.common.*;

public class PollTest
{
public static void main( String args[] )
{
PollTest test = new PollTest();
test.go();

System.exit( 0 );
}

public void go()
{
try
{
RemoteScrapingSession remoteScrapingSession = new RemoteScrapingSession( “Shopping Site” );
remoteScrapingSession.setVariable(“SEARCH”,”dvd”);
remoteScrapingSession.setVariable( “PAGE”, “1” );
remoteScrapingSession.setPollFrequency( 1 );
remoteScrapingSession.setDataReceiver( new MyDataReceiver() );
remoteScrapingSession.scrape();
remoteScrapingSession.disconnect();
}
catch( Exception e )
{
System.err.println( “Exception: ” + e.getMessage() );
e.printStackTrace();
}
}

class MyDataReceiver implements DataReceiver
{
public void receiveData( String key, Object value )
{
System.out.println( “Got data from ss.” );
System.out.println( “Key: ” + key );
System.out.println( “Value: ” + value );
}
}
}

The key is the “MyDataReceiver” class, which implements the “DataReceiver” interface. This interface requires the implementation of just one method: receiveData. When the scraping session is configured correctly, this method will get invoked as data is scraped by screen-scraper, allowing you to handle it in your own code. A few other notes on this class:

  • The “setPollFrequency” indicates how often (in seconds) data should be sent from screen-scraper to the client. The default is five seconds.
  • The “setDataReceiver” method must be called before “scrape” is called.

The implementation in screen-scraper is quite simple. I took the standard “Shopping Site” scraping session from the tutorial, and added the following script:

session.sendDataToClient( “DR”, dataRecord );

The script gets invoked after each product is extracted from the web site. The “sendDataToClient” method will accept most any object, including strings, integers, DataRecords, and DataSets.

So far we’ve only implemented this in the Java and PHP drivers, but the others will be forthcoming.

The example source files can be downloaded here, and includes both PHP and Java files. If you decide to give this a try, be sure to upgrade to version 3.0.63a of screen-scraper. You’ll want to reference the latest “screen-scraper.jar” or “misc\php\remote_scraping_session.php” files in your code (found inside the folder where screen-scraper is installed).

Leave a Comment