11.12.07

Handling scraped data in real time

Posted in Updates, Tips at 12:40 pm by Todd Wilson

Once screen-scraper extracts data from a web site, typically that data is sent somewhere else. Data is probably most commonly written out to a file, but may also be saved to a database or even submitted to another web site. You can always handle the scraped data in screen-scraper scripts, but what if you want to make use of the data in your own application, which invokes screen-scraper?

In the past, when invoking screen-scraper from a remote application, the process has generally meant sending screen-scraper the request to scrape, waiting for extraction to occur, then handling that extracted data in the application that invoked screen-scraper. It’s that second step that can be a bit hard to deal with–the request to scrape is sent, but the scraped data can’t be touched by the calling application until screen-scraper finishes its work. This can be especially troublesome in cases where the scrapes are long and might even get interrupted in the middle. This is at best inconvenient, and at worst may mean loss of scraped data.

I recently had a flash of inspiration as to how to deal with these cases, and implemented a new feature in the latest alpha version of screen-scraper (3.0.63a) that greatly facilitates handling data in a remote application as it is getting scraped. First, to give a contrary example, consider the method we advocate in our fourth tutorial for invoking screen-scraper remotely to extract data from our shopping web site. The process goes basically like this:

  1. An external application starts up (e.g., a Java application or PHP script).
  2. The application invokes screen-scraper, telling it to run the “Shopping Site” scraping session.
  3. The “Shopping Site” scraping session runs.
  4. Once the scraping session completes, control returns to the calling application.
  5. The calling application requests the scraped records from screen-scraper.
  6. The scraped records are output by the calling application.

Now consider this possibility:

  1. An external application starts up (e.g., a Java application or PHP script).
  2. The application invokes screen-scraper, telling it to run the “Shopping Site” scraping session.
  3. While the scraping session runs it sends scraped records back to the calling application, which outputs them as they get scraped.

Hopefully the benefits to the second approach are obvious.

Now on to implementation. Consider this Java class (sorry for the odd formatting):

import com.screenscraper.scraper.*;
import com.screenscraper.common.*;

public class PollTest
{
public static void main( String args[] )
{
PollTest test = new PollTest();
test.go();

System.exit( 0 );
}

public void go()
{
try
{
RemoteScrapingSession remoteScrapingSession = new RemoteScrapingSession( “Shopping Site” );
remoteScrapingSession.setVariable(”SEARCH”,”dvd”);
remoteScrapingSession.setVariable( “PAGE”, “1″ );
remoteScrapingSession.setPollFrequency( 1 );
remoteScrapingSession.setDataReceiver( new MyDataReceiver() );
remoteScrapingSession.scrape();
remoteScrapingSession.disconnect();
}
catch( Exception e )
{
System.err.println( “Exception: ” + e.getMessage() );
e.printStackTrace();
}
}

class MyDataReceiver implements DataReceiver
{
public void receiveData( String key, Object value )
{
System.out.println( “Got data from ss.” );
System.out.println( “Key: ” + key );
System.out.println( “Value: ” + value );
}
}
}

The key is the “MyDataReceiver” class, which implements the “DataReceiver” interface. This interface requires the implementation of just one method: receiveData. When the scraping session is configured correctly, this method will get invoked as data is scraped by screen-scraper, allowing you to handle it in your own code. A few other notes on this class:

  • The “setPollFrequency” indicates how often (in seconds) data should be sent from screen-scraper to the client. The default is five seconds.
  • The “setDataReceiver” method must be called before “scrape” is called.

The implementation in screen-scraper is quite simple. I took the standard “Shopping Site” scraping session from the tutorial, and added the following script:

session.sendDataToClient( “DR”, dataRecord );

The script gets invoked after each product is extracted from the web site. The “sendDataToClient” method will accept most any object, including strings, integers, DataRecords, and DataSets.

So far we’ve only implemented this in the Java and PHP drivers, but the others will be forthcoming.

The example source files can be downloaded here, and includes both PHP and Java files. If you decide to give this a try, be sure to upgrade to version 3.0.63a of screen-scraper. You’ll want to reference the latest “screen-scraper.jar” or “misc\php\remote_scraping_session.php” files in your code (found inside the folder where screen-scraper is installed).

09.13.07

Anonymization through proxy servers

Posted in Tips at 4:38 pm by Todd Wilson

In certain cases a scrape needs to be anonymized in order to get the data you’re after. Generally this means sending the HTTP requests through one or more proxy servers, over which you may or may not have control (see How to surf and screen-scrape anonymously for more on this). Up to this point, this has been possible in screen-scraper, but the implementation has been relatively inelegant. Because of the needs of a recent client of ours, we’ve taken the time to flesh this out a bit more such that handling proxies is handled much more gracefully in screen-scraper. To use the code cited in this post, you’ll need to upgrade to the latest alpha version of screen-scraper.

The best way to explain is often by example, so here you go:

import com.screenscraper.util.*;

// Creat a new ProxyServerPool object. This object will
// control how screen-scraper interacts with proxy servers.
ProxyServerPool proxyServerPool = new ProxyServerPool();

// We give the current scraping session a reference to
// the proxy pool. This step should ideally be done right
// after the object is created (as in the previous step).
session.setProxyServerPool( proxyServerPool );

// This tells the pool to populate itself from a file
// containing a list of proxy servers. The format is very
// simple–you should have a proxy server on each line of
// the file, with the host separated from the port by a colon.
// For example:
// one.proxy.com:8888
// two.proxy.com:3128
// 29.283.928.10:8080
// But obviously without the slashes at the beginning.
proxyServerPool.populateFromFile( “proxies.txt” );

// screen-scraper can iterate through all of the proxies to
// ensure they’re responsive. This can be a time-consuming
// process unless it’s done in a multi-threaded fashion.
// This method call tells screen-scraper to validate up to
// 25 proxies at a time.
proxyServerPool.setNumProxiesToValidateConcurrently( 25 );

// This method call tells screen-scraper to filter the list of
// proxy servers using 7 seconds as a timeout value. That is,
// if a server doesn’t respond within 7 seconds, it’s deemed
// to be invalid.
proxyServerPool.filter( 7 );

// Once filtering is done, it’s often helpful to write the good
// set of proxies out to a file. That way you may not have to
// filter again the next time.
proxyServerPool.writeProxyPoolToFile( “good_proxies.txt” );

// You might also want to write out the list of proxy servers
// to screen-scraper’s log.
proxyServerPool.outputProxyServersToLog();

// This is the switch that tells the scraping session to make
// use of the proxy servers. Note that this can be turned on
// and off during the course of the scrape. You may want to
// anonymize some pages, but not others.
session.setUseProxyFromPool( true );

// As a scrapiing session runs, screen-scraper will filter out
// proxies that become non-responsive. If the number of proxies
// gets down to a specified level, screen-scraper can repopulate
// itself. That’s what this method call controls.
proxyServerPool.setRepopulateThreshold( 5 );

During the course of the scrape, you may find that a proxy has been blocked. When this happens, you can make this method call to tell screen-scraper to remove the proxy from the pool:

session.currentProxyServerIsBad();

Given that this feature is still in the alpha version of screen-scraper, there’s a chance we might change around the methods a bit, but, for the most part, you should be able to use it as you see it here.

It also might be of interest to note that we’ve done a slightly extended implementation of this technique that we’re using internally, which makes use of Amazon’s EC2 service. This allows us to have a pool of high speed proxy servers at an arbitrary quantity. As the proxy servers get blocked, they can be automatically terminated, with others spawned to replace them.

03.01.07

How to surf and screen-scrape anonymously

Posted in Tips at 2:33 pm by Todd Wilson

Well, this is a topic I’ve been meaning to address for quite a while, and a recent support request on the topic pushed me to finally get it done.

What I’ll describe in this article applies to our own screen-scraper software, but would also apply to most any other screen or web scraping software you might use. Most of it would even apply to web surfing in general.

Why surf or scrape anonymously?

There are a number of reasons why you may want to remain anonymous. It’s often a good idea to protect privacy when concerned about identity theft. You might be scraping from a competitor’s web site, and don’t want them to be able to identify you. Some web sites disallow too many requests from the same client, so you might be trying to circumvent such mechanisms.

I’ll issue a little caveat here by pointing out that, like many other tools, screen-scraping software can be used for good or ill. If you find yourself doing a lot of anonymous scraping, you may want to examine the legitimacy of what you’re doing. Scraping tools can be very useful, but don’t abuse them.

How do web sites discourage screen-scraping?

There are a number of mechanisms that web sites will use to attempt to discourage screen-scraping. Here are the ones I can think of off the top of my head:

  • User tracking through cookies. A web site can easily plant a cookie, then track the number of requests you make by incrementing a server-side value attached to the cookie.
  • User tracking by IP address. A slightly less reliable method used by sites is to track the number of requests you make by associating them with your IP address. I say it’s slightly less reliable because you could potentially have multiple client requests originating from the same IP address (e.g., if they’re all connecting through a common gateway).
  • CAPTCHA mechanisms. There are a number of different types, and many are very difficult to circumvent. They’re also not very common, however.
  • Authentication. This one dove tails on tracking through cookies, but is a slight variation in that some sites will require authentication before allowing access to the information you want to scrape. If sites don’t require authentication, you might simply be able to block cookies, so this one can be tricky to deal with.

Great. So how do I scrape anonymously?

The method(s) you’ll want to avail yourself of to scrape anonymously will depend on what the web site is using (if anything) to attempt to discourage scraping. I’ll describe below the techniques I’d recommend, along with when they’d make the most sense.

Hide your real IP address

How to do it: This is probably the most common technique you’ll use, so I’ll address it first. Every page request has to originate from an IP address, but it doesn’t necessarily need to be your real IP address. There are a few different ways you can trick the web server into thinking the HTTP request is coming from a different IP address:

  • Send the request through a proxy server. There are lots of them out there. Most HTTP clients (e.g., a web browser or screen-scraping software) can be set up to send requests through an HTTP or SOCKS proxy server. Given that this is one of the more common techniques, I’ll also describe a few specific approaches:
    • Send all requests through the same proxy server. If you Google around a bit you can find lists of anonymous proxy servers. Find one that seems to be reliable, then set up your scraping software to send all requests through it. There are also tools that will take a list of proxy servers, then tell you which ones are working, faster, more reliable, etc.
    • Send requests through an application that cycles through proxy servers. These applications act as a proxy server, but with each request they’ll cycle it through a different proxy server. You provide a list and it simply iterates through them one by one. MultiProxy is a bit dated, but one I can think of, offhand. This can also be done in our screen-scraper software by simply placing a “proxies.txt” file in screen-scraper’s installation folder. The file should contain a proxy server on each line in the format [host or IP address]:[port] (e.g., myproxy.com:8080).
    • Use tor/privoxy. This little tool can be a gem, but please don’t abuse it. It provides stronger anonymity than regular proxy servers, but may not be quite as fast.
    • Use browser-based anonymization services. There are quite a few online services that allow you to punch in a web address, they send the request from their server, then display the response to you. You likely wouldn’t use this technique for scraping, but it might be useful for a few quick requests from your web browser.
  • Use a virtual private network. This allows you to send all outgoing Internet traffic through a machine external to yours, and will cause the web server you’re scraping to think the request is coming from that computer and not yours. You might already have access to a VPN you can use, but more than likely you’ll just need to pay a bit to use someone else’s. This is probably the best technique for completely anonymizing any HTTP requests you might make, but does have the disadvantage that you won’t be able to cycle through IP addresses. That is, if you want a new IP address you’ll have to disconnect from and reconnect to the network. Two services on this type that I know of are StrongVPN and Relakks. We’ve used Relakks before and have had positive results.

When to do it: This is probably the most common technique, and you should use it any time you want to prohibit the web server you’re working with to have a way to trace requests back to you.

It should be noted that this technique is not foolproof. If you’re simply sending requests through an HTTP proxy server, there’s nothing stopping the owner of the proxy server from recording your request and IP address, then divulging the information to others so that the request can be traced back to you. Tools like tor can provide a greater degree of anonymity, but even that isn’t bulletproof. I recently read of an exploit a researcher found in tor that would allow traffic sent through it to be monitored. The strongest method of anonymity is probably the VPN, but, again, that assumes that the owners of the VPN service will keep private any traffic you send through them.

Block cookies

How to do it: This one’s pretty easy. If you’re using a web browser, just find the setting that indicates that all cookies should be blocked. Most screen-scraping software will (or should) also provide a way to do this.

When to do it: If the web site you’re working with is tracking you through cookies, you can simply reject them all. This likely will only work on relatively unsophisticated sites. Most sites trying to discourage screen-scraping will track your IP address.

Avoid authentication

How to do it: If you’re authenticated to a web site, you’re likely not blocking cookies, so the web site will be able to track you.

When to do it: This is probably obvious, but, if you don’t need to authenticate, don’t. That eliminates one other method whereby a site can track you.

In some cases it’s simply not possible to avoid authentication. In these cases, unfortunately, there may not be anything you can do to stay anonymous. Your best bet would probably be to hide your IP address (as described above), which may also require logging in and out of the site each time you acquire a new IP address.

Look for ways to circumvent CAPTCHA mechanisms

How to do it: In cases where a CAPTCHA mechanism is poorly implemented, it may be possible to determine how to circumvent it programatically (i.e., in programming code). A common CAPTCHA method is to present the user with a series of numbers or characters in a pattern such that a machine wouldn’t be able to read it. In a handful of cases in the past we’ve found that the server simply uses a naming convention with the CAPTCHA images, such that it’s possible to determine what the image says without requiring that a human read it.

Yet another fairly inefficient way of dealing with a CAPTCHA would be to capture the portion of the page containing the CAPTCHA, present it to a human being, have the person type in whatever the CAPTCHA requires, then make the request. We’ve never used this technique (and likely never would), but it’s technically possible to do.

When to do it: If a site is using a CAPTCHA, examine the HTML closely. Refresh the page multiple times to see how it changes. If you’re lucky, there will be a way to circumvent it in code. More than likely, though, you’d simply have to have a human being deal with it.

Behave yourself

So there you have it. I’ve just pointed out a number of tools and techniques to remain anonymous online. Like I said before, don’t abuse them. There are some very legitimate reasons for wanting to do this, but there are a whole host of reasons why you shouldn’t. Part of me says I shouldn’t even be divulging any of this, but I’m not telling you anything you couldn’t find out on your own. So be nice. Behave yourself.

01.02.07

How to stop phpBB spam

Posted in Miscellaneous, Tips at 12:29 pm by Todd Wilson

Well, I sure wish someone would have told us about this a while ago, so I’m doing the world a favor and talking about it here. Hopefully this blog posting gets picked up by Google so that others who are new to phpBB can learn how to stop spam up front.

We’ve been battling spam on our phpBB forum for I don’t know how long. The forum software works fine, but it’s so widespread that it seems to be one of the primary targets for forum spammers. After monkeying around with the thing installing mods and making manual changes, we finally hit this mod: Stop Spambot Registration. Once installed, the spam stopped. Amazing.

Now, obviously your mileage may vary with this one. We’ve also tried a bunch of other mods, so it’s possible that some of our mods are helping, but the Stop Spambot Registration was the key for us. If you find that you need more firepower beyond that mod, I’d recommend trying others on the phpBB Security-Related MODs page that relate to spam.

By the way, just one plea to the phpBB folks–please consider building spam control into the base install of the software. You know people are targeting you, so why not give your users some defense out of the box?

***UPDATE***

Well, I declared victory a bit prematurely with that last posting. We got a bit more spam after I installed the mod I mentioned, so I installed one more: spamwords. It seems to work fairly well. My only complaint is that it only allows you to designate words, and not phrases, as indicators of spam.

I should also mention one other change we made early on that stopped a lot of the spam–we deleted the guest user account. This is the user in the database that has an ID of -1. I searched and searched for a way to disable guest posting, to no avail. With the guest account deleted people see an error message if they explicitly log out, but at least it prevents spam from non-registered posters.

10.18.06

Scraping CAPTCHA forms (you know, those HTML forms with the wavy text)

Posted in Tips at 10:06 am by Todd Wilson

Alert screen-scraper yipa posted an excellent question to our forum this morning:

One of the pages I want to scrape is behind a login with image verification (i.e., you need to enter some text generated in an image to log in). Is there a way to work around this? Maybe something like SS load the image, display/save it to a location, waits for my input after viewing the image, then moves on? Or are there other ways to handle this?

This can be a pretty tricky situation to deal with, but, in most all cases, it should still be doable. I added it to our FAQ, and here’s the explanation for your enlightenement and learning:

I’m trying to scrape an HTML form that requires the user to type in text shown in an image. Can screen-scraper handle this?

This is known as a CAPTCHA mechanism, and is intended to discourage automated form submissions. There are essentially two ways of working around these:

Oftentimes sites will use a poorly implemented CAPTCHA such that it can be determined up front what the text will read. For example, the site may actually have only four or five images, and it simply cycles through them. By looking at the names of the images one could determine what the corresponding text will be. The text could then be used to populate the appropriate HTML form.

Assuming the CAPTCHA mechanism works as it should (i.e., that a human being would have to type in the text shown in the image), it gets a bit trickier to deal with. The best route would probably be to run a scraping session as you normally would, then, once you arrive at the page containing the CAPTCHA, follow these steps:

  1. Download the CAPTCHA image to the local hard drive (e.g., using the session.downloadFile method).
  2. Using a screen-scraper script, pop up a dialog box using Java code that displays the image, and contains a text box that will accept user input. Within a script you have full access to the Java API, so you could pop up something like a custom JDialog containing the image and text box.
  3. Have a person type into the text box the characters displayed in the image.
  4. Accept the text entered by the user, then drop it into a screen-scraper session variable.
  5. Use the value in the session variable to populate the HTML form element.

This obviously isn’t ideal, but, unfortunately, there may not be another way. The CAPTCHA images are designed such that they can’t be read by a machine. As such, human intervention is required.

10.12.06

Scraping a Date Range

Posted in Tips at 3:53 pm by jason

Much of the time in scraping, one wants to fill in a web form and grab the results, and many of the forms want the user to fill in a date range. It’s not a daunting prospect if you just want to scrape the form once, but for jobs where you want run a scrape weekly and get a full week’s worth of data making a script for that has been challenging. I have therefore developed a simple, generic script that will figure the date for a given number of days from today, and save it in session variable.

For the purposes of this post, I’m going to make a script give me a date for a week from today in the format of a 2 digit day, 2 digit month, and 4 digit year, however I’ll make those easy to change.

To start one needs to import some useful Java componants:

import java.util.*;
import java.text.*;

These allow us to go ahead and create an instance of “right now”.

Calendar rightNow = Calendar.getInstance();

This gives me a “right now” to which I can add 7 days to thusly:

rightNow.add( Calendar.DATE, 7 );

And all that is left is to format it:

Date endDate = rightNow.getTime();
Date endDate = rightNow.getTime();
SimpleDateFormat formatter = new SimpleDateFormat( “MM/dd/yyy” );
String newDate = formatter.format( endDate );

Now I have a nicely formatted local variable named newDate that I would just need to set as a session variable for the rest of the scrape to run.

session.setVariable(”NEW_DATE”, newDate);

That’s enough to make the script work, but in order to make it into a good template, one should make it easy to find and change the things that will have to set differently in each application. My attempt to do so ended up like this:

import java.util.*;
import java.text.*;

// Set number of days to add to current date.
addDays = 7;

// Set the format in which the date should be output.
String dateFormat = “MM/dd/yyyy”;

//Figure the new date.
Calendar rightNow = Calendar.getInstance();
rightNow.add( Calendar.DATE, addDays );
Date endDate = rightNow.getTime();
SimpleDateFormat formatter = new SimpleDateFormat( dateFormat );
String newDate = formatter.format( endDate );

// Output the new date.
session.setVariable(”NEW_DATE”, newDate);

Of course you can use this process to make more than one date for your form if needed; from here it should just be a matter of some minor editing.

For information on the date formatting, see the java page at: http://java.sun.com/j2se/1.5.0/docs/api/java/text/SimpleDateFormat.html

And for a trick to make the formatting of dates far easier when you’re in screen-scraper, read up on the reformatDate method that is available in the professional edition.

08.02.06

Extracting data from PDF files

Posted in Tips at 11:42 am by Todd Wilson

Periodically people ask if screen-scraper can extract data from PDF files, as well as HTML. We’ve never had a very good answer for this (it can’t, out of the box), but lately we’ve been forced to come up with a solution, as a project we’ve been working on has required it.

When I initially researched how to go about this, I was looking for libraries that would allow for extraction from PDF files. I found a handful of them, but each had its own proprietary method for performing the extraction (e.g., lots of different method calls for handling tables and such). They seemed like possibilities, but I couldn’t come up with an elegant way to integrate them into screen-scraper without completely changing the way the user would need to perform the extraction.

After stepping back from the problem, I decided that it might make more sense to simply convert the PDF to a text-based format (e.g., HTML), then use screen-scraper’s existing extraction mechanism to pull the data out. After poking around a bit, I happened across pdftohtml, which does an excellent job of converting PDF files to HTML or XML. For the project we’re currently working on, screen-scraper is able to easily pull the data we need out of the converted file.

Our next step will be to integrate this functionality directly within screen-scraper. That is, screen-scraper should be able to seamlessly convert a PDF on the fly, then allow the user to make use of screen-scraper’s existing extraction mechanisms to pull the data out. The tricky part is that pdftohtml is platform dependent. They offer a Windows binary, but on any other OS you have to either compile from source or hope for an existing package (we’re using Ubuntu and were able to just apt-get it).

Here’s how I’m thinking it would work if we were to automate the process within screen-scraper:

  • For any operating systems that allow it, we just ship a binary with screen-scraper that will perform the PDF to HTML/XML conversion locally.
  • In cases where that doesn’t work, we provide a remote web service that will convert the PDF to XML. screen-scraper would invoke this behind the scenes in two different ways:
    • screen-scraper would first attempt to convert the PDF by passing the URL to it to the web service. The web service would attempt to retrieve the PDF via a GET request. Assuming that works, it would then perform the conversion and spit back the resulting XML, which screen-scraper would download.
    • If the web service is unable to grab the PDF directly (e.g., in cases where the PDF is behind an authentication gateway), it would indicate such to screen-scraper, which would then download the PDF file and upload it to the web service. The web service would perform the conversion, then output the resulting XML.

Once the PDF is converted, the user would be free to use the normal extractor patterns to pull the data out.

So in a worst-case scenario, the PDF would need to be downloaded from the source site, uploaded to the web service, converted, then downloaded again. Obviously this would add a lot of overhead, so it’s definitely not the best approach. I would guess that in the majority of the cases, however, the PDF could be converted either locally or via the web service where it’s able to request the PDF directly from the web site.

I can’t say just when we’ll get around to implementing this. It would likely mostly depend on the demand we see for it. This is the first project we’ve done where we’ve had to pull it off, but I’m guessing there will be others down the road. Until we implement this automated method, though, running pdftohtml manually may not be too cumbersome for most.

07.31.06

Extracting data from Java applets, ActiveX controls, and Adobe Flash movies

Posted in Tips at 10:04 am by Todd Wilson

This is a question we get from time to time, so I finally decided to add it to our FAQ. If anyone else has experience with this kind of thing feel free to post a comment. I’m unaware of many packages that can do this.

Here’s the posting from the FAQ:

The short answer to this one is, “Sometimes.” Most all widgets (applets, etc.) that communicate with their server via HTTP can be sccraped by screen-scraper. Oftentimes, however, they’ll use a proprietary protocol. Most of the time Adobe Flash movies use HTTP when they need to communicate with a server, but Java applets and ActiveX controls don’t always. The easiest way to find out is to use screen-scraper’s proxy server when interacting with a page containing one of these elements. Take a close look at the HTTP requests and responses passing between the web browser and the server. If you see text in there (often XML or URL-encoded lists of parameters) then the chances are good that screen-scraper can extract the information being passed between the client and server. Note, however, that there may be text that the widget is displaying that doesn’t get passed between the client and server. Unfortunately, in such cases, screen-scraper is unable to extract that information. The only utility we’re aware of that may allow for scraping that type of information would be IBM’s Rational Robot software.

03.22.06

Scraping data from similar tables

Posted in Tips at 5:52 pm by Todd Wilson

Astute screen-scraper Fred came up with a scenario that arises from time-to-time: you’ve got a page containing one or more HTML tables, all of which are nearly identical in structure. You want to pull the data from each table, but need to be able to distinguish which row came from which table. Standard old extractor patterns won’t do the job–they’ll match every row in every table, which destroys the link between each row and its corresponding table.

Fortunately, there are a couple of ways of handling such a scenario, which I’ve just outlined in this FAQ. Not too complicated, but a bit more involved than just using a standard extractor pattern.

03.07.06

Adding numbers to session variables

Posted in Updates, Tips at 5:38 pm by Todd Wilson

Up till now it’s been a pretty big pain to add a number to a session variable. Oftentimes you’ll have something like a page number that you need to increment as you loop through search results pages. The page number is usually stored as a String, and to increment it you normally have to cast it to an int, increment it, then cast it back to a String. Recently, though, we added a “session.addToVariable” method that makes this a lot quicker. Here’s the documentation on it:

  • addToVariable( String variable, int value ). Adds a value to a session variable. Session variables are generally stored as Strings, so it’s normally more difficult than it should be to simply add a number to one. This method takes the name of the variable, which can either hold a String or Integer, and adds a number to it. The number added to it can be positive or negative.
    example: session.addToVariable( "PAGE_NUM", 1 );

Much simpler than the previous way. This will be part of our upcoming 2.7 release (any day now!), but if you’d like to make use of it right now you can simply upgrade to the latest pre-release version (2.6.0.6a).