Resume points

Posted in Tips on 04/15/10by jason

Sometimes a long scrape will be stopped mid-run by a system crash, power surge, or bad mojo.  Many times there is nothing to do but to restart, but sometimes there is a way to pick up (pretty close to) where you left off.  You need to include some extra logic, but it is often worthwhile.
Let’s say where looking a site that lists hundred of LOCATIONS, and inside each there is a listing of COMPANIES, and the data we’re after is listed in each COMPANY.

I’m going to make a script that runs at the beginning of the scrape to check for a file that contains the last scraping state.

inputFile = "resume_" + session.getName() + ".txt";
File file = new File(inputFile);
if (file.exists())
{
session.loadVariables(inputFile); // This may set several variables, but we'll pay no attention to most
session.log("==============================================================");
session.log("Scrape is set to resume at " + session.getVariable("RESUME_AT_LOCATION") + " and " + session.getVariable("RESUME_AT_COMPANY");
session.log("==============================================================");
}

If the file isn’t there, that’s good.  I’ll just scrape the list of LOCATIONS, and go through all of them.

If the file is there, then I change a couple of variable names so I can throw them away when I’m done with them.

Now I get to the list of companies, and I run this script on each pattern application:

// Function to write the restore point and scrape
scrapeLocation(location)
{
// Set RESUME_AT_LOCATION to write out
session.setVariable("RESUME_AT_LOCATION", location);
session.saveVariables("resume_" + session.getName() + ".txt");
session.setVariable("RESUME_AT_LOCATION", null); // We don't want this set next time the loop is run
session.scrapeFile("Single location");
}
if (session.getVariable("RESUME_AT_LOCATION")==null)
{
// No resume at location set, so scraping
scrapeLocation(dataRecord.get("LOCATION"));
}
else
{
// There is a resume at location set, so compare
if (dataRecord.get("LOCATION").equals(session.getVariable("RESUME_AT_LOCATION")))
{
// Matches, so scrape
scrapeLocation(dataRecord.get("LOCATION"));
}
else
{
session.log("---Skipping this location: " + dataRecord.get("LOCATION"));
session.log("-----Looking to resume at: " + session.getVariable("RESUME_AT_LOCATION"));
}
}

Now when the “RESUME_AT_LOCATION” is set at the beginning of a scrape, it will skip all locations until it finds a match.  It will then scrape that location, and clear the variable so it will get the rest of the locations.

You could copy this same logic into the COMPANIES.

Finally, when the scrape completes naturally, we should delete the file with the resume point since 1) you’re now at the last location, and 2) you’re done … nothing to resume.  Without that file, the scrape will start fresh next time.

// Delete the file with the resume point
outputFile = "resume_" + session.getName() + ".txt";
File file = new File(outputFile);
file.delete();
session.log("Resume point deleted");

You can see more on the loadVariables and saveVariable on the API Page.

Leave a Comment