10.24.06
Posted in Updates at 5:07 pm by Todd Wilson
This may be the one that becomes the next public version. There were a few little annoying GUI quirks that we fixed. Feel free to give it a try and let us know if you notice anything out of the ordinary.
The usual caveats apply–this is alpha software, so use it at your own risk. Thanks, though, to anyone willing to help us test.
If you’re currently running version 2.7.2.9a or higher you can upgrade via Options -> Check for updates. If you’re using anything else, follow these instructions (see this page for details on why you need to follow these steps):
- Back up your scraping sessions (check here for help on that).
- Ensure screen-scraper isn’t currently running (close the workbench and server, if running).
- Download this file, and unzip it.
- Copy the contents of the zip file on top of your existing files in the screen-scraper install folder. For example, the zip file contains a “screen-scraper.jar” file which should be copied on top of your existing “screen-scraper.jar” file.
- Edit your “resource\conf\screen-scraper.properties” file in a text editor. Change the “Version” property to “2.7.2.19a”.
- Launch the screen-scraper workbench.
- If all of your scraping sessions have disappeared, don’t panic!
- Close the screen-scraper workbench.
- Re-open the screen-scraper workbench.
You’re done!
Permalink
10.18.06
Posted in Tips at 10:06 am by Todd Wilson
Alert screen-scraper yipa posted an excellent question to our forum this morning:
One of the pages I want to scrape is behind a login with image verification (i.e., you need to enter some text generated in an image to log in). Is there a way to work around this? Maybe something like SS load the image, display/save it to a location, waits for my input after viewing the image, then moves on? Or are there other ways to handle this?
This can be a pretty tricky situation to deal with, but, in most all cases, it should still be doable. I added it to our FAQ, and here’s the explanation for your enlightenement and learning:
I’m trying to scrape an HTML form that requires the user to type in text shown in an image. Can screen-scraper handle this?
This is known as a CAPTCHA mechanism, and is intended to discourage automated form submissions. There are essentially two ways of working around these:
Oftentimes sites will use a poorly implemented CAPTCHA such that it can be determined up front what the text will read. For example, the site may actually have only four or five images, and it simply cycles through them. By looking at the names of the images one could determine what the corresponding text will be. The text could then be used to populate the appropriate HTML form.
Assuming the CAPTCHA mechanism works as it should (i.e., that a human being would have to type in the text shown in the image), it gets a bit trickier to deal with. The best route would probably be to run a scraping session as you normally would, then, once you arrive at the page containing the CAPTCHA, follow these steps:
- Download the CAPTCHA image to the local hard drive (e.g., using the session.downloadFile method).
- Using a screen-scraper script, pop up a dialog box using Java code that displays the image, and contains a text box that will accept user input. Within a script you have full access to the Java API, so you could pop up something like a custom JDialog containing the image and text box.
- Have a person type into the text box the characters displayed in the image.
- Accept the text entered by the user, then drop it into a screen-scraper session variable.
- Use the value in the session variable to populate the HTML form element.
This obviously isn’t ideal, but, unfortunately, there may not be another way. The CAPTCHA images are designed such that they can’t be read by a machine. As such, human intervention is required.
Permalink
10.17.06
Posted in Updates at 3:39 pm by Todd Wilson
Just a few little bug fixes in this one. There was a pretty annoying problem that would cause the GUI to freeze up from time to time. It turned out to be a bug in Sun’s Java code, but fortunately there was a relatively painless workaround.
The usual caveats apply–this is alpha software, so use it at your own risk. Thanks, though, to anyone willing to help us test.
If you’re currently running version 2.7.2.9a or higher you can upgrade via Options -> Check for updates. If you’re using anything else, follow these instructions (see this page for details on why you need to follow these steps):
- Back up your scraping sessions (check here for help on that).
- Ensure screen-scraper isn’t currently running (close the workbench and server, if running).
- Download this file, and unzip it.
- Copy the contents of the zip file on top of your existing files in the screen-scraper install folder. For example, the zip file contains a “screen-scraper.jar” file which should be copied on top of your existing “screen-scraper.jar” file.
- Edit your “resource\conf\screen-scraper.properties” file in a text editor. Change the “Version” property to “2.7.2.18a”.
- Launch the screen-scraper workbench.
- If all of your scraping sessions have disappeared, don’t panic!
- Close the screen-scraper workbench.
- Re-open the screen-scraper workbench.
You’re done!
Permalink
10.12.06
Posted in Tips at 3:53 pm by jason
Much of the time in scraping, one wants to fill in a web form and grab the results, and many of the forms want the user to fill in a date range. It’s not a daunting prospect if you just want to scrape the form once, but for jobs where you want run a scrape weekly and get a full week’s worth of data making a script for that has been challenging. I have therefore developed a simple, generic script that will figure the date for a given number of days from today, and save it in session variable.
For the purposes of this post, I’m going to make a script give me a date for a week from today in the format of a 2 digit day, 2 digit month, and 4 digit year, however I’ll make those easy to change.
To start one needs to import some useful Java componants:
import java.util.*;
import java.text.*;
These allow us to go ahead and create an instance of “right now”.
Calendar rightNow = Calendar.getInstance();
This gives me a “right now” to which I can add 7 days to thusly:
rightNow.add( Calendar.DATE, 7 );
And all that is left is to format it:
Date endDate = rightNow.getTime();
Date endDate = rightNow.getTime();
SimpleDateFormat formatter = new SimpleDateFormat( “MM/dd/yyy” );
String newDate = formatter.format( endDate );
Now I have a nicely formatted local variable named newDate that I would just need to set as a session variable for the rest of the scrape to run.
session.setVariable(”NEW_DATE”, newDate);
That’s enough to make the script work, but in order to make it into a good template, one should make it easy to find and change the things that will have to set differently in each application. My attempt to do so ended up like this:
import java.util.*;
import java.text.*;
// Set number of days to add to current date.
addDays = 7;
// Set the format in which the date should be output.
String dateFormat = “MM/dd/yyyy”;
//Figure the new date.
Calendar rightNow = Calendar.getInstance();
rightNow.add( Calendar.DATE, addDays );
Date endDate = rightNow.getTime();
SimpleDateFormat formatter = new SimpleDateFormat( dateFormat );
String newDate = formatter.format( endDate );
// Output the new date.
session.setVariable(”NEW_DATE”, newDate);
Of course you can use this process to make more than one date for your form if needed; from here it should just be a matter of some minor editing.
For information on the date formatting, see the java page at: http://java.sun.com/j2se/1.5.0/docs/api/java/text/SimpleDateFormat.html
And for a trick to make the formatting of dates far easier when you’re in screen-scraper, read up on the reformatDate method that is available in the professional edition.
Permalink
10.10.06
Posted in Updates at 4:21 pm by Todd Wilson
Our to-do list is empty! This version contains all of the bug fixes and features we’ve had planned for the next version of screen-scraper. I suppose you could consider it to be more of a beta, or maybe even a release candidate. There really isn’t anything earth-shatteringly new in this version over 2.7.1.16a–mostly just bug fixes and some clean-up.
The usual caveats apply–this is alpha software, so use it at your own risk. Thanks, though, to anyone willing to help us test.
If you’re currently running version 2.7.2.9a or higher you can upgrade via Options -> Check for updates. If you’re using anything else, follow these instructions (see this page for details on why you need to follow these steps):
- Back up your scraping sessions (check here for help on that).
- Ensure screen-scraper isn’t currently running (close the workbench and server, if running).
- Download this file, and unzip it.
- Copy the contents of the zip file on top of your existing files in the screen-scraper install folder. For example, the zip file contains a “screen-scraper.jar” file which should be copied on top of your existing “screen-scraper.jar” file.
- Edit your “resource\conf\screen-scraper.properties” file in a text editor. Change the “Version” property to “2.7.2.17a”.
- Launch the screen-scraper workbench.
- If all of your scraping sessions have disappeared, don’t panic!
- Close the screen-scraper workbench.
- Re-open the screen-scraper workbench.
You’re done!
Permalink