12.28.06
Posted in Updates at 12:54 am by Todd Wilson
Mostly just some minor fixes and clean-ups in this one. Our to-do list is officially empty, and this time we’re *really* not throwing in anything new. We still have quite a bit more testing to do, but, who knows, maybe this one could be the gold release…
The usual caveats apply–this is alpha software, so use it at your own risk. Thanks, though, to anyone willing to help us test.
If you’re currently running version 2.7.2.9a or higher you can upgrade via Options -> Check for updates, then follow these steps:
- After downloading and installing the update via “Check for updates”, launch the screen-scraper workbench.
- Open the “Settings” dialog box by clicking on the wrench icon in the button bar.
- Click inside one of the text boxes (it doesn’t matter which) to give it focus.
- Close the “Settings” dialog box. This causes certain properties files to be re-written with a new property related to the new HTML renderer.
- Close the workbench.
- Launch the workbench again.
If you’re upgrading from anything prior to version 2.7.2.9a, follow these instructions (see this page for details on why you need to follow these steps):
- Back up your scraping sessions (check here for help on that).
- Ensure screen-scraper isn’t currently running (close the workbench and server, if running).
- Download this file, and unzip it.
- Copy the contents of the zip file on top of your existing files in the screen-scraper install folder. For example, the zip file contains a “screen-scraper.jar” file which should be copied on top of your existing “screen-scraper.jar” file.
- Edit your “resource\conf\screen-scraper.properties” file in a text editor. Change the “Version” property to “2.7.2.23a”.
- Launch the screen-scraper workbench.
- If all of your scraping sessions have disappeared, don’t panic!
- Follow steps 2 through 4 in the upgrade instructions above this one (the instructions corresponding to upgrading from version 2.7.2.9a or higher).
- Close the screen-scraper workbench.
- Re-open the screen-scraper workbench.
You’re done!
Permalink
12.18.06
Posted in Updates at 12:26 pm by Todd Wilson
This new version contains one more encoding issue that should fix garbled characters in some Latin character sets. Please let us know if you find any web sites containing characters that screen-scraper incorrectly renders as boxes or question marks. Aside from that, we’ve implemented a new HTML rendering engine, which, while still not perfect, seems to do a much better job than the previous one. This new rendering engine requires, however, that you upgrade to this version in a slightly different way, so please carefully follow the instructions you’ll find below.
The usual caveats apply–this is alpha software, so use it at your own risk. Thanks, though, to anyone willing to help us test.
If you’re currently running version 2.7.2.9a or higher you can upgrade via Options -> Check for updates, then follow these steps:
- After downloading and installing the update via “Check for updates”, launch the screen-scraper workbench.
- Open the “Settings” dialog box by clicking on the wrench icon in the button bar.
- Click inside one of the text boxes (it doesn’t matter which) to give it focus.
- Close the “Settings” dialog box. This causes certain properties files to be re-written with a new property related to the new HTML renderer.
- Close the workbench.
- Launch the workbench again.
If you’re upgrading from anything prior to version 2.7.2.9a, follow these instructions (see this page for details on why you need to follow these steps):
- Back up your scraping sessions (check here for help on that).
- Ensure screen-scraper isn’t currently running (close the workbench and server, if running).
- Download this file, and unzip it.
- Copy the contents of the zip file on top of your existing files in the screen-scraper install folder. For example, the zip file contains a “screen-scraper.jar” file which should be copied on top of your existing “screen-scraper.jar” file.
- Edit your “resource\conf\screen-scraper.properties” file in a text editor. Change the “Version” property to “2.7.2.22a”.
- Launch the screen-scraper workbench.
- If all of your scraping sessions have disappeared, don’t panic!
- Follow steps 2 through 4 in the upgrade instructions above this one (the instructions corresponding to upgrading from version 2.7.2.9a or higher).
- Close the screen-scraper workbench.
- Re-open the screen-scraper workbench.
You’re done!
Permalink
12.01.06
Posted in Updates at 3:05 pm by Todd Wilson
This one includes a few more bug fixes to some of the annoying quirks in the most recent alpha versions. The big breakthrough, though, is that we now seem to have worked out all of the issues related to internationalization. With the most recent set of fixes screen-scraper seems to be handling sites in Japanese, Korean, and Chinese just great. We’re not throwing any confetti just yet, but all tests have checked out perfectly so far.
The usual caveats apply–this is alpha software, so use it at your own risk. Thanks, though, to anyone willing to help us test.
If you’re currently running version 2.7.2.9a or higher you can upgrade via Options -> Check for updates. If you’re using anything else, follow these instructions (see this page for details on why you need to follow these steps):
- Back up your scraping sessions (check here for help on that).
- Ensure screen-scraper isn’t currently running (close the workbench and server, if running).
- Download this file, and unzip it.
- Copy the contents of the zip file on top of your existing files in the screen-scraper install folder. For example, the zip file contains a “screen-scraper.jar” file which should be copied on top of your existing “screen-scraper.jar” file.
- Edit your “resource\conf\screen-scraper.properties” file in a text editor. Change the “Version” property to “2.7.2.21a”.
- Launch the screen-scraper workbench.
- If all of your scraping sessions have disappeared, don’t panic!
- Close the screen-scraper workbench.
- Re-open the screen-scraper workbench.
You’re done!
Permalink
11.28.06
Posted in Updates at 12:19 pm by Todd Wilson
I’m happy to report that any Python lovers out there can now invoke screen-scraper from their scripts. Thanks much to Litao Wei for creating a Python driver for us. The code is excellent.
Those interested can find documentation on the topic here. We also included an example in our fourth tutorial here.
Along these same lines, we’d love to do the same thing with Ruby. If there are any Ruby experts out there who have any interest in working with us on this, feel free to drop me a line. My email address is my first name @screen-scraper.com.
Permalink
11.14.06
Posted in Updates at 5:53 pm by Todd Wilson
Well, so we found a few annoyances/bugs that we decided to slip into a new build. There will likely be at least one more before we do a public release. In addition to a few bug fixes, this version contains two slick new methods: session.saveVariables and session.loadVariables, which will respectively save session variables to a file and load them from a file. Can be very handy if you need to save the state of things so that you can pick up where you left off when you restart a scrape.
The usual caveats apply–this is alpha software, so use it at your own risk. Thanks, though, to anyone willing to help us test.
If you’re currently running version 2.7.2.9a or higher you can upgrade via Options -> Check for updates. If you’re using anything else, follow these instructions (see this page for details on why you need to follow these steps):
- Back up your scraping sessions (check here for help on that).
- Ensure screen-scraper isn’t currently running (close the workbench and server, if running).
- Download this file, and unzip it.
- Copy the contents of the zip file on top of your existing files in the screen-scraper install folder. For example, the zip file contains a “screen-scraper.jar” file which should be copied on top of your existing “screen-scraper.jar” file.
- Edit your “resource\conf\screen-scraper.properties” file in a text editor. Change the “Version” property to “2.7.2.20a”.
- Launch the screen-scraper workbench.
- If all of your scraping sessions have disappeared, don’t panic!
- Close the screen-scraper workbench.
- Re-open the screen-scraper workbench.
You’re done!
Permalink
10.24.06
Posted in Updates at 5:07 pm by Todd Wilson
This may be the one that becomes the next public version. There were a few little annoying GUI quirks that we fixed. Feel free to give it a try and let us know if you notice anything out of the ordinary.
The usual caveats apply–this is alpha software, so use it at your own risk. Thanks, though, to anyone willing to help us test.
If you’re currently running version 2.7.2.9a or higher you can upgrade via Options -> Check for updates. If you’re using anything else, follow these instructions (see this page for details on why you need to follow these steps):
- Back up your scraping sessions (check here for help on that).
- Ensure screen-scraper isn’t currently running (close the workbench and server, if running).
- Download this file, and unzip it.
- Copy the contents of the zip file on top of your existing files in the screen-scraper install folder. For example, the zip file contains a “screen-scraper.jar” file which should be copied on top of your existing “screen-scraper.jar” file.
- Edit your “resource\conf\screen-scraper.properties” file in a text editor. Change the “Version” property to “2.7.2.19a”.
- Launch the screen-scraper workbench.
- If all of your scraping sessions have disappeared, don’t panic!
- Close the screen-scraper workbench.
- Re-open the screen-scraper workbench.
You’re done!
Permalink
10.18.06
Posted in Tips at 10:06 am by Todd Wilson
Alert screen-scraper yipa posted an excellent question to our forum this morning:
One of the pages I want to scrape is behind a login with image verification (i.e., you need to enter some text generated in an image to log in). Is there a way to work around this? Maybe something like SS load the image, display/save it to a location, waits for my input after viewing the image, then moves on? Or are there other ways to handle this?
This can be a pretty tricky situation to deal with, but, in most all cases, it should still be doable. I added it to our FAQ, and here’s the explanation for your enlightenement and learning:
I’m trying to scrape an HTML form that requires the user to type in text shown in an image. Can screen-scraper handle this?
This is known as a CAPTCHA mechanism, and is intended to discourage automated form submissions. There are essentially two ways of working around these:
Oftentimes sites will use a poorly implemented CAPTCHA such that it can be determined up front what the text will read. For example, the site may actually have only four or five images, and it simply cycles through them. By looking at the names of the images one could determine what the corresponding text will be. The text could then be used to populate the appropriate HTML form.
Assuming the CAPTCHA mechanism works as it should (i.e., that a human being would have to type in the text shown in the image), it gets a bit trickier to deal with. The best route would probably be to run a scraping session as you normally would, then, once you arrive at the page containing the CAPTCHA, follow these steps:
- Download the CAPTCHA image to the local hard drive (e.g., using the session.downloadFile method).
- Using a screen-scraper script, pop up a dialog box using Java code that displays the image, and contains a text box that will accept user input. Within a script you have full access to the Java API, so you could pop up something like a custom JDialog containing the image and text box.
- Have a person type into the text box the characters displayed in the image.
- Accept the text entered by the user, then drop it into a screen-scraper session variable.
- Use the value in the session variable to populate the HTML form element.
This obviously isn’t ideal, but, unfortunately, there may not be another way. The CAPTCHA images are designed such that they can’t be read by a machine. As such, human intervention is required.
Permalink
10.17.06
Posted in Updates at 3:39 pm by Todd Wilson
Just a few little bug fixes in this one. There was a pretty annoying problem that would cause the GUI to freeze up from time to time. It turned out to be a bug in Sun’s Java code, but fortunately there was a relatively painless workaround.
The usual caveats apply–this is alpha software, so use it at your own risk. Thanks, though, to anyone willing to help us test.
If you’re currently running version 2.7.2.9a or higher you can upgrade via Options -> Check for updates. If you’re using anything else, follow these instructions (see this page for details on why you need to follow these steps):
- Back up your scraping sessions (check here for help on that).
- Ensure screen-scraper isn’t currently running (close the workbench and server, if running).
- Download this file, and unzip it.
- Copy the contents of the zip file on top of your existing files in the screen-scraper install folder. For example, the zip file contains a “screen-scraper.jar” file which should be copied on top of your existing “screen-scraper.jar” file.
- Edit your “resource\conf\screen-scraper.properties” file in a text editor. Change the “Version” property to “2.7.2.18a”.
- Launch the screen-scraper workbench.
- If all of your scraping sessions have disappeared, don’t panic!
- Close the screen-scraper workbench.
- Re-open the screen-scraper workbench.
You’re done!
Permalink
10.12.06
Posted in Tips at 3:53 pm by jason
Much of the time in scraping, one wants to fill in a web form and grab the results, and many of the forms want the user to fill in a date range. It’s not a daunting prospect if you just want to scrape the form once, but for jobs where you want run a scrape weekly and get a full week’s worth of data making a script for that has been challenging. I have therefore developed a simple, generic script that will figure the date for a given number of days from today, and save it in session variable.
For the purposes of this post, I’m going to make a script give me a date for a week from today in the format of a 2 digit day, 2 digit month, and 4 digit year, however I’ll make those easy to change.
To start one needs to import some useful Java componants:
import java.util.*;
import java.text.*;
These allow us to go ahead and create an instance of “right now”.
Calendar rightNow = Calendar.getInstance();
This gives me a “right now” to which I can add 7 days to thusly:
rightNow.add( Calendar.DATE, 7 );
And all that is left is to format it:
Date endDate = rightNow.getTime();
Date endDate = rightNow.getTime();
SimpleDateFormat formatter = new SimpleDateFormat( “MM/dd/yyy” );
String newDate = formatter.format( endDate );
Now I have a nicely formatted local variable named newDate that I would just need to set as a session variable for the rest of the scrape to run.
session.setVariable(”NEW_DATE”, newDate);
That’s enough to make the script work, but in order to make it into a good template, one should make it easy to find and change the things that will have to set differently in each application. My attempt to do so ended up like this:
import java.util.*;
import java.text.*;
// Set number of days to add to current date.
addDays = 7;
// Set the format in which the date should be output.
String dateFormat = “MM/dd/yyyy”;
//Figure the new date.
Calendar rightNow = Calendar.getInstance();
rightNow.add( Calendar.DATE, addDays );
Date endDate = rightNow.getTime();
SimpleDateFormat formatter = new SimpleDateFormat( dateFormat );
String newDate = formatter.format( endDate );
// Output the new date.
session.setVariable(”NEW_DATE”, newDate);
Of course you can use this process to make more than one date for your form if needed; from here it should just be a matter of some minor editing.
For information on the date formatting, see the java page at: http://java.sun.com/j2se/1.5.0/docs/api/java/text/SimpleDateFormat.html
And for a trick to make the formatting of dates far easier when you’re in screen-scraper, read up on the reformatDate method that is available in the professional edition.
Permalink
10.10.06
Posted in Updates at 4:21 pm by Todd Wilson
Our to-do list is empty! This version contains all of the bug fixes and features we’ve had planned for the next version of screen-scraper. I suppose you could consider it to be more of a beta, or maybe even a release candidate. There really isn’t anything earth-shatteringly new in this version over 2.7.1.16a–mostly just bug fixes and some clean-up.
The usual caveats apply–this is alpha software, so use it at your own risk. Thanks, though, to anyone willing to help us test.
If you’re currently running version 2.7.2.9a or higher you can upgrade via Options -> Check for updates. If you’re using anything else, follow these instructions (see this page for details on why you need to follow these steps):
- Back up your scraping sessions (check here for help on that).
- Ensure screen-scraper isn’t currently running (close the workbench and server, if running).
- Download this file, and unzip it.
- Copy the contents of the zip file on top of your existing files in the screen-scraper install folder. For example, the zip file contains a “screen-scraper.jar” file which should be copied on top of your existing “screen-scraper.jar” file.
- Edit your “resource\conf\screen-scraper.properties” file in a text editor. Change the “Version” property to “2.7.2.17a”.
- Launch the screen-scraper workbench.
- If all of your scraping sessions have disappeared, don’t panic!
- Close the screen-scraper workbench.
- Re-open the screen-scraper workbench.
You’re done!
Permalink
« Previous entries · Next entries »