Version 7.0.1a released

Posted in Updates on 04.19.16 by jason

When you updated to version 7.0.1a, the first thing you’ll notice is spruced up GUI, but there is a quite a bit going on under the hood too. You can see all the release notes here.

If you want to use this update, here is the instruction to update.

Screen-scraper 7.0 Released

Posted in Updates on 03.02.16 by jason

This new stable version adds many new features, and give you the ability to scrape sites that are using the lastest SSL features.

The installers are available on the download page.

Screen-scraper 7.0 requires a newer JRE than the previous stable release, therefore upgrading requires some additional steps.

If you don’t already have all your scrapes exported, or just want to preserve the current configuration, you need to upgrade your current screen-scraper to the latest alpha version 6.0.69a (Instructions). Once done, back up the content of the screen-scraper/resource/db directory.

Linux/OSX

  1. The new installer does not include the JRE
  2. You need to have the Java JRE 1.8 installed (1.7 will work, but is not recommended)
    1. Make note of the install location (a symlink isn’t valid)
  3. Run the new setup SH file.
    1. You cannot install over the top of an exiting installation. You must either move the current directory or during installation choose a new install location.
  4. In the screen-scraper install directory, locate and edit both the server and screen-scraper script. On the line “INSTALL4J_JAVA_HOME_OVERRIDE” (at the top), add the path to your JRE install

Once done, you can replace the content of the resource/db directory with the one you’d backed up.

Windows

  1. Make sure screen-scraper is not running (neither the application, nor in sever mode)
  2. Run the setup EXE
    1. You cannot install over the top of an exiting installation. You must either move the current directory or during installation choose a new install location.

Once done, you can replace the content of the resource/db directory with the one you’d backed up.

We recommend this update for all scrapers, and if there is any problems, please let us know here or on the support forum.

 

Dynamic Content

Posted in Tips on 10.28.15 by jason

One’s first experience with a page full of dynamic content can be pretty confusing. Generally one can request the HTML, but it’s missing the data that is sought.

What you’re usually seeing is a page that contains JavaScript which is making a subsequent HTTP request, and getting the data to add into the HTML. That subsequent HTTP response is often JSON, but can be plain HTML, XML, or myriad other things.

Since screen-scraper doesn’t run any JavaScript, what you need to do is make that request, and scrape the response. Here is an example:

  1. If you go to http://screen-scraper.com/infinite%20scroller/demo.html you can see my sample page. In this case it’s one of those pages that keeps tacking content to the end forever like Facebook or Pintrest.
  2. If you make a scrapeable file of http://screen-scraper.com/infinite%20scroller/demo.html you can get a successful response, but the content text isn’t there.
  3. Now you need to pull out the screen-scraper proxy, and proxy the request. You will see the one page is making 3 requests:
    1. http://screen-scraper.com/infinite%20scroller/demo.html -> The landing page
    2. http://screen-scraper.com/infinite%20scroller/scroll.js -> A JavaScript file that is making another request for data. On this one I’m just doing a GET request for a static page. Most of the time you will either see GET requests with parameters or POST requests to get different responses. Sometimes they change up the base URL, etc. There’s no real standard.
    3. http://screen-scraper.com/infinite%20scroller/data.json -> The request that gets the JSON content. Here you can see the format, and the JavaScript is parsing it, and writing it to the landing page for you.

Now you have the response, and in this case it’s JSON that you can either use extractor patterns on, or parse.

HTTPS connection issues

Posted in Updates on 04.29.15 by jason

We’ve been seeing lots of issues with scrapes connecting to HTTPS sites. Some of the errors include

  • ssl_error_rx_record_too_long
  • An input/output error occurred while connecting to https:// … The message was peer not authenticated.
  • javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated

The issue came about when the Heartbleed vulnerability necessitated changes to some HTTPS connections—some of types aren’t secure anymore, and new versions have come out. Screen-scraper needed two changes to catch up, and they are:

  • Update to use Java 8
  • Update of HTTPClient to 4.4

Both of these are pretty large changes, so they aren’t in the stable release yet, however in some cases they are the only option to make a scrape work, therefore here is the instructions to get what you need. Read the rest of this entry »

Scraping data from various industries

Posted in Miscellaneous on 06.10.13 by Todd Wilson

We’ve just added several new scraping sessions that exemplify extracting data from sites in various industries.  If you go to our home page and click on one of the buttons corresponding to an industry you’ll be taken to a page where you can download the scraping session.  The e-commerce section also has a video to walk you through the process, and we’ll be adding videos to the others shortly.

Apache Commons

Posted in Uncategorized on 05.28.13 by jason

We’ve recently included libraries for Apache Commons Lang. There is a large number of useful things in there, but I find most use for stringUtils and wordUtils.

For example, some sites one might scrape might have the results in all caps. You could:

import org.apache.commons.lang.*;

name = “GEORGE WASHINGTON CARVER”;
name = StringUtils.lowerCase(name);
name = WordUtils.capitalize(name);
session.log(“Name now shows as: ” + name);

At the end, the name is now formatted as “George Washington Carver”. Most all of the methods are already nullsafe, and there is a lot of little tools in there to try.

End-of-year sale!

Posted in Miscellaneous on 11.29.12 by Todd Wilson

This is our biggest sale in quite a while.  Until December 31, 2012 take 40% off Professional Edition licenses and 60% off Enterprise Edition licenses.  Click here to take advantage.

Version 6.0.18a of screen-scraper Released

Posted in Updates on 10.16.12 by Todd Wilson

A few minor updates in the one, along with a long-awaited global find feature!

Let Us Help You Learn screen-scraper

Posted in Uncategorized on 07.19.12 by scottw

We are pleased to announce our new coaching program. To help get started, our new users can receive up to two free hours of one-on-one coaching (click here for details).

Existing users, receive help planning out your project, solving that one tough issue, learn new techniques and refine your current scraping projects. Purchase hours of training by calling our offices at 800-672-0113.

Version 6.0.14a of screen-scraper Released

Posted in Updates on 06.28.12 by Todd Wilson

Several small changes in this one:

  • Extractor patterns invoked manually can now be tested on a sub-set of the HTML page.
  • Added scrapeableFile.setForcePOST.
  • Upgraded internal GWT libraries.
  • Prettied up the web UI.

Check the alpha log for a full list of changes.

Previous Entries »