Dynamic Content

Posted in Tips on 10.28.15 by jason

One’s first experience with a page full of dynamic content can be pretty confusing. Generally one can request the HTML, but it’s missing the data that is sought.

What you’re usually seeing is a page that contains JavaScript which is making a subsequent HTTP request, and getting the data to add into the HTML. That subsequent HTTP response is often JSON, but can be plain HTML, XML, or myriad other things.

Since screen-scraper doesn’t run any JavaScript, what you need to do is make that request, and scrape the response. Here is an example:

  1. If you go to http://screen-scraper.com/infinite%20scroller/demo.html you can see my sample page. In this case it’s one of those pages that keeps tacking content to the end forever like Facebook or Pintrest.
  2. If you make a scrapeable file of http://screen-scraper.com/infinite%20scroller/demo.html you can get a successful response, but the content text isn’t there.
  3. Now you need to pull out the screen-scraper proxy, and proxy the request. You will see the one page is making 3 requests:
    1. http://screen-scraper.com/infinite%20scroller/demo.html -> The landing page
    2. http://screen-scraper.com/infinite%20scroller/scroll.js -> A JavaScript file that is making another request for data. On this one I’m just doing a GET request for a static page. Most of the time you will either see GET requests with parameters or POST requests to get different responses. Sometimes they change up the base URL, etc. There’s no real standard.
    3. http://screen-scraper.com/infinite%20scroller/data.json -> The request that gets the JSON content. Here you can see the format, and the JavaScript is parsing it, and writing it to the landing page for you.

Now you have the response, and in this case it’s JSON that you can either use extractor patterns on, or parse.

HTTPS connection issues

Posted in Updates on 04.29.15 by jason

We’ve been seeing lots of issues with scrapes connecting to HTTPS sites. Some of the errors include

  • ssl_error_rx_record_too_long
  • An input/output error occurred while connecting to https:// … The message was peer not authenticated.
  • javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated

The issue came about when the Heartbleed vulnerability necessitated changes to some HTTPS connections—some of types aren’t secure anymore, and new versions have come out. Screen-scraper needed two changes to catch up, and they are:

  • Update to use Java 8
  • Update of HTTPClient to 4.4

Both of these are pretty large changes, so they aren’t in the stable release yet, however in some cases they are the only option to make a scrape work, therefore here is the instructions to get what you need. Read the rest of this entry »

Scraping data from various industries

Posted in Miscellaneous on 06.10.13 by Todd Wilson

We’ve just added several new scraping sessions that exemplify extracting data from sites in various industries.  If you go to our home page and click on one of the buttons corresponding to an industry you’ll be taken to a page where you can download the scraping session.  The e-commerce section also has a video to walk you through the process, and we’ll be adding videos to the others shortly.

Apache Commons

Posted in Uncategorized on 05.28.13 by jason

We’ve recently included libraries for Apache Commons Lang. There is a large number of useful things in there, but I find most use for stringUtils and wordUtils.

For example, some sites one might scrape might have the results in all caps. You could:

import org.apache.commons.lang.*;

name = StringUtils.lowerCase(name);
name = WordUtils.capitalize(name);
session.log(“Name now shows as: ” + name);

At the end, the name is now formatted as “George Washington Carver”. Most all of the methods are already nullsafe, and there is a lot of little tools in there to try.

End-of-year sale!

Posted in Miscellaneous on 11.29.12 by Todd Wilson

This is our biggest sale in quite a while.  Until December 31, 2012 take 40% off Professional Edition licenses and 60% off Enterprise Edition licenses.  Click here to take advantage.

Version 6.0.18a of screen-scraper Released

Posted in Updates on 10.16.12 by Todd Wilson

A few minor updates in the one, along with a long-awaited global find feature!

Let Us Help You Learn screen-scraper

Posted in Uncategorized on 07.19.12 by scottw

We are pleased to announce our new coaching program. To help get started, our new users can receive up to two free hours of one-on-one coaching (click here for details).

Existing users, receive help planning out your project, solving that one tough issue, learn new techniques and refine your current scraping projects. Purchase hours of training by calling our offices at 800-672-0113.

Version 6.0.14a of screen-scraper Released

Posted in Updates on 06.28.12 by Todd Wilson

Several small changes in this one:

  • Extractor patterns invoked manually can now be tested on a sub-set of the HTML page.
  • Added scrapeableFile.setForcePOST.
  • Upgraded internal GWT libraries.
  • Prettied up the web UI.

Check the alpha log for a full list of changes.

New Quick Guide video

Posted in Tips, Updates on 06.15.12 by scottw

We recently released a new Quick Guide video.  In less than three minutes you can get an idea of what it’s like to use screen-scraper.


Version 6.0.6a of screen-scraper Released

Posted in Updates on 05.10.12 by Todd Wilson

Several small changes in this one:

  • Upgraded Bean Shell to the latest version.
  • Searches within a proxy session now include notes.
  • Fixed an issue that would cause the workbench to freeze when the breakpiont window was up.
  • Now using global proxy settings if no session proxy settings are found.
  • Improved cookie handling in the proxy server.
  • Fixed a bug that would cause a proxy session to not be completely saved.
  • Added sutil.makeGETRequestNoSessionProxy.

Previous Entries »