04.21.08

Screening-Scraping Ethics

Posted in Uncategorized at 12:10 pm by ryans

The internet can be thought of as the world’s largest database. This is so, because it is comprised of inter-connected databases, files, and computer systems. By simply typing in some keywords, one can access hundreds to millions of websites containing treasure troves of facts, statistics, and other formats of information on an endless array of topics. Because the internet is such a valuable resource, we should seek new and innovative ways to mine the data using ethical means.

You may have never heard of screen-scraping, web-fetching, or web-data extraction, but if you’ve ever surfed the internet, you’ve quite likely been a beneficiary of the method of retrieving information on the web described by these terms. They refer to the increasingly popular method of methodically retrieving information with specialized tools. Numerous programs utilize many computer languages for the purpose of mining data. Software often assists users in intercepting HTTP requests and responses by incorporating proxy servers. The software then displays the pages’ source code (HTML, JavaScript, etc.) for users to extract the desired information. In addition, such software can aid iteration through pages (sometimes thousands of them) all the while gleaning valuable data in various forms.

The goal of scraping websites is to access information, but the uses of that information can vary. Users may wish to store the information in their own databases or manipulate the data within a spreadsheet. Other users may utilize data extraction techniques as means of obtaining the most recent data possible, particularly when working with information subject to frequent changes. Investors analyzing stock prices, realtors researching home listings, meteorologists studying weather, or insurance salespeople following insurance prices are a few individuals who might fit this category of users of frequently updated data.

Access to certain information may also provide users with strategic advantage in business. Attorneys might wish to scrape arrest records from county courthouses in search of potential clients. Businesses, such as restaurants or video-rental stores that know the locations of competitors can make better decisions about where to focus further growth. Companies that provide complementary (not to be confused with complimentary) products, like software, may wish to know the make, model, cost, and market share of hardware that are compatible with their software.

Another common, but controversial use of information taken from websites is reposting scraped data to other sites. Scrapers may wish to consolidate data from a myriad of websites and then create a new website containing all of the information in one convenient location. In some cases, the new site’s owner may benefit from ads placed on his or her site or from fees charged to access the site. Companies usually go to great lengths to disseminate information about their products or services. So, why would a website owner not wish to have his or her website’s information scraped?

Several reasons exist for why website owners may not wish to have their site’s scraped by others (excluding search engines). Some people feel that data that is reposted to other sites is plagiarized, if not stolen. These individuals may feel that they made the effort to gather information and make it available on their websites only to have it copied to other sites. Are individuals justified in feeling that they have been taken advantage of, even if their websites are posted publicly?

Interpretation of what exactly “republish” means is widely disputed. One of the most authoritative explanations may be found in the 1991 supreme-court case of Feist Publications v. Rural Telephone Service. This case involved Rural Telephone Service suing Feist Publications for copyright infringement when Feist copied telephone listings after Rural denied Feist’s request to license the information. While information has never been copyrightable under U.S. law, a collection of information, defined mostly in terms of creative arrangement or original ideas, can be copyrighted. The Supreme Court’s ruling in Feist Publications v. Rural Telephone Service stated that “information contained in Rural’s phone directory was not copyrightable, and that therefore no infringement existed.” Justice O’ Conner focused on the need for information to have a “creative” element in order to be termed a “collection” (1). Similarly, information, taken from publicly available websites should not be considered plagiarism or even theft if only the information (numbers, statistics, etc.) is reposted to new sites or used for other purposes.

Scraped websites also experience an increase in used bandwidth as a result of being scraped. Some scrapes take place once, but many scrapes must be performed over and over to achieve the desired results. In such cases, the servers that host the pages being scraped inevitably experience an increased load. Site owners may not wish to have the increased bandwidth, but more importantly, excessive page requests can cause a web server to function slowly or even fail. Rarely, however, do most scrapes cause such strain on a server on their own. Accessing a page through scraping is no different from visiting a page manually, except that scraping allows more pages to be visited over a shorter period. Additionally, scrapes can be adjusted to run more slowly, so as to minimize the strain on the server. Scraping is usually slowed when more than a few scraping sessions are being run against a single server at one time.

Interestingly, having one’s website scraped can have positive effects. Of course the recipient of the scraped data is pleased to have desired data, but owners of scraped sites may also benefit. Think of the case mentioned above in which home listings are scraped from a site. Whether the information is reposted or stored in a database for later querying to match homebuyer’s needs, the purpose of the original site is met—to get the home-listing information into the hands of potential buyers.

Individuals who scrape websites can do so, while still following guidelines for ethical data extraction. Perhaps it would be helpful to review a list of tips for maintaining ethical scraping. One website I consulted gave the following suggestions:

· Obey robots.txt.

· Don’t flood a site.

· Don’t republish, especially not anything that might be copyrighted.

· Abide by the site terms of service (2).

Occasionally, individuals who scrape websites have paid for access to the material being scraped. Many job- and résumé-posting websites fall into this category. Employers must pay a monthly fee for an account which provides access to the résumés of potential new hirers. Certainly, the fact that employers pay for the service entitles them to use whatever means are necessary to sort through and record the desired data. The only exception would be where the site’s terms of service specifically prohibit scraping.

While republishing images, artwork, and other original content without permission is unethical and in many cases illegal, using scraped data for personal purposes is certainly within the limits of ethical behavior. Nevertheless, page scrapers should always avoid taking copyrighted materials. Use of bandwidth is no more deserved by any one person than another. Even making scraped data available to others online can be argued as ethical, especially when the scraped website is posted on public space and the data taken doesn’t include any creative content. After all, the purpose of hosting a website in the first place is to provide information.

(1) http://en.wikipedia.org/wiki/Feist_Publications_v._Rural_Telephone_Service

(2) http://www.perlmonks.org/?node_id=477825

01.23.08

screen-scraper version 4.0 released!

Posted in Updates at 10:49 am by Todd Wilson

Well, it’s now official.  It’s been just over a full year in development, and we’re now happy to release it to the world.  Thanks to all who have helped in testing alpha versions and provided feedback.

In order to upgrade an existing instance, you’ll need to un-install and re-install.  Take a look at this FAQ for details as to the whys and wherefores.

01.14.08

Version 4.0 of screen-scraper coming soon…

Posted in Updates at 4:50 pm by Todd Wilson

We’re anticipating releasing version 4.0 of screen-scraper quite soon.  Perhaps as soon as this week.  There will be quite a few changes that come along with this.  Aside from the usual new features and bug fixes, we’ll be adding a new edition–screen-scraper Enterprise Edition.  Essentially what is now the latest pre-release version of screen-scraper Professional Edition will become screen-scraper Enterprise Edition.  The new screen-scraper Professional Edition will simply be the Enterprise Edition with a number of features stripped out.  Additionally, those who license the Enterprise Edition will get phone support, as well as a few other non-tangibles.

Along with all of this there will be a pricing change.  The Professional Edition will be available for $399 USD, and the Enterprise Edition will cost $2,499.  Those who licensed screen-scraper Professional Edition before the release of the Enterprise Edition will be eligible for a free upgrade to it (though they will not get the phone support that subsequent licensees will get).  In the interest of fairness, I thought it would be a good idea to point this out prior to the release of 4.0.  Those considering licensing screen-scraper Professional Edition right now might want to consider it a bit more seriously, given the price increase that will take place with the new version.  As always, don’t hesitate to drop us a line with any questions.

11.20.07

Anonymization now built in to screen-scraper

Posted in Updates at 6:29 pm by Todd Wilson

If you’re currently (or will be at some point) dealing with sites for which you’d like to anonymize the scraping process, I’m happy to announce the availability of a very slick anonymization feature built right in to screen-scraper. If you upgrade to version 3.0.65a (try this link if you have trouble upgrading), you’ll now find a new section in the “Settings” window, and a new “Anonymization” tab for scraping sessions. Once you’ve done the initial setup to use the anonymization service, which is pretty quick, it can be as simple as checking the “Anonymize this scrape” check box. See this page in our docs for all of the details.

We’ve tried several different methods for anonymization, and this is by far the simplest, fastest, and most reliable. Drop us a line if you’re interested in making use of it in your own scrapes.

11.12.07

Handling scraped data in real time

Posted in Updates, Tips at 12:40 pm by Todd Wilson

Once screen-scraper extracts data from a web site, typically that data is sent somewhere else. Data is probably most commonly written out to a file, but may also be saved to a database or even submitted to another web site. You can always handle the scraped data in screen-scraper scripts, but what if you want to make use of the data in your own application, which invokes screen-scraper?

In the past, when invoking screen-scraper from a remote application, the process has generally meant sending screen-scraper the request to scrape, waiting for extraction to occur, then handling that extracted data in the application that invoked screen-scraper. It’s that second step that can be a bit hard to deal with–the request to scrape is sent, but the scraped data can’t be touched by the calling application until screen-scraper finishes its work. This can be especially troublesome in cases where the scrapes are long and might even get interrupted in the middle. This is at best inconvenient, and at worst may mean loss of scraped data.

I recently had a flash of inspiration as to how to deal with these cases, and implemented a new feature in the latest alpha version of screen-scraper (3.0.63a) that greatly facilitates handling data in a remote application as it is getting scraped. First, to give a contrary example, consider the method we advocate in our fourth tutorial for invoking screen-scraper remotely to extract data from our shopping web site. The process goes basically like this:

  1. An external application starts up (e.g., a Java application or PHP script).
  2. The application invokes screen-scraper, telling it to run the “Shopping Site” scraping session.
  3. The “Shopping Site” scraping session runs.
  4. Once the scraping session completes, control returns to the calling application.
  5. The calling application requests the scraped records from screen-scraper.
  6. The scraped records are output by the calling application.

Now consider this possibility:

  1. An external application starts up (e.g., a Java application or PHP script).
  2. The application invokes screen-scraper, telling it to run the “Shopping Site” scraping session.
  3. While the scraping session runs it sends scraped records back to the calling application, which outputs them as they get scraped.

Hopefully the benefits to the second approach are obvious.

Now on to implementation. Consider this Java class (sorry for the odd formatting):

import com.screenscraper.scraper.*;
import com.screenscraper.common.*;

public class PollTest
{
public static void main( String args[] )
{
PollTest test = new PollTest();
test.go();

System.exit( 0 );
}

public void go()
{
try
{
RemoteScrapingSession remoteScrapingSession = new RemoteScrapingSession( “Shopping Site” );
remoteScrapingSession.setVariable(”SEARCH”,”dvd”);
remoteScrapingSession.setVariable( “PAGE”, “1″ );
remoteScrapingSession.setPollFrequency( 1 );
remoteScrapingSession.setDataReceiver( new MyDataReceiver() );
remoteScrapingSession.scrape();
remoteScrapingSession.disconnect();
}
catch( Exception e )
{
System.err.println( “Exception: ” + e.getMessage() );
e.printStackTrace();
}
}

class MyDataReceiver implements DataReceiver
{
public void receiveData( String key, Object value )
{
System.out.println( “Got data from ss.” );
System.out.println( “Key: ” + key );
System.out.println( “Value: ” + value );
}
}
}

The key is the “MyDataReceiver” class, which implements the “DataReceiver” interface. This interface requires the implementation of just one method: receiveData. When the scraping session is configured correctly, this method will get invoked as data is scraped by screen-scraper, allowing you to handle it in your own code. A few other notes on this class:

  • The “setPollFrequency” indicates how often (in seconds) data should be sent from screen-scraper to the client. The default is five seconds.
  • The “setDataReceiver” method must be called before “scrape” is called.

The implementation in screen-scraper is quite simple. I took the standard “Shopping Site” scraping session from the tutorial, and added the following script:

session.sendDataToClient( “DR”, dataRecord );

The script gets invoked after each product is extracted from the web site. The “sendDataToClient” method will accept most any object, including strings, integers, DataRecords, and DataSets.

So far we’ve only implemented this in the Java and PHP drivers, but the others will be forthcoming.

The example source files can be downloaded here, and includes both PHP and Java files. If you decide to give this a try, be sure to upgrade to version 3.0.63a of screen-scraper. You’ll want to reference the latest “screen-scraper.jar” or “misc\php\remote_scraping_session.php” files in your code (found inside the folder where screen-scraper is installed).

09.13.07

Anonymization through proxy servers

Posted in Tips at 4:38 pm by Todd Wilson

In certain cases a scrape needs to be anonymized in order to get the data you’re after. Generally this means sending the HTTP requests through one or more proxy servers, over which you may or may not have control (see How to surf and screen-scrape anonymously for more on this). Up to this point, this has been possible in screen-scraper, but the implementation has been relatively inelegant. Because of the needs of a recent client of ours, we’ve taken the time to flesh this out a bit more such that handling proxies is handled much more gracefully in screen-scraper. To use the code cited in this post, you’ll need to upgrade to the latest alpha version of screen-scraper.

The best way to explain is often by example, so here you go:

import com.screenscraper.util.*;

// Creat a new ProxyServerPool object. This object will
// control how screen-scraper interacts with proxy servers.
ProxyServerPool proxyServerPool = new ProxyServerPool();

// We give the current scraping session a reference to
// the proxy pool. This step should ideally be done right
// after the object is created (as in the previous step).
session.setProxyServerPool( proxyServerPool );

// This tells the pool to populate itself from a file
// containing a list of proxy servers. The format is very
// simple–you should have a proxy server on each line of
// the file, with the host separated from the port by a colon.
// For example:
// one.proxy.com:8888
// two.proxy.com:3128
// 29.283.928.10:8080
// But obviously without the slashes at the beginning.
proxyServerPool.populateFromFile( “proxies.txt” );

// screen-scraper can iterate through all of the proxies to
// ensure they’re responsive. This can be a time-consuming
// process unless it’s done in a multi-threaded fashion.
// This method call tells screen-scraper to validate up to
// 25 proxies at a time.
proxyServerPool.setNumProxiesToValidateConcurrently( 25 );

// This method call tells screen-scraper to filter the list of
// proxy servers using 7 seconds as a timeout value. That is,
// if a server doesn’t respond within 7 seconds, it’s deemed
// to be invalid.
proxyServerPool.filter( 7 );

// Once filtering is done, it’s often helpful to write the good
// set of proxies out to a file. That way you may not have to
// filter again the next time.
proxyServerPool.writeProxyPoolToFile( “good_proxies.txt” );

// You might also want to write out the list of proxy servers
// to screen-scraper’s log.
proxyServerPool.outputProxyServersToLog();

// This is the switch that tells the scraping session to make
// use of the proxy servers. Note that this can be turned on
// and off during the course of the scrape. You may want to
// anonymize some pages, but not others.
session.setUseProxyFromPool( true );

// As a scrapiing session runs, screen-scraper will filter out
// proxies that become non-responsive. If the number of proxies
// gets down to a specified level, screen-scraper can repopulate
// itself. That’s what this method call controls.
proxyServerPool.setRepopulateThreshold( 5 );

During the course of the scrape, you may find that a proxy has been blocked. When this happens, you can make this method call to tell screen-scraper to remove the proxy from the pool:

session.currentProxyServerIsBad();

Given that this feature is still in the alpha version of screen-scraper, there’s a chance we might change around the methods a bit, but, for the most part, you should be able to use it as you see it here.

It also might be of interest to note that we’ve done a slightly extended implementation of this technique that we’re using internally, which makes use of Amazon’s EC2 service. This allows us to have a pool of high speed proxy servers at an arbitrary quantity. As the proxy servers get blocked, they can be automatically terminated, with others spawned to replace them.

07.06.07

Methods to hinder scraping

Posted in Uncategorized at 4:20 pm by jason

Sometimes we’re asked how one might hinder a person who is trying to scrape data from their site. (The irony, of course, is that it comess from people who contacted me to scrape data for them.) The standard answer is that if you’re publishing data for the world to see, it can be scraped. There’s no stopping it … but it can be made it harder. We’ve seen a variety of methods that make things more difficult:

Turing tests

The most common implementation of the Turning Test is the old CAPTCHA that tries to make a human read the text in an image and fill it into a form. The idea is determine if you are man or machine. We have found a large number of sites that implement a very weak CAPTCHA that takes only a few minutes to get around. On the other hand, there are some very good implementations of Turing Tests that we would opt not to deal with given the choice, but a sophisticated OCR can sometimes overcome those, or many bulletin board spammers have some clever tricks to get past these.

Data as images

Sometimes you know which parts of your data are valuable. In that case it becomes reasonable to replace such text with an image. As with the Turing Test, there is ORC software that can read it, and there’s no reason we can’t save the image and have someone read it later.

Sometimes this doesn’t work out, however, as it makes a site less accessible to the disabled.

Code obfuscation

Using something like a JavaScript function to show data on the page though it’s not anywhere in the HTML source is a good trick. Other examples include putting prolific, extraneous comments through the page or having an interactive page that orders things in an unpredictable way (and the example I think of used CSS to make the display the same no matter the arrangment of the code.)

Limit search results

Most of the data we want to get at is behind some sort of form. Some are easy, and submitting a black from will yield all of the results. Some need an asterisk or percent put in the form. The hardest ones are those that will give you only so many results per query. Sometimes we just make a loop that will submit the letters of the alphabet to the form, but if that’s too general, we must make a loop to submit all combinations of 2 or 3 letters–that’s 17,576 page requests.

IP Filtering

On occasion, a diligent webmaster will notice a large number of page requests coming from a particular IP address, and block requests from that domain.

Sometimes these techniques work by virtue of the fact that it increases the effort required, and the data doesn’t merit the work involved. Nevertheless, if you have something that you really don’t want a scraper to access, the only foolproof way of keeping it safe is to resist publishing it.

07.05.07

Version 3.0.31a of screen-scraper available

Posted in Updates at 11:46 am by Todd Wilson

This one’s definitely a recommended upgrade. It contains a few bug fixes that should remedy some obnoxious behavior you might notice in the previous alpha version.

Aside from bug fixes, this version now allows for sub-extractor patterns to be applied in sequence. It’s not something that’s often needed, but once in a while it can be handy (and even necessary).

Feel free to give it a try and let us know of any more trouble. As always, be sure to back up your work before upgrading to an alpha version.

06.11.07

Version 3.0.28a of screen-scraper available

Posted in Updates at 5:12 pm by Todd Wilson

Aside from a few bug fixes and other niceties since the last announced alpha, this one now handles file uploads in the proxy server. This isn’t a really common need, but it’s been a hole in screen-scraper’s functionality that, happily, is now filled.

Feel free to give it a try and let us know of any more trouble. As always, be sure to back up your work before upgrading to an alpha version.

05.21.07

Version 3.0.22a of screen-scraper available

Posted in Updates at 6:05 pm by Todd Wilson

Thank goodness for alpha versions! That last alpha (3.0.21a) contained somewhat of a nasty bug that would wipe out sub-extractor patterns. Apologies to anyone negatively affected by that (but thanks for helping us test the bleeding edge version!). The upside is that we’ve made a fix for that in version 3.0.22a of screen-scraper. Please give it a try and let us know of any more trouble. As always, be sure to back up your work before upgrading to an alpha version.

On a happier note, version 3.0.22a contains a number of changes that Mac users should love. I recently migrated to a Mac as my primary machine (one of the best decisions I’ve made, by the way), and I’m especially happy about the new changes we’ve made. screen-scraper finally looks, feels and behaves like a Mac app should. This includes menus, look ‘n feel, and keyboard shortcuts. The last little bit we need to take care of is a screen-scraper icon for the dock. Watch for that in a future version.
Speaking of Macs, this is totally unrelated, but I have to share it, only because I’ve been looking for this since I switched. If you haven’t tried it yet, run, do not walk, to binarynights and download ForkLift. This is easily one of the most useful Mac apps I’ve run across.

« Previous entries ·