Screen-Scraping Ethics

Posted in Uncategorized on 04/21/08by ryans

The internet can be thought of as the world’s largest database. This is so, because it is comprised of inter-connected databases, files, and computer systems. By simply typing in some keywords, one can access hundreds to millions of websites containing treasure troves of facts, statistics, and other formats of information on an endless array of topics. Because the internet is such a valuable resource, we should seek new and innovative ways to mine the data using ethical means.

You may have never heard of screen-scraping, web-fetching, or web-data extraction, but if you’ve ever surfed the internet, you’ve quite likely been a beneficiary of the method of retrieving information on the web described by these terms. They refer to the increasingly popular method of methodically retrieving information with specialized tools. Numerous programs utilize many computer languages for the purpose of mining data. Software often assists users in intercepting HTTP requests and responses by incorporating proxy servers. The software then displays the pages’ source code (HTML, JavaScript, etc.) for users to extract the desired information. In addition, such software can aid iteration through pages (sometimes thousands of them) all the while gleaning valuable data in various forms.

The goal of scraping websites is to access information, but the uses of that information can vary. Users may wish to store the information in their own databases or manipulate the data within a spreadsheet. Other users may utilize data extraction techniques as means of obtaining the most recent data possible, particularly when working with information subject to frequent changes. Investors analyzing stock prices, realtors researching home listings, meteorologists studying weather, or insurance salespeople following insurance prices are a few individuals who might fit this category of users of frequently updated data.

Access to certain information may also provide users with strategic advantage in business. Attorneys might wish to scrape arrest records from county courthouses in search of potential clients. Businesses, such as restaurants or video-rental stores that know the locations of competitors can make better decisions about where to focus further growth. Companies that provide complementary (not to be confused with complimentary) products, like software, may wish to know the make, model, cost, and market share of hardware that are compatible with their software.

Another common, but controversial use of information taken from websites is reposting scraped data to other sites. Scrapers may wish to consolidate data from a myriad of websites and then create a new website containing all of the information in one convenient location. In some cases, the new site’s owner may benefit from ads placed on his or her site or from fees charged to access the site. Companies usually go to great lengths to disseminate information about their products or services. So, why would a website owner not wish to have his or her website’s information scraped?

Several reasons exist for why website owners may not wish to have their site’s scraped by others (excluding search engines). Some people feel that data that is reposted to other sites is plagiarized, if not stolen. These individuals may feel that they made the effort to gather information and make it available on their websites only to have it copied to other sites. Are individuals justified in feeling that they have been taken advantage of, even if their websites are posted publicly?

Interpretation of what exactly “republish” means is widely disputed. One of the most authoritative explanations may be found in the 1991 supreme-court case of Feist Publications v. Rural Telephone Service. This case involved Rural Telephone Service suing Feist Publications for copyright infringement when Feist copied telephone listings after Rural denied Feist’s request to license the information. While information has never been copyrightable under U.S. law, a collection of information, defined mostly in terms of creative arrangement or original ideas, can be copyrighted. The Supreme Court’s ruling in Feist Publications v. Rural Telephone Service stated that “information contained in Rural’s phone directory was not copyrightable, and that therefore no infringement existed.” Justice O’ Conner focused on the need for information to have a “creative” element in order to be termed a “collection” (1). Similarly, information, taken from publicly available websites should not be considered plagiarism or even theft if only the information (numbers, statistics, etc.) is reposted to new sites or used for other purposes.

Scraped websites also experience an increase in used bandwidth as a result of being scraped. Some scrapes take place once, but many scrapes must be performed over and over to achieve the desired results. In such cases, the servers that host the pages being scraped inevitably experience an increased load. Site owners may not wish to have the increased bandwidth, but more importantly, excessive page requests can cause a web server to function slowly or even fail. Rarely, however, do most scrapes cause such strain on a server on their own. Accessing a page through scraping is no different from visiting a page manually, except that scraping allows more pages to be visited over a shorter period. Additionally, scrapes can be adjusted to run more slowly, so as to minimize the strain on the server. Scraping is usually slowed when more than a few scraping sessions are being run against a single server at one time.

Interestingly, having one’s website scraped can have positive effects. Of course the recipient of the scraped data is pleased to have desired data, but owners of scraped sites may also benefit. Think of the case mentioned above in which home listings are scraped from a site. Whether the information is reposted or stored in a database for later querying to match homebuyer’s needs, the purpose of the original site is met—to get the home-listing information into the hands of potential buyers.

Individuals who scrape websites can do so, while still following guidelines for ethical data extraction. Perhaps it would be helpful to review a list of tips for maintaining ethical scraping. One website I consulted gave the following suggestions:

· Obey robots.txt.

· Don’t flood a site.

· Don’t republish, especially not anything that might be copyrighted.

· Abide by the site terms of service (2).

Occasionally, individuals who scrape websites have paid for access to the material being scraped. Many job- and résumé-posting websites fall into this category. Employers must pay a monthly fee for an account which provides access to the résumés of potential new hirers. Certainly, the fact that employers pay for the service entitles them to use whatever means are necessary to sort through and record the desired data. The only exception would be where the site’s terms of service specifically prohibit scraping.

While republishing images, artwork, and other original content without permission is unethical and in many cases illegal, using scraped data for personal purposes is certainly within the limits of ethical behavior. Nevertheless, page scrapers should always avoid taking copyrighted materials. Use of bandwidth is no more deserved by any one person than another. Even making scraped data available to others online can be argued as ethical, especially when the scraped website is posted on public space and the data taken doesn’t include any creative content. After all, the purpose of hosting a website in the first place is to provide information.

(1) http://en.wikipedia.org/wiki/Feist_Publications_v._Rural_Telephone_Service

(2) http://www.perlmonks.org/?node_id=477825

5 Comments »

  1. Peter said,

    April 23, 2008 at 6:00 pm

    Great read. I think reposting of data already in the public domain should be ok as long the originating site is referenced.

    My 2cents.

    Peter

  2. WebScraper said,

    February 25, 2009 at 4:47 pm

    good one.
    How can I do adjust screen scraper to run more slowly, so as to minimize the strain on the server?

    thanks,

  3. Todd Wilson said,

    February 25, 2009 at 5:53 pm

    Hi,

    The easiest way would probably be to sprinkle calls to session.pause (http://community.screen-scraper.com/API/pause) throughout your scripts.

    Todd

  4. webscraper said,

    March 4, 2009 at 1:53 am

    thanks, Todd

  5. shay rapaport said,

    November 24, 2010 at 5:08 pm

    Indeed , some sites wish not to be scraped. In many cases, court won’t be of help and common technological measures will fail to nail the trickiest bots.

    The comapny I have co-founded, SiteBlackBox, offers web publishers a service which recognizes and blocks bots – in real-time.

    I guess we have the only effective measure against unethical scrapers/

Leave a Comment