Methods to hinder scraping

Posted in Uncategorized on 07/06/07by jason

Sometimes we’re asked how one might hinder a person who is trying to scrape data from their site. (The irony, of course, is that it comess from people who contacted me to scrape data for them.) The standard answer is that if you’re publishing data for the world to see, it can be scraped. There‚Äôs no stopping it … but it can be made it harder. We’ve seen a variety of methods that make things more difficult:

Turing tests

The most common implementation of the Turning Test is the old CAPTCHA that tries to make a human read the text in an image and fill it into a form. The idea is determine if you are man or machine. We have found a large number of sites that implement a very weak CAPTCHA that takes only a few minutes to get around. On the other hand, there are some very good implementations of Turing Tests that we would opt not to deal with given the choice, but a sophisticated OCR can sometimes overcome those, or many bulletin board spammers have some clever tricks to get past these.

Data as images

Sometimes you know which parts of your data are valuable. In that case it becomes reasonable to replace such text with an image. As with the Turing Test, there is ORC software that can read it, and there’s no reason we can’t save the image and have someone read it later.

Sometimes this doesn’t work out, however, as it makes a site less accessible to the disabled.

Code obfuscation

Using something like a JavaScript function to show data on the page though it’s not anywhere in the HTML source is a good trick. Other examples include putting prolific, extraneous comments through the page or having an interactive page that orders things in an unpredictable way (and the example I think of used CSS to make the display the same no matter the arrangment of the code.)

Limit search results

Most of the data we want to get at is behind some sort of form. Some are easy, and submitting a black from will yield all of the results. Some need an asterisk or percent put in the form. The hardest ones are those that will give you only so many results per query. Sometimes we just make a loop that will submit the letters of the alphabet to the form, but if that’s too general, we must make a loop to submit all combinations of 2 or 3 letters–that’s 17,576 page requests.

IP Filtering

On occasion, a diligent webmaster will notice a large number of page requests coming from a particular IP address, and block requests from that domain.

Sometimes these techniques work by virtue of the fact that it increases the effort required, and the data doesn’t merit the work involved. Nevertheless, if you have something that you really don’t want a scraper to access, the only foolproof way of keeping it safe is to resist publishing it.

Leave a Comment