03.27.07

Ruby Tuesday

Posted in Updates at 10:30 am by Todd Wilson

We’ve had requests for this in the past, and I’m happy to report that we’re now able to deliver. You can now invoke screen-scraper from a Ruby script, via a driver we’ve just released. This was cooked up by a former employee of ours (thanks Adam!), who needed the functionality, and graciously donated his code. We’re still working a bit on documentation and example code, but if you already know some Ruby it should be pretty straightforward to use. To get started visit our Invoking screen-scraper from Ruby page.

03.16.07

How to Measure Anything

Posted in Thoughts at 11:35 am by Todd Wilson

Book CoverA while back I was contacted by Douglas Hubbard regarding a book he was writing entitled How to Measure Anything. He was interested in finding out more about tools that could automate online data collection, and screen-scraping popped up on his list as one method to go about this. Last week Douglas contacted me indicating that he was essentially done with the work, and it was on its way to press. He sent me a recent draft copy, and asked if I might blog a bit about it. I happily consented, and, I have to admit, I’ve really enjoyed what I’ve read so far.

Before digging into my commentary, I thought I’d include a snippet from the book that deals specifically with screen-scraping:

There is quite a lot of information on the internet and it changes fast. If you use a standard search engine, you get a list of websites, but that’s it. Suppose, instead, you needed to measure the number of times your firms name comes up in certain news sites or measure the blog traffic about a new product. You might even need to use this information in concert with other specific data reported in structured formats on other sites such as economic data from government agencies, etc.

Internet “Screen-scrapers” are a way to gather all this information on a regular basis without hiring a 24×7 staff of interns to do it all. You could use a tool like this to track used-market versions of your product on www.ebay.com, correlate your stores sales in different cities to the local weather by screen-scraping data from www.weather.com , or even just the number of hits on your firms name on various search engines hour-by-hour. As a search on the internet will reveal, there are several examples on the web of “mashups” where data is pulled from multiple sources and presented in a way that provides new insight. A common angle with mashups now is to plot information about business, real estate, traffic, and so on against a map site like Mapquest or Google Earth. I’ve found a mashup of Google Earth and real-estate data on www.housingmaps.com that allows you to see recently sold home prices on a map. Another mashup on socaltech.com shows a map that plots locations of businesses that recently received venture capital. At first glance, someone might think these are just for looking to buy a house or find a job with a new company. But how about research for a construction business or forecasting business growth in a new industry? We are limited only by our resourcefulness.

You can imagine almost limitless combinations of analysis by creating mashups of sites like Myspace and/or YouTube to measure cultural trends or public opinion. Ebay gives us tons of free data about behavior of sellers, buyers and what is being bought and sold and there are already several powerful analytical tools to summarize all the data on Ebay. Comments and reviews of individual products on the sites of Sears, Walmart, Target, and Overstock.com are a source of free input from consumers if we are clever enough to exploit it. The mind reels.

If you step back from it, fundamentally screen-scraping simply deals with repurposing information. The information you’re after with happens to be in a format that makes it less usable, and screen-scraping allows you to put it in a format that is. As Douglas points out, the ability to do this leads to infinite possibilities.

He touches on a few basic reasons for doing screen-scraping:

  1. Watching information as it changes over time.
  2. Aggregating data into a single repository.
  3. Combining information from multiple sources in such a way that the whole is greater than the sum of the parts.

Chances are, any one of us could come up with all kinds of examples of each, and many of them would apply directly to the type of work we do. Every industry deals with information. It’s likely that some of the information you deal with on a day-to-day basis would be more useful to you if it could be repurposed in one of the three ways I mention. How would your business benefit if you could be notified when one of your products is mentioned? How much time could you save if you were able to take any existing set of data you deal with frequently, and enrich it by aggregating information onto it? For example, you might take real estate property listings, and enhance it by adding information for each property that can be readily obtained from a county assessor’s web site. The end product could be quite useful, but it would be unreasonable to manually copy and paste the information from the web site. Screen-scraping allows this kind of thing to be done in an automated fashion.

How to Measure Anything isn’t available just yet, but I’d highly recommend keeping an eye out for it. If you work in an industry that deals with information and measurement (and I can’t think of one that doesn’t), you’d likely benefit from the principles Douglas teaches. Keep an eye on his How to Measure Anything web site for updates, or if you’d like to pre-order the book.

03.01.07

How to surf and screen-scrape anonymously

Posted in Tips at 2:33 pm by Todd Wilson

Well, this is a topic I’ve been meaning to address for quite a while, and a recent support request on the topic pushed me to finally get it done.

What I’ll describe in this article applies to our own screen-scraper software, but would also apply to most any other screen or web scraping software you might use. Most of it would even apply to web surfing in general.

Why surf or scrape anonymously?

There are a number of reasons why you may want to remain anonymous. It’s often a good idea to protect privacy when concerned about identity theft. You might be scraping from a competitor’s web site, and don’t want them to be able to identify you. Some web sites disallow too many requests from the same client, so you might be trying to circumvent such mechanisms.

I’ll issue a little caveat here by pointing out that, like many other tools, screen-scraping software can be used for good or ill. If you find yourself doing a lot of anonymous scraping, you may want to examine the legitimacy of what you’re doing. Scraping tools can be very useful, but don’t abuse them.

How do web sites discourage screen-scraping?

There are a number of mechanisms that web sites will use to attempt to discourage screen-scraping. Here are the ones I can think of off the top of my head:

  • User tracking through cookies. A web site can easily plant a cookie, then track the number of requests you make by incrementing a server-side value attached to the cookie.
  • User tracking by IP address. A slightly less reliable method used by sites is to track the number of requests you make by associating them with your IP address. I say it’s slightly less reliable because you could potentially have multiple client requests originating from the same IP address (e.g., if they’re all connecting through a common gateway).
  • CAPTCHA mechanisms. There are a number of different types, and many are very difficult to circumvent. They’re also not very common, however.
  • Authentication. This one dove tails on tracking through cookies, but is a slight variation in that some sites will require authentication before allowing access to the information you want to scrape. If sites don’t require authentication, you might simply be able to block cookies, so this one can be tricky to deal with.

Great. So how do I scrape anonymously?

The method(s) you’ll want to avail yourself of to scrape anonymously will depend on what the web site is using (if anything) to attempt to discourage scraping. I’ll describe below the techniques I’d recommend, along with when they’d make the most sense.

Hide your real IP address

How to do it: This is probably the most common technique you’ll use, so I’ll address it first. Every page request has to originate from an IP address, but it doesn’t necessarily need to be your real IP address. There are a few different ways you can trick the web server into thinking the HTTP request is coming from a different IP address:

  • Send the request through a proxy server. There are lots of them out there. Most HTTP clients (e.g., a web browser or screen-scraping software) can be set up to send requests through an HTTP or SOCKS proxy server. Given that this is one of the more common techniques, I’ll also describe a few specific approaches:
    • Send all requests through the same proxy server. If you Google around a bit you can find lists of anonymous proxy servers. Find one that seems to be reliable, then set up your scraping software to send all requests through it. There are also tools that will take a list of proxy servers, then tell you which ones are working, faster, more reliable, etc.
    • Send requests through an application that cycles through proxy servers. These applications act as a proxy server, but with each request they’ll cycle it through a different proxy server. You provide a list and it simply iterates through them one by one. MultiProxy is a bit dated, but one I can think of, offhand. This can also be done in our screen-scraper software by simply placing a “proxies.txt” file in screen-scraper’s installation folder. The file should contain a proxy server on each line in the format [host or IP address]:[port] (e.g., myproxy.com:8080).
    • Use tor/privoxy. This little tool can be a gem, but please don’t abuse it. It provides stronger anonymity than regular proxy servers, but may not be quite as fast.
    • Use browser-based anonymization services. There are quite a few online services that allow you to punch in a web address, they send the request from their server, then display the response to you. You likely wouldn’t use this technique for scraping, but it might be useful for a few quick requests from your web browser.
  • Use a virtual private network. This allows you to send all outgoing Internet traffic through a machine external to yours, and will cause the web server you’re scraping to think the request is coming from that computer and not yours. You might already have access to a VPN you can use, but more than likely you’ll just need to pay a bit to use someone else’s. This is probably the best technique for completely anonymizing any HTTP requests you might make, but does have the disadvantage that you won’t be able to cycle through IP addresses. That is, if you want a new IP address you’ll have to disconnect from and reconnect to the network. Two services on this type that I know of are StrongVPN and Relakks. We’ve used Relakks before and have had positive results.

When to do it: This is probably the most common technique, and you should use it any time you want to prohibit the web server you’re working with to have a way to trace requests back to you.

It should be noted that this technique is not foolproof. If you’re simply sending requests through an HTTP proxy server, there’s nothing stopping the owner of the proxy server from recording your request and IP address, then divulging the information to others so that the request can be traced back to you. Tools like tor can provide a greater degree of anonymity, but even that isn’t bulletproof. I recently read of an exploit a researcher found in tor that would allow traffic sent through it to be monitored. The strongest method of anonymity is probably the VPN, but, again, that assumes that the owners of the VPN service will keep private any traffic you send through them.

Block cookies

How to do it: This one’s pretty easy. If you’re using a web browser, just find the setting that indicates that all cookies should be blocked. Most screen-scraping software will (or should) also provide a way to do this.

When to do it: If the web site you’re working with is tracking you through cookies, you can simply reject them all. This likely will only work on relatively unsophisticated sites. Most sites trying to discourage screen-scraping will track your IP address.

Avoid authentication

How to do it: If you’re authenticated to a web site, you’re likely not blocking cookies, so the web site will be able to track you.

When to do it: This is probably obvious, but, if you don’t need to authenticate, don’t. That eliminates one other method whereby a site can track you.

In some cases it’s simply not possible to avoid authentication. In these cases, unfortunately, there may not be anything you can do to stay anonymous. Your best bet would probably be to hide your IP address (as described above), which may also require logging in and out of the site each time you acquire a new IP address.

Look for ways to circumvent CAPTCHA mechanisms

How to do it: In cases where a CAPTCHA mechanism is poorly implemented, it may be possible to determine how to circumvent it programatically (i.e., in programming code). A common CAPTCHA method is to present the user with a series of numbers or characters in a pattern such that a machine wouldn’t be able to read it. In a handful of cases in the past we’ve found that the server simply uses a naming convention with the CAPTCHA images, such that it’s possible to determine what the image says without requiring that a human read it.

Yet another fairly inefficient way of dealing with a CAPTCHA would be to capture the portion of the page containing the CAPTCHA, present it to a human being, have the person type in whatever the CAPTCHA requires, then make the request. We’ve never used this technique (and likely never would), but it’s technically possible to do.

When to do it: If a site is using a CAPTCHA, examine the HTML closely. Refresh the page multiple times to see how it changes. If you’re lucky, there will be a way to circumvent it in code. More than likely, though, you’d simply have to have a human being deal with it.

Behave yourself

So there you have it. I’ve just pointed out a number of tools and techniques to remain anonymous online. Like I said before, don’t abuse them. There are some very legitimate reasons for wanting to do this, but there are a whole host of reasons why you shouldn’t. Part of me says I shouldn’t even be divulging any of this, but I’m not telling you anything you couldn’t find out on your own. So be nice. Behave yourself.