How to surf and screen-scrape anonymously
Well, this is a topic I’ve been meaning to address for quite a while, and a recent support request on the topic pushed me to finally get it done.
What I’ll describe in this article applies to our own screen-scraper software, but would also apply to most any other screen or web scraping software you might use. Most of it would even apply to web surfing in general.
Why surf or scrape anonymously?
There are a number of reasons why you may want to remain anonymous. It’s often a good idea to protect privacy when concerned about identity theft. You might be scraping from a competitor’s web site, and don’t want them to be able to identify you. Some web sites disallow too many requests from the same client, so you might be trying to circumvent such mechanisms.
I’ll issue a little caveat here by pointing out that, like many other tools, screen-scraping software can be used for good or ill. If you find yourself doing a lot of anonymous scraping, you may want to examine the legitimacy of what you’re doing. Scraping tools can be very useful, but don’t abuse them.
How do web sites discourage screen-scraping?
There are a number of mechanisms that web sites will use to attempt to discourage screen-scraping. Here are the ones I can think of off the top of my head:
- User tracking through cookies. A web site can easily plant a cookie, then track the number of requests you make by incrementing a server-side value attached to the cookie.
- User tracking by IP address. A slightly less reliable method used by sites is to track the number of requests you make by associating them with your IP address. I say it’s slightly less reliable because you could potentially have multiple client requests originating from the same IP address (e.g., if they’re all connecting through a common gateway).
- CAPTCHA mechanisms. There are a number of different types, and many are very difficult to circumvent. They’re also not very common, however.
- Authentication. This one dove tails on tracking through cookies, but is a slight variation in that some sites will require authentication before allowing access to the information you want to scrape. If sites don’t require authentication, you might simply be able to block cookies, so this one can be tricky to deal with.
Great. So how do I scrape anonymously?
The method(s) you’ll want to avail yourself of to scrape anonymously will depend on what the web site is using (if anything) to attempt to discourage scraping. I’ll describe below the techniques I’d recommend, along with when they’d make the most sense.
Hide your real IP address
How to do it: This is probably the most common technique you’ll use, so I’ll address it first. Every page request has to originate from an IP address, but it doesn’t necessarily need to be your real IP address. There are a few different ways you can trick the web server into thinking the HTTP request is coming from a different IP address:
- Send the request through a proxy server. There are lots of them out there. Most HTTP clients (e.g., a web browser or screen-scraping software) can be set up to send requests through an HTTP or SOCKS proxy server. Given that this is one of the more common techniques, I’ll also describe a few specific approaches:
- Send all requests through the same proxy server. If you Google around a bit you can find lists of anonymous proxy servers. Find one that seems to be reliable, then set up your scraping software to send all requests through it. There are also tools that will take a list of proxy servers, then tell you which ones are working, faster, more reliable, etc.
- Send requests through an application that cycles through proxy servers. These applications act as a proxy server, but with each request they’ll cycle it through a different proxy server. You provide a list and it simply iterates through them one by one. MultiProxy is a bit dated, but one I can think of, offhand. This can also be done in our screen-scraper software by simply placing a “proxies.txt” file in screen-scraper’s installation folder. The file should contain a proxy server on each line in the format [host or IP address]:[port] (e.g., myproxy.com:8080).
- Use tor/privoxy. This little tool can be a gem, but please don’t abuse it. It provides stronger anonymity than regular proxy servers, but may not be quite as fast.
- Use browser-based anonymization services. There are quite a few online services that allow you to punch in a web address, they send the request from their server, then display the response to you. You likely wouldn’t use this technique for scraping, but it might be useful for a few quick requests from your web browser.
- Use a virtual private network. This allows you to send all outgoing Internet traffic through a machine external to yours, and will cause the web server you’re scraping to think the request is coming from that computer and not yours. You might already have access to a VPN you can use, but more than likely you’ll just need to pay a bit to use someone else’s. This is probably the best technique for completely anonymizing any HTTP requests you might make, but does have the disadvantage that you won’t be able to cycle through IP addresses. That is, if you want a new IP address you’ll have to disconnect from and reconnect to the network. Two services on this type that I know of are StrongVPN and Relakks. We’ve used Relakks before and have had positive results.
When to do it: This is probably the most common technique, and you should use it any time you want to prohibit the web server you’re working with to have a way to trace requests back to you.
It should be noted that this technique is not foolproof. If you’re simply sending requests through an HTTP proxy server, there’s nothing stopping the owner of the proxy server from recording your request and IP address, then divulging the information to others so that the request can be traced back to you. Tools like tor can provide a greater degree of anonymity, but even that isn’t bulletproof. I recently read of an exploit a researcher found in tor that would allow traffic sent through it to be monitored. The strongest method of anonymity is probably the VPN, but, again, that assumes that the owners of the VPN service will keep private any traffic you send through them.
How to do it: This one’s pretty easy. If you’re using a web browser, just find the setting that indicates that all cookies should be blocked. Most screen-scraping software will (or should) also provide a way to do this.
When to do it: If the web site you’re working with is tracking you through cookies, you can simply reject them all. This likely will only work on relatively unsophisticated sites. Most sites trying to discourage screen-scraping will track your IP address.
How to do it: If you’re authenticated to a web site, you’re likely not blocking cookies, so the web site will be able to track you.
When to do it: This is probably obvious, but, if you don’t need to authenticate, don’t. That eliminates one other method whereby a site can track you.
In some cases it’s simply not possible to avoid authentication. In these cases, unfortunately, there may not be anything you can do to stay anonymous. Your best bet would probably be to hide your IP address (as described above), which may also require logging in and out of the site each time you acquire a new IP address.
Look for ways to circumvent CAPTCHA mechanisms
How to do it: In cases where a CAPTCHA mechanism is poorly implemented, it may be possible to determine how to circumvent it programatically (i.e., in programming code). A common CAPTCHA method is to present the user with a series of numbers or characters in a pattern such that a machine wouldn’t be able to read it. In a handful of cases in the past we’ve found that the server simply uses a naming convention with the CAPTCHA images, such that it’s possible to determine what the image says without requiring that a human read it.
Yet another fairly inefficient way of dealing with a CAPTCHA would be to capture the portion of the page containing the CAPTCHA, present it to a human being, have the person type in whatever the CAPTCHA requires, then make the request. We’ve never used this technique (and likely never would), but it’s technically possible to do.
When to do it: If a site is using a CAPTCHA, examine the HTML closely. Refresh the page multiple times to see how it changes. If you’re lucky, there will be a way to circumvent it in code. More than likely, though, you’d simply have to have a human being deal with it.
So there you have it. I’ve just pointed out a number of tools and techniques to remain anonymous online. Like I said before, don’t abuse them. There are some very legitimate reasons for wanting to do this, but there are a whole host of reasons why you shouldn’t. Part of me says I shouldn’t even be divulging any of this, but I’m not telling you anything you couldn’t find out on your own. So be nice. Behave yourself.