Using OCR with screen-scraper

Posted in Tips on 03.11.10 by scottw

Within screen-scraper you have the ability to call outside programs directly from your scripts.  The following is an example scraping session that makes use of Tesseract OCR and Imagemagick in order to take an image from the internet and attempt to read the text of the image.

As is, the scraping session is intended to run on Linux.  However, it is possible to run both dependent programs under Windows either directly or using Cygwin.

To use:

Download and import the following scraping session.

http://community.screen-scraper.com/samples/ocr

How we use version control

Posted in Tips on 02.04.10 by Todd Wilson

Any reasonably-sized software development project benefits greatly from some type of version control system, such as CVS, Subversion, or Git.  Internally we use Subversion, and I thought it might be helpful to share a bit how we go about it.  What I describe here is primarily applicable to a project where you have many scrapes being developed by multiple developers, but we even use Subversion for small projects handled by a single developer.

Each developer on a project will have his own instance of screen-scraper, but may be using some scraping sessions and scripts that are also used by other developers.  Generally speaking, though, a given developer is in charge of a certain set of scraping sessions, and we have a series of general scripts that might be used by all developers.  These general scripts can be edited by anyone, but when edits are made everyone needs to be notified so that they can update their own instances of screen-scraper with the latest scripts.  Each time a new scraping session is created or an existing scraping session is modified, it gets exported then committed to the repository.  This isn’t quite as automated as some IDE’s allow, so developers need to be conscientious of their work so that the export and commit at the appropriate times.

We often also make use of debug scripts, which each developer will generally cater to his own work.  It’s likely that he won’t want these scripts overwritten by  those of other developers, so for each of these scripts he need only un-check the “Overwrite this script on import” box in the workbench to protect a such a script.

We also typically keep a separate folder in our version control repository for the scripts that are general to a series of scraping sessions.  It’s possible that a particular developer has a slightly out-dated script, and when he exports that script may go with the scraping session.  To keep it from getting imported into a production environment we’ll copy all of the general scripts (which are always kept current) into screen-scraper’s “import” folder along with the scraping session(s) to be deployed.  screen-scraper will always import scraping sessions first, then scripts.  That way you can guarantee that the current scripts don’t get overwritten.

Because screen-scraper doesn’t use a purely file-based approach to persist its objects, version control can require another step or two beyond what you’d normally find in a modern-day IDE.  Our experience has been, though, that once developers get accustomed to it it’s not too burdensome.  That said, we have plans in the near future to add features that will make working with version control systems even easier with screen-scraper.

Exporting scraping sessions that use session.executeScript

Posted in Tips on 01.14.10 by Todd Wilson

Many have probably noticed that when a scraping session is exported from screen-scraper all of the scripts invoked from within that scraping session get exported along with it.  All of the scripts, that is, except those that get invoked via the session.executeScript method.  The exporter isn’t quite smart enough to actually parse the text of scripts to look for scripts that should be exported because they’re invoked via that method.

Fortunately, there’s an easy workaround.  For scripts that get invoked via session.executeScript simply associate them with the scraping session itself, but then disable them.  That is, on the “General” tab for a scraping session add the scripts via the “Add Script” button, then under the “Enabled?” column in the scripts table un-check the box.  This way the scripts won’t get executed at the beginning of the scraping session, but they will get exported.

Further thoughts on hindering screen-scraping

Posted in Thoughts, Tips on 08.17.09 by jason

We previously listed some means to try to stop screen-scraping, but since it is an ongoing topic for us, it bears revisiting.  Any site can be scraped, but some require such an influx of time and resources as to make it prohibitively expensive.  Some of the common methods to do so are:

Turing tests

The most common implementation of the Turning Test is the old CAPTCHA that tries to ensure a human reads the text in an image, and feeds it into a form.

We have found a large number of sites that implement a very weak CAPTCHA that takes only a few minutes to get around. On the other hand, there are some very good implementations of Turing Tests that we would opt not to deal with given the choice, but a sophisticated OCR can sometimes overcome those, or many bulletin board spammers have some clever tricks to get past these.

Data as images

Sometimes you know which parts of your data are valuable. In that case it becomes reasonable to replace such text with an image. As with the Turing Test, there is ORC software that can read it, and there’s no reason we can’t save the image and have someone read it later.

Often times, however, listing data as an image without a text alternate is in violation of the Americans with Disabilities Act (ADA), and can be overcome with a couple of phone calls to a company’s legal department.

Code obfuscation

Using something like a JavaScript function to show data on the page though it’s not anywhere in the HTML source is a good trick. Other examples include putting prolific, extraneous comments through the page or having an interactive page that orders things in an unpredictable way (and the example I think of used CSS to make the display the same no matter the arrangement of the code.)

CSS Sprites

Recently we’ve encountered some instances where a page has one images containing numbers and letters, and used CSS to display only the characters they desired.  This is in effect a combination of the previous 2 methods.  First we have to get that master-image and read what characters are there, then we’d need to read the CSS in the site and determine to what character each tag was pointing.

While this is very clever, I suspect this too would run afoul the ADA, though I’ve not tested that yet.

Limit search results

Most of the data we want to get at is behind some sort of form. Some are easy, and submitting a blank form will yield all of the results. Some need an asterisk or percent put in the form. The hardest ones are those that will give you only so many results per query. Sometimes we just make a loop that will submit the letters of the alphabet to the form, but if that’s too general, we must make a loop to submit all combination of 2 or 3 letters–that’s 17,576 page requests.

IP Filtering

On occasion, a diligent webmaster will notice a large number of page requests coming from a particular IP address, and block requests from that domain.  There are a number of methods to pass requests through alternate domains, however, so this method isn’t generally very effective.

Site Tinkering

Scraping always keys off of certain things in the HTML.  Some sites have the resources to constantly tweak their HTML so that any scrapes are constantly out of date.  Therefore it becomes cost ineffective to continually update the scrape for the constantly changing conditions.

Techniques for Scraping Large Datasets

Posted in Tips on 07.07.08 by jason

Some of the sites we aspire to scrape contain vast, huge amounts of data. In such cases, an attempt to scrape data from it may run fine for a time, but eventually stop prematurely with the following message printed to the log:

The error message was: The application script threw an exception: java.lang.OutOfMemoryError: Java heap space BSF info: null at line: 0 column: columnNo

There can be a variety of causes, but most of the time it is caused by memory use in page iteration. Turning up the memory allocation for screen-scraper may take care of it, but it doesn’t address the root cause.

In a typical site structure, we input search parameters and are presented with a page of results and a link to view subsequent pages. If there are ten to twenty pages of results, it’s easiest to just scrape the “next page” link and run a script after the pattern is applied that scrapes the next page. The problem lies in the fact that this is recursive. When we’ve requested the search results, and 2 subsequent “next pages” the scrapeable files are still open in memory thusly:

  • Scrapeable file “Search results” and dataSet “Next page”
  • Scrapeable file “Next search results” and dataSet “Next page”
  • Scrapeable file “Next search results” and dataSet “Next page”

Every “Next search results” opens a new scrapable file while the previous is still open. While you can run the script on the scripts tab after the file is scraped to prevent the dataSets from remaining in scope, the scrapeable files remain in memory—the scrape may get further, but the memory still fills up with scrapable files, and it mayn’t be enough to get all the data.

The solution is to use an iterative approach.

If the site we’re scraping shows the total number of pages, using an iterative method easy. For my example, I’ll describe a site that has a link for pages 1 through 20, and a “>>” indicator to show there are pages beyond 20.

On first page of search results, I have 3 extractor patterns to extract the following information:

  1. Each result listed
  2. All the page numbers shown, and
  3. The next batch of results

When I get the to the search results page, the first extractor runs as always and drills into the details of each result as usual. The second extractor pattern grabs all the pages listed so I get a dataSet named “Pages,” containing links to pages 2 through 20, and I save the dataSet as a session variable. On the scripts tab, I then run this script after the file is scraped:

/*

Script gets all page numbers from the Pages extractor pattern, and iterates through them

*/

// Get variable

pages = session.getVariable(“Pages”);

// Clear session variable so it doesn’t linger

session.setVariable(“Pages”, null);

// Loop through pages

for (i=0; i

{

// Since the page list appears twice, use only a number larger than that just used

if (i>session.getVariable(“PAGE”))

{

session.setVariable(“PAGE”, i);

session.log(“+++Scraping page #” + i);

session.scrapeFile(“Next search results”);

}

else

{

session.log(“+++Already have page #” + i + ” so not scraping”);

}

}

The “for” loop will have the first page of search results in memory, but when it calls the “Next search results” scrapeable file to go to page 2, it only gets the results, and doesn’t try to look for a next page. The loop closes out the second page before it starts the third, and closes the third before starting the forth, etc.

The last extractor on “Search results” looks for “>>”. I save the that dataSet as a session variable named “Next batch pages”, and put this as the last script to run on the scripts tab:

import com.screenscraper.common.*;

/*

Script that checks if there is a next batch of pages

*/

if (session.getVariable(“Next batch pages”)!=null)

{

pageSet = session.getVariable(“Next batch pages”);

session.setVariable(“Next batch pages”, null);

pages = pageSet.getDataRecord(0);

page = Integer.parseInt(pages.get(“PAGE”));

if (page>session.getVariable(“PAGE”))

{

session.setVariable(“PAGE”, page);

session.log(“+++Scraping page #” + page);

session.scrapeFile(“Next batch search results”);

}

else

{

session.log(“+++Already have page #” + page + ” so not scraping”);

}

}

Now the “Next batch search results” scrapable file must do all the things the first page of search results did; get each result, look for next page links, and look for a next batch of results. Using the iterative approach to cycle through pages enables you request many more pages without keeping as many in memory, and without unnecessary pages in memory, the scrape will run far longer.

Scraping ASP.NET Sites

Posted in Tips on 06.04.08 by scottw

Microsoft ASP.NET sites have consistently proven to be some of the most difficult to scrape. This is due to their unconventional nature and cryptic information passed between your browser and the server. You’ll know you’re at an ASP.NET site when your URLs end in .aspx, your links look like this:

javascript:__doPostBack('gvLicensing','Select$0')

And your POSTs look like this*:

/wEPDwUJNDczODExNjY1D2QWAgIFD2QWAgIXDzwrAA0CAA8WBB4LXyFEYXRhQm91
bmRnHgtfIUl0ZW1Db3VudGZkDBQrAABkGAIFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJh
Y2tLZXlfXxYBBQdidG5JbmZvBQtndkxpY2Vuc2luZw88KwAJAgYVAQ1saWNfc2VyaWFsX
2lkCGZk/kjEfRuqcTBAeylGENOP9dFkERc=

If you’re at all familiar with conventional HTTP transactions, prepare to forgo what you’ve come to expect. Once again, Microsoft manages to defy many standard practices that, in this case, has gone unnoticed by everyone but your tireless browser tasked with making sense of it all. But now, it is your job to pick apart what’s going on and to try to reconstruct the mixed up conversation your poor browser’s been having with the server. In this blog entry I’ll attempt to cover the more common (and not so common) characteristics of ASP.NET sites and offer techniques for how best you can play the role of your submissive browser to an unforgiving taskmaster.

If you’ve already been down this road before, please post your own stories of things you’ve encountered and how you went about slaying the dragon.

As you begin the process of scraping data from a website we recommend that you start by using screen-scraper’s proxy to record the HTTP transactions while you navigate the site. You’ll then need to identify which of the proxy transactions should be made in to scrapeable files, add extractor patterns to your scrapeable files to be used as session variables for other scrapeable files**, and tie the whole thing together with scripts to run recursively and in the proper sequence while it traverses the site scraping the data you need.

Here are some general rules and recommendations

  • The first rule of screen-scraping: As closely as you can, imitate the requests to the server that your browser makes. Study the raw contents of a successful request from your proxy session while constructing your scrapeable files.
  • Run pages in the correct order. ASP.NET sites are very picky about the order in which pages occur. The server tracks this by referencing the referer found in the request. To ensure you pass the correct referer:
    • Run your scrapeable files in the same order as when you navigated the site during your proxy session (repeated for emphasis).
    • All of your scrapeable files should have the check box checked under the Properties tab where it says, “This scrapeable file will be invoked manually from a script” and should be called using the scrapeFile method. This way you’re in direct control of when scrapeable files are run.
    • Sometimes you’ll need to include a scrapeable file just to ensure you maintain the correct page order by passing the right referer. When calling scrapeable files for this purpose, basic users should use the scrapeFile method. professional and enterprise users can use a shortcut by implementing the setReferer method within a script. Then, call this method in place of an actual scrapeable file.
    • Prior to calling a scrapeable file, you many need to manually reset certain values when your scraping session rolls back up on itself.***
      • For example, say you’re iterating through a list of categories that return a list of products. For each category you also iterate through the list of products and a details page for each product. When you complete the first category iteration screen-scraper will recursively roll back up to the next category. And it’s here that you might need to manually set the values for the next category page since the values for the last details page would still be in memory.
      • One helpful approach is to name the extractor patterns for recurring parameters like the VIEWSTATE with something that indicates which page it was extracted from.  For example, the VIEWSTATE found on the details page may be named VIEWSTATE_DETAILS, while the VIEWSTATE from the search results would be called VIEWSTATE_SEARCH_RESULTS.  Doing so will help you to use the correct session variables when passing the post parameters in the request.
  • POST parameters should NOT be ignored. Most all ASP.NET transactions rely on very specific POST data in order to respond as you’d expect.
    • Include every POST parameter whether or not it has a value.
    • Generally, parameters with cryptic string values must have those values extracted from the referring page and passed as session variables in the request.
    • If you need to programmatically add or alter a POST parameter make use of the addHTTPParameter method which allows you to set both the key and value; as well as, control the sequence.
    • Oddities that can keep you up all night:
      • Occasionally, two different POST parameters will exchange the same value. This has happened with EVENTTARGET & EVENTARGUMENT. When it does, the next bullet point may also apply.
      • POST key/value pairs may not always be found together in the same HTML tag of the requesting page. ASP.NET POST values are typically created via JavaScript at the moment you click a button or link. Generally, the value you want to pass can easily be found in the HTML of the referring page but occasionally it will hide off in a corner where it doesn’t belong. Try searching for the value in the requesting page’s HTML to know what you need to extract in order to get the value you’re after.
      • Watch for parameters that may be included and/or disincluded between pages where you would expect them to always be the same.
        • For example, sometimes parameters will show up on, say, page one of a search results page but will not show up for page two. This can continue for additional results pages and may become even more complex. In order to handle a situation like this you may need to programmatically assign the wayward parameters manually using the addHTTPParameter method.
      • It’s not just the values that can change. Watch for POST parameter names that may also dynamically change.
  • Don’t worry about all the JavaScript. A lot is being handled with JavaScript, but it’s been our experience that you don’t need to understand the logic behind the JavaScript. 99 percent of the time you can find what you need from within the page that is making the request.

* If a page’s VIEWSTATE is too large, screen-scraper can hang when you click on the offending proxy transaction. Wait for a while and it should recover.

**As you’re converting proxy transactions into scrapeable files, a good approach is to replace the values of parameters that look like they’re generated dynamically with session variables containing values extracted from the referring page, test it and compare side-by-side the raw request from your proxy session to that of your test run. And, repeat until you’ve successfully given the server what it wants in order to give you back what you want.

Handling scraped data in real time

Posted in Tips, Updates on 11.12.07 by Todd Wilson

Once screen-scraper extracts data from a web site, typically that data is sent somewhere else. Data is probably most commonly written out to a file, but may also be saved to a database or even submitted to another web site. You can always handle the scraped data in screen-scraper scripts, but what if you want to make use of the data in your own application, which invokes screen-scraper?

In the past, when invoking screen-scraper from a remote application, the process has generally meant sending screen-scraper the request to scrape, waiting for extraction to occur, then handling that extracted data in the application that invoked screen-scraper. It’s that second step that can be a bit hard to deal with–the request to scrape is sent, but the scraped data can’t be touched by the calling application until screen-scraper finishes its work. This can be especially troublesome in cases where the scrapes are long and might even get interrupted in the middle. This is at best inconvenient, and at worst may mean loss of scraped data.

I recently had a flash of inspiration as to how to deal with these cases, and implemented a new feature in the latest alpha version of screen-scraper (3.0.63a) that greatly facilitates handling data in a remote application as it is getting scraped. First, to give a contrary example, consider the method we advocate in our fourth tutorial for invoking screen-scraper remotely to extract data from our shopping web site. The process goes basically like this:

  1. An external application starts up (e.g., a Java application or PHP script).
  2. The application invokes screen-scraper, telling it to run the “Shopping Site” scraping session.
  3. The “Shopping Site” scraping session runs.
  4. Once the scraping session completes, control returns to the calling application.
  5. The calling application requests the scraped records from screen-scraper.
  6. The scraped records are output by the calling application.

Now consider this possibility:

  1. An external application starts up (e.g., a Java application or PHP script).
  2. The application invokes screen-scraper, telling it to run the “Shopping Site” scraping session.
  3. While the scraping session runs it sends scraped records back to the calling application, which outputs them as they get scraped.

Hopefully the benefits to the second approach are obvious.

Now on to implementation. Consider this Java class (sorry for the odd formatting):

import com.screenscraper.scraper.*;
import com.screenscraper.common.*;

public class PollTest
{
public static void main( String args[] )
{
PollTest test = new PollTest();
test.go();

System.exit( 0 );
}

public void go()
{
try
{
RemoteScrapingSession remoteScrapingSession = new RemoteScrapingSession( “Shopping Site” );
remoteScrapingSession.setVariable(“SEARCH”,”dvd”);
remoteScrapingSession.setVariable( “PAGE”, “1” );
remoteScrapingSession.setPollFrequency( 1 );
remoteScrapingSession.setDataReceiver( new MyDataReceiver() );
remoteScrapingSession.scrape();
remoteScrapingSession.disconnect();
}
catch( Exception e )
{
System.err.println( “Exception: ” + e.getMessage() );
e.printStackTrace();
}
}

class MyDataReceiver implements DataReceiver
{
public void receiveData( String key, Object value )
{
System.out.println( “Got data from ss.” );
System.out.println( “Key: ” + key );
System.out.println( “Value: ” + value );
}
}
}

The key is the “MyDataReceiver” class, which implements the “DataReceiver” interface. This interface requires the implementation of just one method: receiveData. When the scraping session is configured correctly, this method will get invoked as data is scraped by screen-scraper, allowing you to handle it in your own code. A few other notes on this class:

  • The “setPollFrequency” indicates how often (in seconds) data should be sent from screen-scraper to the client. The default is five seconds.
  • The “setDataReceiver” method must be called before “scrape” is called.

The implementation in screen-scraper is quite simple. I took the standard “Shopping Site” scraping session from the tutorial, and added the following script:

session.sendDataToClient( “DR”, dataRecord );

The script gets invoked after each product is extracted from the web site. The “sendDataToClient” method will accept most any object, including strings, integers, DataRecords, and DataSets.

So far we’ve only implemented this in the Java and PHP drivers, but the others will be forthcoming.

The example source files can be downloaded here, and includes both PHP and Java files. If you decide to give this a try, be sure to upgrade to version 3.0.63a of screen-scraper. You’ll want to reference the latest “screen-scraper.jar” or “misc\php\remote_scraping_session.php” files in your code (found inside the folder where screen-scraper is installed).

Anonymization through proxy servers

Posted in Tips on 09.13.07 by Todd Wilson

In certain cases a scrape needs to be anonymized in order to get the data you’re after. Generally this means sending the HTTP requests through one or more proxy servers, over which you may or may not have control (see How to surf and screen-scrape anonymously for more on this). Up to this point, this has been possible in screen-scraper, but the implementation has been relatively inelegant. Because of the needs of a recent client of ours, we’ve taken the time to flesh this out a bit more such that handling proxies is handled much more gracefully in screen-scraper. To use the code cited in this post, you’ll need to upgrade to the latest alpha version of screen-scraper.

The best way to explain is often by example, so here you go:

import com.screenscraper.util.*;

// Creat a new ProxyServerPool object. This object will
// control how screen-scraper interacts with proxy servers.
ProxyServerPool proxyServerPool = new ProxyServerPool();

// We give the current scraping session a reference to
// the proxy pool. This step should ideally be done right
// after the object is created (as in the previous step).
session.setProxyServerPool( proxyServerPool );

// This tells the pool to populate itself from a file
// containing a list of proxy servers. The format is very
// simple–you should have a proxy server on each line of
// the file, with the host separated from the port by a colon.
// For example:
// one.proxy.com:8888
// two.proxy.com:3128
// 29.283.928.10:8080
// But obviously without the slashes at the beginning.
proxyServerPool.populateFromFile( “proxies.txt” );

// screen-scraper can iterate through all of the proxies to
// ensure they’re responsive. This can be a time-consuming
// process unless it’s done in a multi-threaded fashion.
// This method call tells screen-scraper to validate up to
// 25 proxies at a time.
proxyServerPool.setNumProxiesToValidateConcurrently( 25 );

// This method call tells screen-scraper to filter the list of
// proxy servers using 7 seconds as a timeout value. That is,
// if a server doesn’t respond within 7 seconds, it’s deemed
// to be invalid.
proxyServerPool.filter( 7 );

// Once filtering is done, it’s often helpful to write the good
// set of proxies out to a file. That way you may not have to
// filter again the next time.
proxyServerPool.writeProxyPoolToFile( “good_proxies.txt” );

// You might also want to write out the list of proxy servers
// to screen-scraper’s log.
proxyServerPool.outputProxyServersToLog();

// This is the switch that tells the scraping session to make
// use of the proxy servers. Note that this can be turned on
// and off during the course of the scrape. You may want to
// anonymize some pages, but not others.
session.setUseProxyFromPool( true );

// As a scrapiing session runs, screen-scraper will filter out
// proxies that become non-responsive. If the number of proxies
// gets down to a specified level, screen-scraper can repopulate
// itself. That’s what this method call controls.
proxyServerPool.setRepopulateThreshold( 5 );

During the course of the scrape, you may find that a proxy has been blocked. When this happens, you can make this method call to tell screen-scraper to remove the proxy from the pool:

session.currentProxyServerIsBad();

Given that this feature is still in the alpha version of screen-scraper, there’s a chance we might change around the methods a bit, but, for the most part, you should be able to use it as you see it here.

It also might be of interest to note that we’ve done a slightly extended implementation of this technique that we’re using internally, which makes use of Amazon’s EC2 service. This allows us to have a pool of high speed proxy servers at an arbitrary quantity. As the proxy servers get blocked, they can be automatically terminated, with others spawned to replace them.

How to surf and screen-scrape anonymously

Posted in Tips on 03.01.07 by Todd Wilson

Well, this is a topic I’ve been meaning to address for quite a while, and a recent support request on the topic pushed me to finally get it done.

What I’ll describe in this article applies to our own screen-scraper software, but would also apply to most any other screen or web scraping software you might use. Most of it would even apply to web surfing in general.

Why surf or scrape anonymously?

There are a number of reasons why you may want to remain anonymous. It’s often a good idea to protect privacy when concerned about identity theft. You might be scraping from a competitor’s web site, and don’t want them to be able to identify you. Some web sites disallow too many requests from the same client, so you might be trying to circumvent such mechanisms.

I’ll issue a little caveat here by pointing out that, like many other tools, screen-scraping software can be used for good or ill. If you find yourself doing a lot of anonymous scraping, you may want to examine the legitimacy of what you’re doing. Scraping tools can be very useful, but don’t abuse them.

How do web sites discourage screen-scraping?

There are a number of mechanisms that web sites will use to attempt to discourage screen-scraping. Here are the ones I can think of off the top of my head:

  • User tracking through cookies. A web site can easily plant a cookie, then track the number of requests you make by incrementing a server-side value attached to the cookie.
  • User tracking by IP address. A slightly less reliable method used by sites is to track the number of requests you make by associating them with your IP address. I say it’s slightly less reliable because you could potentially have multiple client requests originating from the same IP address (e.g., if they’re all connecting through a common gateway).
  • CAPTCHA mechanisms. There are a number of different types, and many are very difficult to circumvent. They’re also not very common, however.
  • Authentication. This one dove tails on tracking through cookies, but is a slight variation in that some sites will require authentication before allowing access to the information you want to scrape. If sites don’t require authentication, you might simply be able to block cookies, so this one can be tricky to deal with.

Great. So how do I scrape anonymously?

The method(s) you’ll want to avail yourself of to scrape anonymously will depend on what the web site is using (if anything) to attempt to discourage scraping. I’ll describe below the techniques I’d recommend, along with when they’d make the most sense.

Hide your real IP address

How to do it: This is probably the most common technique you’ll use, so I’ll address it first. Every page request has to originate from an IP address, but it doesn’t necessarily need to be your real IP address. There are a few different ways you can trick the web server into thinking the HTTP request is coming from a different IP address:

  • Send the request through a proxy server. There are lots of them out there. Most HTTP clients (e.g., a web browser or screen-scraping software) can be set up to send requests through an HTTP or SOCKS proxy server. Given that this is one of the more common techniques, I’ll also describe a few specific approaches:
    • Send all requests through the same proxy server. If you Google around a bit you can find lists of anonymous proxy servers. Find one that seems to be reliable, then set up your scraping software to send all requests through it. There are also tools that will take a list of proxy servers, then tell you which ones are working, faster, more reliable, etc.
    • Send requests through an application that cycles through proxy servers. These applications act as a proxy server, but with each request they’ll cycle it through a different proxy server. You provide a list and it simply iterates through them one by one. MultiProxy is a bit dated, but one I can think of, offhand. This can also be done in our screen-scraper software by simply placing a “proxies.txt” file in screen-scraper’s installation folder. The file should contain a proxy server on each line in the format [host or IP address]:[port] (e.g., myproxy.com:8080).
    • Use tor/privoxy. This little tool can be a gem, but please don’t abuse it. It provides stronger anonymity than regular proxy servers, but may not be quite as fast.
    • Use browser-based anonymization services. There are quite a few online services that allow you to punch in a web address, they send the request from their server, then display the response to you. You likely wouldn’t use this technique for scraping, but it might be useful for a few quick requests from your web browser.
  • Use a virtual private network. This allows you to send all outgoing Internet traffic through a machine external to yours, and will cause the web server you’re scraping to think the request is coming from that computer and not yours. You might already have access to a VPN you can use, but more than likely you’ll just need to pay a bit to use someone else’s. This is probably the best technique for completely anonymizing any HTTP requests you might make, but does have the disadvantage that you won’t be able to cycle through IP addresses. That is, if you want a new IP address you’ll have to disconnect from and reconnect to the network. Two services on this type that I know of are StrongVPN and Relakks. We’ve used Relakks before and have had positive results.

When to do it: This is probably the most common technique, and you should use it any time you want to prohibit the web server you’re working with to have a way to trace requests back to you.

It should be noted that this technique is not foolproof. If you’re simply sending requests through an HTTP proxy server, there’s nothing stopping the owner of the proxy server from recording your request and IP address, then divulging the information to others so that the request can be traced back to you. Tools like tor can provide a greater degree of anonymity, but even that isn’t bulletproof. I recently read of an exploit a researcher found in tor that would allow traffic sent through it to be monitored. The strongest method of anonymity is probably the VPN, but, again, that assumes that the owners of the VPN service will keep private any traffic you send through them.

Block cookies

How to do it: This one’s pretty easy. If you’re using a web browser, just find the setting that indicates that all cookies should be blocked. Most screen-scraping software will (or should) also provide a way to do this.

When to do it: If the web site you’re working with is tracking you through cookies, you can simply reject them all. This likely will only work on relatively unsophisticated sites. Most sites trying to discourage screen-scraping will track your IP address.

Avoid authentication

How to do it: If you’re authenticated to a web site, you’re likely not blocking cookies, so the web site will be able to track you.

When to do it: This is probably obvious, but, if you don’t need to authenticate, don’t. That eliminates one other method whereby a site can track you.

In some cases it’s simply not possible to avoid authentication. In these cases, unfortunately, there may not be anything you can do to stay anonymous. Your best bet would probably be to hide your IP address (as described above), which may also require logging in and out of the site each time you acquire a new IP address.

Look for ways to circumvent CAPTCHA mechanisms

How to do it: In cases where a CAPTCHA mechanism is poorly implemented, it may be possible to determine how to circumvent it programatically (i.e., in programming code). A common CAPTCHA method is to present the user with a series of numbers or characters in a pattern such that a machine wouldn’t be able to read it. In a handful of cases in the past we’ve found that the server simply uses a naming convention with the CAPTCHA images, such that it’s possible to determine what the image says without requiring that a human read it.

Yet another fairly inefficient way of dealing with a CAPTCHA would be to capture the portion of the page containing the CAPTCHA, present it to a human being, have the person type in whatever the CAPTCHA requires, then make the request. We’ve never used this technique (and likely never would), but it’s technically possible to do.

When to do it: If a site is using a CAPTCHA, examine the HTML closely. Refresh the page multiple times to see how it changes. If you’re lucky, there will be a way to circumvent it in code. More than likely, though, you’d simply have to have a human being deal with it.

Behave yourself

So there you have it. I’ve just pointed out a number of tools and techniques to remain anonymous online. Like I said before, don’t abuse them. There are some very legitimate reasons for wanting to do this, but there are a whole host of reasons why you shouldn’t. Part of me says I shouldn’t even be divulging any of this, but I’m not telling you anything you couldn’t find out on your own. So be nice. Behave yourself.

How to stop phpBB spam

Posted in Miscellaneous, Tips on 01.02.07 by Todd Wilson

Well, I sure wish someone would have told us about this a while ago, so I’m doing the world a favor and talking about it here. Hopefully this blog posting gets picked up by Google so that others who are new to phpBB can learn how to stop spam up front.

We’ve been battling spam on our phpBB forum for I don’t know how long. The forum software works fine, but it’s so widespread that it seems to be one of the primary targets for forum spammers. After monkeying around with the thing installing mods and making manual changes, we finally hit this mod: Stop Spambot Registration. Once installed, the spam stopped. Amazing.

Now, obviously your mileage may vary with this one. We’ve also tried a bunch of other mods, so it’s possible that some of our mods are helping, but the Stop Spambot Registration was the key for us. If you find that you need more firepower beyond that mod, I’d recommend trying others on the phpBB Security-Related MODs page that relate to spam.

By the way, just one plea to the phpBB folks–please consider building spam control into the base install of the software. You know people are targeting you, so why not give your users some defense out of the box?

***UPDATE***

Well, I declared victory a bit prematurely with that last posting. We got a bit more spam after I installed the mod I mentioned, so I installed one more: spamwords. It seems to work fairly well. My only complaint is that it only allows you to designate words, and not phrases, as indicators of spam.

I should also mention one other change we made early on that stopped a lot of the spam–we deleted the guest user account. This is the user in the database that has an ID of -1. I searched and searched for a way to disable guest posting, to no avail. With the guest account deleted people see an error message if they explicitly log out, but at least it prevents spam from non-registered posters.

« Newer EntriesPrevious Entries »