11.24.10
Posted in Updates at 6:20 pm by Todd Wilson
Changes in this release:
- Fixed a bug on Mac OS X where an overwrite prompt was not being given in exporting scraping sessions.
- Fixed a message formatting issue in certain script errors.
- Fixed an issue with anonymous proxies being terminated externally.
You can view a cumulative list of changes since 5.0 on our
Change Log page.
Permalink
11.23.10
Posted in Thoughts at 5:19 pm by Todd Wilson
One of the main aspects that I think differentiates screen-scraper from many other solutions is its ability to handle large-scale scraping needs. Additionally, it was designed from the ground up to integrate with other systems, so it generally fits nicely into most any existing setup.
If you’re doing a simple one-off data extraction project screen-scraper could certainly handle it, but, truthfully, you may be better off with something that’s a little more quick-and-easy. On the contrary, if you’re looking to pull data from multiple web sites, and need the extracted data to be made available to other solutions, screen-scraper is an excellent option. There are many solutions out there that may get you up and running fairly quickly, but would fall apart when faced with some of the jobs screen-scraper tackles.
Along these lines we’ve added a new Enterprise-Ready page to our site that summarizes some of what screen-scraper can do. If you need big iron for your project, take a close look at what screen-scraper offers.
Permalink
11.12.10
Posted in Updates at 4:13 pm by Todd Wilson
Changes:
- Based on feedback, now allowing running the screen-scraper workbench and server simultaneously by adding the “AllowMultipleSimultaneousInstances” property to the screen-scraper.properties file.
- Fixed a bug where screen-scraper would freeze up when very large requests were included in proxy sessions and scrapeable files.
- Fixed a bug where space characters in URL’s would generate an error.
Permalink
11.11.10
Posted in Tips at 7:31 pm by scottw
Lately we find an increasing need to anonymize our scraping sessions. So, as necessity is the mother of invention, we have created and adopted a handful of different approaches to keep our scrapes up and running.
Keep in mind, the only way to block a web crawler is for a website’s server to refuse connections from an offending IP address.
Proactive Anonymization
This approach is used before any blocking has occurred. Ideally, a proactive approach would be the only technique needed.
Using screen-scraper’s Anonymization Service set up your scraping session to spawn between 3-5 proxy servers when it starts. Create a script whose job it is to shutdown and spawn anew a proxy server at a random interval (say, every 3-5 minutes).
It is also useful to switch up the User Agent at least each time you switch out a proxy. It can be even more effective if you switch it up on every request.
Similarly, when possible, you can change your referrer to a random URL that is off of the target domain. This makes it appear as though a different user is entering the site from an external source (typically considered positive traffic).
Reactive Anonymization
This is necessary once a site starts blocking your IP address.
The first approach is to use screen-scraper’s built-in Anonymization Service. The current implementation makes use of Amazon EC2 servers as proxies. Because we make use of Amazon’s Linux EC2 instances we have access to Squid, a popular proxy server already installed.
A limitation to using Amazon’s EC2 is that they reside in a finite and predictable block of IP addresses. We have had a number of sites block Amazon EC2’s wholesale.
After Amazon EC2’s are no longer effective you can make use of three other ad-hoc techniques.
Tor: The Tor network is spread widely across many different nodes and can prove difficult (almost impossible) to block. However, because of the vast distribution across any type of web server (with varying internet speeds) the relay speed is roughly 1/10th that of a normal connection. But, it’s free.
I2P2: Similar to Tor but a bit better maintained. This means faster connections. However, there are many fewer proxy nodes and fewer IP addresses to block. But, it’s free.
Anonymization via Manual Proxy Pools: Using proxy pools should be a last resort because the nature of the proxies is unknown and often unreliable. You are making use of computers on the Internet that have been set up with an open port for all the world to relay its traffic through. It’s possible that the owner of the server may close the open port at any time. But, it’s free.
See the following resources to read more about Anonymizing screen-scraper.
Permalink
11.10.10
Posted in Thoughts at 4:08 pm by Todd Wilson
Yesterday ReadWriteWeb published an article entitled “Overwhelmed Executives Still Crave Big Data, Says Survey“. The basic gist of it is that data is vital to making business decisions, and many managers feel that they don’t have enough of it. This got me thinking about how screen-scraping plays into all of this.
At a basic level, as a data extraction company, we deal in information. It really doesn’t make much difference what industry the information pertains to; if it’s out there on the Web, we can probably can grab it. There’s a lot of talk these days about information overload, which is unquestionably a real phenomenon, but oftentimes it’s not so much the quantity of the information as it is getting access to that information in a usable format. If the data you’re interested in consists of hundreds of thousands of records spread across dozens of web sites it may not be nearly as useful as if it could be searched and analyzed in a single repository. Much of the time this is what we do. We’re tasked with aggregating large numbers of data points, normalizing and cleaning them up, then consolidating them all into a highly-structured central repository. Once the data is in such a repository the real value of it surfaces. It’s at this point that the information can be analyzed statistically, summarized, or browsed in a structured way. This leads to business intelligence, which in turn (hopefully) yields good business decisions.
On a related note, as mentioned in the article, timeliness of information can also be critical. Once again, screen-scraping can play an important role here. I can’t count the number of times a client has approached us for a project when they already have access to all (or most of) the information they want us to acquire. The trouble is that much of the time the data they already have is old, inaccurate, and/or incomplete. Web sites and other data providers will often provide an API to their information. This can be a great thing, however much of the time the API is insufficient because it provides access to information that is old or incomplete. For example, if you’re wanting information about automobile sales, an API may give you the make, model, and year of a car that was sold, but not the asking price. In contrast, live web sites generally contain the most up-to-date, complete, and accurate representation of the information. As such, even when data may be available via an API (or, gasp, a mailed CD), it’s often better to go directly to the web site if you want the best data.
Permalink
11.09.10
Posted in Updates at 5:49 pm by Todd Wilson
Just a few changes in this one:
- Fixed a scrolling bug related to displaying script instances associated with extractor patterns.
- Removed a log message that was appearing each time a redirect occurred.
- screen-scraper will now display a “start page” when the workbench initially launches.
The start page will hopefully be especially helpful for newer users. Also, we’ll likely be holding future sales and such that will be advertised on that start page, so keep an eye on it…
Permalink
11.03.10
Posted in Updates at 5:16 pm by Todd Wilson
Fixed a number of bugs in this one:
- Made a bug fix that arose when available anonymous proxy servers was depleted to zero.
- Now disallowing running multiple screen-scraper interfaces simultaneously. For example, previously the screen-scraper workbench could be run concurrently with the server. This ended up causing database corruption in some cases, though, so we’re now disallowing it.
- When clicking a search result after performing a find in a proxy session the HTTP transactions table will now scroll to the corresponding transaction.
- When clicking a search result after performing a find in a proxy session if the associated proxy session isn’t visible in the right pane it now will be.
- In exporting objects if an XML comment was found in any of the text fields the resulting exported file would contain an invalid sequence of characters.
Permalink