10.29.10

Version 5.0.25a of screen-scraper Released

Posted in Updates at 5:39 pm by Todd Wilson

This one’s just a minor update with a couple of bug fixes:

  • With the update of JRE 1.6.0_22 on Mac OS X the “sss” file extension was being trunctaed when exporting.
  • In some cases anonymous proxies would spawn, but never become available.  Added code to handle this situation.
Anyone using anonymization with an alpha version will want to upgrade.

10.27.10

Money-Back Guarantee for screen-scraper

Posted in Miscellaneous at 3:30 pm by Todd Wilson

We’ve actually had a 30-day money-back guarantee since almost the beginning, but recently decided to highlight it a bit more.  Any time in a software development project you incorporate a new library or application you’re incurring a certain amount of risk.  The new application may appear to do just what you want it to, but later on your realize that it falls short.  We’ve put many years of development into making screen-scraper the best data extraction tool on the market, but we also acknowledge that it may not work out for everyone.  Because of this we want to help reduce some of the risk that people take when trying us out.  We already offer the Professional and Enterprise Editions as fully-functional 30-day trials, but on top of that, if things still don’t work out, you always have the option of simply asking for your money back.

As the 30-day money-back guarantee page on our site mentions, we also offer several ways for people to get help.  There’s no question that screen-scraper has a bit of a learning curve, and we try to provide as much help as we can so people can become proficient with it quickly.

10.25.10

Oh, the possibilities (screen-scraping online video)

Posted in Tips at 6:21 pm by scottw

Here we go for the second installment.

The topic for today is online video.

  • Online video

    You may be familiar with certain sites that allow you to view your favorite TV episodes or watch a poor squirrel being launched into the woods off of some guys deck via a salad strainer and 20 feet of bungee cord. Well, we’ve been asked by one of our clients to scrape the source video URL, title, rating, description, etc. of thousands of online animal torture videos and general moving multi-colored malaise.

    Those of you already familiar with screen-scraper are acquainted with the usual routine of starting off by proxying a site using screen-scraper’s proxy server. Well, it so happens that screen-scraper uses an HTTP proxy. It also so happens that most online videos are served over a protocol other than HTTP (eg. mms, http to mms, rtsp, http to rtmp, rtmp, rtmpe, rtmps, rtmpt, etc., etc.).

    Those of you already familiar with online videos probably know that you view them via the Adobe Flash player. screen-scraper’s built-in client is not a Flash player. So, you wonder, how does screen-scraper scrape online videos?

    Challenges:

    Source video URL discovery is particularly challenging for the reasons described above and requires a new set of tools to make it happen. Over time our tool set has evolved to include different video stream recording software, Proxy/TCP revealers, and various multimedia players…

    Once discovered we create a pretty typical scraping session to recurs over a site scraping the visible title, description, etc.; as well as, the non-visible pieces that make up an online video source URL. For example…

    1. Proxy: http://news.bbc.co.uk/sport2/hi/football/world_cup_2010/video/default.stm
    2. Note “connection” node: http://news.bbc.co.uk/media/emp/8680000/8682600/8682671.xml
    3. Compile URL: rtmp://72.246.119.70:80/ondemand?_fcs_vhost=cp45414.edgefcs.net&undefined/public/flash/sport/football/553...
    4. Test via Akamai

    Extracting embedded video meta-data is required because seldom will a site state outright what the format, codec, dimensions, length, etc. of their online videos. We use a combination of software to download a portion of the video in order to get to the meta-data.

    • wget: Download a URL
    • wpro: Download a non-HTTP, non-RTMP URL
    • rtmpdump: Download an RTMP URL
    • mediainfo: Reveal the meta-data

    The ability to easily manage multiple scraping sessions is key because we are currently scraping from around 26 online video portals. To do this we have built a web-based Tomcat controller to coordinate across multiple servers located anywhere in the world. You can manually, or by way of a scheduler, start each scraping session, add additional screen-scraper instances and point to multiple mySQL databases.

    Once the data is in a database the video information can be accessed by way PHP, Java, Coldfusion or a number of other technologies, making Blondstar only a click or two away.

10.22.10

Price Drop in Anonymization Service

Posted in Updates at 10:32 am by Todd Wilson

Due to some recent pricing changes Amazon has effected in their EC2 service I’m happy to report that we’re likewise able to reduce the price of our anonymization service.  Whereas previously the cost per hour per proxy was 25 cents, it will now be 10.  The one hitch to this is that you’ll need to upgrade to the latest alpha version of screen-scraper (currently 5.0.24a) in order to take advantage of the change.  It required a few minor changes to screen-scraper’s internal code, which disallowed us from releasing it to 5.0 users.  Once we release the next public version of screen-scraper, the new pricing will be available to all users of that version as well.

10.21.10

Oh, the possibilities (ScrapbookFinds.com)

Posted in Tips at 12:43 pm by scottw

This is the first installment in what will hopefully become a series.

Here at screen-scraper we handle a variety of projects for a myriad of different clients. All of our work is centered around our core software, screen-scraper, but is often complimented by third-party software such as PHP, Tomcat, Lucene, Google Web Toolkit, mySQL, along with our own set of custom-built code.

  • ScrapbookFinds.com:  Our in-house scrapbooking comparison shopping site. Since 2006 we have been scraping many scrapbooking supply websites for product data. While scraping, the data is added to a mySQL database where we categorize and scrub it for duplicates. When you search the site Lucene quickly handles the finding of results related to your query.

    Challenges:

    Data normalization is the process of identifying a single product that is found on more than one site. Each site may refer to that product using different characteristics in, say, the title, description, or part number. Finding likeness despite the differences is a common challenge for us. Data normalization is handled by Lucene’s ability to index and tokenize disparate data to find commonality.

    We mitigate changes to a site by monitoring the number of records each time it scrapes. If the current number of records drops below 80% of the previous total then we know to look over the logs for errors and/or warnings issued by screen-scraper.

    Technology used:

Stay tuned for more to come…

10.12.10

My Take on the Wall Street Journal Article

Posted in Uncategorized at 2:38 pm by Todd Wilson

As Scott pointed out, we were featured in a Wall Street Journal article yesterday.  I thought it might be worthwhile to share my point of view on what information it presents.

On the whole, I think the article largely misrepresents the type of work we do.  The tone of the article seems to be fairly sensationalistic, and I believe even resorts to scare tactics.  There’s no question that information is programmatically extracted from web sites on a regular basis.  It’s also true that this is a technology that can be (and is) abused by some users of it.  The flip-side is also true, however.  Sites like Zillow, Pricegrabber, and, yes, even Google make heavy use of screen-scraping, yet also provide completely legitimate and very valuable services to users.  Technology (including ours) is simply a tool–it can be used in both positive and negative ways.

The article also makes it sound as though one of the primary purposes of screen-scraping is to extract private and sensitive information about people, then sell that information to the highest bidder.  This definitely isn’t the type of thing we do.  It’s true that people may be using our software for nefarious ends.  When we look at taking on contract work, we simply refuse obviously shady dealings.  We’ve turned away many potential contracts because of this, and will continue to do so.

All of that said, I suppose this is the type of journalism that sells, so perhaps I can’t fault the authors.  Hopefully those who read the article, though, will take the time to read up a bit more on the type of thing we do instead of making assumptions based on the skewed tone of the article.  Along those lines, you might take a look at this part of our web site for a bit more explanation.

screen-scraper.com in the Wall Street Journal

Posted in Miscellaneous at 12:58 am by scottw

On September 5th, 2010 we received a call from a reporter from the Wall Street Journal, Steve Stecklow (2007 Pulitzer Prize winner). He was calling to speak with someone at our company for a story he was doing related to our industry. He and I talked for about 40 minutes where he asked a lot of interesting questions about our company and about the industry in general. I explained to him how we are one of three screen-scraping companies in Utah Valley. He then asked if he could fly in and meet with us in person.

About 10 days later Steve pulled into town in a shiny new rent-a-car. Jason Bellows (VP of Operations), Todd Wilson (Owner, President), and I took Steve to a favorite Mexican joint, Diego’s, which is run by a fellow in our building. There he continued his interview and asked us about the various companies we’ve done work for. He was looking for something juicy for his story but did so in a very polite and forthright manner. We told him about work we’ve done for Microsoft, Oracle, Progressive Insurance and others. Some information we were not able to share due to non-disclosure agreements we’ve entered into with our clients.

A few days later they sent Chris Detrick, a free-lance photographer who works for the Salt Lake Tribune, to take pictures around the office. Over the next few weeks Steve and I stayed in contact as he had various follow up questions.

In the end, Todd and I were quoted briefly and Todd got to expose a part of screen-scraper’s source code for all the world to see.

Read ‘Scrapers’ Dig Deep for Data on Web.

10.11.10

Scaling & Optimizing screen-scraper

Posted in Uncategorized at 3:41 pm by jason

I get a lot of requests for help to configure and run screen-scraper to scrape at an optimal rate.  As is often the case with optimization, it is often as much art as science since the many variables that can affect the speed of a scrape are impossible to catalog.  While these steps will help to achieve a higher rate of scraping, it is impossible to foretell what maximum rate is available in your situation and setup.

Server
Generally, screen-scraper is a fairly lightweight application, however the needs of each scraping server differ.
Screen-scraper is cross-platform, and can be successfully deployed on a number of server operating systems.  We have found, however, that for the most part Linux based servers tend to be somewhat more dependable and scraping friendly.
The more intense the scraping needs, the more screen-scraper can take advantage of system resources.  One set of very successful, high-end scraping servers are configured thus:

  • Intel Core2 Duo at 2Ghz
  • 4G RAM
  • CentOS
  • Multiple servers

In cases where there in an abundance of scrapes that need to be run simultaneously it is advisable to have multiple servers for load balancing.  In these cases we have used physical servers in various locations, virtual servers, or a hybrid solution of the two.  Ekiwi is able to build a custom controller to manage multiple servers, including spawning/closing virtual servers.

Network
The network connection to the site(s) with which you are interacting is the single most important factor in optimizing screen-scraper’s speed.  It is important to have adequate bandwidth available.  As you increase the number of concurrent scrapes, you will need to have greater bandwidth to accommodate them.
Screen-scraper is already configured to make only HTTP requests specified, and will not make subsequent requests for images, scripts, CSS files, frames, etc.
Some factors in the network connection are out of your control.  Speed from your ISP node to the site, aka latency, is often dictated by distance to the remote server, and response time from the site cannot be increased by any setting of screen-scraper or the network.
In some cases anonymization is desired.  Any time that you need to do so, you are introducing additional stops (and distance) between you and the remote site.  These steps can have a substantial and detrimental affect on the speed of your scrape.  Some scenarios include:

  • Tor/Privoxy:  This package is desirable because it is free to use, plus the large number and variation of IPs makes it very difficult to block.  Generally Tor is a slower option, but there is some configuration that can be done to seek fast exit nodes, etc.
  • I2P:  Like Tor/Privoxy, this is free to use, though has fewer IPs, and the drawback of generally limited speed.
  • EC2:  The Amazon EC2 cloud spawns a number of virtual servers, and screen-scraper is set up to tie into and use these virtual servers as proxies.  This option provides consistently fast proxies, but there is a finite number of IPs available so it can be blocked, and in some cases the sites you are scraping can determine that unwanted traffic is coming from EC2, and make an abuse report to Amazon.  The severs cost $0.25 per server per hour.
  • Anonymizer:  This 3rd party option hosts an array of fast servers that are easy to configure.  This too has a finite number of IP addresses, and can sometimes be blocked.  The company charges per HTTP request made.

Screen-scraper
The screen-scraper is already largely configured for optimal speed.  Scrapes should always be run from a command line or through the server (the workbench is meant for development of scraping sessions).
Make sure that screen-scraper is set to a adequate memory usage setting; we’ve found that 768M of memory allocation is optimal, and that higher settings offer little added benefit.
After a scrape is set up and stable, one should stop logging or reduce the logging level.
Ensure that the connection timeout and data extraction timeout are set no higher than needed for the scrape.  Sometimes too low a timeout will miss an HTTP response if the remote server takes too long to respond, but in many cases missing an occasional record is preferable to waiting for it.

Scraping session
The primary indicator of how much time it will take to run a scrape is a count of how many HTTP requests are required.  Large datasets will usually require more requests, so any steps you can take to focus your results will save time.  Scraping sessions should be designed not to make any unnecessary HTTP requests.
For some scenarios there can be an advantage in running multiple threads against a site.  This allows a large number of scraping sessions to target smaller subsets of the site in tandem.  Using this method is generally more intensive of the server’s resources, but will offer a net gain.  With screen-scraper professional edition, you can run up to 5 concurrent scraping sessions, whereas with enterprise edition you may run as many as the server’s resources will allow; determining the number of scrapes to run on any server is a matter of testing and monitoring.  We will often set the server to run 100 concurrent scrapes to make a base-line, and adjust from there if needed.  Sometimes screen-scraper or Java will use all of the resources available to it while the server still has capacity; in such cases you will see greater performance by installing an additional instance of screen-scraper instead of further taxing the existing instance.
When scraping data from the remote site, it is often faster to write data to a file on the fly so screen-scraper needn’t pause for database queries.  In cases where direct database interaction is required, ensure that the database is optimized, indexed, and has a fast connection to the scraping server(s).

10.04.10

Oracle/screen-scraper Podcast

Posted in Miscellaneous at 5:01 pm by Todd Wilson

Here I go tooting our horn again–Oracle has just posted a podcast on integrating screen-scraper with their Oracle Secure Enterprise Search product.  The page that links to the podcast is here, and the MP3 file of the same can be found here.  Might be worth a listen if only to get a better sense for some of the possibilities screen-scraper allows.