How to Extract Text from PDFs and Images

Overview A surprising amount of valuable data is locked away in PDF files. There are a variety of methods for extracting data from them, but the job is made more difficult when they contain embedded images that hold the text. We recently experimented with Google Cloud Vision API to handle the task, and have had … Read moreHow to Extract Text from PDFs and Images

Version 7.0.14a of screen-scraper Released

Just released a new alpha version of screen-scraper. Here are the changes: Bug fix to datamanager awaitCompletionOfPendingWrites method that could cause it to permanently block. Addition of new HTTP callback event fire times. Fixed a data manager issue when building schemas with some newer mysql drivers. Added sutil.makeGETRequestRequired(String) that issues a request even if the … Read moreVersion 7.0.14a of screen-scraper Released

Screen-Scraping vs. API

On occasion we’re asked to acquire data from a site that already offers an API. In almost all cases it’s going to be simpler to acquire data from a site by accessing an API as opposed to crawling. In theory, there should be no need to scrape data from a site if an API to the content is already made available. That said, there are a number of reasons why it still may make sense to scrape a site that also provides an API.

Read moreScreen-Scraping vs. API

Combining Scraped Data from Multiple Sites

Often data sets become richer when they’re combined together. A good example of this is in a small study done by Streaming Observer on the quality of movies available from the big streaming services–Amazon, Netflix, Hulu, and HBO. The study concluded that, even though Amazon has by far the most movies, Netflix has more quality movies than the other three combined. This was determined by combining data about the movies available from each streaming service with data from Rotten Tomatoes, which ranks the quality of movies.

Read moreCombining Scraped Data from Multiple Sites

8 Ways to Handle Scraped Data

In general, the hard part of screen-scraping is acquiring the data you’re interested in. This means building a bot using some type of framework or application to crawl a site, and extracting specific data points. It may also mean downloading files such as images or PDF documents. Once you’ve gotten to the point that you have the data you want, you then need to do something with it. I’m going to review the more common techniques for handling extracted data.

Read more8 Ways to Handle Scraped Data

Large-Scale Web Scraping

I recently answered a question on Quora about parallel web scraping, and thought I’d flesh it out more in a blog posting. Scraping sites on a large scale means running many bots/scrapers in parallel against one or more websites. We’ve done projects in the past that have required hundreds of bots all running at once against hundreds of websites, with many of them targeting the same website. There are special considerations that come into play when extracting information at a large scale that you may not need to consider when doing smaller jobs.

Read moreLarge-Scale Web Scraping

Version 7.0.1a released

When you updated to version 7.0.1a, the first thing you’ll notice is spruced up GUI, but there is a quite a bit going on under the hood too. You can see all the release notes here. If you want to use this update, here is the instruction to update.

Dynamic Content

One’s first experience with a page full of dynamic content can be pretty confusing. Generally one can request the HTML, but it’s missing the data that is sought.

What you’re usually seeing is a page that contains JavaScript which is making a subsequent HTTP request, and getting the data to add into the HTML. That subsequent HTTP response is often JSON, but can be plain HTML, XML, or myriad other things.

Read moreDynamic Content