screen-scrapeable - Thoughts, tips, and updates on screen-scraping

How to Extract Text from PDFs and Images

December 12, 2019December 12, 2019 by Todd Wilson

Overview A surprising amount of valuable data is locked away in PDF files. There are a variety of methods for extracting data from them, but the job is made more difficult when they contain embedded images that hold the text. We recently experimented with Google Cloud Vision API to handle the task, and have had … Read moreHow to Extract Text from PDFs and Images

Version 7.0.14a of screen-scraper Released

October 28, 2019 by Todd Wilson

Just released a new alpha version of screen-scraper. Here are the changes: Bug fix to datamanager awaitCompletionOfPendingWrites method that could cause it to permanently block. Addition of new HTTP callback event fire times. Fixed a data manager issue when building schemas with some newer mysql drivers. Added sutil.makeGETRequestRequired(String) that issues a request even if the … Read moreVersion 7.0.14a of screen-scraper Released

Screen-Scraping vs. API

March 11, 2019 by Todd Wilson

On occasion we’re asked to acquire data from a site that already offers an API. In almost all cases it’s going to be simpler to acquire data from a site by accessing an API as opposed to crawling. In theory, there should be no need to scrape data from a site if an API to the content is already made available. That said, there are a number of reasons why it still may make sense to scrape a site that also provides an API.

Combining Scraped Data from Multiple Sites

March 11, 2019January 30, 2019 by Todd Wilson

Often data sets become richer when they’re combined together. A good example of this is in a small study done by Streaming Observer on the quality of movies available from the big streaming services–Amazon, Netflix, Hulu, and HBO. The study concluded that, even though Amazon has by far the most movies, Netflix has more quality movies than the other three combined. This was determined by combining data about the movies available from each streaming service with data from Rotten Tomatoes, which ranks the quality of movies.

8 Ways to Handle Scraped Data

February 1, 2019January 30, 2019 by Todd Wilson

In general, the hard part of screen-scraping is acquiring the data you’re interested in. This means building a bot using some type of framework or application to crawl a site, and extracting specific data points. It may also mean downloading files such as images or PDF documents. Once you’ve gotten to the point that you have the data you want, you then need to do something with it. I’m going to review the more common techniques for handling extracted data.

Large-Scale Web Scraping

January 17, 2019January 17, 2019 by Todd Wilson

I recently answered a question on Quora about parallel web scraping, and thought I’d flesh it out more in a blog posting. Scraping sites on a large scale means running many bots/scrapers in parallel against one or more websites. We’ve done projects in the past that have required hundreds of bots all running at once against hundreds of websites, with many of them targeting the same website. There are special considerations that come into play when extracting information at a large scale that you may not need to consider when doing smaller jobs.

Complex Forms

January 3, 2019March 22, 2017 by jason

There are some sites that have some pretty complex forms–sometimes in the sheer number of parameters, or sometimes by being incomprehensible to humans. In such cases we have a method to get all the form elements for you.

Version 7.0.1a released

April 19, 2016 by jason

When you updated to version 7.0.1a, the first thing you’ll notice is spruced up GUI, but there is a quite a bit going on under the hood too. You can see all the release notes here. If you want to use this update, here is the instruction to update.

Screen-scraper 7.0 Released

January 3, 2019March 2, 2016 by jason

This new stable version adds many new features, and give you the ability to scrape sites that are using the lastest SSL features.

Dynamic Content

January 3, 2019October 28, 2015 by jason

One’s first experience with a page full of dynamic content can be pretty confusing. Generally one can request the HTML, but it’s missing the data that is sought.

What you’re usually seeing is a page that contains JavaScript which is making a subsequent HTTP request, and getting the data to add into the HTML. That subsequent HTTP response is often JSON, but can be plain HTML, XML, or myriad other things.