There are some sites that have some pretty complex forms–sometimes in the sheer number of parameters, or sometimes by being incomprehensible to humans. In such cases we have a method to get all the form elements for you.
On the page with the forum, you need to extract the whole form, including the “<form” and “</form>”. I will make an extractor with the token named ~@_FORM@~, and use the RegEx in the token properties to define which form I need. An example RegEx:
Once I get it extracted, there is a script to run on each pattern application. Therein you need to set any fields, selection, radio buttons, etc, and set as a session variable.
Form form = scrapeableFile.buildForm(dataRecord.get("_FORM"));
form.setValue("SESSION_TOKEN", session.getv("SESSION_TOKEN")); // Set a field. Add any number needed
form.setValueChecked("values", session.getv("TO_CHECK")); // Set a checkbox as selected. Add any number needed
Then you request the next scrapeableFile, and on that file you run a script before the file is scraped, and it will clear any current URL and parameters, and replace them with those from the _FORM. I rarely change this script.
Form form = session.getv("_FORM");
We’ve recently included libraries for Apache Commons Lang. There is a large number of useful things in there, but I find most use for stringUtils and wordUtils.
For example, some sites one might scrape might have the results in all caps. You could:
name = “GEORGE WASHINGTON CARVER”;
name = StringUtils.lowerCase(name);
name = WordUtils.capitalize(name);
session.log(“Name now shows as: ” + name);
At the end, the name is now formatted as “George Washington Carver”. Most all of the methods are already nullsafe, and there is a lot of little tools in there to try.
We are pleased to announce our new coaching program. To help get started, our new users can receive up to two free hours of one-on-one coaching (click here for details).
Existing users, receive help planning out your project, solving that one tough issue, learn new techniques and refine your current scraping projects. Purchase hours of training by calling our offices at 800-672-0113.
As Scott pointed out, we were featured in a Wall Street Journal article yesterday. I thought it might be worthwhile to share my point of view on what information it presents.
On the whole, I think the article largely misrepresents the type of work we do. The tone of the article seems to be fairly sensationalistic, and I believe even resorts to scare tactics. There’s no question that information is programmatically extracted from web sites on a regular basis. It’s also true that this is a technology that can be (and is) abused by some users of it. The flip-side is also true, however. Sites like Zillow, Pricegrabber, and, yes, even Google make heavy use of screen-scraping, yet also provide completely legitimate and very valuable services to users. Technology (including ours) is simply a tool–it can be used in both positive and negative ways.
The article also makes it sound as though one of the primary purposes of screen-scraping is to extract private and sensitive information about people, then sell that information to the highest bidder. This definitely isn’t the type of thing we do. It’s true that people may be using our software for nefarious ends. When we look at taking on contract work, we simply refuse obviously shady dealings. We’ve turned away many potential contracts because of this, and will continue to do so.
All of that said, I suppose this is the type of journalism that sells, so perhaps I can’t fault the authors. Hopefully those who read the article, though, will take the time to read up a bit more on the type of thing we do instead of making assumptions based on the skewed tone of the article. Along those lines, you might take a look at this part of our web site for a bit more explanation.
I get a lot of requests for help to configure and run screen-scraper to scrape at an optimal rate. As is often the case with optimization, it is often as much art as science since the many variables that can affect the speed of a scrape are impossible to catalog. While these steps will help to achieve a higher rate of scraping, it is impossible to foretell what maximum rate is available in your situation and setup.
Generally, screen-scraper is a fairly lightweight application, however the needs of each scraping server differ.
Screen-scraper is cross-platform, and can be successfully deployed on a number of server operating systems. We have found, however, that for the most part Linux based servers tend to be somewhat more dependable and scraping friendly.
The more intense the scraping needs, the more screen-scraper can take advantage of system resources. One set of very successful, high-end scraping servers are configured thus:
- Intel Core2 Duo at 2Ghz
- 4G RAM
- Multiple servers
In cases where there in an abundance of scrapes that need to be run simultaneously it is advisable to have multiple servers for load balancing. In these cases we have used physical servers in various locations, virtual servers, or a hybrid solution of the two. Ekiwi is able to build a custom controller to manage multiple servers, including spawning/closing virtual servers.
The network connection to the site(s) with which you are interacting is the single most important factor in optimizing screen-scraper’s speed. It is important to have adequate bandwidth available. As you increase the number of concurrent scrapes, you will need to have greater bandwidth to accommodate them.
Screen-scraper is already configured to make only HTTP requests specified, and will not make subsequent requests for images, scripts, CSS files, frames, etc.
Some factors in the network connection are out of your control. Speed from your ISP node to the site, aka latency, is often dictated by distance to the remote server, and response time from the site cannot be increased by any setting of screen-scraper or the network.
In some cases anonymization is desired. Any time that you need to do so, you are introducing additional stops (and distance) between you and the remote site. These steps can have a substantial and detrimental affect on the speed of your scrape. Some scenarios include:
- Tor/Privoxy: This package is desirable because it is free to use, plus the large number and variation of IPs makes it very difficult to block. Generally Tor is a slower option, but there is some configuration that can be done to seek fast exit nodes, etc.
- I2P: Like Tor/Privoxy, this is free to use, though has fewer IPs, and the drawback of generally limited speed.
- EC2: The Amazon EC2 cloud spawns a number of virtual servers, and screen-scraper is set up to tie into and use these virtual servers as proxies. This option provides consistently fast proxies, but there is a finite number of IPs available so it can be blocked, and in some cases the sites you are scraping can determine that unwanted traffic is coming from EC2, and make an abuse report to Amazon. The severs cost $0.25 per server per hour.
- Anonymizer: This 3rd party option hosts an array of fast servers that are easy to configure. This too has a finite number of IP addresses, and can sometimes be blocked. The company charges per HTTP request made.
The screen-scraper is already largely configured for optimal speed. Scrapes should always be run from a command line or through the server (the workbench is meant for development of scraping sessions).
Make sure that screen-scraper is set to a adequate memory usage setting; we’ve found that 768M of memory allocation is optimal, and that higher settings offer little added benefit.
After a scrape is set up and stable, one should stop logging or reduce the logging level.
Ensure that the connection timeout and data extraction timeout are set no higher than needed for the scrape. Sometimes too low a timeout will miss an HTTP response if the remote server takes too long to respond, but in many cases missing an occasional record is preferable to waiting for it.
The primary indicator of how much time it will take to run a scrape is a count of how many HTTP requests are required. Large datasets will usually require more requests, so any steps you can take to focus your results will save time. Scraping sessions should be designed not to make any unnecessary HTTP requests.
For some scenarios there can be an advantage in running multiple threads against a site. This allows a large number of scraping sessions to target smaller subsets of the site in tandem. Using this method is generally more intensive of the server’s resources, but will offer a net gain. With screen-scraper professional edition, you can run up to 5 concurrent scraping sessions, whereas with enterprise edition you may run as many as the server’s resources will allow; determining the number of scrapes to run on any server is a matter of testing and monitoring. We will often set the server to run 100 concurrent scrapes to make a base-line, and adjust from there if needed. Sometimes screen-scraper or Java will use all of the resources available to it while the server still has capacity; in such cases you will see greater performance by installing an additional instance of screen-scraper instead of further taxing the existing instance.
When scraping data from the remote site, it is often faster to write data to a file on the fly so screen-scraper needn’t pause for database queries. In cases where direct database interaction is required, ensure that the database is optimized, indexed, and has a fast connection to the scraping server(s).
One of the primary design goals of screen-scraper from the very beginning has been to emphasize extensibility. We’ve tried to build in a number of features and tools to make screen-scraping easier, but we also realize that we can’t fit it all in. Features such as the internal scripting engine and the ability to invoke screen-scraper from external applications allow it to be extended according to the whims of the developer.
Recently astute scraper Rodney Aiglstorfer came up with an excellent way to link data extracted within screen-scraper to custom-built classes. He’s dubbed it “Screen-Scraper Annotations for Java”, and you can find it here: http://code.google.com/p/ssa4j/. Rodney’s been good enough to release the library under an open source license, so others can benefit as well.
The internet can be thought of as the world’s largest database. This is so, because it is comprised of inter-connected databases, files, and computer systems. By simply typing in some keywords, one can access hundreds to millions of websites containing treasure troves of facts, statistics, and other formats of information on an endless array of topics. Because the internet is such a valuable resource, we should seek new and innovative ways to mine the data using ethical means.
The goal of scraping websites is to access information, but the uses of that information can vary. Users may wish to store the information in their own databases or manipulate the data within a spreadsheet. Other users may utilize data extraction techniques as means of obtaining the most recent data possible, particularly when working with information subject to frequent changes. Investors analyzing stock prices, realtors researching home listings, meteorologists studying weather, or insurance salespeople following insurance prices are a few individuals who might fit this category of users of frequently updated data.
Access to certain information may also provide users with strategic advantage in business. Attorneys might wish to scrape arrest records from county courthouses in search of potential clients. Businesses, such as restaurants or video-rental stores that know the locations of competitors can make better decisions about where to focus further growth. Companies that provide complementary (not to be confused with complimentary) products, like software, may wish to know the make, model, cost, and market share of hardware that are compatible with their software.
Another common, but controversial use of information taken from websites is reposting scraped data to other sites. Scrapers may wish to consolidate data from a myriad of websites and then create a new website containing all of the information in one convenient location. In some cases, the new site’s owner may benefit from ads placed on his or her site or from fees charged to access the site. Companies usually go to great lengths to disseminate information about their products or services. So, why would a website owner not wish to have his or her website’s information scraped?
Several reasons exist for why website owners may not wish to have their site’s scraped by others (excluding search engines). Some people feel that data that is reposted to other sites is plagiarized, if not stolen. These individuals may feel that they made the effort to gather information and make it available on their websites only to have it copied to other sites. Are individuals justified in feeling that they have been taken advantage of, even if their websites are posted publicly?
Interpretation of what exactly “republish” means is widely disputed. One of the most authoritative explanations may be found in the 1991 supreme-court case of Feist Publications v. Rural Telephone Service. This case involved Rural Telephone Service suing Feist Publications for copyright infringement when Feist copied telephone listings after Rural denied Feist’s request to license the information. While information has never been copyrightable under U.S. law, a collection of information, defined mostly in terms of creative arrangement or original ideas, can be copyrighted. The Supreme Court’s ruling in Feist Publications v. Rural Telephone Service stated that “information contained in Rural’s phone directory was not copyrightable, and that therefore no infringement existed.” Justice O’ Conner focused on the need for information to have a “creative” element in order to be termed a “collection” (1). Similarly, information, taken from publicly available websites should not be considered plagiarism or even theft if only the information (numbers, statistics, etc.) is reposted to new sites or used for other purposes.
Scraped websites also experience an increase in used bandwidth as a result of being scraped. Some scrapes take place once, but many scrapes must be performed over and over to achieve the desired results. In such cases, the servers that host the pages being scraped inevitably experience an increased load. Site owners may not wish to have the increased bandwidth, but more importantly, excessive page requests can cause a web server to function slowly or even fail. Rarely, however, do most scrapes cause such strain on a server on their own. Accessing a page through scraping is no different from visiting a page manually, except that scraping allows more pages to be visited over a shorter period. Additionally, scrapes can be adjusted to run more slowly, so as to minimize the strain on the server. Scraping is usually slowed when more than a few scraping sessions are being run against a single server at one time.
Interestingly, having one’s website scraped can have positive effects. Of course the recipient of the scraped data is pleased to have desired data, but owners of scraped sites may also benefit. Think of the case mentioned above in which home listings are scraped from a site. Whether the information is reposted or stored in a database for later querying to match homebuyer’s needs, the purpose of the original site is met—to get the home-listing information into the hands of potential buyers.
Individuals who scrape websites can do so, while still following guidelines for ethical data extraction. Perhaps it would be helpful to review a list of tips for maintaining ethical scraping. One website I consulted gave the following suggestions:
· Obey robots.txt.
· Don’t flood a site.
· Don’t republish, especially not anything that might be copyrighted.
· Abide by the site terms of service (2).
Occasionally, individuals who scrape websites have paid for access to the material being scraped. Many job- and résumé-posting websites fall into this category. Employers must pay a monthly fee for an account which provides access to the résumés of potential new hirers. Certainly, the fact that employers pay for the service entitles them to use whatever means are necessary to sort through and record the desired data. The only exception would be where the site’s terms of service specifically prohibit scraping.
While republishing images, artwork, and other original content without permission is unethical and in many cases illegal, using scraped data for personal purposes is certainly within the limits of ethical behavior. Nevertheless, page scrapers should always avoid taking copyrighted materials. Use of bandwidth is no more deserved by any one person than another. Even making scraped data available to others online can be argued as ethical, especially when the scraped website is posted on public space and the data taken doesn’t include any creative content. After all, the purpose of hosting a website in the first place is to provide information.
Sometimes we’re asked how one might hinder a person who is trying to scrape data from their site. (The irony, of course, is that it comess from people who contacted me to scrape data for them.) The standard answer is that if you’re publishing data for the world to see, it can be scraped. There’s no stopping it … but it can be made it harder. We’ve seen a variety of methods that make things more difficult:
The most common implementation of the Turning Test is the old CAPTCHA that tries to make a human read the text in an image and fill it into a form. The idea is determine if you are man or machine. We have found a large number of sites that implement a very weak CAPTCHA that takes only a few minutes to get around. On the other hand, there are some very good implementations of Turing Tests that we would opt not to deal with given the choice, but a sophisticated OCR can sometimes overcome those, or many bulletin board spammers have some clever tricks to get past these.
Data as images
Sometimes you know which parts of your data are valuable. In that case it becomes reasonable to replace such text with an image. As with the Turing Test, there is ORC software that can read it, and there’s no reason we can’t save the image and have someone read it later.
Sometimes this doesn’t work out, however, as it makes a site less accessible to the disabled.
Limit search results
Most of the data we want to get at is behind some sort of form. Some are easy, and submitting a black from will yield all of the results. Some need an asterisk or percent put in the form. The hardest ones are those that will give you only so many results per query. Sometimes we just make a loop that will submit the letters of the alphabet to the form, but if that’s too general, we must make a loop to submit all combinations of 2 or 3 letters–that’s 17,576 page requests.
On occasion, a diligent webmaster will notice a large number of page requests coming from a particular IP address, and block requests from that domain.
Sometimes these techniques work by virtue of the fact that it increases the effort required, and the data doesn’t merit the work involved. Nevertheless, if you have something that you really don’t want a scraper to access, the only foolproof way of keeping it safe is to resist publishing it.
Today on our support forum we had someone inquire about calling scripts from other scripts within screen-scraper. This has been requested a number of times in the past, and I’ve kind of hummed and hahed about it, not sure if it would be opening a can of worms. Some of our internal developers have wanted this as well, so I gave it a bit more thought, and came up with a pretty quick and easy way to implement it.
I’m particularly interested in having this one thoroughly tested, so please feel free to upgrade (try this FAQ if you run into trouble). Remember that this is an alpha version, so caveats apply. It should be plenty stable, though, since this is the only addition from 2.7.2
Once you’ve upgraded, you can do a method call like this within a script in order to invoke another:
session.executeScript( “My Script” );