03.12.10
Posted in Tips at 6:42 pm by Todd Wilson
Once in a while when you’re scraping you may request a file that ends up being really large, but you actually only need to pull data from the top portion of the file. If it’s a big file it can end up slowing down the scraping process quite a bit. Not too long ago (somewhere around version 4.5.20a, I think) we added a method to deal with just such cases:
scrapeableFile.setMaxResponseLength( int maxKBytes )
This tells screen-scraper to only download a given number of kilobytes at the beginning of the file. You would want to run this method in a script that gets invoked before a file is scraped. For example, if your script contained this line:
scrapeableFile.setMaxResponseLength( 50 );
screen-scraper would download the first 50K of the file, cut it off, then continue on.
If the speed of a scraping session is especially critical this can also be a great way to trim off quite a bit of download time.
Permalink
03.11.10
Posted in Tips at 1:47 pm by scottw
Within screen-scraper you have the ability to call outside programs directly from your scripts. The following is an example scraping session that makes use of Tesseract OCR and Imagemagick in order to take an image from the internet and attempt to read the text of the image.
As is, the scraping session is intended to run on Linux. However, it is possible to run both dependent programs under Windows either directly or using Cygwin.
To use:
Download and import the following scraping session.
http://community.screen-scraper.com/samples/ocr
Permalink
02.04.10
Posted in Tips at 12:20 pm by Todd Wilson
Any reasonably-sized software development project benefits greatly from some type of version control system, such as CVS, Subversion, or Git. Internally we use Subversion, and I thought it might be helpful to share a bit how we go about it. What I describe here is primarily applicable to a project where you have many scrapes being developed by multiple developers, but we even use Subversion for small projects handled by a single developer.
Each developer on a project will have his own instance of screen-scraper, but may be using some scraping sessions and scripts that are also used by other developers. Generally speaking, though, a given developer is in charge of a certain set of scraping sessions, and we have a series of general scripts that might be used by all developers. These general scripts can be edited by anyone, but when edits are made everyone needs to be notified so that they can update their own instances of screen-scraper with the latest scripts. Each time a new scraping session is created or an existing scraping session is modified, it gets exported then committed to the repository. This isn’t quite as automated as some IDE’s allow, so developers need to be conscientious of their work so that the export and commit at the appropriate times.
We often also make use of debug scripts, which each developer will generally cater to his own work. It’s likely that he won’t want these scripts overwritten by those of other developers, so for each of these scripts he need only un-check the “Overwrite this script on import” box in the workbench to protect a such a script.
We also typically keep a separate folder in our version control repository for the scripts that are general to a series of scraping sessions. It’s possible that a particular developer has a slightly out-dated script, and when he exports that script may go with the scraping session. To keep it from getting imported into a production environment we’ll copy all of the general scripts (which are always kept current) into screen-scraper’s “import” folder along with the scraping session(s) to be deployed. screen-scraper will always import scraping sessions first, then scripts. That way you can guarantee that the current scripts don’t get overwritten.
Because screen-scraper doesn’t use a purely file-based approach to persist its objects, version control can require another step or two beyond what you’d normally find in a modern-day IDE. Our experience has been, though, that once developers get accustomed to it it’s not too burdensome. That said, we have plans in the near future to add features that will make working with version control systems even easier with screen-scraper.
Permalink
01.14.10
Posted in Tips at 4:07 pm by Todd Wilson
Many have probably noticed that when a scraping session is exported from screen-scraper all of the scripts invoked from within that scraping session get exported along with it. All of the scripts, that is, except those that get invoked via the session.executeScript method. The exporter isn’t quite smart enough to actually parse the text of scripts to look for scripts that should be exported because they’re invoked via that method.
Fortunately, there’s an easy workaround. For scripts that get invoked via session.executeScript simply associate them with the scraping session itself, but then disable them. That is, on the “General” tab for a scraping session add the scripts via the “Add Script” button, then under the “Enabled?” column in the scripts table un-check the box. This way the scripts won’t get executed at the beginning of the scraping session, but they will get exported.
Permalink
11.11.09
Posted in Uncategorized at 1:22 pm by Todd Wilson
One of the primary design goals of screen-scraper from the very beginning has been to emphasize extensibility. We’ve tried to build in a number of features and tools to make screen-scraping easier, but we also realize that we can’t fit it all in. Features such as the internal scripting engine and the ability to invoke screen-scraper from external applications allow it to be extended according to the whims of the developer.
Recently astute scraper Rodney Aiglstorfer came up with an excellent way to link data extracted within screen-scraper to custom-built classes. He’s dubbed it “Screen-Scraper Annotations for Java”, and you can find it here: http://code.google.com/p/ssa4j/. Rodney’s been good enough to release the library under an open source license, so others can benefit as well.
Permalink
09.02.09
Posted in Updates at 5:38 pm by Todd Wilson
We’re constantly updating screen-scraper with bug fixes and new features, but haven’t always been good about documenting changes. These newer features are typically only available in our alpha versions. Whereas previously you were on your own to figure out what was new, we’re now going to do our best to document new features here:
Alpha documentation
These docs might not be quite as neat and clean as the others, but if you’re using our alpha versions and want to see what’s new, this is a good page to watch.
Permalink
08.28.09
Posted in Updates at 5:27 pm by Todd Wilson
Actually, I should probably call it a REST-like API. I have no doubt the purists will point out that it isn’t a REST API at all. How about we’ll call it an “API accessible via GET requests”.
With that loquacious introduction, I’m happy to announce that, as of version 4.5.18a, you can access screen-scraper via GET requests. Let me just state right here and now that this is alpha functionality and may very well change before the next public release. Use it at your own risk. As with any of our alpha features the documentation is scant, so I’ll simply provide a long list of examples as to how you might use it. Hopefully you’ll get the idea.
You’ll first need to start up screen-scraper in server mode. Once that’s done you can then access a slew of features you’d normally only be able to access via the web interface. Here they are:
http://localhost:8779/ss/rest?action=get_runnable_scraping_sessions
http://localhost:8779/ss/rest?action=get_scrapeable_sessions
http://localhost:8779/ss/rest?action=run_scraping_session&scraping_session_name=Shopping+Site
http://localhost:8779/ss/rest?action=stop_running_scraping_session&scrapeable_session_id=43
http://localhost:8779/ss/rest?action=stop_all_running_scraping_session
http://localhost:8779/ss/rest?action=remove_scrapeable_session&scrapeable_session_id=29
http://localhost:8779/ss/rest?action=reload_settings
http://localhost:8779/ss/rest?action=peek_scrapeable_session_log&scrapeable_session_id=42&num_lines=50
http://localhost:8779/ss/rest?action=get_scheduled_scraping_sessions
http://localhost:8779/ss/rest?action=disable_enable_scheduled_scraping_session&scheduled_scraping_session_id=110&enable=false
http://localhost:8779/ss/rest?action=remove_scheduled_scraping_session&scheduled_scraping_session_id=0
http://localhost:8779/ss/rest?action=set_scheduled_scraping_session&scheduled_scraping_session_id=3&scraping_session_name=Shopping+Site&timeout=123&schedule_date=08%2F20%2F2009&schedule_time=11:22:33&repeat_days=4&repeat_hours=3&repeat_minutes=2&repeat_seconds=1&threshold_time=21&threshold_record_count=43&settable_session_variables=this%3Dthatx%26foo%3Dbar
http://localhost:8779/ss/rest?action=save_settings&default_timeout=89&default_repeat_days=9&default_repeat_hours=8&default_repeat_minutes=7&default_repeat_seconds=6&default_threshold_time=4&default_threshold_record_count=3
http://localhost:8779/ss/rest?action=set_session_variable_on_scrapeable_session&scrapeable_session_id=3&key=foo&value=bap
http://localhost:8779/ss/rest?action=get_session_variable_from_scrapeable_session&scrapeable_session_id=3&key=foo
http://localhost:8779/ss/rest?action=get_memory_usage
As with any alpha feature we appreciate bug reports and feedback. Please don’t hesitate to drop us a line.
Permalink
08.17.09
Posted in Thoughts, Tips at 4:36 pm by jason
We previously listed some means to try to stop screen-scraping, but since it is an ongoing topic for us, it bears revisiting. Any site can be scraped, but some require such an influx of time and resources as to make it prohibitively expensive. Some of the common methods to do so are:
Turing tests
The most common implementation of the Turning Test is the old CAPTCHA that tries to ensure a human reads the text in an image, and feeds it into a form.
We have found a large number of sites that implement a very weak CAPTCHA that takes only a few minutes to get around. On the other hand, there are some very good implementations of Turing Tests that we would opt not to deal with given the choice, but a sophisticated OCR can sometimes overcome those, or many bulletin board spammers have some clever tricks to get past these.
Data as images
Sometimes you know which parts of your data are valuable. In that case it becomes reasonable to replace such text with an image. As with the Turing Test, there is ORC software that can read it, and there’s no reason we can’t save the image and have someone read it later.
Often times, however, listing data as an image without a text alternate is in violation of the Americans with Disabilities Act (ADA), and can be overcome with a couple of phone calls to a company’s legal department.
Code obfuscation
Using something like a JavaScript function to show data on the page though it’s not anywhere in the HTML source is a good trick. Other examples include putting prolific, extraneous comments through the page or having an interactive page that orders things in an unpredictable way (and the example I think of used CSS to make the display the same no matter the arrangement of the code.)
CSS Sprites
Recently we’ve encountered some instances where a page has one images containing numbers and letters, and used CSS to display only the characters they desired. This is in effect a combination of the previous 2 methods. First we have to get that master-image and read what characters are there, then we’d need to read the CSS in the site and determine to what character each tag was pointing.
While this is very clever, I suspect this too would run afoul the ADA, though I’ve not tested that yet.
Limit search results
Most of the data we want to get at is behind some sort of form. Some are easy, and submitting a blank form will yield all of the results. Some need an asterisk or percent put in the form. The hardest ones are those that will give you only so many results per query. Sometimes we just make a loop that will submit the letters of the alphabet to the form, but if that’s too general, we must make a loop to submit all combination of 2 or 3 letters–that’s 17,576 page requests.
IP Filtering
On occasion, a diligent webmaster will notice a large number of page requests coming from a particular IP address, and block requests from that domain. There are a number of methods to pass requests through alternate domains, however, so this method isn’t generally very effective.
Site Tinkering
Scraping always keys off of certain things in the HTML. Some sites have the resources to constantly tweak their HTML so that any scrapes are constantly out of date. Therefore it becomes cost ineffective to continually update the scrape for the constantly changing conditions.
Permalink
04.10.09
Posted in Miscellaneous at 12:03 pm by Todd Wilson
Yesterday I opened a fortune cookie that said, “Do something unusual tomorrow.” I thought about sky-diving or going the whole day blind-folded, but instead opted for something even crazier–sell screen-scraper for half price! If you’re on the fence about purchasing now might be a good time to take the plunge. I don’t see us doing this again any time soon. The sale will last until April 11, 2009 at 11:00 a.m. Mountain time.
Permalink
03.25.09
Posted in Miscellaneous at 10:24 am by Todd Wilson
We’ve had people asking for this for quite a while, and have finally gotten to it. We now have a video version of our first tutorial, accessible from the tutorial itself:
http://community.screen-scraper.com/Tutorial_1_Page_1
It isn’t perfect, but I think it’s a pretty good first version (and definitely better than what we had previously). We’re hoping to get some feedback, then will likely do another version soon based on that feedback. Feel free to give it a try and let us know what you think.
Permalink
« Previous entries