08.31.06
Posted in Updates at 10:34 am by Todd Wilson
I think we’re getting awful close to a public release. This one looks to be pretty stable. Not too many major changes this time around. You’ll discover a few little niceties, along with some bug fixes and clean-ups to minor issues that have probably been annoying you. As always, please let us know of any trouble you encounter.
If you’re currently running version 2.7.2.9a or higher you can upgrade via Options -> Check for updates. If you’re using anything else, follow these instructions (see this page for details on why you need to follow these steps):
- Back up your scraping sessions (check here for help on that).
- Ensure screen-scraper isn’t currently running (close the workbench and server, if running).
- Download this file, and unzip it.
- Copy the contents of the zip file on top of your existing files in the screen-scraper install folder. For example, the zip file contains a “screen-scraper.jar” file which should be copied on top of your existing “screen-scraper.jar” file.
- Edit your “resource\conf\screen-scraper.properties” file in a text editor. Change the “Version” property to “2.7.2.12a”.
- Launch the screen-scraper workbench.
- If all of your scraping sessions have disappeared, don’t panic!
- Close the screen-scraper workbench.
- Re-open the screen-scraper workbench.
Serve chilled and enjoy!
Permalink
08.24.06
Posted in Miscellaneous, Thoughts at 5:02 pm by Todd Wilson
Writing software on a consulting basis can often be a losing proposition for developers or clients or both. There are too many things that can go wrong, and that ultimately translates into loss of time and money. The “15% rule” we’ve come up with is intended to create a win-win situation for both parties (or at least make it fair for everyone). Clients generally get what they want, and development shops make a fair profit. It’s not a perfect solution, but so far it seems to be working for us.
This may come as a surprise to some, but we make very little money selling software licenses. The vast majority of our revenue comes through consulting services–writing code for hire. Having now done this for several years, we’ve learned some hard lessons. On a few projects the lessons were so hard we actually lost money.
A few months ago I put together somewhat of a manifesto-type document intended to address the difficulties we’ve faced in developing software for clients. I’m pleased to say that it’s made a noticeable difference so far for us. My hope is that this blog entry will be read by others who develop software on a consulting basis, so that they can learn these lessons the easy way rather than the way we learned them.
What follows in this article is a summary of one of the main principles we now follow in developing software–the 15% rule. If you’d like, you’re welcome to read the full “Our Approach to Software Development” document.
For the impatient, the 15% rule goes like this…
Before undertaking a development project we create a statement of work (which acts as a contract and a specification) that outlines what we’ll do, how many hours it will require, and how much it will cost the client. As part of the contract we commit to invest up to the amount of time outlined in the document plus 15%. That is, if the statement of work says that the project will take us 100 hours to complete, we’ll spend up to 115 hours (but no more). As to where-fores and why-tos on how this works, read on.
Those that have developed software for hire know that the end product almost never ends up exactly as the client had pictured. There are invariably tweaks that will need to be made (that may or may not have been discussed up front) in order to get the thing to at least resemble what the client has in mind. And, yes, this can happen even if you spend hours upon hours fine tuning the specification to reflect the client’s wishes. Additionally, technical issues can crop up that weren’t anticipated by the programming team. In theory, the better the programming team the less likely this should be, but it doesn’t always end up that way (Microsoft’s Vista operating system is a sterling example). These two factors, among others, equate to the risk that is inherent in the project. Something isn’t going to go right, and that will almost always mean someone pays or loses more money than originally anticipated. The question is, who should be responsible to account for those extra dollars?
Up until relatively recently, we would shoulder almost all of the risk in our projects. If the app didn’t do what the client had in mind, or if unforeseen technical issues cropped up, it generally came out of our pockets. For the most part it wasn’t a huge problem, but always seemed to have at least some effect (the extreme cases obviously being when we lost money on a project).
This seems kind of unfair, doesn’t it? The risk inherent to the project isn’t necessarily the fault of either party. It’s just there. We didn’t put it there, and neither did the client. As such, it shouldn’t be the case that one party shoulders it all. That’s where the 15% rule comes in.
The 15% rule allows both parties to share the risk. By following this rule, we’re acknowledging that something probably won’t go as either party intended, so we need a buffer to handle the stuff that spills over. By capping it at a specific amount, though, we’re also ensuring that the buffer isn’t so big that it devours the profits of the developers.
For the most part, the clients with whom we’ve used the 15% rule are just fine with it. It is a pretty reasonable arrangement, after all. We have had the occasional party that squirms and wiggles about it, but, in the end, they’ve gone along with it and I think everyone has benefited as a result.
Permalink
08.10.06
Posted in Updates at 2:19 pm by Todd Wilson
Just took version 2.7.2.11a fresh out of the oven. This one contains some overdue GUI enhancements that I think will delight you. I’d especially recommend the context menus (try right clicking on items in the tree).
If you’re currently running version 2.7.2.9a or higher you can upgrade via Options -> Check for updates. If you’re using anything else, follow these instructions (see this page for details on why you need to follow these steps):
- Back up your scraping sessions (check here for help on that).
- Ensure screen-scraper isn’t currently running (close the workbench and server, if running).
- Download this file, and unzip it.
- Copy the contents of the zip file on top of your existing files in the screen-scraper install folder. For example, the zip file contains a “screen-scraper.jar” file which should be copied on top of your existing “screen-scraper.jar” file.
- Edit your “resource\conf\screen-scraper.properties” file in a text editor. Change the “Version” property to “2.7.2.11a”.
- Launch the screen-scraper workbench.
- If all of your scraping sessions have disappeared, don’t panic!
- Close the screen-scraper workbench.
- Re-open the screen-scraper workbench.
Provecho!
Permalink
08.03.06
Posted in Updates at 2:23 pm by Todd Wilson
For those of you who have noticed some of the annoying quirks in 2.7.2.9a, try 2.7.2.10a. It should clear up a lot of them. Nothing too major in this release; mostly minor bug fixes.
If you’re currently running version 2.7.2.9a you can upgrade via Options -> Check for updates. If you’re using anything else, follow these instructions (see this page for details on why you need to follow these steps):
- Back up your scraping sessions (check here for help on that).
- Ensure screen-scraper isn’t currently running (close the workbench and server, if running).
- Download this file, and unzip it.
- Copy the contents of the zip file on top of your existing files in the screen-scraper install folder. For example, the zip file contains a “screen-scraper.jar” file which should be copied on top of your existing “screen-scraper.jar” file.
- Edit your “resource\conf\screen-scraper.properties” file in a text editor. Change the “Version” property to “2.7.2.10a”.
- Launch the screen-scraper workbench.
- If all of your scraping sessions have disappeared, don’t panic!
- Close the screen-scraper workbench.
- Re-open the screen-scraper workbench.
Enjoy!
Permalink
08.02.06
Posted in Tips at 11:42 am by Todd Wilson
Periodically people ask if screen-scraper can extract data from PDF files, as well as HTML. We’ve never had a very good answer for this (it can’t, out of the box), but lately we’ve been forced to come up with a solution, as a project we’ve been working on has required it.
When I initially researched how to go about this, I was looking for libraries that would allow for extraction from PDF files. I found a handful of them, but each had its own proprietary method for performing the extraction (e.g., lots of different method calls for handling tables and such). They seemed like possibilities, but I couldn’t come up with an elegant way to integrate them into screen-scraper without completely changing the way the user would need to perform the extraction.
After stepping back from the problem, I decided that it might make more sense to simply convert the PDF to a text-based format (e.g., HTML), then use screen-scraper’s existing extraction mechanism to pull the data out. After poking around a bit, I happened across pdftohtml, which does an excellent job of converting PDF files to HTML or XML. For the project we’re currently working on, screen-scraper is able to easily pull the data we need out of the converted file.
Our next step will be to integrate this functionality directly within screen-scraper. That is, screen-scraper should be able to seamlessly convert a PDF on the fly, then allow the user to make use of screen-scraper’s existing extraction mechanisms to pull the data out. The tricky part is that pdftohtml is platform dependent. They offer a Windows binary, but on any other OS you have to either compile from source or hope for an existing package (we’re using Ubuntu and were able to just apt-get it).
Here’s how I’m thinking it would work if we were to automate the process within screen-scraper:
- For any operating systems that allow it, we just ship a binary with screen-scraper that will perform the PDF to HTML/XML conversion locally.
- In cases where that doesn’t work, we provide a remote web service that will convert the PDF to XML. screen-scraper would invoke this behind the scenes in two different ways:
- screen-scraper would first attempt to convert the PDF by passing the URL to it to the web service. The web service would attempt to retrieve the PDF via a GET request. Assuming that works, it would then perform the conversion and spit back the resulting XML, which screen-scraper would download.
- If the web service is unable to grab the PDF directly (e.g., in cases where the PDF is behind an authentication gateway), it would indicate such to screen-scraper, which would then download the PDF file and upload it to the web service. The web service would perform the conversion, then output the resulting XML.
Once the PDF is converted, the user would be free to use the normal extractor patterns to pull the data out.
So in a worst-case scenario, the PDF would need to be downloaded from the source site, uploaded to the web service, converted, then downloaded again. Obviously this would add a lot of overhead, so it’s definitely not the best approach. I would guess that in the majority of the cases, however, the PDF could be converted either locally or via the web service where it’s able to request the PDF directly from the web site.
I can’t say just when we’ll get around to implementing this. It would likely mostly depend on the demand we see for it. This is the first project we’ve done where we’ve had to pull it off, but I’m guessing there will be others down the road. Until we implement this automated method, though, running pdftohtml manually may not be too cumbersome for most.
Permalink