Home
General
Off-Topic
Web site scraping software
JDMurray
Does anyone have a recommendation for (free) Web site scraping software? I need to collect information from Web board discussion forums
en masse
using a scraping program and package the scraped content for later and easy review. It'd be a plus if the program could save the results as an PDF or Word (RTF) file. Java/Python/Perl programs OK too.
TIA
Find more posts tagged with
Comments
UnixGeek
I can only speak to web scraping software that takes a straight HTML copy, but the simplest is using wget. For some reason the post content filters are preventing me from giving you the full syntax, but check out the --mirror option.
However, this is not 100% reliable on retrieving all CSS files, so if the above command doesn't work for you, try HTTTrack. I use it to rescue new customers' websites from proprietary and/or uncooperative hosting platforms, and have had zero problems with it:
HTTrack Website Copier - Offline Browser
eMeS
A few places you can look...
Red Oak Software at one point acquired a company named "Blue Lobster" that was heavy in the screen scraping business. They might have something.
There is also the ScrAPI toolkit for Ruby. I'm not at all familiar with the methods that are available, but I know the intent is to make screen scraping a bit more elegant and useful.
You might also try IBM's WebSphere sMash Developer Edition at
Project Zero: Download sMash
. This is the free version of a product that they sell for about $10k per seat.
Of the three I would go with the WebSphere one, but I've kinda got a hard on for IBM stuff. Additionally, given IBM's diversity, they probably have at least 3 other products that will do what you want.
If you need access to any IBM stuff that's not free and for which trials are not publicly offered, I can likely get it for you on a limited trial/evaluation access basis. I have access to their entire software catalog.
In any event, what you want to do will in no way be an "out of the box" thing (at least not with these options)....but being the resourceful guy that you are I'm sure that won't be a problem.
MS
NightShade03
HTTrack Website Copier - Offline Browser
Its free, easy to use, and I love it.
I pulled down a 10GB website (between files, profiles, scripts, etc...) in about 30 minutes. It is pretty nice piece of software and has some good options.
Quick Links
All Categories
Recent Posts
Activity
Unanswered
Groups
Best Of