Web site scraping software

JDMurrayJDMurray Admin Posts: 13,023 Admin
Does anyone have a recommendation for (free) Web site scraping software? I need to collect information from Web board discussion forums en masse using a scraping program and package the scraped content for later and easy review. It'd be a plus if the program could save the results as an PDF or Word (RTF) file. Java/Python/Perl programs OK too.

TIA

Comments

  • UnixGeekUnixGeek Member Posts: 151
    I can only speak to web scraping software that takes a straight HTML copy, but the simplest is using wget. For some reason the post content filters are preventing me from giving you the full syntax, but check out the --mirror option.

    However, this is not 100% reliable on retrieving all CSS files, so if the above command doesn't work for you, try HTTTrack. I use it to rescue new customers' websites from proprietary and/or uncooperative hosting platforms, and have had zero problems with it:

    HTTrack Website Copier - Offline Browser
  • eMeSeMeS Member Posts: 1,875 ■■■■■■■■■□
    A few places you can look...

    Red Oak Software at one point acquired a company named "Blue Lobster" that was heavy in the screen scraping business. They might have something.

    There is also the ScrAPI toolkit for Ruby. I'm not at all familiar with the methods that are available, but I know the intent is to make screen scraping a bit more elegant and useful.

    You might also try IBM's WebSphere sMash Developer Edition at Project Zero: Download sMash . This is the free version of a product that they sell for about $10k per seat.

    Of the three I would go with the WebSphere one, but I've kinda got a hard on for IBM stuff. Additionally, given IBM's diversity, they probably have at least 3 other products that will do what you want.

    If you need access to any IBM stuff that's not free and for which trials are not publicly offered, I can likely get it for you on a limited trial/evaluation access basis. I have access to their entire software catalog.

    In any event, what you want to do will in no way be an "out of the box" thing (at least not with these options)....but being the resourceful guy that you are I'm sure that won't be a problem.

    MS
  • NightShade03NightShade03 Member Posts: 1,383 ■■■■■■■□□□
    HTTrack Website Copier - Offline Browser

    Its free, easy to use, and I love it.

    I pulled down a 10GB website (between files, profiles, scripts, etc...) in about 30 minutes. It is pretty nice piece of software and has some good options.
Sign In or Register to comment.