BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] What's the best site-crawler utility?

Subject: [Discuss] What's the best site-crawler utility?
From: eric.chadbourne at gmail.com (Eric Chadbourne)
Date: Wed, 8 Jan 2014 00:37:54 -0500
In-reply-to: <CANaytcd2YOozudKdighfeFmn=-4MtgayFT2U+NjwSA1tEKKqLQ@mail.gmail.com>
References: <52CC9277.4010309@horne.net> <CANaytcd2YOozudKdighfeFmn=-4MtgayFT2U+NjwSA1tEKKqLQ@mail.gmail.com>

Plus one for HTTrack.  I used it a couple of months ago to convert a
terrible Joomla hacked site to HTML.  It was a pain to use at first,
like having to use Firefox, but it worked as advertised.

Hope that helps.

On Tue, Jan 7, 2014 at 10:34 PM, Greg Rundlett (freephile)
<greg at freephile.com> wrote:
> Hi Bill,
>
> GPL - licensed HTTrack Website Copier works well (http://www.httrack.com/).
>  I have not tried it on a MediaWiki site, but it's pretty adept at copying
> websites including dynamically generated websites.
>
> They say: "It allows you to download a World Wide Web site from the
> Internet to a local directory, building recursively all directories,
> getting HTML, images, and other files from the server to your computer.
> HTTrack arranges the original site's relative link-structure. Simply open a
> page of the "mirrored" website in your browser, and you can browse the site
> from link to link, as if you were viewing it online. HTTrack can also
> update an existing mirrored site, and resume interrupted downloads. HTTrack
> is fully configurable, and has an integrated help system.
>
> WinHTTrack is the Windows 2000/XP/Vista/Seven release of HTTrack, and
> WebHTTrack the Linux/Unix/BSD release which works in your browser. There is
> also a command-line version 'httrack'.
>
> HTTrack is actually similar in it's result to the wget -k -m -np
> http://mysite that Matt mentions, but may be easier in general to use and
> offers a GUI to drive the options that you want.
>
> Using the MediaWiki API to export pages is another option if you have
> specific needs that can not be addressed by a "mirror" operation (e.g. your
> wiki has namespaced contents that you want to treat differently.)  If you
> end up exporting via "Special:Export" or the API, then you will be faced
> with the option to convert your XML to HTML.  I have some notes about wiki
> format conversions at https://freephile.org/wiki/index.php/Format_conversion
>
> There's pandoc.  "If you need to convert files from one markup format into
> another, pandoc is your swiss-army knife."
> http://johnmacfarlane.net/pandoc/
>
> ~ Greg
>
> Greg Rundlett
>
>
> On Tue, Jan 7, 2014 at 6:49 PM, Bill Horne <bill at horne.net> wrote:
>
>> I need to copy the contents of a wiki into static pages, so please
>> recommend a good web-crawler that can download an existing site into static
>> content pages. It needs to run on Debian 6.0.
>>
>> Bill
>>
>> --
>> Bill Horne
>> 339-364-8487
>>
>> _______________________________________________
>> Discuss mailing list
>> Discuss at blu.org
>> http://lists.blu.org/mailman/listinfo/discuss
>>
> _______________________________________________
> Discuss mailing list
> Discuss at blu.org
> http://lists.blu.org/mailman/listinfo/discuss



-- 
Eric Chadbourne
617.249.3377
http://theMnemeProject.org/
http://WebnerSolutions.com/

References:
- [Discuss] What's the best site-crawler utility?
  - From: bill at horne.net (Bill Horne)
- [Discuss] What's the best site-crawler utility?
  - From: greg at freephile.com (Greg Rundlett (freephile))

Prev by Date: [Discuss] Small website, non-technical users: Joomla, Drupal, or WordPress? (Solved)
Next by Date: [Discuss] Small website, non-technical users: Joomla, Drupal, or WordPress? (Solved)
Previous by thread: [Discuss] What's the best site-crawler utility?
Next by thread: [Discuss] Boston Linux Meeting Wednesday, January 15, 2014 - Galileo, Linux, and the Internet of Things
Index(es):
- Date
- Thread


BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Boston Linux & Unix / webmaster@blu.org