Boston Linux & UNIX was originally founded in 1994 as part of The Boston Computer Society. We meet on the third Wednesday of each month at the Massachusetts Institute of Technology, in Building E51.

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] My first contribution to MediaWiki



Greg Rundlett (freephile) wrote:
> The project page: http://www.mediawiki.org/wiki/Extension:Html2Wiki
> 
> It's an extension to MediaWiki that lets you "import a website or web page
> into your wiki".

  "It does this by first "normalizing" the content with HTMLTidy, and
  then "sanitizing" it with Purify and Regular Expressions. Then the
  content is "converted" from HTML to WikiText using Regular Expressions
  and a Parsoid service."

Amazing that such a conversion is even possible, given how problematic
most HTML is. In some ways this job is harder than what browsers do when
parsing HTML, as you aren't just rendering the result, but trying to
extract structure - or semantic meaning - from it.

Does HTMLTidy do a lot of the heavy lifting for you? Do you still end up
with a lot of situations where you have multiple HTML constructs that
map to a single wiki markup construct?

How does it handle HTML generated or loaded by JS, as is quite common
now? (You might be able to work around that with one of the projects
that use an embedded and programmatically controlled web rendering
engine, like webkit.)

What are the advantages to implementing this as a plugin rather than a
separate command line tool (which would then support other markup
formats, like Markdown)?

If you couldn't find an existing HTML to wiki markup converter, did you
look for something similar, like a converter to markdown? A search for
this turns up hits, such as:

http://johnmacfarlane.net/pandoc/README.html

with an example:

  pandoc -f html -t markdown http://www.fsf.org

which presumably retrieves content from http://www.fsf.org, specified to
be in HTML format, and outputs Markdown. (It also supports MediaWiki
format.)

If using a tool that doesn't support MediaWiki directly, once in
Markdown, I imagine the conversion to MediaWiki is relatively easy.

 -Tom

-- 
Tom Metro
The Perl Shop, Newton, MA, USA
"Predictable On-demand Perl Consulting."
http://www.theperlshop.com/



BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org