Last modified: 2014-07-10 16:33:13 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 54617 - Replace Tidy with a library that doesn't suck
Replace Tidy with a library that doesn't suck
Status: NEW
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
1.22.0
All All
: Normal enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: tidy
  Show dependency treegraph
 
Reported: 2013-09-25 23:01 UTC by Bartosz Dziewoński
Modified: 2014-07-10 16:33 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Bartosz Dziewoński 2013-09-25 23:01:00 UTC
I mean, just look at https://bugzilla.wikimedia.org/buglist.cgi?quicksearch=tidy or blockers to bug 2542.

Tidy is awful and we need to get rid of it. There's gotta be a less crappy library that will just close unclosed HTML tags without messing with their contents.
Comment 1 MZMcBride 2013-09-25 23:05:14 UTC
Brion suggested that switching the parser to use Parsoid would make HTMLTidy unnecessary.
Comment 2 Brion Vibber 2013-09-26 16:12:05 UTC
So the main reason we went with Tidy in the first place I think was to ensure that we had well-formed (X)HTML output, so XML parsers wouldn't die and browsers wouldn't do exciting things if you had a stray </div> somewhere.

The core parser tries to do some HTML cleanup, but it was never very complete.

Possibilities include:

a) Fix the HTML fixups in the core parser, and make sure non-Tidy output is compatible with the current Tidy output

b) Replace Tidy with another tool that's less annoying

c) Replace the core parser with something that already outputs valid HTML5 (such as Parsoid)


Long-term I like c) but I don't think we're there yet. :)

a) and b) are the things I'd recommend looking at if we really want to kill tidy in the short/medium term.
Comment 3 Gabriel Wicke 2013-10-08 21:09:21 UTC
We hope to be at a point where we can consider using Parsoid output for regular page views by next summer. See https://www.mediawiki.org/wiki/Parsoid/Roadmap.

In Parsoid, an HTML5 treebuilder provides the bulk of the required clean-up. We also approximate the PHP / tidy parser's deviation from the standard cleanup in custom passes to make sure that the semantics of content written against the current setup are preserved.
Comment 4 Bartosz Dziewoński 2013-10-08 21:11:12 UTC
You can't possibly want to require every MediaWiki installation everywhere to use Parsoid? The node.js dependency is unacceptable in most scenarios.
Comment 5 Krinkle 2013-10-08 21:13:07 UTC
Rephrasing summary to reflect that we don't intend to get rid of fixing unclosed tags, but Tidy specifically (we shouldn't kill Tidy without adding something else, so that makes the bug more "atomic")
Comment 6 Gabriel Wicke 2013-10-08 21:14:15 UTC
Parsoid would only be needed for wikitext editing and -templating. HTML-only wikis would basically serve XHTML straight from storage.
Comment 7 Bartosz Dziewoński 2013-10-08 21:16:38 UTC
(In reply to comment #6)
> Parsoid would only be needed for wikitext editing and -templating. HTML-only
> wikis would basically serve XHTML straight from storage.

You can't possibly want to require every MediaWiki installation everywhere to switch to editing raw HTML by hand (VE depends on Parsoid…).
Comment 8 Gabriel Wicke 2013-10-08 21:22:35 UTC
(In reply to comment #7)
> You can't possibly want to require every MediaWiki installation everywhere to
> switch to editing raw HTML by hand (VE depends on Parsoid…).

VE is an HTML editor, so can be used without Parsoid.
Comment 9 Bartosz Dziewoński 2013-10-09 11:30:13 UTC
(In reply to comment #8)
> VE is an HTML editor, so can be used without Parsoid.

Well yeah, okay, this could work. VE, however, has certain software and hardware requirements not all computers meet. And there's the entire issue of "templating" which you dismissed with a single word, which I assume is currently not implemented without wikitext backing it.

VE also currently doesn't work for, say, talk pages (and please don't mention Flow, it will not be ready by next summer) or edit summaries, and there are certain pieces of the interface which show raw source code like diffs (I don't think anybody has implemented rich text diffs yet in MediaWiki, but this is something I'd really like to see).

Using Parsoid for page view is just not workable in short or mid term, no matter how much we would want it.

/offtopic
Comment 10 Gabriel Wicke 2013-11-08 22:39:43 UTC
(In reply to comment #9)
> Using Parsoid for page view is just not workable in short or mid term, no
> matter how much we would want it.

Which issues do you see apart from rendering quality / compatibility?
Comment 11 Bartosz Dziewoński 2013-11-08 23:13:34 UTC
(In reply to comment #10)
> Which issues do you see apart from rendering quality / compatibility?

Compatibility/availability is the single showstopper issue here. I can't run server-side JavaScript on most free hostings.
Comment 12 James Forrester 2013-11-08 23:16:34 UTC
(In reply to comment #11)
> (In reply to comment #10)
> > Which issues do you see apart from rendering quality / compatibility?
> 
> Compatibility/availability is the single showstopper issue here. I can't run
> server-side JavaScript on most free hostings.

So it's back to the policy question of what MediaWiki is intended to be - a great wiki for large- and medium-scale wikis, or a hodge-podge of tools which are limited by ease of download-the-zip-file installation over a proper management tool, rather than by what is best for users?
Comment 13 Bartosz Dziewoński 2013-11-08 23:20:12 UTC
MediaWiki is intended to be both, if you ask me. I don't see how your question is relevant to the bug, since I am not proposing to make it a hodge-podge.
Comment 14 Erwin Dokter 2014-02-20 13:14:00 UTC
I run into too many problems because of Tidy. It's main flaw is that it is not compatible with HTML5; it hasn't been updated since 2008(!). Most problems stem from Tidy not allowing any block elements inside inline elements (which is allowed in HTML5), and kicks them out which results in broken HTML, even though its goal is to prevent exactly that.

Is there no library that has the same functionality and is up to date?
Comment 15 Erwin Dokter 2014-02-20 13:48:20 UTC
Fount a lilbrary called HTML Purifier, but that's more of a 'evil code' filter with some 'Tidy inspired' features. Probably not what we want.

There is also tidy-html5 [1], a fork that aims for full HTML5 support.

[1] https://github.com/w3c/tidy-html5
Comment 16 Gabriel Wicke 2014-02-20 17:53:14 UTC
(In reply to Bartosz Dziewoński from comment #11)
> (In reply to comment #10)
> > Which issues do you see apart from rendering quality / compatibility?
> 
> Compatibility/availability is the single showstopper issue here. I can't run
> server-side JavaScript on most free hostings.

Nor can you typically run tidy there. Virtual machines are really cheap these days (starting at about $30 / year), so cost is no longer the issue that prevents people from installing better tools for the job. Missing packaging is another point, but that is also being addressed (parsoid is now debianized).

In any case, we are working on being ready to start using Parsoid HTML for normal page views this summer. We might not want to maintain the PHP parser in the longer term, and are thus less likely to spend much effort on replacing tidy right now.
Comment 17 Bartosz Dziewoński 2014-02-20 18:13:33 UTC
(In reply to Gabriel Wicke from comment #16)
> Nor can you typically run tidy there.

Citation needed. http://www.php.net/manual/en/book.tidy.php It's definitely more likely to be accessible than having node and being able to shell out.


> Virtual machines are really cheap
> these days (starting at about $30 / year), so cost is no longer the issue
> that prevents people from installing better tools for the job.

$30 is not within the reach of everyone. There's also the fact that you have to have a credit card to get any reputable paid hosting, and that's also not a given in the whole world.
Comment 18 Gabriel Wicke 2014-02-20 18:33:22 UTC
(In reply to Bartosz Dziewoński from comment #17) 
> $30 is not within the reach of everyone. There's also the fact that you have
> to have a credit card to get any reputable paid hosting, and that's also not
> a given in the whole world.

Depending on your use case there are also free options like Wikia and other non-profit options without ads. Free shared hosting is not automatically going to be more reputable than free VM hosting, nor do I see systematic differences in payment methods.

You are free to work on MediaWiki on shared hosting of course. All I'm saying is that there are few remaining reasons for us to

- spend major resources on shared hosting support, and 
- let it hold back our architectural development at the expense of security, performance and maintainability
Comment 19 C. Scott Ananian 2014-07-10 16:33:13 UTC
At some point I would like to replace tidy with a API-compatible binary which uses the standard HTML5 parser mechanism.  It's on my list of 'free time projects'.  There are lots of HTML5 parser libraries now.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links