Last modified: 2014-07-04 16:32:40 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T51143, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 49143 - Store HTML and page properties with multi-part content handler
Store HTML and page properties with multi-part content handler
Status: NEW
Product: Parsoid
Classification: Unclassified
General (Other open bugs)
unspecified
All All
: Low normal
: ---
Assigned To: Gabriel Wicke
:
: 52796 (view as bug list)
Depends on:
Blocks: 851 53508 53784
  Show dependency treegraph
 
Reported: 2013-06-04 17:37 UTC by Gabriel Wicke
Modified: 2014-07-04 16:32 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Gabriel Wicke 2013-06-04 17:37:25 UTC
MediaWiki currently stores the entire page content as WikiText. In addition to WikiText, we would like to store

* The fully expanded HTML DOM
* Page properties: categories, magic word flags (notoc etc), DISPLAYTITLE, bug 48812, etc
* Parsoid-internal information: Basically data-parsoid moved out of the main page DOM

Eventually we'd also like to be able to drop WikiText storage without having to rework the storage architecture.

In the current MediaWiki external storage and ContentHandler architecture this can be achieved by adding a multi-part content type with a corresponding ContentHandler. This could be a JSON object or some other serialization.

A possible downside of the compound document approach stems from the need to update transclusion or image expansions for a given revision. With append-only and immutable external storage this can be implemented by storing a new compound document and then updating the revision to point to it. Without garbage collection this will result in several copies of unmodified WikiText and page properties in external storage. However, this issue should probably be addressed in the storage layer.
Comment 1 Gabriel Wicke 2013-06-12 18:47:46 UTC
Flow needs something very similar: https://www.mediawiki.org/wiki/Flow_Portal/Architecture/Discussion_Storage
Comment 2 Gabriel Wicke 2013-07-15 22:25:23 UTC
We should provide an abstract interface to retrieve parts of multi-part content, so that the storage implementation in the backend can be optimized independently. Possibly something like this:

$html = $rev->getPart( "html" );
$wikitext = $rev->getPart( "wikitext" );
$pageProps = json_decode( $rev->getPart( "pageprops" ) );

Parts can be set / updated with $rev->setPart( "key", "value" );

The backend is free to store each part independently or concatenate parts with some efficient segmentation mechanism. This part interface can be used by higher-level content handlers to implement a consistent ContentHandler interface.
Comment 3 C. Scott Ananian 2013-08-14 17:01:08 UTC
gwicke and i seem to disagree whether this plan involves eventually removing page properties (ie, #REDIRECT, __NOTOC__) from the wikitext/DOM or not.
Comment 4 Gabriel Wicke 2013-08-14 17:35:23 UTC
(In reply to comment #3)
> gwicke and i seem to disagree whether this plan involves eventually removing
> page properties (ie, #REDIRECT, __NOTOC__) from the wikitext/DOM or not.

This bug is primarily about the multi-part storage, but I'll reply nevertheless:

The VE page property dialog makes page-global properties easier to discover and modify. This dialog (or something very close to it) can also be used in combination with wikitext editing.

With a page property UI and diffing support for properties in place I don't see a good reason for keeping page properties both in a versioned page property structure *and* inline in the longer term.
Comment 5 Gabriel Wicke 2013-08-28 19:13:07 UTC
Just a clarification for those that have not been following the discussions in the last year or so:

* We plan to store fully expanded snapshots of HTML for each revision. This makes HTML retrieval fast, but also lets us provide a view of the page as it looked like in the past including all transclusion/extension/file dependencies. There are some storage volume trade-offs (which can likely be addressed with compression), but if we decide to store a snapshot on each re-render after a template/file change, we can provide a copy of each page at any point in the past. Retrieving 'yesterday's Main Page' becomes possible. So far that has only been possible with the flagged revision extension at considerable expense.

* Other data structures that change with transclusion/extension/file updates will be snapshotted similarly. This applies to dynamic page properties for example.
Comment 6 Gabriel Wicke 2013-08-28 19:41:59 UTC
See bug 851 (from 2004) for the "Yesterday's Main Page" problematic.
Comment 7 Gabriel Wicke 2013-08-30 02:52:19 UTC
Moving page properties to page metadata is tracked in bug 53508.
Comment 8 Gabriel Wicke 2013-10-01 20:58:40 UTC
*** Bug 52796 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links