Last modified: 2014-07-04 16:32:40 UTC
MediaWiki currently stores the entire page content as WikiText. In addition to WikiText, we would like to store * The fully expanded HTML DOM * Page properties: categories, magic word flags (notoc etc), DISPLAYTITLE, bug 48812, etc * Parsoid-internal information: Basically data-parsoid moved out of the main page DOM Eventually we'd also like to be able to drop WikiText storage without having to rework the storage architecture. In the current MediaWiki external storage and ContentHandler architecture this can be achieved by adding a multi-part content type with a corresponding ContentHandler. This could be a JSON object or some other serialization. A possible downside of the compound document approach stems from the need to update transclusion or image expansions for a given revision. With append-only and immutable external storage this can be implemented by storing a new compound document and then updating the revision to point to it. Without garbage collection this will result in several copies of unmodified WikiText and page properties in external storage. However, this issue should probably be addressed in the storage layer.
Flow needs something very similar: https://www.mediawiki.org/wiki/Flow_Portal/Architecture/Discussion_Storage
We should provide an abstract interface to retrieve parts of multi-part content, so that the storage implementation in the backend can be optimized independently. Possibly something like this: $html = $rev->getPart( "html" ); $wikitext = $rev->getPart( "wikitext" ); $pageProps = json_decode( $rev->getPart( "pageprops" ) ); Parts can be set / updated with $rev->setPart( "key", "value" ); The backend is free to store each part independently or concatenate parts with some efficient segmentation mechanism. This part interface can be used by higher-level content handlers to implement a consistent ContentHandler interface.
gwicke and i seem to disagree whether this plan involves eventually removing page properties (ie, #REDIRECT, __NOTOC__) from the wikitext/DOM or not.
(In reply to comment #3) > gwicke and i seem to disagree whether this plan involves eventually removing > page properties (ie, #REDIRECT, __NOTOC__) from the wikitext/DOM or not. This bug is primarily about the multi-part storage, but I'll reply nevertheless: The VE page property dialog makes page-global properties easier to discover and modify. This dialog (or something very close to it) can also be used in combination with wikitext editing. With a page property UI and diffing support for properties in place I don't see a good reason for keeping page properties both in a versioned page property structure *and* inline in the longer term.
Just a clarification for those that have not been following the discussions in the last year or so: * We plan to store fully expanded snapshots of HTML for each revision. This makes HTML retrieval fast, but also lets us provide a view of the page as it looked like in the past including all transclusion/extension/file dependencies. There are some storage volume trade-offs (which can likely be addressed with compression), but if we decide to store a snapshot on each re-render after a template/file change, we can provide a copy of each page at any point in the past. Retrieving 'yesterday's Main Page' becomes possible. So far that has only been possible with the flagged revision extension at considerable expense. * Other data structures that change with transclusion/extension/file updates will be snapshotted similarly. This applies to dynamic page properties for example.
See bug 851 (from 2004) for the "Yesterday's Main Page" problematic.
Moving page properties to page metadata is tracked in bug 53508.
*** Bug 52796 has been marked as a duplicate of this bug. ***