Last modified: 2014-11-12 19:02:55 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T54936, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 52936 - Move data-parsoid and possibly data-mw out of the DOM, add uids
Move data-parsoid and possibly data-mw out of the DOM, add uids
Product: Parsoid
Classification: Unclassified
General (Other open bugs)
All All
: High normal
: ---
Assigned To: Arlo Breault
Depends on: 48483
Blocks: 53784 64171
  Show dependency treegraph
Reported: 2013-08-16 17:55 UTC by Gabriel Wicke
Modified: 2014-11-12 19:02 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Gabriel Wicke 2013-08-16 17:55:44 UTC
We need a general way to associate information with DOM nodes without having that information inline. The current idea is to set an UID on each DOM node that has associated information, and use that as the key to externally stored metadata. This can then be applied to remove private information like data-parsoid from the DOM we send to the client.

An issue to consider is copy & pasting between pages of the same wiki or even different wikis.

A simple and safe solution would be to discard all associated private information for modified (copy & pasted) content. This means that we would have to leave all semantic information (data-mw primarily) in the DOM even on page views. It also means that blame map information for example would be lost when a paragraph is moved around.

An alternative would be to make uids unique in a wiki, or even across wikis. Example: <wiki id>:<revision id>:<node id>. 1000:40233066:100000 for example can be encoded as Po:CZehq:Yag. This would allow us to move data-mw out of the view DOM as well, and would open up interesting ways to preserve associated metadata like blame maps across copy & pastes. The wiki id would need to be unique though, and there would need to be a public API to retrieve associated metadata. When the wiki id is not recognized or data retrieval fails, we might lose the associated data-mw as well.
Comment 1 ssastry 2013-08-27 18:24:37 UTC
As for data-mw, moving inlined data-mw into a single global data-mw JSON (or JSON-LD) object with information about all typed nodes might be a good intermediate fix. 

data-mw = {
  "#mwt1" : {
    @type: "mw:Transclusion"
    target: {...},
    params: {...}
  #mwt2": {
    @type: "mw:Extension"
    target: "math",
    attrs: {...},  // or could be called params as well
    body: {...}

This way, the DOM and data about DOM will be separate and can also be served separately if necessary, or clients that dont care about this information can completely ignore this without bloating the DOM itself.  It also eliminates one level of escaping and can be processed concurrently by clients like VE.
Comment 2 Gabriel Wicke 2013-08-28 15:21:34 UTC
@subbu: Before we can remove data-mw from the content, we will need a solution for copy & pasting from a view. Copied HTML from a read-only page will only have attributes but not data-mw. A paste target (VE for example) would need to be able to retrieve the associated metadata like data-mw solely based on the attributes in the pasted HTML fragment. Hence the UID scheming above.
Comment 3 Gabriel Wicke 2013-09-19 01:33:40 UTC
As an interim solution until we have separate storage for data-parsoid, we should consider moving data-parsoid into a single JSON structure in the head of the document and insert locally unique ids to reference it.

This will make it easier to strip this out in the VE frontend (ideally just with a regexp), and can make our output usable for mobile.
Comment 4 Gabriel Wicke 2013-09-19 23:40:10 UTC
Subbu and I developed a solution for this based on simple id attributes and revision URL injection on copy. See the spec for the details:

I'll start by assigning ids and moving all data-parsoid info into the head of the document as a stop-gap until we have separate metadata storage in the revision store (see bug 49143). This makes it easier to strip it in a front-end to minimize the transferred size.
Comment 5 Gerrit Notification Bot 2013-10-08 01:24:03 UTC
Change 88395 had a related patch set uploaded by Arlolra:
WIP: Move data-parsoid into a JSON structure in the head
Comment 6 Gerrit Notification Bot 2013-10-25 18:43:45 UTC
Change 88395 merged by jenkins-bot:
Move data-parsoid into a JSON structure outside the DOM
Comment 7 Gabriel Wicke 2013-12-03 22:33:26 UTC
Support for separate data-parsoid is now merged. The next step will be to store this separately in Rashomon, so that our DOM output is actually free of this data. Keeping this bug open to track work on that as well.
Comment 8 Gabriel Wicke 2014-08-27 18:59:00 UTC
Next steps:

* Return a (JSON?) compound response with separate data-parsoid, data-mw & HTML from a Parsoid web API end point
* Accept the same as an input, or (alternatively) pull data-parsoid from restface separately.
Comment 9 Gerrit Notification Bot 2014-09-08 18:36:15 UTC
Change 159111 had a related patch set uploaded by Arlolra:
Return a JSON response with separate html and data-parsoid
Comment 10 Gerrit Notification Bot 2014-09-11 19:56:16 UTC
Change 159111 merged by jenkins-bot:
Return a JSON response with separate html and data-parsoid
Comment 11 Andre Klapper 2014-11-12 15:15:35 UTC
All patches mentioned in this report were merged - is there more work left to do here (if yes: please reset the bug report status to NEW or ASSIGNED), or can you close this ticket as RESOLVED FIXED?

Note You need to log in before you can comment on or make changes to this bug.