Last modified: 2014-09-23 19:55:58 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T18294, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 16294 - Update to cite.php refs: ids fixed, footnotes, globbed links
Update to cite.php refs: ids fixed, footnotes, globbed links
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
Cite (Other open bugs)
unspecified
All All
: Low enhancement with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
: patch, patch-need-review
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-11-09 23:55 UTC by Nicholas Wilson
Modified: 2014-09-23 19:55 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
The diff (24.39 KB, patch)
2008-11-09 23:55 UTC, Nicholas Wilson
Details
the diff (24.40 KB, patch)
2008-11-10 20:19 UTC, Nicholas Wilson
Details
above, but with parser tests updated and expanded (plus other small fixes) (100.65 KB, patch)
2008-12-11 03:42 UTC, Nicholas Wilson
Details
patch against r44396 (114.64 KB, patch)
2008-12-13 11:34 UTC, Nicholas Wilson
Details
more parsertests, cleared up return values from refArg (120.29 KB, patch)
2008-12-14 23:02 UTC, Nicholas Wilson
Details

Description Nicholas Wilson 2008-11-09 23:55:43 UTC
Created attachment 5514 [details]
The diff

* There were a couple of spots in the code with slight problems on the id-choosing scheme. There were indeed tricksy things you could do to break the old code by choosing group names with strategically placed hyphens. I have altered this, so that all ids are valid XHTML, and map entered names properly injectively to generated ids to ensure uniqueness.
* Previously, refs used once were treated separately and the default group is treated slightly oddly.  I think some of this was a holdover from an earlier coding of the groups functionality.  I have homogenised treatment for these cases a little.

Now, the above make it a bit harder to separate out the changes in my local version to individual patches, so this patch encompasses everthing I did on Friday evening, including the following new exciting features:

* Support for nice footnote symbols. If you create a group called "footnote", you get a nice progression of *†‡ symbols etc. instead of numbers. This is neat and works well, and should be pretty un-controversial.
* Secondly, and more majorly, I have built a parser stage that globs links to notes together. It maps [1][2][nb 3][5][nb 5][6][nb 4][7] ↦ [1,2,5-7,nb 3-nb 4] (an obviously extreme example to show the scheme). In general, the most common case is [13][14][15][16] ↦ [13-16], an improvement that has been requested many times. All links and things work nicely.
Comment 1 Nicholas Wilson 2008-11-10 20:19:55 UTC
Created attachment 5517 [details]
the diff

I have update the patch to fix another annoying bug present in the upstream version: a ref like [Note 1] (as used copiously in some WP articles) should in fact be [Note 1] for obvious reasons.
Comment 2 Brion Vibber 2008-12-08 22:53:13 UTC
This'll need some parser test cases that cover:

a) the previously broken cases, to confirm they don't regress in future

b) any new functionality, to confirm it works as expected and doesn't regress in future


A couple quick notes...

+            preg_match("#{$this->mPre}<span id=\"(cite_ref-(\w*)::(\d*)::.*)\"><a href=\"(.*)\".*>(.*)</a>$#U", $strings[$i], $match);
^ I have to admit I'm pretty leery about this kind of regex on HTML code. These can be fragile, and a minor change in markup could result in breakage, even assuming the regexes have no flaws with the original markup.

Adding regression tests helps. ;)


+    'cite_references_footnote_labels' => '* † ⁑ ‡ § ‖ ¶',
^ Is this really an appropriate progression for footnote labels? § and ¶ refer to section and paragraph numbers, normally; do they get reused for footnotes in this fashion?

+    'cite_reference_bra'    => '[',
+    'cite_reference_ket'    => ']',
^ This naming scheme is cute, but a bit unclear. :) I'd recommend using clear, legible names such as "open_bracket" and "close_bracket", or more simply a single message with the value '[$1]'.
Comment 3 Nicholas Wilson 2008-12-09 00:47:02 UTC
Thanks for the feedback. I will add the parser tests soon. The change in naming scheme is causing the current parser tests to all fail, as expected — the correct answers need to be updated for the existing tests (the naming scheme had to be changed because there were a couple of characters not correctly escaped by the old scheme, while the new one is robust).

So, for the time being, I have checked and updated the old parser tests, which now work, and I will update the new set of tests when I have added suitably tricksy ones to thoroughly test the new functionality. In fact, a minor syntax change to one of regexps has been needed to properly pass the tests, so it was worth showing them to me.

Notes:
1. I agree that regexps are nasty (and slow), but this one is as simple as I think it can get. I am making nearly all my regexps non-greedy, which helps, and the expression is literally just grabbing the tags out of the wikitext which is set above. I can see problems with maintainability, but it should be able to take any wikitext correctly.

Ultimately, I am a bit unhappy about the need for this stage, but there seems no way around it, since tag hooks only work on one tag at a time. Implementing ref merging therefore does need a parser stage, and, short of parsing the whole text into a tree, regexps seem the only semi-efficient way of picking out the tags.

If the grouping of refs is not optional, then there is a slightly more efficient way of managing this, including simpler regexps, but the great advantage of the way I have done it is that simply commenting out the line where the parser hook is registered (l. 641) removes the whole processing stage, and links are output as normal.  (With this line commented out, the patch still works and becomes rather less controversial, with referencesMerge and callbackMerge never used.)

2. This is good. I have some books using this style, and [[Footnote#cite_note-0]] agrees with this use of § and ¶.

3. Sorry; I have just essentially scratch-written the [[probability amplitude]] article, so I have things like [[bra-ket notation]] on the brain. It may look 'cute', but this terminology is actually used by serious physicists!
Comment 4 Brion Vibber 2008-12-09 00:51:25 UTC
Ah, quantum mechanics... For the sanity of translators, I would recommend using non-physics jargon here. ;)
Comment 5 Nicholas Wilson 2008-12-11 03:42:56 UTC
Created attachment 5572 [details]
above, but with parser tests updated and expanded (plus other small fixes)

Right. The latest patch now has updated parser tests (which are valid, display correctly, etc.), including some new ones covering the new functionality. I have also (while tweaking) removed a bizarre legacy restriction on using numbers as names for refs. Most importantly, I have seriously cut down the size of the regexps by adding another article-wide array storing tag data, instead of writing it to the page and then separating it back out later. The expressions are now much simpler and quicker.

Overall, the patch now comprises:
1. A change in the name -> HTML id scheme to tighten up some corner cases, simultaneously cutting down on redundant legacy cases in the code, and clearing up a few other 'fixme' comments.

2. A special case added for the group name "footnote", which now uses symbols like * † ‡ §.

3. A fix to the old error when numeric ids are entered, removing it, since this case is now dealt with correctly.

4. Most majorly, a new stage has been added which globs togther adjacent <ref> tags into one bracketed superscript.

5. Corresponding updates and expansions to the parser tests to ensure everything works.
Comment 6 Brad Jorsch 2008-12-11 16:11:27 UTC
How odd. Although the headers in the patch claim it to be against r43263, I had to go back to r38094 to get it to apply cleanly. Anyway, I have a few criticisms (all testing done on a local MW installation at r44443, except the Cite extension at r38094 + patch):

Since the patch is against r38094 any changes since then are lost, until someone goes to the trouble of merging by hand. In particular, I notice:
* A fix so {{#tag:ref||name=foo}} won't blow up.
* Code to detect unclosed refs.
* $wgAllowCiteGroups.
* Use of Sanitizer::escapeId instead of validateName.

Something like "$wgMergeCitations" would probably be preferable to "Comment out this line to disable merging".

If I enter something like '<span class="ref_DB">1</span>' into the wikitext, it gets magically changed into a reference (or "[]"!). And if I enter '<span class="ref_DB"></span>', I get an internal error. You should probably (somehow) pick an identifier that cannot occur in wikitext.

Customized MediaWiki:Cite_references_link_one and friends need to be updated if the "footnote" group is used, as they are unlikely to contain the needed $4 otherwise. This should be noted somewhere prominent.

Output like "[†–‡]" makes little sense to me, as there is not really a natural ordering to the symbols as there is with numbers. OTOH, if someone customized MediaWiki:cite_references_footnote_labels to use "a b c" instead of "* † ⁑" then it would make sense to merge there.
Comment 7 Nicholas Wilson 2008-12-13 11:34:04 UTC
Created attachment 5577 [details]
patch against r44396

Thanks for the useful feedback. It seems I was indeed working from a version of cite quite a bit older than my svn MW. I have manually merged the recent improvements into my code. In particular, great minds must think alike, as I was surprised to find that after switching to Sanitizer::escapeId from my validateName, I still passed the parser tests, because my carefully chose regexps were identical to the ones in the Sanitizer class! (Also, I had removed $wgAllowCiteGroups per the comment to 'remove when groups are fully tested', and given that they are now on by default and used on the main WP.)

The way the new tags are written however makes it a little problematic to just turn off merging. If a $wgMergeCitations feature is wanted, then it can be done easily, but it would be only for aesthetic reasons, not reliability, as nearly the same amount of new code would still be run in each case.

I have cunningly fixed the problem of entering identifiers which can occur in wikitext by re-using the <ref> tag itself, which is neat, because it is the only one we can count on being parsed the way the want. It also happens to work well with the 'unclosed ref' code to make sure that any dangerous <ref> or </ref> still in the wikitext is not a full tag which would mess things up. It looks a spot fragile, but theoretically should work, and I have tested both the cases of too many <ref>s and too many </ref>s, neither of which cause problems.

Thanks for the reminder to update all the documentation, which I can definitely remember to do if/when this goes through.

I have shortened and improved the dashing function now, and it adds footnotes as a special case. Thanks for pointing this out.
Comment 8 Brad Jorsch 2008-12-13 20:36:40 UTC
(Line numbers are after applying the patch)

I see you added a 'cite_error_references_nested' error! Shouldn't the "ref was" part be localizable, though? Also, can the old check (lines 158-161) be removed, or is it still useful?

{{#tag:ref||name=0}} fails. At line 143, it should be is_null($key) instead of $key==false.

When the ref tag has unknown parameters (e.g. <ref name="foo" value="bar">), false is returned from refArg into $key. Instead of generating an error, these are now all being treated as <ref name="" group="0"> would be were name="" allowed. Either restore the check for $key===false in guardedRef, or change refArg to silently ignore unknown parameters.

You're not escaping the group when outputting the inline ref tags: try <ref group="&lt;script&gt;alert('Pwned!')&lt;/script&gt;">Uh-oh</ref>. Passing $text through htmlspecialchars in referencePlaceText seems to take care of it.

For the 'cite_error_references_no_text' error, you are not properly escaping the key. But here it seems you can only inject wikitext, so it's just aesthetic rather than a security hole.
Comment 9 Nicholas Wilson 2008-12-14 23:02:25 UTC
Created attachment 5584 [details]
more parsertests, cleared up return values from refArg

(In reply to comment #8)
> (Line numbers are after applying the patch)
> 
> I see you added a 'cite_error_references_nested' error! Shouldn't the "ref was"
> part be localizable, though? Also, can the old check (lines 158-161) be
> removed, or is it still useful?
The old check was there to warn the user against the annoying problem of unbalanced tags, which the parser just chews up silently.  There is another check to prevent the XML-valid nesting of tags (for various reasons). Before, this failed fairly silently, so the new error is there to alert the user that something is wrong. Both pieces of code are meant to be in use.
> 
> {{#tag:ref||name=0}} fails. At line 143, it should be is_null($key) instead of
> $key==false.
Thanks.  I have thoroughly rewritten the parser tests now to test nearly exhaustively all possible ways of producing a single error, and some ways of producing two. This should catch more of this sort of thing (but not so many false negatives) in future. Perhaps another set of tests for 'pathological positives' would be helpful sometime.
> 
> When the ref tag has unknown parameters (e.g. <ref name="foo" value="bar">),
> false is returned from refArg into $key. Instead of generating an error, these
> are now all being treated as <ref name="" group="0"> would be were name=""
> allowed. Either restore the check for $key===false in guardedRef, or change
> refArg to silently ignore unknown parameters.
The problem was just a slightly sloppy merge of a line from the stuff that had changed since the version I started working from. I have removed a few cases from refArg to neaten up the possible return values and fixed up guardedRef.
> 
> You're not escaping the group when outputting the inline ref tags: try <ref
> group="&lt;script&gt;alert('Pwned!')&lt;/script&gt;">Uh-oh</ref>. Passing $text
> through htmlspecialchars in referencePlaceText seems to take care of it.
Yikes! This one really is my bad. mRefs now holds dangerous text, is documented as such, and everything is escaped when written out to the page (I hope!). I think everthing is safe now, but I should not have slipped up there. Thanks for double-checking my 'late night' code!
> 
> For the 'cite_error_references_no_text' error, you are not properly escaping
> the key. But here it seems you can only inject wikitext, so it's just aesthetic
> rather than a security hole.
This is fine. I had thought that anything the user types, they are happy with having echoed back at them as wikitext.
Comment 10 Gabriel Wicke 2011-11-14 13:07:49 UTC
Nicholas, I'd like to help with getting this patch finally reviewed and checked in. Cite has changed in the meantime, so the patch will need to be updated to latest trunk. Do you have a more up-to-date version? If you do, then this would save the duplicate effort of forward-porting it.
Comment 11 Sumana Harihareswara 2012-05-16 19:41:36 UTC
Also, Nicholas, now that we have moved to Git as a source control system, you're welcome to get developer access https://www.mediawiki.org/wiki/Developer_access to put this & future patches directly into Git.  That will get them reviewed faster, too!

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links