Last modified: 2014-11-18 18:07:31 UTC
provides a wide range of various sections generated using wiki syntax.
The scope of this report is to identify which cases where "return to section"
does not work properly *and* wrong autocomment links are generated could be
The set of characters which are not allowed in page titles *differs* from the
set of characters which don't make sense in sections (as "#"). Beside this there
is a range of characters which require *escaping* when used in anchors (as "&").
However the page provides many examples where the the "text" rendered /
displayed in the TOC contains only characters which can be used in "return"
anchors and in "autocomment" links. This is mainly
Bug 2831: Return after edit sections using headings with wikilinks returns to
other anchors then the ones used in TOC
but there are many examples using wiki syntax, TeX etc. where the TOC-text is
more suitable as return-to anchor and autocomment as what is generated now. Bug
4987 is only one of them.
Another "discrepancy" I want to address is the multiple identical TOC-texts. A
sollution which would make sense is to render the TOC-text *same* as the anchors
*automatically*. This will make the problem more visible while editing and they
users could easy fix such sections immediately. This is
Bug 111: Return to page after edit of section (where section name occurs
The algorithm should work also for insertion of new sections inserted "out of
sequence" by "first in" -> "first number"; "next in" -> "last number++" (because
of posible gaps). This will not generate duplicate TOC-texts any more. Existing
pages should render according to this algorithm at the next change or while purged.
The algorithm would require (sorry Brion) identical handling of spaces and
underscores as addressed in bug 2339.
The examples from the url shows many complications or section headers about
known bugs or implementation dependent issues:
[[foo|]] does not render to [[foo]] today - bug 1034
[[foo bar#foobar|]] does not generate a valid link today but it might generate
[[foo bar#foobar|foobar]] in the future - bug 845
The main / primary focus should not be on mentioned or documented complications
or other curiosities but on *efficiency* and unifying user interface functionality.
There will still be situations (conflicts) where characters used inside the TOC
are not valid in anchors. These characters need to be escaped anyway.
Wikies live. After "changing the rules" it will be possible to "simplify" the
stored autocomments in existing summary fields. This will not be a manipulation
but a improuvement / usefull fix for users.
best regards reinhardt [[user:gangleri]]
(In reply to comment #0)
> ... *differs* from the
> set of characters which don't make sense in sections (as "#").
A "#"-section generetes
" " generates the return link
This is equivalent to an anchor the start / top of the page
Why is this request nontrivial to implement? The TOC works perfectly for every
case I've seen, so why can't that code be plugged in to the post-edit section
link-generator and the page history link-generator in an hour or whatever?
It's nontrivial because the multipass parser is nontrivial.
The set of text that is present at some stage partway through
parsing isn't necessarily the same as you have at some other
Hmm. Generating the section-edit link could be rather simpler than it is, since
it doesn't actually have to accord *that* closely with the markup used. It
could potentially be arbitrary, in fact, although we don't really want that. An
* Identify internal links and replace with their display text.
* Identify external links and replace with their display text.
* Strip the contents of <ref>s, if Cite.php is installed (ideally; this may not
be reasonable for an extension, but I did once see someone included a very
lengthy ref in a header).
* Strip all SGML-style tags (but not their contents).
* Replace all template parameters with their defaults, if they have any.
* Replace all accented Latin characters with their unaccented versions.
* Strip all apostrophes.
* For all sequences of nonalphabetic symbols that aren't printable in URLs:
** If the sequence has a printable punctuation mark (e.g., _ or -) to one side,
delete the sequence.
** Otherwise, replace the sequence with an underscore.
* See if the array passed to the function contains a string that's identical to
the current string. If it does, append an appropriate number.
* Return the string.
could be implemented as an Article method. Anything that wanted to know what
the title should be for a given heading would request the section heading from
the Article object. This general algorithm would probably be simpler and
cheaper to implement than what I believe is being generated now, because it
doesn't transclude templates or generally rely on anything recursive or
otherwise tiresome. Furthermore, the id wouldn't change if an included template
changed, or in fact if anything whatsoever changed other than the actual text of
the header. And finally, on Latin-alphabet wikis it will provide entirely
readable section ids, because all reserved characters are just dropped.
Created attachment 2093 [details]
* Creates basically consistent anchors no matter where you come from (TOC,
* Certain things in headers, such as variables and seemingly extension hooks
(at least <ref>), will remain inconsistent. No regression, they'll still work
from TOC, but they will not work from history or return on edit.
* Duplicate header names will still work from TOC, but can't conceivably work
from elsewhere with the current setup. Still no regression.
* Legacy anchors are preserved, so old links don't break.
* Anchors are more readable and more easily typed where possible(e.g., "One
(Two), Three" -> "One_Two_Three" not "One_.28Two.29.2C_Three", "spéçïål
çhärâçtêrs" -> "special_characters" not
* I noticed no performance hit in some quick benchmarking.
Created attachment 2094 [details]
Test cases and output for proposed patch
Thanks mainly to Gangleri. Note: Cite.php is enabled, but Inputbox and math
(among other things) are not.
Created attachment 2151 [details]
Patch with no debugging code
The patch isn't great, so I hope to improve upon it, but in case someone *does*
want to use this one, I'd better remove the debugging code. :D
Created attachment 2171 [details]
Patch against r15884
It finally occurred to me that templates don't work with this either; I didn't
think to add that as a test case, stupidly enough. This isn't a great
solution, then, although still no regression, and it should be
forward-compatible, so this is really more "incomplete" than "bad". Might
still be useful to commit the patch in its current form; to handle templates
and variables correctly would require fairly intricate changes to
Parser::replaceVariables or Parser::replace_callback, as far as I can see,
albeit probably a lot simpler if you're familiar with such things. I can't
figure out how to do it, sadly.
*** Bug 4987 has been marked as a duplicate of this bug. ***
Thanks for making the patch. Here are some comments:
* As described, the patch introduces a new mapping
of section names to anchors. The old name is kept as
a "legacy" name, but why make this change at all? To
me it seems orthogonal to the bug. The anchor names
are not arbitrary; they are supposed to be ASCII-only,
human-readable when possible, and stable enough for
use in URLs. Both the existing method or the new
one will be unreadable at times, but at least the
old one was established and unambiguous.
* I noticed that you add an 'x' before anchor names that
don't begin with a letter. Technically you are correct
in that standards require this, but is there any
practical reason for it? Section links like
[[Example#1990s]] already work fine, and changing the
anchor to "x1990s" seems ugly and arbitrary.
* The base64_encode() / base64_decode() process seems
unnecessary; why not just save the section titles in
an array and store an index into the array in the
placeholder instead of the encoded string? To be fair,
this isn't specific to your patch.
* Why make wfCreateAnchor a global function?
(In reply to comment #10)
> * As described, the patch introduces a new mapping
> of section names to anchors. The old name is kept as
> a "legacy" name, but why make this change at all?
I couldn't use the mapping currently used by the parser, because that's based
off rendered HTML, not wikitext, and parsing each fragment on a history page is
unacceptable. I could have used the one used by EditPage and Linker (which I
suspect are slightly different), but that would break past links anyway, because
everyone makes links that point to the parser output, not the EditPage/Linker
output. So I figured I might as well make the output more readable while I was
at it. Note that the system I used accords with the "legacy" system much more
often than the EditPage/Linker system does, too.
> * I noticed that you add an 'x' before anchor names that
> don't begin with a letter. Technically you are correct
> in that standards require this, but is there any
> practical reason for it?
No, but MediaWiki likes to maintain XHTML compliance, all things being equal.
> * The base64_encode() / base64_decode() process seems
> unnecessary; why not just save the section titles in
> an array and store an index into the array in the
> placeholder instead of the encoded string?
Good question. I'm kind of new at programming, and the Base64 idea seemed more
obvious at the time to me. Of course, everything else that I saw in the parser
is done the way you say, and in retrospect I should have done it that way.
> * Why make wfCreateAnchor a global function?
Because it needs to be accessible to Parser, EditPage, and Linker, which don't
all talk to each other, and I didn't know about static methods at the time. :P
(Tim pointed out to me on IRC, by the way, that EditPage would be very simple to
fix: just have Parser pass on the correctly-formatted anchor to EditPage. This
tactic wouldn't solve the history-page problem, but it's better than nothing.)
(In reply to comment #11)
> I couldn't use the mapping currently used by the parser, because that's based
> off rendered HTML, not wikitext, and parsing each fragment on a history page is
But would it really be that expensive, considering that we
already parse wikilinks on history pages? The rendered HTML is
stripped out when making anchors, so you wouldn't need to
generate it on history pages or for section edits. I wonder how
much it would slow things down to replace wikilinks by their text,
remove all single-quote (bold/italic) formatting, and maybe
replace bracketed external links with their descriptions. Is
here something I missed?
I see now why you chose to create a new format, and maybe that
is indeed the way to go. I like your idea of unifying the three
different places that generate anchor/fragment names.
> (Tim pointed out to me on IRC, by the way, that EditPage would be very simple to
> fix: just have Parser pass on the correctly-formatted anchor to EditPage. This
> tactic wouldn't solve the history-page problem, but it's better than nothing.)
I believe we have to do that anyway if we want to fix bug 111.
(In reply to comment #12)
> But would it really be that expensive, considering that we
> already parse wikilinks on history pages?
Well, not if you wrote a mini-parser, but that would be quite complicated (the
parser is rather intricate) and almost certainly not worth the effort. Even if
it's somewhat faster, there can be a maximum of 5000 items on a history page.
The current parser takes about 800 ms to render a typical page, which obviously
is totally unacceptable when multiplied by even ten, let alone a thousand.
> Is here something I missed?
Templates. Which require database queries, not infrequently multiple database
queries. Parser extensions can also potentially be expensive, because they can
do absolutely anything. Tim confirmed what I suspected when I chatted with him
about this on IRC: running the parser for every section edit link would cause
the WMF servers to die. That's why I tried to make a wikitext-based format in
preference to a rendered text-based format in the first place.
> I believe we have to do that anyway if we want to fix bug 111.
. . . hmm. I just had an idea. What if we took the finished anchor from the
parser, stuck it onto something, used it for the return value of EditPage, *and
then put it into the history entry*? Nobody said the history entry's link
target *has* to be determined on loading the history page, right? It can be
determined when adding it to the edit history in the first place.
Of course, this would likely require database schema changes (store the actual
fragment separately from the text displayed to the right of the arrow), but it
seems like a flawless solution. Everything is inherently centralized, and we
don't have to deal with the headache of adding anchor stuff into the template pass.
(In reply to comment #13)
> Well, not if you wrote a mini-parser, but that would be quite complicated (the
> parser is rather intricate) and almost certainly not worth the effort. Even if
My point was that I don't see anything intricate or complicated about
converting wikitext into plain text for the important cases: one regular
expression search-and-replace to remove wikilinks, one to remove quotes,
and one to remove external links. This doesn't have to happen for every
history entry, only those corresponding to section edits, and even then
the replacement only takes place on section headings, not the whole edit
The problem with templates is that they are expanded in the parser before
headings are parsed, and as you say expanding templates on history pages
is not feasible. Your patch doesn't fix templates-within-headings either,
and I'm not convinced it's worth supporting anyway.
> . . . hmm. I just had an idea. What if we took the finished anchor from the
> parser, stuck it onto something, used it for the return value of EditPage, *and
> then put it into the history entry*?
Not a bad idea; the only objection I can think of is that it seems
like overkill to change the schema for what's ultimately a minor
Does this bug cover the problems with linking to sections which include
apostrophes for bolding and italicising? See Talk:€2 commemorative coins at
Many of these issues have been fixed by TimLaqua in r25445, r25446, r25450 and r25573 (see also bug 2831 and bug 10836). This should reduce the list of problematic section headers significantly.
*** Bug 17786 has been marked as a duplicate of this bug. ***
*** Bug 12414 has been marked as a duplicate of this bug. ***
*** Bug 24262 has been marked as a duplicate of this bug. ***
*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*
+reviewed since comment 16 implies some obsoletion of Aryeh's patches