Last modified: 2014-11-18 18:07:31 UTC
Hallo! http://test.wikipedia.org/wiki/User:Gangleri/tests/bugzilla/section_links/wiki_syntax provides a wide range of various sections generated using wiki syntax. The scope of this report is to identify which cases where "return to section" does not work properly *and* wrong autocomment links are generated could be improved. The set of characters which are not allowed in page titles *differs* from the set of characters which don't make sense in sections (as "#"). Beside this there is a range of characters which require *escaping* when used in anchors (as "&"). However the page provides many examples where the the "text" rendered / displayed in the TOC contains only characters which can be used in "return" anchors and in "autocomment" links. This is mainly Bug 2831: Return after edit sections using headings with wikilinks returns to other anchors then the ones used in TOC but there are many examples using wiki syntax, TeX etc. where the TOC-text is more suitable as return-to anchor and autocomment as what is generated now. Bug 4987 is only one of them. Another "discrepancy" I want to address is the multiple identical TOC-texts. A sollution which would make sense is to render the TOC-text *same* as the anchors *automatically*. This will make the problem more visible while editing and they users could easy fix such sections immediately. This is Bug 111: Return to page after edit of section (where section name occurs multiple times) The algorithm should work also for insertion of new sections inserted "out of sequence" by "first in" -> "first number"; "next in" -> "last number++" (because of posible gaps). This will not generate duplicate TOC-texts any more. Existing pages should render according to this algorithm at the next change or while purged. The algorithm would require (sorry Brion) identical handling of spaces and underscores as addressed in bug 2339. The examples from the url shows many complications or section headers about known bugs or implementation dependent issues: [[foo|]] does not render to [[foo]] today - bug 1034 [[foo bar#foobar|]] does not generate a valid link today but it might generate [[foo bar#foobar|foobar]] in the future - bug 845 The main / primary focus should not be on mentioned or documented complications or other curiosities but on *efficiency* and unifying user interface functionality. There will still be situations (conflicts) where characters used inside the TOC are not valid in anchors. These characters need to be escaped anyway. Wikies live. After "changing the rules" it will be possible to "simplify" the stored autocomments in existing summary fields. This will not be a manipulation but a improuvement / usefull fix for users. best regards reinhardt [[user:gangleri]]
(In reply to comment #0) > ... *differs* from the > set of characters which don't make sense in sections (as "#"). A "#"-section generetes http://test.wikipedia.org/wiki/User:Gangleri/tests/bugzilla/section_links/wiki_syntax#.23 " " generates the return link http://test.wikipedia.org/wiki/User:Gangleri/tests/bugzilla/section_links/wiki_syntax# This is equivalent to an anchor the start / top of the page
Why is this request nontrivial to implement? The TOC works perfectly for every case I've seen, so why can't that code be plugged in to the post-edit section link-generator and the page history link-generator in an hour or whatever?
It's nontrivial because the multipass parser is nontrivial. The set of text that is present at some stage partway through parsing isn't necessarily the same as you have at some other time.
Hmm. Generating the section-edit link could be rather simpler than it is, since it doesn't actually have to accord *that* closely with the markup used. It could potentially be arbitrary, in fact, although we don't really want that. An algorithm like: * Identify internal links and replace with their display text. * Identify external links and replace with their display text. * Strip the contents of <ref>s, if Cite.php is installed (ideally; this may not be reasonable for an extension, but I did once see someone included a very lengthy ref in a header). * Strip all SGML-style tags (but not their contents). * Replace all template parameters with their defaults, if they have any. * Replace all accented Latin characters with their unaccented versions. * Strip all apostrophes. * For all sequences of nonalphabetic symbols that aren't printable in URLs: ** If the sequence has a printable punctuation mark (e.g., _ or -) to one side, delete the sequence. ** Otherwise, replace the sequence with an underscore. * See if the array passed to the function contains a string that's identical to the current string. If it does, append an appropriate number. * Return the string. could be implemented as an Article method. Anything that wanted to know what the title should be for a given heading would request the section heading from the Article object. This general algorithm would probably be simpler and cheaper to implement than what I believe is being generated now, because it doesn't transclude templates or generally rely on anything recursive or otherwise tiresome. Furthermore, the id wouldn't change if an included template changed, or in fact if anything whatsoever changed other than the actual text of the header. And finally, on Latin-alphabet wikis it will provide entirely readable section ids, because all reserved characters are just dropped.
Created attachment 2093 [details] Proposed patch * Creates basically consistent anchors no matter where you come from (TOC, Linker.php, EditPage.php). * Certain things in headers, such as variables and seemingly extension hooks (at least <ref>), will remain inconsistent. No regression, they'll still work from TOC, but they will not work from history or return on edit. * Duplicate header names will still work from TOC, but can't conceivably work from elsewhere with the current setup. Still no regression. * Legacy anchors are preserved, so old links don't break. * Anchors are more readable and more easily typed where possible(e.g., "One (Two), Three" -> "One_Two_Three" not "One_.28Two.29.2C_Three", "spéçïål çhärâçtêrs" -> "special_characters" not "sp.C3.A9.C3.A7.C3.AF.C3.A5l_.C3.A7h.C3.A4r.C3.A2.C3.A7t.C3.AArs"). * I noticed no performance hit in some quick benchmarking.
Created attachment 2094 [details] Test cases and output for proposed patch Thanks mainly to Gangleri. Note: Cite.php is enabled, but Inputbox and math (among other things) are not.
Created attachment 2151 [details] Patch with no debugging code The patch isn't great, so I hope to improve upon it, but in case someone *does* want to use this one, I'd better remove the debugging code. :D
Created attachment 2171 [details] Patch against r15884 It finally occurred to me that templates don't work with this either; I didn't think to add that as a test case, stupidly enough. This isn't a great solution, then, although still no regression, and it should be forward-compatible, so this is really more "incomplete" than "bad". Might still be useful to commit the patch in its current form; to handle templates and variables correctly would require fairly intricate changes to Parser::replaceVariables or Parser::replace_callback, as far as I can see, albeit probably a lot simpler if you're familiar with such things. I can't figure out how to do it, sadly.
*** Bug 4987 has been marked as a duplicate of this bug. ***
Thanks for making the patch. Here are some comments: * As described, the patch introduces a new mapping of section names to anchors. The old name is kept as a "legacy" name, but why make this change at all? To me it seems orthogonal to the bug. The anchor names are not arbitrary; they are supposed to be ASCII-only, human-readable when possible, and stable enough for use in URLs. Both the existing method or the new one will be unreadable at times, but at least the old one was established and unambiguous. * I noticed that you add an 'x' before anchor names that don't begin with a letter. Technically you are correct in that standards require this, but is there any practical reason for it? Section links like [[Example#1990s]] already work fine, and changing the anchor to "x1990s" seems ugly and arbitrary. * The base64_encode() / base64_decode() process seems unnecessary; why not just save the section titles in an array and store an index into the array in the placeholder instead of the encoded string? To be fair, this isn't specific to your patch. * Why make wfCreateAnchor a global function?
(In reply to comment #10) > * As described, the patch introduces a new mapping > of section names to anchors. The old name is kept as > a "legacy" name, but why make this change at all? I couldn't use the mapping currently used by the parser, because that's based off rendered HTML, not wikitext, and parsing each fragment on a history page is unacceptable. I could have used the one used by EditPage and Linker (which I suspect are slightly different), but that would break past links anyway, because everyone makes links that point to the parser output, not the EditPage/Linker output. So I figured I might as well make the output more readable while I was at it. Note that the system I used accords with the "legacy" system much more often than the EditPage/Linker system does, too. > * I noticed that you add an 'x' before anchor names that > don't begin with a letter. Technically you are correct > in that standards require this, but is there any > practical reason for it? No, but MediaWiki likes to maintain XHTML compliance, all things being equal. > * The base64_encode() / base64_decode() process seems > unnecessary; why not just save the section titles in > an array and store an index into the array in the > placeholder instead of the encoded string? Good question. I'm kind of new at programming, and the Base64 idea seemed more obvious at the time to me. Of course, everything else that I saw in the parser is done the way you say, and in retrospect I should have done it that way. > * Why make wfCreateAnchor a global function? Because it needs to be accessible to Parser, EditPage, and Linker, which don't all talk to each other, and I didn't know about static methods at the time. :P (Tim pointed out to me on IRC, by the way, that EditPage would be very simple to fix: just have Parser pass on the correctly-formatted anchor to EditPage. This tactic wouldn't solve the history-page problem, but it's better than nothing.)
(In reply to comment #11) > I couldn't use the mapping currently used by the parser, because that's based > off rendered HTML, not wikitext, and parsing each fragment on a history page is > unacceptable. But would it really be that expensive, considering that we already parse wikilinks on history pages? The rendered HTML is stripped out when making anchors, so you wouldn't need to generate it on history pages or for section edits. I wonder how much it would slow things down to replace wikilinks by their text, remove all single-quote (bold/italic) formatting, and maybe replace bracketed external links with their descriptions. Is here something I missed? I see now why you chose to create a new format, and maybe that is indeed the way to go. I like your idea of unifying the three different places that generate anchor/fragment names. > (Tim pointed out to me on IRC, by the way, that EditPage would be very simple to > fix: just have Parser pass on the correctly-formatted anchor to EditPage. This > tactic wouldn't solve the history-page problem, but it's better than nothing.) I believe we have to do that anyway if we want to fix bug 111.
(In reply to comment #12) > But would it really be that expensive, considering that we > already parse wikilinks on history pages? Well, not if you wrote a mini-parser, but that would be quite complicated (the parser is rather intricate) and almost certainly not worth the effort. Even if it's somewhat faster, there can be a maximum of 5000 items on a history page. The current parser takes about 800 ms to render a typical page, which obviously is totally unacceptable when multiplied by even ten, let alone a thousand. > Is here something I missed? Templates. Which require database queries, not infrequently multiple database queries. Parser extensions can also potentially be expensive, because they can do absolutely anything. Tim confirmed what I suspected when I chatted with him about this on IRC: running the parser for every section edit link would cause the WMF servers to die. That's why I tried to make a wikitext-based format in preference to a rendered text-based format in the first place. > I believe we have to do that anyway if we want to fix bug 111. Correct. . . . hmm. I just had an idea. What if we took the finished anchor from the parser, stuck it onto something, used it for the return value of EditPage, *and then put it into the history entry*? Nobody said the history entry's link target *has* to be determined on loading the history page, right? It can be determined when adding it to the edit history in the first place. Of course, this would likely require database schema changes (store the actual fragment separately from the text displayed to the right of the arrow), but it seems like a flawless solution. Everything is inherently centralized, and we don't have to deal with the headache of adding anchor stuff into the template pass.
(In reply to comment #13) > > Well, not if you wrote a mini-parser, but that would be quite complicated (the > parser is rather intricate) and almost certainly not worth the effort. Even if My point was that I don't see anything intricate or complicated about converting wikitext into plain text for the important cases: one regular expression search-and-replace to remove wikilinks, one to remove quotes, and one to remove external links. This doesn't have to happen for every history entry, only those corresponding to section edits, and even then the replacement only takes place on section headings, not the whole edit summary. The problem with templates is that they are expanded in the parser before headings are parsed, and as you say expanding templates on history pages is not feasible. Your patch doesn't fix templates-within-headings either, and I'm not convinced it's worth supporting anyway. > . . . hmm. I just had an idea. What if we took the finished anchor from the > parser, stuck it onto something, used it for the return value of EditPage, *and > then put it into the history entry*? Not a bad idea; the only objection I can think of is that it seems like overkill to change the schema for what's ultimately a minor annoyance.
Does this bug cover the problems with linking to sections which include apostrophes for bolding and italicising? See Talk:€2 commemorative coins at en.wikipedia.org.
Many of these issues have been fixed by TimLaqua in r25445, r25446, r25450 and r25573 (see also bug 2831 and bug 10836). This should reduce the list of problematic section headers significantly.
*** Bug 17786 has been marked as a duplicate of this bug. ***
*** Bug 12414 has been marked as a duplicate of this bug. ***
*** Bug 24262 has been marked as a duplicate of this bug. ***
*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*
+reviewed since comment 16 implies some obsoletion of Aryeh's patches