Last modified: 2014-11-18 18:07:31 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T7019, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 5019 - Fix fragment identifiers in links to sections
Fix fragment identifiers in links to sections
Status: NEW
Product: MediaWiki
Classification: Unclassified
Page editing (Other open bugs)
unspecified
All All
: Low minor with 9 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
http://test.wikipedia.org/w/index.php...
: patch, patch-reviewed
: 4987 12414 17786 24262 (view as bug list)
Depends on: 111 2831 2346 18700
Blocks: 13108
  Show dependency treegraph
 
Reported: 2006-02-17 03:14 UTC by lɛʁi לערי ריינהארט
Modified: 2014-11-18 18:07 UTC (History)
11 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Proposed patch (13.85 KB, patch)
2006-07-16 05:58 UTC, Aryeh Gregor (not reading bugmail, please e-mail directly)
Details
Test cases and output for proposed patch (8.49 KB, text/plain)
2006-07-16 06:00 UTC, Aryeh Gregor (not reading bugmail, please e-mail directly)
Details
Patch with no debugging code (13.74 KB, patch)
2006-07-25 23:11 UTC, Aryeh Gregor (not reading bugmail, please e-mail directly)
Details
Patch against r15884 (13.74 KB, patch)
2006-07-30 06:00 UTC, Aryeh Gregor (not reading bugmail, please e-mail directly)
Details

Description lɛʁi לערי ריינהארט 2006-02-17 03:14:21 UTC
Hallo!

http://test.wikipedia.org/wiki/User:Gangleri/tests/bugzilla/section_links/wiki_syntax
provides a wide range of various sections generated using wiki syntax.

The scope of this report is to identify which cases where "return to section"
does not work properly *and* wrong autocomment links are generated could be
improved.

The set of characters which are not allowed in page titles *differs* from the
set of characters which don't make sense in sections (as "#"). Beside this there
is a range of characters which require *escaping* when used in anchors (as "&").

However the page provides many examples where the the "text" rendered /
displayed in the TOC contains only characters which can be used in "return"
anchors and in "autocomment" links. This is mainly
Bug 2831: Return after edit sections using headings with wikilinks returns to
other anchors then the ones used in TOC
but there are many examples using wiki syntax, TeX etc. where the TOC-text is
more suitable as return-to anchor and autocomment as what is generated now. Bug
4987 is only one of them.

Another "discrepancy" I want to address is the multiple identical TOC-texts. A
sollution which would make sense is to render the TOC-text *same* as the anchors
*automatically*. This will make the problem more visible while editing and they
users could easy fix such sections immediately. This is
Bug 111: Return to page after edit of section (where section name occurs
multiple times)
The algorithm should work also for insertion of new sections inserted "out of
sequence" by "first in" -> "first number"; "next in" -> "last number++" (because
of posible gaps). This will not generate duplicate TOC-texts any more. Existing
pages should render according to this algorithm at the next change or while purged.
The algorithm would require (sorry Brion) identical handling of spaces and
underscores as addressed in bug 2339.

The examples from the url shows many complications or section headers about
known bugs or implementation dependent issues:
[[foo|]] does not render to [[foo]] today - bug 1034
[[foo bar#foobar|]] does not generate a valid link today but it might generate
[[foo bar#foobar|foobar]] in the future - bug 845

The main / primary focus should not be on mentioned or documented complications
or other curiosities but on *efficiency* and unifying user interface functionality.

There will still be situations (conflicts) where characters used inside the TOC
are not valid in anchors. These characters need to be escaped anyway.

Wikies live. After "changing the rules" it will be possible to "simplify" the
stored autocomments in existing summary fields. This will not be a manipulation
but a improuvement / usefull fix for users.

best regards reinhardt [[user:gangleri]]
Comment 1 lɛʁi לערי ריינהארט 2006-02-17 03:40:34 UTC
(In reply to comment #0)
> ... *differs* from the
> set of characters which don't make sense in sections (as "#").

A "#"-section generetes
http://test.wikipedia.org/wiki/User:Gangleri/tests/bugzilla/section_links/wiki_syntax#.23

" " generates the return link
http://test.wikipedia.org/wiki/User:Gangleri/tests/bugzilla/section_links/wiki_syntax#
This is equivalent to an anchor the start / top of the page
Comment 2 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-04-23 21:45:04 UTC
Why is this request nontrivial to implement?  The TOC works perfectly for every
case I've seen, so why can't that code be plugged in to the post-edit section
link-generator and the page history link-generator in an hour or whatever?
Comment 3 Brion Vibber 2006-04-23 21:58:18 UTC
It's nontrivial because the multipass parser is nontrivial.
The set of text that is present at some stage partway through
parsing isn't necessarily the same as you have at some other
time.
Comment 4 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-06-26 03:32:52 UTC
Hmm.  Generating the section-edit link could be rather simpler than it is, since
it doesn't actually have to accord *that* closely with the markup used.  It
could potentially be arbitrary, in fact, although we don't really want that.  An
algorithm like:

* Identify internal links and replace with their display text.
* Identify external links and replace with their display text.
* Strip the contents of <ref>s, if Cite.php is installed (ideally; this may not
be reasonable for an extension, but I did once see someone included a very
lengthy ref in a header).
* Strip all SGML-style tags (but not their contents).
* Replace all template parameters with their defaults, if they have any.
* Replace all accented Latin characters with their unaccented versions.
* Strip all apostrophes.
* For all sequences of nonalphabetic symbols that aren't printable in URLs:
** If the sequence has a printable punctuation mark (e.g., _ or -) to one side,
delete the sequence.
** Otherwise, replace the sequence with an underscore.
* See if the array passed to the function contains a string that's identical to
the current string.  If it does, append an appropriate number.
* Return the string.

could be implemented as an Article method.  Anything that wanted to know what
the title should be for a given heading would request the section heading from
the Article object.  This general algorithm would probably be simpler and
cheaper to implement than what I believe is being generated now, because it
doesn't transclude templates or generally rely on anything recursive or
otherwise tiresome.  Furthermore, the id wouldn't change if an included template
changed, or in fact if anything whatsoever changed other than the actual text of
the header.  And finally, on Latin-alphabet wikis it will provide entirely
readable section ids, because all reserved characters are just dropped.
Comment 5 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-07-16 05:58:53 UTC
Created attachment 2093 [details]
Proposed patch

* Creates basically consistent anchors no matter where you come from (TOC,
Linker.php, EditPage.php).
* Certain things in headers, such as variables and seemingly extension hooks
(at least <ref>), will remain inconsistent.  No regression, they'll still work
from TOC, but they will not work from history or return on edit.
* Duplicate header names will still work from TOC, but can't conceivably work
from elsewhere with the current setup.	Still no regression.
* Legacy anchors are preserved, so old links don't break.
* Anchors are more readable and more easily typed where possible(e.g., "One
(Two), Three" -> "One_Two_Three" not "One_.28Two.29.2C_Three", "spéçïål
çhärâçtêrs" -> "special_characters" not
"sp.C3.A9.C3.A7.C3.AF.C3.A5l_.C3.A7h.C3.A4r.C3.A2.C3.A7t.C3.AArs").
* I noticed no performance hit in some quick benchmarking.
Comment 6 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-07-16 06:00:49 UTC
Created attachment 2094 [details]
Test cases and output for proposed patch

Thanks mainly to Gangleri.  Note: Cite.php is enabled, but Inputbox and math
(among other things) are not.
Comment 7 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-07-25 23:11:44 UTC
Created attachment 2151 [details]
Patch with no debugging code

The patch isn't great, so I hope to improve upon it, but in case someone *does*
want to use this one, I'd better remove the debugging code.  :D
Comment 8 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-07-30 06:00:28 UTC
Created attachment 2171 [details]
Patch against r15884

It finally occurred to me that templates don't work with this either; I didn't
think to add that as a test case, stupidly enough.  This isn't a great
solution, then, although still no regression, and it should be
forward-compatible, so this is really more "incomplete" than "bad".  Might
still be useful to commit the patch in its current form; to handle templates
and variables correctly would require fairly intricate changes to
Parser::replaceVariables or Parser::replace_callback, as far as I can see,
albeit probably a lot simpler if you're familiar with such things.  I can't
figure out how to do it, sadly.
Comment 9 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-07-30 06:05:17 UTC
*** Bug 4987 has been marked as a duplicate of this bug. ***
Comment 10 Wil Mahan 2006-08-26 18:59:39 UTC
Thanks for making the patch. Here are some comments:

* As described, the patch introduces a new mapping
of section names to anchors. The old name is kept as
a "legacy" name, but why make this change at all? To
me it seems orthogonal to the bug. The anchor names
are not arbitrary; they are supposed to be ASCII-only,
human-readable when possible, and stable enough for
use in URLs. Both the existing method or the new
one will be unreadable at times, but at least the
old one was established and unambiguous.
* I noticed that you add an 'x' before anchor names that
don't begin with a letter. Technically you are correct
in that standards require this, but is there any
practical reason for it? Section links like
[[Example#1990s]] already work fine, and changing the
anchor to "x1990s" seems ugly and arbitrary.
* The base64_encode() / base64_decode() process seems
unnecessary; why not just save the section titles in
an array and store an index into the array in the
placeholder instead of the encoded string? To be fair,
this isn't specific to your patch.
* Why make wfCreateAnchor a global function?
Comment 11 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-08-27 02:27:51 UTC
(In reply to comment #10)
> * As described, the patch introduces a new mapping
> of section names to anchors. The old name is kept as
> a "legacy" name, but why make this change at all?

I couldn't use the mapping currently used by the parser, because that's based
off rendered HTML, not wikitext, and parsing each fragment on a history page is
unacceptable.  I could have used the one used by EditPage and Linker (which I
suspect are slightly different), but that would break past links anyway, because
everyone makes links that point to the parser output, not the EditPage/Linker
output.  So I figured I might as well make the output more readable while I was
at it.  Note that the system I used accords with the "legacy" system much more
often than the EditPage/Linker system does, too.

> * I noticed that you add an 'x' before anchor names that
> don't begin with a letter. Technically you are correct
> in that standards require this, but is there any
> practical reason for it?

No, but MediaWiki likes to maintain XHTML compliance, all things being equal.

> * The base64_encode() / base64_decode() process seems
> unnecessary; why not just save the section titles in
> an array and store an index into the array in the
> placeholder instead of the encoded string? 

Good question.  I'm kind of new at programming, and the Base64 idea seemed more
obvious at the time to me.  Of course, everything else that I saw in the parser
is done the way you say, and in retrospect I should have done it that way.

> * Why make wfCreateAnchor a global function?

Because it needs to be accessible to Parser, EditPage, and Linker, which don't
all talk to each other, and I didn't know about static methods at the time.  :P

(Tim pointed out to me on IRC, by the way, that EditPage would be very simple to
fix: just have Parser pass on the correctly-formatted anchor to EditPage.  This
tactic wouldn't solve the history-page problem, but it's better than nothing.)
Comment 12 Wil Mahan 2006-08-27 06:46:01 UTC
(In reply to comment #11)

> I couldn't use the mapping currently used by the parser, because that's based
> off rendered HTML, not wikitext, and parsing each fragment on a history page is
> unacceptable.

But would it really be that expensive, considering that we
already parse wikilinks on history pages? The rendered HTML is
stripped out when making anchors, so you wouldn't need to
generate it on history pages or for section edits. I wonder how
much it would slow things down to replace wikilinks by their text,
remove all single-quote (bold/italic) formatting, and maybe
replace bracketed external links with their descriptions. Is
here something I missed?

I see now why you chose to create a new format, and maybe that
is indeed the way to go. I like your idea of unifying the three
different places that generate anchor/fragment names.

> (Tim pointed out to me on IRC, by the way, that EditPage would be very simple to
> fix: just have Parser pass on the correctly-formatted anchor to EditPage.  This
> tactic wouldn't solve the history-page problem, but it's better than nothing.)

I believe we have to do that anyway if we want to fix bug 111.
Comment 13 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-08-27 07:12:52 UTC
(In reply to comment #12)
> But would it really be that expensive, considering that we
> already parse wikilinks on history pages?

Well, not if you wrote a mini-parser, but that would be quite complicated (the
parser is rather intricate) and almost certainly not worth the effort.  Even if
it's somewhat faster, there can be a maximum of 5000 items on a history page. 
The current parser takes about 800 ms to render a typical page, which obviously
is totally unacceptable when multiplied by even ten, let alone a thousand.

> Is here something I missed?

Templates.  Which require database queries, not infrequently multiple database
queries.  Parser extensions can also potentially be expensive, because they can
do absolutely anything.  Tim confirmed what I suspected when I chatted with him
about this on IRC: running the parser for every section edit link would cause
the WMF servers to die.  That's why I tried to make a wikitext-based format in
preference to a rendered text-based format in the first place.

> I believe we have to do that anyway if we want to fix bug 111.

Correct.

. . . hmm.  I just had an idea.  What if we took the finished anchor from the
parser, stuck it onto something, used it for the return value of EditPage, *and
then put it into the history entry*?  Nobody said the history entry's link
target *has* to be determined on loading the history page, right?  It can be
determined when adding it to the edit history in the first place.

Of course, this would likely require database schema changes (store the actual
fragment separately from the text displayed to the right of the arrow), but it
seems like a flawless solution.  Everything is inherently centralized, and we
don't have to deal with the headache of adding anchor stuff into the template pass.
Comment 14 Wil Mahan 2006-08-27 18:04:16 UTC
(In reply to comment #13)
> 
> Well, not if you wrote a mini-parser, but that would be quite complicated (the
> parser is rather intricate) and almost certainly not worth the effort.  Even if

My point was that I don't see anything intricate or complicated about
converting wikitext into plain text for the important cases: one regular
expression search-and-replace to remove wikilinks, one to remove quotes,
and one to remove external links. This doesn't have to happen for every
history entry, only those corresponding to section edits, and even then
the replacement only takes place on section headings, not the whole edit
summary.

The problem with templates is that they are expanded in the parser before
headings are parsed, and as you say expanding templates on history pages
is not feasible. Your patch doesn't fix templates-within-headings either,
and I'm not convinced it's worth supporting anyway.

> . . . hmm.  I just had an idea.  What if we took the finished anchor from the
> parser, stuck it onto something, used it for the return value of EditPage, *and
> then put it into the history entry*?

Not a bad idea; the only objection I can think of is that it seems
like overkill to change the schema for what's ultimately a minor
annoyance.
Comment 15 Rensualk Sakul 2007-02-01 13:10:47 UTC
Does this bug cover the problems with linking to sections which include
apostrophes for bolding and italicising? See Talk:€2 commemorative coins at
en.wikipedia.org.
Comment 16 Roan Kattouw 2007-09-06 11:49:15 UTC
Many of these issues have been fixed by TimLaqua in r25445, r25446, r25450 and r25573 (see also bug 2831 and bug 10836). This should reduce the list of problematic section headers significantly.
Comment 17 Conrad Irwin 2010-06-21 17:57:33 UTC
*** Bug 17786 has been marked as a duplicate of this bug. ***
Comment 18 Conrad Irwin 2010-09-03 22:58:48 UTC
*** Bug 12414 has been marked as a duplicate of this bug. ***
Comment 19 Conrad Irwin 2010-09-03 22:59:13 UTC
*** Bug 24262 has been marked as a duplicate of this bug. ***
Comment 20 p858snake 2011-04-30 00:10:04 UTC
*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*
Comment 21 Sumana Harihareswara 2011-11-09 23:45:01 UTC
+reviewed since comment 16 implies some obsoletion of Aryeh's patches

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links