Last modified: 2010-05-15 15:33:46 UTC
In 1.4pre I made a change that broke image captions containing external links. In essence, the problem is that when images are parsed before external links, things like the image "alt" text break. I'm filing this bug to track the possibilities for fixing it. (Background: my change was calling replaceInternalLinks() before replaceExternalLinks(). I did this to fix bug 2, and I think it is best way to fix that bug. However, I'm willing to reconsider if necessary. One possibility is to back out my change and WONTFIX bug 2, since it represents context sensitivity, and is tricky to fix, for a rare case.) Rowan Collins noticed this problem with image captions and made a patch to fix it. I also made a patch using a different approach. Both patches pass all the relevant parser test cases; I believe they are both functionally correct. To me, the only issue is which is better from a design standpoint. Currently, replaceInternalLinks() is called twice; the first pass replaces any links inside image captions, and the second pass replaces the images with captions containing links, themselves. With either patch, replaceInternalLinks() is instead called only once.
Created attachment 63 [details] Rowan's replaceImageLinks patch Rowan's patch adds a new method, replaceImageLinks(), which called after replaceExternalLinks(). Thus replaceInternalLinks() no longer handles images. The bug is avoided because all images are processed after external links. Rowan described the patch himself at http://mail.wikipedia.org/pipermail/wikitech-l/2004-October/025647.html I think this patch is well done, and shows a good understanding of the problem. My main criticism is that it represents a new parser pass, and I think we should move toward fewer passes, not more. A counteragument is that calling replaceInternalLinks() twice is already an extra pass, so the patch only replaces one pass with another.
Created attachment 64 [details] Wil's patch My patch checks every link to see if it is an image, and if so, recursively calls replace{Internal,External}Links on the link text. After an image is processed, it is replaced by a placeholder, so it is not parsed further. This should be OK, because tables, lists, etc. aren't allowed in captions anyway. I prefer this approach because there is only one replaceInternalLinks() pass. Of course, any comments or other ideas would be welcome.
(In reply to comment #2) > Created an attachment (id=64) [edit] > Wil's patch I can see one major problem with your patch, which is that you use a text check for the image prefix rather than Title::getNamespace(). This isn't guaranteed to work, since a namespace can have multiple "aliases". I'll have to have a better look to see if there are any other issues.
(In reply to comment #3) > I can see one major problem with your patch, which is that you use a text check > for the image prefix rather than Title::getNamespace(). This isn't guaranteed to > work, since a namespace can have multiple "aliases". I'll have to have a better > look to see if there are any other issues. Thanks for looking at the patch. Title::secureAndSplit() uses the same code for checking for the Image namespace. Could you give an example of such an "alias"?
(In reply to comment #4) > Thanks for looking at the patch. Title::secureAndSplit() uses the same code > for checking for the Image namespace. Could you give an example of such an > "alias"? I think that $imgpre stuff is actually redundant: the actual check for :Image: (and also :Category:) comes within replaceInternalLinks() itself; after all, it doesn't really constitute a different namespace, just a different link behaviour: <code> $noforce = (substr($m[1], 0, 1) != ':'); if (!$noforce) { # Strip off leading ':' $link = substr($link, 1); } </code> As for namespace aliases, I had it on good authority but couldn't find them either, until I looked at Title::secureAndSplit(), which checks first Namespace::getCanonicalIndex() and then $wgContLang->getNsIndex(). So a wiki with a localised image namespace called "Bild:" will still treat "Image:" as being the same namespace. I think the only sensible way of doing it is creating a Title object, since somebody might want to change this behaviour further in the future. On the other hand, I think there are a few ways of doing what your patch does slightly more efficiently, so I'll probably post a new one later.
(In reply to comment #5) > I think that $imgpre stuff is actually redundant: the actual check for :Image: > (and also :Category:) comes within replaceInternalLinks() itself; after all, it > doesn't really constitute a different namespace, just a different link behaviour: In replaceInternalLinks(), yes, but that is not the only user of Title. For example, when including a template, that code is necessary. > As for namespace aliases, I had it on good authority but couldn't find them > either, until I looked at Title::secureAndSplit(), which checks first > Namespace::getCanonicalIndex() and then $wgContLang->getNsIndex(). So a wiki > with a localised image namespace called "Bild:" will still treat "Image:" as > being the same namespace. I think the only sensible way of doing it is creating "Bild:" should still work with my patch. The assumption is that the localization of "Image" is unique for a given language; I could be wrong, but I haven't seen a counterexample. > a Title object, since somebody might want to change this behaviour further in > the future. I would rather avoid creating a Title object, because I don't see a reason to incur the overhead. However, if it is indeed necessary, it would not be that difficult to do, and we already create the Title for all valid links. So I don't see it as a major problem with the patch. > On the other hand, I think there are a few ways of doing what your patch does > slightly more efficiently, so I'll probably post a new one later. Indeed, I think the replaceInternalLinks() method is slow in general, and I agree there is room for improvement. I believe correctness should come before optimizations, though.
(In reply to comment #6) > (In reply to comment #5) > > I think that $imgpre stuff is actually redundant: the actual check for :Image: > > (and also :Category:) comes within replaceInternalLinks() itself; after all, it > > doesn't really constitute a different namespace, just a different link behaviour: > > In replaceInternalLinks(), yes, but that is not the only user of Title. > For example, when including a template, that code is necessary. Why? If it is, then we need to fix that code. The conflict is between ":" as the name of the MAIN namespace, and ":" as a prefix to suppress the magicality of links to namespaces such as "Image:" and "Category:". I'm not sure why the latter case would ever occur anywhere other than links (in the sense that if it's not a link, it's not going to do anything "magic" anyway), which is why I assumed that stripping the prefix in replaceInternalLinks() was sufficient to distiguish. Can you identify any other cases where it would be needed? > "Bild:" should still work with my patch. The assumption is that the localization > of "Image" is unique for a given language; I could be wrong, but I haven't seen > a counterexample. But "Image:" will not, since you don't take into account the existence of the "canonical names"; this isn't a big deal, but it creates an anomaly, and disables a function unrelated to the problem you are trying to solve. The mapping of namespace_name to namespace is potentially many-to-one; currently, a localisation can have "Bild:" and "Image:" both mapping to the image namespace. > I would rather avoid creating a Title object, because I don't see a > reason to incur the overhead. However, if it is indeed necessary, > it would not be that difficult to do, and we already create the Title > for all valid links. So I don't see it as a major problem with the > patch. No, it's not a major problem; but I *can't* see a "correct" [non-hacky] way of avoiding it, so it's a need for revision. Hopefully, there won't be too many cases of something that could be an image link, but actually is just an invalid one (those are the cases where creating a Title object gives extra overhead).
Created attachment 66 [details] Reworking of Wil's concept, only different Right, I've finally finished my patch, based on Wil's. Basically, I think this is the better approach, because only problem links have to go through extra processing, not all image links; this should make it more efficient. The differences with Wil's patch are: * do the special case code after the Title object has been created, so that we can reliably check the namespace * use a couple or regexes, rather than counting the number of ]]'s on a line * try and take every opportunity to give up, to minimise the load of the special case code This works for me, but then that was true of the last one. I just hope there's nothing wrong with this, cos I've been banging my head against it for hours!
Your patch looks good. I committed it, with a few cosmetic changes. Thanks! Could you clarify the category part of your change (not creating $nnt)? It looks ok at first glance, but presumably it was done that way for a reason. Also, I'm not sure stripping tags from the alt text is the best approach, but I committed it pending a resolution to bug 368, since it makes the parser tests work.
Thanks for the vote of confidence :) The $nnt thing was just a coincidental fix I spotted: I asked on IRC, and no-one could come up with any purpose, or indeed any difference that it would make to anything; Brion blamed it on a "brainfart" on Magnus' part when he first checked in the category namespace. I couldn't make anything break without it, so I tidied it away for efficiency's sake. As for stripping tags, it was already done somewhere else in the code; it nicely unwraps ext.links, and used to unwrap internal links, too, I think, but now the placeholders are comments they just disappear altogether. I've filed that as bug 648, since we may still want to generate a title using this method, even if we eventually get an agreement on bug 368. (Come to think, it will get rid of things like "<small>" or "<em>", too, which would otherwise get escaped and be ugly) Meanwhile, I think I found a case where my patch is sub-optimal: in "[[Image:foo| [[link]]", the [[link]] should still be parsed, but at the moment it will get eaten in the (failed) attempt to turn it into an image caption. Not a huge problem, since that markup is essentially broken anyway, but probably not expected behaviour. I haven't tested this (I haven't even tested that it breaks yet!) but I think all we need is to add the following: # we couldn't find the end of this imageLink, so output it raw +$text = $this->replaceExternalLinks($text); $s .= $prefix . '[[' . $link . '|' . $text; No need to do it on the next }else{ down, because we won't have eaten any "[["s
Created attachment 69 [details] trivial fix Confirmed: the patch as committed will fail to render all links after a broken Image caption: "[[Image:Rowan.jpeg|thumb|This is a broken caption. But [[foo|this]] is just an ordinary link." is rendered entirely as raw text. Adding that line (as per this tiny patch) does indeed make that link work. Annoyingly, it will still break any other captions with links in further down the page, but I think that's an obscure enough case that we don't need to worry about it.
(In reply to comment #11) > Adding that line (as per this tiny patch) does indeed make that link work. > Annoyingly, it will still break any other captions with links in further down > the page, but I think that's an obscure enough case that we don't need to worry > about it. Applied, thanks. I agree that broken image captions aren't important enough to worry about further.