Last modified: 2014-04-01 23:40:48 UTC
https://commons.wikimedia.org/wiki/File:A_new_and_accurate_plan_of_Blenheim_Palace_-_L--39-Art_de_Cre%CC%81er_les_Jardins_-1835--_pl.1_-_BL.jpg was uploaded as a part of the first batch upload with "GWToolset Batch Upload" tool. It has several strange properties: 1) Original file page did not have any associated media or description and was deleted as a page "with no valid content". However the page did have a thumbnail and there was a full size image (https://upload.wikimedia.org/wikipedia/commons/f/f8/A_new_and_accurate_plan_of_Blenheim_Palace_-_L--39-Art_de_Cre%CC%81er_les_Jardins_-1835--_pl.1_-_BL.jpg) which was not accessible from the file page. 2) The file had no history but in user contribution log one could find edits with full metadata (https://commons.wikimedia.org/w/index.php?oldid=118131591) 3) The deleted file is still in several categories like https://commons.wikimedia.org/wiki/Category:Pages_using_Artwork_template_with_incorrect_parameter and can not be removed: deleted file can not be edited and the tools like Cat-a-lot or hot-cat crash spectacularly when used with this file. 4) file is a duplicate of https://commons.wikimedia.org/wiki/File:A_new_and_accurate_plan_of_Blenheim_Palace_-_L%27Art_de_Cr%C3%A9er_les_Jardins_%281835%29,_pl.1_-_BL.jpg . It is picked up by "Process Duplicates" tool; however the tool crashes when applied. See also discussion https://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2014/03#Disappearing_image. Can someone delete the image for good (it was reuploded under correct name) so it does not show up in categories or duplicate tool?
King of like bug 32551 in some ways. Like in that bug: 1) incorrect negative entry in file memcache (upload seems to abort somewhere in the middle of doEdit function, which is before cache gets saved, so file doesnt show up for a little while unless someone does action=purge) Unlike in that bug: 2) actual edits for the file exist, although perhaps not associated entirely correctly 3) links still exist. Which means page had to be saved at some point, and there is still an entry in page table. 4) entry in image table still exists. So some sort of referential integrity issue. Not sure if much else could be said without seeing original db records which are gone now.
Further investigation. Note still accessible at https://commons.wikimedia.org/?curid=31451688 Basically, it appears somewhere along the lines gwtoolset didn't properly normalize the page title correctly, thus creating it with the letter 'é' (ie Using combining characters. A U+69 followed by a U+301), instead of doing a 'é' (The precomposed version - U+E9). Titles are supposed to be in NFC, so the various things subtly explode when the non-NFC U+69 U+301 is used. All the symptoms mentioned are consistent with an incorectly normalized db entry, except maybe symptom 1 which seems to imply there was a page at one point using the other form of the é. Kind of unclear what happened there, given the page is now moved/deleted. Perhaps there were page entries for both variants, but the proper variant was broken (e.g. It was fully uploaded to the wrong é, but as part of the process, it was partially uploaded to the correct é too). Hard to know. My previous comment (comment 1) seems to have been incorrect, and this has nothing to do with bug 32551.
(In reply to Bawolff (Brian Wolff) from comment #2) > Kind of unclear what happened there, > given the page is now moved/deleted. Perhaps there were page entries for > both variants, but the proper variant was broken (e.g. It was fully uploaded > to the wrong é, but as part of the process, it was partially uploaded to the > correct é too). Hard to know. Sorry about "crime-scene contamination", I guess I was trying to fix the problems without calling the cavalry. Let me try to recall some of the actions related to this file: * When I find the file it was a thumbnail in the "Artwork" template category for files using unsupported parameters (https://commons.wikimedia.org/wiki/Category:Pages_using_Artwork_template_with_incorrect_parameter). I could not edit the file since it was deleted. The deleted file did not have any media and the only description was "x". * I asked about it, and was eventually pointed to Village pump discussion, where user:Rillke and others dig out some more strange facts about the image, including link to the full version and description wikicode. * After some time I reuploded the media (taken from https://upload.wikimedia.org/wikipedia/commons/f/f8/A_new_and_accurate_plan_of_Blenheim_Palace_-_L--39-Art_de_Cre%CC%81er_les_Jardins_-1835--_pl.1_-_BL.jpg) and added the file description based on https://commons.wikimedia.org/w/api.php?action=query&prop=info|revisions|imageinfo&pageids=31451688&format=jsonfm&rvprop=content|flags|ids|user|comment&iiprop=url|user|timestamp|comment|dimensions * At this point the file was still odd: It claimed to be a duplicate of itself, Categories could be added and removed, but I can not remove it from the "Artwork" category. It looked like there were 2 files with the same name. *After half a day User:Jheald moved the file to it's present name with correct é *Afterwards I deleted the redirect page associated with old name.
(In reply to Jarek Tuszynski from comment #3) > (In reply to Bawolff (Brian Wolff) from comment #2) > > Kind of unclear what happened there, > > given the page is now moved/deleted. Perhaps there were page entries for > > both variants, but the proper variant was broken (e.g. It was fully uploaded > > to the wrong é, but as part of the process, it was partially uploaded to the > > correct é too). Hard to know. > > Sorry about "crime-scene contamination", I guess I was trying to fix the > problems without calling the cavalry. No worries. There's enough here to reproduce the problem if need be. If it turns out we really need to know exactly what happened, we could just try to make gwtoolset upload a non normalized title and see.
So in MediaWiki, we generally prefer to normalize unicode at input. Thus that means that input should be run through $wgContLang->normalize() as it directly comes out of the XML file. So that would be in methods like XmlDetectHandler::createExampleDOMElement, XmlDetectHandler::findExampleDOMNodes, XmlMappingHandler::getFilteredNodeValue
Change 121097 had a related patch set uploaded by Dan-nl: make sure unicode characters are normalized https://gerrit.wikimedia.org/r/121097
Can we also do some database cleanup to remove remnants of this issue. https://commons.wikimedia.org/wiki/Category:Pages_using_Artwork_template_with_incorrect_parameter still has deleted file https://commons.wikimedia.org/wiki/File:A_new_and_accurate_plan_of_Blenheim_Palace_-_L--39-Art_de_Cre%CC%81er_les_Jardins_-1835--_pl.1_-_BL.jpg. I do not think I can fix it with tools available through Commons interface.
https://commons.wikimedia.org/wiki/File:Rencontres_Wikim%C3%A9dia_et_%C3%89ducation_2012_-_De_la_production_a%CC%80_l--39-utilisation_de_ressources_e%CC%81ducatives_libres_-_-1.webm.webm is another file that seem the have the same issue. Can someone fix or delete this file?
(In reply to Jarek Tuszynski from comment #8) > https://commons.wikimedia.org/wiki/File: > Rencontres_Wikim%C3%A9dia_et_%C3%89ducation_2012_- > _De_la_production_a%CC%80_l--39- > utilisation_de_ressources_e%CC%81ducatives_libres_-_-1.webm.webm is another > file that seem the have the same issue. Can someone fix or delete this file? I don't have filemover rights to move the file. However someone with filemover or admin rights, and some knowledge of the api can fix these files by using the API action=move module, and the fromid parameter (fromid takes the page id number. This is the same as the curid parameter on normal requests). Similarly they are deletable from the API too (Actually for deletion its possible via the normal web interface, but you need to do fancy stuff with something like firebug to add curid to the POST parameters of the confirmation screen) The id for File:Rencontres Wikimédia et Éducation 2012 - De la production à l--39-utilisation de ressources éducatives libres - -1.webm.webm ( https://commons.wikimedia.org/wiki/?curid=31747120 ) is 31747120. Interestingly enough, for that title, the first é and É are fine, its the last two à and é that are the issue. ----- Until the patch for this bug gets reviewed and deployed to commons (which should happen quite soon), may I suggest converting XML files to NFC before uploading them to gwtoolset. On linux if you have the libicu-dev package installed you can do this with the command uconv -x any-NFC -o output.xml input.xml (I have no idea how to do this on other operating systems)
Updated [[:commons:Commons:User scripts/Invisible charaters]] https://commons.wikimedia.org/w/index.php?title=Commons:User_scripts/Invisible_charaters&withJS=MediaWiki:Invisible_characters_unveiled.js#val/File:Rencontres%20Wikim%C3%A9dia%20et%20%C3%89ducation%202012%20-%20De%20la%20production%20a%CC%80%20l--39-utilisation%20de%20ressources%20e%CC%81ducatives%20libres%20-%20-1.webm.webm Moved: https://commons.wikimedia.org/wiki/File:Rencontres_Wikim%C3%A9dia_et_%C3%89ducation_2012_-_De_la_production_%C3%A0_l--39-utilisation_de_ressources_%C3%A9ducatives_libres_-_-1.webm Moved: https://commons.wikimedia.org/w/index.php?title=File:A_new_and_accurate_plan_of_Blenheim_Palace_-_L--39-Art_de_Créer_les_Jardins_-1835--_pl.1_-_BL.jpg Firefox decided to crash 2 times while writing in the firebug console, so sorry for the delay.
(In reply to Bawolff (Brian Wolff) from comment #9) > (I have no idea how to do this on other operating systems) You can use node.js with https://github.com/walling/unorm C:\Users\XXX> npm install unorm Create a script named "ps.js" at "C:\Users\XXX" with the following content var fileName = 'sample.txt', fs = require('fs'), unorm = require('unorm'); fs.readFile(fileName, { encoding: 'utf-8' }, function (err, stData) { if (err) throw err; stData = unorm.nfc(stData); fs.writeFileSync(fileName, stData, { encoding: 'utf-8' }) }); (assuming that "C:\Users\XXX\sample.txt" is the file you'd like to process) and run node.js: C:\Users\XXX> node pr.js
Change 121097 merged by jenkins-bot: make sure unicode characters are normalized https://gerrit.wikimedia.org/r/121097
The fix for this issue is scheduled to be deployed on commons on Tuesday, 8 April 2014