Last modified: 2014-04-01 23:40:48 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T64870, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 62870 - GWtoolset uploaded a file with non-normalized unicode characters causing subtle breakage
GWtoolset uploaded a file with non-normalized unicode characters causing subt...
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
GWToolset (Other open bugs)
unspecified
All All
: Unprioritized normal (vote)
: ---
Assigned To: Nobody - You can work on this!
: easy
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-03-20 13:12 UTC by Jarek Tuszynski
Modified: 2014-04-01 23:40 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Jarek Tuszynski 2014-03-20 13:12:13 UTC
https://commons.wikimedia.org/wiki/File:A_new_and_accurate_plan_of_Blenheim_Palace_-_L--39-Art_de_Cre%CC%81er_les_Jardins_-1835--_pl.1_-_BL.jpg was uploaded as a part of the first batch upload with "GWToolset Batch Upload" tool. It has several strange properties:
1) Original file page did not have any associated media or description and was deleted as a page "with no valid content". However the page did have a thumbnail and there was a full size image (https://upload.wikimedia.org/wikipedia/commons/f/f8/A_new_and_accurate_plan_of_Blenheim_Palace_-_L--39-Art_de_Cre%CC%81er_les_Jardins_-1835--_pl.1_-_BL.jpg) which was not accessible from the file page. 
2) The file had no history but in user contribution log one could find edits with full metadata (https://commons.wikimedia.org/w/index.php?oldid=118131591) 
3) The deleted file is still in several categories like https://commons.wikimedia.org/wiki/Category:Pages_using_Artwork_template_with_incorrect_parameter and can not be removed: deleted file can not be edited and the tools like Cat-a-lot or hot-cat crash spectacularly when used with this file.
4) file is a duplicate of https://commons.wikimedia.org/wiki/File:A_new_and_accurate_plan_of_Blenheim_Palace_-_L%27Art_de_Cr%C3%A9er_les_Jardins_%281835%29,_pl.1_-_BL.jpg . It is picked up by "Process Duplicates" tool; however the tool crashes when applied.

See also discussion https://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2014/03#Disappearing_image.

Can someone delete the image for good (it was reuploded under correct name) so it does not show up in categories or duplicate tool?
Comment 1 Bawolff (Brian Wolff) 2014-03-20 14:53:55 UTC
King of like bug 32551 in some ways.

Like in that bug:

1) incorrect negative entry in file memcache (upload seems to abort somewhere in the middle of doEdit function, which is before cache gets saved, so file doesnt show up for a little while unless someone does action=purge)


Unlike in that bug:
2) actual edits for the file exist, although perhaps not associated entirely correctly
3) links still exist. Which means page had to be saved at some point, and there is still an entry in page table.
4) entry in image table still exists.

So some sort of referential integrity issue. Not sure if much else could be said without seeing original db records which are gone now.
Comment 2 Bawolff (Brian Wolff) 2014-03-21 09:20:45 UTC
Further investigation.

Note still accessible at https://commons.wikimedia.org/?curid=31451688

Basically, it appears somewhere along the lines gwtoolset didn't properly normalize the page title correctly, thus creating it with the letter 'é' (ie Using combining characters. A U+69 followed by a U+301), instead of doing a 'é' (The precomposed version - U+E9). Titles are supposed to be in NFC, so the various things subtly explode when the non-NFC U+69 U+301 is used.

All the symptoms mentioned are consistent with an incorectly normalized db entry, except maybe symptom 1 which seems to imply there was a page at one point using the other form of the é. Kind of unclear what happened there, given the page is now moved/deleted. Perhaps there were page entries for both variants, but the proper variant was broken (e.g. It was fully uploaded to the wrong é, but as part of the process, it was partially uploaded to the correct é too). Hard to know.

My previous comment (comment 1) seems to have been incorrect, and this has nothing to do with bug 32551.
Comment 3 Jarek Tuszynski 2014-03-21 13:02:04 UTC
(In reply to Bawolff (Brian Wolff) from comment #2)
> Kind of unclear what happened there,
> given the page is now moved/deleted. Perhaps there were page entries for
> both variants, but the proper variant was broken (e.g. It was fully uploaded
> to the wrong é, but as part of the process, it was partially uploaded to the
> correct é too). Hard to know.

Sorry about "crime-scene contamination", I guess I was trying to fix the problems without calling the cavalry. Let me try to recall some of the actions related to this file:
* When I find the file it was a thumbnail in the "Artwork" template category for files using unsupported parameters (https://commons.wikimedia.org/wiki/Category:Pages_using_Artwork_template_with_incorrect_parameter). I could not edit the file since it was deleted. The deleted file did not have any media and the only description was "x".
* I asked about it, and was eventually pointed to Village pump discussion, where user:Rillke and others dig out some more strange facts about the image, including link to the full version and description wikicode.
* After some time I reuploded the media (taken from https://upload.wikimedia.org/wikipedia/commons/f/f8/A_new_and_accurate_plan_of_Blenheim_Palace_-_L--39-Art_de_Cre%CC%81er_les_Jardins_-1835--_pl.1_-_BL.jpg) and added the file description based on  https://commons.wikimedia.org/w/api.php?action=query&prop=info|revisions|imageinfo&pageids=31451688&format=jsonfm&rvprop=content|flags|ids|user|comment&iiprop=url|user|timestamp|comment|dimensions 
* At this point the file was still odd: It claimed to be a duplicate of itself, Categories could be added and removed, but I can not remove it from the "Artwork" category. It looked like there were 2 files with the same name.
*After half a day User:Jheald moved the file to it's present name with correct é
*Afterwards I deleted the redirect page associated with old name.
Comment 4 Bawolff (Brian Wolff) 2014-03-21 16:48:11 UTC
(In reply to Jarek Tuszynski from comment #3)
> (In reply to Bawolff (Brian Wolff) from comment #2)
> > Kind of unclear what happened there,
> > given the page is now moved/deleted. Perhaps there were page entries for
> > both variants, but the proper variant was broken (e.g. It was fully uploaded
> > to the wrong é, but as part of the process, it was partially uploaded to the
> > correct é too). Hard to know.
> 
> Sorry about "crime-scene contamination", I guess I was trying to fix the
> problems without calling the cavalry.

No worries. There's enough here to reproduce the problem if need be. If it turns out we really need to know exactly what happened, we could just try to make gwtoolset upload a non normalized title and see.
Comment 5 Bawolff (Brian Wolff) 2014-03-26 14:40:21 UTC
So in MediaWiki, we generally prefer to normalize unicode at input.

Thus that means that input should be run through $wgContLang->normalize() as it directly comes out of the XML file. So that would be in methods like XmlDetectHandler::createExampleDOMElement, XmlDetectHandler::findExampleDOMNodes, XmlMappingHandler::getFilteredNodeValue
Comment 6 Gerrit Notification Bot 2014-03-26 16:03:41 UTC
Change 121097 had a related patch set uploaded by Dan-nl:
make sure unicode characters are normalized

https://gerrit.wikimedia.org/r/121097
Comment 7 Jarek Tuszynski 2014-03-26 16:09:27 UTC
Can we also do some database cleanup to remove remnants of this issue. https://commons.wikimedia.org/wiki/Category:Pages_using_Artwork_template_with_incorrect_parameter still has deleted file https://commons.wikimedia.org/wiki/File:A_new_and_accurate_plan_of_Blenheim_Palace_-_L--39-Art_de_Cre%CC%81er_les_Jardins_-1835--_pl.1_-_BL.jpg. I do not think I can fix it with tools available through Commons interface.
Comment 8 Jarek Tuszynski 2014-03-28 17:12:49 UTC
https://commons.wikimedia.org/wiki/File:Rencontres_Wikim%C3%A9dia_et_%C3%89ducation_2012_-_De_la_production_a%CC%80_l--39-utilisation_de_ressources_e%CC%81ducatives_libres_-_-1.webm.webm is another file that seem the have the same issue. Can someone fix  or delete this file?
Comment 9 Bawolff (Brian Wolff) 2014-03-29 07:28:44 UTC
(In reply to Jarek Tuszynski from comment #8)
> https://commons.wikimedia.org/wiki/File:
> Rencontres_Wikim%C3%A9dia_et_%C3%89ducation_2012_-
> _De_la_production_a%CC%80_l--39-
> utilisation_de_ressources_e%CC%81ducatives_libres_-_-1.webm.webm is another
> file that seem the have the same issue. Can someone fix  or delete this file?

I don't have filemover rights to move the file. However someone with filemover or admin rights, and some knowledge of the api can fix these files by using the API action=move module, and the fromid parameter (fromid takes the page id number. This is the same as the curid parameter on normal requests). Similarly they are deletable from the API too (Actually for deletion its possible via the normal web interface, but you need to do fancy stuff with something like firebug to add curid to the POST parameters of the confirmation screen)


The id for File:Rencontres Wikimédia et Éducation 2012 - De la production à l--39-utilisation de ressources éducatives libres - -1.webm.webm ( https://commons.wikimedia.org/wiki/?curid=31747120 ) is 31747120.

Interestingly enough, for that title, the first é and É are fine, its the last two à and é that are the issue.

-----

Until the patch for this bug gets reviewed and deployed to commons (which should happen quite soon), may I suggest converting XML files to NFC before uploading them to gwtoolset. On linux if you have the libicu-dev package installed you can do this with the command

 uconv -x any-NFC -o output.xml input.xml

(I have no idea how to do this on other operating systems)
Comment 11 Rainer Rillke @commons.wikimedia 2014-03-29 11:24:56 UTC
(In reply to Bawolff (Brian Wolff) from comment #9)
> (I have no idea how to do this on other operating systems)

You can use node.js with https://github.com/walling/unorm

C:\Users\XXX> npm install unorm

Create a script named "ps.js" at "C:\Users\XXX" with the following content

var fileName = 'sample.txt',
	fs = require('fs'),
	unorm = require('unorm');

fs.readFile(fileName, { encoding: 'utf-8' }, function (err, stData) {
  if (err) throw err;
  stData = unorm.nfc(stData);
  fs.writeFileSync(fileName, stData, { encoding: 'utf-8' })
});

(assuming that "C:\Users\XXX\sample.txt" is the file you'd like to process)

and run node.js:

C:\Users\XXX> node pr.js
Comment 12 Gerrit Notification Bot 2014-04-01 23:36:39 UTC
Change 121097 merged by jenkins-bot:
make sure unicode characters are normalized

https://gerrit.wikimedia.org/r/121097
Comment 13 Bawolff (Brian Wolff) 2014-04-01 23:40:48 UTC
The fix for this issue is scheduled to be deployed on commons on Tuesday, 8 April 2014

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links