Last modified: 2014-05-13 09:31:50 UTC
GWToolset is uploading a lot of file names like https://commons.wikimedia.org/w/index.php?title=File:The_King_of_Hungary_holding_council_in_his_tent_on_the_battlefield_-_Froissart--39-s_Chronicles_-Volume_IV-_part_2-_-1470-1475--_f.84_-_BL_Harley_MS_4380.jpg&redirect=no . The proper name is [[commons:File:The King of Hungary holding council in his tent on the battlefield - Froissart's Chronicles (Volume IV, part 2) (1470-1475), f.84 - BL Harley MS 4380.jpg]] (and it has since been renamed to that) Notice how things like ), (, ', & are being stripped and replaced with '-'. This is wrong, those characters are perfectly valid in a title. Even worse, characters like apostraphe (') are being converted to their html entity "'", with &, # and ; being replaced with dashes, resulting in "--39-". This is wrong, as html entities in titles should be converted to the character they represent, and that character should be dealt with as appropriate (As is done in normal titles) See: https://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2014/03#Renaming_multiple_files.3F
Related: http://lists.wikimedia.org/pipermail/glamtools/2014-March/000035.html
(In reply to MZMcBride from comment #1) > Related: > http://lists.wikimedia.org/pipermail/glamtools/2014-March/000035.html And from the email: >'#','<','>','[',']','|','{','}',':','¬','`','!','"','£','$','^','&','*','(',')','+','=','~','?',',',';',"'",'@' Many of these characters are very common in file names (apostraphes, parenthesis) and absolutely allowed both socially and technically. I think that GWToolset should simply follow $wgIllegalFileChars and the things that Title::secureAndSplit blocks (To be specific, only blacklist '#','<','>','[',']','|','{','}', and ':'). If there really is a need for additional characters being blacklisted for social reasons (I'm not convinced there is), then the black list should be configurable on wiki as mediawiki: namespace message, since social conventions change over time.
Sorry, to be more specific (because I got questions), GWToolset should use the built in function wfStripIllegalFilenameChars instead of trying to re-implement title validation rules in Utils::stripIllegalTitleChars. This bug is also about html entities, so the full process for normalizing the title should be: 1) Run through Sanitizer::decodeCharReferences() 2) Run through wfStripIllegalFilenameChars()
* working on a patch
Change 121094 had a related patch set uploaded by Dan-nl: relax wiki title restrictions https://gerrit.wikimedia.org/r/121094
Change 121094 merged by jenkins-bot: relax wiki title restrictions https://gerrit.wikimedia.org/r/121094
The fix for this issue is scheduled to be deployed on commons on Tuesday, 8 April 2014
Change 125401 had a related patch set uploaded by Dan-nl: wfStripIllegalFilenameChars truncates title https://gerrit.wikimedia.org/r/125401
Change 125401 merged by jenkins-bot: wfStripIllegalFilenameChars truncates title https://gerrit.wikimedia.org/r/125401
Things seem to me to work better now, but transforming auto. titles with illegal characters is IMO not a good approach. The reason is that there is no way to track these files (after transformation). Why not simply checking this just after the XML upload and telling that something is wrong with the titles (and listing them)?
(In reply to Kelson [Emmanuel Engelhart] from comment #10) > Things seem to me to work better now, but transforming auto. titles with > illegal characters is IMO not a good approach. The reason is that there is > no way to track these files (after transformation). Why not simply checking > this just after the XML upload and telling that something is wrong with the > titles (and listing them)? This is moved to bug 65070. Marking as closed.
*** Bug 64843 has been marked as a duplicate of this bug. ***