Last modified: 2009-01-02 00:34:40 UTC
Double UTF-8 conversion is turning up on a large number of edits on Danish wikis which have just been run through recompressTracked.php. Examples shown to me: A widely-used template: http://da.wikipedia.org/w/index.php?title=Skabelon:Standardstub&diff=prev&oldid=1936801 Various articles such as: http://da.wikipedia.org/w/index.php?title=Lake_Torrens&diff=prev&oldid=1790478 http://da.wikipedia.org/w/index.php?title=Lind%C3%A5&diff=prev&oldid=1478894 I've stopped the jobs running on Hume pending Tim's investigation and fix. The job was partway into dewiki at the time. If only wikis using $wgLegacyEncoding and running the recompressTracked script are affected, then dawiki and dawiktionary need cleanup.
Yeah, $wgLegacyEncoding. I *think* I see the issue. It calls Revision::LoadRevisionText(), which converts to utf-8, then saves that blob in the concatenated diff blob but doesn't go back and mark old_flags with 'utf-8'. MW still thinks it is in legacy encoding then, and double encodes.
The affected revisions are after the conversion, and their old_flags includes 'utf8': +---------+---------+------------------+---------------+ | rev_id | old_id | old_text | old_flags | +---------+---------+------------------+---------------+ | 1478894 | 1468741 | DB://rc1/43592/0 | external,utf8 | | 1790478 | 1777869 | DB://rc1/3210/0 | external,utf8 | | 1936801 | 1923062 | DB://rc1/26644/4 | external,utf8 | +---------+---------+------------------+---------------+ They're clearly getting run through without the flags at some step, though... I've locked dawiki and dawiktionary to editing (wgReadOnly in InitialiseSettings) per Wegge's request until we get this sorted out, since any further edits on broken revisions are going to be pretty nasty and won't get automatically fixed by something that rolls back to the original ES entries for the old revs.
It needs to be 'utf-8', not 'utf8'
Urrggglgllleeeeehhhhh :D
ahh, line 489 of recompressTracked.php: $dbw->update( 'text', array( // set 'old_text' => $url, 'old_flags' => 'external,utf8', ), ...it *does* in fact try to set utf-8, it just has a typo :)
Aaron fixed the code typo in r45205. Should be possible to clean up the entries, then clear all the cache entries. :P Revision cache, diff cache, parser cache, squid cache....... dawiki: +----------+---------------------+ | count(*) | old_flags | +----------+---------------------+ | 3785 | | | 10483 | external | | 6983 | external,gzip | | 2714 | external,object | | 461676 | external,utf-8 | | 336780 | external,utf8 | <- borken | 1094 | gzip | | 29477 | object | | 39973 | utf-8,gzip | | 1783011 | utf-8,gzip,external | +----------+---------------------+ dawiktionary: +----------+---------------------+ | count(*) | old_flags | +----------+---------------------+ | 1818 | | | 1620 | external,utf-8 | | 3744 | external,utf8 | <- borken | 5 | gzip | | 2382 | object | | 1631 | utf-8,gzip | | 25576 | utf-8,gzip,external | +----------+---------------------+ Alternatively to the DB cleanup we could hack the loader to accept 'utf8' as well as 'utf-8'. Still requires cache cleanup...
Running cache cleanup...
Ok, all the automated cleanup should be done at this point. However pages which were edited from the corrupted views need to be fixed up, since they "legitimately" contain the broken chars.
Sorry about that.
You mentioned that the job was partly into dewiki - what about this issue in de?
de does not have $wgLegacyEncoding.