Last modified: 2009-01-02 00:34:40 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T18841, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 16841 - Data corruption apparently related to recompressTracked.php on wikis with $wgLegacyEncoding set
Data corruption apparently related to recompressTracked.php on wikis with $wg...
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Highest critical (vote)
: ---
Assigned To: Tim Starling
http://da.wikipedia.org/w/index.php?t...
: shell
Depends on:
Blocks: 16660
  Show dependency treegraph
 
Reported: 2008-12-30 20:18 UTC by Brion Vibber
Modified: 2009-01-02 00:34 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Brion Vibber 2008-12-30 20:18:59 UTC
Double UTF-8 conversion is turning up on a large number of edits on Danish wikis which have just been run through recompressTracked.php. Examples shown to me:

A widely-used template:
http://da.wikipedia.org/w/index.php?title=Skabelon:Standardstub&diff=prev&oldid=1936801

Various articles such as:
http://da.wikipedia.org/w/index.php?title=Lake_Torrens&diff=prev&oldid=1790478
http://da.wikipedia.org/w/index.php?title=Lind%C3%A5&diff=prev&oldid=1478894

I've stopped the jobs running on Hume pending Tim's investigation and fix. The job was partway into dewiki at the time.

If only wikis using $wgLegacyEncoding and running the recompressTracked script are affected, then dawiki and dawiktionary need cleanup.
Comment 1 Aaron Schulz 2008-12-30 20:36:35 UTC
Yeah, $wgLegacyEncoding.

I *think* I see the issue. It calls Revision::LoadRevisionText(), which converts to utf-8, then saves that blob in the concatenated diff blob but doesn't go back and mark old_flags with 'utf-8'. MW still thinks it is in legacy encoding then, and double encodes.
Comment 2 Brion Vibber 2008-12-30 21:19:58 UTC
The affected revisions are after the conversion, and their old_flags includes 'utf8':

+---------+---------+------------------+---------------+
| rev_id  | old_id  | old_text         | old_flags     |
+---------+---------+------------------+---------------+
| 1478894 | 1468741 | DB://rc1/43592/0 | external,utf8 |
| 1790478 | 1777869 | DB://rc1/3210/0  | external,utf8 |
| 1936801 | 1923062 | DB://rc1/26644/4 | external,utf8 |
+---------+---------+------------------+---------------+

They're clearly getting run through without the flags at some step, though...

I've locked dawiki and dawiktionary to editing (wgReadOnly in InitialiseSettings) per Wegge's request until we get this sorted out, since any further edits on broken revisions are going to be pretty nasty and won't get automatically fixed by something that rolls back to the original ES entries for the old revs.
Comment 3 Aaron Schulz 2008-12-30 21:25:25 UTC
It needs to be 'utf-8', not 'utf8'
Comment 4 Brion Vibber 2008-12-30 21:28:32 UTC
Urrggglgllleeeeehhhhh :D
Comment 5 Aaron Schulz 2008-12-30 21:30:20 UTC
ahh, line 489 of recompressTracked.php:

$dbw->update( 'text',
	array( // set
	'old_text' => $url,
	'old_flags' => 'external,utf8',
),

...it *does* in fact try to set utf-8, it just has a typo :)
Comment 6 Brion Vibber 2008-12-30 21:40:35 UTC
Aaron fixed the code typo in r45205.

Should be possible to clean up the entries, then clear all the cache entries. :P

Revision cache, diff cache, parser cache, squid cache.......

dawiki:
+----------+---------------------+
| count(*) | old_flags           |
+----------+---------------------+
|     3785 |                     |
|    10483 | external            |
|     6983 | external,gzip       |
|     2714 | external,object     |
|   461676 | external,utf-8      |
|   336780 | external,utf8       | <- borken
|     1094 | gzip                |
|    29477 | object              |
|    39973 | utf-8,gzip          |
|  1783011 | utf-8,gzip,external |
+----------+---------------------+

dawiktionary:
+----------+---------------------+
| count(*) | old_flags           |
+----------+---------------------+
|     1818 |                     |
|     1620 | external,utf-8      |
|     3744 | external,utf8       | <- borken
|        5 | gzip                |
|     2382 | object              |
|     1631 | utf-8,gzip          |
|    25576 | utf-8,gzip,external |
+----------+---------------------+

Alternatively to the DB cleanup we could hack the loader to accept 'utf8' as well as 'utf-8'. Still requires cache cleanup...
Comment 7 Brion Vibber 2008-12-30 22:07:52 UTC
Running cache cleanup...
Comment 8 Brion Vibber 2008-12-30 22:34:47 UTC
Ok, all the automated cleanup should be done at this point. However pages which were edited from the corrupted views need to be fixed up, since they "legitimately" contain the broken chars.
Comment 9 Tim Starling 2008-12-31 00:41:34 UTC
Sorry about that. 
Comment 10 Marco 2009-01-01 10:44:34 UTC
You mentioned that the job was partly into dewiki - what about this issue in de?
Comment 11 Tim Starling 2009-01-02 00:34:40 UTC
de does not have $wgLegacyEncoding.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links