Last modified: 2011-08-29 16:41:09 UTC
Adding a delta characters change to each revision is needed for edit analytics. This is needed for both the stub and full article dumps. Rob suggested that using PHP's UTF-8 support (e.g. just calling mb_strlen($buffer, 'UTF-8')) to quickly dispatch of the multi-byte problem would give us a fairly accurate character count. Counting characters will allow us to compare across different languages. If there are serious performance concerns then we can fall back to byte count.
Byte count will be way easier, and might happen sooner than character count, since we already have revision length in the database. Ariel asks that we update the version number of the dumps if that happens, so users of the dumps can correlate contents to versions. The code to modify is here: http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Export.php?view=markup To update the version, we need to update schemaVersion(). In order for this to get into production, it of course needs to get deployed to the production branch. Ariel doesn't have time to implement this right now, so an interested volunteer would be appreciated.
Committed r79856 into trunk. I did bytes because characters was a little more involved. I added byte counts to both stub and full dumps. I thought about not including the byte count in the full dump because it's pretty trivial to get that count from most XML parsers. However, it is nice to have the byte count that doesn't include any XML escaping introduced by the dump, so I left it in. I'll document how I'd go about characters, just in case anyone wants to tackle it. The JOIN of the "text" table in WikiExporter::dumpFrom would have to be performed even in the case of a stub dump. WikiExporter()->text would need to be passed as a new parameter into XMLDumpWriter::writeRevision(). The stub logic in XMLDumpWriter::writeRevision() would need to be changed to use the new parameter to see if we're dealing with a stub dump, rather than inferring it from the absence of text. Finally, mb_strlen($foo, 'UTF-8') could be called. It's not a ton of code (probably 10-15 lines of code change, tops) but that's less likely to get fast-tracked to production.
(In reply to comment #2) > I'll document how I'd go about characters, just in case anyone wants to tackle > it. The JOIN of the "text" table in WikiExporter::dumpFrom would have to be > performed even in the case of a stub dump. WikiExporter()->text would need to > be passed as a new parameter into XMLDumpWriter::writeRevision(). The stub > logic in XMLDumpWriter::writeRevision() would need to be changed to use the new > parameter to see if we're dealing with a stub dump, rather than inferring it > from the absence of text. Finally, mb_strlen($foo, 'UTF-8') could be called. > It's not a ton of code (probably 10-15 lines of code change, tops) but that's > less likely to get fast-tracked to production. Wouldn't this cause stub dumps to load the text of each revision, significantly slowing down their generation?
Exactly. What we want to do is follow the same procedure we did for bytes: add a field in the revision table, automatically populate it for new revs, run a job to populate for old revs.
Even more reason to punt on character count. :) If we ever add character count to the database, we really ought to address bug 21860 (checksum per rev) while we're at it.
This is fixed in r79856, and will be deployed as part of 1.17
Maybe we should include the delta byte count or cumulative number of bytes in the database to enable feature requests such as: * Show size of current text in edit form (https://bugzilla.wikimedia.org/show_bug.cgi?id=3890) * Sorting language pane by article size (https://bugzilla.wikimedia.org/show_bug.cgi?id=6559) * Page character counts: denote simple vs. complex changes (https://bugzilla.wikimedia.org/show_bug.cgi?id=8571) * Special page for statistics about specific articles (https://bugzilla.wikimedia.org/show_bug.cgi?id=547)
The updated schema never got published on mediawiki.org: bug 22750 This will break anything trying to automatically run XSD validation due to being unable to fetch the schema file.
that's bug 29819 rather. bah!
I think we can close this bug, or not?
I'm closing it now.