Last modified: 2011-08-29 16:41:09 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T28563, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 26563 - Add characters changed per revision for stub and full article dumps
Add characters changed per revision for stub and full article dumps
Status: RESOLVED FIXED
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal enhancement (vote)
: ---
Assigned To: Rob Lanphier
: analytics
Depends on: 22750
Blocks: 29819
  Show dependency treegraph
 
Reported: 2011-01-04 11:58 UTC by Diederik van Liere
Modified: 2011-08-29 16:41 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Diederik van Liere 2011-01-04 11:58:18 UTC
Adding a delta characters change to each revision is needed for edit analytics. This is needed for both the stub and full article dumps. 
Rob suggested that using PHP's UTF-8 support (e.g. just calling mb_strlen($buffer, 'UTF-8')) to quickly dispatch of the multi-byte problem would give us a fairly accurate character count. Counting characters will allow us to compare across different languages.

If there are serious performance concerns then we can fall back to byte count.
Comment 1 Rob Lanphier 2011-01-07 20:00:24 UTC
Byte count will be way easier, and might happen sooner than character count, since we already have revision length in the database.  Ariel asks that we  update the version number of the dumps if that happens, so users of the dumps can correlate contents to versions.

The code to modify is here:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Export.php?view=markup

To update the version, we need to update schemaVersion().

In order for this to get into production, it of course needs to get deployed to the production branch.

Ariel doesn't have time to implement this right now, so an interested volunteer would be appreciated.
Comment 2 Rob Lanphier 2011-01-08 03:31:37 UTC
Committed r79856 into trunk.  I did bytes because characters was a little more involved.  I added byte counts to both stub and full dumps.  

I thought about not including the byte count in the full dump because it's pretty trivial to get that count from most XML parsers.  However, it is nice to have the byte count that doesn't include any XML escaping introduced by the dump, so I left it in.

I'll document how I'd go about characters, just in case anyone wants to tackle it.  The JOIN of the "text" table in WikiExporter::dumpFrom would have to be performed even in the case of a stub dump.  WikiExporter()->text would need to be passed as a new parameter into XMLDumpWriter::writeRevision().  The stub logic in XMLDumpWriter::writeRevision() would need to be changed to use the new parameter to see if we're dealing with a stub dump, rather than inferring it from the absence of text.  Finally, mb_strlen($foo, 'UTF-8') could be called.  It's not a ton of code (probably 10-15 lines of code change, tops) but that's less likely to get fast-tracked to production.
Comment 3 Roan Kattouw 2011-01-08 11:57:37 UTC
(In reply to comment #2)
> I'll document how I'd go about characters, just in case anyone wants to tackle
> it.  The JOIN of the "text" table in WikiExporter::dumpFrom would have to be
> performed even in the case of a stub dump.  WikiExporter()->text would need to
> be passed as a new parameter into XMLDumpWriter::writeRevision().  The stub
> logic in XMLDumpWriter::writeRevision() would need to be changed to use the new
> parameter to see if we're dealing with a stub dump, rather than inferring it
> from the absence of text.  Finally, mb_strlen($foo, 'UTF-8') could be called. 
> It's not a ton of code (probably 10-15 lines of code change, tops) but that's
> less likely to get fast-tracked to production.
Wouldn't this cause stub dumps to load the text of each revision, significantly slowing down their generation?
Comment 4 Ariel T. Glenn 2011-01-08 12:11:59 UTC
Exactly. What we want to do is follow the same procedure we did for bytes: add a field in the revision table, automatically populate it for new revs, run a job to populate for old revs.
Comment 5 Rob Lanphier 2011-01-09 03:03:19 UTC
Even more reason to punt on character count.  :)  If we ever add character count to the database, we really ought to address bug 21860 (checksum per rev) while we're at it.
Comment 6 Rob Lanphier 2011-01-27 20:09:20 UTC
This is fixed in r79856, and will be deployed as part of 1.17
Comment 7 Diederik van Liere 2011-02-06 08:10:44 UTC
Maybe we should include the delta byte count or cumulative number of bytes in the database to enable feature requests such as: 
* Show size of current text in edit form (https://bugzilla.wikimedia.org/show_bug.cgi?id=3890)
* Sorting language pane by article size (https://bugzilla.wikimedia.org/show_bug.cgi?id=6559)
* Page character counts: denote simple vs. complex changes (https://bugzilla.wikimedia.org/show_bug.cgi?id=8571)
* Special page for statistics about specific articles (https://bugzilla.wikimedia.org/show_bug.cgi?id=547)
Comment 8 Brion Vibber 2011-07-11 23:10:48 UTC
The updated schema never got published on mediawiki.org: bug 22750

This will break anything trying to automatically run XSD validation due to being unable to fetch the schema file.
Comment 9 Brion Vibber 2011-07-11 23:15:47 UTC
that's bug 29819  rather. bah!
Comment 10 Diederik van Liere 2011-08-12 21:16:57 UTC
I think we can close this bug, or not?
Comment 11 Ariel T. Glenn 2011-08-29 16:41:09 UTC
I'm closing it now.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links