Last modified: 2013-03-16 12:55:43 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T14188, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 12188 - Run maintenance/populateRevisionLength.php on all WMF wikis
Run maintenance/populateRevisionLength.php on all WMF wikis
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: High normal with 6 votes (vote)
: ---
Assigned To: JeLuF
: shell
: 17421 (view as bug list)
Depends on: 18881 24538
Blocks: 16660 18998 46183
  Show dependency treegraph
 
Reported: 2007-12-03 15:15 UTC by Sérgio Nunes
Modified: 2013-03-16 12:55 UTC (History)
17 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Proposed patch (971 bytes, patch)
2009-03-30 22:11 UTC, Bryan Tong Minh
Details

Description Sérgio Nunes 2007-12-03 15:15:47 UTC
The "size" (bytes) of each revision is not available in revisions prior to the introduction of this feature.
Please recompute all the missing "size" values on the entire Wiki. I notice this while using the API to extract the size of old revisions from Wikipedia.
Comment 1 Bryan Tong Minh 2008-05-12 16:33:50 UTC
Probably we should compute rev_size on demand if it isn't available and update the database?
Comment 2 Sérgio Nunes 2008-05-13 09:27:40 UTC
@Bryan Tong Minh:

I think it is a good option. Eventually all articles would be updated and it would work in all MediaWiki installations.
However, it might be easier to run an update script to solve the issue.
In the end, its up to the sysadmins to decide :)

Please vote the bug up so that it gets fixed.
Comment 3 Mike.lifeguard 2009-03-19 18:05:19 UTC
Removed shell keyword, as there's nothing to do on shell yet - a maintenance script would have to be written I guess?
Comment 4 Bryan Tong Minh 2009-03-30 22:11:16 UTC
Created attachment 5981 [details]
Proposed patch

Proposed patch. Have no opportunity to test it atm.
Comment 5 Brion Vibber 2009-03-30 23:02:17 UTC
*** Bug 17421 has been marked as a duplicate of this bug. ***
Comment 6 Brion Vibber 2009-05-26 21:48:15 UTC
On-demand size expansion could potentially be expensive for interactive use (eg pulling up a history view with a few hundred revisions); it also wouldn't handle cases like Special:Longpages / Special:Shortpges generation.

Bug 18881 covers a request for a maintenance script to update them in bulk.
Comment 7 Ilmari Karonen 2010-03-03 20:28:21 UTC
A maintenance script to populate rev_len for old revisions was added in r63221.  Now someone just needs to run it.  Adding back "shell" keyword, removing "patch" and "need-review".
Comment 8 JeLuF 2010-05-10 19:29:06 UTC
Running right now...
Comment 9 JeLuF 2010-05-16 08:44:15 UTC
Completed for all wikis but enwiki.

650 Million records reviewed so far.

enwiki has some 350 Million more that needs to be reviewed and the process runs much much slower on enwiki than on any other wiki.
Comment 10 Nemo 2010-05-16 09:39:55 UTC
Great! Thank you, JeLuF.
Comment 11 JeLuF 2010-05-26 06:36:39 UTC
About 600-700 hours to go...
Comment 12 Happy-melon 2010-05-26 09:19:27 UTC
(In reply to comment #11)
> About 600-700 hours to go...

It takes longer than the dump script?!?  That's slightly bizarre.  And an average of less than 2 revs/sec?  Is it spending most of its time waiting for slaves or something?
Comment 13 JeLuF 2010-05-26 09:35:03 UTC
enwiki ... doing rev_id from 43'814'001 to 43'816'000 (total: 363'892'360)
            in 14.42 seconds, estimated completion in 641.3 hours

2000 revisions in 14 seconds is the current speed. It will become faster at the end when the revisions already have rev_len set.
Comment 14 Ilmari Karonen 2010-05-26 09:46:57 UTC
(In reply to comment #12)
I haven't checked, but my guess would be that the dump scripts know how to deal efficiently with Wikimedia's peculiar ways of storing article text.  populateRevisionLength.php doesn't, it just fetches the text of one revision at a time, which is probably not optimal when multiple revisions are stored in a single compressed blob.  I'll let others who know more about the storage backend stuff decide whether optimizing it somehow would be worth the effort, given that it's never going to be run on Wikimedia wikis again.
Comment 15 JeLuF 2010-05-26 14:47:22 UTC
The ticket was open for about 30 months. I think it can wait for another month for completion...
Comment 16 Ariel T. Glenn 2010-05-26 22:55:14 UTC
The dump scripts read old revisions from previous dump files; that's why they are faster.  When we run them without the so-called prefetch option, they are very slow, much slower than the populateRevisionLength script.  We are seeing about 38 revs/second.
Comment 17 Happy-melon 2010-05-26 23:08:51 UTC
(In reply to comment #16)
> The dump scripts read old revisions from previous dump files; that's why they
> are faster.  When we run them without the so-called prefetch option, they are
> very slow, much slower than the populateRevisionLength script.  

Slightly OT, but is that why getting the dump process going again was so painful?  Because the longer it went without a complete dump, the more work was needed to generate the next one?
Comment 18 Ariel T. Glenn 2010-05-26 23:16:03 UTC
Well it's a bit worse than that; the revision text content of the previous dumps is suspect, so we are running te dumps without prefetch.  This is extremely painful.  Once the revision length is populated in the db we can compare that against the length of the revision in the previous xml file and if they don't match we can refresh from the db.  This will be a big improvement over the current approach.
Comment 19 Ariel T. Glenn 2010-05-27 00:01:14 UTC
My rough back-of-the-napkin calculations indicate that since rev_len started to be populated around rev_id 124 million, and the script has processed up to about 51 million revs so far in ascending order, at 2000 revs per 14 secs it should catch up in about 6 days.  Do bear in mind though that I can't add :-P
Comment 20 Ariel T. Glenn 2010-06-01 18:51:11 UTC
So we're at around revision number 115 600 000 now, and it's taking about 26 seconds for 2000 revs. This extends my estimate.  If we see no further slowdowns we will catch up in two days (i.e. by Thursday June 3 at this time).
Comment 21 JeLuF 2010-06-03 20:52:12 UTC
Done.
Comment 22 Liangent 2010-07-25 06:22:46 UTC
Columns in archive tables are not fully populated currently.
Comment 23 MZMcBride 2010-09-30 06:24:06 UTC
(In reply to comment #22)
> Columns in archive tables are not fully populated currently.

I'm closing this again as "fixed." This bug was about easily retrieving the size of current revisions (from the revision table). It seems reasonable to have a separate script (or an option in populateRevisionLength.php) to calculate the lengths of deleted revisions (in the archive table), but that's a distinct issue and should be filed (if it isn't already) separately.
Comment 24 Sam Reed (reedy) 2013-03-16 01:12:20 UTC
Per a request on #wikimedia-tech

mysql:wikiadmin@db1019 [lbwiki]> select rev_id, rev_user, rev_page, rev_deleted, rev_len, rev_timestamp from revision where rev_id = 185751
+--------+----------+----------+-------------+---------+----------------+
| rev_id | rev_user | rev_page | rev_deleted | rev_len | rev_timestamp  |
+--------+----------+----------+-------------+---------+----------------+
| 185751 |      580 |    83446 |           0 |    NULL | 20061203231418 |
+--------+----------+----------+-------------+---------+----------------+
1 row in set (0.03 sec)


I'm just re-running the whole script on lbwiki, won't take long. Let's see the number of rows it reckons it has set/fixed
Comment 25 Sam Reed (reedy) 2013-03-16 01:17:54 UTC
(In reply to comment #24)
> I'm just re-running the whole script on lbwiki, won't take long. Let's see
> the
> number of rows it reckons it has set/fixed


...doing rev_id from 1520601 to 1520800
rev_len population complete ... 89 rows changed (0 missing)

89/1520800 = 0.00585% unpopulated.

For sanity, I'm going to run it over all wikis with --force to clean up any stragglers that may be laying around
Comment 26 MZMcBride 2013-03-16 01:20:34 UTC
(In reply to comment #25)
> For sanity, I'm going to run it over all wikis with --force to clean up any
> stragglers that may be laying around

Does this cover the archive table?

My suspicion was that these were archive.ar_len rows that were never populated and the revisions got re-inserted into the revision table at some point (probably through page undeletion). Maybe.
Comment 27 MZMcBride 2013-03-16 01:24:34 UTC
mzmcbride@willow:~$ sql lbwiki_p;
mysql> select * from logging where log_page = 83446\G
*************************** 1. row ***************************
       log_id: 57108
     log_type: delete
   log_action: restore
log_timestamp: 20110331152146
     log_user: 120
log_namespace: 0
  log_deleted: 0
log_user_text: Otets
    log_title: Stéier_(Astrologie)
  log_comment: 5 5 Versioune goufe restauréiert: Läschgrond gëtt eliminéiert wou existent
   log_params: 
     log_page: 83446
1 row in set (0.00 sec)

Beep boop.
Comment 28 MZMcBride 2013-03-16 01:35:23 UTC
(In reply to comment #23)
> (In reply to comment #22)
>> Columns in archive tables are not fully populated currently.
> 
> I'm closing this again as "fixed." This bug was about easily retrieving the
> size of current revisions (from the revision table). It seems reasonable to
> have a separate script (or an option in populateRevisionLength.php) to
> calculate the lengths of deleted revisions (in the archive table), but
> that's a distinct issue and should be filed (if it isn't already) separately.

Okay, this is being covered by bug 24538 and bug 46183.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links