Last modified: 2013-03-16 12:55:43 UTC
The "size" (bytes) of each revision is not available in revisions prior to the introduction of this feature. Please recompute all the missing "size" values on the entire Wiki. I notice this while using the API to extract the size of old revisions from Wikipedia.
Probably we should compute rev_size on demand if it isn't available and update the database?
@Bryan Tong Minh: I think it is a good option. Eventually all articles would be updated and it would work in all MediaWiki installations. However, it might be easier to run an update script to solve the issue. In the end, its up to the sysadmins to decide :) Please vote the bug up so that it gets fixed.
Removed shell keyword, as there's nothing to do on shell yet - a maintenance script would have to be written I guess?
Created attachment 5981 [details] Proposed patch Proposed patch. Have no opportunity to test it atm.
*** Bug 17421 has been marked as a duplicate of this bug. ***
On-demand size expansion could potentially be expensive for interactive use (eg pulling up a history view with a few hundred revisions); it also wouldn't handle cases like Special:Longpages / Special:Shortpges generation. Bug 18881 covers a request for a maintenance script to update them in bulk.
A maintenance script to populate rev_len for old revisions was added in r63221. Now someone just needs to run it. Adding back "shell" keyword, removing "patch" and "need-review".
Running right now...
Completed for all wikis but enwiki. 650 Million records reviewed so far. enwiki has some 350 Million more that needs to be reviewed and the process runs much much slower on enwiki than on any other wiki.
Great! Thank you, JeLuF.
About 600-700 hours to go...
(In reply to comment #11) > About 600-700 hours to go... It takes longer than the dump script?!? That's slightly bizarre. And an average of less than 2 revs/sec? Is it spending most of its time waiting for slaves or something?
enwiki ... doing rev_id from 43'814'001 to 43'816'000 (total: 363'892'360) in 14.42 seconds, estimated completion in 641.3 hours 2000 revisions in 14 seconds is the current speed. It will become faster at the end when the revisions already have rev_len set.
(In reply to comment #12) I haven't checked, but my guess would be that the dump scripts know how to deal efficiently with Wikimedia's peculiar ways of storing article text. populateRevisionLength.php doesn't, it just fetches the text of one revision at a time, which is probably not optimal when multiple revisions are stored in a single compressed blob. I'll let others who know more about the storage backend stuff decide whether optimizing it somehow would be worth the effort, given that it's never going to be run on Wikimedia wikis again.
The ticket was open for about 30 months. I think it can wait for another month for completion...
The dump scripts read old revisions from previous dump files; that's why they are faster. When we run them without the so-called prefetch option, they are very slow, much slower than the populateRevisionLength script. We are seeing about 38 revs/second.
(In reply to comment #16) > The dump scripts read old revisions from previous dump files; that's why they > are faster. When we run them without the so-called prefetch option, they are > very slow, much slower than the populateRevisionLength script. Slightly OT, but is that why getting the dump process going again was so painful? Because the longer it went without a complete dump, the more work was needed to generate the next one?
Well it's a bit worse than that; the revision text content of the previous dumps is suspect, so we are running te dumps without prefetch. This is extremely painful. Once the revision length is populated in the db we can compare that against the length of the revision in the previous xml file and if they don't match we can refresh from the db. This will be a big improvement over the current approach.
My rough back-of-the-napkin calculations indicate that since rev_len started to be populated around rev_id 124 million, and the script has processed up to about 51 million revs so far in ascending order, at 2000 revs per 14 secs it should catch up in about 6 days. Do bear in mind though that I can't add :-P
So we're at around revision number 115 600 000 now, and it's taking about 26 seconds for 2000 revs. This extends my estimate. If we see no further slowdowns we will catch up in two days (i.e. by Thursday June 3 at this time).
Done.
Columns in archive tables are not fully populated currently.
(In reply to comment #22) > Columns in archive tables are not fully populated currently. I'm closing this again as "fixed." This bug was about easily retrieving the size of current revisions (from the revision table). It seems reasonable to have a separate script (or an option in populateRevisionLength.php) to calculate the lengths of deleted revisions (in the archive table), but that's a distinct issue and should be filed (if it isn't already) separately.
Per a request on #wikimedia-tech mysql:wikiadmin@db1019 [lbwiki]> select rev_id, rev_user, rev_page, rev_deleted, rev_len, rev_timestamp from revision where rev_id = 185751 +--------+----------+----------+-------------+---------+----------------+ | rev_id | rev_user | rev_page | rev_deleted | rev_len | rev_timestamp | +--------+----------+----------+-------------+---------+----------------+ | 185751 | 580 | 83446 | 0 | NULL | 20061203231418 | +--------+----------+----------+-------------+---------+----------------+ 1 row in set (0.03 sec) I'm just re-running the whole script on lbwiki, won't take long. Let's see the number of rows it reckons it has set/fixed
(In reply to comment #24) > I'm just re-running the whole script on lbwiki, won't take long. Let's see > the > number of rows it reckons it has set/fixed ...doing rev_id from 1520601 to 1520800 rev_len population complete ... 89 rows changed (0 missing) 89/1520800 = 0.00585% unpopulated. For sanity, I'm going to run it over all wikis with --force to clean up any stragglers that may be laying around
(In reply to comment #25) > For sanity, I'm going to run it over all wikis with --force to clean up any > stragglers that may be laying around Does this cover the archive table? My suspicion was that these were archive.ar_len rows that were never populated and the revisions got re-inserted into the revision table at some point (probably through page undeletion). Maybe.
mzmcbride@willow:~$ sql lbwiki_p; mysql> select * from logging where log_page = 83446\G *************************** 1. row *************************** log_id: 57108 log_type: delete log_action: restore log_timestamp: 20110331152146 log_user: 120 log_namespace: 0 log_deleted: 0 log_user_text: Otets log_title: Stéier_(Astrologie) log_comment: 5 5 Versioune goufe restauréiert: Läschgrond gëtt eliminéiert wou existent log_params: log_page: 83446 1 row in set (0.00 sec) Beep boop.
(In reply to comment #23) > (In reply to comment #22) >> Columns in archive tables are not fully populated currently. > > I'm closing this again as "fixed." This bug was about easily retrieving the > size of current revisions (from the revision table). It seems reasonable to > have a separate script (or an option in populateRevisionLength.php) to > calculate the lengths of deleted revisions (in the archive table), but > that's a distinct issue and should be filed (if it isn't already) separately. Okay, this is being covered by bug 24538 and bug 46183.