Last modified: 2013-01-31 16:26:23 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T35409, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 33409 - Transient CDB read/write failures
Transient CDB read/write failures
Status: NEW
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
1.18.x
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
: platformeng
: 32170 32404 (view as bug list)
Depends on:
Blocks: 31576
  Show dependency treegraph
 
Reported: 2011-12-29 02:44 UTC by Tim Starling
Modified: 2013-01-31 16:26 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tim Starling 2011-12-29 02:44:17 UTC
On the main Wikimedia app cluster we log occasional read and write failures in CDB, causing bug 31576 and other issues.

Analysis of the CDB code and limited testing indicates that a disk full condition may cause a truncated CDB file to be moved into the destination. Most of our app servers have /tmp mounted on a tiny (2GB) partition which is commonly 60-70% full. A spike in disk space usage could cause a transient disk-full condition.

More defensive handling of CDB writes might fix this issue. The return values of the underlying write() and close() syscalls should be checked, and if an error occurs, the temporary file should be deleted without the rename being done, and an exception thrown.

There are problems in many places:

* PHP's dba_close() does not check whether the write of the hashtable or the close() succeeded. This is not easy to fix, since dba_close() just calls the resource destructor, and the resource destructor interface (rsrc_dtor_func_t) does not provide a return value. 
* Using CdbWriter_PHP would at least allow us to check for failures during the hashtable write. However fclose() suffers from the same problem as dba_close(): it calls the resource destructor and so cannot provide a meaningful error return. It doesn't even check for failures during the final flush of the internal write buffer, see https://bugs.php.net/bug.php?id=60110
* CdbWriter_DBA::set() just passes through the return value of dba_insert(), it does not throw an exception. Thus most callers will continue despite a disk-full error.
* CdbWriter_PHP::set() throws an exception on a write error, however this will lead to __destruct() being called and thus a truncated CDB file being moved into the destination.

Also, the fact that CdbWriter objects are finalised and closed on __destruct() may lead to an incomplete file being moved into place if some unrelated exception is thrown in the caller during rebuild. Fixing this will mean requiring that callers call CdbWriter::close() before letting the object go out of scope, breaking backwards compatibility for callers like dumpInterwiki.php and extensions/Babel/txt2cdb.php

My recommendations are:

* Add a configuration variable to configure which CdbWriter subclass is used, and make it CdbWriter_PHP by default. It looks like it makes sense to use the C reader but the PHP writer.
* Fix the exception and error handling in both the DBA and PHP versions of CdbWriter.
* Encourage the PHP devs to do something about resource destructor error propagation.
* Increase the size of the /tmp partition on the app servers.
Comment 1 Tim Starling 2011-12-29 04:22:05 UTC
A couple more thoughts: 

* Calling fflush() before fclose() would be useful since fflush() calls write() and returns false on error. Then fclose() will only call close() which is somewhat less likely to fail on disk-full.

* Maybe a cache stampede after l10n cache invalidation causes /tmp to fill up. Switching to manualRecache mode would fix this.
Comment 2 Tim Starling 2011-12-29 05:15:17 UTC
Filed PHP bug about dba_close(): https://bugs.php.net/bug.php?id=60621

Probably it will be closed "bogus", but at least it's a starting point.
Comment 3 Aaron Schulz 2011-12-29 17:54:50 UTC
We will want to increase the /tmp file partitions anyway, to handle the temp file for file concatenation in the future anyway. Maybe the use of TempFSFile will assure that a bit more stuff gets deleted timely as well.
Comment 4 Tim Starling 2012-01-03 21:58:53 UTC
*** Bug 32404 has been marked as a duplicate of this bug. ***
Comment 5 Tim Starling 2012-01-03 22:01:33 UTC
*** Bug 32170 has been marked as a duplicate of this bug. ***
Comment 6 Gabriel Wicke 2012-05-03 18:35:17 UTC
Repartitioning is tracked in bug 36488.
Comment 7 Malafaya 2013-01-31 16:26:23 UTC
Regarding Bug 32404 (marked as duplicate of this one):

New occurrences of non-zero namespace pages link appearing as zero namespace in ptwiktionary dump of 24th Jan 2013:

SELECT CONCAT('# [[', pl_title, ']] ([[Especial:Whatlinkshere/',
        pl_title, '|', COUNT(*), ' lig.]])') as linha
                 FROM pagelinks
                 LEFT JOIN page AS pg1
                 ON pl_namespace = pg1.page_namespace AND 
                        pl_title = pg1.page_title
                 LEFT JOIN page AS pg2
                 ON pl_from = pg2.page_id
                 WHERE pg1.page_namespace IS NULL
                        AND pl_namespace = 0
                        AND pg2.page_namespace <> 8
                 GROUP BY pl_namespace, pl_title
                 ORDER BY COUNT(*) DESC
                 LIMIT 0, 3;

(note that pl_namespace = 0)

yields:

'# [[Usuário:ValJor]] ([[Especial:Whatlinkshere/Usuário:ValJor|4860 lig.]])'
'# [[Usuário_Discussão:ValJor]] ([[Especial:Whatlinkshere/Usuário_Discussão:ValJor|4833 lig.]])'
'# [[Imagem:Flag_of_Canada.svg]] ([[Especial:Whatlinkshere/Imagem:Flag_of_Canada.svg|1604 lig.]])'

First is in namespace User, second in User_Discussion and third in File namespace.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links