Last modified: 2014-04-08 23:43:04 UTC
Today a bunch of DB transactions hung open from mw1207. The offending apache threads showed this stack trace: (gdb) zbacktrace [0xbe6a1190] usleep() /usr/local/apache/common-local/php-1.23wmf18/includes/objectcache/BagOStuff.php:188 [0xbe6a0ae8] lock() /usr/local/apache/common-local/php-1.23wmf18/includes/filerepo/file/LocalFile.php:1832 [0xbe6a00b0] lock() /usr/local/apache/common-local/php-1.23wmf18/includes/filerepo/file/LocalFile.php:1164 [0xbe69fc30] upload() /usr/local/apache/common-local/php-1.23wmf18/includes/upload/UploadBase.php:692 [0xbe69e790] performUpload() /usr/local/apache/common-local/php-1.23wmf18/includes/api/ApiUpload.php:649 [0xbe69e3e0] performUpload() /usr/local/apache/common-local/php-1.23wmf18/includes/api/ApiUpload.php:144 [0xbe69cc98] getContextResult() /usr/local/apache/common-local/php-1.23wmf18/includes/api/ApiUpload.php:111 [0xbe69c610] execute() /usr/local/apache/common-local/php-1.23wmf18/includes/api/ApiMain.php:900 [0xbe69c050] executeAction() /usr/local/apache/common-local/php-1.23wmf18/includes/api/ApiMain.php:364 [0xbe69be48] executeActionWithErrorHandling() /usr/local/apache/common-local/php-1.23wmf18/includes/api/ApiMain.php:335 [0xbe69aee8] execute() /usr/local/apache/common-local/php-1.23wmf18/api.php:86 [0xbe69add8] ??? /usr/local/apache/common-local/w/api.php:3 Paravoid noticed mw1207 twemproxy had stopped listening on port 11211. Restarting it caused the open txns to complete and the apache threads to continue as normal. Mediawiki shouldn't wait forever if memcached is not responding.
Change 122550 had a related patch set uploaded by Aaron Schulz: Speed up LocalFile locking behavoir https://gerrit.wikimedia.org/r/122550
Change 122550 merged by jenkins-bot: Speed up LocalFile locking behavior https://gerrit.wikimedia.org/r/122550
Change 123041 had a related patch set uploaded by Aaron Schulz: Made BagOStuff fail fast in cas/lock on certain errors https://gerrit.wikimedia.org/r/123041
Change 123041 merged by jenkins-bot: Made BagOStuff fail fast in cas/lock on certain errors https://gerrit.wikimedia.org/r/123041
Aaron: Both patches merged. Do we wait for checking if this still happens and if more work is needed, or can this issue be considered fixed?