Last modified: 2011-03-22 21:25:24 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T30144, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 28144 - Pool timeouts and pool overload (PoolCounter) on enwiki


Summary:	Pool timeouts and pool overload (PoolCounter) on enwiki

Status:	RESOLVED FIXED

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	PoolCounter (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	Tim Starling

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2011-03-21 03:08 UTC by CBM
Modified:	2011-03-22 21:25 UTC (History)
CC List:	4 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description CBM 2011-03-21 03:08:19 UTC

Earlier today, when the page [[en:User:Chzz]] was loaded by IP users, it would display the error box corresponding to the 'view-pool-error' and 'pool-timeout' messages. The page could still be viewed by logged-in users. This persisted for a while; finally I blanked the page and restored it, and that seemed to fix the problem. 

At the moment, the page shows up blank for both IP and registered users (no content between the "< !-- bodytext" and "< !-- NewPP limit report" comments). The page source contains this message:

<!-- Saved in parser cache with key enwiki:pcache:idhash:17204717-0!*!0!!*!* and timestamp 20110321024905 -->
<!-- parser cache is expired, sending anyway due to pool overload-->

I am unable to get the page to load by blanking it and restoring the content. The user page is very complicated and transcludes a lot, including a call to Special:Prefixindex. I have no idea what is causing the problem. But it looks like there is some bug in the pool counter system.

Comment 1 Tim Starling 2011-03-21 04:02:25 UTC

I've disabled the pool counter for now. It looks like a client disconnected without releasing its locks somehow. After I disabled it, there was only one TCP connection remaining, from a deadlocked apache process, but that was for a different article. Despite there being no relevant connections in lsof, attempting to acquire the relevant lock (enwiki:pcache:idhash:17204717-0!*!0!!*!*) failed with a timeout.

Unfortunately poolcounterd wasn't compiled with debug symbols, so it wasn't feasible to examine the hashtable. Also, it would have been nice if there was a list of client_data structures that could be examined with gdb or a monitoring command.

Comment 2 Roan Kattouw 2011-03-21 10:28:24 UTC

(In reply to comment #0)
> Earlier today, when the page [[en:User:Chzz]] was loaded by IP users, it would
> display the error box corresponding to the 'view-pool-error' and 'pool-timeout'
> messages. The page could still be viewed by logged-in users. This persisted for
> a while; finally I blanked the page and restored it, and that seemed to fix the
> problem. 
> 
This sounds like the poolcounter error messages were Squid-cached, which should never, ever, happen.

Comment 3 Tim Starling 2011-03-21 11:31:57 UTC

(In reply to comment #2)
> This sounds like the poolcounter error messages were Squid-cached, which should
> never, ever, happen.

I suspect that the reason it happened for IP users is because they have default parser options. There would have been no squid cache, and error messages weren't cached, so every request for that page tried to do a parse. That's the situation it was in when I got to see it, except that it had a stale cache entry that could be returned. If there was no stale cache, it would have returned an error message.

Comment 4 Mark A. Hershberger 2011-03-22 03:12:16 UTC

It looks like the reported problem is fixed, Tim has determined that this isn't a bug in the code, and, while this might be good to keep in mind if a similar bug pops up later, it can now be closed.

If I've misunderstood this, please reopen with an explanation.

Comment 5 Tim Starling 2011-03-22 03:36:55 UTC

It's not fixed, I just disabled the whole extension. If I re-enable the extension, it will start happening again. There is a bug in the server code. We haven't isolated it yet.

Comment 6 Tim Starling 2011-03-22 06:20:04 UTC

Since I wasn't making much progress finding the bug by code review, I decided to have a crack at debugging the running process with gdb, despite the lack of symbols. I've determined the following:

* The hashtable hasn't grown, it still has hashpower=16.
* There are 134 entries still in the hashtable, so this wasn't an isolated case.
* I looked at three entries, they all had count = processing = 2.
* By subtracting an appropriate offset from the address of the locks structure, I could look at the client_data structure. For one of the hashtable entries, there were two clients, with FDs 387 and 466. lsof says:

COMMAND    PID        USER   FD   TYPE             DEVICE SIZE/OFF     NODE NAME
poolcount 1838 poolcounter  387u  sock                0,6      0t0 64556254 can't identify protocol
poolcount 1838 poolcounter  466u  sock                0,6      0t0 64552887 can't identify protocol

There are 275 FDs which give "can't identify protocol", which is suspiciously close to double the number of hashtable entries. Maybe these FDs were closed on the remote side, but poolcounterd never called close() on them. That wouldn't be a surprising scenario, since the structures seem to indicating that free_client_data() was never called on them, and on_client() never calls close() without first calling free_client_data().

Comment 7 Platonides 2011-03-22 21:25:24 UTC

I have been able to reproduce it:

The strace is as follows:
 fd 7 is the first one that locks the entry.
 fd 11 is the leaked entry.

epoll_wait(3, {{EPOLLIN, {u32=0, u64=0}}}, 32, 4294967295) = 1
accept4(0, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 7
epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0
epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 32, 4294967295) = 1
epoll_ctl(3, EPOLL_CTL_DEL, 7, {EPOLLIN, {u32=7, u64=7}}) = 0
recvfrom(7, "ACQ4ME bug 1 2 3\r\n", 1024, 0, NULL, NULL) = 18
sendto(7, "LOCKED\n", 7, 0, NULL, 0)    = 7
epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0
 fd 7 locked an entry

epoll_wait(3, {{EPOLLIN, {u32=0, u64=0}}}, 32, 4294967295) = 1
accept4(0, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 11
epoll_ctl(3, EPOLL_CTL_ADD, 11, {EPOLLIN, {u32=11, u64=11}}) = 0
epoll_wait(3, {{EPOLLIN, {u32=11, u64=11}}}, 32, 4294967295) = 1
epoll_ctl(3, EPOLL_CTL_DEL, 11, {EPOLLIN, {u32=11, u64=11}}) = 0
recvfrom(11, "ACQ4ME bug 1 2 3\r\n", 1024, 0, NULL, NULL) = 18
epoll_ctl(3, EPOLL_CTL_ADD, 11, {EPOLLIN, {u32=11, u64=11}}) = 0
  fd 11 is waiting on that same entry

epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 32, 3000) = 1
epoll_ctl(3, EPOLL_CTL_DEL, 7, {EPOLLIN, {u32=7, u64=7}}) = 0
recvfrom(7, "RELEASE\r\n", 1024, 0, NULL, NULL) = 9
sendto(11, "LOCKED\n", 7, 0, NULL, 0)   = 7
sendto(7, "RELEASED\n", 9, 0, NULL, 0)  = 9
epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0
epoll_wait(3, {}, 32, 594)              = 0
epoll_ctl(3, EPOLL_CTL_DEL, 11, {EPOLLIN, {u32=11, u64=11}}) = 0
epoll_wait(3,
  poolcounterd does not receive anything else about fd 11 (inclusing releases or connection closing)

That seems due to the timeout set when readding the event. Readding the event without timeout when on_client EV_TIMEOUT fixes the problem. <s>However, it still reads after a normal timeout, which is odd.</s> send_client() not only sends the message, it also readds the event.

Fixed in r84560.

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links