Last modified: 2011-03-22 21:25:24 UTC
Earlier today, when the page [[en:User:Chzz]] was loaded by IP users, it would display the error box corresponding to the 'view-pool-error' and 'pool-timeout' messages. The page could still be viewed by logged-in users. This persisted for a while; finally I blanked the page and restored it, and that seemed to fix the problem. At the moment, the page shows up blank for both IP and registered users (no content between the "< !-- bodytext" and "< !-- NewPP limit report" comments). The page source contains this message: <!-- Saved in parser cache with key enwiki:pcache:idhash:17204717-0!*!0!!*!* and timestamp 20110321024905 --> <!-- parser cache is expired, sending anyway due to pool overload--> I am unable to get the page to load by blanking it and restoring the content. The user page is very complicated and transcludes a lot, including a call to Special:Prefixindex. I have no idea what is causing the problem. But it looks like there is some bug in the pool counter system.
I've disabled the pool counter for now. It looks like a client disconnected without releasing its locks somehow. After I disabled it, there was only one TCP connection remaining, from a deadlocked apache process, but that was for a different article. Despite there being no relevant connections in lsof, attempting to acquire the relevant lock (enwiki:pcache:idhash:17204717-0!*!0!!*!*) failed with a timeout. Unfortunately poolcounterd wasn't compiled with debug symbols, so it wasn't feasible to examine the hashtable. Also, it would have been nice if there was a list of client_data structures that could be examined with gdb or a monitoring command.
(In reply to comment #0) > Earlier today, when the page [[en:User:Chzz]] was loaded by IP users, it would > display the error box corresponding to the 'view-pool-error' and 'pool-timeout' > messages. The page could still be viewed by logged-in users. This persisted for > a while; finally I blanked the page and restored it, and that seemed to fix the > problem. > This sounds like the poolcounter error messages were Squid-cached, which should never, ever, happen.
(In reply to comment #2) > This sounds like the poolcounter error messages were Squid-cached, which should > never, ever, happen. I suspect that the reason it happened for IP users is because they have default parser options. There would have been no squid cache, and error messages weren't cached, so every request for that page tried to do a parse. That's the situation it was in when I got to see it, except that it had a stale cache entry that could be returned. If there was no stale cache, it would have returned an error message.
It looks like the reported problem is fixed, Tim has determined that this isn't a bug in the code, and, while this might be good to keep in mind if a similar bug pops up later, it can now be closed. If I've misunderstood this, please reopen with an explanation.
It's not fixed, I just disabled the whole extension. If I re-enable the extension, it will start happening again. There is a bug in the server code. We haven't isolated it yet.
Since I wasn't making much progress finding the bug by code review, I decided to have a crack at debugging the running process with gdb, despite the lack of symbols. I've determined the following: * The hashtable hasn't grown, it still has hashpower=16. * There are 134 entries still in the hashtable, so this wasn't an isolated case. * I looked at three entries, they all had count = processing = 2. * By subtracting an appropriate offset from the address of the locks structure, I could look at the client_data structure. For one of the hashtable entries, there were two clients, with FDs 387 and 466. lsof says: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME poolcount 1838 poolcounter 387u sock 0,6 0t0 64556254 can't identify protocol poolcount 1838 poolcounter 466u sock 0,6 0t0 64552887 can't identify protocol There are 275 FDs which give "can't identify protocol", which is suspiciously close to double the number of hashtable entries. Maybe these FDs were closed on the remote side, but poolcounterd never called close() on them. That wouldn't be a surprising scenario, since the structures seem to indicating that free_client_data() was never called on them, and on_client() never calls close() without first calling free_client_data().
I have been able to reproduce it: The strace is as follows: fd 7 is the first one that locks the entry. fd 11 is the leaked entry. epoll_wait(3, {{EPOLLIN, {u32=0, u64=0}}}, 32, 4294967295) = 1 accept4(0, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 7 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 32, 4294967295) = 1 epoll_ctl(3, EPOLL_CTL_DEL, 7, {EPOLLIN, {u32=7, u64=7}}) = 0 recvfrom(7, "ACQ4ME bug 1 2 3\r\n", 1024, 0, NULL, NULL) = 18 sendto(7, "LOCKED\n", 7, 0, NULL, 0) = 7 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0 fd 7 locked an entry epoll_wait(3, {{EPOLLIN, {u32=0, u64=0}}}, 32, 4294967295) = 1 accept4(0, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 11 epoll_ctl(3, EPOLL_CTL_ADD, 11, {EPOLLIN, {u32=11, u64=11}}) = 0 epoll_wait(3, {{EPOLLIN, {u32=11, u64=11}}}, 32, 4294967295) = 1 epoll_ctl(3, EPOLL_CTL_DEL, 11, {EPOLLIN, {u32=11, u64=11}}) = 0 recvfrom(11, "ACQ4ME bug 1 2 3\r\n", 1024, 0, NULL, NULL) = 18 epoll_ctl(3, EPOLL_CTL_ADD, 11, {EPOLLIN, {u32=11, u64=11}}) = 0 fd 11 is waiting on that same entry epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 32, 3000) = 1 epoll_ctl(3, EPOLL_CTL_DEL, 7, {EPOLLIN, {u32=7, u64=7}}) = 0 recvfrom(7, "RELEASE\r\n", 1024, 0, NULL, NULL) = 9 sendto(11, "LOCKED\n", 7, 0, NULL, 0) = 7 sendto(7, "RELEASED\n", 9, 0, NULL, 0) = 9 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0 epoll_wait(3, {}, 32, 594) = 0 epoll_ctl(3, EPOLL_CTL_DEL, 11, {EPOLLIN, {u32=11, u64=11}}) = 0 epoll_wait(3, poolcounterd does not receive anything else about fd 11 (inclusing releases or connection closing) That seems due to the timeout set when readding the event. Readding the event without timeout when on_client EV_TIMEOUT fixes the problem. <s>However, it still reads after a normal timeout, which is odd.</s> send_client() not only sends the message, it also readds the event. Fixed in r84560.