Last modified: 2014-09-15 01:15:37 UTC
This is from an HHVM jobrunner: <!-- NewPP limit report Parsed by mw1053 CPU time usage: 41.412 seconds Real time usage: 3.443 seconds Preprocessor visited node count: 13901/1000000 Preprocessor generated node count: 0/1500000 Post‐expand include size: 316970/2048000 bytes Template argument size: 35984/2048000 bytes Highest expansion depth: 23/40 Expensive parser function count: 16/500 Lua time usage: 10.352/10.000 seconds Lua memory usage: 2.54 MB/50 MB --> <!-- Saved in parser cache with key enwiki:pcache:idhash:13146361-0!*!0!*!*!4!* and timestamp 20140829142304 and revision id 269146468 --> This is the same page, also HHVM, but from a null-edit rather than a job: <!-- NewPP limit report Parsed by mw1019 CPU time usage: 0.986 seconds Real time usage: 1.276 seconds Preprocessor visited node count: 13994/1000000 Preprocessor generated node count: 0/1500000 Post‐expand include size: 319685/2048000 bytes Template argument size: 35683/2048000 bytes Highest expansion depth: 23/40 Expensive parser function count: 16/500 Lua time usage: 0.290/10.000 seconds Lua memory usage: 2.59 MB/50 MB --> <!-- Saved in parser cache with key enwiki:pcache:idhash:13146361-0!*!0!*!*!4!* and timestamp 20140829142922 and revision id 269146468 --> The job runner's CPU time is much higher for no apparent reason. This has been going on for several days/weeks.
The issue is that LuaSandbox is not thread-safe. The figure you're seeing conflates the CPU time of several concurrent threads of execution. * LuaSandbox uses CLOCK_PROCESS_CPUTIME_ID to measure time, but CLOCK_PROCESS_CPUTIME_ID measures CPU time consumed by all threads of the calling process. * LuaSandbox requests to be notified with a SIGEV_SIGNAL upon timer expiration, but at any point in time, only only one such signal is queued per timer per process. * LuaSandbox uses the sigprocmask() system call to block signals, but its behavior in a multithreaded process is unspecified. pthread_sigmask() must be used instead. * When the normal or emergency timers are stopped, any queued SIGEV_SIGNAL for the entire process is flushed. * When multiple threads are executing concurrently, the timer for the entire process is repeatedly re-armed (i.e., the exiration is reset). * Confusingly, LUASANDBOX_G(signal_handler_installed) is thread-local, not global. This means every thread calls luasandbox_timer_install_handler. The latter caches the old handler in another thread-local variable and reinstalls it when LuaSandbox is deactivating. When LuaSandbox is active in multiple threads, individual threads end up re-installing each other's signal handlers, keeping each other alive. The above applies specifically to the normal and emergency timers. The profiling feature has its own bag of thread-safety / reentrancy issues, which I propose we deal with in the context of bug 68413 rather than this bug.
I notice that it's not just Lua time usage that's bad; core's "CPU time usage" is also really high. Is that problem related to this one?
(In reply to Jackmcbarn from comment #2) > I notice that it's not just Lua time usage that's bad; core's "CPU time > usage" is also really high. Is that problem related to this one? Yes. The fact that concurrent threads reinstall each other's signal handlers means that requests can make each other linger, increasing the backend response time. Your question prompted me to take a look at ParserOutput::getTimes. I found that it also has a separate bug which, like this one, stems from the assumptions it makes about the underlying runtime's threading model. Filed as bug 70227.
Removing performance keyword, since this is a bug in the timer rather than anything actually being slow.
So, is there more to this bug than replacing CLOCK_PROCESS_CPUTIME_ID with CLOCK_THREAD_CPUTIME_ID?
Created attachment 16326 [details] A naive fix that doesn't work (In reply to Victor Vasiliev from comment #5) > So, is there more to this bug than replacing CLOCK_PROCESS_CPUTIME_ID with > CLOCK_THREAD_CPUTIME_ID? Yes. See the attached patch, which doesn't work. (It segfaults in luasandbox_timer_set_one_time). There are things which should be thread-local but aren't. It's going to take some very careful work to fix this.
Two ideas: * In a multithreaded environment, it might be a lot saner to use the timerfd_*[1] family of system calls, which operate on timers that notify via file descriptors rather than signals. But this is a Linux-specific feature, so it would come at the cost of portability. * In an ideal world, LuaSandbox would use high-level timing and code profiling interfaces provided by HHVM, such as Timer::GetRusageMicros[2], cpuCyles[3], etc. [1]: http://man7.org/linux/man-pages/man2/timerfd_create.2.html [2]: https://github.com/facebook/hhvm/blob/master/hphp/util/timer.cpp#L108 [3]: https://github.com/facebook/hhvm/blob/master/hphp/util/cycles.h#L29
Created attachment 16332 [details] Small helper script for reproducing the bug Fires off multiple requests to the enwiki API to parse [[en:Barack Obama]]. Only usable from the WMF cluster.
Change 159011 had a related patch set uploaded by Tim Starling: [WIP] Make timer/profiler be thread-safe https://gerrit.wikimedia.org/r/159011
Change 159011 merged by jenkins-bot: Make timer/profiler be thread-safe https://gerrit.wikimedia.org/r/159011