Last modified: 2013-10-02 06:45:38 UTC
Per bug 42423 comment #17, it would be much better if we received an error when the search index is broken rather than receiving "0 results", as what currently happens. This assumes that Lucene actually returns a useful error message that gets tossed, rather than providing MediaWiki with 0 results in this case.
Related to https://bugzilla.wikimedia.org/show_bug.cgi?id=6090 and https://bugzilla.wikimedia.org/show_bug.cgi?id=35691
*** Bug 43553 has been marked as a duplicate of this bug. ***
For RMI errors, of the kind which were the cause of "zero results" being returned last time I debugged a Lucene problem, there are error handling issues in multiple parts of the stack: * The SearchEngine interface in the core has no way to report errors to SpecialSearch (short of throwing an exception). * MWSearch makes no attempt to extract an error message from the body of HTTP 500 errors, it just returns null. * In lucene-search-2, some RMIMessengerClient methods respond to network errors by returning an empty result set, instead of returning an error status as they should.
We should at least log errors to UDP.
Created attachment 11684 [details] Ruby script to reproduce failure Sends the same search query 100 times with a delay 10s between iterations. I ran it a few times and was able to reproduce the failure every time sometimes very quickly, sometimes after around 70 iterations.
I disagree on this being an enhancement: it's an actual bug because it's highly misleading.
Sorry, but this is as critical as bug 42423. A system that fails to report failed searches (but reports nothing found) is plain misleading and makes that users loose confidence in the system, especially at the rate it fails, sometimes 100 % during minutes. At least, it should state try again.
Over the weekend my script was again able to reproduce the "zero results" error so the issue is still with us; some log analysis indicates that failure of the 'highlight' call due to socket timeouts may be the problem so returning a failure in this case is easy to do; the broader issue of why we are getting socket timeouts is more difficult.
https://gerrit.wikimedia.org/r/#/c/55841/ Instruments code to dump entire GlobalConfiguration singleton to a file to aid debugging.
https://gerrit.wikimedia.org/r/#/c/56354/ Adds better error handling to return an error status instead of hiding internal errors and falsely reporting "zero results". Similar changes to the PHP side of things mentioned by TimS in comment 3 above are still being worked on.
https://gerrit.wikimedia.org/r/#/c/57350/ https://gerrit.wikimedia.org/r/#/c/57368/ I've pushed fixes to the PHP side in the above 2 commits. Apparently Chad has also been working on this and his slightly different fixes are here: https://gerrit.wikimedia.org/r/#/c/57337/ https://gerrit.wikimedia.org/r/#/c/57336/
I amended mine based off our discussion on IRC/e-mail, and combines both approaches.
https://gerrit.wikimedia.org/r/57350 (Gerrit Change Idb42d64987164ba099228b154729c9c86af7407f) | change ABANDONED [by Ram]
https://gerrit.wikimedia.org/r/57368 (Gerrit Change Ic07ce8f32be8358fbb2f5a60f3c8c324cb27694c) | change ABANDONED [by Ram]
Chad's patches in https://gerrit.wikimedia.org/r/#/c/57336/ and https://gerrit.wikimedia.org/r/#/c/57337/ got merged, but that broke ApiQuerySearch (see bug 47353).
https://gerrit.wikimedia.org/r/56354 (Gerrit Change Ibeef63f45a3276e870afbcadbd08c7bd2967b9e6) | change APPROVED and MERGED [by Tim Starling]
All three patches (that I'm aware of) got merged, can this be closed as FIXED or is more needed?
Let's wait for it to be deployed (in a few days, hopefully) before closing.
https://gerrit.wikimedia.org/r/55841 (Gerrit Change I178fba54a42173bce0b941f143bbc5ecf2bac15d) | change ABANDONED [by Tim Starling]
Search failure has been seen a couple of times at Commons last days.
(In reply to comment #20) > Search failure has been seen a couple of times at Commons last days. Yes, I saw it too yesterday. Sometimes it's nice to see errors. ;)
Closing this since we are now seeing proper errors instead of spurious "zero results".
(In reply to comment #4) > We should at least log errors to UDP. Tim, should this be filed as separate bug report? The log was (re)enabled and then disabled as too spammy on April 24: a1c62a08. Currently we have: MWSearch_body.php 500: wfDebugLog( 'mwsearch', "Search timeout requesting $searchUrl" ); 508: wfDebugLog( 'mwsearch', 'Search backend error: ' . $m[1] ); Maybe the second log could be removed/renamed so that we can at least have some way to count the first.
(In reply to comment #23) > (In reply to comment #4) > > We should at least log errors to UDP. > > Tim, should this be filed as separate bug report? The log was (re)enabled and > then disabled as too spammy on April 24: a1c62a08. > Currently we have: > > MWSearch_body.php > 500: wfDebugLog( > 'mwsearch', > "Search timeout requesting $searchUrl" ); > 508: wfDebugLog( 'mwsearch', > 'Search > backend error: ' . $m[1] ); > > Maybe the second log could be removed/renamed so that we can at least have > some > way to count the first. That was filed as bug 54865.