Last modified: 2013-05-16 15:59:56 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T44423, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 42423 - Wikimedia wiki search is broken (outputting inconsistent results)
Wikimedia wiki search is broken (outputting inconsistent results)
Status: RESOLVED WORKSFORME
Product: Wikimedia
Classification: Unclassified
lucene-search-2 (Other open bugs)
unspecified
All All
: Highest critical with 2 votes (vote)
: ---
Assigned To: Munagala Ramanath (Ram)
http://wikitech.wikimedia.org/view/Se...
: ops, platformeng
: 42424 42426 42431 43920 (view as bug list)
Depends on: 43544 43553 43869 43894
Blocks:
  Show dependency treegraph
 
Reported: 2012-11-25 04:56 UTC by MZMcBride
Modified: 2013-05-16 15:59 UTC (History)
20 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Screenshot of mediawiki.org incorrectly showing no search results, 2013-01-05 (127.22 KB, image/png)
2013-01-05 18:03 UTC, MZMcBride
Details
Screenshot of mediawiki.org correctly showing search results, 2013-01-05 (200.32 KB, image/png)
2013-01-05 18:03 UTC, MZMcBride
Details
Page generated on failed search (21.43 KB, text/html)
2013-01-11 04:06 UTC, Valerie Juarez
Details
HTML source diffed between responses with and without results (same URL/query) (17.39 KB, patch)
2013-01-11 16:03 UTC, jeremyb
Details

Description MZMcBride 2012-11-25 04:56:41 UTC
When I go to <https://hi.wikipedia.org/w/index.php?search=incategory%3A%22%E0%A4%95%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A3%E0%A4%BE+%E0%A4%9C%E0%A4%BF%E0%A4%B2%E0%A4%BE%22&title=%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%87%E0%A4%B7%3A%E0%A4%96%E0%A5%8B%E0%A4%9C>, I'm getting inconsistent results. Sometimes when the page loads, it shows a proper listing (1 to 20 of 955 results). Other times when the page loads, it shows an improper listing (no results found).

This leads me to believe that the search indices may not be properly synchronized. Or perhaps data is getting dropped somewhere.
Comment 1 MZMcBride 2012-11-25 06:02:22 UTC
I don't think anyone but roots have access to the search cluster at this point. Hrmph.

I'm seeing similar inconsistent output at <https://www.wikidata.org/w/index.php?title=Special%3ASearch&profile=default&search=Boston&fulltext=Search>. On certain page loads, the results are "1 - 20 of 22"; on other page loads, the results are "no results matching the query". Something is plainly broken.
Comment 2 MZMcBride 2012-11-25 07:38:09 UTC
*** Bug 42426 has been marked as a duplicate of this bug. ***
Comment 3 MZMcBride 2012-11-25 07:38:32 UTC
*** Bug 42424 has been marked as a duplicate of this bug. ***
Comment 4 jeremyb 2012-11-25 08:07:36 UTC
Can anyone still reproduce this?
Comment 5 MZMcBride 2012-11-25 15:32:18 UTC
(In reply to comment #4)
> Can anyone still reproduce this?

Yes. Why do you ask?
Comment 6 Jesús Martínez Novo (Ciencia Al Poder) 2012-11-25 15:42:04 UTC
*** Bug 42431 has been marked as a duplicate of this bug. ***
Comment 7 Andre Klapper 2012-11-25 15:51:29 UTC
Looking at http://wikitech.wikimedia.org/view/Server_admin_log there are several entries which might be related: 

November 25
08:11 apergos: from about half an hour ago, restarted lucene search on search13 and forgot to log it

November 23 
04:04 Tim: oh yeah, and I upgraded lucene to my version with the timeouts, deployed to pmtpa only via puppet
04:02 Tim: many lucene search servers failed to bind to port 1099 when they were restarted by the upgrade, restarting manually
Comment 8 Sven Manguard 2012-11-25 15:53:03 UTC
Hey there. I'm just confirming that search is still useless. 

Sven
Comment 9 billinghurst 2012-11-25 16:49:47 UTC
Probably useful (In reply to comment #8)
> Hey there. I'm just confirming that search is still useless. 
> 
> Sven

Probably useful for you to identify where you are having issues.  The wikis reported on the duplicate and above url for wikidata all seem to return data now, so where is there still a problem?
Comment 10 Sven Manguard 2012-11-25 16:54:11 UTC
It's returning results now, but it wasn't when I made the above post.
Comment 11 Andre Klapper 2012-11-25 17:42:35 UTC
Link in URL field now works reliably for me too (see IRC log below and the line by nagios-wm). However worth to investigate so this doesn't happen again.

<apergos> I got what I think is a no results page
<apergos> I don't see anything useful in the log about hiwiki (on search13 and search14)
<nagios-wm> RECOVERY - Lucene on search14 is OK: TCP OK - 0.002 second response time on port 8123
<apergos> that's odd, I didn't know it was out to lunch (and it didn't behave like it was)
<apergos> there's a lot of 'thread is waiting' messages
<apergos> tim might have some insight (given his recent change to the code)
<apergos> there are also messages like these:
<apergos>  Cannot contact RMI registry for host search0x : Unknown host: search0x
<apergos> but it's hard to tell what is setting that off
Comment 12 jeremyb 2012-11-25 21:39:09 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > Can anyone still reproduce this?
> 
> Yes. Why do you ask?

Because several people that had seen it broken (across multiple wikis) had seen it was no longer broken for them. I figured this bug was a good place to fish for people (and cases) where it was still not working.
Comment 13 Andre Klapper 2012-11-26 18:00:38 UTC
Issue isn't reproducible anymore for me, lowering severity/priority.
Comment 14 Sumana Harihareswara 2012-11-26 18:42:08 UTC
CC'ing Patrick, as Patrick and Tim (already cc'd) are working on fixes to
immediate search problems.
Comment 15 Andre Klapper 2012-11-27 04:16:16 UTC
Probably related: For 
<nagios-wm> PROBLEM - Lucene on search1016 is CRITICAL: Connection refused
from 50 minutes ago,
https://gerrit.wikimedia.org/r/35345 was submitted.
Comment 16 Tim Starling 2012-11-27 04:18:39 UTC
(In reply to comment #11)
> <apergos>  Cannot contact RMI registry for host search0x : Unknown host:
> search0x

That's just a configuration hack, flooding the logs with exception backtraces to avoid the need to disable those indexes properly.
Comment 17 Foroa 2012-11-29 15:06:40 UTC
Sorry, but if it fails, it should state that it failed but not pretend that it found 0 results. 
A possible solution that it always returns the date of the latest search database update, displays an impossible data when nothing received.

Foroa
Comment 18 Andre Klapper 2012-12-31 19:32:58 UTC
Bug 16236 comment 14 implies that this still happens on mediawiki.org.

Patrick and Tim: Has there been any outcome of investigations on this four weeks ago?
Comment 19 Foroa 2012-12-31 20:44:37 UTC
Remains a problem. Search failed several times, even 5 bminutes ago.

See http://commons.wikimedia.org/wiki/Commons:Village_pump#Search_faulty.3F too.
Comment 20 Rob Lanphier 2012-12-31 21:52:49 UTC
comment #17 is filed as bug 43544.  As far as I can tell, the search problem on commons aren't a problem right now.
Comment 21 Foroa 2013-01-01 09:26:13 UTC
This morning, I've got at least 10 search failures. In general, retrying it after a few tens of seconds works.
Comment 22 Nemo 2013-01-01 09:42:28 UTC
From yesterday's logs http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20121231.txt :
[21:17:23] <binasher>	 robla: that bugzilla ticket should probably be closed unless it's worth have a ticket to report new search problems as they happen.  search was broken over the thanksgiving holiday, which was when mzmcbride opened it
[23:48:36] <robla>	 thanks for the update.  I think, after asking you about that, we established that things are (as of this instant) in ok shape, but something we could use a little more monitoring of

So I guess they've deteriorated again?
It was also suggested to split comment 17 to another bug: bug 43553.
Comment 24 MZMcBride 2013-01-05 18:01:13 UTC
Isarra just experienced this problem on mediawiki.org and I was able to reproduce (after many tries). The HTML source of the page with no results contained "<!-- Served by srv275 in 10.143 secs. -->", though I'm not sure this is very helpful to debugging.

My suspicion is that the search cluster's indices are not all fully synchronized. Or possibly one of the search boxes is simply broken/unresponsive (ten seconds is an awfully long time to take to respond).

I'll upload screenshots in short order.
Comment 25 MZMcBride 2013-01-05 18:03:06 UTC
Created attachment 11593 [details]
Screenshot of mediawiki.org incorrectly showing no search results, 2013-01-05
Comment 26 MZMcBride 2013-01-05 18:03:45 UTC
Created attachment 11594 [details]
Screenshot of mediawiki.org correctly showing search results, 2013-01-05
Comment 27 Andre Klapper 2013-01-07 13:17:47 UTC
Bug 43663 is a potential duplicate.

I'm increasing priority as this seem to affect quite some people and makes finding information cumbersome and errorprone.
Comment 28 Andre Klapper 2013-01-07 21:45:50 UTC
Have not been able to reproduce this on mediawiki.org, both for being logged in and not being logged in. Haven't seen anything suspicious since 2013-01-05 in the server admin log at http://wikitech.wikimedia.org/view/Server_admin_log either (except for job queue with lots of items).
Decreasing prio/seve again.
Comment 29 Foroa 2013-01-08 07:26:47 UTC
Problem still persists on Commons. Search is a major tool to check and complete categories. (Hundreds of thousands of image categorisation backlog).
Comment 30 Andre Klapper 2013-01-08 11:08:28 UTC
If the problem exists, please provide explicit and exact steps to reproduce (what to do when, Search in page name vs. page contains etc, a URL / search term to reproduce with) so others can try to reproduce. 
"It still happens" only is unfortunately not helpful.
Comment 31 Foroa 2013-01-08 11:52:51 UTC
The basic problem is that search fails to return search results in a random way without indication that it fails; we know that it fails only because there are no results and because we know that there should be results. I got it this morning a couple of times on Commons. In general, redoing the same search (or a couple of times) one or more seconds later finally returns some results. So the basic problem is that the service is not reliable and does not report that there is a problem. Why I proposed on another bug report to return at least a status and the date of the search database update (which is another source of frustration as it looks as if it takes between 1 and 5 days before new files are included in the search database). 

As test procedure, one could easily make a script that uses as search string the name of a random category (or a word of it) and searches in files, galleries and categories: each search should yield some results. Obviously, such tests should be done on en:wiki (that contains the largest volume of data) or Commons (that probably has the most items in its database).
Comment 32 MZMcBride 2013-01-08 18:12:48 UTC
(In reply to comment #30)
> If the problem exists, please provide explicit and exact steps to reproduce
> (what to do when, Search in page name vs. page contains etc, a URL / search
> term to reproduce with) so others can try to reproduce. 
> "It still happens" only is unfortunately not helpful.

Comment 24, comment 25, and comment 26 could not be any more explicit, showing very clearly both the symptom of the problem and the steps to reproduce (the URL bar was intentionally included in both screenshots).

This problem happens intermittently, through absolutely no fault of users. This bug is waiting on a sysadmin to investigate, debug, and resolve the problem.
Comment 33 Nemo 2013-01-08 18:22:01 UTC
(In reply to comment #28)
> Have not been able to reproduce this [...]
> Decreasing prio/seve again.

Per comment 32, I've marked it "critical" again; it still needs an assignee. 
(It could be a legitimate "blocker", it surely blocks any search-related development/debugging/whatever.)
Comment 34 Andre Klapper 2013-01-08 19:29:07 UTC
(In reply to comment #32)
> Comment 24, comment 25, and comment 26 could not be any more explicit

For your case (mediawiki.org) yes.
But comment 29 (that I answered) was about Commons.
Comment 35 Foroa 2013-01-08 19:33:39 UTC
Failed several times last hour. On the bottom of the source, it reads:
* <!-- Served by srv232 in 10.193 secs. --> when it returns no results, no idea what time it took
* <!-- Served by srv192 in 0.631 secs. --> after second attempt with 36 results

A second case http://commons.wikimedia.org/w/index.php?title=Special%3ASearch&profile=advanced&search=Bottle+filling+-incategory%3A%22Bottle_filling%22&fulltext=Search&ns0=1&ns6=1&ns14=1&redirs=1&profile=advanced:
* fail: <!-- Served by mw39 in 10.283 secs. --> after roughly 10 seconds
* fail: <!-- Served by mw49 in 10.188 secs. --> after roughly 10 seconds
* 68 results :<!-- Served by mw38 in 0.619 secs. -->

It looks as if a search query taking more than 10 seconds aborts the request. I think that some searches without results return in much less than 10 seconds, I will try to estimate that better.
Comment 36 Foroa 2013-01-09 07:39:04 UTC
Previous failed tests has been done in 25 minutes of time, 3 failures out of 10 or so activations. Then I did about 100 tests without failre. Below, a couple of failures from this morning in say 45 minutes, about 50 searches returned valid results.

http://commons.wikimedia.org/w/index.php?title=Special:Search&search=B%C3%A9rat+-incategory:%22B%C3%A9rat%22
* Fail : <!-- Served by mw26 in 10.187 secs. --
* 320 results: !-- Served by mw48 in 1.273 secs. -->

http://commons.wikimedia.org/w/index.php?title=Special:Search&search=Commonwealth%20War%20Graves%20Commission%20cemeteries%20in%20England+-incategory:%22Commonwealth_War_Graves_Commission_cemeteries_in_England%22
* Fail: <!-- Served by mw28 in 10.253 secs. -->
* 22 results: !-- Served by mw52 in 0.552 secs. -->

http://commons.wikimedia.org/w/index.php?title=Special:Search&search=Images%20from%20the%20Geograph%20British%20Isles%20project%20needing%20categories%20in%20grid%20NW9667+-incategory:%22Images_from_the_Geograph_British_Isles_project_needing_categories_in_grid_NW9667%22
* Fail: <!-- Served by srv245 in 10.196 secs. -->
* 1 result: <!-- Served by mw55 in 0.353 secs. -->

http://commons.wikimedia.org/w/index.php?title=Special:Search&search=Schwalm-Radweg+-incategory:%22Schwalm-Radweg%22

* Fail: <!-- Served by srv237 in 10.204 secs. -->
* Fail: <!-- Served by srv267 in 10.181 secs. -->
* Fail: <!-- Served by mw28 in 10.288 secs. -->
* 1 result: <!-- Served by srv261 in 0.224 secs. -->
Comment 37 MZMcBride 2013-01-09 15:54:30 UTC
(In reply to comment #34)
> (In reply to comment #32)
>> Comment 24, comment 25, and comment 26 could not be any more explicit
> 
> For your case (mediawiki.org) yes.
> But comment 29 (that I answered) was about Commons.

Yeah, I'm not sure how many times it needs to be confirmed as broken. It's broken. Really. It needs to be fixed and that requires a sysadmin to investigate, debug, and resolve the issue. Can you find a willing sysadmin, please?
Comment 38 Valerie Juarez 2013-01-10 22:13:34 UTC
(In reply to comment #24)
> Isarra just experienced this problem on mediawiki.org and I was able to
> reproduce (after many tries). The HTML source of the page with no results
> contained "<!-- Served by srv275 in 10.143 secs. -->"...

I could reproduce this error about 10% of the time (About 3 out of 30ish reloads would return no results).

"<!-- Served by mw27 in 10.181 secs. -->" Was contained in the source of one of the pages.

>I'm not sure this is very helpful to debugging.

I would love to know if there is any way we can provide more info from the client side to help track down this issue.
Comment 39 MZMcBride 2013-01-10 22:22:37 UTC
(In reply to comment #38)
> (In reply to comment #24)
> "<!-- Served by mw27 in 10.181 secs. -->" Was contained in the source of one
> of the pages.
> 
>> I'm not sure this is very helpful to debugging.
> 
> I would love to know if there is any way we can provide more info from the
> client side to help track down this issue.

As I understand it, the "served by" HTML comment generally tells a user which server last parsed that particular page. In the context of search results, it tells the user which Apache server served the results. However, I believe for this bug we're interested in which _search_ server or cluster produced the results, not which Apache served the results. I don't believe the search server information is exposed anywhere.
Comment 40 Bawolff (Brian Wolff) 2013-01-10 22:56:03 UTC
(In reply to comment #39)
> (In reply to comment #38)
> > (In reply to comment #24)
> > "<!-- Served by mw27 in 10.181 secs. -->" Was contained in the source of one
> > of the pages.
> > 
> >> I'm not sure this is very helpful to debugging.
> > 
> > I would love to know if there is any way we can provide more info from the
> > client side to help track down this issue.
> 
> As I understand it, the "served by" HTML comment generally tells a user which
> server last parsed that particular page. In the context of search results, it
> tells the user which Apache server served the results. However, I believe for
> this bug we're interested in which _search_ server or cluster produced the
> results, not which Apache served the results.

That's correct, the served by mwXX is which apache. however there is a comment in the html of special:search looking like:

<!-- Search results fetched via search=[search14,search14], highlight=[search14], suggest=[search16] in 476 ms -->

Which would probably be more helpful (I assume anyhow. Not overly familiar with search infrastructure)
Comment 41 Bawolff (Brian Wolff) 2013-01-10 23:03:50 UTC
> 
> <!-- Search results fetched via search=[search14,search14],
> highlight=[search14], suggest=[search16] in 476 ms -->
> 

Just to be 100% clear. That comment is *NOT* from a search that failed. I just copied and pasted from a successful search to show what the comment looked like.
Comment 42 Valerie Juarez 2013-01-10 23:46:09 UTC
I don't see a comment like that on a page where the search failed.
Comment 43 Valerie Juarez 2013-01-11 04:06:36 UTC
Created attachment 11616 [details]
Page generated on failed search
Comment 44 Valerie Juarez 2013-01-11 04:09:53 UTC
Attached the php file generated when a search fails. If that helps.
Comment 45 jeremyb 2013-01-11 16:03:31 UTC
Created attachment 11618 [details]
HTML source diffed between responses with and without results (same URL/query)

took less than 10 tries to get a no results page. (and then had to do it again because my phone OOM'd and still it was <10x)

the fetched via line is in fact missing for the empty result set.
Comment 46 Sumana Harihareswara 2013-02-04 13:17:24 UTC
I am still running into this.  Just now I searched on mediawiki.org for "blog" and got 0 results at first, then reran the search and got a lot.

(See Bug 16236 for more repro cases, in case anyone wants them.)

I am adding Munagala Ramanath (Ram) to cc and raising priority to "Highest" - Ram, can you take a look at this?
Comment 47 Bawolff (Brian Wolff) 2013-02-04 13:41:36 UTC
Per the comments above where it was discovered  that timed out search requests do not include a comment saying what search server was used, we should probably change that.(unless I missed something. Im not too familiar with search)

More specificly for this problem-logging all failed searches and seeing if there is an obvious pattern in terms of which search host failed would probably be a good idea
Comment 48 Munagala Ramanath (Ram) 2013-02-04 15:11:08 UTC
Status of this issue is now being tracked in 43544. I've attached a script there
that allows this problem to be reproduced at will.

I'm fully engaged on this issue but it will be another week or two before there is any material progress since it is taking time to understand the PHP code at one end and the Java code at the other.
Comment 49 Andre Klapper 2013-02-07 18:02:21 UTC
So, apologies for the problems with unreliable search results on several wikipages so far.
Ram is going to take a look at these problems (see bug 42423 comment 48), but it'll take some more time. I'm tentatively assigning this report to Ram.


Trying to summarize the situation:

Issues with unreliable search on Commons: 
Bug 42431 (hmm, marked as dup of bug 42423), bug 43920, bug 35691

Bug 42423 itself is very generic. 
Initial comment mentions hi.wikipedia.org.
wikidata.org is mentioned (bug 42424 marked as dup), {en|fr}.wikisource.org (bug 42426 marked as dup).
It also mentions mediawiki.org (copied from bug 16236 comment 14, and bug 42423 comment 24).
Bug 42423 comment 19 states Commons problems.

Bug 42423 comment 35 and bug 42423 comment 36 implies that some requests take longer than 10sec and abort then.

Better debugging such problems is the subject of bug 43544: Show an error message instead of "zero results".
Thanks to Ram, bug 43544 also has a script to reproduce these problems.

A totally separate issue is bug 43663: Search on ua.wikimedia.org (chapter website) does not work AT ALL.

For general information on the Search situation, also see the posting by Ram at
http://lists.wikimedia.org/pipermail/wikitech-l/2013-February/066273.html
Comment 50 Nemo 2013-02-16 13:26:08 UTC
*** Bug 43920 has been marked as a duplicate of this bug. ***
Comment 51 billinghurst 2013-02-28 07:47:54 UTC
I experienced a fail on Meta today.  I did a specific PrefixIndex directed search, success; ran the same search on a broader criteria, fail, went back a minute later and it worked.

Search run from m:User:Billinghurst using the COIBot search boxes, search word Abercrombie, circa 22:33, 27 February 2013 (UTC)
Comment 52 Andre Klapper 2013-03-14 18:35:59 UTC
Quick update:

Ram (who started a few weeks ago) is trying to improve the Search debugging infrastructure first by working on
* bug 45266
* bug 43544
so it will be easier to find potential reasons (some bugs are expected to get fixed by this, or at least easier to identify).

After these two bugs have been resolved, bug 42423 and bug 43663 are very likely next on the list. Sorry that this takes a bit longer, but the plan is to "do it right".
Comment 53 Andre Klapper 2013-05-14 12:22:37 UTC
Bug 43544 is fixed now so there should be at least error messages when the results are inconsistent.

Can anybody say if this is the case? 

With the given examples in this bug report I could not manage so far to get inconsistent results or errors.
Comment 54 MZMcBride 2013-05-14 15:32:36 UTC
I think this bug can probably be marked resolved/fixed at this point.
Comment 55 Foroa 2013-05-14 15:56:10 UTC
I could not observe a false empty search return with zero results. I notice from times to times a red time-out message (say 15 times per week); Maybe the message can be a bit masssaged, such as Temporary search engine overload, please try again ... 


I must admit, that now I did some additional tests, the system impresses me; I did not manage to get it in time-out.
Comment 56 Andre Klapper 2013-05-16 15:59:56 UTC
Closing as per comment 54 and comment 55.
Thanks everybody, and again sorry that improving the situation took a while (and improving the Search is still ongoing work and complicated enough). :-/

Followup issues: bug 45266, bug 47761.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links