Last modified: 2014-06-25 13:59:12 UTC
Hello, since the unicode normalization analyzer was installed for Hebrew some expected search results are missed. How to reproduce: Compare results form this search in wikidata: https://www.wikidata.org/w/index.php?search=%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99&title=Special%3ASearch&go=%D7%9C%D7%93%D7%A3 to the same search in hebrew wiki: https://he.wikipedia.org/w/index.php?search=%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99&title=%D7%9E%D7%99%D7%95%D7%97%D7%93%3A%D7%97%D7%99%D7%A4%D7%95%D7%A9&go=%D7%9C%D7%A2%D7%A8%D7%9A One would expect the five results showing in wikidata search would show up in hebrew wiki, but The first and last result on wikidata don't appear on hebrew wiki search results. Best
Result number 1 and 5 in wikidata look like result number 1 and 2 on hewiki. I wonder if we lost those pages temporarily? That'd be bad.
Or, am I reading it wrong?
This is a better comparison: https://www.wikidata.org/w/index.php?search=%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99&title=Special%3ASearch&go=%D7%9C%D7%93%D7%A3 to https://he.wikipedia.org/w/index.php?search=%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99&title=%D7%9E%D7%99%D7%95%D7%97%D7%93%3A%D7%97%D7%99%D7%A4%D7%95%D7%A9&go=%D7%9C%D7%A2%D7%A8%D7%9A&fulltext=1 The first result in wikidata (https://www.wikidata.org/wiki/Q7003270) isn't in the hewiki results. On further digging, the page exists at (https://he.wikipedia.org/wiki/%D7%A7%D7%9C%D7%99%D7%A4%D7%95%D7%A8%D7%93_%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99) but when I try to fetch it from the search index it isn't in there: manybubbles@elastic1003:~$ curl localhost:9200/hewiki_content/page/495403 {"_index":"hewiki_content_1401724632","_type":"page","_id":"495403","found":false} So what is the deal?
(In reply to Nik Everett from comment #3) > So what is the deal? That is rhetorical - I'm going to figure it out.
I added that page back into the index: manybubbles@terbium:~$ mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki hewiki --fromId 495402 --toId 495403 Indexed 1 pages ending at 495403 at 6/second Indexed a total of 1 pages at 6/second manybubbles@terbium:~$ That's just remediation. Now to figure out why it wasn't in there in the first place.
Change 138835 had a related patch set uploaded by Manybubbles: Add a maintenance script to make the index sane https://gerrit.wikimedia.org/r/138835
I've written a tool to scan the index and look for insanity. I'm tempted to chalk some insanity in Hebrew up to the hebrew analyzer which was buggy and we had it in production for two weeks. The tool should heal whatever damage it did. Then we'll run it again a few days later and see if we get _more_ insanity. That'll have the benefit of being recent.
Change 138835 merged by jenkins-bot: Add a maintenance script to make the index sane https://gerrit.wikimedia.org/r/138835
Saneitizer seems to have done the trick here. I'm going to claim it was the broken analyzer. If we lose more pages I'll revise that claim.