Last modified: 2014-02-04 17:23:17 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T44234, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 42234 - Full text search index is corrupt (index rebuild ignores content model)
Full text search index is corrupt (index rebuild ignores content model)
Status: VERIFIED FIXED
Product: MediaWiki extensions
Classification: Unclassified
WikidataRepo (Other open bugs)
unspecified
All All
: High major with 2 votes (vote)
: ---
Assigned To: Munagala Ramanath (Ram)
: i18n
: 45860 (view as bug list)
Depends on: 41532 45983 54201
Blocks: 44529
  Show dependency treegraph
 
Reported: 2012-11-18 01:03 UTC by jeblad
Modified: 2014-02-04 17:23 UTC (History)
17 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description jeblad 2012-11-18 01:03:19 UTC
When searching some items, where there are non-English letters, like "Kimi Räikkönen", it does not give any results. However it recognizes letter "ü". When When searching just "Kimi", then Kimi Räikkönen can be found, but the string uses  the following code "Kimi R\u00e4ikk\u00f6nen".

A possible reason could be that the unserialization fails for the indexed document.
Comment 1 Daniel Kinzler 2012-11-18 20:03:34 UTC
This is caused (or rather, fixed) by bug 41532, which is closed because the fix is in master now. Closing. 

If the problem still be there in about two weeks (when the new version should have gone live), please re-open.
Comment 2 Stryn 2013-01-24 06:56:07 UTC
I'm not sure is this problem related to this, but if I search "täsmennyssivu" (which is disambiguation page in Finnish language) I don't get any results. Then search engine asks, did you mean: u00e4smennyssivu. And "u00e4smennyssivu" gives results for täsmennyssivu.

And if I search "Räikkönen" search engine (http://www.wikidata.org/w/index.php?search=r%C3%A4ikk%C3%B6nen&title=Special%3ASearch) gives only one result; Ville Räikkönen. It does not find e.g. Kimi Räikkönen.

And Kimi Räikkönen can be found if I search Kimi R\u00e4ikk\u00f6nen, but not if I search Kimi Räikkönen.
Comment 3 jeblad 2013-01-24 15:23:45 UTC
JSON is now being indexed again. Are we using an old version of the OAI extension?
Comment 4 Daniel Kinzler 2013-01-24 15:31:28 UTC
The OAI extension is providing a flat text version for indexing: https://www.wikidata.org/w/index.php/Special:OAIRepository?verb=ListRecords&metadataPrefix=lsearch&from=2013-01-10T20:30:00Z

Was LuceneSearch changed to no longer use this?
Comment 5 jeblad 2013-01-24 17:40:17 UTC
The small municipality Höör (Q765434)‎ in Sweden got updated and it is not possible to find it in the search (http://www.wikidata.org/w/index.php?search=H%C3%B6%C3%B6r&title=Special%3ASearch) that could mean the indexes are not updated, but note that it could also mean the search is broken also.

A new item is "The Man Who Shook the Hand of Vicente Fernandez" (http://www.wikidata.org/w/index.php?title=Special%3ASearch&profile=default&search=The+Man+Who+Shook+the+Hand+of+Vicente+Fernandez&fulltext=Search) and that too can't be found.

The city of "Ålesund" (http://www.wikidata.org/w/index.php?search=%C3%85lesund&title=Special%3ASearch) is be found, that is an old item.

The city of "Göteborg" (http://www.wikidata.org/w/index.php?search=g%C3%B6teborg&title=Special%3ASearch) can also be found, and this too is an old item.

Seems to me that the index is broken.
Comment 6 Rob Lanphier 2013-01-25 00:52:34 UTC
Tim, could you take a look at this?
Comment 7 Nemo 2013-01-31 09:09:40 UTC
(In reply to comment #5)
> The small municipality Höör (Q765434)‎ in Sweden got updated and it is not
> possible to find it in the search
> (http://www.wikidata.org/w/index.
> php?search=H%C3%B6%C3%B6r&title=Special%3ASearch)
> that could mean the indexes are not updated, but note that it could also mean
> the search is broken also.

This still doesn't work.

> 
> A new item is "The Man Who Shook the Hand of Vicente Fernandez"
> (http://www.wikidata.org/w/index.
> php?title=Special%3ASearch&profile=default&search=The+Man+Who+Shook+the+Hand+
> of+Vicente+Fernandez&fulltext=Search)
> and that too can't be found.

This now works.

Another example reported on it.wiki is "Iván Moro". You have to use the search gadget to find it.
Comment 8 Andre Klapper 2013-02-04 17:34:57 UTC
Tim: Did you have a chance to take a look at this?
Comment 10 jeblad 2013-02-24 22:03:54 UTC
Dilts, created 19:52, 24 February 2013‎ http://www.wikidata.org/w/index.php?title=Q5277075&action=history

Searched 23:03, no go
Comment 11 Tim Starling 2013-02-25 19:29:43 UTC
The search index for wikidatawiki probably needs to be rebuilt.
Comment 12 Tim Starling 2013-02-27 23:16:43 UTC
Bash history and file modification timestamps on searchidx2 and searchidx1001 seem to indicate that the wikidatawiki index hasn't been rebuilt since November 14.
Comment 13 Lydia Pintscher 2013-03-01 15:17:46 UTC
Thanks for investigating, Tim. Any chance you can fix this? Anything I can tell the community (who's rather unhappy about the search)?
Comment 14 Munagala Ramanath (Ram) 2013-03-01 15:35:34 UTC
Looks like Tim fixed it -- timestamp on searchidx1001 for wikidatawiki is today:

cat ../status/wikidatawiki 
#Last incremental update timestamp
#Fri Mar 01 03:42:21 UTC 2013
timestamp=2013-03-01T03\:41\:07Z

Many of the index files have a timestamp of yesterday or today.
Comment 15 Leinad 2013-03-01 15:45:21 UTC
(In reply to comment #13)
> Thanks for investigating, Tim. Any chance you can fix this? Anything I can
> tell
> the community (who's rather unhappy about the search)?

Hi,
I would like you to suggest to postpone deploy Wikidata on projects like plwiki until fix this bug - this is really important issue and in my opinion it will cause negative impressions of new tool. On plwiki we still have a problem to convince community about advantages of Wikidata and such bugs won't help us.

Names like "Łódź" are impossible to search: http://www.wikidata.org/w/index.php?search=%C5%81%C3%B3d%C5%BA&title=Special%3ASearch
Comment 16 Nemo 2013-03-01 15:58:52 UTC
(In reply to comment #15)
> Names like "Łódź" are impossible to search:
> http://www.wikidata.org/w/index.
> php?search=%C5%81%C3%B3d%C5%BA&title=Special%3ASearch

On it.wiki users were just told not to use Special:Search at all, because it's completely useless, and to rely on the search gadget (enabled by default on Vector) which is activated by clicking the arrow next to the search bar. You should probably do the same and forget the standard search: this helped a lot on it.wiki.
Comment 17 Daniel Zahn 2013-03-01 22:31:35 UTC
link to RT-4625
Comment 18 Andre Klapper 2013-03-06 08:44:11 UTC
Make that last comment RT #4625
Comment 19 jeremyb 2013-03-07 03:43:51 UTC
<notpeter>  I have rebuilt the index from a fresh dump of wikidatawiki. this should hopefully fix the problem. if the problem persists, please re-open this ticket.
Comment 20 Nemo 2013-03-07 06:07:50 UTC
(In reply to comment #15)
> Names like "Łódź" are impossible to search:
> http://www.wikidata.org/w/index.
> php?search=%C5%81%C3%B3d%C5%BA&title=Special%3ASearch

Still getting no result as of now. The other examples here seem to work.
Comment 21 Andre Klapper 2013-03-07 12:05:38 UTC
Confirming that Łódź is still a problem for wikidata.org.
Reopening as per comment 19, though not sure if this is the same problem.
Comment 22 Daniel Kinzler 2013-03-07 12:57:33 UTC
(In reply to comment #19)
> <notpeter>  I have rebuilt the index from a fresh dump of wikidatawiki. this
> should hopefully fix the problem. if the problem persists, please re-open
> this ticket.

Oh... how does rebuilding the index from a dump work? Which code does it use? Can it handle non-wikitext content at all? If not, it will index the JSON...

For the live updates, I have implemented the required support in the OAI extension, so OAI's lsearch output is not JSON but (generated) plain text. The same needs to be done when re-indexing based on dumps, I suppose. So far, I assumed that the rebuild would be using the same interface to access the data. If that is not the case, rebuilding the index might actually cause *more* breakage.
Comment 23 Munagala Ramanath (Ram) 2013-03-07 16:45:26 UTC
Not sure exactly how notpeter did it but one way is to use the import-file()
function in puppet/files/lucene/lucene.jobs.sh. There is also an import-db()
function that dumps the DB to a file and runs the former function on that file.

It uses the Java class org.wikimedia.lsearch.importer.BuildAll. I don't yet know
this part of the code well enough to answer the other questions.
Comment 24 Daniel Kinzler 2013-03-07 16:52:19 UTC
(In reply to comment #23)
> Not sure exactly how notpeter did it but one way is to use the import-file()
> function in puppet/files/lucene/lucene.jobs.sh. There is also an import-db()
> function that dumps the DB to a file and runs the former function on that
> file.
> 
> It uses the Java class org.wikimedia.lsearch.importer.BuildAll. I don't yet
> know
> this part of the code well enough to answer the other questions.

We don't have any handling of non-wikitext content in Java, and I don't see how it could be added... we'd either have to create specialized dumps, or implement the entire content handler infrastructure in Java (including java versions of content handlers supplied by extensions), or not use dumps and always call the API.

None of the options sounds good :\
Comment 25 Daniel Kinzler 2013-03-11 10:14:28 UTC
*** Bug 45860 has been marked as a duplicate of this bug. ***
Comment 26 Daniel Kinzler 2013-03-11 10:20:24 UTC
A brief discussion on wikitech-l suggests using a special XML dump for this purpose, see http://www.gossamer-threads.com/lists/wiki/wikitech/340638

I filed that as bug 45983.
Comment 27 Andre Klapper 2013-03-14 18:57:39 UTC
RT comment is "Nothing else for ops to do right now."
Tentatively assigning to Ram, though this needs more time (see comment 23).
Comment 28 denny vrandecic 2013-06-27 10:06:52 UTC
Currently the index rewrite mechanism is being reworked. Until then, there's not much we will do here.

Also not that the search of Entities by their label actually do work for the given examples. It is merely the full text search that does not return appropriate results.
Comment 29 Chad H. 2014-01-29 18:26:41 UTC
Is this still a problem since we're using Cirrus on wikidatawiki?
Comment 30 Lydia Pintscher 2014-01-29 18:32:55 UTC
The examples I could find all work. I'm closing it. If there are still issues please reopen.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links