Last modified: 2013-09-30 19:57:24 UTC
I search for "savepage" and "screenshot" and "savepage-greyed" on test2, e.g., https://test2.wikipedia.org/w/index.php?title=Special:Search&search=screenshot&fulltext=Search&profile=all&redirs=0 , and do not find https://test2.wikipedia.org/wiki/File:Savepage-greyed.png in the results, even though that file has the word "Screenshot" in its file summary.
(In reply to comment #0) > I search for "savepage" and "screenshot" and "savepage-greyed" on test2, > e.g., > https://test2.wikipedia.org/w/index.php?title=Special: > Search&search=screenshot&fulltext=Search&profile=all&redirs=0 > , and do not find https://test2.wikipedia.org/wiki/File:Savepage-greyed.png > in > the results, even though that file has the word "Screenshot" in its file > summary. I would be more concerned with it not picking up "Screenshot" in the image description page body over it not picking the word out of the img_comment.
Triaging to high. Weird.
The following searches seem to find it just fine: https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=all&search=Savepage-greyed.png&fulltext=Search&srbackend=CirrusSearch https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=all&search=Savepage&fulltext=Search https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=all&search=greyed.png&fulltext=Search https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=all&search=screenshot&fulltext=Search But this searches didn't find the file and probably should: https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=all&search=greyed&fulltext=Search I seems to be not working because Savepage-greyed.png is tokenized as "savepag" and "greyed.png" [1] which isn't really what we want. I'm not sure what we do want though. Maybe "savepag" and "grey" and ".png". [1] Running http://<elasticsearch_host>:9200/nikwiki_general/_analyze?analyzer=text&text=Savepage-greyed.png spits out { "tokens": [ { "token": "savepag", "start_offset": 0, "end_offset": 8, "type": "<ALPHANUM>", "position": 1 }, { "token": "greyed.png", "start_offset": 9, "end_offset": 19, "type": "<ALPHANUM>", "position": 2 } ] }
By the way, I'm not sure why when you reported the problem it wasn't working but is now. I'm going to add a few more regression tests to make sure that the searches that do work continue to work then I might merge the tokenizing problem into Bug 53013 and continue to work through the remaining bugs.
I've added some file search regression tests in https://gerrit.wikimedia.org/r/#/c/80074/ . Now that most searches seem to be working and we've got regression tests for them I'm going to lower this to normal priority and work on the stemming problem I mentioned in Comment 3 when I've knocked out the higher priority bugs.
I want to fix this with a Pattern Capture token filter but that isn't in the version of Elasticsearch we're using (0.90.2.) It _is_ in 0.90.3 but since 0.90.4 is supposed to be coming out "early next week" and we've got a bunch of bugs waiting on that I'm tagging this as waiting on that too. With this filter fixing the last portion of this bug should be pretty simple.
I looked into this some more and I'm still not happy with it. I can fix it by adding a PatternCaptureFilter with the pattern "([^\.]+)" but that has some problems: 1. Highlighting just gets really really confused. If one part matches then the whole thing matches. 2. Adding a regex like that to every single token can't be quick. 3. I just looked at Bug 54669 which wanted _more_ precision around funky token patterns. I'm going to resolve this to fixed now because the original problem, not being able to find a fine name, is pretty well fixed. I'd love to hear arguments for the splitting around the . change in the file name, but I'm currently convinced it isn't a good idea.