Last modified: 2014-08-13 20:46:34 UTC
I tried to search articles with "intitle:dari Spanyol" (from Spain) in the title, but it gave 0 result, the same if I search "intitle:dari" (from), but it gave the expected result when I searched "intitle:Spanyol" (Spain). 1. https://id.wikipedia.org/w/index.php?title=Istimewa%3APencarian&profile=default&search=intitle%3Adari+spanyol&fulltext=Search&uselang=en 2. https://id.wikipedia.org/w/index.php?title=Istimewa%3APencarian&profile=default&search=intitle%3Adari&fulltext=Search&uselang=en 3. https://id.wikipedia.org/w/index.php?title=Istimewa%3APencarian&profile=default&search=intitle%3Aspanyol&fulltext=Search&uselang=en Expecting some kind of error message other than "There were no results matching the query."
Something is a certainly weird here. Temporary work around: https://id.wikipedia.org/w/index.php?title=Istimewa%3APencarian&profile=default&search=intitle%3A%22dari+spanyol%22&fulltext=Search
I suspect it is some kind of language-based stop words, in this case Indonesian language, because of three reasons: 1. other Indonesian stop words also didn't show up ("intitle:di" - in, "intitle:ke" - to) 2. those words ("intitle:di", "intitle:ke", "intitle:dari") are found in other projects 3. based on my experience, id.wp's CirrusSearch employ some kind of Indonesian-language stemmer If that is true, is it possible to disable the stop words?
Further investigation: Searching "intitle:di" in Italian Wikipedia also failed https://it.wikipedia.org/w/index.php?title=Speciale%3ARicerca&profile=advanced&search=intitle%3Adi&fulltext=Search&ns0=1&profile=advanced But searching "intitle:from" in English Wikipedia and "intitle:von" in German Wikipedia yields the expected results. (btw, my searching context was noble titles, e.g. "ABC from XYZ" which translates "ABC dari XYZ" in id.wp and "ABC di XYZ" in it.wp, and so on)
Further investigation: searching in similar projects id.wp and ms.wp are similar, while it.wp and scn.wp and en.wp and simple.wp are also compared: "intitle:dari" 1 id.wp - failed 2 id.wp - success "intitle:di" 3 it.wp - failed 4 scn.wp - failed "intitle:of" 5 en.wp - error 6 simple.wp - success links: 1 https://id.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&fulltext=Search&search=intitle%3Adari 2 https://ms.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&fulltext=Search&search=intitle%3Adari 3 https://it.wikipedia.org/w/index.php?title=Speciale%3ARicerca&profile=advanced&fulltext=Search&search=intitle%3Adi 4 https://scn.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&fulltext=Search&search=intitle%3Adi 5 https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&fulltext=Search&search=intitle%3Aof 6 https://simple.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&fulltext=Search&search=intitle%3Aof
Probably related * [[bugzilla:54875]] Automatic stopwords for the 200+ languages without their own analyzer available * [[bugzilla:60362]] CirrusSearch: Stopwords are not optional and are worth as much as exact matches * https://www.mail-archive.com/mediawiki-commits@lists.wikimedia.org/msg169298.html So, where can I look at the Indonesian stopwords list, and/or stemmer?
Looks like this is the stemmer: https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/id/IndonesianStemmer.java These are the stopwords: https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/resources/org/apache/lucene/analysis/id/stopwords.txt Those bugs are related. The reason we haven't fixed them is because its a pretty large effort and we're still concentrating on performance. Its on the list, but it isn't as high as I'd like it to be.