Last modified: 2010-05-15 16:03:20 UTC
Created attachment 5161 [details] bad search results - Note: text content of image is (c) 2008 ontolawgy LLC because it is comes from a for-profit custom-designed wiki. I am building a wiki where some page titles share some of the first few words with other page titles. It seems that there may be an issue with search terms being dropped from the index when page titles start with two or more of the same words For example: "Reasonably available control technology" is the title of one page, and "Reasonably available control measures" is the title of another page. All pages are in the "Main" namespace, which is being searched by default. Searches for "Reasonably", "Available", or "reasonably available" come up COMPLETELY blank (0 hits), however, searches for "control" or "reasonably available control" bring up both pages, as well as all the other pages that mention those pages. The results are the same whether or not I include quotes in the query. A search for "reasonab*" turns up the all relevant pages (i.e., those including the terms "reasonable", "reasonably", etc.) EXCEPT for the pages with "reasonably" in their titles. This seems very strange to me.... There are pages that share only the first word, in this case "national". Those pages are titled "National primary ambient air quality standard" and "National secondary ambient air quality standard". Searches for "national" turn up all the relevant pages. I have rebuilt the text index (rebuildtextindex.php) and rebuilt the search index (updateSearchIndex.php), run the update script (update.php), etc., with no change in results. This is a fresh wiki on 1.13.0rc1 (that is, built entirely using 1.13.0rc1) using Semantic MediaWiki 1.2 (which appears to break the "refreshLinks.php script, but that is an issue for SMW and should not affect searching...). Any help/suggestions would be most welcome.
Note - just upgraded to rc2; bug persists in 1.13.0rc2.
Ah, MySQL's fulltext stopword list strikes again! http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html "reasonably" and "available" are on the list and will thus be ignored in the fulltext index used for searching. You can override the default stopword list in reasonably current versions of MySQL; see http://dev.mysql.com/doc/refman/5.0/en/fulltext-fine-tuning.html
Thanks - that's frustrating. If I can get lucene running on the server that seems to be a viable option, but I doubt my host would allow that. 1.13.0 seems to at least offer the "pages starting with" and "pages linking to" options that seem pretty helpful...
Can this bug be closed or is there some actions that should be done?