Last modified: 2010-05-15 16:03:20 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T17120, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 15120 - Words dropped from index if page titles start with two or more of the same words
Words dropped from index if page titles start with two or more of the same words
Status: RESOLVED INVALID
Product: MediaWiki
Classification: Unclassified
Search (Other open bugs)
1.13.x
All All
: Normal major (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-08-11 14:05 UTC by Alex
Modified: 2010-05-15 16:03 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
bad search results - Note: text content of image is (c) 2008 ontolawgy LLC because it is comes from a for-profit custom-designed wiki. (241.85 KB, image/png)
2008-08-11 14:05 UTC, Alex
Details

Description Alex 2008-08-11 14:05:19 UTC
Created attachment 5161 [details]
bad search results - Note: text content of image is (c) 2008 ontolawgy LLC because it is comes from a for-profit custom-designed wiki. 

I am building a wiki where some page titles share some of the first few words with other page titles. 
It seems that there may be an issue with search terms being dropped from the index when page titles start with two or more of the same words

For example: 

"Reasonably available control technology" is the title of one page, and "Reasonably available control measures" is the title of another page. All pages are in the "Main" namespace, which is being searched by default. 

Searches for "Reasonably", "Available", or "reasonably available" come up COMPLETELY blank (0 hits), however, searches for "control" or "reasonably available control" bring up both pages, as well as all the other pages that mention those pages. The results are the same whether or not I include quotes in the query. 

A search for "reasonab*" turns up the all relevant pages (i.e., those including the terms "reasonable", "reasonably", etc.) EXCEPT for the pages with "reasonably" in their titles. This seems very strange to me....

There are pages that share only the first word, in this case "national". Those pages are titled "National primary ambient air quality standard" and  "National secondary ambient air quality standard". Searches for "national" turn up all the relevant pages. 

I have rebuilt the text index (rebuildtextindex.php) and rebuilt the search index (updateSearchIndex.php), run the update script (update.php), etc., with no change in results. 

This is a fresh wiki on 1.13.0rc1 (that is, built entirely using 1.13.0rc1) using Semantic MediaWiki 1.2 (which appears to break the "refreshLinks.php script, but that is an issue for SMW and should not affect searching...).

Any help/suggestions would be most welcome.
Comment 1 Alex 2008-08-11 14:27:25 UTC
Note - just upgraded to rc2; bug persists in 1.13.0rc2.
Comment 2 Brion Vibber 2008-08-11 17:06:09 UTC
Ah, MySQL's fulltext stopword list strikes again!

http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html

"reasonably" and "available" are on the list and will thus be ignored in the fulltext index used for searching.

You can override the default stopword list in reasonably current versions of MySQL; see http://dev.mysql.com/doc/refman/5.0/en/fulltext-fine-tuning.html
Comment 3 Alex 2008-08-28 17:07:27 UTC
Thanks - that's frustrating. If I can get lucene running on the server that seems to be a viable option, but I doubt my host would allow that. 1.13.0 seems to at least offer the "pages starting with" and "pages linking to" options that seem pretty helpful... 
Comment 4 Niklas Laxström 2008-09-28 07:58:36 UTC
Can this bug be closed or is there some actions that should be done?

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links