Last modified: 2008-05-19 17:51:19 UTC
This is an enhancement bug to track the suggestion searching feature request, as discussed recently on wikitech-l. To summarize: * The behaviour of the current MediaWiki search box would change, so that instead of a straight text field, it would be more akin to an electronic index that tries to help you find what you are looking for as you type, without submitting the whole form. * The user in their preferences would specify (opt-in) that they want to use suggestion searching. Since suggestion searching uses AJAX, it would probably be best to default this to being off so that backwards compatibility is retained for older non-JavaScript browsers, or clients with slow & expensive & high-latency connections (e.g. mobile phone devices). * Suggestion searching would show a list of the possible page names matching what the user has typed thus far (limited to say the top 10 matches, possibly ranked by popularity). * As the user types more, the suggestions would become more specific. * The user is able to arrow up or arrow down through the list of suggestions to select / highlight their choice. * Pressing enter should probably open the topmost choice in the list of suggestions, or the highlighted suggestion (if the first item is not the highlighted one). * Potentially the suggestion has autocomplete functionality, whereby the next few letters are filled in and highlighted where it seems probable that this is what the user is going to type. If it helps to visualize what's being described, there are some screenshots to give an idea of what it could potentially look like here: http://nickj.org/images/8/80/03-autocompletion-kicks-in.png and here: http://nickj.org/images/b/b3/04-found-desired-article.png One potential implementation for this idea would be the server-side program from Julien Lemoine. A web interface to this program can be accessed at http://en.suggest.speedblue.org/ (e.g. the current MediaWiki Search box would behave something like the search box on this site, except integrated into MediaWiki), and GPL source code can also be downloaded. However, there are a few things which would be good to see happen to the above Suggestion Searching to help integrate it into MediaWiki: * Currently the index generation uses the pages-articles.xml + all-titles-in-ns0 dump files downloaded from download.wikipedia.org. It would probably be better to be also be able to generate the indexes directly from the database, instead of requiring a dump stage first. This would probably be faster, and allow sites which don't currently generate dump files to also use this. * Currently some articles can't be reached using the search suggest because they're "masked" by more popular articles. An example for the English Wikipedia would be the "AM" disambiguation page being masked by the pages that start with "American" (i.e. you cannot get to the "AM" search result). Potentially the exact matches could be included in the search result (although _maybe_ they want to be towards the end of the list if they're less popular articles, since they're probably not what the user is looking for). * Currently the index does not include non namespace 0 articles. It would probably be best to include other namespaces (e.g. Template:, MediaWiki:, Talk:, etc), so that the suggestion searching box would have "functional-parity" with the current search box (e.g. should be able to type "Template:Cleanup" into the search box, and have it appear in the list of possible results). * Potential case-sensitive ordering of the results. For example, if the user searches for "Adfa" on the English Wikipedia, it lists three results, including "ADFA" (listed first) and "Adfa" (the Welsh town, listed later). Should "Adfa" come first, because it is an exact match for what the user typed?
(In reply to comment #0) > * The user in their preferences would specify (opt-in) that they want to use > suggestion searching. Since suggestion searching uses AJAX, it would probably be > best to default this to being off so that backwards compatibility is retained > for older non-JavaScript browsers, or clients with slow & expensive & > high-latency connections (e.g. mobile phone devices). Surely that's a complete waste of this feature? Non-JavaScript browsers just wouldn't do anything, of course, and users of high-latency connections can just ignore the results (since they'll be unhelpful).
No need to have it disabled for non-javascript browsers. JavaScript should be able to handle it so it falls back gracefully. Slow connections are a problem though, so an option to disable it should be available anywhere.
This feature has already been implemented in part by the API (http://en.wikipedia.org/w/api.php) - the opensearch feature. http://en.wikipedia.org/w/api.php?action=opensearch&search=Te will return first 10 titles beginning with "Te" Page titles are currently not very relevant. See http://meta.wikimedia.org/wiki/Proposed_Database_Schema_Changes for suggested improvement to the search result relevancy. Moreover, this feature is already being used for the Firefox 2.0 search box. To use, visit any mediawiki site, click on the search engine selector button, and select "add wiki" - autocomplete will work, except that there is 500ms timeout by default set in firefox_install_dir/components/nsSearchSuggestions.js -- search for "_suggestionTimeout: 500" line, and set much higher timeout if you are on slow connection.
This has been brought up in Mozilla's bug for adding Wikipedia to the default search engine list for Firefox: <https://bugzilla.mozilla.org/show_bug.cgi?id=380785>. CCing to rainman in case he has any thoughts about implementing this at some point using the relevancy algorithms already existing in Lucene, or if these would be inappropriate. This is maybe better for a built-in approach as suggested at <http://www.mediawiki.org/wiki/Proposed_Database_Schema_Changes#table_for_auto-suggest_page_title_search> using backlinks or some other metric (in which case redirects need to be counted as well).
I think we should be trying to integrate http://suggest.speedblue.org/ Building a prefix tree with articles ordered by rank is probably the most efficient way to go. Things to do: 1) Make the suggest engine rebuild it's prefix tree from lucene index, and not database dump - frequent rebuilds from db dumps have proven not to be very reliable. And the needed data is already in the index, i.e. article titles, and their ranks. This way, we can worry only about keeping the main index up-to-date. 2) Figure out an update scheme that will minimize downtime. Rsync+restart could be enough for starters, but it would be nice if there would be, say, an extra thread that would check the contents of some rsync path for updates. I'll invite Julien (the author of suggest engine) to give some comments on this as well.
This sounds good. If you have an efficient way to extract titles and ranks from index, this is the best way to have an efficient completion structure. To minimize the downtime, the best solution is to keep the old prefix trie loaded while the new tree is not build and loaded (on a new tcp port for example) and to redirect queries to the new version when it is available. You will have two tree in memory during a short time but without downtime. I can provide you some support and improvement for the prefix trie I implemented, I have some idea to reduce memory footprint and improve performances. Best Regards. Julien
We could use CLucene to access the lucene index, but I don't know if they maintain full compatibility with the latest java lucene file-structure changes. Or, we could use something like this: http://svn.wikimedia.org/viewvc/mediawiki/branches/lucene-search-2.1/src/org/wikimedia/lsearch/util/ExtractTitles.java?view=markup This would open the latest consistent version of the lucene index and print out a title per line: <rank> <namespace> <title> [<redirects>] This could then be piped into a trie rebuild tool. Extracting the complete listing for en.wiki takes less than 2 minutes. How things stand now, redirects won't be included. They are in the index but are not stored in raw form, thus not very easy to extract. But if we would go ahead to integrate this, I believe I could easily add them without enlarging the index much, and without hurting performance. So, if this would be worked out, I would be happy to setup a test on wmf servers, with some help from the root-access people of course :)
*** Bug 12412 has been marked as a duplicate of this bug. ***
Resolving as FIXED -- $wgEnableMWSuggest is available for 1.13.