Last modified: 2008-05-19 17:51:19 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T9288, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 7288 - Suggestion searching
Suggestion searching
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Search (Other open bugs)
1.8.x
All All
: Normal enhancement with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
:
: 12412 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2006-09-11 07:54 UTC by Nick Jenkins
Modified: 2008-05-19 17:51 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Nick Jenkins 2006-09-11 07:54:19 UTC
This is an enhancement bug to track the suggestion searching feature request, as
discussed recently on wikitech-l.

To summarize:
* The behaviour of the current MediaWiki search box would change, so that
instead of a straight text field, it would be more akin to an electronic index
that tries to help you find what you are looking for as you type, without
submitting the whole form.
* The user in their preferences would specify (opt-in) that they want to use
suggestion searching. Since suggestion searching uses AJAX, it would probably be
best to default this to being off so that backwards compatibility is retained
for older non-JavaScript browsers, or clients with slow & expensive &
high-latency connections (e.g. mobile phone devices).
* Suggestion searching would show a list of the possible page names matching
what the user has typed thus far (limited to say the top 10 matches, possibly
ranked by popularity).
* As the user types more, the suggestions would become more specific.
* The user is able to arrow up or arrow down through the list of suggestions to
select / highlight their choice.
* Pressing enter should probably open the topmost choice in the list of
suggestions, or the highlighted suggestion (if the first item is not the
highlighted one).
* Potentially the suggestion has autocomplete functionality, whereby the next
few letters are filled in and highlighted where it seems probable that this is
what the user is going to type.

If it helps to visualize what's being described, there are some screenshots to
give an idea of what it could potentially look like here:
http://nickj.org/images/8/80/03-autocompletion-kicks-in.png and here:
http://nickj.org/images/b/b3/04-found-desired-article.png

One potential implementation for this idea would be the server-side program from
Julien Lemoine. A web interface to this program can be accessed at
http://en.suggest.speedblue.org/ (e.g. the current MediaWiki Search box would
behave something like the search box on this site, except integrated into
MediaWiki), and GPL source code can also be downloaded.

However, there are a few things which would be good to see happen to the above
Suggestion Searching to help integrate it into MediaWiki:
* Currently the index generation uses the pages-articles.xml + all-titles-in-ns0
dump files downloaded from download.wikipedia.org. It would probably be better
to be also be able to generate the indexes directly from the database, instead
of requiring a dump stage first. This would probably be faster, and allow sites
which don't currently generate dump files to also use this.
* Currently some articles can't be reached using the search suggest because
they're "masked" by more popular articles. An example for the English Wikipedia
would be the "AM" disambiguation page being masked by the pages that start with
"American" (i.e. you cannot get to the "AM" search result). Potentially the
exact matches could be included in the search result (although _maybe_ they want
to be towards the end of the list if they're less popular articles, since
they're probably not what the user is looking for).
* Currently the index does not include non namespace 0 articles. It would
probably be best to include other namespaces (e.g. Template:, MediaWiki:, Talk:,
etc), so that the suggestion searching box would have "functional-parity" with
the current search box (e.g. should be able to type "Template:Cleanup" into the
search box, and have it appear in the list of possible results).
* Potential case-sensitive ordering of the results. For example, if the user
searches for "Adfa" on the English Wikipedia, it lists three results, including
"ADFA" (listed first) and "Adfa" (the Welsh town, listed later). Should "Adfa"
come first, because it is an exact match for what the user typed?
Comment 1 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-09-12 00:37:58 UTC
(In reply to comment #0)
> * The user in their preferences would specify (opt-in) that they want to use
> suggestion searching. Since suggestion searching uses AJAX, it would probably be
> best to default this to being off so that backwards compatibility is retained
> for older non-JavaScript browsers, or clients with slow & expensive &
> high-latency connections (e.g. mobile phone devices).

Surely that's a complete waste of this feature?  Non-JavaScript browsers just
wouldn't do anything, of course, and users of high-latency connections can just
ignore the results (since they'll be unhelpful).
Comment 2 Platonides 2006-10-01 13:35:36 UTC
No need to have it disabled for non-javascript browsers. JavaScript should be
able to handle it so it falls back gracefully.
Slow connections are a problem though, so an option to disable it should be
available anywhere.
Comment 3 Yuri Astrakhan 2006-10-24 01:20:12 UTC
This feature has already been implemented in part by the API
(http://en.wikipedia.org/w/api.php) - the opensearch feature. 
http://en.wikipedia.org/w/api.php?action=opensearch&search=Te  will return first
10 titles beginning with "Te"

Page titles are currently not very relevant. See
http://meta.wikimedia.org/wiki/Proposed_Database_Schema_Changes for suggested
improvement to the search result relevancy.

Moreover, this feature is already being used for the Firefox 2.0 search box.
To use, visit any mediawiki site, click on the search engine selector button,
and select "add wiki" - autocomplete will work, except that there is 500ms
timeout by default set in firefox_install_dir/components/nsSearchSuggestions.js
-- search for  "_suggestionTimeout: 500"  line, and set much higher timeout if
you are on slow connection.
Comment 4 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-08-12 23:23:22 UTC
This has been brought up in Mozilla's bug for adding Wikipedia to the default search engine list for Firefox: <https://bugzilla.mozilla.org/show_bug.cgi?id=380785>.  CCing to rainman in case he has any thoughts about implementing this at some point using the relevancy algorithms already existing in Lucene, or if these would be inappropriate.  This is maybe better for a built-in approach as suggested at <http://www.mediawiki.org/wiki/Proposed_Database_Schema_Changes#table_for_auto-suggest_page_title_search> using backlinks or some other metric (in which case redirects need to be counted as well).
Comment 5 Robert Stojnic 2007-08-13 00:01:56 UTC
I think we should be trying to integrate http://suggest.speedblue.org/
Building a prefix tree with articles ordered by rank is probably the
most efficient way to go. 

Things to do:
1) Make the suggest engine rebuild it's prefix tree from lucene index,
and not database dump - frequent rebuilds from db dumps have proven 
not to be very reliable. And the needed data is already in the index, 
i.e. article titles, and their ranks. This way, we can worry only about
keeping the main index up-to-date.

2) Figure out an update scheme that will minimize downtime. 
Rsync+restart could be enough for starters, but it would be nice
if there would be, say, an extra thread that would check the contents
of some rsync path for updates.

I'll invite Julien (the author of suggest engine) to give some comments
on this as well. 
Comment 6 Julien Lemoine 2007-08-13 08:50:56 UTC
This sounds good.

If you have an efficient way to extract titles and ranks from index, this 
is the best way to have an efficient completion structure.

To minimize the downtime, the best solution is to keep the old prefix 
trie loaded while the new tree is not build and loaded (on a new tcp port for 
example) and to redirect queries to the new version when it is 
available. You will have two tree in memory during a short time but 
without downtime.

I can provide you some support and improvement for the prefix trie I 
implemented, I have some idea to reduce memory footprint and improve 
performances.

Best Regards.
Julien
Comment 7 Robert Stojnic 2007-08-13 23:12:22 UTC
We could use CLucene to access the lucene index, but I don't know if
they maintain full compatibility with the latest java lucene file-structure
changes. Or, we could use something like this:
http://svn.wikimedia.org/viewvc/mediawiki/branches/lucene-search-2.1/src/org/wikimedia/lsearch/util/ExtractTitles.java?view=markup

This would open the latest consistent version of the lucene index and 
print out a title per line: <rank> <namespace> <title> [<redirects>]
This could then be piped into a trie rebuild tool. Extracting the 
complete listing for en.wiki takes less than 2 minutes. 

How things stand now, redirects won't be included. They are in the
index but are not stored in raw form, thus not very easy to extract.
But if we would go ahead to integrate this, I believe I could easily 
add them without enlarging the index much, and without hurting performance. 

So, if this would be worked out, I would be happy to setup a test
on wmf servers, with some help from the root-access people of course :)



Comment 8 Brion Vibber 2008-03-19 00:13:47 UTC
*** Bug 12412 has been marked as a duplicate of this bug. ***
Comment 9 Brion Vibber 2008-05-19 17:51:19 UTC
Resolving as FIXED -- $wgEnableMWSuggest is available for 1.13.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links