Last modified: 2014-04-25 03:40:42 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T2974, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 974 - "Did you mean . . ." search feature to automatically spellcheck all search terms


Summary:	"Did you mean . . ." search feature to automatically spellcheck all search terms

Status:	REOPENED

Product:	MediaWiki
Classification:	Unclassified
Component:	Search (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Lowest enhancement with 16 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Duplicates:	2236 2486 3139 3418 3638 5533 6778 8694 14680 15021 (view as bug list)
Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2004-12-02 04:56 UTC by Nathan Larson
Modified:	2014-04-25 03:40 UTC (History)
CC List:	24 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Nathan Larson 2004-12-02 04:56:26 UTC

It would be helpful for Wikipedia to detect typos similar to Google. For instance, if I 
type in "Nathaz Larson" on Google, it completes the search but also shows a hyperlink 
saying "Did you mean Nathan Larson?"

This would save users time re-entering mistyped search strings.

Comment 1 Ævar Arnfjörð Bjarmason 2005-05-24 23:20:14 UTC

*** Bug 2236 has been marked as a duplicate of this bug. ***

Comment 2 Brion Vibber 2005-09-11 02:44:15 UTC

*** Bug 3418 has been marked as a duplicate of this bug. ***

Comment 3 Muthu 2005-09-26 17:31:53 UTC

I think we must integrate Aspell with Mediawiki to suggest the spelling,
incase we think the user typed string is *NOT* an exact match, or has
more than one match. Now it makes a lot of sense to have something like
aspell to have the suggestions of strings, and throw them on the user,
if the above conditions are met.

Comment 4 Brion Vibber 2005-09-26 19:34:17 UTC

The Lucene search includes a fuzzy match feature which does similar; it's currently 
disabled for speed reasons.

Comment 5 Muthu 2005-09-26 22:42:24 UTC

If 'speed reasons' include, the time to complete search, 
then Id guess, its not worth disabling. YOu must enable this feature.

Otherwise, youre worried about server usage, etc then probably I think
its an issue. 

I think its rational to think that, someone will be more-satisfied if
a search query returns with, some useful results, AND take some time
for it, rather than, turning up nothing in quick time. Im saying
if we can get some results, that will be relevant, its worth trying for..

Comment 6 Brion Vibber 2005-09-27 02:30:33 UTC

time to complete search ~= server usage

Comment 7 Ævar Arnfjörð Bjarmason 2005-10-07 07:50:11 UTC

*** Bug 3638 has been marked as a duplicate of this bug. ***

Comment 8 lɛʁi לערי ריינהארט 2005-10-14 10:06:24 UTC

¿is this the same as?

bug 2486: Automatic wiki page name suggestion similar as "Google Suggest"

Comment 9 Tuukka Hastrup 2006-03-13 16:28:06 UTC

Could we consider some algorithm that takes less server resources? For example,
keep a mapping from "lowercaseletteronly" version to any matching "Lower-case
Letter Onlys".

There could also be such a mapping for What links here, Wiktionary, and Commons
links so the "no such article" page could tell if the links actually lead to
something.

Comment 10 lɛʁi לערי ריינהארט 2006-04-11 09:48:18 UTC

*** Bug 5533 has been marked as a duplicate of this bug. ***

Comment 11 vossman 2006-05-14 20:59:20 UTC

*** Bug 3139 has been marked as a duplicate of this bug. ***

Comment 12 petdog 2006-07-11 17:28:23 UTC

Can't you just use the levenshtein function (which is also included in php >=
3.0.17) against the list of titles (of the wiki pages), and choose the [two]
minimum results as the suggested keyword/s?
I hope that wouldn't be too much of a work for the servers, because it would for
sure be a very good thing for the users :D

Sorry for my BAD english, I hope you understand.

__
petdog@gmail.com

Comment 13 Harald 2006-07-11 18:00:36 UTC


*** This bug has been marked as a duplicate of 2486 ***

Comment 14 vossman 2006-07-11 21:16:01 UTC

*** Bug 2486 has been marked as a duplicate of this bug. ***

Comment 15 Daniel Kinzler 2006-07-11 21:20:00 UTC

@petdog: run the levenshtein function agains all titles in a wiki for every
search? The english wikipedia has more than a million entries, and several
searches per second.... That would never work. It may be nice for a very small
wiki though - shouldn't be to hard to write a search plugin that does it.

Comment 16 Karl O. Pinc 2006-07-11 21:54:42 UTC

Why would it be any harder to put levenshtein function values into a database
and search them than it is to put regular words into a database and search them?
 Is the load on the servers so high that you can't levenshtein-ize the user's
input to search against the data?

Comment 17 Daniel Kinzler 2006-07-11 21:57:41 UTC

@Karl: that's not how it works. The levenshtein function takes two parameters:
the search term and the title. You can't pre-compute distances for all possible
search terms.

Comment 18 Karl O. Pinc 2006-07-11 22:28:36 UTC

My bad. Ah well.  But you could still use soundex, although you'd have to find
equalivent functions for languages other than english.

Humm....

While you can't pre-compute distances for all possible search terms, you might
be able to pre-compute distances for search terms that are entered by users and
not found.  I'm sure google is doing something like this, using the user's input
to guide the search process.  Alternately, you could just watch what the users
do.  If they don't find something, then they do, and the levenshtein distance is
small, save the found page as a suggestion.  Or whatever.  There's lots of
information supplied to wikipedia by the user's searches.  Research project?
Maybe.  Unique opportunity, surely.

Comment 19 Brion Vibber 2006-07-22 22:25:20 UTC

*** Bug 6778 has been marked as a duplicate of this bug. ***

Comment 20 Eneas 2006-11-10 14:43:22 UTC

What about creating a list with unsuccesful search terms and running new entries
each week/month against the levenshtein function. The results are saved in a table.

When you perform a search with a mistyped search term that is equal to the list
of unsuccessful searches you get the pre-computed result of the algorithm.

Comment 21 Brion Vibber 2007-01-18 22:10:46 UTC

*** Bug 8694 has been marked as a duplicate of this bug. ***

Comment 22 Jason Spiro 2007-01-30 22:02:08 UTC

(In reply to comment #20)
> What about creating a list with unsuccesful search terms and running new entries
> each week/month against the levenshtein function. The results are saved in a
table.
> 
> When you perform a search with a mistyped search term that is equal to the list
> of unsuccessful searches you get the pre-computed result of the algorithm.

Sounds like a fine workaround to me.  Site admins: do you think it would work?
(In reply to comment #5)
> [...] someone will be more-satisfied if
> a search query returns with, some useful results, AND take some time
> for it, rather than, turning up nothing in quick time.

I agree.  IMO, spell checking is so useful for so many users that even if it
slowed all searches down by twenty-five percent, it would still be worthwhile. 
Should I perhaps take a poll of a few fellow Wikipedia readers at my community
college and they think?

(In reply to comment #20)
> What about creating a list with unsuccesful search terms and running new entries
> each week/month against the levenshtein function. The results are saved in a
table.
> 
> When you perform a search with a mistyped search term that is equal to the list
> of unsuccessful searches you get the pre-computed result of the algorithm.

Sounds like a fine workaround to me.  Site admins: do you think it would work?

I am taking the liberty of bumping this bug's priority to High.

Comment 23 Derrick Coetzee 2007-06-17 07:54:35 UTC

Don't go reinventing the wheel. This is a generic query task that could apply to any MySQL database. Lucene is ideal, but I can see why it's disabled. One relatively straightforward approach is to do the Levenshtein distance thing, but first use a spatial query to narrow down the number of points you have to evaluate. Search terms (and titles) are associated with vectors specifying the frequency of each letter. We then perform a spatial query using a spatial index to find the nearest 50 or so points in this space, and do Levenshtein against each of them for the final ranking. Unfortunately, the OpenGIS spatial support appears to be for only low numbers of dimensions. Nevertheless this could be carried out by keeping a separate spatial index "off to the side" outside of MySQL and updating it as titles are updated. Just an idea.

Comment 24 Zhen Lin 2007-06-17 08:01:24 UTC

The developers may already be aware of this, but there is a Levenshtein distance user-defined function in C for MySQL: http://joshdrew.com/mysql_levenshtein_udf-1.0.tar.gz

Though as the previous poster points out, it would still be necessary to narrow down the number of entries to consider for a large database like Wikipedia.

Comment 25 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-06-17 08:08:15 UTC

Does Rainman's Lucene 2.0 extension address this?

Comment 26 Robert Stojnic 2007-06-17 10:24:09 UTC

Simetrical: Not yet. Probably in 2.1.

Derrick: One way of narrowing down the results is using ngram indexes. Using ngrams  does a fair job in determining if two string are statistically similar, and it's well-suited for spelling errors because it preserve linkage between adjacent letters (I'm not sure if this is the case with letter-frequency based spatial indexes as well?). So, you could make a query on ngram lucene index, fetch first 50 results, and then do some sophisticated analysis trying to figure out which word(s) suite the query best.

There are some caveats however. One is that you want spell-checking to be context-dependent, i.e. the spell checker should always look at the phrase, not individual words. Indexing all phrases is probably not feasible. Possible solution might be to build ngram index in iterations, by adding phrases that could be confused with long words or such. Some other caveats are that you might want to favor spelling that correspond to article titles, that correspond to some interwiki link (in which case you also might want a translation), that you might want to do spell checking even if search returns some results, and so on.. 

However, this is still theory. I guess it makes a huge difference when you go from a small system of few hundreds of MB of text to a system with tens of GB of text, and want the spell checker always to find the most relevant article and words.

Comment 27 Andrew Garrett 2008-06-29 08:36:11 UTC

*** Bug 14680 has been marked as a duplicate of this bug. ***

Comment 28 Brion Vibber 2008-08-04 05:59:17 UTC

*** Bug 15021 has been marked as a duplicate of this bug. ***

Comment 29 Brion Vibber 2008-12-28 21:54:33 UTC

Note that the "did you mean" thingy is now active in the Lucene search backend used on Wikipedia. Yay!

However there's not yet support in the default MySQL backend.

Comment 30 Robert Stojnic 2011-12-30 16:18:45 UTC

Note that the votes are for enabling this on WMF wikis (which is done), not necessarily to implement it in MySQL.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links

box0000006
brion
dc
dvanliere
eagleapex
gmctruck68
jared.zimmerman
kop
kosikfl
lowzl
mail
meno25mail
muthus_post
neverett+bugzilla
nurhusien.hasen27
omid_h28
qgil
rainman
schauhan
sexyeuroboy
soulshinobi
Tuukka
wadcock2
waldir