Last modified: 2014-04-25 03:40:42 UTC
It would be helpful for Wikipedia to detect typos similar to Google. For instance, if I type in "Nathaz Larson" on Google, it completes the search but also shows a hyperlink saying "Did you mean Nathan Larson?" This would save users time re-entering mistyped search strings.
*** Bug 2236 has been marked as a duplicate of this bug. ***
*** Bug 3418 has been marked as a duplicate of this bug. ***
I think we must integrate Aspell with Mediawiki to suggest the spelling, incase we think the user typed string is *NOT* an exact match, or has more than one match. Now it makes a lot of sense to have something like aspell to have the suggestions of strings, and throw them on the user, if the above conditions are met.
The Lucene search includes a fuzzy match feature which does similar; it's currently disabled for speed reasons.
If 'speed reasons' include, the time to complete search, then Id guess, its not worth disabling. YOu must enable this feature. Otherwise, youre worried about server usage, etc then probably I think its an issue. I think its rational to think that, someone will be more-satisfied if a search query returns with, some useful results, AND take some time for it, rather than, turning up nothing in quick time. Im saying if we can get some results, that will be relevant, its worth trying for..
time to complete search ~= server usage
*** Bug 3638 has been marked as a duplicate of this bug. ***
¿is this the same as? bug 2486: Automatic wiki page name suggestion similar as "Google Suggest"
Could we consider some algorithm that takes less server resources? For example, keep a mapping from "lowercaseletteronly" version to any matching "Lower-case Letter Onlys". There could also be such a mapping for What links here, Wiktionary, and Commons links so the "no such article" page could tell if the links actually lead to something.
*** Bug 5533 has been marked as a duplicate of this bug. ***
*** Bug 3139 has been marked as a duplicate of this bug. ***
Can't you just use the levenshtein function (which is also included in php >= 3.0.17) against the list of titles (of the wiki pages), and choose the [two] minimum results as the suggested keyword/s? I hope that wouldn't be too much of a work for the servers, because it would for sure be a very good thing for the users :D Sorry for my BAD english, I hope you understand. __ petdog@gmail.com
*** This bug has been marked as a duplicate of 2486 ***
*** Bug 2486 has been marked as a duplicate of this bug. ***
@petdog: run the levenshtein function agains all titles in a wiki for every search? The english wikipedia has more than a million entries, and several searches per second.... That would never work. It may be nice for a very small wiki though - shouldn't be to hard to write a search plugin that does it.
Why would it be any harder to put levenshtein function values into a database and search them than it is to put regular words into a database and search them? Is the load on the servers so high that you can't levenshtein-ize the user's input to search against the data?
@Karl: that's not how it works. The levenshtein function takes two parameters: the search term and the title. You can't pre-compute distances for all possible search terms.
My bad. Ah well. But you could still use soundex, although you'd have to find equalivent functions for languages other than english. Humm.... While you can't pre-compute distances for all possible search terms, you might be able to pre-compute distances for search terms that are entered by users and not found. I'm sure google is doing something like this, using the user's input to guide the search process. Alternately, you could just watch what the users do. If they don't find something, then they do, and the levenshtein distance is small, save the found page as a suggestion. Or whatever. There's lots of information supplied to wikipedia by the user's searches. Research project? Maybe. Unique opportunity, surely.
*** Bug 6778 has been marked as a duplicate of this bug. ***
What about creating a list with unsuccesful search terms and running new entries each week/month against the levenshtein function. The results are saved in a table. When you perform a search with a mistyped search term that is equal to the list of unsuccessful searches you get the pre-computed result of the algorithm.
*** Bug 8694 has been marked as a duplicate of this bug. ***
(In reply to comment #20) > What about creating a list with unsuccesful search terms and running new entries > each week/month against the levenshtein function. The results are saved in a table. > > When you perform a search with a mistyped search term that is equal to the list > of unsuccessful searches you get the pre-computed result of the algorithm. Sounds like a fine workaround to me. Site admins: do you think it would work? (In reply to comment #5) > [...] someone will be more-satisfied if > a search query returns with, some useful results, AND take some time > for it, rather than, turning up nothing in quick time. I agree. IMO, spell checking is so useful for so many users that even if it slowed all searches down by twenty-five percent, it would still be worthwhile. Should I perhaps take a poll of a few fellow Wikipedia readers at my community college and they think? (In reply to comment #20) > What about creating a list with unsuccesful search terms and running new entries > each week/month against the levenshtein function. The results are saved in a table. > > When you perform a search with a mistyped search term that is equal to the list > of unsuccessful searches you get the pre-computed result of the algorithm. Sounds like a fine workaround to me. Site admins: do you think it would work? I am taking the liberty of bumping this bug's priority to High.
Don't go reinventing the wheel. This is a generic query task that could apply to any MySQL database. Lucene is ideal, but I can see why it's disabled. One relatively straightforward approach is to do the Levenshtein distance thing, but first use a spatial query to narrow down the number of points you have to evaluate. Search terms (and titles) are associated with vectors specifying the frequency of each letter. We then perform a spatial query using a spatial index to find the nearest 50 or so points in this space, and do Levenshtein against each of them for the final ranking. Unfortunately, the OpenGIS spatial support appears to be for only low numbers of dimensions. Nevertheless this could be carried out by keeping a separate spatial index "off to the side" outside of MySQL and updating it as titles are updated. Just an idea.
The developers may already be aware of this, but there is a Levenshtein distance user-defined function in C for MySQL: http://joshdrew.com/mysql_levenshtein_udf-1.0.tar.gz Though as the previous poster points out, it would still be necessary to narrow down the number of entries to consider for a large database like Wikipedia.
Does Rainman's Lucene 2.0 extension address this?
Simetrical: Not yet. Probably in 2.1. Derrick: One way of narrowing down the results is using ngram indexes. Using ngrams does a fair job in determining if two string are statistically similar, and it's well-suited for spelling errors because it preserve linkage between adjacent letters (I'm not sure if this is the case with letter-frequency based spatial indexes as well?). So, you could make a query on ngram lucene index, fetch first 50 results, and then do some sophisticated analysis trying to figure out which word(s) suite the query best. There are some caveats however. One is that you want spell-checking to be context-dependent, i.e. the spell checker should always look at the phrase, not individual words. Indexing all phrases is probably not feasible. Possible solution might be to build ngram index in iterations, by adding phrases that could be confused with long words or such. Some other caveats are that you might want to favor spelling that correspond to article titles, that correspond to some interwiki link (in which case you also might want a translation), that you might want to do spell checking even if search returns some results, and so on.. However, this is still theory. I guess it makes a huge difference when you go from a small system of few hundreds of MB of text to a system with tens of GB of text, and want the spell checker always to find the most relevant article and words.
*** Bug 14680 has been marked as a duplicate of this bug. ***
*** Bug 15021 has been marked as a duplicate of this bug. ***
Note that the "did you mean" thingy is now active in the Lucene search backend used on Wikipedia. Yay! However there's not yet support in the default MySQL backend.
Note that the votes are for enabling this on WMF wikis (which is done), not necessarily to implement it in MySQL.