Last modified: 2009-07-24 12:13:29 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T2883, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 883 - Fuzzy (approximate) wiki page title access - fuzzy bookmarking - auto search
Fuzzy (approximate) wiki page title access - fuzzy bookmarking - auto search
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Search (Other open bugs)
unspecified
All All
: Normal enhancement with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
http://wiki.tcl.tk/391
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2004-11-15 05:53 UTC by T. Gries
Modified: 2009-07-24 12:13 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description T. Gries 2004-11-15 05:53:08 UTC
I recently came across a wiki which implements a more useful way to access
(search) pages by actually implementing a form of fuzzy (approximate) bookmarking.

I am copying the relevant text from http://wiki.tcl.tk/391 :

To search for the word "cgi" in all page titles, you can use the URL:
  
        http://purl.org/tcl/wiki/cgi

To search for this word in all titles and in the full texts, use:

        http://purl.org/tcl/wiki/cgi*  (in general: an regular expression)

Or, if you prefer, you can enter the search word on the search page, at:

        http://purl.org/tcl/wiki/search

But there's a little more to it. That last URL is actually a form of fuzzy
bookmarking. There is no web page called "search". Wikit presents its contents
as if it were a directory with pages, but its all smoke and mirrors...

First of all, note that all Wikit pages have a unique identifying number. The
"About" page is at http://purl.org/tcl/wiki/1.html, for example. But although
these unique IDs are effective for internal links, they are quite awkward as
bookmarks, since they convey no information whatsoever about the title or
contents of a page.

To offer a more useful way of bookmarking, pages which are not of the form
<number>.html are treated as search instructions to locate a page. The following
URL is an instruction to look for a page titled "hawaii":

        http://purl.org/tcl/wiki/hawaii

Assuming there is a page titled "hawaii" (case is ignored), the above URL will
lead directly to that page.

But wiki's change. So do page titles, occasionally. Some page titles are long
and may contain embedded spaces or other inconvenient characters. This all makes
the above search mechanism a bit too brittle for long-lasting URLs.

To solution which has been adopted here, is to refine the search process as
follows (everything after the slash will be called the search term):

   1. If the search term is a reference to a page (<number>.html), then simply
go to that page
   2. If the search term matches a page title (while ignoring case), then jump
to the page with that title
   3. If the search term includes one or more upper-case letters, modify the
search to be approximate (see below). If the approximate match finds exactly one
page, jump to that page.
   4. Otherwise, treat the search term as a regular search, and present the
search results.

Approximate matching - if the search term has upper-case letters, for example
"OneTwoThree", it is turned into a match pattern (using the glob / string match
syntax). In the example given, a search would be performed on page titles
matching the pattern "*[Oo]ne*[Tt]wo*[Tt]hree*".

What's the point of all this? Well... this mechanism allows you to specify URLs
pointing into the Tcl'ers Wiki with some quite attractive properties:

    * If the search keyword is accurate enough, it's equivalent to a real URL
    * If the search is general enough, it'll survive minor title changes (e.g.
typo's)
    * The URL has a meaningful word in it, so people can remember what it was about
    * If more pages are added to the wiki, the search will turn up more than one
match
    * This is an extremely useful feature, because the original match will be
one of the search results listed, and so will new - probably related - pages

For an example, here's a link to Don Libes' book on Expect:

        http://purl.org/tcl/wiki/Expect

And here's a search which lists all pages where the word "expect" is used:

        http://purl.org/tcl/wiki/expect*
Comment 1 soloturn99 2004-11-24 17:51:20 UTC
does this solve the problem of not finding "my_faq" and just getting "faq" wiki,
when searching for "faq"?
Comment 2 T. Gries 2004-11-24 18:00:28 UTC
(In reply to comment #1)
> does this solve the problem of not finding "my_faq" and just getting "faq" wiki,
> when searching for "faq"?

Of course, it will ! as long as the distance between the user input (call it "needle") is not too far away from the needle in 
the "haystack". I am an expert in AGREP (see http://www.tgries.de/agrep and there are several spawn-offs which could be 
integrated in MediaWiki) and AGREP used with the option "-By" would automatically first try an exact match (my_faq = faq) which 
does not match and in that case it increments an error number to 1 an searches with one allowed error. The same loops until at 
least one match has been found, usually several similar spellings ...

... which then would be presented to the user to select from OR
... to really create a new page with the "my_faq" page title, if the user wants this.

Are you an developer ?
Comment 3 T. Gries 2005-06-22 20:25:24 UTC
(added for documentation completeness only)

See also my other enhancement bug 
http://bugzilla.wikimedia.org/show_bug.cgi?id=2486

Automatic wiki page name suggestion similar as "Google Suggest"
Comment 4 Siebrand Mazeland 2009-02-02 13:32:34 UTC
Changed component to "RecentChanges"
Comment 5 Happy-melon 2009-07-24 12:13:29 UTC
This bug is totally stale, but most of the features requested seem to have been developed in the intervening period.  We have a much better search functionality through LuceneSearch, which includes "did you mean", fuzzy matching, etc.  We have mwsuggest that does useful things in the search box.  I don't think a fuzzy-matching algorithm being automatically triggered on all URLs is a good idea. Resolving FIXED.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links