Last modified: 2014-11-20 16:57:54 UTC
Background: In our Mobile apps, we're implementing full-text searching when the user types a search term. The way we query the api is as follows: - We use prop=pageprops with generator=search to get the list of results, which also gives us the image thumbnail and Wikidata ID for each result, but this returns the results in the wrong order, so: - We combine the query with list=search, which should give the same set of results (albeit without thumbnails or Wikidata ID), but in the correct order. - We then correlate the data from the two lists (expecting both lists to contain the exact same items) to arrive at the full results in the correct order. For example, our query is something like this: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=search&prop=pageprops&format=json&srsearch=Barack&srnamespace=0&srwhat=text&srinfo=suggestion&srlimit=12&ppprop=wikibase_item&generator=search&gsrsearch=Barack&gsrnamespace=0&gsrwhat=text&gsrprop=redirecttitle&gsrlimit=12 However: The two sets of results often don't match up with each other -- there are sometimes one or more pages in one list that are not present in the other. What's even more strange, this doesn't happen every time: the lists might match up in one instance, and then be different in another (for the exact same query). For example, try executing the query linked above, and observe the last item in the returned "pages" array. 4 out of 5 times, the item will be "There's No One as Irish as Barack O'Bama", but sometimes it becomes "Statewide opinion polling for the United States presidential election, 2012", which is no longer consistent with the other list returned by list=search.
Just looking at the "last" item in the pages object isn't going to tell you much, because JSON doesn't guarantee ordering in hashes. After repeated trials, I cannot reproduce a single trial where 'pages' contains a title not in 'search' or vice versa. If this does somehow occur, it's most likely due to the underlying search engine rather than anything in the API.
Just to clarify, the "search" results list is not a hash - it's an array. It was needed because the "pages" results are a hash, so the order was meaningless.
(In reply to Monte Hurd from comment #2) > Just to clarify, the "search" results list is not a hash - it's an array. It > was needed because the "pages" results are a hash, so the order was > meaningless. Why not change the search results lists to allow it to include the properties that you need? That seems about 1000000% more efficient than making two http calls. Search results should be mostly consisten across requests but there is nothing that keeps them entirely consistent. We're changing the underlying index constantly and we make no effort to cache because the tail is so long. You search may not even be sent to the same set of machines when you repeat it. This could cause slight shifting around the edges. It'd be a bug if something was in the middle one time and not there the next. Seriously, make just one query. Save some load on the search cluster and save the user some time.
Created attachment 17176 [details] Results set for "fish" before update
Created attachment 17177 [details] Result set for "fish" after update
Those results are super duper different.
Could you turn those screenshots into urls? Also Cirrus intentionally returns redirects in the list so that funny search for fishspinner is actually a result for tropical cyclone. One thing we could do is penalize results that we find via a redirect as compared to a page title match. We don't do that right now. It'd probably make the results make more sense. Still I think the underlying issue is that the list=search api doesn't return lwhat you want it to.
Here is the exact query the iOS app used for those two screenshots: https://en.m.wikipedia.org/w/api.php?action=query&format=json&generator=prefixsearch&gpslimit=24&gpsnamespace=0&gpssearch=fish&list=prefixsearch&pilimit=24&piprop=thumbnail&pithumbsize=144&ppprop=wikibase_item&prop=pageprops%7Cpageimages&pslimit=24&pssearch=fish
Note that all of my usual daily search testing terms - "art", "cat", "bird", "toast", have underwent a similar degradation as "fish". The first result is still always a match, but, for example, the second result when searching for "bird" is now "Bird-Magic rivalry", which may be a great article, but it's probably not the second-most relevant bird article on enwiki :)
I'll work on the relevance issue. Is this bug about that or the different between list=search and generator=search? I don't know much about the api issue but I can totally work on relevance.
Good point - we may be conflating two issues. Thanks for tackling the relevance part! I think this bug could then just track the "different between list=search and generator=search" part if that sounds ok?
OK. I'm not working on that now because I don't think it's as bad. Also I don't understand our API too well so I'd have to do some research. I don't even know enough to know if the bug is in CirrusSearch. I'm going to reset the assignee to default for now. Finally: you really should be able to get the results in one query. That you can't is also a bug. I'm tracking the relevance issue as bug 73636
(In reply to Nik Everett from comment #12) > OK. I'm not working on that now because I don't think it's as bad. Also I > don't understand our API too well so I'd have to do some research. I don't > even know enough to know if the bug is in CirrusSearch. The general idea is that lists return results directly, and generators use the output of a list to feed into another module (like a UNIX pipe). So generators essentially are the intended way to combine modules in one query like this. The problem is for some reason list= is also used/necessary in this case. See https://www.mediawiki.org/wiki/API:Query#Generators (In reply to Monte Hurd from comment #8) > Here is the exact query the iOS app used for those two screenshots: > > https://en.m.wikipedia.org/w/api. > php?action=query&format=json&generator=prefixsearch&gpslimit=24&gpsnamespace= > 0&gpssearch=fish&list=prefixsearch&pilimit=24&piprop=thumbnail&pithumbsize=14 > 4&ppprop=wikibase_item&prop=pageprops%7Cpageimages&pslimit=24&pssearch=fish I notice this does not use indexpageids. So you're only getting it in the correct order if whatever iOS JSON parser you're using preserves order in hashes/associative arrays (possible, but not certain). See https://en.m.wikipedia.org/w/api.php?action=query&format=jsonfm&generator=prefixsearch&gpslimit=24&gpsnamespace=0&gpssearch=fish&list=prefixsearch&pilimit=24&piprop=thumbnail&pithumbsize=144&ppprop=wikibase_item&prop=pageprops|pageimages&pslimit=24&pssearch=fish&indexpageids for a version with indexpageids with a clear array of results. Also, it's using generator=prefixsearch not generator=search (which comment #0 mentions).
Matt - it seems the indexpageids are not in the right order. And the native json parser on iOS doesn't preserve order within resulting associative arrays - just arrays. I believe this is true w/android as well. We could write custom parsing logic to determine order if these hash entries were made to be in the correct order, but that's a hack every non-php consumer of this data would have to implement. Would be perfect if the entries could be both ordered correctly and returned in order-preserving json array format. Can there be a flag that just caused results to come back array format?
*to clarify, the array format flag would return array of page hashes, rather than hash of page hashes*
(In reply to Matthew Flaschen from comment #13) > (In reply to Nik Everett from comment #12) > > OK. I'm not working on that now because I don't think it's as bad. Also I > > don't understand our API too well so I'd have to do some research. I don't > > even know enough to know if the bug is in CirrusSearch. > > The general idea is that lists return results directly, and generators use > the output of a list to feed into another module (like a UNIX pipe). So > generators essentially are the intended way to combine modules in one query > like this. The problem is for some reason list= is also used/necessary in > this case. See https://www.mediawiki.org/wiki/API:Query#Generators This. The general problem people seem to be running into with search as a generator is that they want to keep some indication of the ranking despite generators currently being defined as producing an unordered list of pageids (or revids). I'm not opposed to allowing generators to provide additional properties for the generated pages (reopening bug 14859, although I'd probably do it in a different manner than requested there) but it'll take some thought as to how to best do that. I'll put something on [[mw:API/Architecture work/Planning]] so I don't forget to look into it. As for how the API code works in this situation, it instantiates two instances of ApiQueryPrefixSearch, and each instance runs the same code to get a list of titles: $searcher = new TitlePrefixSearch; $titles = $searcher->searchWithVariants( $search, $limit, $namespaces ); The processing of that list of titles is different, but it should theoretically be the same titles in the list. What seems to be going on is that this code is not entirely deterministic. (In reply to Monte Hurd from comment #8) > Here is the exact query the iOS app used for those two screenshots: > > https://en.m.wikipedia.org/w/api. > php?action=query&format=json&generator=prefixsearch&gpslimit=24&gpsnamespace= > 0&gpssearch=fish&list=prefixsearch&pilimit=24&piprop=thumbnail&pithumbsize=14 > 4&ppprop=wikibase_item&prop=pageprops%7Cpageimages&pslimit=24&pssearch=fish I note you're using gpsnamespace=0, but not psnamespace=0. In this case that doesn't matter, though, since 0 is the default. I have been able to reproduce the issue with this query, BTW. (In reply to Matthew Flaschen from comment #13) > I notice this does not use indexpageids. So you're only getting it in the > correct order if whatever iOS JSON parser you're using preserves order in > hashes/associative arrays (possible, but not certain). They're presumably using query.prefixsearch (which *is* a JSON array) for the ordering, using the pageids from there to look up the additional info in the query.pages hash.
(In reply to Brad Jorsch from comment #16) > As for how the API code works in this situation, it instantiates two > instances of ApiQueryPrefixSearch, and each instance runs the same code to > get a list of titles: > > $searcher = new TitlePrefixSearch; > $titles = $searcher->searchWithVariants( $search, $limit, $namespaces ); > > The processing of that list of titles is different, but it should > theoretically be the same titles in the list. What seems to be going on is > that this code is not entirely deterministic. > That is why I originally WONTFIXED it. Searches aren't deterministic because the data changes under them. lsearchd was _more_ deterministic than cirrus because it changed slower and had fewer shards and replicas. The list shouldn't drastically change but it will change some. How drastic are the changes? Also, are there things from the search results that you need that you can't get by using it as a generator? As in, if we allowed generators to return things in order somehow would that be enough for you to only make a single call?
(In reply to Nik Everett from comment #17) > Also, are there things from the search results that you need that you can't > get by using it as a generator? As in, if we allowed generators to return > things in order somehow would that be enough for you to only make a single > call? That's correct -- if the generator returned the results in order, then we wouldn't need to do the other call.
(In reply to Nik Everett from comment #17) > The list shouldn't drastically change but it will change some. How drastic are > the changes? The changes don't seem to be very drastic; in some quick checking, it seems to mainly be in the ordering of the results. For example, in one query using only list=prefixsearch "Fisherman's Atlantic City Wind Farm" was at #14 while it dropped to #16 when the query was repeated, with no change in the relative order of any of the other 24 results. > Also, are there things from the search results that you need that you can't > get by using it as a generator? As in, if we allowed generators to return > things in order somehow would that be enough for you to only make a single > call? Allowing generators to return things in order isn't likely to happen. More likely is that generator=prefixsearch would be able to add an 'index' field into the hashes inside the 'pages' hash, so the client could sort based on that field's value. More thoughts on this are at https://www.mediawiki.org/wiki/API/Architecture_work/Planning#Allow_generators_to_provide_data
(In reply to Brad Jorsch from comment #19) > (In reply to Nik Everett from comment #17) > > The list shouldn't drastically change but it will change some. How drastic are > > the changes? > > The changes don't seem to be very drastic; in some quick checking, it seems > to mainly be in the ordering of the results. For example, in one query using > only list=prefixsearch "Fisherman's Atlantic City Wind Farm" was at #14 > while it dropped to #16 when the query was repeated, with no change in the > relative order of any of the other 24 results. Cool. That's pretty much what I expect. > > > Also, are there things from the search results that you need that you can't > > get by using it as a generator? As in, if we allowed generators to return > > things in order somehow would that be enough for you to only make a single > > call? > > Allowing generators to return things in order isn't likely to happen. More > likely is that generator=prefixsearch would be able to add an 'index' field > into the hashes inside the 'pages' hash, so the client could sort based on > that field's value. > > More thoughts on this are at > https://www.mediawiki.org/wiki/API/Architecture_work/ > Planning#Allow_generators_to_provide_data Cool.
Chad just SWATed our fix to prefix search relevance. You still get redirects if they are very good and there aren't any good regular pages. Try searching for "ozymandius". Its a common misspelling of Ozymandias.