Last modified: 2014-11-17 10:35:32 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T19993, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 17993 - Option on API lists to only have count of links/categories/whatever returned, rather than a full resultset
Option on API lists to only have count of links/categories/whatever returned,...
Status: NEW
Product: MediaWiki
Classification: Unclassified
API (Other open bugs)
1.14.x
All All
: Normal enhancement with 4 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
http://en.wikipedia.org/w/api.php?act...
:
: 20504 (view as bug list)
Depends on: 36912
Blocks:
  Show dependency treegraph
 
Reported: 2009-03-15 21:07 UTC by Sam Reed (reedy)
Modified: 2014-11-17 10:35 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Sam Reed (reedy) 2009-03-15 21:07:24 UTC
Would it be possible on (possibly all?) queries that are getting links on pages, categories on page, pages in category, etc, have an option (or even just by default), a number of results returned?

ie for the link above,

instead of 
        <categories>
          <cl ns="14" title="Category:1879 births" />
          <cl ns="14" title="Category:1955 deaths" />
          <cl ns="14" title="Category:Academics of the Charles University" />
          <cl ns="14" title="Category:Albert Einstein" />
          <cl ns="14" title="Category:American humanists" />
          <cl ns="14" title="Category:American pacifists" />
          <cl ns="14" title="Category:American philosophers" />
          <cl ns="14" title="Category:American physicists" />
          <cl ns="14" title="Category:American socialists" />
          <cl ns="14" title="Category:American vegetarians" />
        </categories>

something like

        <categories count="10">
          <cl ns="14" title="Category:1879 births" />
          <cl ns="14" title="Category:1955 deaths" />
          <cl ns="14" title="Category:Academics of the Charles University" />
          <cl ns="14" title="Category:Albert Einstein" />
          <cl ns="14" title="Category:American humanists" />
          <cl ns="14" title="Category:American pacifists" />
          <cl ns="14" title="Category:American philosophers" />
          <cl ns="14" title="Category:American physicists" />
          <cl ns="14" title="Category:American socialists" />
          <cl ns="14" title="Category:American vegetarians" />
        </categories>


And an option to be able to go &countonly (or something), and just be returned

<categories count="10" />

Please?


Thanks
Comment 1 Roan Kattouw 2009-03-16 13:51:21 UTC
(In reply to comment #0)
> Would it be possible on (possibly all?) queries that are getting links on
> pages, categories on page, pages in category, etc, have an option (or even just
> by default), a number of results returned?
> 
> ie for the link above,
> 
> instead of 
>         <categories>
>           <cl ns="14" title="Category:1879 births" />
>           <cl ns="14" title="Category:1955 deaths" />
>           <cl ns="14" title="Category:Academics of the Charles University" />
>           <cl ns="14" title="Category:Albert Einstein" />
>           <cl ns="14" title="Category:American humanists" />
>           <cl ns="14" title="Category:American pacifists" />
>           <cl ns="14" title="Category:American philosophers" />
>           <cl ns="14" title="Category:American physicists" />
>           <cl ns="14" title="Category:American socialists" />
>           <cl ns="14" title="Category:American vegetarians" />
>         </categories>
> 
> something like
> 
>         <categories count="10">
>           <cl ns="14" title="Category:1879 births" />
>           <cl ns="14" title="Category:1955 deaths" />
>           <cl ns="14" title="Category:Academics of the Charles University" />
>           <cl ns="14" title="Category:Albert Einstein" />
>           <cl ns="14" title="Category:American humanists" />
>           <cl ns="14" title="Category:American pacifists" />
>           <cl ns="14" title="Category:American philosophers" />
>           <cl ns="14" title="Category:American physicists" />
>           <cl ns="14" title="Category:American socialists" />
>           <cl ns="14" title="Category:American vegetarians" />
>         </categories>
> 
> 
I don't see the use here, as counting results on the client side is trivial and inexpensive.

> And an option to be able to go &countonly (or something), and just be returned
> 
> <categories count="10" />
> 
I get that this would maybe save bandwidth; I'll look into implementing it.
Comment 2 Sam Reed (reedy) 2009-03-16 14:28:00 UTC
Fair enough..

The idea came about in the cases where in AWB, i want to get a count of categories on a page, but couldn't care less what they were, so might aswell just get a count.

The <categories count="10" /> style, if implemented on most of the query types, would be useful!

But i suppose you're right, if you're wanting the list of categories, the count is redundant there

Thanks
Comment 3 Roan Kattouw 2009-09-04 17:35:29 UTC
*** Bug 20504 has been marked as a duplicate of this bug. ***
Comment 4 Le Chat 2009-09-05 09:15:27 UTC
This would indeed be useful. Primarily in cases where the size of the result set is greater than the limit for a single query (for example, you want to know how many backlinks there are without having to execute a large number of consecutive queries). I don't see why this should be a problem (I don't know SQL, but thinking abstractly - if you're able to establish that exactly 10 pages are backlinks in a single query, you should be able to establish that exactly 10 pages are not backlinks, i.e. that N-10 pages are backlinks, and similarly for all numbers in between.)
Comment 5 Roan Kattouw 2009-09-05 09:24:10 UTC
(In reply to comment #4)
> This would indeed be useful. Primarily in cases where the size of the result
> set is greater than the limit for a single query (for example, you want to know
> how many backlinks there are without having to execute a large number of
> consecutive queries). I don't see why this should be a problem (I don't know
> SQL, but thinking abstractly - if you're able to establish that exactly 10
> pages are backlinks in a single query, you should be able to establish that
> exactly 10 pages are not backlinks, i.e. that N-10 pages are backlinks, and
> similarly for all numbers in between.)
> 

It doesn't work this way. It's *possible* to find out how many backlinks there are without listing them all, but it's not *efficient* up to a level that's acceptable on Wikipedia.
Comment 6 Le Chat 2009-09-05 12:53:41 UTC
So can you explain how it does work? Is the list of backlinks maintained explicitly in some table? Or does the software compile the list each time, by looking through the table of forward links? (I just can't imagine how there would be any efficiency difference between counting N positive results and effectively counting N negative ones.)
Comment 7 Roan Kattouw 2009-09-05 13:01:43 UTC
(In reply to comment #6)
> So can you explain how it does work? Is the list of backlinks maintained
> explicitly in some table? Or does the software compile the list each time, by
> looking through the table of forward links? (I just can't imagine how there
> would be any efficiency difference between counting N positive results and
> effectively counting N negative ones.)
> 

They're in the pagelinks table, which has the fields pl_from (page ID of the source page), pl_namespace and pl_title (NS+title of the target page). There is an index on the table which allows us to retrieve data sorted by pl_namespace, then pl_title, then pl_from. Since we're looking for rows with e.g. pl_namespace=0 and pl_title=Foo, all rows we're looking for are consecutive, and the first one can easily be located using binary search.

This means we're not examining any rows that aren't in our list: we know our list is consecutive and where it starts. However, counting how many items are in the list still requires us to examine all of them, which means examining an arbitrary and possibly very large amount of rows, which we don't want to do for performance reasons.

Another caveat is that the N-10 approach assumes that rows that don't satisfy our criterion are rare, which is definitely not the case in the pagelinks table for an enwiki-sized wiki.

(Disclaimer: all of this is based on my limited and possibly misguided understanding of how MySQL indexes work; all of this stuff happens in MySQL, not in MediaWiki)
Comment 8 Bryan Tong Minh 2010-11-27 18:39:47 UTC
This can only be implemented as returning "x" where x is a specific number or something that indicates "more than y". Would such a feature still be useful or close as WONTFIX?
Comment 9 Sam Reed (reedy) 2010-11-27 22:48:51 UTC
It's still useful, ish.

The use case was for the damn stupid "A page is an orhpan, if it has less than X incoming links".. And other such stupid responses.

Technically, the point of the request is not to get all the information sent through that we're just going to ignore - We're bothered about the count, not what they are (in most cases).

To an extent, just putting the request limit to say wanted + 1, and see what we get back would do.. But we're still getting useless information.

But "returning "x" where x is a specific number or" is the same thing.

Though, technically, just doing the request, without the SELECT columns (well, we'd need to select something trivial) would do it, surely? and then using the DB object to do a row count?
Comment 10 Bergi 2011-06-14 16:19:09 UTC
Isn't there even an SQL command to return just the length of the matched set instead of the items themselves? Of course Roans explanation is good, but this is only the price for one query. If my aim was to count the set, I would have to make all the continue-queries, which means the same searching through the table as it would have been for one query. Of course, this might be an argument to repeal any api limits, but the real advantage is the save of bandwidth and PHP-requests.

A script that could profit from this would be http://de.wikipedia.org/wiki/MediaWiki:Gadget-revisionCounter.js. Here just two queries would have to be done: 
* api.php?action=query&prop=revisions&titles=Foo&countonly
* api.php?action=query&prop=revisions&titles=Foo&rvuser=Bar&countonly
Another example would be a tool to retrieve the number of template transclusions, just like http://toolserver.org/~jarry/templatecount/. There a simple call to api.php?action=query&list=embeddedin&eititle=Template:Foo&countonly would be enough.

The countonly parameter should work for all properties but "info", "categoryinfo" and "pageprops", and for all lists but "random". The "search" list already provides a "totalhits" parameter, which might be interesting. I don't think counting would be useful for meta queries.
Comment 11 Roan Kattouw 2011-06-14 18:32:59 UTC
(In reply to comment #10)
> Isn't there even an SQL command to return just the length of the matched set
> instead of the items themselves?
Yes. It's COUNT(*)

> Of course Roans explanation is good, but this
> is only the price for one query.
It's not the same price. A LIMIT 50 query only inspects 50 rows (or maybe a bit more if there's a WHERE clause that can't be done with an index), whereas a COUNT(*) query will inspect all rows in the entire result set in order to count them. That could be a million rows in extreme cases (e.g. counting the number of category members of Living_people on enwiki, I think that's like 750k members). It should be obvious that a query examining 100 rows is much, much faster than a query that inspects almost a million.

> If my aim was to count the set, I would have
> to make all the continue-queries, which means the same searching through the
> table as it would have been for one query. Of course, this might be an argument
> to repeal any api limits, but the real advantage is the save of bandwidth and
> PHP-requests.
>
Yes, it means the entire result set will be scanned eventually. But there's an advantage to not doing that all at once. Count queries of a high magnitude can easily take a minute, and at some point things start timing out (PHP max exec time limit, timeouts on the client side, timeouts in intermediate caching proxies)
 
So we could return a count, but we'd keep paging in.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links