Last modified: 2008-07-28 19:05:43 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T16923, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 14923 - Add methods for getting an array of titles from a category
Add methods for getting an array of titles from a category
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Categories (Other open bugs)
unspecified
All All
: Normal enhancement (vote)
: ---
Assigned To: Robert Leverington
: patch, patch-need-review
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-07-25 18:17 UTC by Robert Leverington
Modified: 2008-07-28 19:05 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Adds two method to Category, getMemberIds() and getMemberTitles(). (1.56 KB, patch)
2008-07-25 18:17 UTC, Robert Leverington
Details
Revised method adding only the Category::getMembers() function with limit and offset parameters (1.02 KB, patch)
2008-07-28 16:05 UTC, Robert Leverington
Details
Revised version with better comments and string offsets. (1.18 KB, patch)
2008-07-28 17:08 UTC, Robert Leverington
Details

Description Robert Leverington 2008-07-25 18:17:45 UTC
Created attachment 5094 [details]
Adds two method to Category, getMemberIds() and getMemberTitles().

Several extension deal with all the members of a category in some way, it would be useful to have methods that allow one to get an array containing titles for all members in a category.

The attached patch adds two methods to the Category class:
 * getMemberIds() for getting an array of category page ids,
 * getMemberTitles() for getting an array of title objects for these page ids.
The arrays behind these are only generated when the methods are actually called (but are cached in member variables), to save on memory usage; and are ordered by the category sort key.

While this solution is clearly only adequate on smaller categories, most of the extensions that require such methods are aimed at smaller wikis - and this is therefore not a significant issue.  It is often the case that certain extensions would break or need redesigning if faced with thousands, rather than hundreds, of pages in a category (this is not necesserily a bad point however, due to the primary target of such extensions).  In the future if large categories need to be scanned, heavy duty progressive methods could be introduced in some way - but this is not yet necessary.

I'd appreciate it if someone could take a look over the patch, give comments, and possibly advice as to whether or not it would be suitable for core integration - after which case I will commit it if there is agreement.
Comment 1 Max Semenik 2008-07-25 18:27:16 UTC
And what if one of these functions is run on [[Category:Living people]]? There must be some kind of limit.
Comment 2 Robert Leverington 2008-07-25 18:33:48 UTC
As I pointed out in the bug description, this is aimed at extension being used on smaller wikis - but placed in core due to the number of extensions that could use it.  It would be silly to use this in core at all, due to the very reason you pointed out, and thus I see no reason to place any artificial limitation on it.  As I was careful to indicate, any traversal of a larger category would require a progressive function.  If this feature is ever commited I will, however, place a comment pointing out the danger of using this in core and how it would not be suitable for anything aimed at large wikis.
Comment 3 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-07-25 18:38:41 UTC
Thoughts:

* If this is going to be done, I'd add offset and limit parameters and incorporate it into CategoryPage.php, replacing doCategoryQuery().  This would also allow third-party extensions to iterate through categories in a reasonable fashion.
* What's the point of getting the ID's?  ID's are practically useless by themselves.  I'd ditch that method and just return a title array.
* I would *not* cache the result.  There's a pretty good chance that it's large, so caching is a bad idea.  The caller can cache it if necessary.  If it's not, we don't want a ton of memory being hogged.
* Look over your whitespace.  You seem to have lots and lots of blank lines for no particular reason.  Don't make the first or last line of a function blank, and use blank lines within a function to separate groups of lines.  Don't put a blank line between every line.
* To get the titles, you're:
    1) querying the database for ID's,
    2) copying all those ID's into a PHP array (which is much less efficient than a MySQL result, according to Tim),
    3) caching that array even though nobody ever asked for it,
    4) querying the database *again* to get the info you need from the ID's (in Title::newFromIDs),
    5) copying those into an array of Title objects (which is horribly memory-inefficient: how big is each Title? how much overhead per array item),
    6) caching *that* array,
    7) returning it.
  I would strongly suggest that instead you do the full query to retrieve all necessary fields directly, then return the result wrapper.  Or better yet, an efficient TitleArray class akin to the UserArray class we now have would not go amiss.
Comment 4 Robert Leverington 2008-07-28 16:05:48 UTC
Created attachment 5100 [details]
Revised method adding only the Category::getMembers() function with limit and offset parameters

I have taken into account your advice here and on IRC and have created a revised method Category::getMembers() that accepts limit and offset parameters.  It uses the TitleArray class you developed and contains only one blank line (this seperates the actual querying and the construction of the limit\offset query options, which have to be added to the ORDER BY if needed).  No caching is done as this would no longer be sensible given the limit and offset parameters.
Comment 5 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-07-28 16:33:43 UTC
+	/**
+	 * @param integer $limit
+	 * @param integer $offset
+	 * @return TitleArray object for category members.
+	 */

More documentation, please.  Give a one-sentence description of the method's purpose, at least, like "Fetch a TitleArray of up to $limit category members, beginning after the category sort key $offset."  (See remarks below on changing $offset to a string.)

Also, I'm told we're supposed to be using "@param $name type", not the reverse:

http://www.mediawiki.org/wiki/Manual:Coding_conventions#Inline_documentation

+	public function getMembers( $limit = false, $offset = 0 ) {
+		$options = array( 'ORDER BY' => 'cl_sortkey' );
+		if( $limit ) $options += array( 'LIMIT' => $limit, 'OFFSET' => $offset);

Do not use OFFSET if you can possibly avoid it (except if you know that it's going to be acceptably small, which you don't).  What you want to do is something like

		$dbr = wfGetDB( DB_SLAVE );
		$conds = array( 'cl_to' => $this->getName() );
		$options = array( 'ORDER BY' => 'cl_sortkey' );
		if( $limit ) {
			$options['LIMIT'] = $limit;
		}
		if( $offset !== '' ) {
			$conds []= 'cl_sortkey > ' . $dbr->addQuotes( $offset );
		}

LIMIT m OFFSET n is inefficient: it basically just calculates the first m + n rows and throws away the first n.  The technique of using your sort value as an offset (which is used in Pager.php, for instance, and all pages that use it) is much more efficient for very large offsets, since it only scans the rows that are needed.  Pagination can be achieved by passing the last sort key in the previous batch as the offset in the new batch.  Note how category pages have offsets like "from=Bob_Smith", not "from=200", and the same is true for most other pages.

+		$dbr = wfGetDB( DB_SLAVE );
+		$members = TitleArray::newFromResult(
+			$dbr->select(
+				array( 'page', 'categorylinks' ),
+				array( 'page_id', 'page_namespace', 'page_title', 'cl_to', 'cl_from', 'cl_sortkey' ),

This is wrong.  You're never going to use cl_to, cl_from, cl_sortkey.  Make it

				array( 'page_id', 'page_namespace', 'page_title', 'page_len', 'page_is_redirect', 'page_latest' ),

as documented in the comment for TitleArray::newFromResult().

+				array( 'cl_to' => $this->getName() ),
+				__METHOD__,
+				$options,
+				array( 'categorylinks' => array( 'INNER JOIN', 'cl_from = page_id' ) )

Using the join thing here should work, but it's clunky.  Really this is only needed for left/outer joins.  You could drop this line and add 'cl_from = page_id' to the conditions, for the same effect.  But either way works.

+			)
+		);
+		return $members;

It's not a problem, but it would be a little simpler (one line less) if you just replaced "$members = ..." above with "return ...", and skipped this line.
Comment 6 Robert Leverington 2008-07-28 17:08:15 UTC
Created attachment 5101 [details]
Revised version with better comments and string offsets.

Takes into account all of the comments.
Comment 7 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-07-28 18:15:04 UTC
Looks good.  If you've tested it and it works, I say go ahead and check it in.
Comment 8 Robert Leverington 2008-07-28 19:05:43 UTC
Added in r38141

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links