Last modified: 2011-12-14 17:37:16 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T28614, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 26614 - action=parse shows different sortkey then one outputted by prop=categories (prefix vs actual binary sortkey)
action=parse shows different sortkey then one outputted by prop=categories (p...
Status: NEW
Product: MediaWiki
Classification: Unclassified
API (Other open bugs)
1.18.x
All All
: Normal enhancement (vote)
: ---
Assigned To: Roan Kattouw
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-01-06 23:15 UTC by db [inactive,noenotif]
Modified: 2011-12-14 17:37 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description db [inactive,noenotif] 2011-01-06 23:15:10 UTC
Using a very long sortkey (>255 bytes) let the api module action=parse to output that long sortkey, but the database stored later a truncated sortkey.

Please truncate the sortkey when adding to the ParserOutput, so action=parse shows the right sortkey.

Please truncate whole multibyte characters and avoid half bytes inside the database.

By the way, you can remove the truncation from LinksUpdate (r79706).

Thanks.
Comment 1 Bawolff (Brian Wolff) 2011-01-06 23:45:21 UTC
The reason i didn't do it in ParserOutput (well in addition to the parser being scary), is that an extension can add stuff to that array in several different ways, which makes it hard to put the check directly in ParserOutput.


Perhaps we could add a check in ParserOutput::addCategory where categories are added 99% of the time, and still keep the extra substring check in LinksUpdate. Does that sound sane?
Comment 2 db [inactive,noenotif] 2011-01-06 23:55:30 UTC
(In reply to comment #1)
> The reason i didn't do it in ParserOutput (well in addition to the parser being
> scary), is that an extension can add stuff to that array in several different
> ways, which makes it hard to put the check directly in ParserOutput.
> Perhaps we could add a check in ParserOutput::addCategory where categories are
> added 99% of the time, and still keep the extra substring check in LinksUpdate.
> Does that sound sane?

Yes, when the truncate inside LinksUpdate is needed, because of extensions, than keep the extra substring.
Comment 3 Bawolff (Brian Wolff) 2011-01-07 01:03:38 UTC
On second thought, even if it was truncated in ParserOutput, the api would still output the wrong sortkey on action=parse, because its outputting the equivalent of cl_sortkey_prefix, not cl_sortkey. The question is, should it be outputting the prefix, or the final (binary) sortkey?

Does anyone know how these values are actually used in practise (so which is appropriate?). Perhaps we should output both? I'm leaning towards outputting the final binary sortkey here, but I could also imagine someone would have use for the human readable sort key as well, so I'm unsure.

cc'ing Aryeh Gregor in case he has any thoughts on this, seeing as he did most of the category stuff.
Comment 4 Sam Reed (reedy) 2011-01-07 02:12:15 UTC
I'm guessing this has some relation to bug 24650
Comment 5 Aryeh Gregor (not reading bugmail, please e-mail directly) 2011-01-07 17:01:19 UTC
The way they're used in practice is that if cl_sortkey_prefix is empty, then cl_sortkey = $wgContLang->convertToSortkey( page_title ).  Otherwise, cl_sortkey = $wgContLang->convertToSortkey( cl_sortkey_prefix . "\0" . page_title ).  See Title::getCategorySortkey() and Language::convertToSortkey().

cl_sortkey is in general an arbitrary binary string which may bear no discernible relationship to the original page title or sortkey prefix, once we start using proper ICU or CLDR or whatever for sortkeys.  So I can't imagine why anyone would want the actual value; logically, you'd only want to sort by it.

I don't know why you'd want cl_sortkey_prefix either, for that matter (although it is going to be human-readable text).  I can't see any use for it other than constructing cl_sortkey.  Do you know of anyone who actually wants or uses these?
Comment 6 Sam Reed (reedy) 2011-01-15 01:01:18 UTC
This isn't actually an API par se, it's more the parser sort of.

Where is it actually output from a parse query? I can't see it anywhere obviously.

bug 24650 has been fixed now.

Where are these actually used anywhere by anyone? I'm poking around, and have logged bug 26736 as a related bug, depending on this perceived usefulness..

Especially, as per Aryeh, it may move further from being a human readable thing, which is probably useless.


+------------+-------------------+
| cl_sortkey | cl_sortkey_prefix |
+------------+-------------------+
| FULLTWOLOW |                   |
| MAIN PAGE  |                   |
+------------+-------------------+
Comment 7 Bawolff (Brian Wolff) 2011-01-15 02:07:06 UTC
>Where is it actually output from a parse query? I can't see it anywhere
>obviously.

The sortkey attribute of the cl element in this query:
api.php?action=parse&text=[[category:Foo|bar]]&prop=categories

Honestly, it wouldn't be horrible to leave it the way it is. In a sense, its outputing what it reads the sortkey as, not what the final sortkey is.
Comment 8 Mark A. Hershberger 2011-01-20 18:56:22 UTC
Since Bug #24650 is fixed now, this obviously wasn't blocking it.  Removing blocker status.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links