Last modified: 2009-01-30 22:55:10 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T18852, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 16852 - padleft and padright do not handle multibyte characters properly
padleft and padright do not handle multibyte characters properly
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
1.14.x
All All
: Normal major with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
:
: 12324 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-12-31 22:08 UTC by rememberthedot
Modified: 2009-01-30 22:55 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Proposed patch v1 (2.48 KB, patch)
2008-12-31 22:08 UTC, rememberthedot
Details

Description rememberthedot 2008-12-31 22:08:20 UTC
Created attachment 5635 [details]
Proposed patch v1

The built-in parser functions "padleft" and "padright" do not handle multibyte characters properly. For example, including {{padright:Hello-|7|Æ}} anywhere on a page blanks the entire page. Including multibyte characters in the first parameter produces unexpected results, for example:

{{padleft:Æ|5}} = 000Æ when 0000Æ was expected
{{padleft:本|5}} = 00本 when 0000本 was expected

These problems severely reduce the usefulness of the padleft and padright functions because there is no guarantee that the actual output will even remotely resemble the expected output. The output could have the wrong amount of padding, or blank the page entirely.

I'm including a patch I wrote that cleans up MediaWiki's pad function and gives it proper support for multibyte characters. It also adds a couple test cases to parserTests.txt.
Comment 1 Chad H. 2009-01-02 13:31:04 UTC
*** Bug 12324 has been marked as a duplicate of this bug. ***
Comment 2 Aryeh Gregor (not reading bugmail, please e-mail directly) 2009-01-14 14:36:49 UTC
+		$lengthOfPadding = mb_strlen( $padding );		
+		if ( $lengthOfPadding == 0 ) return $string;
+		
+		#the remaining length to add counts down to 0 as padding is added
+		$length = min( $length, 500 ) - mb_strlen( $string );
+		$finalPadding = ''; # $finalPadding is just $padding repeated enough times so that strlen( $string ) + strlen ( $finalPadding ) == $length
+		while ( $length > 0 ) {
+			$finalPadding .= mb_substr( $padding, 0, $length );
+			$length -= $lengthOfPadding;
+		}

This seems pretty elaborate.  Why don't you just do:

		$char = mb_substr( $padding, 0, 1 );
		if ( strlen( $padding ) == 0 ) return $string;
		$length = min( $length, 500 ) - mb_strlen( $string );
		$finalPadding = str_repeat( $char, $length );

I don't understand a lot of what you've written.  Like:

+		$lengthOfPadding = mb_strlen( $padding );		

Why do you care?  You're only padding with the first character anyway, aren't you?

+			$finalPadding .= mb_substr( $padding, 0, $length );

Isn't $length usually going to be *longer* than $padding?  And again, why do you care about more than the first character?  Current code only uses the first character of the padding and simply discards the rest -- if you're changing that (other than by using only the first *Unicode* character), you should say so, and say why.  You should especially address the question of what's expected to happen if $length isn't a multiple of $lengthOfPadding.
Comment 3 rememberthedot 2009-01-14 16:38:40 UTC
All right, those are good comments. I tried to make the pad function behave like PHP's str_pad function as much as possible. That included adding the ability for $padding to be more than one character. If the padding does not evenly fill the desired length, it is truncated. For example, {{padright:abc|8|def}} = abcdefde

You're right that I should have explained this better earlier. I was hoping to make padleft and padright just as flexible as PHP's str_pad, so that we can design more powerful wikitemplates than is currently possible. Since padleft and padright currently break with Unicode input, we really can't design reliable templates at all. If support for multibyte characters is to be added, and I feel that it should be, it would be a good idea to add it at the same time as we fix the glaring bugs. This will help prevent backwards-compatibility problems - the new functionality will be added at the same time that this function becomes reliable enough for general use.
Comment 4 Aryeh Gregor (not reading bugmail, please e-mail directly) 2009-01-14 17:56:07 UTC
I wouldn't consider dropping non-initial padding characters to be a bug, more like an omission.  But in that case I see what you're doing here, the patch makes sense now.  Committed as r45734.
Comment 5 rememberthedot 2009-01-14 17:59:59 UTC
I agree. Thank you!

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links