Last modified: 2014-11-17 10:35:59 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 22967 - Edit comment/summary accepts 200 chars but is truncated on save.
Edit comment/summary accepts 200 chars but is truncated on save.
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Page editing (Other open bugs)
1.15.x
All All
: Normal major with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: 4714
  Show dependency treegraph
 
Reported: 2010-03-26 18:46 UTC by Arod Shatil
Modified: 2014-11-17 10:35 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
patch that uses js to give better limit for summary field (1.83 KB, patch)
2010-04-01 21:08 UTC, Bawolff (Brian Wolff)
Details
version 2 of js enforcing byte limit on summary field (2.07 KB, patch)
2010-04-02 01:17 UTC, Bawolff (Brian Wolff)
Details
patch to make Summary not be cut off. (the better version ;) (2.09 KB, patch)
2010-05-06 03:34 UTC, Bawolff (Brian Wolff)
Details

Description Arod Shatil 2010-03-26 18:46:12 UTC
Edit comment/summary is limited to 200 chars by use of maxlength=200 in the edit form.
however, the database field (rev_comment column in revisions table) is "tinyblob" which can only hold 255 bytes. in unicode, 200 characters can be up to 600 or even 800 bytes or so (may be rare, but 2:1 is pretty much the rule for any non-latin-alphabeth language).

in addition, EditPage.php contains the line:
$this->summary = $wgLang->truncate( $request->getText( 'wpSummary' ), 250, '' );
(around line 559 in 1.15.1). 
"truncate" also works on bytes and not characters, due to its use of "substr".

remedy:
1)
<code=sql>
alter table revisions modify column rev_comment blob not null;
(can be safely run on upgrades - mysql will copy the content of the tinyblob without complains)
</code>
2)
change the "250" to "600" or so in EditPage.php#559.
alternatively, you might be able to teach "truncate" to count utf-8 chars instead of bytes. in this case you probably want to change the "250" in EditPage.php to "200".
this can be achieved by something like:
<code=php>
function mysubstr($str, $length) {
   return function_exists("mb_substr")?mb_substr($str,$len):substr($str,$len);
}
</code>
and replace the calls to substr by calls to "mysubstr". (this is a sketch - you need to accommodate for some optional params).
if you do the latter, i think you can change the "250" in EditPage.php to "200", which may actually be nicer that kicking it up to 600 and staying with the broken substr.
(apologies for not submitting a "real" patch - i will once you guys upgrade to git.)


the alternative, i.e. to limit "maxlength" to less than 200 is not a good idea, because sometimes you *do* need more than 100 chars to describe an edit, and anyway, to be on the safe side you'll need to limit it to 50 which is unacceptable.

the above will solve the problem for storage of comments and for displaying them on the page history, but for "latest changes" there is a further truncation somewhere which i haven't located yet, so you'll need to hunt it also.
Comment 1 Arod Shatil 2010-03-26 18:49:02 UTC
and of course, the existing situation, where users enter edit summaries and (sometimes) sweat to make them fit the 200-characters limit just to see them truncated on the 125 character is *totally* unacceptable.
Comment 2 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-26 18:52:17 UTC
See bug 4715.  It isn't so straightforward to just make the field longer.  An alternative would be to work around it by writing JS that would prohibit further input at exactly 255 UTF-8 bytes.  Personally I think we should do something to allow longer edit summaries to be stored, but someone will have to get the sysadmins to approve that if it's going to be used by Wikimedia.
Comment 3 Arod Shatil 2010-03-26 19:11:38 UTC
i disagree.
first, it is *very* straightforward to change one tinyblob to blob and you know it. it only costs one extra byte per revision - hardly something that will overwhelm the system.

second, if English has 200 chars, why should Hebrew be limited to 125? (and Malaysian to 60?) it may be a more compact language, but not *that* compact.
also, your JS script really doesn't do it, because it only triggers after the user have sweated on making her comment fit in the confines of 200 (or whatever) characters, and then inform her that "it doesn't fit"?
and writing a JS that will trigger on keypress is very unstable and will not work well for older, slower systems, not to mention that any JS dependency is bad - not everyone has it enabled.

the current system renders the edit-comment field barely useful, or almost useless for anything other than latin-alphabet-based languages.
Comment 4 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-26 19:20:29 UTC
(In reply to comment #3)
> first, it is *very* straightforward to change one tinyblob to blob and you know
> it. it only costs one extra byte per revision - hardly something that will
> overwhelm the system.

Domas Mitzuas said otherwise, and he's Wikimedia's DBA.  If you think you know better, I suggest you take that up with him.

> second, if English has 200 chars, why should Hebrew be limited to 125? (and
> Malaysian to 60?) it may be a more compact language, but not *that* compact.

It shouldn't.  However, as long as the field is limited to 255 bytes, it would be better to have an accurate JS limit instead of an inaccurate hardcoded maxlength=200.  This will give more room for mostly-ASCII languages, and stop mostly-not-ASCII languages from having their comments truncated.

> also, your JS script really doesn't do it, because it only triggers after the
> user have sweated on making her comment fit in the confines of 200 (or
> whatever) characters, and then inform her that "it doesn't fit"?

No.  It would be an oninput script or something, and would return false (stopping the keypress from having any effect) if the limit would be exceeded.  So it would work just like maxlength.

> and writing a JS that will trigger on keypress is very unstable and will not
> work well for older, slower systems

A JS handler is not unstable or slow.  It would work indistinguishably from the maxlength attribute.

> not to mention that any JS dependency is
> bad - not everyone has it enabled.

Sure, agreed.  We'd keep the maxlength=200 for people without JS.  There's no way to do this well without JS, though, as long as it has to be fit into 255 bytes -- you could have a server-side error message instead of truncation, but that would be *way* more annoying.

> the current system renders the edit-comment field barely useful, or almost
> useless for anything other than latin-alphabet-based languages.

Sure.  I agree that we should increase the edit summary storage somehow if possible.  If that's not so easy to get done in the near future, however, it would be nice to at least be able to accurately limit the input so it's not mysteriously truncated.  (I'm amazed that users of non-Latin languages have put up with the mysterious truncation for this long.)
Comment 5 Roan Kattouw 2010-03-26 19:25:21 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > first, it is *very* straightforward to change one tinyblob to blob and you know
> > it. it only costs one extra byte per revision - hardly something that will
> > overwhelm the system.
> 
> Domas Mitzuas said otherwise, and he's Wikimedia's DBA.  If you think you know
> better, I suggest you take that up with him.
> 
ALTER TABLE queries on standard garden variety tables are straightforward, yes. ALTER TABLE queries on tables with 350 million rows and 6 indexes in a replication environment, not so much.
Comment 6 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-26 19:31:20 UTC
The ALTER TABLE is a nonissue.  There are procedures in place to do that, although it's a pain and it might take a while if that was the only issue.  The problem here is Domas said a simple schema change isn't acceptable.  See bug 4715 comment 8 and bug 4715 comment 10.
Comment 7 Bawolff (Brian Wolff) 2010-04-01 19:39:37 UTC
Going over here from Bug 4715 as not really relevant over there per Aryeh)
(In reply to Bug 4715 comment #45)
> please note that my routine *did not* "block keystrokes". rather it manipulated
> the maxlength property of the input field and let the browser handle the actual
> keystrokes.

What I meant to say, is blocking keystrokes is a bad idea unless you check for non-printing keystrokes, as it'd block backspaces and such. I just mentioned that as it seemed an obvious way of dealing with the problem.

The concern i have is I'm not sure if maxlength is interpreted uniformly. I only tested in firefox, and it uses how many UTF-16 code points are used. That means that sometimes you can have 1 character count as 2 (for example LINEAR B SYLLABLE B008 A (U+10000) is represented by 0xD800 0xDC00 in utf-16, which is what firefox counts in the maxlength property)

Which is all great and all, but when I read the html5 specification it says that the maxlength specifies the maximun number of Unicode code points in the field. Which is ambiguous to me, as its unclear (at least to me) if stuff in supplementary characters are one code point (U+10000) [what i would think] or as a surogate pair of 2 code points (what firefox seems to do).

Anyways, my understanding of unicode is limited, so i could be totally wrong about all this.
Comment 8 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-04-01 19:51:39 UTC
If some browsers interpret maxlength as code points and some as half the number of UTF-16 bytes, that's not a big issue, because it only affects astral-plane stuff.  May as well get a working implementation for the BMP before worrying about slight overcounting (one byte per character) in the astral plane.
Comment 9 Bawolff (Brian Wolff) 2010-04-01 21:08:16 UTC
Created attachment 7256 [details]
patch that uses js to give better limit for summary field

Ok, here's a patch that does much more accurate length limit on summaries.

I have only tested this on firefox so far. I need to test it on other browsers.

It still reserves the extra 5 bytes for a "..." in case something goes over, so it doesn't use the maximun possible length, but i doubt anyone is going to care about 5 bytes.
Comment 10 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-04-01 23:01:34 UTC
Looks good at a glance.  Can't test it right now, will do so later.  Preliminary comments:

+			document.getElementById('wpSave').focus(); //to show user they hit limit.

This doesn't look like a good idea to me.  This might result in the user typing quickly and hitting space, submitting the form before they're ready.  We should try to match the UI of maxlength here exactly -- it's what users are familiar with, and has been proven to work.

+			summary.value = text.substring(0, text.length - 1);
+			//We don't know how many characters over it is, so only plop off one
+			trimSummary(); //in case we need to plop off more then one (surogates, copy/paste etc).

What actually happens here with surrogates?  Won't it remove only one, find it's now under the limit, and allow submission of invalid Unicode?  (JS handling of Unicode is insane, sigh.)  Probably not the end of the world, but worth noting in a comment.  Also worth making sure we know what happens, because pathological behavior is conceivable . . . for instance, if substring() actually returns the full string rather than removing half a character, the code will recurse infinitely.
Comment 11 Bawolff (Brian Wolff) 2010-04-02 01:17:11 UTC
Created attachment 7258 [details]
version 2 of js enforcing byte limit on summary field

Yeah, I could imagine that if it submitted when someone thought they were typing, it could be annoying. I've revised it so it doesn't switch focus (of course its not identical. for example it will delete from the end regardless of where person is typing from). I don't think an infite recursion would happen, but its better to be safe then sorry in these types of things, so i added a check to see if substring actually made the string shorter.

-----

I've tested it on Opera, firefox, konquor and chrome (IE is the only one i haven't tested yet). The only issue i noticed is that if javascript modifies the summary, field, it appears as if the focus of the field is at the beginning of it rather then where you were just typing in  firefox (but not opera or chrome), which is rather annoying. However, since the cursor stays at the end, I think thats probably acceptable. 

As for surogates, if half a surogate is cut off, then an invalid sequence goes into summary box (on firefox anyways, this seems to vary by browser). Whats really weird is on firefox, this happens even when just using maxlength attribute. If you only have 1 character of room left, and you put in a U+10000, firefox will insert half a surogate pair. In any case, I don't really think thats an issue, as its validated on the server-side.

How browsers handle unicode does seem to vary, especially on how they interpert maxlength. for example, in chrome you can have document.getElementById('summary').value.length > document.getElementById('summary').maxlength since js string lengths are measured differently then maxlength in that browser. All and all unicode makes things weird :s
Comment 12 Arod Shatil 2010-04-02 02:04:53 UTC
nitpick:
len = text.replace(/[^\u0000-\u007F]/g, '').length +
		text.replace(/[^\u0080-\u07FF\uD800-\uDFFF]/g, '').length*2 +
		text.replace(/[^\u0800-\uD7FF\uE000-\uFFFF]/g, '').length*3;

could (and should) be:
len = text.replace(/[\u0080-\u07FF\uD800-\uDFFF]/g, '**').replace(/[\u0800-\uD7FF\uE000-\uFFFF]/g, '***').length;
not so much for efficiency (though it *is* more efficient), it just makes more sense.

(you could use spaces instead of *'s, i used them for code readability)

and i still think that "my" handling, i.e. manipulating the maxlength property makes more sense, though i readily admit that "your" way is less browser-dependent.

"my" way does not change the current user experience: now, when someone fills all 200 chars the input field simply refuses to get more characters, but still process keystrokes (backspace, arrow keys etc.) and does not throw focus elsewhere. 
also trimming from the end is new behavior that better be avoided (methinks).


again,all the above are nitpicks. if you guys can get it into the code it would be great!
thanks.
Comment 13 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-04-12 20:56:09 UTC
Sorry for the long delay!  I was caught up in non-MediaWiki stuff for the last couple of weeks and didn't have time to look at this.  I've reviewed the patch, and I agree with Arod that it's very jarring.  Try typing out to almost full length, then inserting some multibyte characters in the middle . . . everything jumps around and stuff is removed from the end.

Maybe his suggested approach would work better.  I'd be happy to review further patches, hopefully more expeditiously than I reviewed this one.

Also, another thing -- it's probably best to keep the maxlength at 200, and only change it in JavaScript.  This way, if JavaScript is disabled, you don't have a regression for users of Latin languages that use only a few multibyte characters.
Comment 14 Bawolff (Brian Wolff) 2010-05-06 03:34:18 UTC
Created attachment 7350 [details]
patch to make Summary not be cut off. (the better version ;)

Third times the charm :). I think this patch is much better then my previous ones. It cancels key strokes if the summary has too many characters, it does not block special keys (backspace, etc). It keeps the maxlength to 200 for people without js, then overrides it to 250 in javascript. It behaves much more like maxlength parameter.

One thing to note is that this will consider ctrl+v as a non-printing keystroke, and not filter it. I think thats ok.

Tested on the linux versions of: firefox (actually iceweasel) 3.0.6, MSIE 6.0, Opera 10.10, Chrome 5.0.342.7 beta, Konqueror 5.3.9
Comment 15 Arod Shatil 2010-05-06 08:13:20 UTC
nice.
one small nitpick, though: 

i find it aesthetically displeasing that in the three different points the function "checkSummary" returns, three different mechanisms are used for return value:
for special chars you use the return value of the function itself.
when the string is too long you set both e.preventDefault() and e.returnValue, but not the return value of the function, and in all other cases you do not set any property on e, and leave the function return value undefined.
i think you probably want to add
    return false; 
when len > 250, and
    return true;
just before the last "}" of checkSummary.


and a practical question:
could you please add a couple of lines instructing me how to add this script to my user space (personal scripts or somesuch) in he.wikipedia, so i can test it? (can't test it on my own wiki, because i removed the 200 chars limit in the DB and increased the php limit to 400...)

peace.
Comment 16 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-05-06 15:21:36 UTC
Thanks for the patch!  I'm currently in the middle of finals, but should be able to try this out in a week or two.
Comment 17 Bawolff (Brian Wolff) 2010-05-06 21:33:20 UTC
(In reply to comment #15)
> nice.
> one small nitpick, though: 
> 
> i find it aesthetically displeasing that in the three different points the
> function "checkSummary" returns, three different mechanisms are used for return
> value:
> for special chars you use the return value of the function itself.
> when the string is too long you set both e.preventDefault() and e.returnValue,
> but not the return value of the function, and in all other cases you do not set
> any property on e, and leave the function return value undefined.
> i think you probably want to add
>     return false; 
> when len > 250, and
>     return true;
> just before the last "}" of checkSummary.
I'm not sure if that matters too much, but I could see it being slightly confusing. the e.preventDefault() and e.returnValue is to override the default action of the event, the other return is just to get out of the function.


> and a practical question:
> could you please add a couple of lines instructing me how to add this script to
> my user space (personal scripts or somesuch) in he.wikipedia, so i can test it?
> (can't test it on my own wiki, because i removed the 200 chars limit in the DB
> and increased the php limit to 400...)
> 
> peace.

yeah, just take the patch, wrap it in an if (wgAction === 'edit') { /*js from patch*/} and put it in your personal js. Should work (I think, no reason why it shouldn't, but i didn't test).

-bawolff

p.s. I have commit access now so i could commit it, but I was unsure if i should or not since the last version was rejected.
Comment 18 Arod Shatil 2010-05-09 15:10:33 UTC
intermediate report:
the patch seems to work as advertised, with no ill effects so far.
tested with FF, IE, (8 only - i do not have access to older versions :( ), Safari, Konqueror, Opera and google chrom.

i had to change the test from (len < 250) to (len < 248) (off-by-one), although conceptually, i assume that changing it to (len <= 250) would have been the "more correct" thing to do, but practically it makes sense to use 248 - with any language there are still some single-byte chars, such as punctuation and spaces, so a total size of 249 is possible, yet one would not want to add one more 2-byte character in this case. (maybe 247 would be safer still - there *are* 3-bytes-per-char languages. i use Hebrew, so 248 works for me).

i also changed the "return true;" to "return;": if the return value is unimportant, we should not use it in just one case (control characters).

i also do not think the line
summary.maxLength = 250;
is required or desired. leaving a limit of 200 chars seems reasonable. this whole bug is about *adding* a test for no-more-than-250-bytes to the existing no-more-than-200-chars limitation.

the issue/question of extending the length of summary above 200 chars should be separated from the bug associated with more-than-one-byte-per-char languages and the existing 250 bytes DB limitation.

peace.
Comment 19 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-05-26 00:07:06 UTC
Sorry for the long delay.  It looks okay to me at a glance, although I haven't tested it.  If you've tested that it doesn't break things in IE, go ahead and commit it IMO.  People will report it if there are any bugs.

(In reply to comment #18)
> i also do not think the line
> summary.maxLength = 250;
> is required or desired. leaving a limit of 200 chars seems reasonable. this
> whole bug is about *adding* a test for no-more-than-250-bytes to the existing
> no-more-than-200-chars limitation.

The 200-char limit exists only as a crude hack to prevent most Latin-based languages from hitting 255 bytes.  It's not necessary if we have a better technique, like this.
Comment 20 Bawolff (Brian Wolff) 2010-05-26 09:11:37 UTC
Fixed in r66913
Comment 21 Arod Shatil 2010-05-30 17:16:22 UTC
(In reply to comment #20)
> Fixed in r66913
unfortunately this fix has an off-by-one bug:
+		if (len > 250) {
+			if (e.preventDefault) e.preventDefault();
+			e.returnValue = false; //IE
+			return false;
+		}

this means that if len *is* 250 (or 249 for that matter) the script will allow you to add one more character, which can be 2 or 3 extra bytes which will get truncated.
we need to do one of the following:
-- concatenate the "event.data" to summary.value before calculating the length, thus:
+		len = (summary.value + event.data).replace(/[\u0080-\u07FF\uD800-\uDFFF]/g, '**').replace(/[\u0800-\uD7FF\uE000-\uFFFF]/g, '***').length;
-- use 247 instead of 250 for the comparison

peace.
Comment 22 Platonides 2010-05-30 20:18:41 UTC
> this means that if len *is* 250 (or 249 for that matter) the script will allow
> you to add one more character, which can be 2 or 3 extra bytes 

which will still be less than 255 bytes.
Comment 23 Arod Shatil 2010-05-31 19:16:22 UTC
(In reply to comment #22)
> > this means that if len *is* 250 (or 249 for that matter) the script will allow
> > you to add one more character, which can be 2 or 3 extra bytes 
> 
> which will still be less than 255 bytes.

This is true. and tinyblob can indeed support up to 255 bytes. 
However, the php code truncates at 250, which causes problems with r66913.
Comment 24 Bawolff (Brian Wolff) 2010-06-01 02:42:48 UTC
Technically it will truncate it at 250, then add "..." to the end then return the original if its shorter or equal to the truncated string, so in english at least (where the  "..." message is 3 bytes long) all is well, but I'm not sure if relying on that is really good practise.
Comment 25 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-06-01 17:10:54 UTC
The point is, we don't want to truncate it at all on the server side.  It should prohibit you from entering enough characters to be truncated on the server side, that's the point of this bug.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links