Last modified: 2014-11-08 18:25:11 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T6715, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 4715 - Allow comments longer than 255 bytes


Summary:	Allow comments longer than 255 bytes

Status:	REOPENED

Product:	MediaWiki
Classification:	Unclassified
Component:	Database (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Low enhancement with 9 votes (vote)
Target Milestone:	---
Assigned To:	Jan Paul Posma

URL:	http://dev.mysql.com/doc/refman/5.0/e...
Whiteboard:
Keywords:	patch, patch-need-review, performance, schema-change

Duplicates:	13200 22696 (view as bug list)
Depends on:
Blocks:	4714 41348 56295 34586
	Show dependency tree / graph

Reported:	2006-01-22 08:34 UTC by Ævar Arnfjörð Bjarmason
Modified:	2014-11-08 18:25 UTC (History)
CC List:	29 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Schema change (25.05 KB, patch) 2011-04-24 19:50 UTC, Jan Paul Posma	Details
Add an attachment (proposed patch, testcase, etc.)

Description Ævar Arnfjörð Bjarmason 2006-01-22 08:34:28 UTC

To support longer edit summaries the rev_deleted field has to be changed from
TINYBLOB to BLOB and a relevant database upgrade has to be written.

Comment 1 Brion Vibber 2006-01-22 19:49:43 UTC

That'd be rev_comment, not rev_deleted. :)

Comment 2 Brion Vibber 2008-02-29 19:37:31 UTC

*** Bug 13200 has been marked as a duplicate of this bug. ***

Comment 3 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-03-07 19:11:16 UTC

This should also be done for log_comment.  Don't know if that should be a separate bug.

Comment 4 Siebrand Mazeland 2009-02-02 16:00:58 UTC

Removed 'schema-change' keyword. No patch attached that requires a schema change.

Comment 5 Aryeh Gregor (not reading bugmail, please e-mail directly) 2009-02-02 17:39:30 UTC

Um, what?  The necessary schema change is in the bug summary: "Change rev_comment from TINYBLOB to BLOB".  A patch to do that would be about two lines, there's no need to post one.

(Changed summary to include log_comment)

Comment 6 Raimond Spekking 2010-03-02 15:41:33 UTC

*** Bug 22696 has been marked as a duplicate of this bug. ***

Comment 7 praveenp 2010-03-02 15:49:07 UTC

(In reply to comment #6)
> *** Bug 22696 has been marked as a duplicate of this bug. ***

Actually 200 characters are enough for summary. But our problem is different :(

Comment 8 Domas Mituzas 2010-03-02 15:56:21 UTC

by the way, what about "no" ?;-)

Comment 9 praveenp 2010-03-02 16:01:37 UTC

(In reply to comment #8)
> by the way, what about "no" ?;-)

May be that is an easiest way to solve a real problem.

Comment 10 Domas Mituzas 2010-03-02 16:07:08 UTC

I'm sure it is :-)

Anyway, all these comments are shown in too many reports (watchlist, RCL) - that are already quite expensive to produce - introducing blob data to them without database/code refactoring wouldn't make much sense.

Comment 11 praveenp 2010-03-02 17:08:09 UTC

(In reply to comment #10)

> Anyway, all these comments are shown in too many reports (watchlist, RCL) -
> that are already quite expensive to produce - introducing blob data to them
> without database/code refactoring wouldn't make much sense.

Because of that kind of definition, only a handful languages (those which using Latin script probably) get any advantage of Edit-summary :(. 

There may be no update on edit summary since it developed in very initial stage of mediawiki software.

Comment 12 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-02 17:24:55 UTC

(In reply to comment #10)
> I'm sure it is :-)
> 
> Anyway, all these comments are shown in too many reports (watchlist, RCL) -
> that are already quite expensive to produce - introducing blob data to them
> without database/code refactoring wouldn't make much sense.

What sort of database/code refactoring were you thinking of?

Comment 13 Victor Vasiliev 2010-03-02 17:35:49 UTC

(In reply to comment #12)
> What sort of database/code refactoring were you thinking of?

Maybe addition of rev_description/log_description which are pointers to appropriate descriptions in text table? Or even to a page in special namespace (long rationales may require fixes)?

Comment 14 Domas Mituzas 2010-03-02 18:28:04 UTC

well, as long as that doesn't get included everywhere (recentchanges, feeds, RCL, watchlists, etc ;-), you can probably store whatever you want anywhere you want ;-)

the refactoring would mean moving that data out of tables, and doing load of it in second pass "on display". it would suck though - another huge table with billions of rows to maintain/join against, and there'd be questions how to structure it to have good locality, etc. 

pain oh pain.

Comment 15 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-02 18:58:18 UTC

How about this.  Add new rev_text and log_text as foreign keys to old_id.  If the comment fits in rev_comment/log_comment, keep it there, and set rev_text/log_text = 0.  Otherwise, keep a truncated version in rev_comment/log_comment and use that for RC and watchlist, and set rev_text/log_text and use that on history pages and other appropriate places.  (We probably don't want super-long summaries on RC anyway, it would add clutter.)

We'd likely have a limit of 1000 Unicode characters or less, though, so would using the text table be overkill?  Would it make more sense to just have rev_comment_long/log_comment_long or something?

Either way, this scheme wouldn't increase the database size noticeably for 95%+ of revisions, since most people aren't that verbose.  So that shouldn't be a big problem, right?

Comment 16 praveenp 2010-03-03 09:02:35 UTC

(In reply to comment #15)
> Either way, this scheme wouldn't increase the database size noticeably for 95%+
> of revisions, since most people aren't that verbose.  So that shouldn't be a
> big problem, right?

Then it is possible, isn't it?

Comment 17 Shiju Alex 2010-03-03 13:26:15 UTC

Currently when we create a new page, if nothing is entered on the edit summary box, the first few lines of the content will be automatically displayed on the edit summary. 

If we are increasing the character limit, then it is better to remove the above functionality.

Comment 18 Gurch 2010-03-03 17:51:34 UTC

(In reply to comment #11)
> Because of that kind of definition, only a handful languages (those which using
> Latin script probably) get any advantage of Edit-summary :(. 
> 
> There may be no update on edit summary since it developed in very initial stage
> of mediawiki software.

It's not so much the age of the software as the inefficiency of adding and converting to larger fields in the database schema.

It also doesn't help that most people equate one character with one byte and forget that users of non-Latin scripts are stuck with an encoding that takes two to three times as much storage space.

> (In reply to comment #12)
> > What sort of database/code refactoring were you thinking of?
> 
> Maybe addition of rev_description/log_description which are pointers to
> appropriate descriptions in text table? Or even to a page in special namespace
> (long rationales may require fixes)?

Not sure where you're getting the "long rationales may require fixes" idea from. 200 characters in a multibyte-encoded script doesn't necessarily convey any more information than 200 characters in a Latin script, it just takes more space to store.

(In reply to comment #15)
> We'd likely have a limit of 1000 Unicode characters or less, though, so would
> using the text table be overkill?  Would it make more sense to just have
> rev_comment_long/log_comment_long or something?

Nobody is suggesting anything like 1000 Unicode characters. 200 Unicode characters -- i.e. the same length Latin-script users get already -- would be more than enough, but currently scripts that are encoded with multibyte characters (almost anything non-Latin) don't get anywhere near that much.

(In reply to comment #17)
> Currently when we create a new page, if nothing is entered on the edit summary
> box, the first few lines of the content will be automatically displayed on the
> edit summary. 
> 
> If we are increasing the character limit, then it is better to remove the above
> functionality.

Not really. Just cap the automatic summary length at a fixed number of characters, rather than bytes. No reason why non-Latin script shouldn't get to see the first ~200 characters of their articles too.

Comment 19 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-03 18:54:20 UTC

(In reply to comment #18)
> It's not so much the age of the software as the inefficiency of adding and
> converting to larger fields in the database schema.

This isn't a big issue.  We have a system in place for schema changes, even on large tables.  The problem is the efficiency of the resulting deployed system.

> It also doesn't help that most people equate one character with one byte and
> forget that users of non-Latin scripts are stuck with an encoding that takes
> two to three times as much storage space.

I hope by "most people" you don't mean to include developers and sysadmins.

> Nobody is suggesting anything like 1000 Unicode characters. 200 Unicode
> characters -- i.e. the same length Latin-script users get already -- would be
> more than enough, but currently scripts that are encoded with multibyte
> characters (almost anything non-Latin) don't get anywhere near that much.

I disagree.  I regularly want to use English-language summaries longer than 255 ASCII characters.  If we're going to allow more than 255 bytes, we'll be allowing at least 64K bytes (likely 16M, if we use the text table) on the backend, so how many are allowed in an actual comment will be admin-configurable.  I imagine some wikis would request more than 255 characters -- I'd support a software default of 500 or so.  The current limit is only because we use varchar(255).

> Not really. Just cap the automatic summary length at a fixed number of
> characters, rather than bytes. No reason why non-Latin script shouldn't get to
> see the first ~200 characters of their articles too.

Agreed.

Comment 20 Gurch 2010-03-03 23:10:23 UTC

(In reply to comment #19)
> I disagree.  I regularly want to use English-language summaries longer than 255
> ASCII characters.

Well don't :) Long summaries would make histories and related pages more difficult to read. Increasing the character count for other scripts doesn't have that issue. Besides which, anyone writing a mini-essay for a summary would be better off summarizing it briefly and putting a detailed explanation on the discussion page. I've seen a lot of edit warring using the summary field as a mechanism for argument, and longer summaries would only encourage that.

Comment 21 praveenp 2010-03-04 09:42:31 UTC

> Nobody is suggesting anything like 1000 Unicode characters. 200 Unicode
> characters -- i.e. the same length Latin-script users get already -- would be
> more than enough, but currently scripts that are encoded with multibyte
> characters (almost anything non-Latin) don't get anywhere near that much.

> Long summaries would make histories and related pages more
> difficult to read. Increasing the character count for other scripts doesn't
> have that issue. Besides which, anyone writing a mini-essay for a summary would
> be better off summarizing it briefly and putting a detailed explanation on the
> discussion page. I've seen a lot of edit warring using the summary field as a
> mechanism for argument, and longer summaries would only encourage that.

agreed

Comment 22 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-04 17:23:56 UTC

Well, that's orthogonal to the discussion here.  We want the backend for foreign languages either way, and once we have it we can discuss how many characters should be allowed for edit summaries independent of the technical constraints that now force it to 255 bytes.  It would certainly be admin-configurable (and therefore configurable per-project on Wikimedia).

Comment 23 praveenp 2010-03-04 18:21:42 UTC

(In reply to comment #22)
> Well, that's orthogonal to the discussion here.  We want the backend for
> foreign languages either way, and once we have it we can discuss how many
> characters should be allowed for edit summaries independent of the technical
> constraints that now force it to 255 bytes.  It would certainly be
> admin-configurable (and therefore configurable per-project on Wikimedia).

Or close this bug and open independent bugs for different languages (like Bug 22696 this)

Comment 24 Raimond Spekking 2010-03-04 18:27:17 UTC

(In reply to comment #23)

> Or close this bug and open independent bugs for different languages (like Bug
> 22696 this)


No, not yet pls. First this base bug has to be fixed. Than we have (hopefully) configuration per-project.

Comment 25 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-04 18:53:22 UTC

(In reply to comment #23)
> Or close this bug and open independent bugs for different languages (like Bug
> 22696 this)

This makes no sense.  We cannot do this for any language until the infrastructure is in place.

Domas, do you have an opinion on comment 15?

Comment 26 Arod Shatil 2010-03-26 19:21:02 UTC

i just entered a bug-report ( http://bugzilla.wikimedia.org/show_bug.cgi?id=22967 ) and learned about this one (btw: i did try to search first)

i do not think the title of this bug is adequate. we are conflating here two completely different issues.
one is allowing more than 200 characters in a comment. please open another bug for this.
the other is that the existing system is built for 200 characters (e.g., the "maxlength" field in the html form is 200), but we actually support only 250 bytes (we truncate to 250 in php, so the 255 limit of the backend doesn't even ever have to truncate).
so, forget the discussion about "200 or more than 200", or rather, conduct it elsewhere.
having a limit of 200, and taking into account that some languages need more than 255 bytes to have 200 chars, the answer is very simple and straightforward: change the DB schema from tinyblob to blob for this one column (rev_comment), and teach the "truncate" function to count characters and not bytes, and oh, by the way, truncate at 200 (chars) and not 250 (bytes).
all in all one simple schema change, and one simple code change.

Comment 27 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-26 19:32:53 UTC

(In reply to comment #26)
> i do not think the title of this bug is adequate. we are conflating here two
> completely different issues.
> one is allowing more than 200 characters in a comment. please open another bug
> for this.
> the other is that the existing system is built for 200 characters (e.g., the
> "maxlength" field in the html form is 200), but we actually support only 250
> bytes (we truncate to 250 in php, so the 255 limit of the backend doesn't even
> ever have to truncate).
> so, forget the discussion about "200 or more than 200", or rather, conduct it
> elsewhere.

That discussion is irrelevant, yes.  Cf. comment 22.

> having a limit of 200, and taking into account that some languages need more
> than 255 bytes to have 200 chars, the answer is very simple and
> straightforward: change the DB schema from tinyblob to blob for this one column
> (rev_comment), and teach the "truncate" function to count characters and not
> bytes, and oh, by the way, truncate at 200 (chars) and not 250 (bytes).
> all in all one simple schema change, and one simple code change.

Simple to do, but that doesn't mean it will perform well.  See comment 8 and comment 10 by Domas Mitzuas.  Comment 15 outlines a possible solution, but I haven't gotten his approval for it.

Comment 28 Arod Shatil 2010-03-26 19:45:05 UTC

(In reply to comment #27)
> Simple to do, but that doesn't mean it will perform well.  See comment 8 and
> comment 10 by Domas Mitzuas.  Comment 15 outlines a possible solution, but I
> haven't gotten his approval for it.

comment 8 just doesn't compute. "just say no" was mighty stupid when nancy reagan said it wrt drugs, and one really expect a bit better reasoned answer from the intelligent people who run mediawiki.

as to #10: i will not pretend i understood it entirely, so i quote it here:
> Anyway, all these comments are shown in too many reports (watchlist, RCL) -
> that are already quite expensive to produce - introducing blob data to them
> without database/code refactoring wouldn't make much sense.

what does that mean? the data is *already* a "blob data" (it is currently defined as "tinyblob", not "varchar" or somesuch). and changing it from "tinyblob" to "blob" will only cost one extra byte.
for latin-based languages that's it. no an extra cost, because the software limit the length to 200 *chars* anyway.
for the rest of us, believe me it will be an improvement, not to see all those truncated edit messages any more.

on "my" wiki i already did it, but i would also like he.wikipedia.org to stop displaying all those truncated edit summaries.
and using a clever JS trick to limit it at the input form level will just leave us with inadequate edit messages (125 chars is less than a tweat, fgs. and what about malaysian which gets truncated at ~60?

peace.

Comment 29 Arod Shatil 2010-03-26 19:55:08 UTC

(In reply to comment #27)
> comment 10 by Domas Mitzuas.  Comment 15 outlines a possible solution, but I
> haven't gotten his approval for it.

sorry, i missed this reference to Comment #15.

i do not think it makes sense. 
a lot of complication and zero performance gain (at least i could not see any gain). to the best of my understanding, nowhere will it perform better than the simple solution of changing the tinyblob to blob, and in many places it will perform significantly worse (and will cost more storage).

if i am wrong it won't be the first nor (i hope) the last time, but i am pretty sure that *in this case* i am not wrong.

peace.

Comment 30 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-26 22:30:00 UTC

(In reply to comment #28)
> what does that mean? the data is *already* a "blob data" (it is currently
> defined as "tinyblob", not "varchar" or somesuch). and changing it from
> "tinyblob" to "blob" will only cost one extra byte.

It's "varchar(255) binary" in maintenance/tables.sql.  I believe it's varbinary(255) on Wikimedia; it is on the toolserver.  I'm not a MySQL expert, but I know that any blob type (even tinyblob) has significant gotchas compared to varbinary(255), although logically they're the same -- see the bit about "constraints" at <http://dev.mysql.com/doc/refman/5.1/en/blob.html>.  I don't know how many of these would apply to, e.g., varbinary(500).  I do know that longer columns would not be able to be fully indexed, for instance, which would make use of covering indexes impossible, but I don't know what the hard cutoff is for that.  There are probably other issues.

Basically, no, it's not just one byte difference in practice, although it is logically.

Comment 31 Arod Shatil 2010-03-26 22:38:07 UTC

when you say "it" i just want to make sure we are talking about the same thing:
i was talking about the column [rev_comment] in table [revision]
this is where the edit comments are stored, and chaanging it on *my* wiki allowed me to have full (=200 chars) comments.

in the publicly available source (mediawiki 1.15.1) this column is tinyblob.
so i do not claim that there is little difference between varbinary and blob, but according to the same documentation you linked, there is *very* little difference between tinyblob and blob - exactly one byte...

Comment 32 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-26 22:43:15 UTC

Sorry, you're right, rev_comment is tinyblob on trunk.  However, it's varbinary(255) on Wikimedia, or at least on the toolserver, and that's what's relevant from a sysadmin perspective.  Anyway, rc_comment is varchar(255) binary even on trunk, and that would have to be updated too for full functionality.

Comment 33 Arod Shatil 2010-03-26 22:51:45 UTC

i did not realize that you'd need to change rc_comment also. i'll do it on my wiki - thanks for the tip.

in this case, i would not suggest changing [rc_comment] from varbinary to blob - just from varbinary(255) to varbinary(2048) or somesuch (again, a single byte change, no effect on performance or storage afaik).


as to the difference between the trunk and the toolserver: does this mean the trunk is *behind* the toolserver or the other way around?

also, if [rev_comment] is varbinary(255) on the toolserver, how about changing *it* to varbinary(2048) ?

as i indicated, i do not "demand" a fix next week - next version would be plenty fine.

peace.

Comment 34 Domas Mituzas 2010-03-27 08:15:42 UTC

we shouldn't be using tinyblob anywhere, probably either I forgot to change, or someone changed it back :)

Comment 35 Arod Shatil 2010-03-27 13:56:28 UTC

far be it from me to advocate blobs - i don't like them (especially the green ones).

so, assuming we'll change it (or change it back) to varbinary, may i ask to make it varbinary(1024) rather than varbinary(255)?

the code guards this field at 200 chars, so for all latin scripts this practically means paying one more byte per revision, and 0 performance implications.
for all non-latin scripts this is going to be a fix of an old and very annoying bug.

thanks.

Comment 36 Domas Mituzas 2010-03-27 14:01:50 UTC

there are performance implications, as we can't do covering indexes on large fields, to speed up non-PK based range views.

Comment 37 Arod Shatil 2010-03-27 15:23:54 UTC

(In reply to comment #36)
> there are performance implications, as we can't do covering indexes on large
> fields, to speed up non-PK based range views.

but in tables.sql rev_commant does not seem to be in *any* index?
for sure there are many things i do not understand about the internal structure (i am definitely *not* DBA master), but unless this is one more discrepancy between trunk and "real life", rev_comment is not indexed, so i don't think performance would be impacted.

Comment 38 Domas Mituzas 2010-03-27 15:36:58 UTC

we switch certain indexes (include all table fields) to covering to avoid extra i/o for certain access patterns.

Comment 39 Arod Shatil 2010-03-27 16:23:40 UTC

will it be able then to truncate them for indexing and yet keep the whole (200 chars) thing for storage?
this would deprive non-latin-script users from searchability on the tail of the comment, but the comment itself, at least, will not get truncated?

Comment 40 Arod Shatil 2010-03-27 16:24:34 UTC

(i meant "will it be possible", of course, not "will it be able...)

Comment 41 Domas Mituzas 2010-03-27 16:27:10 UTC

it is not possible

Comment 42 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-28 18:22:08 UTC

(In reply to comment #33)
> as to the difference between the trunk and the toolserver: does this mean the
> trunk is *behind* the toolserver or the other way around?
> 
> also, if [rev_comment] is varbinary(255) on the toolserver, how about changing
> *it* to varbinary(2048) ?

Toolserver is theoretically the same as Wikimedia.  Wikimedia uses trunk but with some modifications, e.g., extra indexes and different field definitions.  Covering index on recentchanges (that means covering the full width of every column) is necessary for performance on Wikimedia sites.

(In reply to comment #41)
> it is not possible

Can we make it 500 bytes or such for wikis with scripts that are mainly two bytes or longer?  So keeping it at 255 for en, de, fr, it, pt, es, pl, nl, etc, almost all the biggest wikis.  Could we still do a covering index on 500 bytes?  What's the limit on that?  500 bytes seem to be the max for in-memory temporary tables, if I'm reading right (don't know how important that is).

Comment 43 Arod Shatil 2010-03-29 06:57:41 UTC

since all we really need is 200 chars, i think 500 bytes should completely solve the issue for all the 2-bytes-per-char languages (Hebrew, Arabic, Farsi, Cyrilic, Greek and more), and will improve situation a lot for 3-bytes-per-char languages (i think most of the south and east asian? - at least Praveen in bug 22696 talked about 3-to-1). 
even with the 3-to-1 they get an occasional single byte char such as space and punctuation, so in reality 500 will get them very close.

i also thought a bit more about the JS, and thought that the right thing will not be to try to handle keystrokes - this will require to duplicate too much editing functionality in the JS - movement, deletion, cut and paste, etc., but rather to dynamically manipulate the "maxlength" and let the browser handle the actual keystroke.
the only place where this will not be 100% adequate is with large paste.

what i mean is something like this: (100% untested. my JS is a bit rusty, so don't take it too literally):
============ JS Hack ============
function size_in_bytes(s) {//if there's builtin function, i didn't find it.
   var bytes = 0;
   for (var i in s.toIntArray()) 
       while (i > 0) { // count bytes in char.
           bytes++;
           i >>= 8;
       } 
   return bytes;
}

function update_maxlen() {
    var
        already_bytes = size_in_bytes(this.value),
        already_chars = this.value.length,
        available_bytes = 500 - already_bytes, //500=>const?
        available_chars = Math.floor(available_bytes / 4),
        new_maxlen = Math.min(200, already_chars+available_chars);//200=>const? 
    this.maxlength = new_maxleng;
}
============ END JS Hack =============
and make  update_maxlen() an event handler for keystroke or keyup or keydown or somesuch.

above code is 100% original and 100% copyright-free. use it as you see fit.
(once you switch to git i'll try submitting more "formal" patches) 

peace.

Comment 44 Bawolff (Brian Wolff) 2010-04-01 08:16:09 UTC

(In response to Comment 43) Umm, I don't think toIntArray() is a method of the string class (also, nitpick: that could have unexpected results if someone decides to add stuff to Array.prototype , better to use a normal for loop instead of a for in loop). Also some things like ã is represented as 0x00E3 in utf-16 (what js uses), but would take 2 bytes in utf-8 (0xC3A3), thus I don't think that counts correctly.

Personally I think a better approach would be to count how many bytes are in the field, and then plop some off the end if its too much. (preventing keystrokes has the issue of not allowing backspace through), since I have no idea if browsers interpert maxlength consistantly for high codepoint characters. If someone copies and paste, its still counted as a keypress, thus this still works (and if all else fails, keep the maxlength parameter). For example how about this (note: only tested on firefox):

var summary = document.getElementById('wpSummary');
trimSummary = function (e) {
 if (!e) e = window.event; //for IE
 var text = summary.value;
 //This basically figures out how many bytes a utf-16 string (which is what js sees) will take in utf-8
 //Note, surogate (\uD800-\uDFFF) characters are counted as 2 bytes, since theres two of them
 //and the actual character takes 4 bytes in utf-8 (2*2=4). might not work right for illegal surrogate sequences, but that should never happen.
 len = text.replace(/[^\u0000-\u007F]/g, '').length + text.replace(/[^\u0080-\u07FF\uD800-\uDFFF]/g, '').length*2 + text.replace(/[^\u0800-\uD7FF\uE000-\uFFFF]/g, '').length*3;
 if (len > 255) {
  summary.value = text.substring(0, text.length - 1);
  //We don't know how many characters over it is, so only plop off one
  trimSummary(); //recursively to get rid of other chars. In most cases this will just exit (or run once in case of surogate). depending on what happens, could cause an invalid sequence, as surgate pairs counted as 2 characters (at least in my testing in firefox). shouldn't cause an issue other then a ? appearing.

/*
Alternatively could do
if (e.preventDefault) e.preventDefault(); //std
else e.returnValue = false; //IE
but then have to determine if its a printing character, which seems difficult to do cross browser.
*/
 }
}

addHandler(summary, 'keyup', trimSummary);

Comment 45 Arod Shatil 2010-04-01 11:57:15 UTC

(In reply to comment #44)
i believe your counting routine is most probably better than mine, but i do not like the idea of chopping.
please note that my routine *did not* "block keystrokes". rather it manipulated the maxlength property of the input field and let the browser handle the actual keystrokes.

peace.

Comment 46 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-04-01 19:27:23 UTC

If anyone would like to submit some actual working JavaScript, feel free to do so at bug 22967 and I'll review it and commit if appropriate.  It can be standalone, so I'll integrate it into MediaWiki, but it has to actually work, since it's unlikely I'll find the time to do much fix-up.  Let's keep this bug for discussing schema changes.  Thanks.

Comment 47 Jan Paul Posma 2011-04-24 19:50:17 UTC

Created attachment 8454 [details]
Schema change

This patch changes the schema for MySQL, SQLite, MS SQL, IBM DB2, and Oracle. PostgreSQL already stored everything in TEXT types.

Updated columns are:
* ar_comment
* fa_description
* img_description
* ipb_reason
* log_comment
* oi_description
* pt_reason
* rc_comment
* rev_comment

All these columns are now varchar(1024) binary, although this number can easily be changed by replacing the values in the patch.

Note that the actual limits can be enforced in the software and don't have to be 1024 bytes. Perhaps we can discuss increasing the limit in the user interface in other bugs.

Comment 48 Aryeh Gregor (not reading bugmail, please e-mail directly) 2011-04-24 20:34:05 UTC

This patch has the potential for serious performance impact on Wikimedia sites, so it needs to be discussed with someone like Tim or Brion before anyone considers committing it.

Comment 49 Brion Vibber 2011-04-25 21:12:39 UTC

As a schema change touching some of the biggest and most active core tables, it's an issue of deployment rather than coding or review.

Applying these schema changes will require rebuilds of HUGE tables like 'revision' -- that probably means slow application on slave servers and a master context switch, which probably needs to be carefully coordinated.


I'm also a bit leery about just upping the limit from 255 to 1024 bytes; it doesn't solve the practical problems with the limits like too-long strings being cropped (unless we painstakingly assure that there's a strict 255-character limit applied on Unicode input prior to insertion into the database, in which case we're still enforcing the too-small limits and preventing useful slightly longer texts).

Comment 50 Domas Mituzas 2011-04-25 21:54:50 UTC

We can't have covering indexing on 1024-byte long fields (nor we can actually have any kind of indexing on them)

Comment 51 MZMcBride 2011-04-25 22:05:23 UTC

(In reply to comment #49)
> I'm also a bit leery about just upping the limit from 255 to 1024 bytes; it
> doesn't solve the practical problems with the limits like too-long strings
> being cropped (unless we painstakingly assure that there's a strict
> 255-character limit applied on Unicode input prior to insertion into the
> database, in which case we're still enforcing the too-small limits and
> preventing useful slightly longer texts).

I think it makes more sense to impose any limit at the application level and make the fields all blobs.

(In reply to comment #50)
> We can't have covering indexing on 1024-byte long fields (nor we can actually
> have any kind of indexing on them)

I don't follow this. As far as I can see, none of the affected fields are indexed (or would ever need to be).

Comment 52 Platonides 2011-04-25 23:28:03 UTC

(In reply to comment #50)
> We can't have covering indexing on 1024-byte long fields (nor we can actually
> have any kind of indexing on them)

Why would we want to index the summaries?


I'm not convinced of this change, though.

Comment 53 Domas Mituzas 2011-04-25 23:40:07 UTC

> Why would we want to index the summaries?

To allow covering index reads for features like 'user contributions'

Comment 54 Roan Kattouw 2011-04-26 15:13:57 UTC

(In reply to comment #53)
> > Why would we want to index the summaries?
> 
> To allow covering index reads for features like 'user contributions'

To clarify: one thing that is not in the default MW schema but is on the WMF databases is an extended usertext_timestamp index on the revision table that covers all fields selected by the user contributions query. This allows faster retrieval of the results: the database doesn't have to look at the actual row because all of the data is already in the index.

Vanilla MW:
KEY `usertext_timestamp` (`rev_user_text`, `rev_timestamp`)

WMF:
KEY `usertext_timestamp` (`rev_user_text`, `rev_timestamp`, `rev_user`, `rev_deleted`, `rev_minor_edit`, `rev_text_id`, `rev_comment`)

Comment 55 MZMcBride 2011-04-26 19:27:59 UTC

(In reply to comment #54)
> To clarify: one thing that is not in the default MW schema but is on the WMF
> databases is an extended usertext_timestamp index on the revision table that
> covers all fields selected by the user contributions query. This allows faster
> retrieval of the results: the database doesn't have to look at the actual row
> because all of the data is already in the index.
> 
> Vanilla MW:
> KEY `usertext_timestamp` (`rev_user_text`, `rev_timestamp`)
> 
> WMF:
> KEY `usertext_timestamp` (`rev_user_text`, `rev_timestamp`, `rev_user`,
> `rev_deleted`, `rev_minor_edit`, `rev_text_id`, `rev_comment`)

Thank you for the clarification, Roan. Much appreciated.

Domas also pointed me toward this old blog post of his that explains the reasoning for a broader index: <http://dom.as/2007/01/26/mysql-covering-index-performance/>.

Comment 56 Sumana Harihareswara 2011-11-09 03:36:14 UTC

Seems like this fix needs further review and affects performance, so I'm marking the bug with the appropriate keywords.

Comment 58 Siddhartha Ghai 2014-03-01 14:16:27 UTC

Any chance of any action on this now? It's been almost three years since the last comments.

Comment 59 Quim Gil 2014-04-10 07:08:38 UTC

As initially asked in bug 4714 comment 16:

What process do we need to resolve this request in a way or another? A chat between maintainers? A thread in wikitech-l? A formal architecture RfC seems excessive.

Even if the implementation is technical, the question is social: do we as MediaWiki / Wikimedia communities want the possibility to post longer edit summaries?

This request was filed in 2006! Let's kick the discussion to resolve it.

Comment 60 Raimond Spekking 2014-04-10 09:01:21 UTC

The result of a poll in the German Wikipedia shows that this bug is the 5th most annoying bug: https://de.wikipedia.org/wiki/Wikipedia:Umfragen/Technische_W%C3%BCnsche/Top_20

(In reply to Quim Gil from comment #59)
> 
> Even if the implementation is technical, the question is social: do we as
> MediaWiki / Wikimedia communities want the possibility to post longer edit
> summaries?

That is a real good point!

Comment 61 Bawolff (Brian Wolff) 2014-04-10 14:48:27 UTC

(In reply to Raimond Spekking from comment #60)
> The result of a poll in the German Wikipedia shows that this bug is the 5th
> most annoying bug:
> https://de.wikipedia.org/wiki/Wikipedia:Umfragen/Technische_W%C3%BCnsche/
> Top_20
> 
> (In reply to Quim Gil from comment #59)
> > 
> > Even if the implementation is technical, the question is social: do we as
> > MediaWiki / Wikimedia communities want the possibility to post longer edit
> > summaries?
> 
> That is a real good point!

People in certain scripts complain often about only being able to write 85 letter summaries (63 letters if you have a really unlucky script).

Actual technical proposal above (storing full comment in external storage, not showing it in most reports unless specificly unless asked for) would probably need to be signed off on by a dba.

Comment 62 Andre Klapper 2014-07-03 21:51:49 UTC

Greg: Do you have any input on comment 59 (what could be the process here)?

(In reply to Bawolff (Brian Wolff) from comment #61)
> People in certain scripts complain often about only being able to write 85
> letter summaries (63 letters if you have a really unlucky script).
> 
> Actual technical proposal above (storing full comment in external storage,
> not showing it in most reports unless specificly unless asked for) would
> probably need to be signed off on by a dba.

DBA = would that be springle?

Comment 63 Bawolff (Brian Wolff) 2014-07-03 22:22:22 UTC

(Inreply to Andre Klapper from comment #62)
> Greg: Do you have any input on comment 59 (what could be the process here)?
> 
> (In reply to Bawolff (Brian Wolff) from comment #61)
> > People in certain scripts complain often about only being able to write 85
> > letter summaries (63 letters if you have a really unlucky script).
> > 
> > Actual technical proposal above (storing full comment in external storage,
> > not showing it in most reports unless specificly unless asked for) would
> > probably need to be signed off on by a dba.
> 
> DBA = would that be springle?

Yes.

This bug has a long history, and its a little confusing. To summarize there seem to be three things discussed:
*make various edit summary fields longer (convert to blob). Domas has vetoed that due to interfering with covering indexes. That was quite a while ago. Springle could perhaps comment on if covering indexes are still a hard requirement. I suspect they are.
*make js to make sure people dont go over limit. This is done.
*keep current limit but have "..." at end of comment be clickable and have js load rest of summary on click. This needs to be fleshed out a little more, but badic questions that need answering:
**would users find such a solution acceptable (design review needed).
**where would long version go? External storage?
**how would we associate edits with long version. New db fields? Sticking an id number at the end of rev_comment after some marker field?
**how long to go. If we allow infinite, should the interface treat it more like a commit summary?

Comment 64 Sean Pringle 2014-07-04 02:45:52 UTC

The rev_comment field could be increased to 767 bytes without requiring changes to the indexing for either stock Mediawiki or WMF databases. The same width limit would apply to all of the varbinary comment fields. That would allow 191 characters minimum for a 4 byte charset.

`rev_comment` varbinary(767) DEFAULT NULL

767 is a internal key-width limit per column in InnoDB. Exceeding it would block the use of covering indexes used on WMF databases for performance. 

Sounds like we could:

a) Choose to expand to a minimum of 191 characters
b) Decide whether comments should be entirely redesigned to be more flexible

My expectation is that (a) would not harm performance much, if at all, for enwiki; currently there are various factors boosting performance besides the original custom covering indexes. Eg, table partitioning on slaves handling specific query groups, and quite simply a bunch more hardware :)

Such schema changes would take some weeks to roll out, but otherwise aren't terribly painful. Certainly not a reason to avoid the issue.

Comment 65 Andre Klapper 2014-08-05 22:59:35 UTC

Based on comment 63 and comment 64 (thanks for the analysis!) some performance input would probably be welcome here, plus having somebody driving this, in order to allow unlucky scripts to get more than 63 letters for change summaries.

Greg: Anybody in mind?

And does the patch in comment 47 still make sense, and work could be based on it (means: should be put into Gerrit by following http://www.mediawiki.org/wiki/Gerrit/Tutorial or using http://tools.wmflabs.org/gerrit-patch-uploader/ )?

Comment 66 Andre Klapper 2014-10-16 12:53:47 UTC

Based on comment 63 and comment 64 some performance input would probably be welcome here, plus having somebody driving this, in order to allow unlucky scripts to get more than 63 letters for change summaries.

Greg: Anybody in mind?

And does the patch in comment 47 still make sense, and work could be based on it (means: should be put into Gerrit by following http://www.mediawiki.org/wiki/Gerrit/Tutorial or using http://tools.wmflabs.org/gerrit-patch-uploader/ )?

Comment 67 Greg Grossmeier 2014-10-20 16:24:53 UTC

(In reply to Andre Klapper from comment #65)
> Based on comment 63 and comment 64 (thanks for the analysis!) some
> performance input would probably be welcome here, plus having somebody
> driving this, in order to allow unlucky scripts to get more than 63 letters
> for change summaries.
> 
> Greg: Anybody in mind?

Not really no, unfortunately. Brian?

Comment 68 Andre Klapper 2014-11-07 13:58:46 UTC

bawolff: ^ ?

Comment 69 Yuvi Panda 2014-11-07 17:23:38 UTC

Hmm, so from springle's comments, looks like the action items now are:

1. Make the schema change,
2. Check for places in the code where this is displayed, and make sure they will handle this properly,
2. Co-ordinate to make that happen on wikimedia wikis,
3. Announce globally.

Does that seem correct?

Comment 70 Bawolff (Brian Wolff) 2014-11-08 18:25:11 UTC

(In reply to Andre Klapper from comment #68)
> bawolff: ^ ?

/me does not have time to work on this right now.

(In reply to Yuvi Panda from comment #69)
> Hmm, so from springle's comments, looks like the action items now are:
> 
> 1. Make the schema change,
> 2. Check for places in the code where this is displayed, and make sure they
> will handle this properly,
> 2. Co-ordinate to make that happen on wikimedia wikis,
> 3. Announce globally.
> 
> Does that seem correct?

Yes. Also make sure that maxlength attributes, and jquery.bytecounter (or whatever the js that does max byte count is called) is updated appropiately.

The only tricky bit might be not breaking things for people using myisam or other db backends that might have different limits on the index. Otherwise should be relatively straight forward.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links

aklapper
alexsm333
arod.shatil
b
bawolff+wn
church.of.emacs.ml
danny.b
domas.mituzas
Eduard.Braun2
glaisher.wiki
greg
john
matma.rex
matthew.britton
me.praveen
mediawiki-bugs
meno25mail
p858snake
Platonides
qgil
raimond.spekking
roan.kattouw
scott
shijualex
Siddhartha.Ghai
springle
srik.lak
vasilvv
yuvipanda