Last modified: 2011-04-15 10:08:18 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T17422, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 15422 - Wikibits patch #10 - more numbers


Summary:	Wikibits patch #10 - more numbers

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	Parser (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Low enhancement with 2 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	patch, patch-need-review

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2008-09-01 16:12 UTC by SharkD
Modified:	2011-04-15 10:08 UTC (History)
CC List:	1 user (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
unified diff file (449 bytes, patch) 2008-09-01 16:12 UTC, SharkD	Details
unified diff file (1.06 KB, patch) 2008-09-02 04:05 UTC, SharkD	Details
unified diff file (1.80 KB, patch) 2008-09-02 04:55 UTC, SharkD	Details
unified diff file (1.82 KB, patch) 2008-09-02 18:01 UTC, SharkD	Details
unified diff file (1.51 KB, patch) 2008-09-03 04:22 UTC, SharkD	Details
unified diff file (1.57 KB, patch) 2008-09-03 05:48 UTC, SharkD	Details
unified diff file (1.58 KB, patch) 2008-09-03 05:57 UTC, SharkD	Details
test script (1.11 KB, application/octet-stream) 2008-09-03 05:59 UTC, SharkD	Details
unified diff file (1.58 KB, patch) 2008-09-03 06:09 UTC, SharkD	Details
test script (1.27 KB, text/plain) 2008-09-03 06:10 UTC, SharkD	Details
javascript demonstration (820 bytes, application/octet-stream) 2008-09-09 11:42 UTC, SharkD	Details
Show Obsolete (8) Add an attachment (proposed patch, testcase, etc.)

Description SharkD 2008-09-01 16:12:56 UTC

Created attachment 5251 [details]
unified diff file

This patch allows more number formats to be detected properly when sorting tables.

Specifically, numbers beginning with "-" or "+", or containing "e" or "E" are handled properly.

This patch requires that the multiple "if" statements in this particular section also be changed to "if" and "if else" statements. Otherwise, the regexp for numeric values will falsely detect some values that are actually dates.

Comment 1 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-09-01 20:01:02 UTC

Do all browsers support all of these formats?  If some browsers don't support exponential formats, it might be best not to let those through.  If they've all supported them since the beginning, though, which is quite possible, it's fine.

Comment 2 SharkD 2008-09-02 01:20:52 UTC

Javascript is based on the ECMA standard. It shouldn't matter what browser you use. This has nothing to do with character entities or text encoding or anything like that.

Comment 3 SharkD 2008-09-02 01:21:49 UTC

I just want to re-emphasize that this patch requires some changes made in one of the earlier patches (and not included here) in order to work.

Comment 4 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-09-02 02:06:35 UTC

(In reply to comment #2)
> Javascript is based on the ECMA standard. It shouldn't matter what browser you
> use. This has nothing to do with character entities or text encoding or
> anything like that.

Yeah, theory is a wonderful thing . . . it's probably correct on this point, I'm just being paranoid.

(In reply to comment #3)
> I just want to re-emphasize that this patch requires some changes made in one
> of the earlier patches (and not included here) in order to work.

Can you post a full patch that will implement this by itself?  Or at least indicate exactly which other patch is needed, and which lines of that patch?

Comment 5 SharkD 2008-09-02 04:05:57 UTC

Created attachment 5256 [details]
unified diff file

Sure, here you go.

Comment 6 SharkD 2008-09-02 04:12:47 UTC

The code (the new and the original) searches for the percent sign (%) if it exists. I was wondering if there were any other symbols that should be searched for? I can't think of any off the top of my head, besides maybe the degrees symbol (°).

Actually, the plus-minus symbol (±) might be a good thing to catch, too.

The ¼ ½ and ¾ symbols could easily be replaced with decimal values.

² and ³ are a bit tougher. They would need to be evaluated as expressions in addition to being replaced.

Comment 7 SharkD 2008-09-02 04:14:30 UTC

The tricky thing is if these ANSI codes represent something different in other character encodings.

Comment 8 SharkD 2008-09-02 04:55:16 UTC

Created attachment 5257 [details]
unified diff file

Added support for additional mathematics/science symbols as outlined in my previous post.

Comment 9 SharkD 2008-09-02 18:01:59 UTC

Created attachment 5259 [details]
unified diff file

Slight tweak.

Comment 10 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-09-03 00:19:28 UTC

(In reply to comment #8)
> Added support for additional mathematics/science symbols as outlined in my
> previous post.

I think this goes much too far.  We don't need to write a mini-parser here.  The treatment of ± is outright wrong, too: it makes no sense to try sorting such a thing relative to numbers known to be positive or negative.

I'm going to look at attachment 5256 [details] instead:

>-	sortfn = ts_sort_caseinsensitive;
>+	var sortfn = ts_sort_caseinsensitive;
> 	if (itm.match(/^\d\d[\/. -][a-zA-Z]{3}[\/. -]\d\d\d\d$/))
>-		sortfn = ts_sort_date;
>-	if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d\d\d$/))
>-		sortfn = ts_sort_date;
>-	if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d$/))
>-		sortfn = ts_sort_date;
>-	if (itm.match(/^[\u00a3$\u20ac]/)) // pound dollar euro
>-		sortfn = ts_sort_currency;
>-	if (itm.match(/^[\d.,]+\%?$/))
>-		sortfn = ts_sort_numeric;
>+		var sortfn = ts_sort_date;
>+	else if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d\d\d$/))
>+		var sortfn = ts_sort_date;
>+	else if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d$/))
>+		var sortfn = ts_sort_date;
>+	else if (itm.match(/^[\u00a3$\u20ac]/)) // pound dollar euro
>+		var sortfn = ts_sort_currency;
>+	else if (itm.match(/^[\d.,-+eE]+\%?$/))
>+		var sortfn = ts_sort_numeric;

Why are you adding "var" before each sortfn assignment?  That should only be needed on the initial assignment.  Even if not, *please* keep one thing to one patch.  Same for if -> else if.  The substantive change here *seems* to be one line, but I'd prefer not to have to look through ten times as many lines to find it:

>-	if (itm.match(/^[\d.,]+\%?$/))
>+	if (itm.match(/^[\d.,-+eE]+\%?$/))

Note that this seems incorrect, "-" has a special meaning here and should be at the end of the character class.  I've committed this line's change as r40348, with the character class reordered so "-" is at the end.

I've also committed the stylistic changes *separately* as r40349 -- except with only the first variable declaration marked local, not all of them.  If there's some reason to mark all of them with "var" separately, let me know.

Comment 11 SharkD 2008-09-03 02:21:11 UTC

(In reply to comment #10)
> (In reply to comment #8)
> > Added support for additional mathematics/science symbols as outlined in my
> > previous post.
> I think this goes much too far.  We don't need to write a mini-parser here. 
> The treatment of ± is outright wrong, too: it makes no sense to try sorting
> such a thing relative to numbers known to be positive or negative.

I don't think it's overkill. There are only a few ANSI characters used in mathematics that aren't already understood by javascript. As for the ± symbol, I wouldn't expect it to present a problem the way it is typically used.

> I'm going to look at attachment 5256 [details] instead:
> >-	sortfn = ts_sort_caseinsensitive;
> >+	var sortfn = ts_sort_caseinsensitive;
> > 	if (itm.match(/^\d\d[\/. -][a-zA-Z]{3}[\/. -]\d\d\d\d$/))
> >-		sortfn = ts_sort_date;
> >-	if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d\d\d$/))
> >-		sortfn = ts_sort_date;
> >-	if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d$/))
> >-		sortfn = ts_sort_date;
> >-	if (itm.match(/^[\u00a3$\u20ac]/)) // pound dollar euro
> >-		sortfn = ts_sort_currency;
> >-	if (itm.match(/^[\d.,]+\%?$/))
> >-		sortfn = ts_sort_numeric;
> >+		var sortfn = ts_sort_date;
> >+	else if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d\d\d$/))
> >+		var sortfn = ts_sort_date;
> >+	else if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d$/))
> >+		var sortfn = ts_sort_date;
> >+	else if (itm.match(/^[\u00a3$\u20ac]/)) // pound dollar euro
> >+		var sortfn = ts_sort_currency;
> >+	else if (itm.match(/^[\d.,-+eE]+\%?$/))
> >+		var sortfn = ts_sort_numeric;
> Why are you adding "var" before each sortfn assignment?  That should only be
> needed on the initial assignment.  Even if not, *please* keep one thing to one
> patch.  Same for if -> else if.  The substantive change here *seems* to be one
> line, but I'd prefer not to have to look through ten times as many lines to
> find it:

First of all, the statements are preceded by var because they *are* the first instance. There's no sense in defining it an extra time just to eliminate some characters when it's not necessary. Secondly you specifically asked me to include the additional code needed for this patch to work. I quote:

> Can you post a full patch that will implement this by itself?  Or at least
> indicate exactly which other patch is needed, and which lines of that patch?



> >-	if (itm.match(/^[\d.,]+\%?$/))
> >+	if (itm.match(/^[\d.,-+eE]+\%?$/))
> Note that this seems incorrect, "-" has a special meaning here and should be at
> the end of the character class.  I've committed this line's change as r40348,
> with the character class reordered so "-" is at the end.

Good catch.

> I've also committed the stylistic changes *separately* as r40349 -- except with
> only the first variable declaration marked local, not all of them.  If there's
> some reason to mark all of them with "var" separately, let me know.

As I explained, the "var"s are necessary, otherwise you are creating a *global* variable within each condition. This is unneeded additional overhead.

Comment 12 SharkD 2008-09-03 02:30:00 UTC

(In reply to comment #11)
> (In reply to comment #10)
> > Why are you adding "var" before each sortfn assignment?  That should only be
> > needed on the initial assignment.  Even if not, *please* keep one thing to one
> > patch.  Same for if -> else if.  The substantive change here *seems* to be one
> > line, but I'd prefer not to have to look through ten times as many lines to
> > find it:
> First of all, the statements are preceded by var because they *are* the first
> instance. There's no sense in defining it an extra time just to eliminate some
> characters when it's not necessary. Secondly you specifically asked me to
> include the additional code needed for this patch to work. I quote:

Ah, I see my error now. The block of code is supposed to look like this:

>+	if (itm.match(/^\d\d[\/. -][a-zA-Z]{3}[\/. -]\d\d\d\d$/))
>+		var sortfn = ts_sort_date;
>+	else if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d\d\d$/))
>+		var sortfn = ts_sort_date;
>+	else if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d$/))
>+		var sortfn = ts_sort_date;
>+	else if (itm.match(/^[\u00a3$\u20ac]/)) // pound dollar euro
>+		var sortfn = ts_sort_currency;
>+	else if (itm.match(/^[\d.,+eE-]+\%?$/))
>+		var sortfn = ts_sort_numeric;
>+	else
>+		var sortfn = ts_sort_caseinsensitive;

Comment 13 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-09-03 02:36:08 UTC

(In reply to comment #11)
> I don't think it's overkill. There are only a few ANSI characters used in
> mathematics that aren't already understood by javascript.

Why just [[Latin-1]] characters (more proper term than "[[ANSI]]") and not Unicode generally?  It's a slippery slope here, especially ² and ³.  When is anyone ever going to square tabulated data?  ± might be worth it, but Unicode fractions are rarely used, and if you support any you should support all, not just the Latin-1 ones.  % might be useful, but it's already treated more or less correctly by just stripping it: the only issue would be if people mix percents and normal numbers in the same table, which is weird.

> First of all, the statements are preceded by var because they *are* the first
> instance. There's no sense in defining it an extra time just to eliminate some
> characters when it's not necessary.

No, they aren't the first instance.  To wit:

	var sortfn = ts_sort_caseinsensitive;

That's executed unconditionally at the start.

> Secondly you specifically asked me to
> include the additional code needed for this patch to work. I quote:

By that I meant include additional code needed for this patch to *work*, not additional code needed for the patch to *apply* in its unduly modified form.  Of course your original patch was one line, but I didn't remember that when I wrote my last comment, what with the other ten patches by you I've been looking at.  I thought you meant a meaningful dependency, not a formatting dependency, which is why I said it should be included in the patch.  Patches should be *formatted* to apply against current trunk.

(In reply to comment #12)
> Ah, I see my error now. The block of code is supposed to look like this:

Which is one line plus 4*5 = 20 characters longer.  :)

Comment 14 SharkD 2008-09-03 04:20:48 UTC

(In reply to comment #13)
> (In reply to comment #11)
> > I don't think it's overkill. There are only a few ANSI characters used in
> > mathematics that aren't already understood by javascript.
> Why just [[Latin-1]] characters (more proper term than "[[ANSI]]") and not
> Unicode generally?  It's a slippery slope here, especially ² and ³.  When is
> anyone ever going to square tabulated data?  ± might be worth it, but Unicode
> fractions are rarely used, and if you support any you should support all, not
> just the Latin-1 ones.  % might be useful, but it's already treated more or
> less correctly by just stripping it: the only issue would be if people mix
> percents and normal numbers in the same table, which is weird.

Yes, the powers of 2 and 3 were maybe overkill (and lead to some false results). I'll post a trimmed-down version. The fractions ½ ¼ ¾ require only simple substitutions, no evaluation of any kind. Percentages are also an easy case.

> > First of all, the statements are preceded by var because they *are* the first
> > instance. There's no sense in defining it an extra time just to eliminate some
> > characters when it's not necessary.
> No, they aren't the first instance.  To wit:
>         var sortfn = ts_sort_caseinsensitive;
> That's executed unconditionally at the start.

Hence my correction.

> > Secondly you specifically asked me to
> > include the additional code needed for this patch to work. I quote:
> By that I meant include additional code needed for this patch to *work*, not
> additional code needed for the patch to *apply* in its unduly modified form. 
> Of course your original patch was one line, but I didn't remember that when I
> wrote my last comment, what with the other ten patches by you I've been looking
> at.  I thought you meant a meaningful dependency, not a formatting dependency,
> which is why I said it should be included in the patch.  Patches should be
> *formatted* to apply against current trunk.

The changes *were* required in order for it to work. In its original form, the numerical search might have picked up a date when it wasn't supposed to.

> > Ah, I see my error now. The block of code is supposed to look like this:
> Which is one line plus 4*5 = 20 characters longer.  :)

You're missing the point. Declaring the variable at the beginning results in an additional assignment and an (expensive) additional lookup in global scope in 5 out of the 6 cases.

Comment 15 SharkD 2008-09-03 04:22:02 UTC

Created attachment 5271 [details]
unified diff file

Scaled-down version of the last patch.

Comment 16 SharkD 2008-09-03 04:27:41 UTC

Also, you retained a bug from the previous version when you committed the change.

In /^[\d.,eE+-]+\%?$/ the e and E may appear at the beginning of the text, which is not a number.

Comment 17 SharkD 2008-09-03 05:48:29 UTC

Created attachment 5272 [details]
unified diff file

The previous versions of the code were picking up a lot of (admittedly unlikely) false positives. Hopefully I've eliminated them. I'll also upload the test script I've been using just in case.

Comment 18 SharkD 2008-09-03 05:57:44 UTC

Created attachment 5273 [details]
unified diff file

Forgot something in the last version.

Comment 19 SharkD 2008-09-03 05:59:24 UTC

Created attachment 5274 [details]
test script

Here's the test script I've been using to eliminate errors.

Comment 20 SharkD 2008-09-03 06:09:36 UTC

Created attachment 5275 [details]
unified diff file

Cought something else.

Comment 21 SharkD 2008-09-03 06:10:14 UTC

Created attachment 5276 [details]
test script

Update to the test script.

Comment 22 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-09-03 14:03:39 UTC

Please make all new patches against SVN trunk.  Your current patches wouldn't have applied (even if I wanted to apply them) because of r40348 and r40349.  Before you make changes to submit a patch, right-click on wikibits.js or the directory and choose "Update" or something like that, to make sure the copy you're patching against is up-to-date.

(In reply to comment #14)
> The fractions ½ ¼ ¾ require only
> simple substitutions, no evaluation of any kind.

How about the fractions ⅓ ⅔ ⅕ ⅖ ⅗ ⅘ ⅙ ⅚ ⅛ ⅜ ⅝ ⅞?  And you need to consider "1 ½", "1-½", etc.  Nobody even uses these, [[WP:MOS]] and most other guides recommend against them.  It adds bloat for little gain here.

> The changes *were* required in order for it to work. In its original form, the
> numerical search might have picked up a date when it wasn't supposed to.

Well, you should *say* these things, so that I don't make incorrect assumptions about what your patch is supposed to do.  It wasn't obvious.

> You're missing the point. Declaring the variable at the beginning results in an
> additional assignment and an (expensive) additional lookup in global scope in 5
> out of the 6 cases.

You're missing the point.  *Performance does not matter here.*  This piece of code is run *once* for every time the user clicks the button.  You're talking probably under a microsecond's difference for each *click*.  The physical process of clicking the mouse button takes thousands of times as long.

Stop worrying about performance when you don't need to.  It makes code uglier and harder to understand and maintain.

(In reply to comment #16)
> Also, you retained a bug from the previous version when you committed the
> change.
> 
> In /^[\d.,eE+-]+\%?$/ the e and E may appear at the beginning of the text,
> which is not a number.

There are a lot more possible problems than that, in fact.  + and - can also appear anywhere, you can have multiple dots, etc.  It doesn't support hex.  I've created a new regex based on the ECMAScript standard:

http://www.ecma-international.org/cgi-bin/counters/unicounter.pl?name=Ecma-262&deliver=http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf

I've committed it in r40379 (with a fix in r40382), so there should be nothing incorrect detected as numbers anymore.

Comment 23 SharkD 2008-09-03 23:18:02 UTC

(In reply to comment #22)
> Please make all new patches against SVN trunk.  Your current patches wouldn't
> have applied (even if I wanted to apply them) because of r40348 and r40349. 
> Before you make changes to submit a patch, right-click on wikibits.js or the
> directory and choose "Update" or something like that, to make sure the copy
> you're patching against is up-to-date.
> (In reply to comment #14)
> > The fractions ½ ¼ ¾ require only
> > simple substitutions, no evaluation of any kind.
> How about the fractions ⅓ ⅔ ⅕ ⅖ ⅗ ⅘ ⅙ ⅚ ⅛ ⅜ ⅝ ⅞?  And
> you need to consider "1 ½", "1-½", etc.  Nobody even uses these, [[WP:MOS]]
> and most other guides recommend against them.  It adds bloat for little gain
> here.

It seems to be used quite frequently, and is even included in the JavaScript character insertion toolbox of the edit page.

> > The changes *were* required in order for it to work. In its original form, the
> > numerical search might have picked up a date when it wasn't supposed to.
> Well, you should *say* these things, so that I don't make incorrect assumptions
> about what your patch is supposed to do.  It wasn't obvious.

I did say so in message #5.

> > You're missing the point. Declaring the variable at the beginning results in an
> > additional assignment and an (expensive) additional lookup in global scope in 5
> > out of the 6 cases.
> You're missing the point.  *Performance does not matter here.*  This piece of
> code is run *once* for every time the user clicks the button.  You're talking
> probably under a microsecond's difference for each *click*.  The physical
> process of clicking the mouse button takes thousands of times as long.
> Stop worrying about performance when you don't need to.  It makes code uglier
> and harder to understand and maintain.

Might I remind you that my previous optimizations were similarly labeled trivial, yet I was able to achieve a twenty-fold increase in javascript performance. Also, lag in user interface elements is particularly insidious.

> (In reply to comment #16)
> > Also, you retained a bug from the previous version when you committed the
> > change.
> > 
> > In /^[\d.,eE+-]+\%?$/ the e and E may appear at the beginning of the text,
> > which is not a number.
> There are a lot more possible problems than that, in fact.  + and - can also
> appear anywhere, you can have multiple dots, etc.  It doesn't support hex. 

Yes, hence my subsequent patches. I feel your lack of keeping current with the latest changes to be suspect.

> I've created a new regex based on the ECMAScript standard:
> http://www.ecma-international.org/cgi-bin/counters/unicounter.pl?name=Ecma-262&deliver=http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf
> I've committed it in r40379 (with a fix in r40382), so there should be nothing
> incorrect detected as numbers anymore.

There are several issues with your changes:

1 commas can appear at the beginning of a number
2 multiple commas may appear in a sequence or separated by fewer than three digits
3 multiple periods may appear in a single number
4 commas may appear after a period
5 sequences surrounded by parentheses are stored as matches when they don't need to be

All of which I have resolved in my latest changes.

Comment 24 SharkD 2008-09-03 23:26:28 UTC

(In reply to comment #22)
> Please make all new patches against SVN trunk.  Your current patches wouldn't
> have applied (even if I wanted to apply them) because of r40348 and r40349. 
> Before you make changes to submit a patch, right-click on wikibits.js or the
> directory and choose "Update" or something like that, to make sure the copy
> you're patching against is up-to-date.
> (In reply to comment #14)
> > The fractions ½ ¼ ¾ require only
> > simple substitutions, no evaluation of any kind.
> How about the fractions ⅓ ⅔ ⅕ ⅖ ⅗ ⅘ ⅙ ⅚ ⅛ ⅜ ⅝ ⅞?  And
> you need to consider "1 ½", "1-½", etc.  Nobody even uses these, [[WP:MOS]]
> and most other guides recommend against them.  It adds bloat for little gain
> here.

Forgot to add a link to support the frequency of the character's usage:

http://www.google.com/search?hl=en&q=site%3Aen.wikipedia.org+%C2%BD&aq=f&oq=site%3Aen.wikipedia.org+%C2%BD%22

Also, I don't think "1 ½" "1-½" are valid.

> There are a lot more possible problems than that, in fact.  + and - can also
> appear anywhere, you can have multiple dots, etc.  It doesn't support hex. 
> I've created a new regex based on the ECMAScript standard:
> http://www.ecma-international.org/cgi-bin/counters/unicounter.pl?name=Ecma-262&deliver=http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf
> I've committed it in r40379 (with a fix in r40382), so there should be nothing
> incorrect detected as numbers anymore.

Having three separate regexps sharing multiple sequences also slows down the code.

Comment 25 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-09-08 16:07:09 UTC

(In reply to comment #23)
> It seems to be used quite frequently, and is even included in the JavaScript
> character insertion toolbox of the edit page.

It still adds complication . . . I'm not sure.  I'd like some more opinions on this.

> I did say so in message #5.

Providing a patch that also made changes that clearly weren't necessary for this to work (adding var), and where you didn't explain why changing "if" to "else if" was actually necessary -- which wasn't obvious at all.

> Might I remind you that my previous optimizations were similarly labeled
> trivial, yet I was able to achieve a twenty-fold increase in javascript
> performance. Also, lag in user interface elements is particularly insidious.

I'll reserve judgment on your other optimizations (not *all* of which anyone called trivial, only some of them) until I've reviewed them.  Microseconds or less per click is not insidious, it's invisible.

> Yes, hence my subsequent patches. I feel your lack of keeping current with the
> latest changes to be suspect.

"Suspect"?  I'm a volunteer here.  I'm kindly setting aside my own time to review your patches out of the goodness of my heart.  It's pretty rude of you to question my commitment when I have no obligation to pay attention to you at all, and when in fact most patches to MediaWiki go totally unreviewed and are ignored for months or years.

Your subsequent patches were totally unexplained, so I didn't realize what exactly you had done with them.  If you had *pointed out* what problems they fixed, I might not have seen the need to make that comment.  Besides, your subsequent patches make mistakes that mine didn't.  They consider, for instance, the single string "." to be a number.

> 1 commas can appear at the beginning of a number

Yeah, I spotted that when re-reviewing one of your patches above.  I've fixed that in r40611.

> 2 multiple commas may appear in a sequence or separated by fewer than three
> digits

I'm not sure if that's good or bad.  Is sorting something like "40,52" as the number 4052 better or worse than sorting it as a string?  Sorting "," as 0 or NaN or whatever it evaluates to is definitely wrong, and I've fixed that.

> 3 multiple periods may appear in a single number

How?  I don't see how that's possible with the regexes I committed.

> 4 commas may appear after a period

It doesn't hurt anything.  I guess it would simplify the regex a bit if that didn't happen, but the commas will be stripped before evaluation anyway.

> 5 sequences surrounded by parentheses are stored as matches when they don't
> need to be

Chucking in extra "?:" everywhere makes the code uglier for no good reason.

(In reply to comment #24)
> Also, I don't think "1 ½" "1-½" are valid.

Why not?

> Having three separate regexps sharing multiple sequences also slows down the
> code.

I'm simply not going to consider any claims of slowing down anything unless you can provide either benchmarks or a convincing theoretical argument for why the slowdown might be nontrivial.  Some microoptimizations are worth making, but most aren't except in performance-critical code paths.

Comment 26 SharkD 2008-09-09 11:39:40 UTC

(In reply to comment #25)
> (In reply to comment #23)
> > I did say so in message #5.
> Providing a patch that also made changes that clearly weren't necessary for
> this to work (adding var), and where you didn't explain why changing "if" to
> "else if" was actually necessary -- which wasn't obvious at all.

I already explained earlier in this thread that the "var"s are necessary, otherwise you end up possibly overriding the regexps for the dates.

> > Yes, hence my subsequent patches. I feel your lack of keeping current with the
> > latest changes to be suspect.
> "Suspect"?  I'm a volunteer here.  I'm kindly setting aside my own time to
> review your patches out of the goodness of my heart.  It's pretty rude of you
> to question my commitment when I have no obligation to pay attention to you at
> all, and when in fact most patches to MediaWiki go totally unreviewed and are
> ignored for months or years.

I am also a volunteer, and am no less obligated to submit patches to the software. Don't act like you're doing me a favor.

> Your subsequent patches were totally unexplained, so I didn't realize what
> exactly you had done with them.  If you had *pointed out* what problems they
> fixed, I might not have seen the need to make that comment.

Each of my patches is accompanied by a comment as to the nature of the changes. They're easily enough compared with previous versions using the "diff" command that exists next to the link.

> Besides, your
> subsequent patches make mistakes that mine didn't.  They consider, for
> instance, the single string "." to be a number.

No you're wrong. The only period in the regexp is escaped, and therefore only matches a literal dot.

> > 1 commas can appear at the beginning of a number
> Yeah, I spotted that when re-reviewing one of your patches above.  I've fixed
> that in r40611.
> > 2 multiple commas may appear in a sequence or separated by fewer than three
> > digits
> I'm not sure if that's good or bad.  Is sorting something like "40,52" as the
> number 4052 better or worse than sorting it as a string?  Sorting "," as 0 or
> NaN or whatever it evaluates to is definitely wrong, and I've fixed that.

It's been a couple of days since I worked on it, so I'm a bit hazy on why this might be undesirable.

> > 3 multiple periods may appear in a single number
> How?  I don't see how that's possible with the regexes I committed.

My mistake.

> > 4 commas may appear after a period
> It doesn't hurt anything.  I guess it would simplify the regex a bit if that
> didn't happen, but the commas will be stripped before evaluation anyway.
> > 5 sequences surrounded by parentheses are stored as matches when they don't
> > need to be

Most of the speed benefits (see below) are a direct result of this optimization.

> Chucking in extra "?:" everywhere makes the code uglier for no good reason.
> (In reply to comment #24)
> > Also, I don't think "1 ½" "1-½" are valid.
> Why not?

Authors would omit the space simply for the sake of readability. The minus sign in this case indicates and expression not a number, and expressions should not be picked up by the regexp.

> > Having three separate regexps sharing multiple sequences also slows down the
> > code.
> I'm simply not going to consider any claims of slowing down anything unless you
> can provide either benchmarks or a convincing theoretical argument for why the
> slowdown might be nontrivial.  Some microoptimizations are worth making, but
> most aren't except in performance-critical code paths.

I'll attach the demonstration file where you can compare my and your regular expressions side by side. After five trials the results were:

2237ms for the current method
1996ms for the current regexp plus the "?:" at the beginning of the parentheses
1943ms for the method in my latest patch

Comment 27 SharkD 2008-09-09 11:42:00 UTC

Created attachment 5309 [details]
javascript demonstration

Comparison of three methods of writing regular expressions. The regular expressions are tested for speed versus an identical input string. The results are sent to standard output.

Comment 28 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-09-14 19:00:39 UTC

(In reply to comment #26)
> Each of my patches is accompanied by a comment as to the nature of the changes.
> They're easily enough compared with previous versions using the "diff" command
> that exists next to the link.

Not an informative comment.  Things like "Forgot something in the last version" and "Caught something else" don't tell me what the actual difference is.

In any event, there's no point bickering about this.  I'm telling you how to best get your patches reviewed.  If you don't follow that advice -- although mostly you have followed it so far with regard to splitting patches up and so forth, and I thank you for that -- then you're going to drive away reviewers, including me, who are uninterested in spending the time to decipher, cut pieces out of, or reformat your patches when they could just write their own, or ignore the problem entirely.

> > Besides, your
> > subsequent patches make mistakes that mine didn't.  They consider, for
> > instance, the single string "." to be a number.
> 
> No you're wrong. The only period in the regexp is escaped, and therefore only
> matches a literal dot.

Yes, that's my point.  The literal string "." is counted as a number.  Look at the regex:

/^[+-]?(?:\d{1,3}(?:\,\d{3})*)*\.?\d*(?:[eE][+-]?\d+(?:\,\d{3})*)?[%\u00b0\u00bc\u00bd\u00be]?$/

This can match nothing:

[+-]?(?:\d{1,3}(?:\,\d{3})*)*

This can also match nothing:

\d*(?:[eE][+-]?\d+(?:\,\d{3})*)?[%\u00b0\u00bc\u00bd\u00be]?

The part in between is this:

\.?

and that can match the literal string ".".  For that matter, your pattern also matches the empty string, since every piece is followed by either ? or *.

> > > Also, I don't think "1 ½" "1-½" are valid.
> > Why not?
> 
> Authors would omit the space simply for the sake of readability. The minus sign
> in this case indicates and expression not a number, and expressions should not
> be picked up by the regexp.

"1 ½" is the same as "1½", which is the same as "1-½" (the hyphen/dash there is not a minus sign, it just separates the fractional part from the integer part).  Googling "1 ½" shows a mix of the three formats:

http://www.google.com/search?q="1+½"

> I'll attach the demonstration file where you can compare my and your regular
> expressions side by side. After five trials the results were:
> 
> 2237ms for the current method
> 1996ms for the current regexp plus the "?:" at the beginning of the parentheses
> 1943ms for the method in my latest patch

Yes, if you *run the code a hundred thousand times*.  It's not run 100,000 times.  It's run *once* when the user clicks the sort button.  The difference is 22.37 *microseconds* for my code and 19.43 *microseconds* for your code, per click.  So in other words, correct me if I'm mistaking how this code is used, you're making changes to the code that save the user three microseconds when they click on the sort button.  Am I misconstruing how often this code runs, or do you get my point that it makes absolutely no difference to the speed of the application?


If I'm not wrong, we're basically down to the following differences between your version and mine:

* Your patch treats fractions and percent signs more intelligently.  I'm not sure if we really want to go in this direction.  It's possibly scope creep, and there's no end to it.  I'd like input from some more people on this, I'm not going to commit it by myself.
* You're a little more picky about where commas go.  Again, I'm not sure if this is needed or not.  I won't do anything without further input from others.
* Your version uses regexes that save maybe two microseconds per click.  Unless I'm totally mistaken on how many times that code is executed, in which case I retract my remarks, I hope you agree we don't need to worry about complicating the regex over this.

Comment 29 SharkD 2008-09-17 01:21:16 UTC

(In reply to comment #28)
> Yes, that's my point.  The literal string "." is counted as a number.  Look at
> the regex:
> /^[+-]?(?:\d{1,3}(?:\,\d{3})*)*\.?\d*(?:[eE][+-]?\d+(?:\,\d{3})*)?[%\u00b0\u00bc\u00bd\u00be]?$/
> This can match nothing:
> [+-]?(?:\d{1,3}(?:\,\d{3})*)*
> This can also match nothing:
> \d*(?:[eE][+-]?\d+(?:\,\d{3})*)?[%\u00b0\u00bc\u00bd\u00be]?
> The part in between is this:
> \.?
> and that can match the literal string ".".  For that matter, your pattern also
> matches the empty string, since every piece is followed by either ? or *.

I've been trying to find a way to fix this, but the only solutions I've been able to come up with involve slapping additional tests onto the end, such as " && (itm != '') ", which I'd rather not do.


> > > > Also, I don't think "1 ½" "1-½" are valid.
> > > Why not?
> > 
> > Authors would omit the space simply for the sake of readability. The minus sign
> > in this case indicates and expression not a number, and expressions should not
> > be picked up by the regexp.
> "1 ½" is the same as "1½", which is the same as "1-½" (the hyphen/dash there
> is not a minus sign, it just separates the fractional part from the integer
> part).  Googling "1 ½" shows a mix of the three formats:
> http://www.google.com/search?q="1+½"

OK, but the same might be said of the percent sign that exists in the current revision; and, at least the version with the space in between (the most common of the two) could be caught by simply adding an "\s*" beforehand.


> * Your version uses regexes that save maybe two microseconds per click.  Unless
> I'm totally mistaken on how many times that code is executed, in which case I
> retract my remarks, I hope you agree we don't need to worry about complicating
> the regex over this.

No, it's not a lot of time, but I like to be thorough.

Comment 30 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-10-08 18:38:07 UTC

Being "thorough" should not equate to writing less readable code.  The extra characters are not needed and so are just clutter.

Regardless, as I said, I'm not going to commit any further changes to the things brought up here (complexity creep) unless other developers are in favor.  So I have nothing more to say here unless a third party comments.

Comment 31 DieBuche 2011-04-15 10:08:18 UTC

Stuff like e/E and -/+ is fixed in r86088 (I think e/E was patched in the meantime as well)

I agree that ½ is unneccessary

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links