Last modified: 2009-02-03 10:34:16 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T17401, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 15401 - Wikibits patch #4 - more date formats
Wikibits patch #4 - more date formats
Status: RESOLVED DUPLICATE of bug 8226
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
unspecified
All All
: Normal enhancement (vote)
: ---
Assigned To: SharkD
: patch, patch-need-review
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-09-01 02:19 UTC by SharkD
Modified: 2009-02-03 10:34 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
unified diff file (5.59 KB, application/octet-stream)
2008-09-01 02:19 UTC, SharkD
Details
unified diff file (5.58 KB, patch)
2008-09-01 16:02 UTC, SharkD
Details
unified diff file (5.44 KB, patch)
2008-09-01 23:12 UTC, SharkD
Details
unified diff file (5.88 KB, patch)
2008-09-02 20:21 UTC, SharkD
Details
unified diff file (5.26 KB, patch)
2008-09-02 23:09 UTC, SharkD
Details
unified diff file (5.25 KB, patch)
2008-09-02 23:15 UTC, SharkD
Details
unified diff file (5.28 KB, patch)
2008-09-03 03:20 UTC, SharkD
Details

Description SharkD 2008-09-01 02:19:21 UTC
Created attachment 5238 [details]
unified diff file

Detect more date formats when sorting tables.

Currently only 3 formats are detected. The changes increase this number to about two dozen.
Comment 1 Daniel Friesen 2008-09-01 08:40:56 UTC
Just a note to anyone who decides to apply the patch.

This patch also includes a change to the currency which is a different bug, not relevant to this bug.
And it also contains a change to numeric matching which merely adds some useless escape characters.
Comment 2 SharkD 2008-09-01 16:02:36 UTC
Created attachment 5250 [details]
unified diff file

Update, minus the change to currency and 'useles' escape sequences.
Comment 3 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-09-01 19:22:52 UTC
You're still removing semicolons.  On another stylistic note, try to keep lines down to 80 chars or so if it's reasonable.  Those comments in the first chunk should go on different lines, not the end of the current line.

I'll review this at some point, if no one else does, but not this second.
Comment 4 SharkD 2008-09-01 23:12:41 UTC
Created attachment 5254 [details]
unified diff file

Woops. Sorry about that. Here's a fixed version.
Comment 5 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-09-02 16:46:26 UTC
>+	// matches: D(D)-M(M)-YY(YY) or M(M)-D(D)-YY(YY)
>+	if (itm.match(/^\d{1,2}[\/\.\-\s]\d{1,2}[\/\.\-\s](\d{2}|\d{4})$/))
>+		sortfn = ts_sort_date_1;

Maybe we want to narrow this a bit.  Something like "1/3.14" would match, even though it's clearly not a date.  I'd enforce that the two delimiters need to be the same.  Making sure that neither of the first two parts is greater than 31 might also be a good idea, but on the other hand it might be too complicated.

>+	// matches: D(D)-MONTHABV-YY(YY) or MONTHABV-D(D)-YY(YY) or MONTHABV D(D), YY(YY)
>+	if (itm.match(/^([^\d\/\.\-\s]+[\/\.\-\s]\d{1,2}|\d{1,2}[\/\.\-\s][^\d\/\.\-\s]+),?[\/\.\-\s](\d{2}|\d{4})$/))
>+		sortfn = ts_sort_date_2;

This seems way too broad.  Wouldn't it catch strings like "Score: 5-1"?  The old code seems to require that month abbreviations be three letters, which is less prone to error (although not as precise as a list).

>-function ts_dateToSortKey(date) {	
>+function ts_dateToSortKey_1(date) {

Could you think of a more descriptive way to distinguish between their names than appending _1 and _2?  Same for ts_sort_date_*.

>+	date = date.replace(/[\/\.\s]/g, '-');
>+	var dateLen = date.length;
>+	var firstIdx = date.indexOf('-');
>+	var secndIdx = date.indexOf('-', firstIdx + 1);
>+	var yearIdx = secndIdx + 1;
>+	var yearLen = dateLen - yearIdx;
>+	var secndLen = dateLen - yearLen - firstIdx - 2;
>+	var year = parseInt(date.substr(yearIdx, yearLen));
>...
>+	if (ts_europeandate) {
>+		var month = date.substr(firstIdx + 1, secndLen);
>+		var day = date.substr(0, firstIdx);
>+	} else {
>+		var month = date.substr(0, firstIdx);
>+		var day = date.substr(firstIdx + 1, secndLen);
> 	}
>-	return "00000000";
>+	month = parseInt(month);
>+	day = parseInt(day);

Couldn't you do this a lot more simply?  Like (untested)

	var day, month, year;
	if( ts_europeandate ) {
		[day, month, year] = date.split( "-" );
	} else {
		[month, day, year] = date.split( "-" );
	}

Followed by parseInt().

>+	var monthStr = "janfebmaraprmayjunjulaugsepoctnovdec";
>+	var monthLen = 3;
>+	var monthNam = date.match(/[^\d\s\/\.\-]+/)[0];
>+	var monthAbv = monthNam.substr(0, monthLen);
>+	date = date.replace(/[^\d\s\/\.\-]+/, monthAbv);
>+	date = date.replace(/[\s\/\.]/g, '-');
>+	date = date.replace(/,/g, '');
>+	var dateLen = date.length;
>+	var firstIdx = date.indexOf('-');
>+	var secndIdx = date.indexOf('-', firstIdx + 1);
>+	var yearIdx = secndIdx + 1;
>+	var yearLen = dateLen - yearIdx;
>+	var secndLen = dateLen - yearLen - firstIdx - 2;
>+	var year = parseInt(date.substr(yearIdx, yearLen));

Again, this seems to be way more intricate than it needs to be.  How about something like (untested):

	var monthLen = 3;
	var monthName = date.match(/[^\d\s\/.-]+/)[0].substr( 0, monthLen );
	var month = monthStr.indexOf( monthName )/monthLen;
	var day = parseInt( date.match(/\d\d?/)[1] );
	var year = parseInt( date.match(/\d\d?/)[2] );

You're writing like this is C rather than JavaScript, with all those indexes all over the place.

Also, it might be that this function would best be a wrapper around the other function.  There seems to be some code duplication here.  How about replacing the month abbreviation with the month number and then just passing through to the first date parsing variant?

>+	if (date.charCodeAt(0) < 58) {

What's this supposed to mean?  Don't use incomprehensible constructs like this, please, *especially* not without comments.  Most people don't have the ASCII code tables memorized.
Comment 6 SharkD 2008-09-02 17:57:29 UTC
(In reply to comment #5)
> >+	// matches: D(D)-M(M)-YY(YY) or M(M)-D(D)-YY(YY)
> >+	if (itm.match(/^\d{1,2}[\/\.\-\s]\d{1,2}[\/\.\-\s](\d{2}|\d{4})$/))
> >+		sortfn = ts_sort_date_1;
> Maybe we want to narrow this a bit.  Something like "1/3.14" would match, even
> though it's clearly not a date.  I'd enforce that the two delimiters need to be
> the same.  Making sure that neither of the first two parts is greater than 31
> might also be a good idea, but on the other hand it might be too complicated.

Good idea. I'm not sure off the top of my head how to ensure the deliminator is the same. I'll have to look it up.

> >+	// matches: D(D)-MONTHABV-YY(YY) or MONTHABV-D(D)-YY(YY) or MONTHABV D(D), YY(YY)
> >+	if (itm.match(/^([^\d\/\.\-\s]+[\/\.\-\s]\d{1,2}|\d{1,2}[\/\.\-\s][^\d\/\.\-\s]+),?[\/\.\-\s](\d{2}|\d{4})$/))
> >+		sortfn = ts_sort_date_2;
> This seems way too broad.  Wouldn't it catch strings like "Score: 5-1"?  The
> old code seems to require that month abbreviations be three letters, which is
> less prone to error (although not as precise as a list).

It wouldn't pick up "Score: 5-1", specifically. Colons aren't included in the search, and the last set of digits must be 2 or 4 digits long.

> >-function ts_dateToSortKey(date) {	
> >+function ts_dateToSortKey_1(date) {
> Could you think of a more descriptive way to distinguish between their names
> than appending _1 and _2?  Same for ts_sort_date_*.
> >+	date = date.replace(/[\/\.\s]/g, '-');
> >+	var dateLen = date.length;
> >+	var firstIdx = date.indexOf('-');
> >+	var secndIdx = date.indexOf('-', firstIdx + 1);
> >+	var yearIdx = secndIdx + 1;
> >+	var yearLen = dateLen - yearIdx;
> >+	var secndLen = dateLen - yearLen - firstIdx - 2;
> >+	var year = parseInt(date.substr(yearIdx, yearLen));
> >...
> >+	if (ts_europeandate) {
> >+		var month = date.substr(firstIdx + 1, secndLen);
> >+		var day = date.substr(0, firstIdx);
> >+	} else {
> >+		var month = date.substr(0, firstIdx);
> >+		var day = date.substr(firstIdx + 1, secndLen);
> > 	}
> >-	return "00000000";
> >+	month = parseInt(month);
> >+	day = parseInt(day);
> Couldn't you do this a lot more simply?  Like (untested)
>         var day, month, year;
>         if( ts_europeandate ) {
>                 [day, month, year] = date.split( "-" );
>         } else {
>                 [month, day, year] = date.split( "-" );
>         }

I haven't looked too closely, but that looks like a good alternative.

> Followed by parseInt().
> >+	var monthStr = "janfebmaraprmayjunjulaugsepoctnovdec";
> >+	var monthLen = 3;
> >+	var monthNam = date.match(/[^\d\s\/\.\-]+/)[0];
> >+	var monthAbv = monthNam.substr(0, monthLen);
> >+	date = date.replace(/[^\d\s\/\.\-]+/, monthAbv);
> >+	date = date.replace(/[\s\/\.]/g, '-');
> >+	date = date.replace(/,/g, '');
> >+	var dateLen = date.length;
> >+	var firstIdx = date.indexOf('-');
> >+	var secndIdx = date.indexOf('-', firstIdx + 1);
> >+	var yearIdx = secndIdx + 1;
> >+	var yearLen = dateLen - yearIdx;
> >+	var secndLen = dateLen - yearLen - firstIdx - 2;
> >+	var year = parseInt(date.substr(yearIdx, yearLen));
> Again, this seems to be way more intricate than it needs to be.  How about
> something like (untested):
>         var monthLen = 3;
>         var monthName = date.match(/[^\d\s\/.-]+/)[0].substr( 0, monthLen );
>         var month = monthStr.indexOf( monthName )/monthLen;
>         var day = parseInt( date.match(/\d\d?/)[1] );
>         var year = parseInt( date.match(/\d\d?/)[2] );
> You're writing like this is C rather than JavaScript, with all those indexes
> all over the place.

Ditto.

> Also, it might be that this function would best be a wrapper around the other
> function.  There seems to be some code duplication here.  How about replacing
> the month abbreviation with the month number and then just passing through to
> the first date parsing variant?

Ditto.

> >+	if (date.charCodeAt(0) < 58) {
> What's this supposed to mean?  Don't use incomprehensible constructs like this,
> please, *especially* not without comments.  Most people don't have the ASCII
> code tables memorized.

This just tests whether the first character is a digit or not in order to differentiate between the month-day-year (American) and day-month-year (European) formats. It could be replaced with a regexp. Or, the variable "ts_europeandate" could be used instead. However, I'd rather not use "ts_europeandate" if at all possible.
Comment 7 SharkD 2008-09-02 18:07:03 UTC
(In reply to comment #6)
> > Couldn't you do this a lot more simply?  Like (untested)
> >         var day, month, year;
> >         if( ts_europeandate ) {
> >                 [day, month, year] = date.split( "-" );
> >         } else {
> >                 [month, day, year] = date.split( "-" );
> >         }
> I haven't looked too closely, but that looks like a good alternative.

Errr, woops.

[day, month, year] is not valid syntax. However, storing the values in an array would work (which is your general idea I think).
Comment 8 SharkD 2008-09-02 20:21:26 UTC
Created attachment 5261 [details]
unified diff file

I tried to implement as many of your suggestions as possible. I also added support (and found a bug) for incomplete dates consisting of only a month and a day.
Comment 9 SharkD 2008-09-02 23:09:33 UTC
Created attachment 5264 [details]
unified diff file

Update. I discovered how to collapse the multiple, similar regexps to a single line.
Comment 10 SharkD 2008-09-02 23:15:12 UTC
Created attachment 5265 [details]
unified diff file

Tiny tweak.
Comment 11 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-09-03 00:34:14 UTC
(In reply to comment #6)
> Good idea. I'm not sure off the top of my head how to ensure the deliminator is
> the same. I'll have to look it up.

Back-references should work.

> 
> > >+	// matches: D(D)-MONTHABV-YY(YY) or MONTHABV-D(D)-YY(YY) or MONTHABV D(D), YY(YY)
> > >+	if (itm.match(/^([^\d\/\.\-\s]+[\/\.\-\s]\d{1,2}|\d{1,2}[\/\.\-\s][^\d\/\.\-\s]+),?[\/\.\-\s](\d{2}|\d{4})$/))
> > >+		sortfn = ts_sort_date_2;
> > This seems way too broad.  Wouldn't it catch strings like "Score: 5-1"?  The
> > old code seems to require that month abbreviations be three letters, which is
> > less prone to error (although not as precise as a list).
> 
> It wouldn't pick up "Score: 5-1", specifically. Colons aren't included in the
> search, and the last set of digits must be 2 or 4 digits long.

[^\d\/\.\-\s] will pick up a colon, or anything else that's not a digit, slash, dot, whitespace, or hyphen.  You're right about the other part, but "Score: 12-10" will be picked up AFAICT.

Incidentally, good style would be to write that as [^\d\/.\s-]: move the hyphen to the end so you don't have to escape it, and don't escape the dot (it has no special meaning in character classes).

> > >+	if (date.charCodeAt(0) < 58) {
> > What's this supposed to mean?  Don't use incomprehensible constructs like this,
> > please, *especially* not without comments.  Most people don't have the ASCII
> > code tables memorized.
> 
> This just tests whether the first character is a digit or not in order to
> differentiate between the month-day-year (American) and day-month-year
> (European) formats. It could be replaced with a regexp. Or, the variable
> "ts_europeandate" could be used instead. However, I'd rather not use
> "ts_europeandate" if at all possible.

Find something else for that than comparing raw ASCII codes, please.  :)

(In reply to comment #7)
> (In reply to comment #6)
> > > Couldn't you do this a lot more simply?  Like (untested)
> > >         var day, month, year;
> > >         if( ts_europeandate ) {
> > >                 [day, month, year] = date.split( "-" );
> > >         } else {
> > >                 [month, day, year] = date.split( "-" );
> > >         }
> > I haven't looked too closely, but that looks like a good alternative.
> 
> Errr, woops.
> 
> [day, month, year] is not valid syntax. However, storing the values in an array
> would work (which is your general idea I think).

It was supposed to mean

	day = date.split("-")[0]
	month = date.split("-")[1]
	year = date.split("-")[2]

Syntax like this works in Python, and in PHP as well with the list() function.  I was told the syntax I gave worked, but maybe not.

Will review new patch at some point.
Comment 12 SharkD 2008-09-03 01:58:57 UTC
First of all, I changed my display name from my real name to my Wikipedia account.

(In reply to comment #11)
> (In reply to comment #6)
> > Good idea. I'm not sure off the top of my head how to ensure the deliminator is
> > the same. I'll have to look it up.
> Back-references should work.

Yes, I figured that out and already incorporated them.

> > It wouldn't pick up "Score: 5-1", specifically. Colons aren't included in the
> > search, and the last set of digits must be 2 or 4 digits long.
> [^\d\/\.\-\s] will pick up a colon, or anything else that's not a digit, slash,
> dot, whitespace, or hyphen.  You're right about the other part, but "Score:
> 12-10" will be picked up AFAICT.

You're right. What I wanted was \w, except without the digits. The problem with [a-zA-Z] is that it is limited to certain locales. Using unicode ranges instead would be lengthy and complicated.


> Incidentally, good style would be to write that as [^\d\/.\s-]: move the hyphen
> to the end so you don't have to escape it, and don't escape the dot (it has no
> special meaning in character classes).

I have to disagree with you on this one. I like to keep all special characters escaped. It makes the code easier to read, and I don't have to worry about treating certain characters differently in special cases (i.e., the dot and dash).


> > > >+	if (date.charCodeAt(0) < 58) {
> > > What's this supposed to mean?  Don't use incomprehensible constructs like this,
> > > please, *especially* not without comments.  Most people don't have the ASCII
> > > code tables memorized.
> > 
> > This just tests whether the first character is a digit or not in order to
> > differentiate between the month-day-year (American) and day-month-year
> > (European) formats. It could be replaced with a regexp. Or, the variable
> > "ts_europeandate" could be used instead. However, I'd rather not use
> > "ts_europeandate" if at all possible.
> Find something else for that than comparing raw ASCII codes, please.  :)

OK.

> (In reply to comment #7)
> > [day, month, year] is not valid syntax. However, storing the values in an array
> > would work (which is your general idea I think).
> It was supposed to mean
>         day = date.split("-")[0]
>         month = date.split("-")[1]
>         year = date.split("-")[2]
> Syntax like this works in Python, and in PHP as well with the list() function. 
> I was told the syntax I gave worked, but maybe not.
> Will review new patch at some point.

Yes, this is what I've done. Check the most recent version I uploaded.
Comment 13 SharkD 2008-09-03 02:00:19 UTC
Crap! I keep getting logged out. I've disabled the "restrict to a single IP" option in the Login screen.
Comment 14 SharkD 2008-09-03 03:20:54 UTC
Created attachment 5270 [details]
unified diff file

Update. Finding a unicode range didn't end up being so difficult after all. The code now uses [a-zA-Z\u00c0-\uffff] instead of [^\d\/\.\-\s].
Comment 15 bluehairedlawyer 2009-02-03 10:34:16 UTC

*** This bug has been marked as a duplicate of bug 8226 ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links