Last modified: 2009-02-03 10:34:16 UTC
Created attachment 5238 [details] unified diff file Detect more date formats when sorting tables. Currently only 3 formats are detected. The changes increase this number to about two dozen.
Just a note to anyone who decides to apply the patch. This patch also includes a change to the currency which is a different bug, not relevant to this bug. And it also contains a change to numeric matching which merely adds some useless escape characters.
Created attachment 5250 [details] unified diff file Update, minus the change to currency and 'useles' escape sequences.
You're still removing semicolons. On another stylistic note, try to keep lines down to 80 chars or so if it's reasonable. Those comments in the first chunk should go on different lines, not the end of the current line. I'll review this at some point, if no one else does, but not this second.
Created attachment 5254 [details] unified diff file Woops. Sorry about that. Here's a fixed version.
>+ // matches: D(D)-M(M)-YY(YY) or M(M)-D(D)-YY(YY) >+ if (itm.match(/^\d{1,2}[\/\.\-\s]\d{1,2}[\/\.\-\s](\d{2}|\d{4})$/)) >+ sortfn = ts_sort_date_1; Maybe we want to narrow this a bit. Something like "1/3.14" would match, even though it's clearly not a date. I'd enforce that the two delimiters need to be the same. Making sure that neither of the first two parts is greater than 31 might also be a good idea, but on the other hand it might be too complicated. >+ // matches: D(D)-MONTHABV-YY(YY) or MONTHABV-D(D)-YY(YY) or MONTHABV D(D), YY(YY) >+ if (itm.match(/^([^\d\/\.\-\s]+[\/\.\-\s]\d{1,2}|\d{1,2}[\/\.\-\s][^\d\/\.\-\s]+),?[\/\.\-\s](\d{2}|\d{4})$/)) >+ sortfn = ts_sort_date_2; This seems way too broad. Wouldn't it catch strings like "Score: 5-1"? The old code seems to require that month abbreviations be three letters, which is less prone to error (although not as precise as a list). >-function ts_dateToSortKey(date) { >+function ts_dateToSortKey_1(date) { Could you think of a more descriptive way to distinguish between their names than appending _1 and _2? Same for ts_sort_date_*. >+ date = date.replace(/[\/\.\s]/g, '-'); >+ var dateLen = date.length; >+ var firstIdx = date.indexOf('-'); >+ var secndIdx = date.indexOf('-', firstIdx + 1); >+ var yearIdx = secndIdx + 1; >+ var yearLen = dateLen - yearIdx; >+ var secndLen = dateLen - yearLen - firstIdx - 2; >+ var year = parseInt(date.substr(yearIdx, yearLen)); >... >+ if (ts_europeandate) { >+ var month = date.substr(firstIdx + 1, secndLen); >+ var day = date.substr(0, firstIdx); >+ } else { >+ var month = date.substr(0, firstIdx); >+ var day = date.substr(firstIdx + 1, secndLen); > } >- return "00000000"; >+ month = parseInt(month); >+ day = parseInt(day); Couldn't you do this a lot more simply? Like (untested) var day, month, year; if( ts_europeandate ) { [day, month, year] = date.split( "-" ); } else { [month, day, year] = date.split( "-" ); } Followed by parseInt(). >+ var monthStr = "janfebmaraprmayjunjulaugsepoctnovdec"; >+ var monthLen = 3; >+ var monthNam = date.match(/[^\d\s\/\.\-]+/)[0]; >+ var monthAbv = monthNam.substr(0, monthLen); >+ date = date.replace(/[^\d\s\/\.\-]+/, monthAbv); >+ date = date.replace(/[\s\/\.]/g, '-'); >+ date = date.replace(/,/g, ''); >+ var dateLen = date.length; >+ var firstIdx = date.indexOf('-'); >+ var secndIdx = date.indexOf('-', firstIdx + 1); >+ var yearIdx = secndIdx + 1; >+ var yearLen = dateLen - yearIdx; >+ var secndLen = dateLen - yearLen - firstIdx - 2; >+ var year = parseInt(date.substr(yearIdx, yearLen)); Again, this seems to be way more intricate than it needs to be. How about something like (untested): var monthLen = 3; var monthName = date.match(/[^\d\s\/.-]+/)[0].substr( 0, monthLen ); var month = monthStr.indexOf( monthName )/monthLen; var day = parseInt( date.match(/\d\d?/)[1] ); var year = parseInt( date.match(/\d\d?/)[2] ); You're writing like this is C rather than JavaScript, with all those indexes all over the place. Also, it might be that this function would best be a wrapper around the other function. There seems to be some code duplication here. How about replacing the month abbreviation with the month number and then just passing through to the first date parsing variant? >+ if (date.charCodeAt(0) < 58) { What's this supposed to mean? Don't use incomprehensible constructs like this, please, *especially* not without comments. Most people don't have the ASCII code tables memorized.
(In reply to comment #5) > >+ // matches: D(D)-M(M)-YY(YY) or M(M)-D(D)-YY(YY) > >+ if (itm.match(/^\d{1,2}[\/\.\-\s]\d{1,2}[\/\.\-\s](\d{2}|\d{4})$/)) > >+ sortfn = ts_sort_date_1; > Maybe we want to narrow this a bit. Something like "1/3.14" would match, even > though it's clearly not a date. I'd enforce that the two delimiters need to be > the same. Making sure that neither of the first two parts is greater than 31 > might also be a good idea, but on the other hand it might be too complicated. Good idea. I'm not sure off the top of my head how to ensure the deliminator is the same. I'll have to look it up. > >+ // matches: D(D)-MONTHABV-YY(YY) or MONTHABV-D(D)-YY(YY) or MONTHABV D(D), YY(YY) > >+ if (itm.match(/^([^\d\/\.\-\s]+[\/\.\-\s]\d{1,2}|\d{1,2}[\/\.\-\s][^\d\/\.\-\s]+),?[\/\.\-\s](\d{2}|\d{4})$/)) > >+ sortfn = ts_sort_date_2; > This seems way too broad. Wouldn't it catch strings like "Score: 5-1"? The > old code seems to require that month abbreviations be three letters, which is > less prone to error (although not as precise as a list). It wouldn't pick up "Score: 5-1", specifically. Colons aren't included in the search, and the last set of digits must be 2 or 4 digits long. > >-function ts_dateToSortKey(date) { > >+function ts_dateToSortKey_1(date) { > Could you think of a more descriptive way to distinguish between their names > than appending _1 and _2? Same for ts_sort_date_*. > >+ date = date.replace(/[\/\.\s]/g, '-'); > >+ var dateLen = date.length; > >+ var firstIdx = date.indexOf('-'); > >+ var secndIdx = date.indexOf('-', firstIdx + 1); > >+ var yearIdx = secndIdx + 1; > >+ var yearLen = dateLen - yearIdx; > >+ var secndLen = dateLen - yearLen - firstIdx - 2; > >+ var year = parseInt(date.substr(yearIdx, yearLen)); > >... > >+ if (ts_europeandate) { > >+ var month = date.substr(firstIdx + 1, secndLen); > >+ var day = date.substr(0, firstIdx); > >+ } else { > >+ var month = date.substr(0, firstIdx); > >+ var day = date.substr(firstIdx + 1, secndLen); > > } > >- return "00000000"; > >+ month = parseInt(month); > >+ day = parseInt(day); > Couldn't you do this a lot more simply? Like (untested) > var day, month, year; > if( ts_europeandate ) { > [day, month, year] = date.split( "-" ); > } else { > [month, day, year] = date.split( "-" ); > } I haven't looked too closely, but that looks like a good alternative. > Followed by parseInt(). > >+ var monthStr = "janfebmaraprmayjunjulaugsepoctnovdec"; > >+ var monthLen = 3; > >+ var monthNam = date.match(/[^\d\s\/\.\-]+/)[0]; > >+ var monthAbv = monthNam.substr(0, monthLen); > >+ date = date.replace(/[^\d\s\/\.\-]+/, monthAbv); > >+ date = date.replace(/[\s\/\.]/g, '-'); > >+ date = date.replace(/,/g, ''); > >+ var dateLen = date.length; > >+ var firstIdx = date.indexOf('-'); > >+ var secndIdx = date.indexOf('-', firstIdx + 1); > >+ var yearIdx = secndIdx + 1; > >+ var yearLen = dateLen - yearIdx; > >+ var secndLen = dateLen - yearLen - firstIdx - 2; > >+ var year = parseInt(date.substr(yearIdx, yearLen)); > Again, this seems to be way more intricate than it needs to be. How about > something like (untested): > var monthLen = 3; > var monthName = date.match(/[^\d\s\/.-]+/)[0].substr( 0, monthLen ); > var month = monthStr.indexOf( monthName )/monthLen; > var day = parseInt( date.match(/\d\d?/)[1] ); > var year = parseInt( date.match(/\d\d?/)[2] ); > You're writing like this is C rather than JavaScript, with all those indexes > all over the place. Ditto. > Also, it might be that this function would best be a wrapper around the other > function. There seems to be some code duplication here. How about replacing > the month abbreviation with the month number and then just passing through to > the first date parsing variant? Ditto. > >+ if (date.charCodeAt(0) < 58) { > What's this supposed to mean? Don't use incomprehensible constructs like this, > please, *especially* not without comments. Most people don't have the ASCII > code tables memorized. This just tests whether the first character is a digit or not in order to differentiate between the month-day-year (American) and day-month-year (European) formats. It could be replaced with a regexp. Or, the variable "ts_europeandate" could be used instead. However, I'd rather not use "ts_europeandate" if at all possible.
(In reply to comment #6) > > Couldn't you do this a lot more simply? Like (untested) > > var day, month, year; > > if( ts_europeandate ) { > > [day, month, year] = date.split( "-" ); > > } else { > > [month, day, year] = date.split( "-" ); > > } > I haven't looked too closely, but that looks like a good alternative. Errr, woops. [day, month, year] is not valid syntax. However, storing the values in an array would work (which is your general idea I think).
Created attachment 5261 [details] unified diff file I tried to implement as many of your suggestions as possible. I also added support (and found a bug) for incomplete dates consisting of only a month and a day.
Created attachment 5264 [details] unified diff file Update. I discovered how to collapse the multiple, similar regexps to a single line.
Created attachment 5265 [details] unified diff file Tiny tweak.
(In reply to comment #6) > Good idea. I'm not sure off the top of my head how to ensure the deliminator is > the same. I'll have to look it up. Back-references should work. > > > >+ // matches: D(D)-MONTHABV-YY(YY) or MONTHABV-D(D)-YY(YY) or MONTHABV D(D), YY(YY) > > >+ if (itm.match(/^([^\d\/\.\-\s]+[\/\.\-\s]\d{1,2}|\d{1,2}[\/\.\-\s][^\d\/\.\-\s]+),?[\/\.\-\s](\d{2}|\d{4})$/)) > > >+ sortfn = ts_sort_date_2; > > This seems way too broad. Wouldn't it catch strings like "Score: 5-1"? The > > old code seems to require that month abbreviations be three letters, which is > > less prone to error (although not as precise as a list). > > It wouldn't pick up "Score: 5-1", specifically. Colons aren't included in the > search, and the last set of digits must be 2 or 4 digits long. [^\d\/\.\-\s] will pick up a colon, or anything else that's not a digit, slash, dot, whitespace, or hyphen. You're right about the other part, but "Score: 12-10" will be picked up AFAICT. Incidentally, good style would be to write that as [^\d\/.\s-]: move the hyphen to the end so you don't have to escape it, and don't escape the dot (it has no special meaning in character classes). > > >+ if (date.charCodeAt(0) < 58) { > > What's this supposed to mean? Don't use incomprehensible constructs like this, > > please, *especially* not without comments. Most people don't have the ASCII > > code tables memorized. > > This just tests whether the first character is a digit or not in order to > differentiate between the month-day-year (American) and day-month-year > (European) formats. It could be replaced with a regexp. Or, the variable > "ts_europeandate" could be used instead. However, I'd rather not use > "ts_europeandate" if at all possible. Find something else for that than comparing raw ASCII codes, please. :) (In reply to comment #7) > (In reply to comment #6) > > > Couldn't you do this a lot more simply? Like (untested) > > > var day, month, year; > > > if( ts_europeandate ) { > > > [day, month, year] = date.split( "-" ); > > > } else { > > > [month, day, year] = date.split( "-" ); > > > } > > I haven't looked too closely, but that looks like a good alternative. > > Errr, woops. > > [day, month, year] is not valid syntax. However, storing the values in an array > would work (which is your general idea I think). It was supposed to mean day = date.split("-")[0] month = date.split("-")[1] year = date.split("-")[2] Syntax like this works in Python, and in PHP as well with the list() function. I was told the syntax I gave worked, but maybe not. Will review new patch at some point.
First of all, I changed my display name from my real name to my Wikipedia account. (In reply to comment #11) > (In reply to comment #6) > > Good idea. I'm not sure off the top of my head how to ensure the deliminator is > > the same. I'll have to look it up. > Back-references should work. Yes, I figured that out and already incorporated them. > > It wouldn't pick up "Score: 5-1", specifically. Colons aren't included in the > > search, and the last set of digits must be 2 or 4 digits long. > [^\d\/\.\-\s] will pick up a colon, or anything else that's not a digit, slash, > dot, whitespace, or hyphen. You're right about the other part, but "Score: > 12-10" will be picked up AFAICT. You're right. What I wanted was \w, except without the digits. The problem with [a-zA-Z] is that it is limited to certain locales. Using unicode ranges instead would be lengthy and complicated. > Incidentally, good style would be to write that as [^\d\/.\s-]: move the hyphen > to the end so you don't have to escape it, and don't escape the dot (it has no > special meaning in character classes). I have to disagree with you on this one. I like to keep all special characters escaped. It makes the code easier to read, and I don't have to worry about treating certain characters differently in special cases (i.e., the dot and dash). > > > >+ if (date.charCodeAt(0) < 58) { > > > What's this supposed to mean? Don't use incomprehensible constructs like this, > > > please, *especially* not without comments. Most people don't have the ASCII > > > code tables memorized. > > > > This just tests whether the first character is a digit or not in order to > > differentiate between the month-day-year (American) and day-month-year > > (European) formats. It could be replaced with a regexp. Or, the variable > > "ts_europeandate" could be used instead. However, I'd rather not use > > "ts_europeandate" if at all possible. > Find something else for that than comparing raw ASCII codes, please. :) OK. > (In reply to comment #7) > > [day, month, year] is not valid syntax. However, storing the values in an array > > would work (which is your general idea I think). > It was supposed to mean > day = date.split("-")[0] > month = date.split("-")[1] > year = date.split("-")[2] > Syntax like this works in Python, and in PHP as well with the list() function. > I was told the syntax I gave worked, but maybe not. > Will review new patch at some point. Yes, this is what I've done. Check the most recent version I uploaded.
Crap! I keep getting logged out. I've disabled the "restrict to a single IP" option in the Login screen.
Created attachment 5270 [details] unified diff file Update. Finding a unicode range didn't end up being so difficult after all. The code now uses [a-zA-Z\u00c0-\uffff] instead of [^\d\/\.\-\s].
*** This bug has been marked as a duplicate of bug 8226 ***