Last modified: 2014-02-13 23:52:55 UTC
User's created before r12207 who have no edits until after r12207 are not assigned guessed data for user.user_registration. This is not critical but often is very confusing and sometimes wildly inaccurate. There are several months worth of data on Wikimedia wikis in the new user log from the extension (see r10573 ) that could populate this data. Also, for users prior to even the extension, a gaussian curve could be plotted from the data of available edits and log entries (all of which would be after the creation date) and normalized to a curve or wave of user creation date/ID. Awaiting WONTFIX!
Go on, then, let's see this gaussian curve of yours :D Might as well work for your wontfix!! The other suggestion, however, is good; that extension provided accurate log data; A quick check on the toolserver suggests that there are at least 290,000 entries in the relevant period; a substantial fraction of these could be recovered in this fashion. It should probably be a separate script, though; there's no guarrantee that wikis needing to populate the column would have had the extension installed, and no point in the script trying to use that data if it's not present.
> Go on, then, let's see this gaussian curve of yours :D Too slow of a query to do it for everyone without actually, yknow, DOING it, as in populating the data. But here is 5000 from en.wp. Note there isn't much curve to it, and it skips all users with double/nulls, but there is definitely a trend line: http://test.wikipedia.org/wiki/File:Example_of_user_first_actions_for_en.wp_400000-405000.gif
Created attachment 6080 [details] Sampling of normalizable user first-contribution curve Here is a more distributed sampling, of all users from 1k-750k (1:1000). Copied from http://test.wikipedia.org/wiki/File:Example_of_user_first_actions_for_en.wp_1-750000_(by_thousand).gif
Wow, that's a much better fit than I was expecting, TBH. And the outliers tell their own story; particularly interesting the ones on the second graph that were registered in 2001-03, but not used until around 2008... More ammunition (as if it were needed) against deleting old accounts. Still not entirely sure how you'd convert that data into registration timestamps, or are you going to assume that the curve approximately follows the registration time; that is, the average delay between registering and editing is zero? Seems a justifiable assumption, but I notice the curve gets a bit wobbly at the top; lots of double NULLs in the data...
*** Bug 22097 has been marked as a duplicate of this bug. ***