Last modified: 2010-05-15 14:36:00 UTC
There are potential problems with 'clumping' where some entries may be assigned duplicate cur_random entries or near-duplicates that bias favorably to some entries in a large sample. Avoiding duplicates, changing random keys, or adding jitter could help.
enwiki: mysql> select cur_random,count(*) as c from cur group by cur_random having c > 1; 1466 rows in set (2.04 sec)
The code that generates cur_random is $rand = number_format( mt_rand() / mt_getrandmax(), 12, '.', '' ); which means we round to 12 decimal digits of precision. So we would expect a repeated value of cur_random after about 1.2*sqrt(10^12), or roughly a million articles, using an approximation of the formula for the Birthday Problem. Thus 12 digits should be enough. However, on my (32-bit) machine, mt_getrandmax() returns only 2147483647 == 2^31-1. This means we get a much smaller set of random values, and would expect a collision after only about 50K articles, assuming the same value of mt_getrandmax. I think a solution would be to use the RAND() MySQL function instead. MySQL 3.23 and up allow ORDER BY RAND(), although I'm guessing this was rejected in favor of the cur_random column for some reason.
ORDER BY RAND() was rejected for being ungodly slow on tables with tens of thousands of rows. Can't really think of any reason not to use RAND(), though... might have been concern about replication? Need to check that.
Give me a break, who cares if 100 articles have the same cur_random? Who is going to notice? It's not a collision, it's a non-issue. If Special:Randompage is our only method of article review, we have bigger problems than machine word length.
Correctness is always a concern, even if the affect is minor. Just because we have better things to do doesn't mean it's not worth doing! :)
(In reply to comment #3) > > Can't really think of any reason not to use RAND(), though... might have been > concern about replication? Need to check that. > Probably because it's not possible to pass RAND() to the database wrapper functions; they expect literal values and quote everything that isn't numeric before querying the DB. So to fix this we'd need to prevent Database::makeList from quoting RAND(), or use a separate query to get the random value (slower?). Re Tim's comment, my understanding is that this could prevent articles with duplicate cur_random values from ever being returned by the random page function. True, not a big deal, but it would be nice to fix since we have more than enough precision in the cur_random field. Also, it might be important for statistical surveys to know that a truly random sample is taken, although I can't claim to know whether this would make any practical difference.
> Also, it might be important for statistical surveys to know that a truly > random sample is taken, although I can't claim to know whether this would make any > practical difference. I do claim to know. If you take a population, randomly remove 100 members of that population, then take a random sample of the remainder, that sample is still representative of the original population. What annoys me is superstition surrounding random numbers arising from the fact that programmers don't understand them. Like when my parsing algorithm was changed because there was a 1 in 2^64 probability of a minor parsing error on a typical page, or 1 in 2^32 for a special kind of page 100 million times longer than our page size limit. Never mind the IE expression() JS insertion bug, or the countless other bugs that occur every single time you type in a particular bit of text -- software engineers seem to have an inexplicable urge to tame randomness even when it was just minding its own business.
As far as the database wrappers; iirc they already handle NULL distinctly from strings/integers. It should be relatively straightforward to provide, say, a wrapper class for raw SQL functions, which could be recognized and the raw string passed through.
I thought of a simple fix for this problem: use two calls to mt_rand() to get the required amount of randomness. Creating a million random numbers using the current method, I got 262 dupes. Instead using this function: function wfRandom() { $max = mt_getrandmax(); $rand = number_format( mt_rand() * mt_rand() / $max / $max, 12, '.', '' ); return $rand; } I got 0 dupes.
Also a note: we need to use a predetermined constant in Special:Randompage because RAND() in a WHERE clause doesn't behave as expected (see comment I added to that file a few days ago), however RAND() should work fine when setting the cur_random on page save.
(In reply to comment #10) > Also a note: we need to use a predetermined constant in Special:Randompage because RAND() in a WHERE clause doesn't > behave as expected (see comment I added to that file a few days ago), however RAND() should work fine when setting the > cur_random on page save. > Thanks for the clarification. I committed my wfRandom() function to HEAD, and changed Special:Randompage as well as the article inserts to use it. That was easier than using RAND() for the page saves, because of the database wrapper difficulty I mentioned in comment #6. Besides, mt_rand() appears to have better randomness properties than MySQL's RAND(), so it I think it makes sense to be consistent and use mt_rand() for everything. I don't think my change will hurt performance, since in my understanding, none of the users is performance-criticial.
1.4 release imminent, resolving as fixed.