Last modified: 2010-05-15 14:36:00 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T2589, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 589 - Make random selection slightly more random
Make random selection slightly more random
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal minor (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2004-09-28 01:05 UTC by Brion Vibber
Modified: 2010-05-15 14:36 UTC (History)
0 users

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Brion Vibber 2004-09-28 01:05:14 UTC
There are potential problems with 'clumping' where some entries may be assigned
duplicate cur_random entries or near-duplicates that bias favorably to some
entries in a large sample.

Avoiding duplicates, changing random keys, or adding jitter could help.
Comment 1 River Tarnell 2004-09-28 01:06:40 UTC
enwiki:

mysql> select cur_random,count(*) as c from cur group by cur_random having c > 1;
1466 rows in set (2.04 sec)
Comment 2 Wil Mahan 2004-09-28 06:34:00 UTC
The code that generates cur_random is 

 $rand = number_format( mt_rand() / mt_getrandmax(), 12, '.', '' );

which means we round to 12 decimal digits of precision. So we would expect a
repeated value of cur_random after about 1.2*sqrt(10^12), or roughly a million
articles, using an approximation of the formula for the Birthday Problem. Thus 12
digits should be enough.

However, on my (32-bit) machine, mt_getrandmax() returns only 2147483647 == 2^31-1.
This means we get a much smaller set of random values, and would expect a collision
after only about 50K articles, assuming the same value of mt_getrandmax.

I think a solution would be to use the RAND() MySQL function instead. MySQL 3.23 
and up allow ORDER BY RAND(), although I'm guessing this was rejected
in favor of the cur_random column for some reason.
Comment 3 Brion Vibber 2004-09-28 06:59:23 UTC
ORDER BY RAND() was rejected for being ungodly slow on tables with tens of
thousands of rows.

Can't really think of any reason not to use RAND(), though... might have been
concern about replication? Need to check that.
Comment 4 Tim Starling 2004-09-28 07:28:23 UTC
Give me a break, who cares if 100 articles have the same cur_random? Who is
going to notice? It's not a collision, it's a non-issue. If Special:Randompage
is our only method of article review, we have bigger problems than machine word
length.
Comment 5 Brion Vibber 2004-09-28 07:38:16 UTC
Correctness is always a concern, even if the affect is minor.

Just because we have better things to do doesn't mean it's not worth doing! :)
Comment 6 Wil Mahan 2004-09-28 07:39:12 UTC
(In reply to comment #3)
> 
> Can't really think of any reason not to use RAND(), though... might have been
> concern about replication? Need to check that.
> 

Probably because it's not possible to pass RAND() to the database wrapper functions;
they expect literal values and quote everything that isn't numeric before
querying the DB. So to fix this we'd need to prevent Database::makeList from
quoting RAND(), or use a separate query to get the random value (slower?).

Re Tim's comment, my understanding is that this could prevent articles with
duplicate
cur_random values from ever being returned by the random page function. True,
not a big deal,
but it would be nice to fix since we have more than enough precision in the
cur_random
field. Also, it might be important for statistical surveys to know that a truly
random
sample is taken, although I can't claim to know whether this would make any
practical
difference.
Comment 7 Tim Starling 2004-09-28 07:59:18 UTC
> Also, it might be important for statistical surveys to know that a truly
> random sample is taken, although I can't claim to know whether this would make any
> practical difference.

I do claim to know. If you take a population, randomly remove 100 members of
that population, then take a random sample of the remainder, that sample is
still representative of the original population. 

What annoys me is superstition surrounding random numbers arising from the fact
that programmers don't understand them. Like when my parsing algorithm was
changed because there was a 1 in 2^64 probability of a minor parsing error on a
typical page, or 1 in 2^32 for a special kind of page 100 million times longer
than our page size limit. Never mind the IE expression() JS insertion bug, or
the countless other bugs that occur every single time you type in a particular
bit of text -- software engineers seem to have an inexplicable urge to tame
randomness even when it was just minding its own business.

Comment 8 Brion Vibber 2004-09-28 08:02:06 UTC
As far as the database wrappers; iirc they already handle NULL distinctly from
strings/integers. It should be relatively straightforward to provide, say, a
wrapper class for raw SQL functions, which could be recognized and the raw
string passed through.
Comment 9 Wil Mahan 2004-10-06 20:07:23 UTC
I thought of a simple fix for this problem: use two calls to
mt_rand() to get the required amount of randomness.

Creating a million random numbers using the current method,
I got 262 dupes. Instead using this function:

function wfRandom() {
        $max = mt_getrandmax();
        $rand = number_format( mt_rand() * mt_rand()
            / $max / $max, 12, '.', '' );
        return $rand;
}

I got 0 dupes.
Comment 10 Brion Vibber 2004-10-06 20:45:21 UTC
Also a note: we need to use a predetermined constant in Special:Randompage because RAND() in a WHERE clause doesn't 
behave as expected (see comment I added to that file a few days ago), however RAND() should work fine when setting the 
cur_random on page save.
Comment 11 Wil Mahan 2004-10-11 17:48:51 UTC
(In reply to comment #10)
> Also a note: we need to use a predetermined constant in Special:Randompage
because RAND() in a WHERE clause doesn't 
> behave as expected (see comment I added to that file a few days ago), however
RAND() should work fine when setting the 
> cur_random on page save.
> 

Thanks for the clarification. I committed my wfRandom() function to HEAD,
and changed Special:Randompage as well as the article inserts to use it.
That was easier than using RAND() for the page saves, because of the
database wrapper difficulty I mentioned in comment #6. Besides, mt_rand()
appears to have better randomness properties than MySQL's RAND(), so it I
think it makes sense to be consistent and use mt_rand() for everything.

I don't think my change will hurt performance, since in my understanding,
none of the users is performance-criticial.
Comment 12 Brion Vibber 2004-12-12 00:04:44 UTC
1.4 release imminent, resolving as fixed.
Comment 13 Brion Vibber 2004-12-12 00:04:51 UTC
1.4 release imminent, resolving as fixed.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links