Last modified: 2010-05-15 15:28:09 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 1148 - Write correct SQL instead of php arrays which causes HUGE hiden IO
Write correct SQL instead of php arrays which causes HUGE hiden IO
Status: RESOLVED INVALID
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
1.3.x
All All
: Low minor (vote)
: ---
Assigned To: Nobody - You can work on this!
:
: 1145 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2004-12-22 17:02 UTC by Alain B
Modified: 2010-05-15 15:28 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Alain B 2004-12-22 17:02:58 UTC
Many php functions are written using a temporary array which 
has 3 nasty side effects: 
 
- its less understandable 
- it kills both sql server and php "server" by inducing huge IO 
- its damages mysql.log which are HUGE because we find the data 
inside, introduce another slowdown, and prevent reading 
mysql.log 
 
USE SQL , dont subsitute php to myssql engine and it will be at 
least 10 time faster !
Comment 1 Alain B 2004-12-22 17:24:41 UTC
It is not a performance issue, it s a BUGFIX, which has the side effect 
of improving performance AND diminushing I/O pressure on the servers. 
I thinks it is 10 times faster, perhaps even more, but the important is 
that it preserve the server. 
 
For example, in FixBrokenLinks (the heart of wikipedia) you should write 
(or put back?) something like 
 
$dqw->query( "SELECT bl_from FROM 'brokenlinks' WHERE bl_to=$this->mTitle 
FOR UPDATE" , $fname ); 
$dbw->query( "INSERT IGNORE $links $links.l_from $links.l_to \ 
      SELECT $brokenlinks.bl_from, $cur.cur_id FROM $brokenlinks, $cur \ 
      WHERE $brokenslinks.bl_to=$this->mTitle AND 
$cur.cur_title=$brokenlinks.bl_to;" , $fname ); 
 
instead of the crazy (and broken and evily written, and opposite to 
coding style) stuff with an array passed in a global var !!! 
 
ALSO you MUST put the end of the request in the same function, not hide 
them in a called function, that is FORBIDDEN ! 
 
Last but not least, when there is a lock FOR UPDATE, please manage to 
write the SELECT ... FOR UPDATE and the UPDATE statement, in the same 
function, and clearly readable. Thanks a lot. 
 
All that happens many times in the code, please correct that, dont wait 
for bug reports. 
  
Comment 2 River Tarnell 2004-12-22 17:28:07 UTC
How are you planning to handle database table prefixes, schema names, and DBMSs
that don't support INSERT IGNORE and SELECT FOR UPDATE?
Comment 3 Alain B 2004-12-22 17:42:57 UTC
It is not a matter of sql abstract layer (i find it crazy also, but 
usefull if it works. This should be a separate project, not an internal 
function, that would prevent many of the uglyness of the code) 
 
THE pb is the written sql is BROKEN, in mysql, pgsql, db2 or whatever 
DBengine you use, there are data in the request. 
 
For fixbrokenLinks, we cant avoid the title, but the array you use is 
really crazy and clean sql avoid to used this awfull array that breaks 
many things in the server, causes huge io, huge logs ... 
 
 
Comment 4 Antoine "hashar" Musso (WMF) 2004-12-22 23:11:25 UTC
Lowering priority as that s code cleanup.

Alain, can you cleanly expose the problem you are refering too ?

Give a short example of actual code which is not clean and please
give a way to fix it. Until you do it i close that bug report.

Comment 5 Antoine "hashar" Musso (WMF) 2004-12-22 23:12:19 UTC
*** Bug 1145 has been marked as a duplicate of this bug. ***
Comment 6 Brion Vibber 2004-12-23 00:57:50 UTC
Alain, I'm afraid this report is simply incoherent. Can you explain what exactly is wrong? Namely:

* Which queries are produced that are incorrect
* Where they are produced
* What is wrong with them

The only way we have of communicating with the database server is through SQL, so if there's a 
problem with the queries you should be able to produce some specific information to back up the 
vague claims posted above. Some SQL queries are built by functions which handle table naming, 
quoting, and DBMS-specific differences in syntax, but they still produce SQL.
Comment 7 Alain B 2004-12-24 15:42:41 UTC
OK I try again to explain (coolely ;) why:  
- i put a bug on that  
- i say it is severity critical  
  
Preamble: I speak of functions which are used very often by the server,  
and in maintenance, function that are nearly the only one used. So we  
must think we are in a SQL engine, dealing with very high traffic, and  
with lets says 10 millisecond time order of magnitude. Maintenance is a  
cool way to simultate a very heavy traffic on the server. Thinking with  
separate SQL server and remote php client gives light on what i write  
below.  
  
1/ There is a _bug_ that crush the IO on the server  
- in fixBrokenLinks (called only for a neawly created page, or in  
maintenance) we have an array of links pointing to the (new) title.  
  
This seems nothing but it mask huge IO:  
- We query the DB to get links to this title, so the data are now outside  
the DBengine, and inside the (slow, interpreted) php program.  
- we transport those data through the php program , in array, and  
send them without _any_ change to another function  
- the function (fixBrokenLinks) then take those data and send them back  
to the SQL engine.  
  
It seems nothing but it is a huge cause of IO, we can see that  
in /var/log/mysql/mysqld.log which is far much bigger than the DB  
itself !  
  
For maintenance for 100 articles, i have a log of nearly 10 MB because of  
queries using arrays, expanded with a lot of redundancy, only with  
title... If we extrapolate to the whole DB 100 000 articles will give me  
10 GB of log (FR.cur.sql is 240MB) and people have reported logs like  
that. That mean that just by manipulating titles, we have IO that are 40  
time the whole DB ! This is a BUG.  
   
That pb disappear if we supress the array and stay always in the DB, and  
never take the data outside it (except of course for modifying them, or  
displaying).  
  
And the same occurs in other places with same king of arrays, coming from  
teh DB and going back to it after a slow-eroneous travel in the php code.  
  
2/ This is critical, because the same ovehead (IO = 40 times the DB!)  
occurs for each connection/request in the server, which has a lot of IO  
to deal with, just for nothing.   
Some people complains that the server is getting slow, or that  
refrehLinks is tooooooo_slow, i'm quite sure that suppressing this huge  
IO oveahead will give some fresh air to both SQL server and "apache php  
server", and on site with separate servers (like en.wikipedia.org) we  
should see a huge IO decrease between both machine.  
  
3/ Last thing: having data in request also prevent request-caching in the  
sqlengine. For refeshlinks, we cannot avoid the title ;), but on other  
places we can completely suppress ARRAYS and the specificity of the data,  
just have requests with tables and index-names , and that will give 3  
nice things:  
 - IO pressure will disappear  
 - request can be cached by the sqlengine for furious speed improvement:)  
 - mysql.log will be readable again, (without huge filtering) and will  
ease the understanding, and will surely guide us to other improvements.  
Comment 8 Alain B 2004-12-24 17:18:08 UTC
I ve seen the cool $fname in the logs of 1.4.beta3 :)  I ll go on digging 
it and try to do a report of "offending" functions, that i ll send next 
time i go to the cyber, (in few days). Mery Xmas. 
Comment 9 Jamesday 2004-12-24 18:30:18 UTC
I've changed this to low priority and minor severity. Please don't overstate the
significance of a minor performance issue which might, depending on the case, be
faster either way. This isn't remotely close to being a bug or critical issue
causing significance performance issues, compared to other things. Maybe it will
be later, but not today. I'm the person who is doing DBA work for Wikimedia,
which handles about 190 million queries per day. I have no plans to log 190
million queries per day to the general query log. I suggest that you don't think
of doing it either. If you want profiling there are profiling settings which log
the queries without the data to a database table for later analysis. If you want
to know what the real performance issues are, look at the schema. You'll find
plenty of scope for improvements there, just as I have. So have most people
who've looked at it. Improvements are planned for MediaWiki 1.5.

The MySQL 4.n query cache will not cache the queries if changed as you suggest.
ANY change to a table will purge ALL queries using that table from the cache.
This is an update operation which changes the brokenlinks and links tables, so
all queries involving the brokenlinks and links tables will be removed from the
cache by this function.

Your suggested alternative doesn't do the same job as the function you're
suggesting it be used in. That uses the same select twice and getting the data
in PHP avoids doing it twice. It may be faster to do the select twice to avoid
the I/O, as you suggest. Or not - it'll depend on the case. Does seem worth
checking - in this case I think it is likely ot be faster, because the query
appears both simple and fast.

I agree that the abstraction in this function is at too low a level. This one
looks like the "generic" version of the function, not one which would be in a
database-specific layer. That would be tuned to each database engine.

You make some good points but describing a performance enhancement suggestion as
a critical bug didn't help your case.:) If you want to find the real problems I
suggest using the slow query log with a one minute threshold and seeing what you
see. You might be able to lower it to ten seconds without being swamped if you
have a fast system or few queries.

To see the issues which are most significant today, see
http://bugzilla.wikipedia.org/votes.cgi?action=show_bug&bug_id=1191 .
Comment 10 Brion Vibber 2004-12-24 20:56:07 UTC
> That pb disappear if we supress the array and stay always in the DB, and  
> never take the data outside it (except of course for modifying them, or  
> displaying).

You need to cite the exact queries that you think are problematic, and explain 
exactly what they are doing wrong. Ignore the arrays, they are irrelevant -- focus 
only on the SQL. We have ONE WAY to communicate with the database, and that 
is SQL. We don't send arrays to MySQL, we send SQL queries. We don't get arrays 
from MySQL, we get results of SQL queries.

Arrays can not have any affect on I/O load between our scripts and the database, 
because we neither send arrays to or get arrays from the database.

This has nothing to do with arrays. Pretend you never heard of arrays.

Now, tell us what QUERIES are bad, and how they can be done more efficiently.
Comment 11 Brion Vibber 2005-01-29 08:19:49 UTC
No further details from original poster in over a month. Resolving as INVALID.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links