Last modified: 2010-05-15 15:28:09 UTC
Many php functions are written using a temporary array which
has 3 nasty side effects:
- its less understandable
- it kills both sql server and php "server" by inducing huge IO
- its damages mysql.log which are HUGE because we find the data
inside, introduce another slowdown, and prevent reading
USE SQL , dont subsitute php to myssql engine and it will be at
least 10 time faster !
It is not a performance issue, it s a BUGFIX, which has the side effect
of improving performance AND diminushing I/O pressure on the servers.
I thinks it is 10 times faster, perhaps even more, but the important is
that it preserve the server.
For example, in FixBrokenLinks (the heart of wikipedia) you should write
(or put back?) something like
$dqw->query( "SELECT bl_from FROM 'brokenlinks' WHERE bl_to=$this->mTitle
FOR UPDATE" , $fname );
$dbw->query( "INSERT IGNORE $links $links.l_from $links.l_to \
SELECT $brokenlinks.bl_from, $cur.cur_id FROM $brokenlinks, $cur \
WHERE $brokenslinks.bl_to=$this->mTitle AND
$cur.cur_title=$brokenlinks.bl_to;" , $fname );
instead of the crazy (and broken and evily written, and opposite to
coding style) stuff with an array passed in a global var !!!
ALSO you MUST put the end of the request in the same function, not hide
them in a called function, that is FORBIDDEN !
Last but not least, when there is a lock FOR UPDATE, please manage to
write the SELECT ... FOR UPDATE and the UPDATE statement, in the same
function, and clearly readable. Thanks a lot.
All that happens many times in the code, please correct that, dont wait
for bug reports.
How are you planning to handle database table prefixes, schema names, and DBMSs
that don't support INSERT IGNORE and SELECT FOR UPDATE?
It is not a matter of sql abstract layer (i find it crazy also, but
usefull if it works. This should be a separate project, not an internal
function, that would prevent many of the uglyness of the code)
THE pb is the written sql is BROKEN, in mysql, pgsql, db2 or whatever
DBengine you use, there are data in the request.
For fixbrokenLinks, we cant avoid the title, but the array you use is
really crazy and clean sql avoid to used this awfull array that breaks
many things in the server, causes huge io, huge logs ...
Lowering priority as that s code cleanup.
Alain, can you cleanly expose the problem you are refering too ?
Give a short example of actual code which is not clean and please
give a way to fix it. Until you do it i close that bug report.
*** Bug 1145 has been marked as a duplicate of this bug. ***
Alain, I'm afraid this report is simply incoherent. Can you explain what exactly is wrong? Namely:
* Which queries are produced that are incorrect
* Where they are produced
* What is wrong with them
The only way we have of communicating with the database server is through SQL, so if there's a
problem with the queries you should be able to produce some specific information to back up the
vague claims posted above. Some SQL queries are built by functions which handle table naming,
quoting, and DBMS-specific differences in syntax, but they still produce SQL.
OK I try again to explain (coolely ;) why:
- i put a bug on that
- i say it is severity critical
Preamble: I speak of functions which are used very often by the server,
and in maintenance, function that are nearly the only one used. So we
must think we are in a SQL engine, dealing with very high traffic, and
with lets says 10 millisecond time order of magnitude. Maintenance is a
cool way to simultate a very heavy traffic on the server. Thinking with
separate SQL server and remote php client gives light on what i write
1/ There is a _bug_ that crush the IO on the server
- in fixBrokenLinks (called only for a neawly created page, or in
maintenance) we have an array of links pointing to the (new) title.
This seems nothing but it mask huge IO:
- We query the DB to get links to this title, so the data are now outside
the DBengine, and inside the (slow, interpreted) php program.
- we transport those data through the php program , in array, and
send them without _any_ change to another function
- the function (fixBrokenLinks) then take those data and send them back
to the SQL engine.
It seems nothing but it is a huge cause of IO, we can see that
in /var/log/mysql/mysqld.log which is far much bigger than the DB
For maintenance for 100 articles, i have a log of nearly 10 MB because of
queries using arrays, expanded with a lot of redundancy, only with
title... If we extrapolate to the whole DB 100 000 articles will give me
10 GB of log (FR.cur.sql is 240MB) and people have reported logs like
that. That mean that just by manipulating titles, we have IO that are 40
time the whole DB ! This is a BUG.
That pb disappear if we supress the array and stay always in the DB, and
never take the data outside it (except of course for modifying them, or
And the same occurs in other places with same king of arrays, coming from
teh DB and going back to it after a slow-eroneous travel in the php code.
2/ This is critical, because the same ovehead (IO = 40 times the DB!)
occurs for each connection/request in the server, which has a lot of IO
to deal with, just for nothing.
Some people complains that the server is getting slow, or that
refrehLinks is tooooooo_slow, i'm quite sure that suppressing this huge
IO oveahead will give some fresh air to both SQL server and "apache php
server", and on site with separate servers (like en.wikipedia.org) we
should see a huge IO decrease between both machine.
3/ Last thing: having data in request also prevent request-caching in the
sqlengine. For refeshlinks, we cannot avoid the title ;), but on other
places we can completely suppress ARRAYS and the specificity of the data,
just have requests with tables and index-names , and that will give 3
- IO pressure will disappear
- request can be cached by the sqlengine for furious speed improvement:)
- mysql.log will be readable again, (without huge filtering) and will
ease the understanding, and will surely guide us to other improvements.
I ve seen the cool $fname in the logs of 1.4.beta3 :) I ll go on digging
it and try to do a report of "offending" functions, that i ll send next
time i go to the cyber, (in few days). Mery Xmas.
I've changed this to low priority and minor severity. Please don't overstate the
significance of a minor performance issue which might, depending on the case, be
faster either way. This isn't remotely close to being a bug or critical issue
causing significance performance issues, compared to other things. Maybe it will
be later, but not today. I'm the person who is doing DBA work for Wikimedia,
which handles about 190 million queries per day. I have no plans to log 190
million queries per day to the general query log. I suggest that you don't think
of doing it either. If you want profiling there are profiling settings which log
the queries without the data to a database table for later analysis. If you want
to know what the real performance issues are, look at the schema. You'll find
plenty of scope for improvements there, just as I have. So have most people
who've looked at it. Improvements are planned for MediaWiki 1.5.
The MySQL 4.n query cache will not cache the queries if changed as you suggest.
ANY change to a table will purge ALL queries using that table from the cache.
This is an update operation which changes the brokenlinks and links tables, so
all queries involving the brokenlinks and links tables will be removed from the
cache by this function.
Your suggested alternative doesn't do the same job as the function you're
suggesting it be used in. That uses the same select twice and getting the data
in PHP avoids doing it twice. It may be faster to do the select twice to avoid
the I/O, as you suggest. Or not - it'll depend on the case. Does seem worth
checking - in this case I think it is likely ot be faster, because the query
appears both simple and fast.
I agree that the abstraction in this function is at too low a level. This one
looks like the "generic" version of the function, not one which would be in a
database-specific layer. That would be tuned to each database engine.
You make some good points but describing a performance enhancement suggestion as
a critical bug didn't help your case.:) If you want to find the real problems I
suggest using the slow query log with a one minute threshold and seeing what you
see. You might be able to lower it to ten seconds without being swamped if you
have a fast system or few queries.
To see the issues which are most significant today, see
> That pb disappear if we supress the array and stay always in the DB, and
> never take the data outside it (except of course for modifying them, or
You need to cite the exact queries that you think are problematic, and explain
exactly what they are doing wrong. Ignore the arrays, they are irrelevant -- focus
only on the SQL. We have ONE WAY to communicate with the database, and that
is SQL. We don't send arrays to MySQL, we send SQL queries. We don't get arrays
from MySQL, we get results of SQL queries.
Arrays can not have any affect on I/O load between our scripts and the database,
because we neither send arrays to or get arrays from the database.
This has nothing to do with arrays. Pretend you never heard of arrays.
Now, tell us what QUERIES are bad, and how they can be done more efficiently.
No further details from original poster in over a month. Resolving as INVALID.