Last modified: 2010-05-15 15:28:09 UTC
Many php functions are written using a temporary array which has 3 nasty side effects: - its less understandable - it kills both sql server and php "server" by inducing huge IO - its damages mysql.log which are HUGE because we find the data inside, introduce another slowdown, and prevent reading mysql.log USE SQL , dont subsitute php to myssql engine and it will be at least 10 time faster !
It is not a performance issue, it s a BUGFIX, which has the side effect of improving performance AND diminushing I/O pressure on the servers. I thinks it is 10 times faster, perhaps even more, but the important is that it preserve the server. For example, in FixBrokenLinks (the heart of wikipedia) you should write (or put back?) something like $dqw->query( "SELECT bl_from FROM 'brokenlinks' WHERE bl_to=$this->mTitle FOR UPDATE" , $fname ); $dbw->query( "INSERT IGNORE $links $links.l_from $links.l_to \ SELECT $brokenlinks.bl_from, $cur.cur_id FROM $brokenlinks, $cur \ WHERE $brokenslinks.bl_to=$this->mTitle AND $cur.cur_title=$brokenlinks.bl_to;" , $fname ); instead of the crazy (and broken and evily written, and opposite to coding style) stuff with an array passed in a global var !!! ALSO you MUST put the end of the request in the same function, not hide them in a called function, that is FORBIDDEN ! Last but not least, when there is a lock FOR UPDATE, please manage to write the SELECT ... FOR UPDATE and the UPDATE statement, in the same function, and clearly readable. Thanks a lot. All that happens many times in the code, please correct that, dont wait for bug reports.
How are you planning to handle database table prefixes, schema names, and DBMSs that don't support INSERT IGNORE and SELECT FOR UPDATE?
It is not a matter of sql abstract layer (i find it crazy also, but usefull if it works. This should be a separate project, not an internal function, that would prevent many of the uglyness of the code) THE pb is the written sql is BROKEN, in mysql, pgsql, db2 or whatever DBengine you use, there are data in the request. For fixbrokenLinks, we cant avoid the title, but the array you use is really crazy and clean sql avoid to used this awfull array that breaks many things in the server, causes huge io, huge logs ...
Lowering priority as that s code cleanup. Alain, can you cleanly expose the problem you are refering too ? Give a short example of actual code which is not clean and please give a way to fix it. Until you do it i close that bug report.
*** Bug 1145 has been marked as a duplicate of this bug. ***
Alain, I'm afraid this report is simply incoherent. Can you explain what exactly is wrong? Namely: * Which queries are produced that are incorrect * Where they are produced * What is wrong with them The only way we have of communicating with the database server is through SQL, so if there's a problem with the queries you should be able to produce some specific information to back up the vague claims posted above. Some SQL queries are built by functions which handle table naming, quoting, and DBMS-specific differences in syntax, but they still produce SQL.
OK I try again to explain (coolely ;) why: - i put a bug on that - i say it is severity critical Preamble: I speak of functions which are used very often by the server, and in maintenance, function that are nearly the only one used. So we must think we are in a SQL engine, dealing with very high traffic, and with lets says 10 millisecond time order of magnitude. Maintenance is a cool way to simultate a very heavy traffic on the server. Thinking with separate SQL server and remote php client gives light on what i write below. 1/ There is a _bug_ that crush the IO on the server - in fixBrokenLinks (called only for a neawly created page, or in maintenance) we have an array of links pointing to the (new) title. This seems nothing but it mask huge IO: - We query the DB to get links to this title, so the data are now outside the DBengine, and inside the (slow, interpreted) php program. - we transport those data through the php program , in array, and send them without _any_ change to another function - the function (fixBrokenLinks) then take those data and send them back to the SQL engine. It seems nothing but it is a huge cause of IO, we can see that in /var/log/mysql/mysqld.log which is far much bigger than the DB itself ! For maintenance for 100 articles, i have a log of nearly 10 MB because of queries using arrays, expanded with a lot of redundancy, only with title... If we extrapolate to the whole DB 100 000 articles will give me 10 GB of log (FR.cur.sql is 240MB) and people have reported logs like that. That mean that just by manipulating titles, we have IO that are 40 time the whole DB ! This is a BUG. That pb disappear if we supress the array and stay always in the DB, and never take the data outside it (except of course for modifying them, or displaying). And the same occurs in other places with same king of arrays, coming from teh DB and going back to it after a slow-eroneous travel in the php code. 2/ This is critical, because the same ovehead (IO = 40 times the DB!) occurs for each connection/request in the server, which has a lot of IO to deal with, just for nothing. Some people complains that the server is getting slow, or that refrehLinks is tooooooo_slow, i'm quite sure that suppressing this huge IO oveahead will give some fresh air to both SQL server and "apache php server", and on site with separate servers (like en.wikipedia.org) we should see a huge IO decrease between both machine. 3/ Last thing: having data in request also prevent request-caching in the sqlengine. For refeshlinks, we cannot avoid the title ;), but on other places we can completely suppress ARRAYS and the specificity of the data, just have requests with tables and index-names , and that will give 3 nice things: - IO pressure will disappear - request can be cached by the sqlengine for furious speed improvement:) - mysql.log will be readable again, (without huge filtering) and will ease the understanding, and will surely guide us to other improvements.
I ve seen the cool $fname in the logs of 1.4.beta3 :) I ll go on digging it and try to do a report of "offending" functions, that i ll send next time i go to the cyber, (in few days). Mery Xmas.
I've changed this to low priority and minor severity. Please don't overstate the significance of a minor performance issue which might, depending on the case, be faster either way. This isn't remotely close to being a bug or critical issue causing significance performance issues, compared to other things. Maybe it will be later, but not today. I'm the person who is doing DBA work for Wikimedia, which handles about 190 million queries per day. I have no plans to log 190 million queries per day to the general query log. I suggest that you don't think of doing it either. If you want profiling there are profiling settings which log the queries without the data to a database table for later analysis. If you want to know what the real performance issues are, look at the schema. You'll find plenty of scope for improvements there, just as I have. So have most people who've looked at it. Improvements are planned for MediaWiki 1.5. The MySQL 4.n query cache will not cache the queries if changed as you suggest. ANY change to a table will purge ALL queries using that table from the cache. This is an update operation which changes the brokenlinks and links tables, so all queries involving the brokenlinks and links tables will be removed from the cache by this function. Your suggested alternative doesn't do the same job as the function you're suggesting it be used in. That uses the same select twice and getting the data in PHP avoids doing it twice. It may be faster to do the select twice to avoid the I/O, as you suggest. Or not - it'll depend on the case. Does seem worth checking - in this case I think it is likely ot be faster, because the query appears both simple and fast. I agree that the abstraction in this function is at too low a level. This one looks like the "generic" version of the function, not one which would be in a database-specific layer. That would be tuned to each database engine. You make some good points but describing a performance enhancement suggestion as a critical bug didn't help your case.:) If you want to find the real problems I suggest using the slow query log with a one minute threshold and seeing what you see. You might be able to lower it to ten seconds without being swamped if you have a fast system or few queries. To see the issues which are most significant today, see http://bugzilla.wikipedia.org/votes.cgi?action=show_bug&bug_id=1191 .
> That pb disappear if we supress the array and stay always in the DB, and > never take the data outside it (except of course for modifying them, or > displaying). You need to cite the exact queries that you think are problematic, and explain exactly what they are doing wrong. Ignore the arrays, they are irrelevant -- focus only on the SQL. We have ONE WAY to communicate with the database, and that is SQL. We don't send arrays to MySQL, we send SQL queries. We don't get arrays from MySQL, we get results of SQL queries. Arrays can not have any affect on I/O load between our scripts and the database, because we neither send arrays to or get arrays from the database. This has nothing to do with arrays. Pretend you never heard of arrays. Now, tell us what QUERIES are bad, and how they can be done more efficiently.
No further details from original poster in over a month. Resolving as INVALID.