Last modified: 2009-12-23 17:31:43 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T23868, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 21868 - SMW Parser is missing grammar validation
SMW Parser is missing grammar validation
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
Semantic MediaWiki (Other open bugs)
unspecified
All All
: Normal critical (vote)
: ---
Assigned To: Markus Krötzsch
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-12-16 16:11 UTC by Roi Avinoam
Modified: 2009-12-23 17:31 UTC (History)
0 users

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Roi Avinoam 2009-12-16 16:11:25 UTC
The SMW Parser seems to ignore grammar/syntax errors altogether. Instead of issuing an error message, it's generating an impossibly-heavy DB query:

SELECT /* SMW::getQueryResult Avinoamr */  DISTINCT t0.smw_id AS id,t0.smw_title AS t,t0.smw_namespace AS ns,t0.smw_iw AS iw,t0.smw_sortkey  FROM `smw_ids` AS t0 INNER JOIN `smw_ids` AS t2 ON t0.smw_id=t2.smw_id  WHERE t2.smw_iw!=':smw' AND t2.smw_iw!=':smw-redi' AND t2.smw_iw!=':smw-border' AND t2.smw_iw!=':smw-intprop'  ORDER BY t0.smw_sortkey ASC  LIMIT 21

This query is obviously malformed, it doesn't try to match anything useful, just filters out certain rows. On our current MediaWiki installation this query takes more than 1 hour to complete. 

I've noticed it's being generated by (1) using a bad syntax (try [Field::->Value]]), or (2) by using the {#ask} parser function with an empty query string (try {#ask:}). 

Obviously, the user should be aware of what he's doing - but a couple of these queries can really slow down the DBs for no reason. 

I think it's a critical performance/stability bug. If I'm not mistaken, a single user can pretty much crash several of the DB servers simply by running these queries repeatedly. 

I'm not certain what kind of parsing algorithm is being used (LL(1) ?), but I'm sure grammar validation is part of it. If you wish, I volunteer to help with developing this support - but I'd like to have a comprehensive understanding and guidance before I begin.

By the way - another approach would be to check the Query object before executing it. And if we realize that this meaningless SQL error will be executed, we can just throw a non-informative Parsing exception. That's a more quick & dirty solution, in my opinion.
Comment 1 Markus Krötzsch 2009-12-23 17:31:43 UTC
The query parser is pretty much ad hoc, but it works as expected in the cases you mention. The parser generally tries to ignore errors and interpret whatever remains of the query. In your case, there are no valid conditions in the malformed query, so that the query ends up selecting all pages. The same happens for the empty query. Note that you can use "format=debug" if you want to have more information about how the query was understood without actually executing it.

One could (and probably should) disallow queries for *all* pages, but the main issue here is that such a simple query actually freezes your database. Normally, the query without conditions should be easiest to answer (given that it imposes a small limit as in your case). I think the problem here is that the SQL generation adds another instance of the smw_id table (SMW's "page table") to the query, thus enforcing a redundant inner join of potentially large tables. I have now improved the query generation process to avoid this, so the performance of the empty query should be much improved.

I have also added a new parameter "$smwgIgnoreQueryErrors" that can be used to specify if queries should be executed at all if errors occured during parsing or preprocessing. The default for this setting is "true." Together with the above, this should fix this specific bug.

There is still much room for optimising the SQL query generation. Ultimately, it might be worth to investigate the use of more specialised storage backends than MySQL (relational DBs are not particularly great for SMW's data model), or to adopt a more sophisticated table layout in SQL (especially large inner joins are an issue with the "single table" approach of SMW). I am also somewhat disappointed about the bad performance that you indicate, given that the inner join is done over the primary key of the tables: retrieving these 21 results should really be no problem in spite of the large table as such!

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links