Last modified: 2014-01-22 11:22:31 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T47529, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 45529 - use composite indexes for efficient term search
use composite indexes for efficient term search
Status: VERIFIED FIXED
Product: MediaWiki extensions
Classification: Unclassified
WikidataRepo (Other open bugs)
unspecified
All All
: High major (vote)
: ---
Assigned To: Wikidata bugs
backlog, termsearch
: performance
Depends on:
Blocks: 44529
  Show dependency treegraph
 
Reported: 2013-02-28 00:52 UTC by Daniel Kinzler
Modified: 2014-01-22 11:22 UTC (History)
13 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Daniel Kinzler 2013-02-28 00:52:18 UTC
Currently, the wb_terms table has indexes on individual fields:

CREATE INDEX /*i*/wb_terms_entity_id ON /*_*/wb_terms (term_entity_id);
CREATE INDEX /*i*/wb_terms_entity_type ON /*_*/wb_terms (term_entity_type);
CREATE INDEX /*i*/wb_terms_language ON /*_*/wb_terms (term_language);
CREATE INDEX /*i*/wb_terms_type ON /*_*/wb_terms (term_type);
CREATE INDEX /*i*/wb_terms_text ON /*_*/wb_terms (term_text);
CREATE INDEX /*i*/wb_terms_search_key ON /*_*/wb_terms (term_search_key);

Since only a single index will be used for any given query, this is not helpful, and results in inefficient searches. We really need a composite key for each kind of query. As far as I remember, the kinds of queries we do are:

* by term_language, term_term, and term_entity_type
* by term_language, term_term, and term_type 

"term" here may be an exact match, or a soft match - so we need indexes covering either. The term type has very low cardinality (3 or 4, iirc), so it can probably be admitted.

From this, I gather we need the following two composite indexes:

* for exact matches: ( term_language, term_term ) 
* for soft matches: ( term_language, term_search_key )

We also need to access terms by entity ID (especially for deleting and updating, so we also want an index on 

* ( term_entity_type, term_entity_id )

And of course, row_id will remain a primary key.
Comment 1 Ariel T. Glenn 2013-06-03 17:52:48 UTC
What is the wb_terms table?  I need a human-readable one line description (for non wikibase users) which will accompany it on the dumps page. It currently has the following placeholder: "This table is currently missing a description." See the description of the wb_items_per_site table at http://dumps.wikimedia.org/wikidatawiki/20130527/ for an example.
Comment 2 Daniel Kinzler 2013-06-03 19:26:02 UTC
> What is the wb_terms table?  I need a human-readable one line description
> (for non wikibase users) which will accompany it on the dumps page. 

wb_terms associates labels, aliases and description in a given language with data items. It can be used for finding all labels and aliases for a given concept, or finding all concepts that use a given name (label or alias) in a given language. (In reply to comment #1)
Comment 3 Asher Feldman 2013-06-03 23:55:49 UTC
(In reply to comment #0)
> Since only a single index will be used for any given query, this is not
> helpful, and results in inefficient searches. 

This isn't strictly true.  Since the select queries in production against wb_terms aren't join queries, index merge optimizations are possible and are being utilized.  I.e. a query containing WHERE (term_language='oc' AND term_type='label' AND term_entity_type='property') is executed via an intersection of two indexes - "Using intersect(wb_terms_entity_type,wb_terms_language)".  

That said, index merging isn't as efficient as composite indexes, and there's one query case that isn't performing well at all. 

It looks like there are four common select query types in production:

1) /* Wikibase\TermSqlIndex::getTermsOfEntity */ select term_language, term_type, term_text from `wb_terms` where term_entity_id = ? and term_entity_type = ? 

2) /* Wikibase\TermSqlIndex::getMatchingIDs */ select distinct term_entity_id from `wb_terms` where (term_language=? and term_search_key like ? and term_type=? and term_entity_type=?) or (term_language=? and term_search_key like ? and term_type=? and term_entity_type=?) limit ? 

3) /* Wikibase\TermSqlIndex::getMatchingTerms */ SELECT term_entity_type, term_type, term_language, term_text, term_entity_id FROM `wb_terms` WHERE (term_language=? AND term_type=? AND term_entity_type=?)

4) /* Wikibase\TermSqlIndex::getMatchingTermCombination */ select terms?term_entity_type as terms?term_entity_type, terms?term_type as terms?term_type, terms?term_language as terms?term_language, terms?term_text as terms?term_text, terms?term_entity_id as terms?term_entity_id, terms?term_entity_type as terms?term_entity_type, terms?term_type as terms?term_type, terms?term_language as terms?term_language, terms?term_text as terms?term_text, terms?term_entity_id as terms?term_entity_id from `wb_terms` `terms?` inner join `wb_terms` `terms?` on ((terms?term_entity_id=terms?term_entity_id) and (terms?term_entity_type=terms?term_entity_type)) where (terms?term_language=? and terms?term_text=? and terms?term_type=? and terms?term_entity_type=?) and (terms?term_language=? and terms?term_text=? and terms?term_type=? and terms?term_entity_type=?) and (terms?term_entity_id <> ? or terms?term_entity_type <> ?) limit ?

Query type 3 above can have a large number of results returned and should probably include a limit.

I think your proposed indexes should be good, but please limit the length on term_search_key.  Most queries use a few letters only ('abc%'), indexing term_search_key(6) should be good.  

If the combination ( term_entity_type, term_entity_id ) is guaranteed to be unique, and it will be used to delete rows, define this index as unique to minimize locking. 

Submit the changes (with drop index statements) as a regular schema change and let me know when it's available in gerrit for review.
Comment 4 Daniel Kinzler 2013-06-04 12:18:25 UTC
Thanks for the analysis Asher!

> If the combination ( term_entity_type, term_entity_id ) is guaranteed to be
> unique, and it will be used to delete rows, define this index as unique to
> minimize locking. 

It's not unique. In fact, ( term_entity_type, term_entity_id, term_language ) isn't unique either (there can be multiple aliases in a language language on an entity).
Comment 5 Max Semenik 2013-08-18 08:31:56 UTC
Just noticed Wikibase\TermSqlIndex::getMatchingIDs queries taking over a minute in production - this needs to be fixed.

(In reply to comment #3)

> I think your proposed indexes should be good, but please limit the length on
> term_search_key.  Most queries use a few letters only ('abc%'), indexing
> term_search_key(6) should be good.  

UTF-8 can take up to 4 bytes for some characters, so this limit would cover 1.5 chars for some languages.
Comment 6 Daniel Kinzler 2013-10-29 16:06:55 UTC
we should really look into this.
Comment 7 Sean Pringle 2013-11-11 00:52:54 UTC
While this is still in limbo, to keep production alive, last week I added this temporary index:

KEY `tmp1` (`term_language`,`term_type`,`term_entity_type`,`term_search_key`)

This query form (2 in comment #3) was backing up to the point of pain on slaves:

/* Wikibase\TermSqlIndex::getMatchingIDs */ select distinct term_entity_id
from `wb_terms` where (term_language=? and term_search_key like ? and
term_type=? and term_entity_type=?) or (term_language=? and term_search_key
like ? and term_type=? and term_entity_type=?) limit ?

Most examples were identical. Perhaps something could be cached.
Comment 8 Daniel Kinzler 2013-12-06 14:03:28 UTC
(In reply to comment #3)
> It looks like there are four common select query types in production:
> 
> 1) /* Wikibase\TermSqlIndex::getTermsOfEntity */ select term_language,
> term_type, term_text from `wb_terms` where term_entity_id = ? and
> term_entity_type = ? 

This is getting all the terms defined for an entity. This is mostly used when determining which term entries need to be updated when an entity is saved.
 
> 2) /* Wikibase\TermSqlIndex::getMatchingIDs */ select distinct term_entity_id
> from `wb_terms` where (term_language=? and term_search_key like ? and
> term_type=? and term_entity_type=?) or (term_language=? and term_search_key
> like ? and term_type=? and term_entity_type=?) limit ? 

This is used to find entities based on a user provided prefix. We should see LOTS of those, since they are used for find-as-you-type suggestions in the search box and in the entity selector.

> 3) /* Wikibase\TermSqlIndex::getMatchingTerms */ SELECT term_entity_type,
> term_type, term_language, term_text, term_entity_id FROM `wb_terms` WHERE
> (term_language=? AND term_type=? AND term_entity_type=?)

This should be relatively rare: it's used to load all labels for all properties. The number of rows in the result should be the number of properties defined, which at the moment is about 1000. TermPropertyLabelResolver already caches the result in memcached.

There should be another variant of the Wikibase\TermSqlIndex::getMatchingTerms queries, which searches for a specific term_text:

3b) /* Wikibase\TermSqlIndex::getMatchingTerms */ SELECT term_entity_type,
term_type, term_language, term_text, term_entity_id FROM `wb_terms` WHERE
(term_language=? AND term_type=? AND term_entity_type=? AND term_text=? )

This is quite similar to case (2), but using term_text, not term_search_key.

> 4) /* Wikibase\TermSqlIndex::getMatchingTermCombination */ select
> terms?term_entity_type as terms?term_entity_type, terms?term_type as
> terms?term_type, terms?term_language as terms?term_language, terms?term_text [...]

Used to find entities with a given combination of label and description, to avoid conflicts. This query is performed once per edit.

I suppose the query as such could be optimized - maybe using a subquery is more efficient here than a full join. Should be investigated.


To cover all these use cases, I'm considering the following indexes:

-- for TermSqlIndex::getMatchingIDs
CREATE INDEX /*i*/term_search ON /*_*/wb_terms (term_language, term_search_key(12), term_entity_type, term_type);

-- for TermSqlIndex::getTermsOfEntity and for the join in TermSqlIndex::getMatchingTermCombination
CREATE INDEX /*i*/term_entity ON /*_*/wb_terms (term_entity_type, term_entity_id, term_type);

-- TermSqlIndex::getMatchingTerms with or without given term_text, as well as for TermSqlIndex::getMatchingTermCombination
CREATE UNIQUE INDEX /*i*/term_identity ON /*_*/wb_terms (term_language, term_type, term_entity_type, term_text, term_entity_id);
Comment 9 Gerrit Notification Bot 2013-12-06 15:36:40 UTC
Change 99660 had a related patch set uploaded by Daniel Kinzler:
(bug 45529) use composite indexes on wb_terms.

https://gerrit.wikimedia.org/r/99660
Comment 10 Sean Pringle 2014-01-07 07:56:03 UTC
Pulled from db1021 with the old wb_% indexes after a few hours of activity:

+--------------+------------+---------------------+------------+
| TABLE_SCHEMA | TABLE_NAME | INDEX_NAME          | ROWS_READ  |
+--------------+------------+---------------------+------------+
| wikidatawiki | wb_terms   | tmp1                |  905372478 |
| wikidatawiki | wb_terms   | wb_terms_entity_id  | 1284769088 |
| wikidatawiki | wb_terms   | wb_terms_text       |     365238 |
| wikidatawiki | wb_terms   | wb_terms_language   |   85815191 |
| wikidatawiki | wb_terms   | wb_terms_search_key |      80851 |
+--------------+------------+---------------------+------------+

(tmp1 is roughly eqivalent to the proposed terms_search composite index)

Turns out we need to keep some indexes with term_entity_id, term_text, and term_search_key in left-most positions, plus avoid having massive indexes (22G already, more for the new composites) except where "Index Condition Pushdown" (ICP) is useful. Therefore I propose these indexes instead:

-- Some wb_terms queries not listed in the bug report use term_entity_id=N
-- which is good selectivity.
CREATE INDEX /*i*/terms_entity ON /*_*/wb_terms (term_entity_id);

-- When any wb_terms query includes a search on term_text greater than
-- four or five leading characters a simple index on term_text and
-- language is often better than the proposed composite indexes. Note
-- that it's fine to have term_language second as MariaDB still uses
-- the entire key length even with LIKE '...%' on term_text.
CREATE INDEX /*i*/terms_text ON /*_*/wb_terms (term_text, term_language);

-- Same idea as above for terms_search_key. Not so much traffic as
-- term_text but enough to be useful.
CREATE INDEX /*i*/terms_search_key ON /*_*/wb_terms (term_search_key, term_language);

-- The getMatchingIDs query is horrible when it has just one or two
-- leading characters on term_search_key. This index is slightly
-- different to the proposed term_search for better selectivity
-- while still allowing ICP for short string values.
CREATE INDEX /*i*/terms_search ON /*_*/wb_terms (term_language, term_entity_id, term_type, term_search_key(16));

The above indexes give reasonable performance on all wb_terms queries I can find and seems suitably generic enough for tables.sql.

However, for wikidatawiki in production it's still not fast enough for all fringe cases due to the sheer dataset size. So, on db1026 I've partitioned wb_terms based on hash(term_language). This:

1) Lets the optimizer use partition pruning like a free first-stage index on term_language (because we use equality for term_language clauses). Reduces the number of rows read for many of our queries except those already using ICP.

2) Allows terms_text and terms_search_key to drop the second term_language field for slightly smaller footprints with comparable performance due to (1). More stuff fits in the buffer pool for longer, basically.

3) Speeds up writes to wb_terms on the slave because insert/update/delete only touches the smaller indexes in the relevant partition rather than the entire btrees (this is only a slight benefit, but I figure every bit to help reduce replag...)

CREATE TABLE wb_terms (
...
  KEY terms_entity (term_entity_id),
  KEY terms_text (term_text),
  KEY terms_search_key (term_search_key),
  KEY terms_search (term_language,term_entity_type,term_type,term_search_key(16))
) ENGINE=InnoDB PARTITION BY KEY (term_language) PARTITIONS 16

Note that I left term_language in terms_search despite the partitions because getMatchingIDs query still benefits from ICP when term_search_key is used with a poorly performing short prefix: LIKE 'a%'.

Partitions aren't suitable for tables.sql as we still say we support MySQL 5.0+, but I think it's worthwhile for this particular table on WMF production slaves.
Comment 11 Daniel Kinzler 2014-01-07 13:19:14 UTC
(In reply to comment #10)
> Pulled from db1021 with the old wb_% indexes after a few hours of activity:

Thank you for the thorough analysis and the helpful suggestions! I don't understand mysql's internals nearly as well as you, so I think I'll just go with your suggestions :)
Comment 12 Daniel Kinzler 2014-01-07 13:26:59 UTC
> Partitions aren't suitable for tables.sql as we still say we support MySQL
> 5.0+, but I think it's worthwhile for this particular table on WMF production
> slaves.

Note that this is for the Wikibase extension, not core. We can and do have stronger requirements for the extension anyway. I'll think about whether to use partitioning always, or to make a special wmf version.
Comment 13 Sean Pringle 2014-01-08 05:13:24 UTC
FTR, correction for my comment #10: I said:

"Note that I left term_language in terms_search despite the partitions because
getMatchingIDs query still benefits from ICP when term_search_key is used with
a poorly performing short prefix: LIKE 'a%'."

This is actually wrong for MariaDB 5.5. ICP is not being used inside the partition (though it apparently will be possible in the future). Looks like the speed-up for getMatchingIDs here is only due to the pruning and the smaller partitioned indexes generating fewer Handler_read% hits.
Comment 14 Gerrit Notification Bot 2014-01-08 14:08:59 UTC
Change 99660 merged by jenkins-bot:
(bug 45529) use composite indexes on wb_terms.

https://gerrit.wikimedia.org/r/99660

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links