Last modified: 2014-09-29 14:01:00 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T50625, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 48625 - Provide namespace IDs and names in the databases similar to toolserver.namespace
Provide namespace IDs and names in the databases similar to toolserver.namespace
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: Unprioritized trivial
: ---
Assigned To: Marc A. Pelletier
http://article.gmane.org/gmane.org.wi...
:
Depends on:
Blocks: labs-replication tool-missing-ts-feat 66868
  Show dependency treegraph
 
Reported: 2013-05-20 04:27 UTC by Tim Landscheidt
Modified: 2014-09-29 14:01 UTC (History)
18 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tim Landscheidt 2013-05-20 04:27:20 UTC
Toolserver has the local table toolserver.namespace on all databases that provides the IDs and names of all namespaces in a wiki:

| mysql> SELECT * FROM toolserver.namespace LIMIT 5;
| +----------+------------------+-------+-------------------------------------+
| | dbname   | domain           | ns_id | ns_name                             |
| +----------+------------------+-------+-------------------------------------+
| | abwiki_p | ab.wikipedia.org |    10 | Ашаблон                      |
| | abwiki_p | ab.wikipedia.org |     7 | Обсуждение файла     |
| | abwiki_p | ab.wikipedia.org |    11 | Обсуждение шаблона |
| | abwiki_p | ab.wikipedia.org |     8 | Амедиавики                |
| | abwiki_p | ab.wikipedia.org |     9 | Обсуждение MediaWiki      |
| +----------+------------------+-------+-------------------------------------+
| 5 rows in set (0.00 sec)

| mysql>

This can probably be easily generated from iterating over all wikis, using the API (http://de.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces) to query namespaces and update them in the database if the webcall succeeded, but an even more interesting approach would of course be to clone operations/mediawiki-config and extract the information from there.
Comment 1 Tim Landscheidt 2013-05-21 19:03:19 UTC
On second thought: It would be much more elegant and logical to have a table namespace on *each* database (not one per server) that provides only the IDs (ns_id) and names (ns_name) of that particular database.
Comment 2 Kunal Mehta (Legoktm) 2013-05-31 21:36:24 UTC
(In reply to comment #1)
> On second thought: It would be much more elegant and logical to have a table
> namespace on *each* database (not one per server) that provides only the IDs
> (ns_id) and names (ns_name) of that particular database.

There are use cases to have it all together (https://www.wikidata.org/wiki/User_talk:Legoktm/namespaces is an example) in one database, however having it also available on the database would also be useful.

I think for backwards-compatability reasons we should have one central database structured like the toolserver one, and then we can have tables on each individual database.
Comment 3 Alex Monk 2013-06-02 03:15:19 UTC
I have put most this data in p50380g50577__wmf_wiki_info on tools-db. It was generated using the API so some information is missing/wrong for private wikis which do not allow meta=siteinfo API access.
Comment 4 Tim Landscheidt 2013-06-02 17:27:09 UTC
(In reply to comment #3)
> I have put most this data in p50380g50577__wmf_wiki_info on tools-db. It was
> generated using the API so some information is missing/wrong for private
> wikis
> which do not allow meta=siteinfo API access.

Thanks!  Could you post the script here as an attachment, please?
Comment 5 Liangent 2013-06-02 17:35:08 UTC
(In reply to comment #3)
> I have put most this data in p50380g50577__wmf_wiki_info on tools-db. It was
> generated using the API so some information is missing/wrong for private
> wikis
> which do not allow meta=siteinfo API access.

Maybe you can make it a "tool"
Comment 6 Liangent 2013-06-02 17:39:29 UTC
(In reply to comment #0)
> but an even more interesting approach would of course be to clone
> operations/mediawiki-config and extract the information from there.

I'm doing things in this way currently (in my script it's not only for namespace names, but also the whole environment so a single table itself wouldn't help me) but there should be some caveat here: to resolve all namespace names you need to have all extensions deployed ready, at least those who create new namespaces, and follow any WMF's deployment of new extensions.
Comment 7 Alex Monk 2013-06-02 17:45:40 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > I have put most this data in p50380g50577__wmf_wiki_info on tools-db. It was
> > generated using the API so some information is missing/wrong for private
> > wikis
> > which do not allow meta=siteinfo API access.
> 
> Thanks!  Could you post the script here as an attachment, please?

You can find it in /data/project/wmf-wiki-info/PyMediaWikiClient/script.py
The script wont work alone because you also need some dependencies in certain locations (MediaWikiClient - a small python class which makes it easy to deal with MW's API, and WMF's operations/mediawiki-config.git for the s[1-7].dblist files).
Any feedback/review/etc. is welcome
Comment 8 MZMcBride 2013-06-02 17:48:18 UTC
I think we should have a discussion about what the current "toolserver" database is, what we want in the future, and whether we care about breaking backward compatibility.

Some of the design decisions in some of the database tables could probably be re-thought, but only if we're willing to break the current interfaces.

In addition, I think we should only rely on MediaWiki's API for this information (with user authentication, as necessary). This is the cleanest and sanest way to accurately get this information, as far as I know.
Comment 9 Marc A. Pelletier 2013-06-12 16:29:08 UTC
This should probably be discussed on-wiki so that we can gather actual requirements and figure out /how/ before we go ahead and implement something.
Comment 10 MZMcBride 2013-06-14 01:31:37 UTC
(In reply to comment #9)
> This should probably be discussed on-wiki so that we can gather actual
> requirements and figure out /how/ before we go ahead and implement something.

Agreed. That's pretty much what I had in mind in comment 8. Feel free to start that discussion. :-)
Comment 11 MZMcBride 2013-09-30 03:45:27 UTC
I just hit this issue trying to manually run a database report. The lack of a namespace table is a bit painful. I imagine this issue is preventing a number of database reports from being moved to Labs. Hrm. Perhaps everyone should just query the MediaWiki API themselves. Hmm.
Comment 12 Marc A. Pelletier 2013-10-10 17:06:19 UTC
Querying the API is probably the only way to get fresh data; the actual configuration is difficult to parse from the configuration and may also be modified by extensions (which vary from project to project).

If we were to provide metadata tables with namespaces and other project-specific information, it'd have to be queried from the API anyways; the only benefit of having it in tables would be to allow joins (don't know how useful that'd be) at the cost of the information lagging behind reality by 0.5 update period on average.
Comment 13 Pietrodn 2013-10-10 18:15:23 UTC
I wrote a simple PHP function that queries the MediaWiki API for namespaces.
See the "getNamespacesAPI" function here:
https://github.com/pietrodn/intersect-contribs/blob/master/pietrodnUtils.php
Comment 14 Beta16 2013-10-11 09:58:24 UTC
My two cents:
Generate database reports by queries, like [[:en:Wikipedia:Database reports]], is quite impossible without namespace meta-table.
Get this information via API means making thousands of API calls instead of a single join.
Comment 15 Marc A. Pelletier 2013-10-29 13:57:37 UTC
Beta16: I'm not sure how to get "thousands of API calls"; if you are generating a database report for project X, getting the list of namespaces for that project is exactly one API call.  OTOH, doing a join while querying some table on that project's database doubles the number of index lookups, at best, and potentially multiplies the number of fetched rows and increases the data set size a lot.
Comment 16 MZMcBride 2013-10-29 15:41:15 UTC
I'm strongly inclined to agree with comment 12 and comment 15. I'm not sure this is a valid bug. I say this as the original author of [[Wikipedia:Database reports]], mentioned in comment 14. Scripts and tools should probably query the MediaWiki Web API themselves.
Comment 17 Marc A. Pelletier 2013-12-19 20:50:00 UTC
Given that this would provide nothing that cannot be obtained faster and more reliably by an API call, and that its use would add load to the database for no gain except a minor programming shorthand; closing WONTFIX.

Tool maintainers are perfectly allowed to make such a database available for general use (and maintain it accordingly).
Comment 18 Merlijn van Deen (test) 2013-12-21 13:20:30 UTC
I suggest re-opening this, as there is are at least two use cases the table provides:

 1) creating full page titles and URLs on the SQL level; for this at least the canonical (Talk, User, User Talk, etc) namespaces have to be provided. The Toolserver wiki shows the following example query:

SELECT
  ns_name,
  page_title
FROM enwiki_p.page
 JOIN toolserver.namespacename 
  ON ns_id = page_namespace AND dbname = 'enwiki_p'
WHERE page_namespace = 6
 AND ns_is_favorite = TRUE
LIMIT 1;

    Note that it's impossible to construct a URL without the namespace names -- you cannot link to http://en.wikipedia.org/wiki/NS3:valhallasw, for example.

 2) providing localized page titles for multiple sites (see Comment 2).
Comment 19 Maarten Dammers 2013-12-21 16:01:34 UTC
I agree with Merlijn -> reopened.
Comment 20 Marc A. Pelletier 2013-12-21 16:06:56 UTC
This belongs in the application, and not the database.  The cost of doubling index lookups on queries is nontrivial for something that can be done at near-zero cost at the applicative level.

Comment 18 example 1 is exactly the use case that is to be /avoided/; that join is pointless and expensive when the application could simply look up the namespace number in a hash when presenting the result.

Closing again; this will not be supported by tool labs.
Comment 21 Maarten Dammers 2013-12-21 16:27:55 UTC
Marc, we don't accept no. This is widely used at the Toolserver and so we need an alternative. It's not near-zero core and is not pointless and expensive. 

Do you think volunteer time is free? What about all the that are currently using this at the Toolserver and are now forced to move to labs?
Comment 22 Marc A. Pelletier 2013-12-21 16:44:46 UTC
It is pointless and expensive, and can be done in a line or two of PHP or perl; and probably no more in pretty much any other language.  Do you not realize that the query above causes an index lookup and row fetch in the database for every single page?

Indeed, given that query, you more than triple the resources required to run the query, as well as increase the size of the result set by sending redundant data at every row.

You /have/ an alternative:  use one line of code at the beginning of your script to fetch the namespaces info, and look the name up when outputting it.

Does that mean small changes in scripts that used to rely on offloading their work to the database?  Yes.  It's certainly a much smaller change than was required when we added edit tokens, for instance.  APIs and interface evolve.  Obsolete junk is pruned all the time.  This is one of those times.

Reopening this bug will not change those facts, nor will it magically alter how databases work to make it sane to push what is a trivial hash lookup in the application into an expensive join at the database.  So please don't; I'd rather not have to start cracking down on disputes in bugzilla (which isn't the right venue for this anyways).
Comment 23 Tim Landscheidt 2013-12-22 11:29:24 UTC
(In reply to comment #22)
> It is pointless and expensive, and can be done in a line or two of PHP or
> perl;
> and probably no more in pretty much any other language.  Do you not realize
> that the query above causes an index lookup and row fetch in the database for
> every single page?

Those are two claims I'd like to see evidence for :-).  I don't like MySQL (or MariaDB) very much, but any query planner worth its salt will cache the namespacename table heavily.

> Indeed, given that query, you more than triple the resources required to run
> the query, as well as increase the size of the result set by sending
> redundant
> data at every row.
> [...]

If we are counting bytes, we should close down Tools *now*.

Most of the queries run on Toolserver/Tools are ineffective, and that fits nicely with MediaWiki being written in PHP.

If it is your professional opinion that with the current setup we need more database servers or bigger network cables to implement toolserver.namespace & Co., that's totally fine, but the solution then isn't WONTFIX, but take it up the chain to the one who signs an order for new equipment.
Comment 24 MZMcBride 2013-12-30 09:47:24 UTC
Two thoughts here:

1. This is clearly a breaking change for the transition from Toolserver to Labs. Is this appropriately documented somewhere as such?

2. Is the source code that controls/maintains meta[_p] in a Labs project? I think it should be just another Labs project®, if it isn't already.
Comment 25 merl 2014-01-27 14:00:58 UTC
I also need to have all namespace names of all wikis in one table on each cluster.

One aditional use case in my script is replacing full page titles of languagelinks, iwlinks and wikidata-sitelink table with namespace id and title without namespace as is needed for further queries in this script on page table.

Doing this at application layer would be very inefficient any slow, because i am doing this with about 6,5 million pages (all living people) twice a day. Currently this part of my tools is done on toolserver with joining namespacename in less than 5 minutes.

Running api queries to get this namespace info for all 500 wikis (wikipedia + commons + wikisource + wikiquote) would take longer. And running 6.5 mio. single update queries on a use table instead of one join would also stress the database server needless.

So the solution for me is to create this table for my own.
Comment 26 Tim Landscheidt 2014-04-01 16:20:01 UTC
(In reply to Tim Landscheidt from comment #23)
> > It is pointless and expensive, and can be done in a line or two of PHP or
> > perl;
> > and probably no more in pretty much any other language.  Do you not realize
> > that the query above causes an index lookup and row fetch in the database for
> > every single page?

> Those are two claims I'd like to see evidence for :-).  I don't like MySQL
> (or MariaDB) very much, but any query planner worth its salt will cache the
> namespacename table heavily.

> > Indeed, given that query, you more than triple the resources required to run
> > the query, as well as increase the size of the result set by sending
> > redundant
> > data at every row.
> > [...]

> If we are counting bytes, we should close down Tools *now*.

> Most of the queries run on Toolserver/Tools are ineffective, and that fits
> nicely with MediaWiki being written in PHP.

> If it is your professional opinion that with the current setup we need more
> database servers or bigger network cables to implement toolserver.namespace
> & Co., that's totally fine, but the solution then isn't WONTFIX, but take it
> up the chain to the one who signs an order for new equipment.

With no response, I ran some tests today on enwiki_p with toolserver.namespace dumped from the Toolserver to s51073__toolserver.namespace.  I did see an increase in looking up and returning an additional field (as expected), but nothing major that would justify buying new equipment.  So reopening this issue so that it can be solved in software.
Comment 27 PiRSquared17 2014-04-04 21:13:18 UTC
(In reply to Merlijn van Deen from comment #18)
>     Note that it's impossible to construct a URL without the namespace names
> -- you cannot link to http://en.wikipedia.org/wiki/NS3:valhallasw, for
> example.

You can select page_id and do //en.wikipedia.org/?curid=$pageid
Comment 28 nosy 2014-04-09 12:54:34 UTC
I checked with several people.
I can move the Toolserver DB MMP off TS and migrate it into Labs.
When I am done something like toolserverdb_p should be available on all user DB hosts.
Comment 29 nosy 2014-04-09 13:00:29 UTC
If anyone wants to join the project I'd be glad since it is intended to be a MMP anyway.
Comment 30 Silke Meyer (WMDE) 2014-04-10 14:09:22 UTC
Thanks a lot for the proposal, nosy! Indeed, I would like others to join you because "we" (as in WMDE) cannot / will not maintain a tool that that many other tools depend on in the long run.

Any volunteers? If there aren't any, it might not be such a good idea after all...
Comment 31 nosy 2014-04-25 14:42:18 UTC
Did the first steps - most of the scripts already run and the dbs get filled.
I still dont know how to test the data in regard of being valid.
I'd be glad if some of you who need it anyway probably try to use s51892_toolserverdb on any of the db instances.
Comment 32 Marc A. Pelletier 2014-04-25 15:48:39 UTC
Give me a ring once you are satisfied with the result, I can rename the database to something more mnemonic for you.

(Also, unless your database name ends with _p, you will need to give grants to it)
Comment 33 Silke Meyer (WMDE) 2014-04-28 10:03:30 UTC
Any comments on Marlen's work so far? Did anyone test it?
Comment 34 Alex Monk 2014-04-29 01:07:23 UTC
All I can do is get a list of tables in that DB. Everything else is denied.
Comment 35 Silke Meyer (WMDE) 2014-05-09 12:36:34 UTC
Little status update... According to Nosy, this is not fully done, yet.
Comment 36 Silke Meyer (WMDE) 2014-06-10 09:37:41 UTC
Nosy is on it. Long term maintenance of this tool is still an open question though.
Comment 37 nosy 2014-06-10 20:31:21 UTC
Its done  so far.
Comment 38 Silke Meyer (WMDE) 2014-06-11 16:29:06 UTC
Dear all, please verify and close if okay. Thanks, nosy!
Comment 39 Silke Meyer (WMDE) 2014-07-08 12:54:37 UTC
No complaints in almost a month. Closing the ticket.
Comment 40 Jimmy Xu 2014-08-03 02:47:02 UTC
Reopening, would this be merged into meta_p or renamed without a s**_ prefix?
Comment 41 Marc A. Pelletier 2014-08-13 15:25:11 UTC
Renaming without prefix makes it easy to keep maintaining with the current tool.  Just say the word, and I'll do the rename.
Comment 42 nosy 2014-08-18 09:36:30 UTC
Word :). Please rename the DB to toolserver.
Comment 43 nosy 2014-08-18 10:13:21 UTC
If its not too complicated I would prefer to have a additional DB called toolserver so I can migrate and leave the old DB in place until its done.
Comment 44 Marc A. Pelletier 2014-08-27 14:41:12 UTC
Wait, if you want an additional DB named toolserver, where do you want me to put s51073__toolserver?
Comment 45 Marc A. Pelletier 2014-08-27 14:41:57 UTC
(Also, if you want everyone to be able to select in a database, it normally needs to be named ending in _p)
Comment 46 nosy 2014-09-10 13:49:35 UTC
I'd need a DB called toolserver. :D
It'd be fine if I can only get toolserver_p.
The rest of the DBs of the user can be left untouched.
Comment 47 nosy 2014-09-29 13:34:46 UTC
Latest plan is to implement a view called toolserverdb that points to the original db.
It would regard:

s51892_toolserverdb_p.language
s51892_toolserverdb_p.namespace
s51892_toolserverdb_p.namespacename
s51892_toolserverdb_p.servermapping
s51892_toolserverdb_p.wiki
Comment 48 Marc A. Pelletier 2014-09-29 14:01:00 UTC
(The view is named toolserverdb_p to allow all users select right, at it lives on tools.labsdb).

That has been created.  Please note that (a) the views will not automatically track schema changes, but that shouldn't be an issue since those are legacy tables, and (b) the data in those tables is explicitly maintained by the community and may fall out of sync with reality unless rigorously maintained - caveat ūtilisor.

Issues with the data should be reported to the toolserverdb tool maintainers (https://tools.wmflabs.org/?list#toollist-toolserverdb).

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links