Last modified: 2012-06-11 18:06:24 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T35253, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 33253 - Run updateArticleCount.php on all Wikisources and Wiktionaries
Run updateArticleCount.php on all Wikisources and Wiktionaries
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Site requests (Other open bugs)
unspecified
All All
: Normal major with 2 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: analytics, shell
Depends on: 26033
Blocks: 29782 34184
  Show dependency treegraph
 
Reported: 2011-12-19 16:53 UTC by Pyb
Modified: 2012-06-11 18:06 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Pyb 2011-12-19 16:53:58 UTC
1) Wikisource has a lot of articles without internal link. Article count should include page without internal link.

2) Some namespaces are included into article count (102 = author,104 = page, 106 = index). But the numerotation is not identical between wikis

For example :
On pl: 100=page, 102=index, 104=author
On it: 102=author, 108=page, 110=index
On fr: 102=author, 104=page, 112=index
Comment 1 Nemo 2012-01-15 09:10:41 UTC
(In reply to comment #0)
> 1) Wikisource has a lot of articles without internal link. Article count should
> include page without internal link.

I thought this had been fixed by bug 24754 / bug 11868.
I don't think Wikisource needs the new $wgArticleCountMethod created after bug 26033, does it? And in any case it should be set to "comma", not "any", except perhaps some languages which don't use commas.
Cf. bug 27256, while I don't find a bug for Wiktionary, perhaps the actual configuration has not been requested yet or wasn't actually needed?

> 2) Some namespaces are included into article count (102 = author,104 = page,
> 106 = index). But the numerotation is not identical between wikis
> 
> For example :
> On pl: 100=page, 102=index, 104=author
> On it: 102=author, 108=page, 110=index
> On fr: 102=author, 104=page, 112=index

This is bug 29172: reopen if some namespaces are missing on some wikis.

After the configuration has been set or confirmed to be set correctly, the really missing piece here is running updateArticleCount.php on all Wikisources, which already have utterly broken count because of the ContentNamespaces change, mass deletions and so on.
Also Wiktionary (after r88113) and en/pt.books need it I suppose.
Comment 2 Pyb 2012-01-15 10:54:20 UTC
(In reply to comment #1)
Thanks for your comment. I didn't understand everything and don't know where the problem come from.

But I still believe that the figures on http://stats.wikimedia.org and [[Special:Statitics]] for Wikisource are misleading.

E.g. the size of the french wikisoure database is decreasing !

There is also a problem with the number of active editors, article count...

The following script does may be not take into account the specificities of Wikisource (transclusion, namespaces)
http://svn.wikimedia.org/viewvc/mediawiki/trunk/wikistats/dumps/WikiCountsInput.pm?view=markup
Comment 3 Nemo 2012-01-15 11:00:43 UTC
(In reply to comment #2)
> But I still believe that the figures on http://stats.wikimedia.org and
> [[Special:Statitics]] for Wikisource are misleading.

What's misleading and why on Special:Statistics?

> E.g. the size of the french wikisoure database is decreasing !
> 
> There is also a problem with the number of active editors, article count...
> 
> The following script does may be not take into account the specificities of
> Wikisource (transclusion, namespaces)
> http://svn.wikimedia.org/viewvc/mediawiki/trunk/wikistats/dumps/WikiCountsInput.pm?view=markup

Please open another bug for WikiStats. Erik Zachte has already worked on this and fixed some parts of it: I think Commons is ok, while he's not updated the script to use ContentNamespaces rather than namespace 0 in all cases, AFAIK. I think he has other priorities now, but a bug report might be useful.
Comment 4 Nemo 2012-01-15 11:11:44 UTC
In the meanwhile, because no request to change the article count method seems to be needed/requested, I'm changing the summary making this a shell request for a maintenance script run.
Severity set to "major" because article count is an important feature and Special:Statistics a very used page, currently completely incorrect on those wikis after the count method changes.
Comment 5 Sam Reed (reedy) 2012-01-16 16:50:25 UTC
Isn't this a WONTFIX?
Comment 6 Nemo 2012-01-16 16:56:29 UTC
(In reply to comment #5)
> Isn't this a WONTFIX?

Why should it?! The script updates the count to reflect the real value according to current rules.
Comment 7 Platonides 2012-01-27 20:53:22 UTC
This needs to wait until deployment of 1.19, which is where r88113 added $wgArticleCountMethod
Comment 8 Donald Lancon 2012-04-16 05:03:54 UTC
(In reply to comment #7)
> This needs to wait until deployment of 1.19, which is where r88113 added
> $wgArticleCountMethod

1.19 has been deployed for a while now. What's the status of this bug? I'm assuming the script has not been run on everything, since Veps Wikipedia (for example) is still reporting the wrong article count.
Comment 9 Donald Lancon 2012-05-10 05:30:46 UTC
And now we're on MW1.20wmf2. Is this still waiting for anything in particular (apart from someone to act on it)?

[P.S. - Why was the URL set to "analytics"?]
Comment 10 Sam Reed (reedy) 2012-05-10 07:00:39 UTC
Done
Comment 11 Donald Lancon 2012-05-11 02:09:01 UTC
Be careful what you wish for! Thanks for running the script on these wikis, but it has resulted in some *huge* changes in article counts, and some are very questionable. For example, the Nepali Wiktionary has "lost" 98% of its entries, falling from 4,821 down to 73. How is this Wiktionary counting articles? Link or comma method? (It clearly isn't using "any".)

And while most of the Wikisources have grown, as one would expect since more namespaces are now considered "content", the Ukrainian Wikisource has lost 57% of its text units, dropping from 4,563 down to 1,947. How could it have lost so many content pages?

More generally: I know how to check what namespaces count as content (API namespaces query), but how does one find out what article-count method a wiki is using?

Reopening this bug until this issue is cleared up. (I've only checked a few of the updated article counts so far, but I will check more and see if there's a systematic problem with certain types of languages, or what...)
Comment 12 Sam Reed (reedy) 2012-05-11 02:15:07 UTC
I'd suspect quite a few haven't been updated configuration wise like they should have:

/**
 * Method used to determine if a page in a content namespace should be counted
 * as a valid article.
 *
 * Redirect pages will never be counted as valid articles.
 *
 * This variable can have the following values:
 * - 'any': all pages as considered as valid articles
 * - 'comma': the page must contain a comma to be considered valid
 * - 'link': the page must contain a [[wiki link]] to be considered valid
 * - null: the value will be set at run time depending on $wgUseCommaCount:
 *         if $wgUseCommaCount is false, it will be 'link', if it is true
 *         it will be 'comma'
 *
 * See also See http://www.mediawiki.org/wiki/Manual:Article_count
 *
 * Retroactively changing this variable will not affect the existing count,
 * to update it, you will need to run the maintenance/updateArticleCount.php
 * script.
 */
$wgArticleCountMethod = null;
Comment 13 Donald Lancon 2012-05-11 05:01:53 UTC
So what would you recommend as far as finding out which wikis still/now need fixing? Since I'm just a "regular user", the best I can do is compare the on-wiki (/API) article counts before and after the running of the updateArticleCount.php script (I collect these numbers daily) with the official article counts listed at Wikistats (stats.wikimedia.org), once those are posted (in a few weeks). Unfortunately, the Nepali Wiktionary is one of the wikis that are not tracked (for whatever reason) at Wikistats (meaning there are no "official" article counts for it, only what the wiki itself reports). In any case, because of the sheer number of projects involved, I haven't actually made any such comparisons yet (I have been collecting some relevant data over the past few weeks, though).

Would it be possible to post the values of $wgArticleCountMethod and $wgUseCommaCount for every wiki? I know it's a lot to ask, but I assume there's a "quick" way of doing this on the command line...?
Comment 14 Nemo 2012-05-11 05:21:45 UTC
All wikis are still using the default link method except pt, en.books which filed a request, see above.
Entries on ne.wiktionary seem to have no links, categories o templates at all, so they wouldn't be counted with the normal method either. The only difference may be that also interwiki links used to be counted, but I'd consider that a bug. In any case, please open a new bug to get the method fixed/change, this bug is indeed fixed.

(In reply to comment #13)
> Would it be possible to post the values of $wgArticleCountMethod and
> $wgUseCommaCount for every wiki? I know it's a lot to ask, but I assume there's
> a "quick" way of doing this on the command line...?

'wgArticleCountMethod' => array(
	'default' => 'link',
	'enwikibooks' => 'comma',
	'ptwikibooks' => 'comma',
),

http://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php
Comment 15 Donald Lancon 2012-05-11 07:44:41 UTC
That is a very helpful file! Thanks.

Actually, from what I've seen, the vast majority of ne.wiktionary entries do have categories. Same for uk.wikisource. Are you sure those would count as "links"? I don't remember...

In fact, based on a census of Special:AllPages (main namespace, hiding redirects) and a sample of 30 "Special:Random" pages (in main ns) checked for the presence of at least one link (assuming Category: links count) on each wiki:

* ne.wiktionary = 28/30 * 4,937 = 4,608 estimated article count
* uk.wikisource = 28/30 * 4,757 = 4,440 estimated article count

(Both wikis count only the main namespace as "content".)

Those estimates are really close to the respective counts before the update script was run: 4,821 and 4,563.

So... does this mean Category: links _used_ to count as links but don't anymore?

If so, this is going to affect a great many wikis. (And already has: 13 Wikisources and 24 Wiktionaries dropped below their latest significant article count milestone [in the sense of those tracked at m:Wikimedia_News] in the last 24 hours -- typically only a few wikis fall below milestones every _month_, across _all_ WMF projects.)

So, does anyone know if this has been discussed on-wiki anywhere, or on a mailing list?
Comment 16 Donald Lancon 2012-05-11 08:31:38 UTC
Well, damn, there it is right there at [[mw:Manual:Article count]]: "...will be counted as an article in the statistics and the {{NUMBEROFARTICLES}} variable... if it contains at least one wiki link... or is categorized to at least one category."

So, this used to be the behavior, at least. Has it changed?
Comment 17 Erik Zachte 2012-05-16 13:12:25 UTC
(In reply to comment #11)

> More generally: I know how to check what namespaces count as content (API
> namespaces query), but how does one find out what article-count method a wiki
> is using?

Query does list namespaces which are in use, but not whether these count as content. Am I missing something?

http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces

If such a query exists, or such an attribute could be added, wikistats could use it to get that part of article counting up to date.
Comment 18 Donald Lancon 2012-05-17 00:13:08 UTC
(In reply to comment #17)
> Query does list namespaces which are in use, but not whether these count as
> content. Am I missing something?
> 
> http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces

In the results of that query:

 <ns id="0" case="first-letter" content="" xml:space="preserve" />

The string 'content=""' indicates that this namespace counts as content. Here's another example from <http://de.wikisource.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces>:

 ...
 <ns id="0" case="first-letter" subpages="" content="" xml:space="preserve" />
 ...
 <ns id="102" case="first-letter" canonical="Seite" content="" xml:space="preserve">Seite</ns>
 ...
 <ns id="104" case="first-letter" canonical="Index" content="" xml:space="preserve">Index</ns>
 ...
Comment 19 Donald Lancon 2012-05-17 00:56:34 UTC
I should point out that I'm really only _assuming_ this is true (about the 'content=""' string), since it seems to match (more or less) what I've been told about what namespaces count as content on what projects.

Note, however, that there is quite a bit of variation in this. For example, when you, Erik, told me at [[m:Talk:Wikimedia News#Using Wikipedia Statistics to fill in gaps]] that "102 = Author, 104 = Page, 106 = Index" count as content on Wikisource, that's true about the English Wikisource, but not necessarily the others. Not all Wikisources even use the same namespace numbers for the same purposes: in the Estonian Wikisource, for example, 102 = Page, 104 = Index, and 106 = Author (and these are all marked as "content" in the API query results; and in the Turkish Wikisource, 100 = Author, and that's the only namespace other than main (ns0) marked as "content".

So does this mean not even Wikistats is counting the articles correctly?? [g]

As part of my investigation into the large shifts in "on-wiki" article counts alluded to above, I've started to fill in a large table at [[m:Talk:Wikimedia News#May 10 article count updates]] with some relevant info, including what namespaces are marked as "content" in the API results, how many non-redirect pages are (or appear to be, approximately) in each, and an estimate of what percentage of these should count as "articles" by the "at least one link" standard (plus a lot of other stuff -- note, BTW, that the table only contains wikis that passed or dropped below article-count "milestones").

I'm also in the process of downloading all the relevant database dumps that should allow me to calculate "exactly" many of the numbers in that table that are currently only estimates (in essense duplicating what I assume your script[s] do, Erik, but for dumps made just before and just after May 10th, not only at the end of the month).
Comment 20 Lars Aronsson 2012-05-17 01:18:26 UTC
At http://meta.wikimedia.org/wiki/Wikisource
the Norwegian (no.) Wikisource is listed with 4,145 "good" pages,
which should be ten times larger if the "Side:" (Page) namespace was counted.
It should only be slightly smaller than the Swedish (sv.) Wikisource,
which has 46,815 "good" pages in the same table.
Comment 21 Donald Lancon 2012-05-17 04:51:57 UTC
OTOH, notice that the "official" article count for s:no:, as of Mar 31, 2012, is only 2,392.<http://stats.wikimedia.org/wikisource/EN/Sitemap.htm>

The more I look into this, the more convinced I become that, unfortunately, *most* of the article counts, both on-wiki and based on dumps, are actually wrong by significant amounts.... but I can't be sure at this point exactly how widespread the problem is. When I get a clearer picture, I'll open a different bug about it.
Comment 22 Erik Zachte 2012-05-17 22:18:42 UTC
Here is a list of 'content' namespaces collected via the API.
If this looks sensible I can use it from now on for wikistats.

http://stats.wikimedia.org/wikimedia/misc/StatisticsContentNamespaces.csv

BTW commons does not list 6 or 14.
Comment 23 Erik Zachte 2012-05-17 22:47:56 UTC
Ah list of content namespaces is already available via

http://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php

section 'wgContentNamespaces' => array
Comment 24 Erik Zachte 2012-05-17 22:58:14 UTC
Ahem I overlooked comment 14, this php file was already mentioned.

http://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php

So how about wikisource and wiktionary wikis, aren't those wgArticleCountMethod 'any' ?
Comment 25 Sam Reed (reedy) 2012-05-17 23:01:09 UTC
reedy@fenari:~$ mwscript eval.php enwikisource
> print $wgArticleCountMethod
link
>
reedy@fenari:~$ mwscript eval.php enwiktionary
> print $wgArticleCountMethod
link
Comment 26 Donald Lancon 2012-06-11 18:06:24 UTC
Oops. Forgot to mention here that I've opened bug 37291 about updateArticleCount.php (or whatever code actually counts the articles) not counting correctly.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links