Last modified: 2012-06-12 19:13:54 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T39291, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 37291 - updateArticleCount.php script is broken
updateArticleCount.php script is broken
Status: RESOLVED INVALID
Product: MediaWiki
Classification: Unclassified
Maintenance scripts (Other open bugs)
unspecified
All All
: Unprioritized normal (vote)
: ---
Assigned To: Nobody - You can work on this!
http://meta.wikimedia.org/wiki/User:D...
: analytics
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-06-02 04:39 UTC by Donald Lancon
Modified: 2012-06-12 19:13 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Donald Lancon 2012-06-02 04:39:47 UTC
In short: The updateArticleCount.php script is not counting articles correctly.

The evidence:

See the table I'm still filling out at [[m:User:Dcljr/Article counts]], which collects (way too many) statistics based on the official database dumps. (In particular, see the columns highlighted in pink, which show how far off the "on-wiki" article counts were from the actual dump-based article counts, both before and after the script was run.)

The longer version:

Ever since the resolution of bug 33253, which led to several wikis "losing" or "gaining" huge numbers of articles (according to their {NUMBEROFARTICLES} count), I've suspected very strongly that the updateArticleCount.php script is not counting articles correctly. Now I have firm evidence.

I wrote a Perl script to download and parse relevant dumps from <dumps.wikimedia.org> thereby counting articles "from scratch" based on the current "non-redirect with at least one wikilink" criteria (as well as some more and less generous criteria that I'm trying out for comparison). The results are being collected at the Meta page above.

I've started with the Wiktionaries whose article counts dropped the most (in terms of percentage), so the table is currently showing huge undercounts. I originally suspected that the wikis whose article counts gained the most would show significant overcounts, but the handful of checks I've made of such wikis (which haven't been added to the table yet) haven't shown this to be the case.

We Shall See...

Punchline: Someone needs to check the updateArticleCount.php script to see why it's undercounting articles.
Comment 1 Donald Lancon 2012-06-02 04:49:56 UTC
BTW, I should point out that the undercounting cannot be because it's not considering all the "content" namespaces, because all Wiktionaries use only ns0 for content.
Comment 2 Donald Lancon 2012-06-07 08:33:21 UTC
I see that the updateArticleCount.php script itself does very little. Instead, it relies on other code to actually count the articles. I followed the dependencies for a while, but eventually gave up before I found the actual code that does the counting. Someone more familiar with MW code will have to say where the problem lies...
Comment 3 Platonides 2012-06-11 20:33:30 UTC
Is your script available somewhere?

Maybe you could point out a small wiki with the count of your script, for comparing with the numbers provided by updateArticleCount for that wiki?
Comment 4 Nemo 2012-06-11 21:30:22 UTC
(In reply to comment #3)
> Is your script available somewhere?

aka, what definition of "good" article are you using to say that the count is not "correct"?
Comment 5 Platonides 2012-06-11 22:04:26 UTC
See above: «based on the current "non-redirect with at least one wikilink" criteria»
Comment 6 Donald Lancon 2012-06-11 22:22:28 UTC
If you haven't already, please see the Meta page I pointed to in my initial post: <http://meta.wikimedia.org/wiki/User:Dcljr/Article_counts>

That gives all the information I think someone would need to independently
check my counts....  (In fact, it might be a good idea for someone to try to
count the articles themselves without seeing my code first.  My script is
currently not available anywhere, but I can put it up at Meta if it's really
necessary.)

I've posted stats for 12 Wiktionaries and 15 Wikisources so far (each for a
date before and a date after the running of the maintenance script).  Take your
pick for which one(s) you want to check.

The exact definition I'm using for a "good" article is: "non-redirect in a content namespace with any kind of internal-style [[wikilink]]: page, category, image/file, interlanguage, or interwiki".  AFAIK, that's the definition currently in use.  Which pages contain each of these types of links are gleaned from the respective database dumps.  For details, see the Meta page.  For the Wiktionaries, I also show counts using three other sets of criteria (also explained at the Meta page).
Comment 7 Donald Lancon 2012-06-12 03:22:13 UTC
Me: "AFAIK, that's the definition currently in use."

This only applies to the "link" article-count method, of course -- which all Wikisources and Wiktionaries are currently using, according to <http://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php> (search for "wgArticleCountMethod").
Comment 8 Nemo 2012-06-12 07:11:26 UTC
(In reply to comment #6)
> The exact definition I'm using for a "good" article is: "non-redirect in a
> content namespace with any kind of internal-style [[wikilink]]: page, category,
> image/file, interlanguage, or interwiki".  AFAIK, that's the definition
> currently in use.

I wouldn't be so sure. As I already told you, interwikis and/or category links may not being counted now (which makes sense, especially for interwikis).
Comment 9 Donald Lancon 2012-06-12 16:02:01 UTC
OK... so I finally looked at r88113, which is apparently where all of this changed radically.

Let's be very precise here.  There are many different kinds of wikilinks (a fact that has contributed greatly to confusion over this issue):

1. page: e.g., [[link]] or [[Special:Statistics]]
2. category: [[Category:English]]
3. image/file: [[File:Yes.png]]
4. interlanguage: [[:de:]] or [[de:]]
5. interwiki: [[species:]]
6. hidden: <!-- [[don't look at me]] -->
7. deactivated: <nowiki>[[look at me]]</nowiki>
8-14. template-provided versions of, respectively, 1-7

Before r88113, 1-7 (in fact, _any_ instance of "[[") were all counted, but not 8-14.  Afterwards, 1 and 8 are counted and no others.  (Even though I can't check 8-14 with my script, checking for only type 1 links gave counts that matched {{NUMBEROFARTICLES}} on four wikis I tried it on.  So there ya go.)

Unfortunately, this means you can't tell anymore just from the raw page source whether a page will be an article or not (I mean, say, if it has a template on it but no page links); it must be parsed first.

Seems to me, this amounts to a fundamental change in the way articles are counted (the changes in article counts that have resulted is proof enough of this) that was only ever discussed beforehand by a handful of people in bug 11868 -- and nobody there seemed to actually be discussing _this_ particular counting method!  (Brion, for example, stated that the new method would "overcount" articles, which is the opposite of what has happened!)

IOW, this "new" state of affairs (which, although over a year old at this point, has not yet propagated to projects beyond Wikisource and Wiktionary, because updateArticleCount.php hasn't been run on them) was not arrived at through any real consensus process.  In fact, Nemo_bis, I see that's essentially what you said just 3 weeks before the changes were committed by IAlex <https://bugzilla.wikimedia.org/show_bug.cgi?id=24754#c1>.

So, anyway... I guess this bug is finished, and I need to start a (now more informed) discussion about this on Meta....
Comment 10 Platonides 2012-06-12 16:08:59 UTC
(In reply to comment #9)
> OK... so I finally looked at r88113, which is apparently where all of this
> changed radically.

Wow, I wasn't aware of that.



> Unfortunately, this means you can't tell anymore just from the raw page source
> whether a page will be an article or not (I mean, say, if it has a template on
> it but no page links); it must be parsed first.

For the verification purposes discussed here, you can use pagelinks.sql.gz though.
Comment 11 Donald Lancon 2012-06-12 19:13:54 UTC
Yeah, I just realized that! [g]

For some reason I was thinking that the way my script was doing it would miss links provided by templates, but of course that's not true: what my script does is _exactly_ what the MW code itself does when not triggered by a page edit: it checks page.sql for the existence of links originating from the page in question!

I don't know what I was thinking....

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links