Last modified: 2013-04-08 11:01:51 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 11868 - Use transclusions to count articles as well
Use transclusions to count articles as well
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Low enhancement with 3 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: patch
: 12566 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2007-11-04 10:59 UTC by Dominic
Modified: 2013-04-08 11:01 UTC (History)
13 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Patch to article.php to fix problem (using diff command) (191 bytes, patch)
2007-11-12 08:13 UTC, Matt
Details
DefaultSettings.php patch, to coincide with Article.php patch above. (472 bytes, patch)
2007-11-12 08:14 UTC, Matt
Details

Description Dominic 2007-11-04 10:59:46 UTC
Currently, the article count used to generate {{NUMBEROFARTICLES}} and Special:Statistics only counts a page as an article if it includes a [[wikilink]]. Instead, this should be expanded to include {{transclusions}} as well as wikilinks. The issue here is that non-Wikipedia projects, like Wiktionary, do often have valid articles without wikilinks, because the wikilinks are contained in the templates that generate the article. As many as one fifth of valid Wiktionary articles may be inflection articles (plurals, verb form, etc.) and are mostly just a template. We've had to input wikilinks in the template, like {{plural of|[[word]]}}, but this is inefficient and also prevents us from passing that parameter to any other parts of the template, like a category.

There may be a downside, but I can't think of one, especially now that preventing page creations is done with cascade protection instead of protected templates.
Comment 1 Roan Kattouw 2007-11-04 11:01:56 UTC
What about pages that transclude {{stub}} then?
Comment 2 Rob Church 2007-11-04 13:50:11 UTC
(In reply to comment #1)
> What about pages that transclude {{stub}} then?

As far as we're concerned, those are still articles - such pages would usually contain at least one link anyway, which would put them on the counter.
Comment 3 Aaron Schulz 2007-11-09 05:45:41 UTC
This seems to have an efficiency barrier. We'd need another site stat to track this. That or change the current one (with a retroactive batch query ran too).
Comment 4 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-11-11 01:28:04 UTC
The request is to change the existing metric.  Would be easy enough to do.  The batch query would only have to examine the "bad" articles, too, which is probably a good deal fewer than all of them.
Comment 5 Matt 2007-11-12 08:13:44 UTC
Created attachment 4329 [details]
Patch to article.php to fix problem (using diff command)

I have updated the isCountable method of Article to take into account templates if $wgCountTemplateOnlyPages is set to true. As well as getting this comitted, I would like that setting turned on for the english (and hopefully rest of) wiktionary. Lastly, a shell needs to run "maintenance/recount.sql" (or something like that) on the versions of wiktionary its enabled on.

Patch to DefaultSettings coming in a second.
Comment 6 Matt 2007-11-12 08:14:17 UTC
Created attachment 4330 [details]
DefaultSettings.php patch, to coincide with Article.php patch above.
Comment 7 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-11-14 01:40:30 UTC
1) There's no reason to have a config option for this.  If a page contains a template link, it should logically be counted, especially given that the template may itself include links.  The behavior should be on automatically for all wikis.

2) Please use the command "svn diff" to generate diff files.  If you really don't want to check out the SVN repository, at *least* use diff -u, and concatenate the two diffs into one text file for easier reading (indicating within the single diff file which part of the diff corresponds to which file).

3) I'm not willing to check this in without comment from Brion or Tim about how to proceed with recounting.  Given (1) above, a recount needs to be done on update, but of course not every single time an update is done, once suffices.  Maybe we should have a database flag for schema versions, as a general thing?  This kind of issue has come up before, and we have no good answer for it.

If (3) is satisfied I'm willing to check in a one-line patch based on the given attachment.  Have you tested it?
Comment 8 Tim Starling 2007-11-14 02:55:03 UTC
I think it should have the config option, to avoid an unnecessary step change in the non-Wiktionary counters. I've generally been against recounts on the large wikis, where years of counter drift have taken their toll, because of the psychological importance of the article count and its continuous growth. 

The suggested patch is fine, except that it also needs a patch to maintenance/updateArticleCount.inc.php. 
Comment 9 Danny B. 2007-11-14 03:05:34 UTC
(In reply to comment #0)
> Currently, the article count used to generate {{NUMBEROFARTICLES}} and
> Special:Statistics only counts a page as an article if it includes a
> [[wikilink]].

Not exactly. Per related bug 10834 comment #5 and live tests current good articles counter counts using the following method:

1. page is in ns 0
2. page is not redirect
3. page contains "[[" string

Step 3 causes that pages with no wikilink but Image or Category inserted, even with <!-- [[ --> are counted.
Comment 10 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-11-14 03:13:52 UTC
Tim points out that the recount script already counts template inclusions.  It would probably make the most sense to make Article.php use the updateArticleCount method (parsing and checking the resulting links) rather than adding the extra check for '{{'.
Comment 11 Matt 2007-11-14 03:45:58 UTC
In that case, why was one updated and not the other? I also believe there should be a config option, as sites like wikipedia dont want pages that only include {{deletedpage}} or such.

Also, I don't have any command line svn utility, as I use a graphical system for SVN (hooked into right-click menu). So the diff command was my only choice. I'll use -u next time.
Comment 12 Dominic 2007-11-14 08:19:50 UTC
Note that the {{deletedpage}} practice is obsolete ever since cascade protection, and, at least on enwp, has been completely converted to the new method (and {{deletedpage}} was itself deprecated following a deletion nomination).
Comment 13 Roan Kattouw 2007-11-14 13:46:20 UTC
(In reply to comment #11)
> Also, I don't have any command line svn utility, as I use a graphical system
> for SVN (hooked into right-click menu). So the diff command was my only choice.
> I'll use -u next time.
TortoiseSVN I presume? Right-click on a file or folder, then click "Create patch".
Comment 14 Matt 2007-11-15 04:33:21 UTC
No, some random program I cant remember the name of for Mac.
Comment 15 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-11-16 18:26:33 UTC
It should still support patch creation somewhere, check the docs.  If not, you can just use the command line for this one thing.  But that's off-topic.
Comment 16 Brion Vibber 2007-12-04 15:42:34 UTC
initStats sets the good article count for all pages in defined content namespaces which are not redirects and are greater than 0 bytes in length.

The length check is an approximation due to the difficulty of checking for text contents in a single query like this (text would have to be loaded, uncompressed, and encoding-converted individually for every revision checked).


Then there seems to be *yet another* script which got shoved in there somehow, updateArticleCount, which does the above checks, plus a join against the pagelinks table to list those which have outgoing wiki links.


So we currently have *three different methods* of counting, all different:

1) on every page update: check for text containing '[['

This is the canonical version; updates to the count on edit assume that the existing count was based on this -- the total count is incremented or decremented based on changes in state of this check between previous and new versions.


2) on bulk initStats.php: check for non-empty text

This will overcount, including pages which have text but no links.


3) on bulk updateArticleCount.php: check for non-empty text and outgoing links

This will overcount but not as much, including pages which transclude templates which themselves have links as well as extensions which record links but don't contain '[[' in the actual text.


What might actually be the sanest thing to do might be to add a page_is_counted field on the page table and update it at save time. Then bulk updates can be done a lot more sanely, and changes in the counter method won't cause as much weird drift. :P

But a good start would be to harmonize them:

* Junk updateArticleCount and merge its check into initStats

This seems like a no-brainer... any reason not to?


* Change the article count updates to be based on link count in parse state rather than text contents

Besides causing extra parses on save (slow), one obvious problem here is that link count from transcluded templates can change over time. A template might contain links at time T and no links at time T+1. Thus refreshes of links could change the state.

So... a check for transcludes as well, maybe?

Bleah.
Comment 17 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-12-06 05:30:59 UTC
(In reply to comment #16)
> Besides causing extra parses on save (slow), one obvious problem here is that
> link count from transcluded templates can change over time. A template might
> contain links at time T and no links at time T+1. Thus refreshes of links could
> change the state.
> 
> So... a check for transcludes as well, maybe?

The easy answer to that is yes, check for transcludes as well, as this bug suggests.  ;)  It makes sense anyway.
Comment 18 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-01-09 17:01:19 UTC
*** Bug 12566 has been marked as a duplicate of this bug. ***
Comment 19 Connel MacKenzie 2008-01-27 03:20:27 UTC
Just a comment from en.wiktionary.org: It does not represent community consensus to suggest that anyone beyond a small minority wants transclusions "counted."  Perhaps a separate statistic that show that, but not the count of "good" entries.  For example, many assumptions have been made based on the so-called "incorrect" behavior.  Entries marked with {{misspelling of}} do not contain any wikilinks, specifically so that they are not counted.
Comment 20 Yair Rand 2010-12-12 22:34:10 UTC
(In reply to comment #19)
> Just a comment from en.wiktionary.org: It does not represent community
> consensus to suggest that anyone beyond a small minority wants transclusions
> "counted."  Perhaps a separate statistic that show that, but not the count of
> "good" entries.  For example, many assumptions have been made based on the
> so-called "incorrect" behavior.  Entries marked with {{misspelling of}} do not
> contain any wikilinks, specifically so that they are not counted.
Note: This is no longer correct. All mainspace pages on enwiktionary are counted due to a bot adding invisible links to all pages that otherwise don't have links.
Comment 21 Krinkle 2010-12-12 22:52:20 UTC
(In reply to comment #16)
> So we currently have *three different methods* of counting, all different:
> 
> 1) on every page update: check for text containing '[['
> 
> This is the canonical version; updates to the count on edit assume that the
> existing count was based on this -- the total count is incremented or
> decremented based on changes in state of this check between previous and new
> versions.
> 
> 
> 2) on bulk initStats.php: check for non-empty text
> 
> This will overcount, including pages which have text but no links.
> 
> 
> 3) on bulk updateArticleCount.php: check for non-empty text and outgoing links
> 
> This will overcount but not as much, including pages which transclude templates
> which themselves have links as well as extensions which record links but don't
> contain '[[' in the actual text.
> 

What's the status on this anno 2010 ? Still three methods ?
I guess 3) makes the most sense. Then when saving an article it counts outgoing links again (which it needs to do to update pagelinks anyway).

When changing templates, the job queue that updates caches within X minutes for tranclusions (whatlinkshere) etc. could be fixed to update this count as well.

Anyway, keeping three different methods that are inconsistent which eachother seems a bad thing no matter how we look at it.
Comment 22 Alexandre Emsenhuber [IAlex] 2011-05-14 17:15:10 UTC
I'm marking this as FIXED since the check is also based on the presence of links in templates since r88113.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links