Last modified: 2013-03-25 23:16:27 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T2547, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 547 - Special page for statistics about specific articles
Special page for statistics about specific articles
Status: NEW
Product: MediaWiki
Classification: Unclassified
Special pages (Other open bugs)
unspecified
All All
: Low enhancement with 5 votes (vote)
: ---
Assigned To: Matt Johnston
http://en.wikipedia.org/wiki/Wikipedi...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2004-09-21 14:33 UTC by Leah
Modified: 2013-03-25 23:16 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Proposed patch (7.62 KB, patch)
2008-09-11 11:29 UTC, Matt Johnston
Details
Proposed patch v2 (9.97 KB, patch)
2008-09-11 11:57 UTC, Matt Johnston
Details
Proposed patch v3 (10.07 KB, patch)
2008-09-12 04:35 UTC, Matt Johnston
Details
Proposed patch v4 (9.72 KB, patch)
2008-09-12 05:12 UTC, Matt Johnston
Details
Proposed patch v5 (10.40 KB, patch)
2008-09-15 11:00 UTC, Matt Johnston
Details

Description Leah 2004-09-21 14:33:33 UTC
A recent discussion on [[en:Wikipedia:Village pump]] suggests the addition of a
special page, possibly [[Special:Statistics]], to automatically determine
interesting statistics about individual articles.  In particular, the following
statistics are desired:

* Total size of article (in kb)
* Total size of text (in kb)
* Total size of readable text (in kb)
* Total size of all images in article (in kb)
* Readable Word Count (excluding headings etc as per definitions of '''article
count - alternative''')
* Amount of internal links
* Amount of external links

: As well as up-to-date information on the page which shows what the average
Wikipedia article is like in comparison to the article, plus additional
information on the language version. (eg Average Wikipedia article size is 2078
bytes, compared to English language article size of 2315 bytes)

Additional useful statistics include ''character'', sentence, and paragraph
counts, and readability score (noting readability is language-specific).

While it is possible to download the entire database or use [[Special:Export]],
the former may be infeasible, and both require the user to parse wiki code on
their own, which could be processed much more efficiently by the server.

There is also confusion about what the size as given by [[Special:Search]]
actually represents (wiki code vs. HTML vs. rendered text, with or without images).
Comment 1 Leah 2004-09-21 17:19:05 UTC
Since readability measurements are language (and culture) specific, and because
most readability scores are of dubious objective value, the following score may
serve better as a general measure of readability.

   readability = sqrt[(characters / words)**2 + (words / sentences)**2 +
(sentences / paragraphs)**2]

   with larger values indicating greater complexity.

This avoids tenuous links to "grade level" and allows scores to be meaningfully
compared as ratios.  An average score can be calculated for each language, and
statistics for individual pages might simply state "this article is 53% less
complex than Wikipedia average" or "this page is 31% more complex than Wikipedia
average", without needing to state what the actual score is.

Some sample measurements:

* [[Edgar Allen Poe]] has an average score of about 66
** The article about him has a score of 38

* [[Douglas Adams]] scores about 19
** His article has a score of 27.5

* [[The Lord of the Rings]] has an average of about 75, while [[The Hobbit]]
scores around 37
** Their articles score 35.8 and 52.3, respectively

* [[Roger Zelazny]] has a wide range of scores, with an average around 45
** His article scores 25.5
Comment 2 Leah 2004-09-21 17:31:11 UTC
Additionally, since sections are well-separated, it may be useful to include
(paragraphs / sections) in the above formula.
Comment 3 Mark Clements (HappyDog) 2006-04-17 10:54:48 UTC
The page produced by the 'info' action seems like a sensible place to put this
information.  Currently this just gives no. of edits, no. of editors and no. of
watchers, but there is a lot of extra useful info that it could contain.

However this will not help on Wikipedia and other WM wikis as this feature is
disabled.  I would imagine that this kind of info is quite intensive to produce
so it would be unlikely to be enabled in the future (unless the DB was modified
so that this data could be cached).
Comment 4 Matt Johnston 2008-09-11 11:29:10 UTC
Created attachment 5313 [details]
Proposed patch

Here's my attempt at doing this. Stats rely on $wgEnablePerPageStats being true, and are inefficient but memcached. Will work fine for most wikis.
Comment 5 Matt Johnston 2008-09-11 11:57:24 UTC
Created attachment 5314 [details]
Proposed patch v2

Various updates, proper keying system for memcached, permission checks, and fully deprecates action=info and $wgAllowPageInfo.
Comment 6 Matt Johnston 2008-09-12 04:35:55 UTC
Created attachment 5318 [details]
Proposed patch v3

Fixing a few bugs in previous, mostly with non-mainspace redirection and some Title functions with poor docs misleading me.
Comment 7 Aaron Schulz 2008-09-12 04:44:31 UTC
Perhaps you can use the global $wgAllowPageInfo, which already exists?
Comment 8 Aaron Schulz 2008-09-12 04:48:54 UTC
(In reply to comment #7)
> Perhaps you can use the global $wgAllowPageInfo, which already exists?
> 

You might as well merge this into the ?action=info stats as well. A link could be added to it on the toolbox or at the bottom of the page (may require some minor skin changes). The function getStats() should be in Article.php (and tweaked to work there of course). No per-page items should be in SpecialStatistics.php.
Comment 9 Matt Johnston 2008-09-12 04:51:33 UTC
action=info IS merged in (try it with this patch enabled, it's at least backwards compatible). I decided not to use the old setting, as this is probably more DB intensive (although better cached).

I'll move it to article, it does seem a better place now I look at it.
Comment 10 Matt Johnston 2008-09-12 05:12:31 UTC
Created attachment 5319 [details]
Proposed patch v4

This should address Aaron's suggestions, and also means no changes to Title. Fully backwards-compatible now, with the main addition of caching and a few more stats.
Comment 11 Aaron Schulz 2008-09-12 11:40:50 UTC
OK, please add error_reporting( E_ALL & E_NOTICE ); to localsettings.php. I'm getting an error notice in the background.

Also, it would be nice if the data was in and HTML table like special:statistics.
Comment 12 Aaron Schulz 2008-09-12 11:44:42 UTC
Also, you might as well show the top 10 authors or so, rather than the first
Comment 13 Aaron Schulz 2008-09-12 11:47:35 UTC
http://vs.aka-online.de/cgi-bin/wppagehiststat.pl?lang=en.wikipedia&page=Main+Page

I'd look at that for some more possible stats to have here as an example.
Comment 14 Roan Kattouw 2008-09-12 13:00:10 UTC
(In reply to comment #11)
> OK, please add error_reporting( E_ALL & E_NOTICE ); to localsettings.php.
That should be E_ALL | E_NOTICE , I assume?
Comment 15 Aaron Schulz 2008-09-12 13:02:35 UTC
(In reply to comment #14)
> (In reply to comment #11)
> > OK, please add error_reporting( E_ALL & E_NOTICE ); to localsettings.php.
> That should be E_ALL | E_NOTICE , I assume?
> 

Yes
Comment 16 Matt Johnston 2008-09-15 11:00:53 UTC
Created attachment 5334 [details]
Proposed patch v5

This tidies up the output, groups it, and should make it easier to add individual stats. Does anyone think it's a good idea to add something like a $wgEnableStat['asdf'] setting to enable fine-tuning by admins?
Comment 17 Aaron Schulz 2009-01-03 14:10:38 UTC
special:statistics should probably be left as is.

Also, this patch gives heavy merge conflicts :(
Comment 18 Andre Klapper 2013-01-09 13:22:19 UTC
mattj:
This report has been in ASSIGNED status for more than one year and you are set as its assignee. In case that you are not actively working on a fix, please reset the bug status to NEW/UNCONFIRMED.
In case you do not plan to work on a fix in the near future: Please also edit the "Assigned To" field by clicking "Reset Assignee to default", in order to not prevent potential contributors from working on a fix. Thanks for your help!
[assigned>=1y]
Comment 19 MZMcBride 2013-03-25 23:16:27 UTC
Resetting this to new.

The good news is that the info action has now been re-implemented and is available on Wikimedia wikis (including Wikipedia) and by default in MediaWiki core. Yay!

The bad news is that we still don't have cool stats in the info action's output.

Here's the live info action today: <https://en.wikipedia.org/wiki/Barack_Obama?action=info>.

(In reply to comment #0)
> * Total size of article (in kb)
> * Total size of text (in kb)
> * Total size of readable text (in kb)
> * Total size of all images in article (in kb)
> * Readable Word Count (excluding headings etc as per definitions of
> '''article
> count - alternative''')
> * Amount of internal links
> * Amount of external links

Can you split this out into individual bug reports, please? :-)  One bug report for each bullet, I guess. Take a look at <https://bugzilla.wikimedia.org/showdependencytree.cgi?id=38450&hide_resolved=0> to see how similar bugs have been filed. Individual bug reports allow everyone to focus on particular data points to add action=info. Some of these data points have very clear forward paths. Others have very unclear forward paths (for example, measuring "readable text").

> : As well as up-to-date information on the page which shows what the average
> Wikipedia article is like in comparison to the article, plus additional
> information on the language version. (eg Average Wikipedia article size is
> 2078 bytes, compared to English language article size of 2315 bytes)

Right. This will be a separate bug report.

> Additional useful statistics include ''character'', sentence, and paragraph
> counts, and readability score (noting readability is language-specific).

This too. Ideas for implementation strategies for any of this would also be great inside the individual bug reports. ;-)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links