Last modified: 2013-03-25 23:16:27 UTC
A recent discussion on [[en:Wikipedia:Village pump]] suggests the addition of a
special page, possibly [[Special:Statistics]], to automatically determine
interesting statistics about individual articles. In particular, the following
statistics are desired:
* Total size of article (in kb)
* Total size of text (in kb)
* Total size of readable text (in kb)
* Total size of all images in article (in kb)
* Readable Word Count (excluding headings etc as per definitions of '''article
count - alternative''')
* Amount of internal links
* Amount of external links
: As well as up-to-date information on the page which shows what the average
Wikipedia article is like in comparison to the article, plus additional
information on the language version. (eg Average Wikipedia article size is 2078
bytes, compared to English language article size of 2315 bytes)
Additional useful statistics include ''character'', sentence, and paragraph
counts, and readability score (noting readability is language-specific).
While it is possible to download the entire database or use [[Special:Export]],
the former may be infeasible, and both require the user to parse wiki code on
their own, which could be processed much more efficiently by the server.
There is also confusion about what the size as given by [[Special:Search]]
actually represents (wiki code vs. HTML vs. rendered text, with or without images).
Since readability measurements are language (and culture) specific, and because
most readability scores are of dubious objective value, the following score may
serve better as a general measure of readability.
readability = sqrt[(characters / words)**2 + (words / sentences)**2 +
(sentences / paragraphs)**2]
with larger values indicating greater complexity.
This avoids tenuous links to "grade level" and allows scores to be meaningfully
compared as ratios. An average score can be calculated for each language, and
statistics for individual pages might simply state "this article is 53% less
complex than Wikipedia average" or "this page is 31% more complex than Wikipedia
average", without needing to state what the actual score is.
Some sample measurements:
* [[Edgar Allen Poe]] has an average score of about 66
** The article about him has a score of 38
* [[Douglas Adams]] scores about 19
** His article has a score of 27.5
* [[The Lord of the Rings]] has an average of about 75, while [[The Hobbit]]
scores around 37
** Their articles score 35.8 and 52.3, respectively
* [[Roger Zelazny]] has a wide range of scores, with an average around 45
** His article scores 25.5
Additionally, since sections are well-separated, it may be useful to include
(paragraphs / sections) in the above formula.
The page produced by the 'info' action seems like a sensible place to put this
information. Currently this just gives no. of edits, no. of editors and no. of
watchers, but there is a lot of extra useful info that it could contain.
However this will not help on Wikipedia and other WM wikis as this feature is
disabled. I would imagine that this kind of info is quite intensive to produce
so it would be unlikely to be enabled in the future (unless the DB was modified
so that this data could be cached).
Created attachment 5313 [details]
Here's my attempt at doing this. Stats rely on $wgEnablePerPageStats being true, and are inefficient but memcached. Will work fine for most wikis.
Created attachment 5314 [details]
Proposed patch v2
Various updates, proper keying system for memcached, permission checks, and fully deprecates action=info and $wgAllowPageInfo.
Created attachment 5318 [details]
Proposed patch v3
Fixing a few bugs in previous, mostly with non-mainspace redirection and some Title functions with poor docs misleading me.
Perhaps you can use the global $wgAllowPageInfo, which already exists?
(In reply to comment #7)
> Perhaps you can use the global $wgAllowPageInfo, which already exists?
You might as well merge this into the ?action=info stats as well. A link could be added to it on the toolbox or at the bottom of the page (may require some minor skin changes). The function getStats() should be in Article.php (and tweaked to work there of course). No per-page items should be in SpecialStatistics.php.
action=info IS merged in (try it with this patch enabled, it's at least backwards compatible). I decided not to use the old setting, as this is probably more DB intensive (although better cached).
I'll move it to article, it does seem a better place now I look at it.
Created attachment 5319 [details]
Proposed patch v4
This should address Aaron's suggestions, and also means no changes to Title. Fully backwards-compatible now, with the main addition of caching and a few more stats.
OK, please add error_reporting( E_ALL & E_NOTICE ); to localsettings.php. I'm getting an error notice in the background.
Also, it would be nice if the data was in and HTML table like special:statistics.
Also, you might as well show the top 10 authors or so, rather than the first
I'd look at that for some more possible stats to have here as an example.
(In reply to comment #11)
> OK, please add error_reporting( E_ALL & E_NOTICE ); to localsettings.php.
That should be E_ALL | E_NOTICE , I assume?
(In reply to comment #14)
> (In reply to comment #11)
> > OK, please add error_reporting( E_ALL & E_NOTICE ); to localsettings.php.
> That should be E_ALL | E_NOTICE , I assume?
Created attachment 5334 [details]
Proposed patch v5
This tidies up the output, groups it, and should make it easier to add individual stats. Does anyone think it's a good idea to add something like a $wgEnableStat['asdf'] setting to enable fine-tuning by admins?
special:statistics should probably be left as is.
Also, this patch gives heavy merge conflicts :(
This report has been in ASSIGNED status for more than one year and you are set as its assignee. In case that you are not actively working on a fix, please reset the bug status to NEW/UNCONFIRMED.
In case you do not plan to work on a fix in the near future: Please also edit the "Assigned To" field by clicking "Reset Assignee to default", in order to not prevent potential contributors from working on a fix. Thanks for your help!
Resetting this to new.
The good news is that the info action has now been re-implemented and is available on Wikimedia wikis (including Wikipedia) and by default in MediaWiki core. Yay!
The bad news is that we still don't have cool stats in the info action's output.
Here's the live info action today: <https://en.wikipedia.org/wiki/Barack_Obama?action=info>.
(In reply to comment #0)
> * Total size of article (in kb)
> * Total size of text (in kb)
> * Total size of readable text (in kb)
> * Total size of all images in article (in kb)
> * Readable Word Count (excluding headings etc as per definitions of
> count - alternative''')
> * Amount of internal links
> * Amount of external links
Can you split this out into individual bug reports, please? :-) One bug report for each bullet, I guess. Take a look at <https://bugzilla.wikimedia.org/showdependencytree.cgi?id=38450&hide_resolved=0> to see how similar bugs have been filed. Individual bug reports allow everyone to focus on particular data points to add action=info. Some of these data points have very clear forward paths. Others have very unclear forward paths (for example, measuring "readable text").
> : As well as up-to-date information on the page which shows what the average
> Wikipedia article is like in comparison to the article, plus additional
> information on the language version. (eg Average Wikipedia article size is
> 2078 bytes, compared to English language article size of 2315 bytes)
Right. This will be a separate bug report.
> Additional useful statistics include ''character'', sentence, and paragraph
> counts, and readability score (noting readability is language-specific).
This too. Ideas for implementation strategies for any of this would also be great inside the individual bug reports. ;-)