Last modified: 2014-09-23 23:46:23 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T4994, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 2994 - Automatically generated count and list of contributors to an article (authorship tracking)
Automatically generated count and list of contributors to an article (authors...
Status: NEW
Product: MediaWiki
Classification: Unclassified
History/Diffs (Other open bugs)
unspecified
All All
: Low enhancement with 6 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
:
: 23327 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-07-29 11:02 UTC by Robert Horning
Modified: 2014-09-23 23:46 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Robert Horning 2005-07-29 11:02:37 UTC
When trying to determine "authorship" of an article, one possible method would
be to "count" the number of edits for a given article.  This is particularly
important when trying to determine who the "principle author" of an article
might be when giving citations of the article, or for formal copyright registration.

In short, a quick count "tab" or "button" in the history page would then count
each user's contributions in a fashion like this:

User1 (20 edits)
User2 (15 edits)
User3 (7 edits)
49.12.24.127 (3 edits)

To get "fancy" you could even try to eliminate counts from reversions (or even
reversion wars), especially to eliminate giving credit to vandals.  A simple
implementation would only require a simple count.

Another further enhancement would be to list the timestamp for the last edit for
each author on a particular article.

The main purpose of this is to extract the names of all authors for a particular
article.
Comment 1 Rowan Collins [IMSoP] 2005-07-29 23:21:42 UTC
If the purpose is to extract the names of all authors for the article, why do
you need to count their edits? I would think that that was a terribly poor
statistic. For instance, I make heavy use of the "Preview" button, or even a
temporary page, when performing multiple or substantial modifications on an
article; other users (particularly less experienced ones) tend to save "little
and often", filling the history with multiple small changes. I don't see how any
system could overcome such biases and represent "relative contributions" to the
article.

For just listing the users from a page's history, I think the software already
has this capability, but it's not available on the Wikimedia servers for
security purposes.

For how to cite an article, see
http://en.wikipedia.org/wiki/Wikipedia:Citing_Wikipedia and bug 800

For determining who has "contributed most" to an article, you might be
interested in this piece of IBM research:
http://www.alphaworks.ibm.com/tech/historyflow/
Comment 2 Robert Horning 2005-07-30 00:45:03 UTC
The #1 reason for getting the edit counts would be to get a simple method to
determine who may have been major contributors to the article as opposed to
minor edits.  This is not a totally consistant rule, as you pointed out Rowan,
but it at least does give a consistant basis in fact.  A more comprehensive rule
would be to try and do a word count for each author, although that may take
quite a bit of server time to try and figure out.  Some sort of algorithm might
be derived to determine exactly who wrote what words in a given article, but it
may be tricky to do that.  I'm just trying to keep things simple here.  I agree
that one author with one edit might write 90% of an article with the other 200+
edits only minor rearranging and vandalism with reverts.

The purpose of this is to primarily organize and quickly come up with the names
of all of the authors for an article, or preferably a series of articles (a
whole Wikibook, for instance) that could be used in a legal context to define
who exactly is the author of the article, and to be able to "file" formal
copyright registration.  Some have told me that only the top 5-10 authors need
to be referenced in this fashion, so getting the top 10 users by edit count
would help to determine just who should be included in a formal application. 
I'm not throwing out the idea that another metric could be used, but at the same
time I don't want to overload the system trying to do a whole series of reverts
to compare just who wrote what part of the article and base "ownership" on
original word count.

If the IBM technology can be distributed with and function with MediaWiki,
perhaps that is what is needed.  This request, however, is for something that I
need for Wikimedia projects specifically, and en.wikibooks.org in particular. 
It is a general request because I think it could be useful in other
installations of MediaWiki software.

Legally we _have_ to cite each author.  The Wikipedia article on citing
wikipedia articles is wrong when it comes to legal registration practices and
formal citations in a legal setting.  As far as "recommended" citations for term
papers and such, it is an easy cop-out to simply ignore the authors of the
articles altogether, and not necessarily required.
Comment 3 Rowan Collins [IMSoP] 2005-07-30 01:02:12 UTC
Well, in my view, counting the edits by each user would just be such a poor
indicator of "relative work" as to be a waste of time. If, as you say, 90% of
the article can be owed to one one-off edit, your list of the "top 5" might as
well just be a "random 5". If you want a heuristic for who has contributed most,
look into the IBM research; if all you're after is a cheap list of authors,
either list them all or pick them at random.

I've just checked, and the software does indeed have a feature for listing the
editors of a page - via ...&action=credits - but it appears to be switched off
or otherwise unavailable on Wikipedia. (Well, the test wiki has it, anyway, see
http://test.leuksman.com/index.php?title=Main_Page&action=credits). This would
seem to me to be much what you need - why limit it any further. 

You could of course use database dumps to get at this information, or indeed the
Special:Export feature which can output entire histories (although this can
presumably puts large amounts of strain on the server for heavily editted articles).
Comment 4 Robert Horning 2005-08-11 11:36:38 UTC
Is there any reason why this feature would be "turned off" from Wikipedia?  (the
&action=credits feature?)  Is it still considered "buggy" or does it take a lot
of server resources to accomplish?  There are valid reasons required in the GFDL
where obtaining this information is not only useful but legally required.  Also,
who would have the power to "turn on" such a feature for a given Wiki?  As in
the typical Developer/Steward/Bureaucrat/Admin heirarchy?

Doing a DB dump seems like a waste of bandwidth, particularly when all you are
trying to do is get the credits for just a few articles.  It would take me a
couple of days to download all of en.wikipedia, for instance.  That really isn't
a reasonable request or expectation of a typical user.
Comment 5 Rowan Collins [IMSoP] 2005-08-11 12:04:31 UTC
(In reply to comment #4)
> Is there any reason why this feature would be "turned off" from Wikipedia?  (the
> &action=credits feature?) 

I'm not sure, tbh - I think I'll ask on wikitech-l. But my guess is that there's
no efficient way of generating/caching this information, so that it takes large
amounts of server resources. Thinking about it, the only way I can think of
would require accessing the metadata (although not now the text, which is stored
separately) for every revision a page has ever had - which is a lot of revisions
on some pages...

> Also, who would have the power to "turn on" such a feature for a given Wiki? 
As in
> the typical Developer/Steward/Bureaucrat/Admin heirarchy?

A developer; I imagine it's a variable in LocalSettings.php.

> Doing a DB dump seems like a waste of bandwidth, particularly when all you are
> trying to do is get the credits for just a few articles.  

Well, if you just want a few articles, you can use [[Special:Export]] to dump
just those articles, including their history. But it's still kind of wasteful, I
agree.
Comment 6 Catherine Munro 2005-12-03 01:01:21 UTC
There is a way to count editor contributions, since this external site by German
user Aka does it:
http://vs.aka-online.de/wppagehiststat/

(See http://de.wikipedia.org/wiki/Benutzer:Aka)

Perhaps his solution could be adapted into MediaWiki, if it's less taxing on the
database than "&action=credits".  
Comment 7 Rob Church 2006-01-04 20:13:02 UTC
A straight count of all revisions in an article's history wouldn't be too bad.
Grouping by username, etc. is where the fun comes in, however, since it's a more
complicated and hence longer query; ultimately, performance is affected.
Comment 8 Jarle Tufto 2008-07-13 16:26:23 UTC
I strongly agree that there should be a better way to cite wikipedia articles and get authorship information.  My concern is how wikipedia is perceived and used in academia.  This in turn has implications for the quality of wikipedia.  Suppose someone who is the main author of an article want to include a reference to the article in his list of publications submitted when applying for, say, tenure or grant money.  The current way,

  http://en.wikipedia.org/wiki/Wikipedia:Citing_Wikipedia,

is not good enough.  There needs to be an easy way to get information on exactly what some particular author has written.  I would suggest that a special link for this was available, perhaps on the history page. The url could have, for example, the following format

  http://en.wikipedia.org/w/index.php?title=Genetics&oldid=225193947&highlight=Jimbo_Wales,

which should produce a standard view of the page Genetics but with were user Jimbo_Wales' contributions highlighted in, say, light yellow.  This should work on a per character and not per line basis. Authorship should be preserved for moved text which seems to be possible if this is based on algorithms such as that of 

  http://en.wikipedia.org/wiki/User:Cacycle/wikEdDiff

I guess handling reverts of old deletions may be more tricky (preserving authorship of text which becomes reinserted by someone author than the original author).

I suppose this would require changes in the Mediawiki software so that it keeps track of the authorship of every byte in the source of each article from every version to the next; I don't see that this should lead to a massive increase in computational load if implemented properly (perhaps some downtime would be needed when making the transition...).  The above wikEdDiff page mentions "integration into Mediawiki"...
Comment 9 Platonides 2008-07-13 16:46:06 UTC
Roman Nosov did an interesting blamemap extension about what you point here last year.
Guy Van den Broeck is working on a Visual Diff on this yeasr's SOC http://code.google.com/soc/2008/wikimedia/appinfo.html?csaid=9813DF0473619117
Comment 10 Jarle Tufto 2008-07-13 18:17:29 UTC
Very nice!  A live demo is still up and running at

  http://91.186.7.138:9001/wiki/Freebsd?trackchanges=blamemap
Comment 11 Sumana Harihareswara 2011-10-07 20:42:02 UTC
Section "3. Page-level change tracking" of the Quality section of the Feature Map here:

https://www.mediawiki.org/wiki/Feature_map#Quality:_Features_that_directly_support_quality_assurance.2C_assessment_and_labeling

mention & link to WikiBlame, Daniel Kinzler's Contributors Script, PARC's WikiDashboard, and a few other tools that individuals can use to understand who contributed to a wiki article.

Given the current options, what's the best way to move forward?  Perhaps researchers who just want to cite an article's authors could use a user gadget that, for any article, generates a simple list of the authors' names and puts it on the revision history page.  As for more complicated needs involving highlighting who-wrote-what, I'm not sure what the best option is.
Comment 12 Sumana Harihareswara 2011-10-07 21:48:23 UTC
*** Bug 23327 has been marked as a duplicate of this bug. ***
Comment 13 Nemo 2013-10-23 09:20:50 UTC
(In reply to comment #11)
> As for more complicated needs involving
> highlighting who-wrote-what, I'm not sure what the best option is.

I've no idea what's the best option but WikiTrust did that and is seeking a new maintainer: http://lists.wikimedia.org/pipermail/wiki-research-l/2013-September/003068.html
Comment 14 Nemo 2014-06-27 07:42:51 UTC
(In reply to Rob Church from comment #7)
> A straight count of all revisions in an article's history wouldn't be too
> bad.
> Grouping by username, etc. is where the fun comes in, however, since it's a
> more
> complicated and hence longer query; ultimately, performance is affected.

Rob did this in https://www.mediawiki.org/wiki/Extension:Contributors around 2006; I'm adding Yaron, the current maintainer, to cc. Then we have action=credits in core.

There are many ways to approach this bug, inside or outside core. The two main lines of work in core can be seen here:
https://bugzilla.wikimedia.org/showdependencygraph.cgi?id=39533&showsummary=on&display=tree&rankdir=TB

The most advanced features are unlikely to be implemented in core but it's useful to have a map of existing and possible work; I hope the graph above helps.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links