Last modified: 2014-09-23 23:46:23 UTC
When trying to determine "authorship" of an article, one possible method would be to "count" the number of edits for a given article. This is particularly important when trying to determine who the "principle author" of an article might be when giving citations of the article, or for formal copyright registration. In short, a quick count "tab" or "button" in the history page would then count each user's contributions in a fashion like this: User1 (20 edits) User2 (15 edits) User3 (7 edits) 49.12.24.127 (3 edits) To get "fancy" you could even try to eliminate counts from reversions (or even reversion wars), especially to eliminate giving credit to vandals. A simple implementation would only require a simple count. Another further enhancement would be to list the timestamp for the last edit for each author on a particular article. The main purpose of this is to extract the names of all authors for a particular article.
If the purpose is to extract the names of all authors for the article, why do you need to count their edits? I would think that that was a terribly poor statistic. For instance, I make heavy use of the "Preview" button, or even a temporary page, when performing multiple or substantial modifications on an article; other users (particularly less experienced ones) tend to save "little and often", filling the history with multiple small changes. I don't see how any system could overcome such biases and represent "relative contributions" to the article. For just listing the users from a page's history, I think the software already has this capability, but it's not available on the Wikimedia servers for security purposes. For how to cite an article, see http://en.wikipedia.org/wiki/Wikipedia:Citing_Wikipedia and bug 800 For determining who has "contributed most" to an article, you might be interested in this piece of IBM research: http://www.alphaworks.ibm.com/tech/historyflow/
The #1 reason for getting the edit counts would be to get a simple method to determine who may have been major contributors to the article as opposed to minor edits. This is not a totally consistant rule, as you pointed out Rowan, but it at least does give a consistant basis in fact. A more comprehensive rule would be to try and do a word count for each author, although that may take quite a bit of server time to try and figure out. Some sort of algorithm might be derived to determine exactly who wrote what words in a given article, but it may be tricky to do that. I'm just trying to keep things simple here. I agree that one author with one edit might write 90% of an article with the other 200+ edits only minor rearranging and vandalism with reverts. The purpose of this is to primarily organize and quickly come up with the names of all of the authors for an article, or preferably a series of articles (a whole Wikibook, for instance) that could be used in a legal context to define who exactly is the author of the article, and to be able to "file" formal copyright registration. Some have told me that only the top 5-10 authors need to be referenced in this fashion, so getting the top 10 users by edit count would help to determine just who should be included in a formal application. I'm not throwing out the idea that another metric could be used, but at the same time I don't want to overload the system trying to do a whole series of reverts to compare just who wrote what part of the article and base "ownership" on original word count. If the IBM technology can be distributed with and function with MediaWiki, perhaps that is what is needed. This request, however, is for something that I need for Wikimedia projects specifically, and en.wikibooks.org in particular. It is a general request because I think it could be useful in other installations of MediaWiki software. Legally we _have_ to cite each author. The Wikipedia article on citing wikipedia articles is wrong when it comes to legal registration practices and formal citations in a legal setting. As far as "recommended" citations for term papers and such, it is an easy cop-out to simply ignore the authors of the articles altogether, and not necessarily required.
Well, in my view, counting the edits by each user would just be such a poor indicator of "relative work" as to be a waste of time. If, as you say, 90% of the article can be owed to one one-off edit, your list of the "top 5" might as well just be a "random 5". If you want a heuristic for who has contributed most, look into the IBM research; if all you're after is a cheap list of authors, either list them all or pick them at random. I've just checked, and the software does indeed have a feature for listing the editors of a page - via ...&action=credits - but it appears to be switched off or otherwise unavailable on Wikipedia. (Well, the test wiki has it, anyway, see http://test.leuksman.com/index.php?title=Main_Page&action=credits). This would seem to me to be much what you need - why limit it any further. You could of course use database dumps to get at this information, or indeed the Special:Export feature which can output entire histories (although this can presumably puts large amounts of strain on the server for heavily editted articles).
Is there any reason why this feature would be "turned off" from Wikipedia? (the &action=credits feature?) Is it still considered "buggy" or does it take a lot of server resources to accomplish? There are valid reasons required in the GFDL where obtaining this information is not only useful but legally required. Also, who would have the power to "turn on" such a feature for a given Wiki? As in the typical Developer/Steward/Bureaucrat/Admin heirarchy? Doing a DB dump seems like a waste of bandwidth, particularly when all you are trying to do is get the credits for just a few articles. It would take me a couple of days to download all of en.wikipedia, for instance. That really isn't a reasonable request or expectation of a typical user.
(In reply to comment #4) > Is there any reason why this feature would be "turned off" from Wikipedia? (the > &action=credits feature?) I'm not sure, tbh - I think I'll ask on wikitech-l. But my guess is that there's no efficient way of generating/caching this information, so that it takes large amounts of server resources. Thinking about it, the only way I can think of would require accessing the metadata (although not now the text, which is stored separately) for every revision a page has ever had - which is a lot of revisions on some pages... > Also, who would have the power to "turn on" such a feature for a given Wiki? As in > the typical Developer/Steward/Bureaucrat/Admin heirarchy? A developer; I imagine it's a variable in LocalSettings.php. > Doing a DB dump seems like a waste of bandwidth, particularly when all you are > trying to do is get the credits for just a few articles. Well, if you just want a few articles, you can use [[Special:Export]] to dump just those articles, including their history. But it's still kind of wasteful, I agree.
There is a way to count editor contributions, since this external site by German user Aka does it: http://vs.aka-online.de/wppagehiststat/ (See http://de.wikipedia.org/wiki/Benutzer:Aka) Perhaps his solution could be adapted into MediaWiki, if it's less taxing on the database than "&action=credits".
A straight count of all revisions in an article's history wouldn't be too bad. Grouping by username, etc. is where the fun comes in, however, since it's a more complicated and hence longer query; ultimately, performance is affected.
I strongly agree that there should be a better way to cite wikipedia articles and get authorship information. My concern is how wikipedia is perceived and used in academia. This in turn has implications for the quality of wikipedia. Suppose someone who is the main author of an article want to include a reference to the article in his list of publications submitted when applying for, say, tenure or grant money. The current way, http://en.wikipedia.org/wiki/Wikipedia:Citing_Wikipedia, is not good enough. There needs to be an easy way to get information on exactly what some particular author has written. I would suggest that a special link for this was available, perhaps on the history page. The url could have, for example, the following format http://en.wikipedia.org/w/index.php?title=Genetics&oldid=225193947&highlight=Jimbo_Wales, which should produce a standard view of the page Genetics but with were user Jimbo_Wales' contributions highlighted in, say, light yellow. This should work on a per character and not per line basis. Authorship should be preserved for moved text which seems to be possible if this is based on algorithms such as that of http://en.wikipedia.org/wiki/User:Cacycle/wikEdDiff I guess handling reverts of old deletions may be more tricky (preserving authorship of text which becomes reinserted by someone author than the original author). I suppose this would require changes in the Mediawiki software so that it keeps track of the authorship of every byte in the source of each article from every version to the next; I don't see that this should lead to a massive increase in computational load if implemented properly (perhaps some downtime would be needed when making the transition...). The above wikEdDiff page mentions "integration into Mediawiki"...
Roman Nosov did an interesting blamemap extension about what you point here last year. Guy Van den Broeck is working on a Visual Diff on this yeasr's SOC http://code.google.com/soc/2008/wikimedia/appinfo.html?csaid=9813DF0473619117
Very nice! A live demo is still up and running at http://91.186.7.138:9001/wiki/Freebsd?trackchanges=blamemap
Section "3. Page-level change tracking" of the Quality section of the Feature Map here: https://www.mediawiki.org/wiki/Feature_map#Quality:_Features_that_directly_support_quality_assurance.2C_assessment_and_labeling mention & link to WikiBlame, Daniel Kinzler's Contributors Script, PARC's WikiDashboard, and a few other tools that individuals can use to understand who contributed to a wiki article. Given the current options, what's the best way to move forward? Perhaps researchers who just want to cite an article's authors could use a user gadget that, for any article, generates a simple list of the authors' names and puts it on the revision history page. As for more complicated needs involving highlighting who-wrote-what, I'm not sure what the best option is.
*** Bug 23327 has been marked as a duplicate of this bug. ***
(In reply to comment #11) > As for more complicated needs involving > highlighting who-wrote-what, I'm not sure what the best option is. I've no idea what's the best option but WikiTrust did that and is seeking a new maintainer: http://lists.wikimedia.org/pipermail/wiki-research-l/2013-September/003068.html
(In reply to Rob Church from comment #7) > A straight count of all revisions in an article's history wouldn't be too > bad. > Grouping by username, etc. is where the fun comes in, however, since it's a > more > complicated and hence longer query; ultimately, performance is affected. Rob did this in https://www.mediawiki.org/wiki/Extension:Contributors around 2006; I'm adding Yaron, the current maintainer, to cc. Then we have action=credits in core. There are many ways to approach this bug, inside or outside core. The two main lines of work in core can be seen here: https://bugzilla.wikimedia.org/showdependencygraph.cgi?id=39533&showsummary=on&display=tree&rankdir=TB The most advanced features are unlikely to be implemented in core but it's useful to have a map of existing and possible work; I hope the graph above helps.