Last modified: 2014-10-26 21:09:45 UTC
I have had many times where I would continuously go through a history to find out who added an offending line, or a curious line which I need to contact them about. As various people mentioned here (Such as --TK) did not sign, it would take a while to figure out exactly who TK was. It would be rather nice to be able to highlight/search a line, and it would tell me times that that line was affected, which would allow me to easily find who added said line. This is a feature request, and as so, I labeled it an enhancement, as there's no easy way to request features. Apologies if I did this wrong. I also searched "Line" and only found a very few bugs, none of which like this.
CVS has this on a line-by-line basis, so it's theoretically doable for a word-oriented check (since we have paragraph-oriented text, 'lines' are whole paragraphs and that's not as useful). However I suspect it's optimized by CVS's diff-based storage. This would be spiffy indeed, but it's likely an expensive operation. (Particularly as some pages have thousands of revisions.) Something to keep in mind for the future. Also note that when text is rearranged the results may be misleading.
I've not seen the CVS in action (Unless that's the test wiki). Basically tracking the origins of a paragraph would be a great improvement as it is, the line basis is just nitpicky as I really had meant paragraph _i guess_ to begin with). I just wanted to track the history of a line, being a comment by a person, which would in itself also be a paragraph. As words can be added to lines at any time (and i mean non-paragraph, wordwrapped lines) as well as the length (see wordwrapping), this would be very processor intensive, and only slightly more useful than paragraph history tracing.
I should clarify that I'm talking about CVS itself, the 'cvs annotate' command. It gives output like this, marking each line with the revision number, user, and date that that line was last changed: 1.1 (eloquenc 28-Feb-04): if ( "" == $title && "delete" != $action ) { 1.58 (zhengzhu 22-Sep-04): $wgTitle = Title::newFromText( wfMsgForContent( "mainpage" ) ); 1.10 (vibber 08-Mar-04): } elseif ( $curid = $wgRequest->getInt( 'curid' ) ) { 1.1 (eloquenc 28-Feb-04): # URLs like this are generated by RC, because rc_title isn't always accurate 1.10 (vibber 08-Mar-04): $wgTitle = Title::newFromID( $curid ); 1.1 (eloquenc 28-Feb-04): } else { 1.1 (eloquenc 28-Feb-04): $wgTitle = Title::newFromURL( $title ); 1.1 (eloquenc 28-Feb-04): }
(In reply to comment #3) > I should clarify that I'm talking about CVS itself, the 'cvs annotate' command. It gives output like this, > marking each line with the revision number, user, and date that that line was last changed: > Ah, I understand now. Yes, this feature (In paragraph form if not lineform) would be excellent in mediawiki/wikipedia.
*** Bug 1652 has been marked as a duplicate of this bug. ***
*** Bug 1827 has been marked as a duplicate of this bug. ***
(In reply to comment #2) > I've not seen the CVS in action (Unless that's the test wiki). Just in case you weren't aware, CVS (Concurrent Versions System) is the source control tool used by the developers. The annotate command is often used to find out who broke what part of the code. :)
*** Bug 4796 has been marked as a duplicate of this bug. ***
This feature is called blame in Subversion. I don't think it's feasible on a per sentence basis, and we shouldn't worry about getting that out first. I really think this would be useful. Unfortunantely, it does seem to be an expensive operation (even Subversion says so). How would it work? Hmm... If we had delta based histories, getting a blame operation would be a simple matter of scrolling backwards in the history in increments, matching the diffs to current lines until all the lines had been matched, and then spitting that out. However, we have a sort of compressed fulltext history thing, with diffs computed on the fly (correct me if I'm wrong). So, it would indicate to me, that the solution would be to generate these delta histories when a blame is requested, and then keep it on file for the rest of eternity. This, however, increases redundancy, and has its own synchronization problems. Perhaps a move to delta compression is in order? Or has it already happened? :? ::is thoroughly confused, but would really like the feature::
At WikiSym, a guy was showing off some work he was doing on this kind of stuff. He was basically running the comparisons offline and building a parallel database which could be then queried quickly. Once built, additional diffs can be added in pretty fast as well, at least in theory.
Created attachment 1367 [details] Implementation of blame So, what this attachment does is it creates a blame() function, which takes an array of revisions, and computes the diff in the form of an Annotation object. See the SimpleTest testcase: it works. It's horrible code though, but I was hoping to get it running on the Toolserver (unfortunantely, pulling revisions from the database is also a horribly complicated problem, albeit one that can be bypassed).
Created attachment 1386 [details] Defines Annotation class for annotating based on revisions Much cleaner code, having been rewritten. A test suite is also going to be uploaded for it. Still needs integration and a AnnotationPrinter.
Created attachment 1387 [details] Test suite for Annotation package. Test suite for the annotation package. After all, TDD is good.
With the implementation of the Annotation in place, there are several more tasks to do: 1. Hook this code up to a special page 2. Create a new table annotations for storing the cached annotations 3. Create a maintenance script that will munch through all pages and generate all initial annotations 4. Create an AnnotationPrinter 5. Add a hook to edit saves that recompiles the annotation 2, 3 and 5 are necessary in order to make this sort of extension efficient enough for a huge wiki like English Wikipedia. Any comments???
*** Bug 7366 has been marked as a duplicate of this bug. ***
I've decided to unassign the bug to me. This is a very tricky piece of software to implement and I don't think I'd be most qualified to do it. That's not to say that the code isn't any good, but it still needs to be integrated with MediaWiki.
I really would like to know who was the *first* to introduce a given sentence/paragraph, so I can hunt down copyright violators and kill them =)
That requires considerably more complexity. You have to decide what happens when lines are split or merged or moved, to begin with.
I think that running an annotation on a page every time it's saved would make saving /very/ slow on pages with large histories. My suggestion would be /only/ updating the annotation for the changed lines, rather than redoing the entire annotation.
Maybe a crazy idea, but anyway: I started using git (the version control tool used for the linux kernel) two weeks ago and am already amazed at it's power and flexibility. It's very fast and has good tools for searching through history. Maybe the whole Wikipedia history could be imported into git? After that, new page saves would be added as new commits; as this is very fast in git, it won't represent a problem for the servers.
To make the git idea more practical, it would also be possible to have a git repository for each wikipedia page; git is very space efficient, so this would not be a problem (I think it would probably need less space than the DB) and the repositories could be stored on different servers. As pages are effectively independent from each other, so a shared repository wouldn't have many advantages.
That would require gutting MediaWiki's internals, breaking compatibility with huge amounts of other implementations; requiring the use of another piece of software, and *could* introduce serious performance problems, despite the "speed of git", as it were. The current use of the database is optimised in various places for speed and overall load balancing as it is. A "blame" command would be nice to have, but it's going to need a sane implementation, not a radical reorganising of literal terabytes of information.
> I have had many times where I would continuously go through a history to find > out who added an offending line, or a curious line which I need to contact them > about. Me too; it sucks! But note that a full-on CVS/Subversion line-by-line "annotate" command is more than this feature really needs to be. All you really need is a box where you can type some text, and click "Find first version of this article containing this text". The code could just look at revisions of the article in a binary-search fashion, so it would be fast. Here's a quick implementation in Perl: http://en.wikipedia.org/wiki/User:TotoBaggins#Wikiblame
Binary search is unacceptable for this. It can return incorrect results in the case of reversions.
*** Bug 9455 has been marked as a duplicate of this bug. ***
I'll repost my request 9455 here, as it's rather simpler to implement than the original request, and possibly less expensive: --- It would be useful to be able to search in the prior revisions of a page in two modes: * Search backwards to find the first time when a specified piece of text appears (ie, when it was added) * Search backwards to find the last time that a specified piece of text appears (ie, when it was removed) Ideally one day it would be great to be able to click on text and see who added. But in the meantime, it would be great to simply be able to search for a phrase like "He was a supporter of Hitler." and to be able to leap to the revision when that text first appeared. (a slightly souped up version might show a condensed history consisting of groups of revisions where the phrase appears at least once followed by groups of revisions where it doesn't appear at all) --- I notice that it would not be susceptible to whole paragraphs being moved around as Brion commented. Since we would only be detecting whether the given phrase exists or not, two successive diffs where the phrase existed (but in different locations) would be treated the same. It ought to be less expensive as there is no diffing involved: just a simple text search: Does the phrase exist in revision T-1? No. Does the phrase exist in revision T-2? No. Does the phrase exist in T-3? Yes. Stop.
*** Bug 10031 has been marked as a duplicate of this bug. ***
There is an extension [1] that does this now. WONTFIX? [1] http://www.mediawiki.org/wiki/Extension:Annotation
No. This is an important feature for reasonably effective version control and should be in core if at all possible.
I was checking out the article on Noah Webster for americanized words, and noticed that the section on it seemed to incorrectly reference american words as british, and vice versa. I wasn't sure where the problem lied (was it specifying them wrong or had they been swapped), so I checked a bit older version which had them correctly. It took a few nexts (as I had not realized it was so recent) to find the culprit: http://en.wikipedia.org/w/index.php?title=Noah_Webster&diff=prev&oldid=166613821 Some users might have just thought that it was possibly old vandalism and just corrected it by hand. The problem there, as evidenced by the edit I link, is that there was more vandalism than just the section I had noticed it in. The benefit of a blame system shines here, where I can see which revision the edit occurred in and spot additional, previously hidden, edits. I'm back at my bug, 3 years and 6 dupes later, and I can't really see what the exact status of this bug is. I do like the new partial undo feature though, that is really nice.
The most important point for this bug is that it's not at all simple to do with a relational database system. If we had something like git or Bazaar as a backend for revision storage, it would be trivial. The interesting questions at this point seem to be 1a) If someone were to implement version storage for MediaWiki on top of something like git or Bazaar in a manner that doesn't sacrifice existing efficiency, is the Wikimedia Foundation willing to put in the time and effort to transfer the major projects? Or even the minor projects, to start with? (Probably not going to get an unambiguous "yes" here without progress on (2a).) 1b) If so, is anyone willing to do it? (So far, no, and probably not going to be yes unless (1a) is fulfilled.) 2a) Is it possible to implement blame efficiently and scalably on top of an RDBMS? (No evidence for a yes to this that I've seen: Ambush Commander admits that his work is not efficient enough for use right now.) 2b) If so, is anyone willing and able to do it? (So far, no, and definitely not going to be yes unless (2a) is fulfilled.) The picture is unlikely to change at any time in the foreseeable future, unless we get someone to step forward and put in a lot of work that may or may not end up amounting to something. Put another way, in standard open-source fashion: if you really want it, you're going to have to write it yourself.
*** Bug 13927 has been marked as a duplicate of this bug. ***
*** Bug 18810 has been marked as a duplicate of this bug. ***
*** Bug 18218 has been marked as a duplicate of this bug. ***
I felt also interested on it, but thinking on the day-by-day edits on a Wiki, I think that a blame/annotate SVN/CVS-like feature is not feasible in a MediaWiki installation, specially in a public one where vandalism is common. The annotation feature makes sense on a controlled development system where changes are not very huge. But here at Wikimedia (and other public wikis) where we deal with vandalism, it's common for vandals to blank pages or large sections of a page. That defeats the whole annotation system, since all lines would be marked as changed. Instead, the idea of Steve Bennett at Comment 26 (posted on Bug 9455) would be more useful here, which only needs a text or pattern search of every revision text. That could also be implemented using JavaScript, retrieving every revision text trough the API and doing the search. Bug 9455 was closed as resolved duplicated of this one, but I think it's worth to reopen it and probably think of implementing it if this one wouldn't be implemented.
(In reply to comment #28) > There is an extension [1] that does this now. WONTFIX? > > [1] http://www.mediawiki.org/wiki/Extension:Annotation The WikiTrust userscript also has this functionality: https://de.wikipedia.org/wiki/Benutzer:NetAction/WikiTrust/WikiPraise
The tool from Comment 26 is now available at http://wikipedia.ramselehof.de/wikiblame.php