Last modified: 2014-11-17 10:36:11 UTC
In the example URL, the lines beginning "Sed elit" on each side differ by only one character. These should be considered the "equivalent lines" to each other and show up in the same row of the table and get word-by-word hilighting. This is a general feature request for more intelligence in deciding what lines are "equivalent".
The solution might affect bug 13466 somehow.
*** Bug 24618 has been marked as a duplicate of this bug. ***
'This is a general feature request for more intelligence in deciding what lines are "equivalent".' I agree with Random832. An acquaintance of mine gives this example: http://dbclass.saintjoe.edu/wiki/index.php/Demo_Context The diff: http://dbclass.saintjoe.edu/wiki/index.php?title=Demo_Context&diff=2773&oldid=2772 This person teaches English composition. He uses MediaWiki to do it. His students type their essays into MediaWiki, he improves them via an edit, and then they look at the diff together to understand what he changed and why. He finds that the diff calculation in MediaWiki is not robust enough and fails to sensibly show linebreak changes in some instances, and that this makes it much harder to use the diffs as a teaching tool. "There were very minimal changes made to the article between the first and second revisions; however, I did add a number of paragraph breaks, and coalesced a couple of paragraphs. "You can see that the paragraph breaks caused the diff "discernment function" to identify whole paragraphs as changes, when in fact all that happened with the addition of a simple line break." Adding the "design" keyword to ping a designer to consider what we should really be doing regarding various diff generation and diff-viewing edge cases.
Agree, visiting here to report a similar request with this example: http://en.wikipedia.org/w/index.php?diff=470024993&oldid=469833887 Examples of issues that should have been noticed by the diff engine/formatter: * Line starting "{{cquote|It is surprising" -- same or virtually same line appears left and right, diff engine fails to match them with no obvious reason why that should be. So they appear as a deletion + insertion, rather than shown adjacent. Common problem. * Same occurs lower down with line starting ":* Fibres from" * Under heading "=== Subsequent events ===" -- a paragraph has been added starting "An inquest into..." Surrounding text is unchanged. Instead of recognizing this as a simple one-paragraph addition, it's treating it as a removal of one paragraph and change to all text in all following paragraphs (ie believes each para has changed when they have merely moved down one para simultaneously due to the insertion). The last 2 paras in the section are then treated as new insertions which they aren't. * Line starting ": "I don't" edited to add a {{cquote| template. Instead of recognizing the few extra characters diff treated it as a completely substituted new paragraph.
*** Bug 349 has been marked as a duplicate of this bug. ***
Created attachment 9885 [details] dwdiff Histories are full of completely useless diffs like this https://www.mediawiki.org/w/index.php?title=Help%3AExtension%3ATranslate&action=historysubmit&diff=489225&oldid=487083 (just a random example, things can get much worse). Word-level diff gives better results in such cases, see screenshot of a simple dwdiff -c (1.9; I see there are further improvements in later releases).
(In reply to comment #6) > Word-level diff gives better results in such cases, see screenshot of a simple > dwdiff -c (1.9; I see there are further improvements in later releases). According to docs (which are outdated) wikidiff2 «performs word-level (space-delimited) diffs» (now they're [always?] character-level), so it probably should be able to handle whitespace in a more sensible way, but I don't know how the different features can be merged/balanced. Moving under wikidiff2 anyway.
The bad matching of paragraphs is definitely harming my productivity. Raising to a bug to give it credit it should have.
*** Bug 23704 has been marked as a duplicate of this bug. ***
Here is a fresh example where the diff algorithm fails: http://de.wikipedia.org/w/index.php?title=Holland-America_Line&diff=prev&oldid=103985082
(In reply to comment #10) > Here is a fresh example where the diff algorithm fails: > > http://de.wikipedia.org/w/index.php?title=Holland- > America_Line&diff=prev&oldid=103985082 That example is still kind of annoying, yeah.
(In reply to comment #11) > (In reply to comment #10) > > http://de.wikipedia.org/w/?diff=103985082 > > That example is still kind of annoying, yeah. As announced in bug #33331 I improved my user script a lot in the past months. http://de.wikipedia.org/wiki/Benutzer_Diskussion:TMg/cleanDiff.js Besides other features (it shrinks the word-level highlighting to character-level and improves the highlighting for single characters) it also fixes bad line matching like in the example above. On of the reasons for bad line matching are spaces in otherwise empty lines. If a space is added to or removed from an empty line the diff algorithm gets confused. It tries to find an other empty line with the same amount of spaces. It will find one. But in almost all cases these empty lines don't belong together. My proposed fix is to simply ignore all trailing whitespace when matching lines. Trailing whitespace never have a meaning in the wiki syntax. It's good to highlight it in the diff. But it should be ignored in the first step when the algorithm tries to match lines.
only ignoring trailing whitespaces is not enough https://test.wikipedia.org/w/index.php?diff=199552&oldid=199551
I'm removing myself from cc as I prepare to leave Wikimedia Foundation, but I will leave my 2 cents here: improving the line matching in diffs seems, to me, a cool project that could go in https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects .
Cc bawolff due to https://lists.wikimedia.org/pipermail/wikitech-l/2014-November/079427.html
(In reply to Nemo from comment #15) > Cc bawolff due to > https://lists.wikimedia.org/pipermail/wikitech-l/2014-November/079427.html It should be noted that displaying diffs and doing edit merges/edit conflicts use two different code paths, probably with different algorithms. Better line matching would be nice in both cases.