Last modified: 2014-11-17 10:36:11 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 13462 - Enhance line matching in diffs
Enhance line matching in diffs
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
wikidiff2 (Other open bugs)
All All
: Normal normal with 6 votes (vote)
: Future release
Assigned To: Nobody - You can work on this!
: design
: 349 23704 24618 (view as bug list)
Depends on:
Blocks: 70163
  Show dependency treegraph
Reported: 2008-03-21 03:08 UTC by Random832
Modified: 2014-11-17 10:36 UTC (History)
15 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---

dwdiff (205.30 KB, image/png)
2012-01-22 19:50 UTC, Nemo

Description Random832 2008-03-21 03:08:15 UTC
In the example URL, the lines beginning "Sed elit" on each side differ by only one character. These should be considered the "equivalent lines" to each other and show up in the same row of the table and get word-by-word hilighting. This is a general feature request for more intelligence in deciding what lines are "equivalent".
Comment 1 Huji 2008-03-21 08:50:01 UTC
The solution might affect bug 13466 somehow.
Comment 2 Alexandre Emsenhuber [IAlex] 2010-08-01 08:53:09 UTC
*** Bug 24618 has been marked as a duplicate of this bug. ***
Comment 3 Sumana Harihareswara 2011-10-07 23:35:50 UTC
'This is a general feature request for more intelligence in deciding what lines are "equivalent".'  I agree with Random832.

An acquaintance of mine gives this example:

The diff:

This person teaches English composition.  He uses MediaWiki to do it.  His students type their essays into MediaWiki, he improves them via an edit, and then they look at the diff together to understand what he changed and why.  He finds that the diff calculation in MediaWiki is not robust enough and fails to sensibly show linebreak changes in some instances, and that this makes it much harder to use the diffs as a teaching tool.

"There were very minimal changes made to the article between the first and second revisions; however, I did add a number of paragraph breaks, and coalesced a couple of paragraphs.

"You can see that the paragraph breaks caused the diff "discernment function" to identify whole paragraphs as changes, when in fact all that happened with the addition of a simple line break."

Adding the "design" keyword to ping a designer to consider what we should really be doing regarding various diff generation and diff-viewing edge cases.
Comment 4 FT2 2012-01-07 06:11:49 UTC
Agree, visiting here to report a similar request with this example:

Examples of issues that should have been noticed by the diff engine/formatter:

* Line starting "{{cquote|It is surprising" -- same or virtually same line appears left and right, diff engine fails to match them with no obvious reason why that should be. So they appear as a deletion + insertion, rather than shown adjacent. Common problem.

* Same occurs lower down with line starting ":* Fibres from"

* Under heading "=== Subsequent events ===" -- a paragraph has been added starting "An inquest into..." Surrounding text is unchanged. Instead of recognizing this as a simple one-paragraph addition, it's treating it as a removal of one paragraph and change to all text in all following paragraphs (ie believes each para has changed when they have merely moved down one para simultaneously due to the insertion). The last 2 paras in the section are then treated as new insertions which they aren't.

* Line starting ": "I don't" edited to add a {{cquote| template. Instead of recognizing the few extra characters diff treated it as a completely substituted new paragraph.
Comment 5 Nemo 2012-01-22 19:19:25 UTC
*** Bug 349 has been marked as a duplicate of this bug. ***
Comment 6 Nemo 2012-01-22 19:50:55 UTC
Created attachment 9885 [details]

Histories are full of completely useless diffs like this (just a random example, things can get much worse).

Word-level diff gives better results in such cases, see screenshot of a simple dwdiff -c (1.9; I see there are further improvements in later releases).
Comment 7 Nemo 2012-01-30 15:29:36 UTC
(In reply to comment #6)
> Word-level diff gives better results in such cases, see screenshot of a simple
> dwdiff -c (1.9; I see there are further improvements in later releases).

According to docs (which are outdated) wikidiff2 «performs word-level (space-delimited) diffs» (now they're [always?] character-level), so it probably should be able to handle whitespace in a more sensible way, but I don't know how the different features can be merged/balanced. Moving under wikidiff2 anyway.
Comment 8 Niklas Laxström 2012-04-12 10:35:57 UTC
The bad matching of paragraphs is definitely harming my productivity. Raising to a bug to give it credit it should have.
Comment 9 Nemo 2012-04-13 08:59:39 UTC
*** Bug 23704 has been marked as a duplicate of this bug. ***
Comment 10 TMg 2012-06-08 10:24:05 UTC
Here is a fresh example where the diff algorithm fails:
Comment 11 Sumana Harihareswara 2012-12-28 01:03:44 UTC
(In reply to comment #10)
> Here is a fresh example where the diff algorithm fails:
> America_Line&diff=prev&oldid=103985082

That example is still kind of annoying, yeah.
Comment 12 TMg 2012-12-31 12:06:16 UTC
(In reply to comment #11)
> (In reply to comment #10)
> >
> That example is still kind of annoying, yeah.

As announced in bug #33331 I improved my user script a lot in the past months.

Besides other features (it shrinks the word-level highlighting to character-level and improves the highlighting for single characters) it also fixes bad line matching like in the example above.

On of the reasons for bad line matching are spaces in otherwise empty lines. If a space is added to or removed from an empty line the diff algorithm gets confused. It tries to find an other empty line with the same amount of spaces. It will find one. But in almost all cases these empty lines don't belong together.

My proposed fix is to simply ignore all trailing whitespace when matching lines. Trailing whitespace never have a meaning in the wiki syntax. It's good to highlight it in the diff. But it should be ignored in the first step when the algorithm tries to match lines.
Comment 13 Gryllida 2014-05-16 10:43:29 UTC
only ignoring trailing whitespaces is not enough
Comment 14 Sumana Harihareswara 2014-09-23 19:43:25 UTC
I'm removing myself from cc as I prepare to leave Wikimedia Foundation, but I will leave my 2 cents here: improving the line matching in diffs seems, to me, a cool project that could go in .
Comment 16 Bawolff (Brian Wolff) 2014-11-13 07:48:25 UTC
(In reply to Nemo from comment #15)
> Cc bawolff due to

It should be noted that displaying diffs and doing edit merges/edit conflicts use two different code paths, probably with different algorithms. Better line matching would be nice in both cases.

Note You need to log in before you can comment on or make changes to this bug.