Last modified: 2009-12-30 05:07:41 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T15209, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 13209 - Add Revision Diff functionality to API


Summary:	Add Revision Diff functionality to API

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	API (Other open bugs)
Version:	1.12.x
Hardware:	All All

Importance:	Normal enhancement with 1 vote (vote)
Target Milestone:	---
Assigned To:	Brion Vibber

URL:
Whiteboard:
Keywords:	patch, patch-need-review

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2008-03-01 15:05 UTC by OverlordQ
Modified:	2009-12-30 05:07 UTC (History)
CC List:	6 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Patch to hook it into the API (1.16 KB, patch) 2008-03-01 15:05 UTC, OverlordQ	Details
The actual functionality (5.25 KB, text/plain) 2008-03-01 15:07 UTC, OverlordQ	Details
Patch that adds diff functionality to prop=revisions (11.06 KB, patch) 2009-01-12 23:40 UTC, Roan Kattouw	Details
Show Obsolete (2) Add an attachment (proposed patch, testcase, etc.)

Description OverlordQ 2008-03-01 15:05:09 UTC

Created attachment 4684 [details]
Patch to hook it into the API

The current methods of getting a diff:

1) Requesting both revisions, running diff algo on it.
2) Scraping the diff html output

are rather sub par.  A handier/nicer way would be to add this option to the api that returns a nice clean format.

Comment 1 OverlordQ 2008-03-01 15:07:53 UTC

Created attachment 4685 [details]
The actual functionality

It's horrible horrible code that is probably more complex then it should be, and is most likely horribly inefficient.

The only thing going for it, is that it works.

http://www.thedarkcitadel.com/w/api.php?action=query&prop=diff&titles=Main%20Page
or
http://www.thedarkcitadel.com/w/api.php?action=query&prop=diff&titles=Rainbow%20Tables

Comment 2 Bryan Tong Minh 2008-03-01 15:15:25 UTC

Don't assume text to be located in the text table, since they are not. You should use Revision::getRevisionText.

Comment 3 Victor Vasiliev 2008-03-01 15:24:22 UTC

This bug was fixed once and then reverted because of performance issues.

Comment 4 Roan Kattouw 2008-03-01 20:41:24 UTC

(In reply to comment #3)
> This bug was fixed once and then reverted because of performance issues.

To elaborate on this a little further: there are several problems I ran into when implementing API diffs:

* Diffs are cached, so for performance reasons you wanna use the diff cache. However, the diff cache contains prefab HTML.
* Diff generation in PHP is slow as hell, so we use a C++ program for that (wikidiff2). However, wikidiff2 outputs (you guessed it) HTML.

Caching diffs rather than HTML is:
* A huge job, because it requires rewriting large parts of wikidiff2 and MediaWiki's diff handling code
* Memory-intensive, as arrays take truckloads of memory in PHP
* A performance hit, as the diff has to be processed and HTML generated on every diff view, rather than once and for all

Comment 5 Gurch 2008-10-05 13:16:27 UTC

I would settle for an API query that wrapped the already-generated HTML representation of the diff in <api><diff> ... </diff></api> tags and output that. It would still be more efficient at both ends than getting the same information through index.php. Any chance of this?

Comment 6 Roan Kattouw 2008-10-05 20:18:20 UTC

(In reply to comment #5)
> I would settle for an API query that wrapped the already-generated HTML
> representation of the diff in <api><diff> ... </diff></api> tags and output
> that. It would still be more efficient at both ends than getting the same
> information through index.php.

It wouldn't: with the right parameter mix (I don't know the exact mix, others do) you can get just the diff from index.php .

Comment 7 Gurch 2008-12-04 00:17:22 UTC

Reopening this; I understand that providing machine-readable diffs is not feasible for performance reasons. However, providing an API query that outputs the HTML diffs (already generated by MediaWiki) suitably wrapped would not, as suggested above, be merely redundant to an existing UI query. It is something that would be useful for two key reasons:

* Machine-readable output in the event of errors, something that the UI does not always, or even usually, provide
* It would be relatively simple to implement an API query format that, unlike the UI, could return several diffs in the same request. This would make fetching many diffs easier on the servers, something that is done in the process of dealing with vandalism for example.

Comment 8 Roan Kattouw 2008-12-04 14:34:18 UTC

(In reply to comment #7)
> Reopening this; I understand that providing machine-readable diffs is not
> feasible for performance reasons. However, providing an API query that outputs
> the HTML diffs (already generated by MediaWiki) suitably wrapped would not, as
> suggested above, be merely redundant to an existing UI query. It is something
> that would be useful for two key reasons:
> 
> * Machine-readable output in the event of errors, something that the UI does
> not always, or even usually, provide
> * It would be relatively simple to implement an API query format that, unlike
> the UI, could return several diffs in the same request. This would make
> fetching many diffs easier on the servers, something that is done in the
> process of dealing with vandalism for example.
> 

I guess those are good arguments for implementing this. I'll do it soon, but note that its usefulness is severely limited because the diff format depends on the wiki's configuration. On a wiki using MW's built-in diff functionality (slow!) or wikidiff2 (WMF uses this), you'll get an HTML diff polluted with <table>, <tr>, <td>, <ins> and <del> tags. But if that's what you want, you can get it :)

Comment 9 AlexSm 2008-12-04 14:59:43 UTC

The parameters mentioned by Roan in #6 are
&action=render&diffonly=yes

This outputs just HTML diff.
(Need to import /skins-1.5/common/diff.css to make it look better)

I don't really see why adding <api><diff> is needed at all.

Comment 10 Roan Kattouw 2008-12-04 15:04:51 UTC

(In reply to comment #9)
> The parameters mentioned by Roan in #6 are
> &action=render&diffonly=yes
> 
> This outputs just HTML diff.
> (Need to import /skins-1.5/common/diff.css to make it look better)
> 
> I don't really see why adding <api><diff> is needed at all.
> 

Because it'd allow fetching lots of diffs at the same time. I'll have to ask Brion about this, though, I think requesting lots of uncached diffs may blow up stuff.

Comment 11 Gurch 2008-12-04 20:36:09 UTC

(In reply to comment #8)
> On a wiki using MW's built-in diff functionality
> (slow!) or wikidiff2 (WMF uses this), you'll get an HTML diff polluted with
> <table>, <tr>, <td>, <ins> and <del> tags. But if that's what you want, you can
> get it :)

Sometimes the intended purpose is display to the user anyway, in which case it's not a problem. For those cases where it isn't, it's not *that* difficult to parse the HTML and get the information into a more machine-friendly format.

(I have, of course, considered fetching the two revisions, then producing the diff locally. But when the change is small relative to the size of the page, which is often, and the user is on a slow connection and/or has a slow machine, this option takes much longer, and alas there are still many people out there on dial-up. :(

Obviously if this blows stuff up then forget it, Brion has the last word on that. :)

Comment 12 Gurch 2008-12-04 20:37:44 UTC

(In reply to comment #10)
> Because it'd allow fetching lots of diffs at the same time. I'll have to ask
> Brion about this, though, I think requesting lots of uncached diffs may blow up
> stuff.

Oh, and by "lots" I didn't have in mind hundreds. Even if it was limited to, say, five, it would be useful.

Comment 13 Brion Vibber 2008-12-15 18:17:40 UTC

extractRowInfo($row) is bogus -- it's assuming that text.old_text contains the final wikitext of the revision, which is wrong. You *must* fetch text via the Revision class, or you'll break on compressed text, legacy encodings, cur row proxies, batch compression, or external storage.

Comment 14 Roan Kattouw 2008-12-15 18:59:54 UTC

(In reply to comment #13)
> extractRowInfo($row) is bogus -- it's assuming that text.old_text contains the
> final wikitext of the revision, which is wrong. You *must* fetch text via the
> Revision class, or you'll break on compressed text, legacy encodings, cur row
> proxies, batch compression, or external storage.
> 

Yes, the patch is fundamentally flawed, we know about that. I'll mark it obsolete.

The real question we have for you though is if you see any harm in clients being able to pull 50 (500 for privileged users) diffs in one request, considering they might have to be generated at that point. Is 50/500 diff generations in one request evil enough to deny this feature?

Comment 15 Brion Vibber 2008-12-15 19:06:24 UTC

Well, they can pull 50 or 500 diffs in individual requests if they like already. :)

Without wikidiff2 present, diff generation is hideously slow, however, making lots of hits a bit of a danger to begin with. This is why for instance the RSS feed attempts to avoid doing diffs it thinks might be too big or slow (though it doesn't do a very good job of this iirc).

In short: I hate diffs, they suck. ;)

Considerations:

* Do we really need to request multiple diffs at once, or do we just want an API point for requesting *a* diff in a consistent fashion?

* Do we care that the result is going to be a big wad of HTML which will change depending on the version of MediaWiki, the diff backend installed, etc?

* What's the actual intention of fetching the diff anyway?

Comment 16 Roan Kattouw 2008-12-15 19:12:10 UTC

(In reply to comment #15)
> Well, they can pull 50 or 500 diffs in individual requests if they like
> already. :)
> 
Well yeah, but that'll at least load-balance somewhat nicely.

> Without wikidiff2 present, diff generation is hideously slow, however, making
> lots of hits a bit of a danger to begin with. This is why for instance the RSS
> feed attempts to avoid doing diffs it thinks might be too big or slow (though
> it doesn't do a very good job of this iirc).
> 
> In short: I hate diffs, they suck. ;)
> 
I second that.

> Considerations:
> 
> * Do we really need to request multiple diffs at once, or do we just want an
> API point for requesting *a* diff in a consistent fashion?
>
> * Do we care that the result is going to be a big wad of HTML which will change
> depending on the version of MediaWiki, the diff backend installed, etc?
> 
For answers to these questions, read comment #7, #8, #10 and #11.

> * What's the actual intention of fetching the diff anyway?
> 
That doesn't seem to be mentioned anywhere.

Comment 17 Gurch 2008-12-15 19:34:25 UTC

(In reply to comment #15)
> * Do we really need to request multiple diffs at once, or do we just want an
> API point for requesting *a* diff in a consistent fashion?
> 
> * Do we care that the result is going to be a big wad of HTML which will change
> depending on the version of MediaWiki, the diff backend installed, etc?
> 
> * What's the actual intention of fetching the diff anyway?

It seems that the only person who actually wants this is me, so I guess don't bother too much about it if it's a lot of work. Things work OK how they are, at least nothing has fallen over yet.

But in my case, the answers to these questions:

* Ideally multiple diffs, but single ones would still mean simpler client-side code and cleaner error handling.

* Not really, since it's going to be presented to the user anyway.

* The intended use is for recent changes patrolling on the English Wikipedia. Users need to review diffs fairly constantly at short intervals over a reasonable period of time, and this accounts for the bulk of web requests associated with rc patrolling. I am assuming -- quite possibly incorrectly -- that it would be easier on the servers if the client was asking for, say, 2 diffs every 6 seconds rather than 1 diff every 3 seconds. If that assumption is complete bollocks then please disregard anything about multiple diffs. :) The trade-off here is minimizing the delay between the user wanting the diff and having it, while also not being excessive with web requests. I keep said delay close to zero by pre-loading diffs a few in advance of the one they're actually looking at, hence it would be easy to fetch them two or three at a time.

Comment 18 Gurch 2008-12-15 19:36:30 UTC

Another things I should clarify is that all the diffs that such requests might generate would have been generated anyway, just possibly in separate requests. And at least for en.wikipedia RC patrolling, if an edit looks even in the least suspicious then it is almost guaranteed that a diff is going to be generated for it fairly quickly because there is always at least someone patrolling.

Comment 19 Roan Kattouw 2008-12-15 20:53:41 UTC

(In reply to comment #18)
> Another things I should clarify is that all the diffs that such requests might
> generate would have been generated anyway, just possibly in separate requests.
> And at least for en.wikipedia RC patrolling, if an edit looks even in the least
> suspicious then it is almost guaranteed that a diff is going to be generated
> for it fairly quickly because there is always at least someone patrolling.
> 

That applies to your tool, but the API isn't restricted to your tool. If it's *possible* to blow stuff up, your tool might not use that 'feature', but someone else might.

(In reply to comment #17)
> It seems that the only person who actually wants this is me, so I guess don't
> bother too much about it if it's a lot of work. Things work OK how they are, at
> least nothing has fallen over yet.
> 
It's not that much work, but it's more than just a few lines, which is why I wanna get Brion's approval before I implement this.

> But in my case, the answers to these questions:
> 
> * Ideally multiple diffs, but single ones would still mean simpler client-side
> code and cleaner error handling.
> 
Single diffs are definitely possible; I'll write a module that returns them later, and leave the possibility of multiple diffs open until Brion's cast his verdict.

> I am assuming -- quite possibly incorrectly --
> that it would be easier on the servers if the client was asking for, say, 2
> diffs every 6 seconds rather than 1 diff every 3 seconds. If that assumption is
> complete bollocks then please disregard anything about multiple diffs. :)
That makes sense. Retrieval of cached diffs is fast compared to request overhead, generating diffs isn't (AFAIK). But since most diffs are in cache your assumption is valid (although it doesn't matter much in a worst-case scenario, but then those scenarios are rare).

> The
> trade-off here is minimizing the delay between the user wanting the diff and
> having it, while also not being excessive with web requests. I keep said delay
> close to zero by pre-loading diffs a few in advance of the one they're actually
> looking at, hence it would be easy to fetch them two or three at a time.
> 
That makes sense as well.

Comment 20 Roan Kattouw 2009-01-12 23:40:57 UTC

Created attachment 5664 [details]
Patch that adds diff functionality to prop=revisions

The attached patch adds diff functionality to prop=revisions. It can:
* diff all listed revs to a given revid (rvdiffto)
* diff all listed revs to the previous rev of the same page (rvdifftoprev)
* diff all listed revs to the top rev of the same page (rvdifftotop)
* show diff size (bytes), diff lines and revids involved (rvdiffprop)

Side effects of this patch:
* all calls to getText() (get text if publicly available) changed to revText() (get text if available to the current user)
* the revisions array in the output now uses revids rather than 0,1,2... as keys (technically a breaking change, although sane clients won't even notice)

Performance considerations:
* difftoprev and difftotop are only enabled when listing multiple revisions of the same page
** this means all revisions have the same top revision so it has to be fetched only once
* difftoprev diffs to the previous/next revision IN THE LIST, which may not be the previous/next revision if rvuser/rvexcludeuser is used to filter stuff. Also, the last/first revision doesn't get a difftoprev
** this allows difftoprev to use only the revision texts already fetched
* when diffs are generated, at most 50 revisions (500 for users with the apihighlimits right) are returned
** this means at most 150/1500 diffs are generated in one request
*** 50/500 of these (diffto) have a low cache hit probability (diffto revision may not even belong to the same page)
*** 50/500 of these (difftoprev) have a high cache hit probability
*** 50/500 of these (difftotop) have a medium cache hit probability

Comment 21 Roan Kattouw 2009-01-12 23:41:49 UTC

Reassigning to Brion as I need him to review the patch.

Comment 22 Brion Vibber 2009-01-12 23:52:51 UTC

If the diffs *are* in fact cached, then fetching a lot at once is cheap. If they're not, I really don't want a single request to be able to multiply an expensive operation times 1500...

My general preference would be to limit to one-at-a-time fetching.

Perhaps if there's a strong use case for it, multiple fetches could be conditionally enabled, or could fetch "only if cached", but I'm not sure how valuable this is.

Comment 23 Roan Kattouw 2009-01-12 23:56:48 UTC

(In reply to comment #22)
> If the diffs *are* in fact cached, then fetching a lot at once is cheap. If
> they're not, I really don't want a single request to be able to multiply an
> expensive operation times 1500...
> 
I completely understand. Of course that limit could be reduced, but I'm guessing it would have to be too low to be useful (would it?).

> My general preference would be to limit to one-at-a-time fetching.
> 
> Perhaps if there's a strong use case for it, multiple fetches could be
> conditionally enabled, or could fetch "only if cached", but I'm not sure how
> valuable this is.
> 
Not trivial to implement either: the current diff backend doesn't support such conditional fetches.

I'll think about one-at-a-time fetching and only-if-cached fetching some more later, but right now I just wanna get some sleep :P

Comment 24 Gurch 2009-01-13 02:00:51 UTC

(In reply to comment #22)
> If the diffs *are* in fact cached, then fetching a lot at once is cheap. If
> they're not, I really don't want a single request to be able to multiply an
> expensive operation times 1500...
> 
> My general preference would be to limit to one-at-a-time fetching.
> 
> Perhaps if there's a strong use case for it, multiple fetches could be
> conditionally enabled, or could fetch "only if cached", but I'm not sure how
> valuable this is.

For me, multiple fetches would be useful, but I would only be using small values of "multiple".

How about not granting a higher limit to users with the 'apihighlimits' right for this particular request, and/or reducing the limit on the number of revisions? If the revision limit was lowered to, say, 10 (with no higher limit for bots), that would mean a maximum of 30 diffs generated, which may or may not be still too high but is certainly lower than 1500 :)

Fetching only if cached wouldn't be much use in my case (as it would be mostly used with suspicious recent changes which are inevitably going to be requested by someone at some time but may not have been yet).

It's not vital, just useful to have.

Comment 25 P.Copp 2009-01-13 02:09:44 UTC

(In reply to comment #20)
> * the revisions array in the output now uses revids rather than 0,1,2... as
> keys (technically a breaking change, although sane clients won't even notice)
> 
To be honest, this would break almost all of my scripts, as this changes the json output from an array of revisions ([...]) to an object ({...}). If you still want to do this, please consider adding some parameter like &indexrevids like you did for the pageids.

Comment 26 Bryan Tong Minh 2009-01-13 14:26:29 UTC

(In reply to comment #20)
> * the revisions array in the output now uses revids rather than 0,1,2... as
> keys (technically a breaking change, although sane clients won't even notice)
> 

It would because this means that the type changes from list to dictionary, which does differ in sane languages.

Comment 27 Roan Kattouw 2009-01-13 17:03:20 UTC

(In reply to comment #25)
> (In reply to comment #20)
> > * the revisions array in the output now uses revids rather than 0,1,2... as
> > keys (technically a breaking change, although sane clients won't even notice)
> > 
> To be honest, this would break almost all of my scripts, as this changes the
> json output from an array of revisions ([...]) to an object ({...}). If you
> still want to do this, please consider adding some parameter like &indexrevids
> like you did for the pageids.
> 

Sure.

Does this mean you can't iterate over an object in JavaScript in a sane fashion (i.e. without knowing the keys beforehand)? Isn't there something like a foreach statement?

Comment 28 Roan Kattouw 2009-01-13 17:04:38 UTC

(In reply to comment #27)
> (In reply to comment #25)
> > (In reply to comment #20)
> > > * the revisions array in the output now uses revids rather than 0,1,2... as
> > > keys (technically a breaking change, although sane clients won't even notice)
> > > 
> > To be honest, this would break almost all of my scripts, as this changes the
> > json output from an array of revisions ([...]) to an object ({...}). If you
> > still want to do this, please consider adding some parameter like &indexrevids
> > like you did for the pageids.
> > 
> 
> Sure.
>
Oh wait, I just realized I don't absolutely need this change; I'll just have to create yet another array to keep track of which revision of which page has which revid.

Comment 29 Gurch 2009-01-28 22:17:21 UTC

Any progress on this? Feel free to scrap the whole multiple-diff thing if there are still reservations about it; just single would be fine.

Comment 30 Roan Kattouw 2009-02-28 13:27:16 UTC

(In reply to comment #23)
> I'll think about one-at-a-time fetching and only-if-cached fetching some more
> later, but right now I just wanna get some sleep :P
> 

I decided to go with only-if-cached fetching, which I implemented in r47890. This implementation will let you pull an unlimited number of cached diffs and at most $wgAPIMaxUncachedDiffs (1 by default) uncached diffs.

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links