Last modified: 2009-12-30 05:12:13 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T2208, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 208 - API for external access ( SOAP, XML-RPC, REST... )


Summary:	API for external access ( SOAP, XML-RPC, REST... )

Status:	RESOLVED FIXED

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	Extensions requests (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal enhancement with 17 votes (vote)
Target Milestone:	---
Assigned To:	JeLuF

URL:
Whiteboard:
Keywords:

Duplicates:	1012 1233 2037 (view as bug list)
Depends on:	181
Blocks:	3492 3365
	Show dependency tree / graph

Reported:	2004-08-24 22:31 UTC by Eckhart Wörner
Modified:	2009-12-30 05:12 UTC (History)
CC List:	10 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Eckhart Wörner 2004-08-24 22:31:17 UTC

There seems to be a need for some kind of API for accessing the wikipedia
articles by external programs (at least I need it ;-) ).

* getting the data record of a given or the current version matching a given
article name
* getting a list of version numbers matching an article name (this depends on #181).
* getting a list of article names matching a search term

This API would relieve load from the Wikipedia servers as there has nothing to
be parsed before submitting.

Comment 1 Eckhart Wörner 2004-08-27 14:42:10 UTC

* getting a list of authors having worked on a given article (this has also been
requested in [Wikitech-l] by Jakob Voss).

Comment 2 JeLuF 2004-09-01 23:10:23 UTC

Added a soap interface to HEAD.

Open todos:
* Improve search to use searchindex instead of cur.
* Change client.php so that it can't be called via http, just cli.
* Check whether nusoap is UTF-8 clean.
* Add feature to limit number of requests.

Comment 3 JeLuF 2004-09-08 22:02:06 UTC

Updated:
* using searchindex
* client.php can only be called using the cli
* upon first sight, nusoap seems to be utf8-clean. search query for utf8 strings
succeeded.

Todos:
* limit number of requests per user per day

Comment 4 Daniel Kinzler 2004-09-15 20:07:54 UTC

I would like to suggest something similar, but IMHO seasier to use and implement
than soap (I already posted this on wikitech-l on Sept. 15 2004 and was directed
here). I hope the new features would make life easier for people who develop
bots and other tools that access the wikipedia and also reduce traffic from such
tools. I also belive those to be fairly easy to implement.

The idea is to have an optionaly URL-Parameter ("format=" or such), that would
tell the software to return a page in a format different from the full-fledged
HTML. I would like to suggest formats for "real" pages and special pages
separately, as the requirements are different.

For articles, discussion-pages, etc, support the following formats:

* source - return the wiki-source of that page
* text - return a plain text version, with all markup striped/replaced (tables,
text boxes, etc do not have to be formatd nicely, but their content should be there)

For special pages and all automatically generated lists (kategories, changes,
watchlist, whatlinkshere, etc):

* csv - return the list in CSV-format
* rss - return entries in the list as RSS items.

Additionally, for the normal "full html" view, provide a switch "plain" that
supresses all sidebars, etc and shows just the formated text.

As to the implementation, I would suggest to map the format-name to the name of
a php-class and load it on demand. That way, now formats can be supported by
just placing an appropriate file in the php lib-path.

But all this is pretty different from the original bug - so maybe I should file
a new one?

Thank you.

Comment 5 JeLuF 2004-09-15 20:18:08 UTC

require_once('nusoap.php');

$s = new soapclient( 'http://en.wikipedia.org:80/soap/' );
$r = $s->call( 'getArticle', array( 'Frankfurt' ) );
print $r->text;


I don't see how this is complicated or how parsing a CSV file is easier.
The approach to have a format= parameter for all pages is not realistic. This
would require a complete rewrite of MediaWiki, which is not designed with a
strict separation model/view/controller.

Comment 6 Brion Vibber 2004-09-15 20:22:10 UTC

(In reply to comment #4)
> * source - return the wiki-source of that page

Already have this: action=raw

> * text - return a plain text version, with all markup striped/replaced (tables,
> text boxes, etc do not have to be formatd nicely, but their content should be there)

Potentially doable.

> * csv - return the list in CSV-format

Unlikely to be useful.

> * rss - return entries in the list as RSS items.

Already have this where supported: feed=rss (or feed=atom)

> Additionally, for the normal "full html" view, provide a switch "plain" that
> supresses all sidebars, etc and shows just the formated text.

Potentially doable.

> But all this is pretty different from the original bug - so maybe I should file
> a new one?

Please do.

Comment 7 Ævar Arnfjörð Bjarmason 2005-05-14 15:52:35 UTC

*** Bug 1012 has been marked as a duplicate of this bug. ***

Comment 8 S.F. Keller 2005-05-14 21:26:46 UTC

Thank you for taking over Bug 1012. But please don't forget to pay attention to
"get by category" as well as the fact that there are valid alternatives to SOAP
like the HTTP/GET with parameters (see RESTful architectural style).

Comment 9 Ævar Arnfjörð Bjarmason 2005-05-15 00:45:50 UTC

*** Bug 1012 has been marked as a duplicate of this bug. ***

Comment 10 Brion Vibber 2005-05-15 01:02:01 UTC

To clarify: this bug is not specifically about the SOAP protocol.

Bug 1012 was specifically about the action=raw and Special:Export interfaces, which are interfaces 
specifically for retrieving editable source text. Non-page-attached metadata such as category 
memberships is included in that interface as the source markup which produces those links, but the 
requested things just don't fit into that interface.

This bug is a general one, and is on-topic for additional data fetch interfaces.

Comment 11 S.F. Keller 2005-05-15 10:46:55 UTC

Regarding APIs: I vote for a continuation of the already existing "RESTful API"
instead of or in addition to the SOAP.

Wikimedia is already using REST; it's nothing new! Yahoo! and Amazon are using
it. It's exactly what Comment #4 above suggests as well as myself in Bug 1012.
For the debate about "REST versus SOAP" and "RESTful SOAP" see
http://c2.com/cgi/wiki?RestArchitecturalStyle.

In comment #5 of Bug 1012 Ævar Arnfjörð Bjarmason wrote:
> Our needs aren't simple, ideally a SOAP api would handle all sorts of stuff such
> as getting edit histories, feeds, rendered html, stuff that links to page $1 and
> so on, also, there are standard API's for most programming languages that
> implement it.

Look at the need of this API: There are almost only getter operations and its
parameters are (so far) a single article, a search term, a category or an
authors name. The responses are either a unstructured html, a wiki text or RSS
or CSV(??) - or an error message. 

Now compare these requirements with the Pros of a RESTful implementation: HTTP
support is all that programming languages need. The Cons against REST come into
play when there are complex objects in the involved operation (request and
response), that is when enconding of non-string parameters is needed.

Comment 12 Ævar Arnfjörð Bjarmason 2005-05-16 04:36:16 UTC

*** Bug 1233 has been marked as a duplicate of this bug. ***

Comment 13 Ævar Arnfjörð Bjarmason 2005-06-05 21:50:18 UTC

(In reply to comment #6)
> (In reply to comment #4)
> > Additionally, for the normal "full html" view, provide a switch "plain" that
> > supresses all sidebars, etc and shows just the formated text.
> 
> Potentially doable.

It's very doable, I did a action=html thing the other day that dumped the html
of an  article, Tim also made a dumpHTML thing that does that, but just not
throuh a webinterface, the problem with it however was that it didn't allow
modification of the parser options which is what an API like this bug discusses
should implement.

Comment 14 Ævar Arnfjörð Bjarmason 2005-06-05 22:00:40 UTC

> In comment #5 of Bug 1012 Ævar Arnfjörð Bjarmason wrote:
> > Our needs aren't simple, ideally a SOAP api would handle all sorts of stuff such
> > as getting edit histories, feeds, rendered html, stuff that links to page $1 and
> > so on, also, there are standard API's for most programming languages that
> > implement it.
> 
> Look at the need of this API: There are almost only getter operations and its
> parameters are (so far) a single article, a search term, a category or an
> authors name. The responses are either a unstructured html, a wiki text or RSS
> or CSV(??) - or an error message.

Just so we're clear on this I personally don't really care what we use, as long
as it's something that's designed in such a way that it can suit all our needs,
both current and potential ones, and do so through one interface, as well as
being widely supported by the most popular programming languages.

The problem with what you're suggesting is that it would basically be something
that would grow like cancer over time (and I'm talking about the mix of RSS, CSV
and other formats, I haven't really looked into REST) and be hell to implement,
you'd have to switch between parsing CSV, RSS and probably some other things
rather than just using one format for everything.

Comment 15 Ævar Arnfjörð Bjarmason 2005-06-05 22:10:56 UTC

Now, to contribute something useful other than flaming other peoples choice of
API's;)

I made a special page extension the other day that was a basic proof-of-concept
of SOAP functionality, it used the nusoap_server class (see:
http://cvs.sourceforge.net/viewcvs.py/*checkout*/nusoap/lib/nusoap.php?rev=HEAD
) to parse requests and generate output, unlike Jeluf's implementation (which is
no longer in CVS) it used internal MediaWiki functions to fetch the various
things rather than making its own SQL queries on the database.

I don't have access to the code right now (it's on another computer that I don't
have access to at the moment) but for anyone interested in SOAP support making a
new special page (Special:SOAP) shouldn't be that difficult.

Just remember to turn off the skin output with $wgOut->disable()

Comment 16 S.F. Keller 2005-06-06 00:40:22 UTC

We agree on the points that the API should be easy to implement, supported by
programming languages and covering both current and potential needs.

Don't hype SOAP and misinterpret REST: The former is more heavy weight than the
latter and is 'restricted' to XML. REST does'nt imply any format (CSV or RSS...)
- it's up to our design choices to use XML - and it's not cancer-causing per se.
Both approaches don't resolve the need for a data model of the content to be
transferred. The choice is simply a matter of good evaluation and engineering.

Comment 17 David Cornelson 2005-06-06 00:55:26 UTC

(In reply to comment #16)
> We agree on the points that the API should be easy to implement, supported by
> programming languages and covering both current and potential needs.
> Don't hype SOAP and misinterpret REST: The former is more heavy weight than the
> latter and is 'restricted' to XML. REST does'nt imply any format (CSV or RSS...)
> - it's up to our design choices to use XML - and it's not cancer-causing per se.
> Both approaches don't resolve the need for a data model of the content to be
> transferred. The choice is simply a matter of good evaluation and engineering.

I don't think I care what the API looks like either. When I posted my request (since absorbed into #208), I was 
basically looking for a way to grab a page's content in HTML format without the surrounding elements (header, 
sidebar, footer, etc).

The idea is that there could then be a "Print this Article" link and it would print only the article content and 
not the whole page (which is unnecessary if you're simply trying to give someone reference material). Similarly, 
an "E-Mail this Article" link could do the same thing.

Another feature could be to "Download Article as PDF" or "Download Article as Word Document"...

Any API would allow these operations...

Of course I do have an opinion and would use SOAP/XML. It's ubiquitous even if it has some historical flaws.

Comment 18 Ævar Arnfjörð Bjarmason 2005-06-09 21:02:23 UTC

(In reply to comment #17)
> The idea is that there could then be a "Print this Article" link and it would
print only the article content and 
> not the whole page (which is unnecessary if you're simply trying to give
someone reference material). Similarly,

We already have that, just try printing any page (with monobook turned on) on a
modern browser, only the @media print styles will be used (i.e. the sidebars and
other irrelivant content won't be printed).

> an "E-Mail this Article" link could do the same thing.
> 
> Another feature could be to "Download Article as PDF" or "Download Article as
Word Document"...
> 
> Any API would allow these operations...

Actually it wouldn't, any currently forseeable API would allow access to the
current interface (or rather a subset of it) throuh alternative mean that we'll
have PDF genration.

Comment 19 Ævar Arnfjörð Bjarmason 2005-06-24 06:32:35 UTC

(In reply to comment #11)
> Wikimedia is already using REST; it's nothing new! Yahoo! and Amazon are using
> it. It's exactly what Comment #4 above suggests as well as myself in Bug 1012.

For those of you who have no idea what REST is, it's a (to paraphrase a
dot.kde.org post) desi4gn approach to web APIs. While SOAP tries to solve
everything inside SOAP itself, REST means you rely on existing mechanisms (URLs,
HTTP) as much as possible.

Comment 20 Ævar Arnfjörð Bjarmason 2005-06-24 13:16:58 UTC

*** Bug 2037 has been marked as a duplicate of this bug. ***

Comment 21 Dejan Nenov 2005-10-06 00:33:59 UTC

I would like to request some functionality for this upcoming API - specifically:

1. A method (a sequence of API calls) of enumerating and iterating over the entire content;
2. A method of obtaining a list of objects that have been changed or are new since a particular time-stamp.

For an example of how (1) would be used, please see the Apache Commons VFS APIs, which look at a data source as a file 
system - i.e. in a hierarchical fashion.

As I am a newcomer to this project and am ignorant to the way data is organized – I will only venture to make a trivial 
suggestion. It would likely be possible (and simple) to overlay a simple topic based alphabetical hierarchy over all 
content, so that at the first level the categorization is by the first letter of all topics or article titles, the next 
level is based on the first two letters of the topic or article title, etc. 

A.B.C.D.E.F.G.H.I………..Z
AA.AB.AC.AD………………..AX
AAA.AAB.AAC.AAD………..AXX

Etc.

Again – as long as there is some way to retrieve a tree structure and walk through it, or an iterator over the 
collection of unique content object identifiers in the repository, the API would let me accomplish the task I am 
looking at.

Comment 22 Daniel Kinzler 2005-10-13 16:24:20 UTC

I have put together some ideas and suggestions for a REST interface om meta. See
here:

http://meta.wikimedia.org/wiki/REST

Please have a look and add your thoughts, if you like.

Comment 23 とある白い猫 2005-10-30 17:02:44 UTC

Coppied from bug #3365
I would like to have a feature that allows me to load diffs but just diffs. <rc>
bot reports
changes on articles on wikipedia, I have a working bot that processes this. I
cant load every diff
on wikipedia edit for practical reasons (bandwidth and cpu). So if I could load
just the diffs and
process that and check for obvious cases of sneaky vandalism. I dont believe it
is hard to
implement this since we already have a function showing the diff (with the rest
of the page).

Comment 24 とある白い猫 2005-10-30 17:04:49 UTC

It should be machibe friendly so as to use minimum bandwith usage and process
time. All I care about is whats removed and whats added.

Comment 25 Andrew Dupont 2006-03-20 03:04:38 UTC

I'd be interested in writing an implementation of WTTP (as described at
http://meta.wikimedia.org/wiki/WikiText_Transfer_Protocol). WTTP (or some other
RESTful approach) would be quicker to implement than a full-fledged SOAP
approach, and it seems like the 80% solution would be much more than the zero
percent we've got now.

Comment 26 S.F. Keller 2006-03-20 07:19:58 UTC

I agree (c.f.  my comment #8). Before starting one should start with the simple
case (diffs come later) and clarify the goal: request for a single page, a range
of pages or all (= bulk download?).

Comment 27 Yuri Astrakhan 2006-03-20 15:29:36 UTC

From the bot's perspective, the most needed features are bulk get and (bulk?) put - anything else can come later. Biggest 
hurdle - we need a clear indication when the asked page(s) do not exist, and we need to make sure the user's login does not 
get lost. These two issues have occured numerous time with the present system. Beyond that, any page parsing can be done on 
the client until later.

Comment 28 Yuri Astrakhan 2006-06-25 21:15:23 UTC

The first stage of this interface has been implemented. Ladies and gentlemen, I
proudly present the Query API: http://en.wikipedia.org/w/query.php

It supports multiple output formats and bulk data retrieval of many items you
could only get by html scraping.  Any comments are welcome on the API's home
page at http://en.wikipedia.org/wiki/User:Yurik/Query_API

Comment 29 Yuri Astrakhan 2007-05-19 23:42:38 UTC

We had some of this for ages. More specific requests should be filed separately.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links