Last modified: 2009-12-30 05:12:13 UTC
There seems to be a need for some kind of API for accessing the wikipedia articles by external programs (at least I need it ;-) ). * getting the data record of a given or the current version matching a given article name * getting a list of version numbers matching an article name (this depends on #181). * getting a list of article names matching a search term This API would relieve load from the Wikipedia servers as there has nothing to be parsed before submitting.
* getting a list of authors having worked on a given article (this has also been requested in [Wikitech-l] by Jakob Voss).
Added a soap interface to HEAD. Open todos: * Improve search to use searchindex instead of cur. * Change client.php so that it can't be called via http, just cli. * Check whether nusoap is UTF-8 clean. * Add feature to limit number of requests.
Updated: * using searchindex * client.php can only be called using the cli * upon first sight, nusoap seems to be utf8-clean. search query for utf8 strings succeeded. Todos: * limit number of requests per user per day
I would like to suggest something similar, but IMHO seasier to use and implement than soap (I already posted this on wikitech-l on Sept. 15 2004 and was directed here). I hope the new features would make life easier for people who develop bots and other tools that access the wikipedia and also reduce traffic from such tools. I also belive those to be fairly easy to implement. The idea is to have an optionaly URL-Parameter ("format=" or such), that would tell the software to return a page in a format different from the full-fledged HTML. I would like to suggest formats for "real" pages and special pages separately, as the requirements are different. For articles, discussion-pages, etc, support the following formats: * source - return the wiki-source of that page * text - return a plain text version, with all markup striped/replaced (tables, text boxes, etc do not have to be formatd nicely, but their content should be there) For special pages and all automatically generated lists (kategories, changes, watchlist, whatlinkshere, etc): * csv - return the list in CSV-format * rss - return entries in the list as RSS items. Additionally, for the normal "full html" view, provide a switch "plain" that supresses all sidebars, etc and shows just the formated text. As to the implementation, I would suggest to map the format-name to the name of a php-class and load it on demand. That way, now formats can be supported by just placing an appropriate file in the php lib-path. But all this is pretty different from the original bug - so maybe I should file a new one? Thank you.
require_once('nusoap.php'); $s = new soapclient( 'http://en.wikipedia.org:80/soap/' ); $r = $s->call( 'getArticle', array( 'Frankfurt' ) ); print $r->text; I don't see how this is complicated or how parsing a CSV file is easier. The approach to have a format= parameter for all pages is not realistic. This would require a complete rewrite of MediaWiki, which is not designed with a strict separation model/view/controller.
(In reply to comment #4) > * source - return the wiki-source of that page Already have this: action=raw > * text - return a plain text version, with all markup striped/replaced (tables, > text boxes, etc do not have to be formatd nicely, but their content should be there) Potentially doable. > * csv - return the list in CSV-format Unlikely to be useful. > * rss - return entries in the list as RSS items. Already have this where supported: feed=rss (or feed=atom) > Additionally, for the normal "full html" view, provide a switch "plain" that > supresses all sidebars, etc and shows just the formated text. Potentially doable. > But all this is pretty different from the original bug - so maybe I should file > a new one? Please do.
*** Bug 1012 has been marked as a duplicate of this bug. ***
Thank you for taking over Bug 1012. But please don't forget to pay attention to "get by category" as well as the fact that there are valid alternatives to SOAP like the HTTP/GET with parameters (see RESTful architectural style).
To clarify: this bug is not specifically about the SOAP protocol. Bug 1012 was specifically about the action=raw and Special:Export interfaces, which are interfaces specifically for retrieving editable source text. Non-page-attached metadata such as category memberships is included in that interface as the source markup which produces those links, but the requested things just don't fit into that interface. This bug is a general one, and is on-topic for additional data fetch interfaces.
Regarding APIs: I vote for a continuation of the already existing "RESTful API" instead of or in addition to the SOAP. Wikimedia is already using REST; it's nothing new! Yahoo! and Amazon are using it. It's exactly what Comment #4 above suggests as well as myself in Bug 1012. For the debate about "REST versus SOAP" and "RESTful SOAP" see http://c2.com/cgi/wiki?RestArchitecturalStyle. In comment #5 of Bug 1012 Ævar Arnfjörð Bjarmason wrote: > Our needs aren't simple, ideally a SOAP api would handle all sorts of stuff such > as getting edit histories, feeds, rendered html, stuff that links to page $1 and > so on, also, there are standard API's for most programming languages that > implement it. Look at the need of this API: There are almost only getter operations and its parameters are (so far) a single article, a search term, a category or an authors name. The responses are either a unstructured html, a wiki text or RSS or CSV(??) - or an error message. Now compare these requirements with the Pros of a RESTful implementation: HTTP support is all that programming languages need. The Cons against REST come into play when there are complex objects in the involved operation (request and response), that is when enconding of non-string parameters is needed.
*** Bug 1233 has been marked as a duplicate of this bug. ***
(In reply to comment #6) > (In reply to comment #4) > > Additionally, for the normal "full html" view, provide a switch "plain" that > > supresses all sidebars, etc and shows just the formated text. > > Potentially doable. It's very doable, I did a action=html thing the other day that dumped the html of an article, Tim also made a dumpHTML thing that does that, but just not throuh a webinterface, the problem with it however was that it didn't allow modification of the parser options which is what an API like this bug discusses should implement.
> In comment #5 of Bug 1012 Ævar Arnfjörð Bjarmason wrote: > > Our needs aren't simple, ideally a SOAP api would handle all sorts of stuff such > > as getting edit histories, feeds, rendered html, stuff that links to page $1 and > > so on, also, there are standard API's for most programming languages that > > implement it. > > Look at the need of this API: There are almost only getter operations and its > parameters are (so far) a single article, a search term, a category or an > authors name. The responses are either a unstructured html, a wiki text or RSS > or CSV(??) - or an error message. Just so we're clear on this I personally don't really care what we use, as long as it's something that's designed in such a way that it can suit all our needs, both current and potential ones, and do so through one interface, as well as being widely supported by the most popular programming languages. The problem with what you're suggesting is that it would basically be something that would grow like cancer over time (and I'm talking about the mix of RSS, CSV and other formats, I haven't really looked into REST) and be hell to implement, you'd have to switch between parsing CSV, RSS and probably some other things rather than just using one format for everything.
Now, to contribute something useful other than flaming other peoples choice of API's;) I made a special page extension the other day that was a basic proof-of-concept of SOAP functionality, it used the nusoap_server class (see: http://cvs.sourceforge.net/viewcvs.py/*checkout*/nusoap/lib/nusoap.php?rev=HEAD ) to parse requests and generate output, unlike Jeluf's implementation (which is no longer in CVS) it used internal MediaWiki functions to fetch the various things rather than making its own SQL queries on the database. I don't have access to the code right now (it's on another computer that I don't have access to at the moment) but for anyone interested in SOAP support making a new special page (Special:SOAP) shouldn't be that difficult. Just remember to turn off the skin output with $wgOut->disable()
We agree on the points that the API should be easy to implement, supported by programming languages and covering both current and potential needs. Don't hype SOAP and misinterpret REST: The former is more heavy weight than the latter and is 'restricted' to XML. REST does'nt imply any format (CSV or RSS...) - it's up to our design choices to use XML - and it's not cancer-causing per se. Both approaches don't resolve the need for a data model of the content to be transferred. The choice is simply a matter of good evaluation and engineering.
(In reply to comment #16) > We agree on the points that the API should be easy to implement, supported by > programming languages and covering both current and potential needs. > Don't hype SOAP and misinterpret REST: The former is more heavy weight than the > latter and is 'restricted' to XML. REST does'nt imply any format (CSV or RSS...) > - it's up to our design choices to use XML - and it's not cancer-causing per se. > Both approaches don't resolve the need for a data model of the content to be > transferred. The choice is simply a matter of good evaluation and engineering. I don't think I care what the API looks like either. When I posted my request (since absorbed into #208), I was basically looking for a way to grab a page's content in HTML format without the surrounding elements (header, sidebar, footer, etc). The idea is that there could then be a "Print this Article" link and it would print only the article content and not the whole page (which is unnecessary if you're simply trying to give someone reference material). Similarly, an "E-Mail this Article" link could do the same thing. Another feature could be to "Download Article as PDF" or "Download Article as Word Document"... Any API would allow these operations... Of course I do have an opinion and would use SOAP/XML. It's ubiquitous even if it has some historical flaws.
(In reply to comment #17) > The idea is that there could then be a "Print this Article" link and it would print only the article content and > not the whole page (which is unnecessary if you're simply trying to give someone reference material). Similarly, We already have that, just try printing any page (with monobook turned on) on a modern browser, only the @media print styles will be used (i.e. the sidebars and other irrelivant content won't be printed). > an "E-Mail this Article" link could do the same thing. > > Another feature could be to "Download Article as PDF" or "Download Article as Word Document"... > > Any API would allow these operations... Actually it wouldn't, any currently forseeable API would allow access to the current interface (or rather a subset of it) throuh alternative mean that we'll have PDF genration.
(In reply to comment #11) > Wikimedia is already using REST; it's nothing new! Yahoo! and Amazon are using > it. It's exactly what Comment #4 above suggests as well as myself in Bug 1012. For those of you who have no idea what REST is, it's a (to paraphrase a dot.kde.org post) desi4gn approach to web APIs. While SOAP tries to solve everything inside SOAP itself, REST means you rely on existing mechanisms (URLs, HTTP) as much as possible.
*** Bug 2037 has been marked as a duplicate of this bug. ***
I would like to request some functionality for this upcoming API - specifically: 1. A method (a sequence of API calls) of enumerating and iterating over the entire content; 2. A method of obtaining a list of objects that have been changed or are new since a particular time-stamp. For an example of how (1) would be used, please see the Apache Commons VFS APIs, which look at a data source as a file system - i.e. in a hierarchical fashion. As I am a newcomer to this project and am ignorant to the way data is organized – I will only venture to make a trivial suggestion. It would likely be possible (and simple) to overlay a simple topic based alphabetical hierarchy over all content, so that at the first level the categorization is by the first letter of all topics or article titles, the next level is based on the first two letters of the topic or article title, etc. A.B.C.D.E.F.G.H.I………..Z AA.AB.AC.AD………………..AX AAA.AAB.AAC.AAD………..AXX Etc. Again – as long as there is some way to retrieve a tree structure and walk through it, or an iterator over the collection of unique content object identifiers in the repository, the API would let me accomplish the task I am looking at.
I have put together some ideas and suggestions for a REST interface om meta. See here: http://meta.wikimedia.org/wiki/REST Please have a look and add your thoughts, if you like.
Coppied from bug #3365 I would like to have a feature that allows me to load diffs but just diffs. <rc> bot reports changes on articles on wikipedia, I have a working bot that processes this. I cant load every diff on wikipedia edit for practical reasons (bandwidth and cpu). So if I could load just the diffs and process that and check for obvious cases of sneaky vandalism. I dont believe it is hard to implement this since we already have a function showing the diff (with the rest of the page).
It should be machibe friendly so as to use minimum bandwith usage and process time. All I care about is whats removed and whats added.
I'd be interested in writing an implementation of WTTP (as described at http://meta.wikimedia.org/wiki/WikiText_Transfer_Protocol). WTTP (or some other RESTful approach) would be quicker to implement than a full-fledged SOAP approach, and it seems like the 80% solution would be much more than the zero percent we've got now.
I agree (c.f. my comment #8). Before starting one should start with the simple case (diffs come later) and clarify the goal: request for a single page, a range of pages or all (= bulk download?).
From the bot's perspective, the most needed features are bulk get and (bulk?) put - anything else can come later. Biggest hurdle - we need a clear indication when the asked page(s) do not exist, and we need to make sure the user's login does not get lost. These two issues have occured numerous time with the present system. Beyond that, any page parsing can be done on the client until later.
The first stage of this interface has been implemented. Ladies and gentlemen, I proudly present the Query API: http://en.wikipedia.org/w/query.php It supports multiple output formats and bulk data retrieval of many items you could only get by html scraping. Any comments are welcome on the API's home page at http://en.wikipedia.org/wiki/User:Yurik/Query_API
We had some of this for ages. More specific requests should be filed separately.